Friday, June 13, 2008

AQ Database Milestones

Most of you reading this probably don't know much about databases. You've probably heard the word before, but it's a method us computer programmers use to store large quantities of data in an organized manner that can be looked up quickly. Information about you is probably located in thousands of databases around the world. You can't do anything without it showing up in a database somewhere. I consider it one of the most essential pieces of software that runs Atlas Quest, second perhaps to the operating system itself.

The website is just a pretty interface into the murky depths of the database. Oh, yes, there are some pages that don't require the use of the database, such as the tutorials, but that's not the reason most people come back day after day. No, you keep coming back because of the message boards, letterbox clues, and AQ mail. All of those data-driven features are heavily dependent on the database. Without the database, Atlas Quest would be an empty shell--it's heart, the database, missing.

It's rather remarkable, but my appreciation for databases didn't really take root until after I got my first programming job at TrueLink when we used a relatively simple database for storing account information in our projects. That's when it first hit me that any worth-while application is going to require the use of a database. I studied for a computer science degree for four years and even took a database class, and at the end of the class, I was left wondering, "So why did I waste my time taking this class?"

Looking back, it's astounding to think I thought a database class was a waste of time. Frankly, the class was a waste of time if it couldn't answer such an important and easy question. Oh, if I was able to teach that class today, I'd make sure every student who finished that course knew that it would likely be the most important class of their entire college career--because it should be!

At TrueLink, I first learned how to use a database. Basically, I learned everything that I should have learned in my database class! The stuff I did with them was pretty simple, but it was a huge eye opener for me.

Later, I moved on to bigger and better things, and started working at Intel. Early on, I started working with some code that interacted with--of course, a database. It was also fairly simple as far as databases go, but I was absolutely stunned to find out that I knew more about them than anyone else in my group! Now I did not consider myself an expert on matters of the database, but I knew what a foreign key constraint was and that you could tell the database to enforce them. What's a foreign key constraint, you ask? Glad you asked. =)

First, you have to realize that a decent database keeps track of data in what are called tables. It segregates various types of data into distinct pieces, such as member data (trail name, e-mail address, password, etc.) or data about letterboxes (box name, type, attributes, etc.).

Computers find numbers easy to work with, so pretty much everything on Atlas Quest gets assigned a number. There was a recent discussion about member IDs on the message boards, which sometimes shows up at the end of a URL as gMemberId=x where x is the number assigned to the person in question. My ID number is 1. And there's a table in the database that keeps track of all the information related to a specific member, such as my trail name, password, e-mail address, etc. The ID numbers are convenient to work with, so I can tell the database, "Give me all the information you have about member #1," and it tells me all the information it has about me.

Even thought letterboxes have their own table in the database and members have another table in the database, there is a relationship between the two. (Thus, you might hear the term relational database.) Members create letterboxes, and Atlas Quest needs to keep track of who carved the stamp, authored the clue, planted the box, owns the box, listed the box, and who all needs to be contacted when the letterbox is recorded as found. So there's a lot of data in the letterbox table that points to individual members in the member table using their ID numbers.

Creating a foreign key constraint is a way of telling the database that if you create a link from one table to another--in this case, from the letterboxing table to the member table--it MUST be to a valid ID number. It makes no sense that a letterbox was listed by member number 1,343,233 because there is no member with that number. If I set up a foreign key constraint and try to tell the database to link to an invalid number, it'll reject the request and tell me I screwed up. =)

Anyhow, I was absolutely stunned when I found out that none of my co-workers knew what a foreign key constraint was. It's elementary stuff for working with databases. You should be flunked from Databases 101 if you don't know what a foreign key constraint is after the first week of class. It's that basic.

And I was absolutely stunned when I realized that I was the resident database expert in my group, because I didn't feel like an expert at all. My co-workers had created convoluted pieces of code that would first double-check that a certain ID number existed somewhere before using it in another table of the database (if they remembered), and code that they could run periodically to make sure the numbers didn't get into an inconsistent state. They wrote hundreds of lines of code that did this sort of thing, and by specifying a foreign key constraint--one line of text--on the database, the database would have done all that work for them faster and more reliably.

Fortunately, the database we worked with was a simple, straight-forward database with minimal amounts of information, and I had no trouble using my "expertise" to fix up some obvious problems and amazed everyone with my smarts and ingenuity, but I knew the truth--I was an expert only because nobody else knew anything about databases. =)

If anyone reading this is planning to get into the computer science industry, seriously, learn everything you can about databases. You WILL work with them, and you'll be absolutely invaluable if you know them inside and out.

Ironically, Atlas Quest does not use foreign key constraints. It bothers me a little, but it's because the site runs with MySQL 5.0 which does not support foreign key constraints. (At least not the with the table types I'm using.) They have a release candidate for MySQL 5.1 out already, and that new version should be released soon and I'm looking very forward to it since it will support foreign key constraints and I'm anxious to start using them. =) One of the checks AQ runs each night is to search the database for foreign key constraints that have been violated and to clean up the mess. Once foreign key constraints are supported at the database level, those checks will be a thing of the past.

But I digress....

When I first had the thought to create a city-centered search engine for letterboxes, I knew I'd have to learn a whole heck of a lot about databases that I didn't know. My original data source for locations included information on about three million locations. When I narrowed it down to just cities and towns, there were still over half a million locations listed--far larger than anything I had worked with before.

Additionally, the queries--a word that means to look up data from the database--would be far more complex. Usually I just searched using an ID number, and told the database to return all the data associated with it. "Give me everything about member #1." This is fast, quick, and easy, because those numbers are indexed, much like the pages of a phone book. It's not like I have to search through every member checking their numbers until I stumbled onto the right number. They're all in a specific order and therefore quick and easy for the database to find.

That kind of index does not work when you ask the database, "Give me all towns within 10 miles of Seattle, WA." But searching randomly through more than a half-million towns for all matches is incredibly slow.

I developed a couple of tricks to optimize the searches--precalculating some data, bounding the results, yadda, yadda, yadda--and eventually got something to work that was extremely fast, which is still the same code that's largely being used today. I've since added additional optimizations and tweaked various things, but the principles being used are largely unchanged.

And when I finally got that working, it was the first time in my life I really felt like I was an expert at databases. When AQ 1.0 finally hit the web, the starting size of the database was about 70 megs of data (almost all of it data about locations), and all the data used was sorted into 12 tables.

It was the largest database I'd ever worked with or designed, and using the longest most complicated queries I'd ever had to construct. In this particular case, the queries were created on the fly as well, to support all the different search options people may want to use.

The Atlas Quest of today now has over 100 tables. The new tracker system I implemented put it over the century mark, and there are now 102 tables keeping track of all the happenings on the website. The size of the database, of this this morning, is 836 megabytes--more than 10 times larger than when I first started the website, and it's still growing as an enormous clip.

I've had to learn additional ways to keep the database running fast. I'm still learning new tricks to use and traps to avoid. There's a way to check how a database will process a query which is often useful for determining why a particular query is running slow (or not), for instance, and I was having trouble late last year with some message board queries running unacceptably slow whenever someone looked at really old messages.

So I tested the queries to find out where the queries were going wrong, and tried a couple of variations of the queries to see if that would give better results, and I was left with the results that I could optimize a query to run get results fast if it's a newer post, or optimize it for older posts, but there didn't seem to be any query that could retrieve data fast regardless of whether the post was new or old. That was a problem.

Then it hit me. Why don't I send the query to the database to examine before I actually run it? I was checking these queries manually to estimate which queries were most efficient so I could put in the ONE that would run the best, but why not automate the process so I could basically give several queries to the database, then have it pick the version that would get the best results.

And presto--it worked wonderfully! That's what happens when you read any message on the message boards now--it'll actually run one of two different queries depending on which one will run the most efficiently. They should both return the same results, but each one does so in a slightly different manner.

It's a clever solution to a perplexing problem I had, and it was a rather thrilling idea to me since I've never heard of anyone else who's done this sort of thing. I'm not naive enough to think that I'm the first person to ever think of this, but I don't know of anyone else who's done it before. It's not something you'll learn in a Databases 101 course, or even an advanced database course. In all my reading about databases, I've never found a source to suggest such an intriguing solution.

I still consider myself an expert on databases, but admittedly, there are still facets about it that I don't know much about such as replication. (When the database becomes too big to fit on a single server, it needs to be distributed among several servers that work together. Even at 836 megabytes, the AQ database is still puny in the grand scheme of things, so it's on a single server. If I ever need to split it among more than one server, however, that's when replication comes into play.) And even newer versions of databases have new features and quirks I must learn. (Foreign key constraints are what I'm most looking forward to in this next version coming up, and I'll have to tweak the existing code to start relying on it.)

But wow, there are now more than 100 tables that make up the Atlas Quest database. I hope the next hundred go just as smoothly! =)

-- Ryan

6 comments:

Anonymous said...

Love the technical posts you make and translate into English. Thanks for all you do Ryan.

Tee

Anonymous said...

Wow, Ryan, I actually could follow along your path, and I don't know one thing technical about computers!!! Now that is saying something!! You make a great teacher!!! Way to go.
Okie Dog

Anonymous said...

Quite interesting stuff. Thanks for a behind the scenes look. (Lord, give a piece of paper and pencil for my database.)

DrWatson

Anonymous said...

Very interesting post gives me even greater appreciation for how smoothly AQ functions and how capably you're able to keep it speeding along.

sula

Holly said...

I love it Ryan .... but I must admit you sound a lot like a bunch of my web guys that have a mantra of "we don't need no stinkin DBA" ... that is until ... they have themselves in a pickle and come sweet talking me ! :O)

Holly

Anonymous said...

I loved reading your technical stuff and how excited you are about it. (Makes the rest of us feel a bit less geeky...) It's amazing how much some programmers don't know, isn't it? I once consulted on a project for a MAJOR employer in my area to help out a guy that was building an Access database. Turns out that he didn't know anything and had only built a FORM for show with no tables behind it at all! (He was fired immmediately, of course.) Then I wasted my time writing complicated queries to later find out that they had changed the specs b/c the original guy had told them it was programmatically impossible! Yikes!