Saturday, April 28, 2007

Riding the Cheetah

A commonly known piece of trivia is that the cheetah is the fastest land animal on the planet, able to achieve speeds of 70 mph. Astounding, isn't it? Think about that. If you had to drive through Oregon on I-5 going the speed limit, that cheetah would be passing you! That's pretty fast. A wild turkey, in case you're wondering, tops out at 15 mph.

Why do I mention this? Because I like to think Atlas Quest is pretty darned fast. I spend a lot of time and effort trying to make it so. One source clocks AQ as being faster than 96% of the typical website on the Internet. Two thumbs up for me!

But it's a constant battle. The database that makes up Atlas Quest has now grown over 400 megabytes in size, and continues to grow a couple of more megabytes each week. I expect it'll grow to over 500 megabytes by the end of the year--perhaps even as much as 600 megabytes or more. That's a huge amount of growth to handle, and I'm not professionally trained for such large databases. I knew what most programmers know about them--terminology like foreign key constraints, joins, and primary keys. I could join a few tables together without too much trouble. But the Atlas Quest database is now about 1,000 times larger than anything I've ever worked with before. Let me spell that out for you. One thousand times larger. Things work a lot different on that sort of scale, and I'm constantly figuring out new tricks to keep things running fast.

One problem I had early in Atlas Quest history was figuring out exactly which queries were the biggest problems. A query, for those not in the know, is a technical term that essentially means to ask the database for information.

So I created a small bit of code to help me track down problem queries. Atlas Quest answers--on average--eight queries every single second. During the busiest times of the day, it can be much higher. Instead of querying the database directory, any code that needs information calls this special code I created. The special code logs the exact time (to a thousandth of a second), then forwards my query to the database. The database does it's thing then returns the data to the special code I wrote. That code then logs the exact time (to a thousandth of a second) that the data was returned. Subtract one from the other and it can tell you exactly how long it took the database to process that query. Most queries take less than 1/100th of a second to process. More complicated ones might take 1/10th of a second, which I consider acceptable.

Occasionally, a query might take a few seconds to process. Perhaps the database needs to load a large index from disk. Maybe Atlas Quest is doing nightly updates and needs to temporarily lock out other database queries for a few seconds while it's calculating everyone's PFX counts.

But sometimes, it's a symptom of a bigger problem. Maybe the query itself is inefficient. So I set a limit--if a query takes more than ten seconds to run, it logs a warning to a file that webmasters can access. It tells me exact query that was slow, when it ran, what page it was run from, who viewed the page (if known), and where on the page the query was called from.

Additionally, if it's a webmaster that's viewing the page, it will show every single query that was run to generate the page and the time each query took to run at the bottom of the page.

The message board update I did was a substantial change to the code base, so I've been paying more attention to the query profiling than usual to make sure it all stays running fast. I browse through the list of slow queries, looking for patterns. I surfed the message boards, just to scroll to the bottom of the page and check the query profiles. Are any of them running slower than expected? Do certain queries seem to show up as slow more often than others?

That's what I started to do yesterday afternoon. Squeezing every last ounce of performance out of Atlas Quest. When I spot something suspicious, I dissect the query and try multiple different ways to get the same information from the database. Which version run the fastest or the slowest? And why? What can I do to improve them?

Last night, shortly before I planned to go to sleep, I took one last brief through the list of slow queries and noticed a number of them coming from the nightly updates. This is nothing new--I have not spent much effort trying to keep those queries running fast. Why bother? They only run once per day, and late at night when few people are online to be affected anyhow. My efforts are better spent concentrating on the queries that run thousands of times each day.

But I spotted an easy fix to one of the queries. I wasn't even really looking for it--just kind of popped out at me. I fixed the query so it ran faster--much faster--then decided to look through those queries from the nightly updates a bit closer. Quite a few improvements could be made.

I ended spending the next two hours tweaking the queries that make up the night updates--LONG after I planned to go to sleep. The result of all this work? Fewer "hiccups" on Atlas Quest for you late night web surfers. Times when AQ seems to pause for maybe ten or twenty seconds before doing what you asked it to.

It'll still happen--I haven't fixed all of the slow queries. Some of them just can't be fixed, such as calculating everyone's PFX count each night. Atlas Quest has to pull up every one of the more than 40,000 letterboxes from the database and nearly 400,000 recorded finds and count them up, one at a time. Very slow. How else can I get an accurate count for everyone, though?

But the number of particularly slow queries during the nightly updates, I figure, has probably been cut by about 70%. The only difference anyone might notice are fewer "hiccups" when AQ is processing a time consuming query before it responds to your page request.

By 2:00am, I was falling asleep at the keyboard, so I finally called it a night and went to sleep. But I was anxious to continue my slow query slaying ways (say that ten times fast!) and continued work on the problem this morning. One slow query that showed up this morning--I went to sleep too early to see it last night--was the update for the Hall of Fame page.

Counting PFX counts can be tricky. AQ would first count all of your recorded finds, then subtract out a count of all finds on your own boxes, then subtract out any boxes you've counted more than once as a find. It was one messed up query. Counting all the finds for everyone is going to be a slow query no matter what, but this convoluted way of counting was really slow. So I started to fix the query.

Then I realized I was getting more than just counts of traditional boxes. It was also counting non-traditional boxes--just throwing that data away afterwards, though. I've wanted to update the Hall of Fame feature to include non-traditional boxes for quite some time now (it's a rather common feature request I get--just not high on my priority list). Now that I'm getting all this spare data anyway, I may as well use it!

So I spent the morning and afternoon rewriting the Hall of Fame code to now display top planters and finders of non-traditional letterboxes.

I had no idea when I went to sleep last night I'd be doing this particular feature today. It's was a surprise, but that's what I get for riding the cheetah. =)

I uploaded the changes--this particular feature didn't require me to take down the site for a period of time, so I uploaded the changes as soon as they were done. Then I went out for lunch (I was hungry), stretched my legs, and mailed an eBay item that Amanda sold.

Now I'm back, on the prowl for more ways turn Atlas Quest into a cheetah. Not a whole lot to work with at the moment. In the time it took me to write this blog entry, AQ only recorded one slow query. I'll check it out thought, and see if there's anything I can do to make it faster. During busy times on Atlas Quest is the best time to identify slow queries. It's when the database is bogged down trying to serve up a hundred queries per second that the slowest queries make their presence known, but Sunday is a typically slow day on Atlas Quest. Not much activity on today. Sunday night tends to pick up, though, and Mondays are far and away the busiest day on Atlas Quest. Tonight will be the first full night of nightly updates since my tweaking last night and this morning, so it'll be interesting to see exactly how often slow queries rear their ugly heads.

1 comment:

Anonymous said...

I have always been amazed at how fast AQ runs, even with my snail speed dial up, and wondered how you do it. Now I know! Thanks for your hard work, Ryan.
Nitrocat