Friday, September 05, 2008

Yowzers, that's fast!

Those folks with webmaster accounts on Atlas Quest get quite a bit of technical information included with every page they view on the site. It tells us if AQ detected any errors, any cookies that were sent, POST and GET data, the headers from each page, session variables, and--relevant to today's post--a record of all queries that AQ runs on the database that take more than five seconds to run. There's nothing special about five seconds. Originally I started it at ten seconds, then waited around for a bunch of slow queries to be recorded, then started figuring out ways to make them run faster. They were the worst of the lot and slowing down Atlas Quest.

Eventually, a day came along, and Atlas Quest did not record a single slow query. So then I cut the definition of a "slow" query to five seconds, and started fixing those up. And eventually, I knocked all of those out too.

I never cut the definition of a slow query after that--Atlas Quest was running fast and smooth and there seemed little reason for it. Five seconds is actually very slow, but an occasional slow query wasn't an issue and to be expected. During the nightly updates, for instance, certain chores such as backing up the database would cause queries to stop for the better part of a minute, but backups are very important and necessary and I wasn't going to stop doing them so queries at 3:00 in the morning ran quickly. So five seconds seemed like a reasonable limit--cutting out most of the temporary slow queries but still catching most of the problem queries.

A few weeks ago, I noticed that one particular query which runs on My Page was getting pegged as slow more and more often. My Page is one of the most commonly hit pages on Atlas Quest, and it has a slow query to boot? Not a good combination....

So I took a look at the queries on that page. Atlas Quest will display for webmasters every query that is run to generate a page along with how many milliseconds the query took to run. The vast majority of queries will run in less than 5 milliseconds. A few of the more complex queries may take as long as 10 or 20 milliseconds. And a small handful, under the right conditions, may take as long as 100 milliseconds. That's still 1/10th of a second, however, and pretty darned fast.

And on My Page, Atlas Quest typically runs about 10 queries to generate all the data you see on that page. And one of the queries--the query that kept getting tagged as slow--took longer to run than all the other queries combined! Most of the time I looked, it required about 200 to 300 milliseconds to run, but when the server had a particularly heavy load, that time would often extend beyond 5,000 milliseconds (that's five full seconds!) and AQ would record the slow query for my examination. For one of the most commonly viewed pages on Atlas Quest, that query was becoming a major problem, and a problem that continued to grow worse as the number of members on Atlas Quest and the number of messages to the message boards grew.

What did this query do? It's the one that calculated how many new posts were on your favorite boards. Somehow, I didn't think most of you would like it if I just took out that feature. I needed to think of a way to make that query faster.

And for weeks now, I pondered ways about doing that. I delved into the database documention, tried to implement a couple of solutions, but kept hitting a block wall. That query was a darn good query already--it had to be to have kept running so well for so long--and further optimizations were growing a lot harder to find. There were other problem queries on Atlas Quest, but I considered the one on My Page my number one priority since it was one of the most commonly viewed pages on Atlas Quest.

I don't know exactly how many hours I worked on figuring out a solution to this problem query. Probably at least 50 hours. Probably not more than 100, but I don't have a timesheet I check in with, so I don't really know, but it was long enough that I was getting very frustrated. After weeks of effort and dozens of hours of research and tests, and I was getting absolutely nowhere.

And the problem query just kept becoming a bigger and bigger problem, getting tagged as slow more and more often. Would the madness ever end?

Finally, more-or-less out of desperation, I had one last idea that required me to duplicate some data in the database in two different places. Actually, technically speaking, the data was duplicated--on average--about 7 times. I only duplicated one column, but for those of you wise in database design, duplication of data is best avoided. But I thought one particular index might help, and to create it, I needed to duplicate some data from one table in the database to another, then create an index on it. An index is just what it sounds like--something like out of a technical book you might read. It allows finding specific data in a table very fast.

So I copied over this data, created an index on it, and what an improvement! That problem query on My Page went from 200-300 milliseconds (most of the time) down to 20-30 milliseconds (most of the time). It's a huge improvement, running 10 times faster than before. The last time I viewed My Page, the queries it ran took the follow milliseconds to run: 1, 19, 6, 2, 6, 5, 0, and 3 for a total of 42 millseconds. That's pretty normal, and is a huge improvement over the 300+ milliseconds the page used to take to run. So if My Page seems like it's running faster today, it's because it is. AQ is generating it nearly 10 times faster than it was just yesterday.

Turns out, however, that was just the beginning. There was another problem query I was running into when someone viewed a page of message board posts. Each time you view a post, Atlas Quest needs to calculate where the previous and next buttons should point to. I put in a couple of ugly hacks many moons ago to help keep the message boards running fast, but they were starting to filter. The calculation for the previous message in particular was often the slowest of all, rarely finishing in less than 1000 milliseconds (one full second). And that's a pretty busy page to have a query that takes at least one full second to generate.

With this duplicated data now in the database, on a hunch, I added another index and checked out the results. The previous message query took 1 millisecond to run. One. That friggin' query took 1/1000th of a second to run--a thousand times faster than before. HOLY COW! I was floored. I clicked more messages to see if it was a fluke, and the query consistently came back within one or two milliseconds. It also make the query for calcuation the next post equally fast (improving from about 100-200 milliseconds to 1 or 2 milliseconds), and the query for the messages that were being viewed also improved from about 100-200 milliseconds to 1 or 2 milliseconds. With queries running that fast, I no longer needed a couple of the hacks I put in the code to improve the query speeds, so I took them out which saved another few hundred milliseconds off the overhead to generate a page.

All told, viewing a page of 30 posts was taking nearly 2000 milliseconds to create on average. Even during the lightest of server loads, the page was taking at least 1000 milliseconds to generate. During heavy server loads, it might take 10,000 milliseconds. (Ten full seconds!)

The last page of messages I just looked at included the following query times (in milliseconds): 1, 2, 0, 0, 1, 1, 0, and 1 for a total of 6 milliseconds--an improvement over one hundred times faster than before. Another page of messages took the database 15 milliseconds to generate--and that was one of the slower ones!

So to make a long story short.... if the message boards seem zippier today than they have in a long time, there's a good reason for that. They're running over 100 times faster today than they were yesterday. =)

9 comments:

Anonymous said...

Thanks for the cow reference. Even if I did get a little bored reading this!!

Anonymous said...

Yes, an amazing improvement, and VERY noticeable!

You are SUCH a geek. Really. And I mean this in a good way, as one geek to another. Although I kinda think this should be on the "Total Goober" blog...
-wassamatta_u

Anonymous said...

Well done!
And your explanation: Well done!

sula

Anonymous said...

Next time I'll know to scroll down to last paragraph where you "make a long story short"! I followed most of it...really I did! :)

Anonymous said...

While I would imagine this blog entry is Greek to most of us, ANYone can see that Ryan is to be rewarded for his expertise, perserverance, knowledge and determination to bring us one of the best websites on the WWW! EVERYone should be a Premium Member just to support this kind of loyalty. Three cheers for our Fearless Leader! (not a reference to a Bullwinkle character. . .)
~~Doublesaj & Old Blue~~

Anonymous said...

So you are saying that you like indexing "a lot"...

Anonymous said...

Hoooah!!!!
OD

Anonymous said...

Very cool. So did you basically duplicate the message ID, the board ID and then one other index (unread?) so that the query wouldn't have to read through so much text?

As I was reading your post, I was wondering if running a separate query every 5 seconds to count the number of new messages on each board and write it out to a table... oh wait, that wouldn't work because the counts would be different for each user and not all users are logged on... Sheesh, now I see why it took a long time to figure out! I'll have to remember that sometimes duplication is actually better!

Anonymous said...

Thank you for all of your hard work Ryan!! We really appreciate it!!

♥ Lady Lilac