DevBlog: Constantly Optimising for Performance
Post by trog @ 05:02pm 02/08/11 | Comments
Optimising the AusGamers site is something we try to do all the time. Here's one thing we've done recently which really sped things up!
We've always been about speed here at AusGamers. No-one knows more than gamers the difference in impact that tiny amounts of time have on your experience - everyone here evolved playing games like Quake in which a mere handful of milliseconds meant the difference between virtual life and death, winning and losing.
We've always tried to ensure that in the development of AusGamers we've kept everything as fast and efficient as possible so our readers can get the info they want as quickly as possible. If you're an Australian gamer, the chances are pretty high that at some point in the last decade you've downloaded a file from our file servers - making sure Australians have had access to the latest files so we didn't have to fight international users has always been our bread and butter.
Over the years we've constantly battled down performance issues as they've come up - with any code change comes the potential for problems, and we've had our share of them for sure.
A recent change we made was a great example of this - a simple change to the default number of items that get displayed in a thread as part of an experiment we were doing in indexing meant that our database server was doing a lot more work. This was expected and tested for extensively, but as with almost any battle plan, it didn't long survive contact with the enemy and once it was in production we had some bad behavior on the database side simply as a result of the increased load. Nothing generally earth-shattering, but enough to cause some problems.
Now, this new change was something we wanted to keep as we felt it is probably in our long term benefit. So while rolling it back would have fixed the issue, it wasn't something we wanted to do unless we absolutely have to.
One possible solution would simply have been to upgrade the database server - another couple of gigabytes of RAM is reasonably cheap, and almost certainly would have solved the problem.
However, I personally hate buying new hardware except as an absolute last resort. It means spending money, which I am famously against here (I am the office scrooge) - especially when it is a performance-related thing, because I feel that we're reasonably clever and should be able to code our way around most performance problems.
So I started poking around at the guts of the system that was changed that was responsible for the increased load, as well as the exact specifics about what was going on - specifically, what database queries were happening and what code was running at the time we were seeing slowdowns.
Fortunately, it was easy to find the problem, as it turned out to be something we've experienced many times over the last few years - MySQL choking on large DELETE operations. We've seen this many times in our MySQL slow log, but they've never been a big enough deal to try to do anything about - but the recent change we made had tipped it over to the side of unacceptable performance degradation.
The core problem was that MySQL was taking a while to delete big chunks of data. The DELETE query wasn't taking a while because it was working on millions of rows - it was because the rows it was working on were quite huge - often several megabytes in size. This meant that when we were doing a big DELETE, we were often deleting several gigabytes of data at a time - which meant lots of disk thrashing.
A corollary of this was obviously what performance impact that the queries that were actually putting this stuff in to the database were having. This wasn't a high priority as it didn't seem to be triggering slow queries, but due to the volume of data moving in and out it was still a concern.
These huge chunks of data were basically just big blocks of text that only needed to be moved in and out of the database. MySQL didn't need to know about any of the contents of these big chunks of data - all it needed to do was take them in, store them, and spit them out on request. MySQL was simply suffering because of the volume of data.
So there was one simple, obvious fix that presented itself - compress the data.
I was a bit loathe to do this initially because I was worried about what the CPU impact might be - but our CPU usage was pretty low anyway, generally averaging around 50% and rarely spiking. So I simply wrapped the things that moved those code chunks in and out of the database with some compression - PHP's native gzcompress() function, choosing a default level of 4). Anything bigger than around 10,000 bytes would now automatically get compressed before getting pushed into the database.
After some quick testing to make sure we weren't suddenly going to start spewing binary gibberish to usres, we rolled this out, and instantly saw the benefits. The overall load on the database server dropped significantly. CPU usage stayed roughly the same - maybe a tiny increase, but barely noticable. Some of the other metrics showed improvement as well - I've got a few of the graphs showing the change below (the change was rolled out just after 12pm).
First up and most importantly to me was I/O. You can see the big drop off in writes - previously a big shaky green line flapping all over the place, after the change it stabilised at a much nicer level. Reads similarly dropped a bunch too.
Network usage was also interesting - you can see that the number of kilobytes received dropped through the floor as MySQL was no longer receiving these massive chunks of data from our application.
The server load average. There's a massive spike at around 12pm - that was a big epic fail that made the site get stuck for about 15 minutes when a bunch of big DELETEs went through at once, which prompted the deploy to go live sooner rather than later. You can see after that load immediately drops right down and levels out, showing a much nicer line with no peaks after that.
A week later and the graphs are all holding steady. The performance of the site is rock solid - in fact, more solid than it has been for a long time, as this has solved a performance issue that has been intermittently plaguing us for several years.
This is how we'd rather do things in any software development project - make sure that we're getting the absolute most out of the existing infrastructure. This is a great example of how seemingly simple changes in business requirements can lead to technical problems in an online service. Even though the extra load was anticipated, the volume of it was just enough to cause hassles, and I'm happy we were able to resolve this with what turned out to be a simple change in just the right place.
The AusGamers site is faster than ever - but this change is only the tip of the iceberg for us. As a result of this work we've identified a series of other optimisations that we can make in this particular system. We'll save disk space, speed up database operations, and generally improve the performance of the site across the board - reducing the impact we have on our hardware and allowing us to potentially save even more money, just by making clever decisions in software.
So if you're having performance issues on your website and thinking about upgrading your hardware - something your hosting service will no doubt be telling you is a great idea - maybe think twice.
(Blatant plug: of course, the developers on AusGamers are available to help you as well - just head over to our parent company Mammoth Media's website to get in touch! :)