We dubbed it “our Christmas,” the day when all our good engineering deeds would pay off and we’d be rewarded with a bounty of happy active users for our mobile app, Avocado.
Up until this point we’d enjoyed steady and predictable growth, but with only two weeks’ notice we realized a few big, coinciding promotions would bring us a 30x increase in traffic and a membership boom for which we weren’t yet prepared.
Scaling for a “Christmas” moment is scary; it forces you to confront all the design decisions you may have made earlier when bootstrapping your app. No matter what the hidden weaknesses are, a massive group of new users will find them.
With only two weeks, how could we scale our infrastructure to meet demand and not fall over?
Preparing for scale
This is a document of how we screwed up. And then how we fixed our screw-ups. Please learn from us.
At first, we thought we knew where our weak points were. Well, until we ran some tests and found our problems were in a completely different place.
We’ve been using Amazon’s Web Services. Here’s the initial configuration of our deployment:
1 frontend server (stored: API, web presence, socket.io, HAProxy, stunnel/SSL encryption)
1 Redis server
1 MySQL server
1 batch server (running miscellaneous daemons)
Our API server was using Node.js since it’s lightweight, fast, easy to code for, and has a large community with some solid libraries we use (and some we have written and shared).
Given the results of early tests and under limited resources and time, we understood where we needed to focus: new user signups — clearly our most complex process. We put our energies into finding the pain points and where the system would die if we had a sudden influx. This was an important business decision: You can let a few normal day-to-day activities fail, but you don’t want new user creation to fail.
Smart scaling required us, as a wise man once said, to check ourselves before we irreparably wrecked ourselves. In other words, we tested, tested, and tested some more.
Early tests weren’t good enough so we created a replica environment: entire duplicates of our database and existing server and partitioned them off so nothing would affect our production environment. We ran through some common use cases like account creation as well as messaging and uploading images. We saw where Avocado failed and at what rates.
Pro Tip #1: We used much smaller EC2 instances in our test environment than in production. Not only does that provide limited resources all-around to more easily find problems, but it’s considerably cheaper as you pay for the amount of hardware you use on AWS.
Pro Tip #2: We built test scripts as well as used a tool called jmeter to simulate tens of thousands of user signups per minute.
Our testing process: Run the scripts, find weak points, re-run the scripts. We then experimented to see what would happen if, for instance, we added a second server or if we took advantage of all the CPUs on a given server.
We started to see things immediately. Any time we had thousands of users signing up at one time, things would grind to a halt and connections to Avocado would take ~30 seconds to complete.
What we discovered: Any time Node.js spawned off a separate process, the system would take a hit. At the time, we were spawning off a Python process to send verification emails and another for image resizes (required when a user uploads a photo to their activity stream). When that happened, we’d see a big ol’ CPU spike. We realized it was those processes that were slowing everything down.
Well, it was the processes. And the fact that we had everything on just one box.
We also noticed slow database query response times. We saw in the test environment that a handful of MySQL queries were showing up in MySQL’s slow query log (for instance, if we queried to see if a certain email address had received an invitation to join their “boo” on Avocado).
Pro Tip #3: MySQL query optimization can be something of a dark art but speeding up queries can be as easy as adding additional indexes to a couple tables.
One last thing: SSL encryption/decryption was a major load on the CPU as well, but we didn’t know to what degree it would impact the system under heavy user traffic in production.
What we changed after testing
Our first fix was to rewrite the email system, previously in Python, in Node.js so it wouldn’t need to call up another process. Moving to a node.js mailing library meant we could send 3x to 5x the number of emails in a period of time as it would take to call the external python program.
Next, we put stunnel (SSL encryption handler) and HAProxy onto their own EC2 server so we wouldn’t overload our core services.
We had a few possible solutions for the image resizing problem found in testing where spawning ImageMagick processes from node.js proved to be a huge resources hog. We explored the option of leveraging some faster libraries and third-party image resizing services.
But we ended up not rewriting any code or using any services. Either may have worked, but we didn’t want to do last minute code changes a day before our big promotion. So instead, we deployed four new EC2 instances whose sole purpose was image resizing. We took these down after the promotion.
Currently we’re back to just resizing images on our normal frontend servers (the servers that serve the API and Web page/client). For the future, we’d like to leverage a service that we can upload images to and request whatever size we want as we need them, thereby offloading the hardware intensive operation of image manipulation to someone else.
On the second day after “Christmas” began, we were woken up early in the morning by a notification that our server response times had slowed to a crawl due to traffic. The frontend server’s resources were maxed out. This, admittedly, was not our favorite way to start the weekend. #startuplife
We deployed a second frontend server and routed half the traffic to that server.
To review: At this point, we now had:
2 frontend servers
1 HAProxy/stunnel server
1 Redis server
1 MySQL server
1 batch server (running miscellaneous daemons)
4 image resizing servers
From that point onward, we kept an eye on servers and rolled out new EC2 instances as traffic increased. (We maxed out at 15 EC2 instances.)
Prior to adding all those servers, we came across another issue in testing that we waited to fix until we actually needed it: We realized that the larger EC2 instances had multiple CPUs available that would go mostly unused on our API servers due to the fact that node.js is single threaded. The HAProxy server was only pointed at one instance per box. We waited to deploy a fix because the solution was somewhat experimental — we didn’t want a “guess” to go to production. Yeah, it turns out we should’ve gambled on that guess.
All the while, whenever we would update the HAProxy configuration to handle our latest scaling solution, Avocado would go down for 1 to 2 seconds as our HAProxy restarted, thus loading the new config. Our next task was to create a redundant HAProxy server to prevent any down time.
In response, we used the Elastic Load Balancer to route traffic to the two HAProxy servers. The Elastic Load Balancer also handled the SSL encryption which took a huge load off of the proxy boxes. This allowed us to downsize the proxy boxes considerably, thus saving money in EC2 instance costs.
Everything seemed good. For, like, five minutes. Then there was socket.io.
We discovered our socket.io servers began to seriously eat up resources. Up to this point, we had a single socket.io server instance running on one of our two frontend servers. Which meant if we ever needed to update the frontend server, even if it had nothing to do with socket.io, our socket.io server would go down. It would only be a blip, but it’d still result in downtime. Downtime = sad users.
Scaling socket.io to multiple servers is tricky because if a user is connected to one socket.io server, they need to stay connected to that same one for the entire session – another socket.io server won’t have their credentials if they’re suddenly moved. So, we used something called “sticky sessions” to keep users connected to the same server for the life of their session. (When you home screen the app, we close the socket.io connection, and when you open the app again, we create another socket.io connection for the user.) Sticky sessions sets a cookie for the life of the session so it knows what socket.io server to send you to every single time. HAProxy will see that cookie and send you to the right socket.io server.
On that note, we’d like to say that while WebSockets are the new hotness (and while socket.io takes advantage of it), mobile apps should be warned, Websockets will not work on some cellular networks worldwide.
Instead, we had to use XHR-polling with socket.io, which uses HTTP requests to simulate a real-time connection. Furthermore, Amazon’s Elastic Load Balancer didn’t support WebSockets, so we had to use XHR-polling regardless.
Once that issue was resolved, we scaled out to two socket.io servers. After this initial scaling, we moved the socket.io server to its own box – inside of that we could have 8 socket.io servers (one per CPU), so we had a total of 16 instances of socket.io servers.
I’m sure at this point you’re saying, “16 socket.io servers?! You’re doing something wrong.” And in thinking that, you’d be right. So, we changed something else to improve our performance: database connection pooling.
Prior to this we had one connection to the database per server instance. Any time a process wanted to talk to our database it had to get in line. The process happened so quickly that the line wasn’t very long, but since we only had a single connection per server instance we ended up using a lot more resources than necessary.
For example, our socket.io servers were eating up 60 percent of the box’s CPUs across all 8 cores because of only having one connection to the database per socket.io server instance. When we added the connection pooling, we added anywhere from 2 to 10 connections to the database per server instance.
We didn’t do this connection pooling before mostly due to lack of time and because we didn’t want to make major changes to our database code with only three days before the promotion. The connection pooling meant that usage went from 60 percent of CPU to 3 percent. Response times were dramatically improved all around.
Our “Christmas” event took us global. And we hadn’t adequately prepared for that. Users in other parts of the world were connecting to our servers all the way in NorCal. And that meant sloooow connections. We have since launched servers throughout the world, relying on Amazon to route traffic based on latency.
And that’s all part of our…
We’re still in the afterglow of this growth and while things are slightly less hectic, we also have quite a few other big events planned, and each of those mean more scaling.
We expect to be using auto-deployment strategies. For example, Amazon will deploy new boxes automatically based on certain metrics, letting us deploy new servers when their average CPU usage exceeds a certain percentage.
One concern we’re talking about though: if you do auto-scaling wrong you might accidentally deploy 100 boxes… and then you’re paying for 100 boxes. So we’re cautiously proceeding here.
Biggest lesson learned: It would have been great to start the scaling process much earlier. Due to time pressure we had to make compromises –like dropping four of our media resizer boxes. While throwing more hardware at some scaling problems does work, it’s less than ideal.
Generally, it’s hard to know in advance what you’re going to need unless you know exactly what growth to expect. So plan for the worst, hope for the best, and good luck out there.
TL;DR: Plan ahead. Test your servers until they fall to pieces. Eat your veggies.
P.S. We’re hosting an Android Hack Night during Google IO this year; come by and chat us up if you want to learn more about what we did.
More: MobileBeat 2016 is focused on the paradigm shift from apps to AI, messaging, and chatbots. Don't miss this opportunity: July 12 and 13 in San Francisco.