Starting on Monday, February 24 at about 1PM, AWeber began experiencing large, sustained, and repeated DDoS attacks that completely disabled all aspects of our service for extended periods of time. Events like this always have the effect of enlightening the victim to cracks they didn't know existed, and forcing prioritization of issues they'd planned to address anyway. Our team was no exception: we had big plans for 2014 to address many things that might have lessened the immediate impact of the recent attacks, and we've also prioritized and executed on things that, frankly, weren't on our radar.
We hope this post, and future posts that'll take more of a deep dive into more of the specific technical aspects will help others learn from our experiences, good and bad.
AWeber was just one of a rather impressive list of sites to be attacked over the past week, and the attacks continue still on some sites. In social media channels, it's been referred to as a Distributed Denial of Service attack, although it is more correctly called a Distributed Reflective Denial of Service (DRDoS) attack. In this attack, a malicious agent spoofs a source address to be that of the target site, and sends large numbers of requests to third-party servers, causing the correspondingly large number of responses to flood the target site, making it impossible for the target site to properly handle legitimate traffic. However, in the specific attacks we're seeing recently (and the one that affected our site), the size of the attack is amplified: it utilized NTP servers configured to respond to a specific set of request types where the response generated by the NTP server is upwards of 500 times the size of the request. That makes this type of attack more efficient to actually execute by an attacker than if the response sizes were similar to the request size.
You can read more about UDP-based protocol amplification attacks at US-CERT
You can read more about how NTP is utilized in this type of attack specifically at the cert.org knowledge base.
Nothing that AWeber or any other site under attack is seeing is magical. It's just massive. The traffic patterns are completely predictable, completely consistent, and obviously malicious, or at least obviously invalid. We captured 50 packets off of our router and immediately knew what was going on. Our attack was even more consistent than others who've come under attack: the first hit we saw on Monday was all 100% source port 123 UDP, and 100% of it was bound for port 80 on a single IP. We blocked it in all of about a minute. But why did we see this traffic at all, and why do some sites see this traffic and others who are targeted do not experience service interruptions?
Maybe most importantly, what can *you* do to help avoid this kind of situation?
There are a lot of answers to that question, and in the interest of sharing whatever we can, let's go through at least a few important ones.
Have A Specific Attack Mitigation Plan
AWeber has averted many DoS and DDoS attacks in the past, as probably every site has at some point or another. Unfortunately, the sheer massive size of the recent attacks was rare to the point of actually being historic, for our service as well as many others: while past DDoS attacks were up to a couple hundred Mbps, the most recent ones were 70-75Gbps. As we've seen over the past week, lots of sites, and even upstream providers, were ill-prepared for that size of attack. If your network architecture is already set up such that you can mitigate these attacks on your own, you're probably sized similarly to a provider of network services, or a DDoS mitigation service, especially considering that it was only this past December that smaller attacks (peaking at around 60Gb) brought down JPMorgan and Bank of America.
Most networks aren't prepared for attacks of this size, and that's kinda sad since the peak size of these attacks seems to regularly double or triple in size over the past couple of years.
For the rest of us (in the 99% of internet-based businesses), there are a few options that you can move on quickly in the event you wake up to something like this. Be careful what you choose, though, because doing a search for "DDoS Mitigation" on Google returns lots of results, and there are so many potential avenues for mitigation that they vary wildly in how they're implemented. Depending on your needs and current situation, some may not be all that useful.
Two popular types of DDoS mitigation service are a hardware-based solution that gets installed at your site, and a proxy-based solution that forwards 'scrubbed' traffic to your site. Both of these were problematic from our viewpoint. We already have hardware that mitigates DDoS attacks and it did its job for the traffic that reached our network, but the attacks were so massive that they overwhelmed our internet uplinks and the uplinks of our datacenter provider. The proxy solutions we saw looked like they could be effective, but in the end we'd still be exposing other IPs in our infrastructure that could be easily discovered by an attacker and directly targeted, bypassing the proxy altogether. There are also services that'll serve a cached version of your site if you're down, but that wasn't going to work for us, since large parts of our site (customer control panel and dashboard, link shorteners, etc) aren't very useful as stale, static pages.
Instead we went with a routed solution offered by Prolexic (who just recently became a part of Akamai). The solution basically has Prolexic advertising our BGP routes on our behalf, and communicating with our site via GRE tunnels. In other words, there are no directly-targetable IP addresses associated with the site at all that aren't filtered through Prolexic's service. It was quicker to get on Prolexic's proxy service, which we were able to do same-day, but if you have more time, getting on the routed solution or one of their hardware-based (or hardware-aided) options should get you where you want to go.
It's possible that your architecture won't allow you to go with any of these solutions, or that you might choose to rely instead on agreements with upstream bandwidth providers that run pipes to your colo. This is doable, but the terms of your contract should be very explicit about what the provider is accountable for, what size attack they're capable of mitigating without a resulting interruption of your service, what policies govern their ability to null route your network (think they can't? They can). Our own upstream provider not only null routed our network, they also demonstrably fell short in their understanding of how a simple DDoS attack works, didn't respond to repeated requests to implement a one-line fix to protect us and the rest of our customers, and told us their policy in situations like this was, in fact, to null route our entire network until we "secured our systems". No, seriously. They said that. Needless to say, after being in the same facility for 13 years and thru 3 different acquisitions, we’re changing providers very soon and bringing in additional third party bandwidth in the mean time as a stop gap measure.
How Big Is Your Basket?
AWeber's site is the place our customers go to set up new lists and email marketing campaigns, and see how the campaigns are doing. Our service includes serving forms to our customers sites, and collecting data from those forms to add new subscribers. It also includes tracking opens and clicks for messages that are sent from our service, among other things that are independent of the site, but internet-based nonetheless. None of these things need to live in-house, and having them in-house just equates to more risk in scenarios like this. With those components moved out to the cloud, there are a good number of users of our service that would not have actually known we were down at all, as was confirmed by a great many customers calling in, tweeting, and posting on facebook. So, if your site primarily exists to be a tool to set up and measure a service, separate those concerns and make sure your service works if your site is down (and then, make sure your site is redundant too, so nothing ever goes completely away).
To move things out of our network quickly, we wrote code that was basically the leanest dev approach you could ever imagine: start with just having the ability to collect data. So, when someone clicks 'submit' on a form, and our colo is completely unavailable, those requests can go to a service running in AWS that'll collect all of the form data in a way that it can be replayed against our internal system at a later time. Of course, we'll iterate on that to make it a fully-fledged service that's integrated with (but not dependent on) the rest of our in-house infrastructure. We've done that for two services in the past few days, and are working on more.
As for forms, they're now completely hosted in a CDN, and we're already getting redundant CDNs in place for those assets. We had already moved lots of our static assets away from our in-house servers, but forms was the biggest, and we'd delayed moving them out until we had more confidence serving less-critical assets from the new service. So far, forms in the CDN are working, though we're attacking a couple of isolated issues that are bound to crop up when you move in such a hurry. We're on it.
Aside from code you write yourself, there's also infrastructure to consider. In our case, for example, we hosted our own DNS. This means that, even if our only issue was that DNS was down, once the TTLs on the DNS records expired, nobody would be able to have forms delivered to their site. For folks visiting our customers' sites who hadn't been to a site with an AWeber form in the past, the whole site might actually hang even before TTLs expired.
Our solution was, again, something we'd planned to do this year anyway: DNS is now hosted externally at Dyn, whose responsiveness to our situation, and their support, service, and general extra attention in light of what we were going through was incredible. It was almost like we weren't dealing with a traditional vendor. All service from all vendors should be half as good as what Dyn provided us.
How Solid Is Your Team and Culture?
Over the past week, we had a rotating staff of about 8 engineers, the CTO (me), the CEO, a project manager, working around the clock, dealing with our colo provider (and other colo providers), various bandwidth providers, Prolexic, Dyn, and others, making huge changes to our network and service architecture. One of those days started before 8AM and didn't end until 6:30AM the next day. We had a lot of decisions to make this week, and a ton to execute on, both technically and non-technically (I should note that our social media presence has never been automated, and was also pretty much non-stop around the clock). I cannot possibly imagine getting through it without such an amazing team of individuals, whose actions and ability to execute is supported by what might be the most important part of our culture at AWeber: Trust.
As CTO, I didn't personally set up or even suggest Prolexic. The suggestion came from the team, we built consensus on it in under 20 minutes, and the wheels were moving immediately. I didn't personally get on the phone with anyone, or micromanage negotiations, I told them to act like AWeber is theirs, because it is. They were entrusted to make the best call they knew how to make on behalf of our customers and our team, and to over communicate. That's exactly what they did. If you look around your office and can't possibly imagine delegating that level of responsibility to the people who ultimately run your business, the impact of that is far more severe and long-lasting than any attack on your systems.
More To Come
Over the coming weeks, we'll post more information here about specific steps we've taken, how we're managing things, and the technologies and vendors that are helping us out. We're fans of transparency here, so if you have questions and we can find some way to answer them, certainly let us know.