When PostgreSQL Did Everything
In the early stages of Sendwithus, we used one Heroku Postgres instance for everything. It was very fast to set up, easy to maintain, and reasonably priced.
We could backup, migrate, and upgrade our entire database within minutes. We could run ad-hoc SQL queries and expect near real-time performance. We could also quickly create new features and produce custom reports for our amazing customers.
As our product evolved and our company grew, so did our database.
We started handling millions of API calls per day — sometimes spiking to more than 20k requests/min. Sendwithus tracks a lot of data, and our one-size-fits-all PostgreSQL solution started falling behind.
Our tables contained hundreds of millions of rows. Our indexes were growing exponentially and no longer fit in memory. Our write throughput was constantly bumping up against the maximum for a single database instance.
And our monthly Heroku spend was getting large. Really large. In short, we’d outgrown Heroku Postgres and needed a new solution.
What do we actually need?
We called an emergency team meeting and laid out our database requirements:
1) Atomic and consistent writes. To power our current feature set, we needed strong data consistency. Enough said.
2) High write throughput. The bulk of our problems were related to write throughput. We needed enough capacity to 100x our current write throughput — that would solve the problem for at least twelve months.
3) Lower cost. The new solution needed to cost a fraction of what we were already paying. Going forward, cost should scale linearly with growth.
4) Hosted. Our team builds products — we’re not database experts, and at the time, we didn’t have the time or money to hire one. We decided the best course was to focus on what we’re good at and pay others to solve the problems they’re good at solving. (This is one of the reasons we chose Heroku Postgres in the first place.)
Experimentation and Prototyping
We were running short on time, so we did some time-boxed research and experimentation with a few different solutions.
HBase and Hadoop
Hadoop was a popular suggestion, especially given our high-write throughput requirements. We spun up a (mostly) working cluster and had data flowing into it within a few hours.
It was hard, though, to predict the limitations of a given cluster, and running queries to power our UI was too cumbersome. We considered building a Redis write-through cache for fast individual reads, but decided that would be too complex.
We also failed to find a Hadoop provider we really liked. We took a long look at TreasureData, who were offering a managed Hadoop solution at the time, but their solution felt unproven and expensive for our needs.
Even after a few hours, we never got a Cassandra cluster working. I still don’t quite understand how columnar schemas work.
We also couldn’t find a proven hosted solution, although we chatted briefly with the team at InstaClustr, who was in early beta at the time.
Ultimately MongoDB was unable to achieve our throughput requirements. There are a few great providers available, and today we use a Compose MongoDB instance for some of our non-critical BI data.
We were really impressed with what CitusData had to offer, and spent a lot of time with their team discussing our use case and our PostgreSQL problems. We also chatted with Heap Analytics who use CitusData to perform large-scale data analysis.
If you’re looking for a scalable PostgreSQL we strongly recommend CitusData. Ultimately we decided that our team lacked the PostgreSQL expertise and DBA experience to fully leverage and maintain their technology; we needed a solution with known and well-documented limitations that we could set-and-forget.
As implied by the title of this post, we ended up choosing DynamoDB. Here’s why…
Within a couple of hours, we were able to achieve 10x our current write throughput. Even better, we were seeing <5ms latency on all writes regardless of load. We were confident we would be able to meet our throughput requirements in production.
And of course, it’s hosted and provides a well-maintained API. That meant we wouldn’t have to become DBAs or SysAdmins to use it in production.
The price also made sense. The bulk of the DynamoDB cost is based on throughput — not storage. You pay for write/read throughput capacity, meaning our cost would scale linearly with throughput and we could easily store terabytes of data without worrying about cost.
We were impressed. Some additional load testing blew us away, cutting our costs by more than 60% for our current throughput. We had a winner.
DynamoDB is a NoSQL datastore. Switching from PostgreSQL had disadvantages.
First, we had to be careful with our schema. DynamoDB provides one index per table and you must choose it wisely. (Technically you can have N indexes, but you pay for Nx your write throughput, so the cost is effectively the same as having multiple tables.)
After a couple of days of testing, we had a workable schema. This post discusses the details of our process and where we landed.
At this point, the trade-offs became apparent.
For starters, we could no longer run queries. Querying DynamoDB is expensive and slow, akin to reading and processing all rows. Since cost is based on throughput, any non-indexed query became prohibitively expensive.
The other trade-off was the maximum row size. At the time, DynamoDB rows could not be larger than 64kB (I believe it’s since increased to 400KB). Our data can sometimes be larger than 1MB so a 64KB cap was unacceptable.
However, because our data was so infrequently read, we could code around this by offloading large rows to S3 and linking to them from DynamoDB. It’s not pretty, but this case occurs infrequently enough that it was an acceptable hack.
Migrating Data to DynamoDB
Our migration plan was pretty simple:
Step 1. We deployed code to production that would write to both PostgreSQL and DynamoDB in parallel. We also ran a python script that would confirm that data written to and fetched from both sources was identical.
Step 2. Once we were satisfied with the data going into DynamoDB, we had to backfill historical data from PostgreSQL.
We did this using SQS and a python script that shuttled data from one database into the other. We also ran our data verification script on all rows being copied.
It took about 48 hours to migrate everything, and cost ~$250 for the increased write-throughput. (For the record, we had to stop and restart this migration twice because of edge cases that were caught by our verification tool. +1 for testing!)
Step 3. While the data was backfilling, we updated our app to use the new DynamoDB tables. Ideally, our code would have been modular enough to make this a trivial step, but startups rarely operate ideally… 😉
Once the backfill was complete, we deployed production code, switching all read operations over to DynamoDB.
Step 4. We continued writing to both PostgreSQL and DynamoDB in parallel for another two weeks, watching closely for bugs or unexpected complications. Afterwards, we turned off the PostgreSQL writes and immediately saw app-wide performance gains.
All our database troubles had gone away.
We copied the old tables to S3 for a backup and then dropped them. This freed up nearly 750GB in our PostgreSQL database.
Step 5. Team-wide sigh of relief. Problem solved.
At time of writing, we had been using DynamoDB at scale for nearly nine months, and while we ran into a few unexpected pitfalls (more on that in a future post) we remained generally very happy with it.
For us, the biggest wins were that it’s hosted, reliable, and incredibly scalable. We could focus on building a product our customers love and not worry about unexpected infrastructure problems.
On top of that, our costs were considerably less than what we were paying Heroku for our “very large” PostgreSQL database plan.
We ended up with more than a few terabytes of data in DynamoDB, and started to play with Elastic MapReduce and RedShift to power some more interesting queries.