In the previous blog post, we covered our story of migrating to AWS RDS using AWS Database Migration Service (DMS), a complex initiative with multiple issues we had to address.
Nevertheless, almost all the migrations we did could have been generalized to the same strategy - in the end, we found a way to use DMS that works fine, even for mid-size databases.
One database, though, required some extra preparation - the 11 TB giant (10.9 TB to be exact). Despite all the steps we took, it was not possible to migrate via AWS DMS via full load within an acceptable time, even when applying parallel load. In that case, we had to develop our custom migration script, which turned out to be almost 20 times faster than AWS DMS.
Even more surprising is that the database shrunk to 3.1% of its original size after the migration!
Let's review all the steps we took to prepare this giant database for the migration and what our custom migration script exactly looked like.
Before we get into the details, let's start by explaining how we got to the point where the database was so massive.
The primary culprits were two tables (and their huge indexes) that contributed approximately 90% to the total size of the database. One of them was an audit trail (paper trail versions, to be exact), and the second one was more domain-specific for short-term rentals. It's a pre-computed cache of prices for properties depending on various conditions so that they don't need to be computed each time on the fly and can be easily distributed to other services.
In both cases, the origin of the problem is the same, just with a slightly different flavor—trying to store as much as possible in the Postgres database without defining any reasonable retention policy.
The tricky part is that we couldn't just delete some batch records after a certain period defined by a retention policy. We had to keep some paper trail versions forever. And for the computed cache, we still had to preserve many computation results for an extended period to debug potential problems.
That doesn't mean we had to keep them in the Postgres database, e.g., AWS S3. Pulling data on demand (and removing them once no longer needed) is an alternative that may take some extra time to develop, but it's much more efficient.
This is what we did to a considerable extent for the records representing pre-computed prices; we started archiving shortly after they were no longer applicable by pushing them to S3, deleting them from the Postgres database, and pulling data on demand when debugging was needed (and deleting them again when no longer required).
We also applied a retention policy for paper trail versions and archived the records by uploading them to S3 and deleting them from the Postgres database. However, we also decided to split the giant unpartitioned table into several tables (one generic one as a catch-all/default table and a couple of model-specific tables for the most important models), essentially introducing partitioning by model type. Due to that split/partitioning, we temporarily increased the total size of the database. Still, we could ignore the original table during the migration and, at the same time, make migration via AWS DMS faster by simplifying parallelization during the full load.
The interesting realization is that the majority of the data we stored were historical records, which are not critical business data.
Overall, we deleted a massive amount of records from the Postgres database. However, the database size didn't change at all. Not even by a single gigabyte!
What happened?
If you are in a similar situation, you will likely have a big issue to solve. This is due to a few reasons:
Let's look at all these issues closer.
It is essential to remember that deleting records in the Postgres database does nothing to reduce the table size—it only makes the records no longer available. However, running vacuum marks the internal tuples from deleted rows as reusable for the future. So, it does slow down the table's growth (as a part of already allocated storage is reusable), yet we can't expect the size to go down.
Nevertheless, there are multiple ways to shrink the table size:
While we have several solutions to the problem, the sad reality is that if we use block-storage services such as AWS EBS, we will still pay the same price for the allocated storage, which cannot be deallocated.
If the price is an issue, the options we could consider at this point would be moving data to a smaller EBS volume or migrating to a new cluster. We went with the second option as it naturally fit our plan to move from self-managed clusters to the AWS-managed service (AWS RDS).
After all the hard work in preparing the database for migration, we did a test benchmark to see how long it would take. And it looked very promising initially - almost all the tables migrated in approximately 3 hours. All except the one - the table that stored the results of computed prices. After close to 5.5 hours, we gave up, as this was an unacceptable time for just a single table. 5 hours would be fine for the entire downtime window, including the indexes re-creation, not just migrating a single table on top of 3 hours that it took to migrate the rest of the database.
That was a big surprise - the original table was huge (2.6 TB), but we deleted over 90% of the records, so we expected the migration to be really fast. Especially with the parallel load!
However, most of the storage space was occupied by LOBs —the array of decimals (prices)—which could have been the primary cause behind the slowness. Most likely, the massive bloat after deleting such a big part of the table didn't help. And probably the most crucial reason is that there is a limit to how much we can parallelize the process. The maximum full load subtasks number allowed on
AWS DMS is 49.
At that point, we had to figure out some custom solution, as we didn't want to change the migration strategy from just the full load to the combination of full load and CDC. The good news was that we had already been using something that could be very useful in designing a custom solution - performing a bulk insert (using activerecord-import) of the archived records. It proved to be fast enough to restore a significant number of records. Also, nothing was preventing us from having a way higher parallelization degree than DMS. This could be our solution.
We had to do a couple of tweaks to reuse the functionality we implemented to restore the archived records. Not only the source of the data would be different (from S3 to Postgres database), but there were also important considerations:
And this is how we did it:
Take a few minutes to analyze how things work exactly.
Running the entire process was limited to merely executing this code:
And it was really that simple to implement an alternative solution to AWS DMS! We did a test benchmark, and it turned out that with a comparable parallelization (50 Sidekiq workers), it took close to 45 minutes, so it was perfectly fine for our needs! Even more interesting was that there was no significant database load increase, even while running it on the production database actively processing standard workload. The potential for parallelization was even greater, which we wanted to see in action during the last migration.
And this day finally came - facing the 11 TB giant. Fortunately, the migration went perfectly fine! The entire process took approximately 5 hours (the actual downtime). And there were even more things to celebrate!
The migration of the troublesome table took merely 18 minutes with 125 Sidekiq workers! We haven't tried going beyond 5.5 hours for this table using AWS DMS, but even assuming that it would be the final load time, our custom migration script that took a few hours to build turned out to be almost 20x faster (18.3, to be precise). And there was plenty of room to optimize it even further - for example, we could play with different buckets sizes' and also run the process with 200 Sidekiq workers. Furthermore, we don't know how long it would take AWS DMS to finish the process - maybe it could be 6 or 7 hours. It would not be impossible then to have a custom process that would be 30x or even 40x faster.
And there was one more thing—the database size after the migration turned out to be merely 347 GB, which is 3% of the original size! Reducing all the bloat definitely paid off.
Ultimately, we managed to migrate the colossus, which started as an 11 TB giant and became a mid-size and well-maintained database of 347 GB (3.1% of the original size). That was a challenging initiative, yet thanks to all the steps we took, we managed to shrink it massively and migrate it during a reasonable downtime window. However, it wouldn't have been possible without our custom migration script, which we used together with the AWS DMS full load.
Stay tuned for the next part, as we will provide a solution for making AWS RDS Proxy work with Rails applications, which was not that trivial.