In the previous blog post, an intro to our database migration series, we promised to tell the story of our challenges with AWS Database Migration Service, which turned out to be far from all sunshine and rainbows as it might initially seem after skimming through the documentation.
When we started using it, it went significantly downhill compared to expectations, with all the errors and unexpected surprises. Nevertheless, we made the migration possible and efficient with extra custom flows outside DMS.
If you already have data in any storage system (within or outside AWS) and want to move it to the Amazon-managed service, using Database Migration Service is the way to go. That implies that it's not a tool just for migrating from self-managed PostgreSQL to AWS RDS as we used it - it's just one of the possible paths. In fact, AWS DMS supports an impressive list of sources and targets, including non-obvious ones such as Redshift, S3 or Neptune.
For example, it's possible to migrate data from PostgreSQL to S3 and use AWS DMS for that purpose, which already gives an idea of how powerful the service can be.
Essentially, we can have two types of migrations:
In our case, that was a homogenous migration (from PostgreSQL to PostgreSQL), which sounds way simpler compared to the heterogenous one (which is likely to require tricky schema conversions, among other things).
When performing the migration via AWS DMS, we also need a middleman between the source and target databases responsible for reading data from the source database and loading it into the target database.
There are two ways how we can work with that middleman:
On top of that, we have three types of homogenous PostgreSQL migrations:
The choice here comes down mainly to the trade-off between simplicity and downtime. If you can tolerate some downtime (depending on the database size and type of data stored), a Full Load sounds like a preferential option, as fewer things can go wrong here—it's simpler. If it doesn't sound like a possible option, using CDC (with or without Full Load) is the only way to achieve near-zero downtime. However, the complexity might be a big concern here.
The initial plan for migration and the first round of apps
Our initial plan assumed that for applications where we can afford a downtime outside of business hours (like 3 or 4 AM UTC+1), we would proceed with the Full Load approach, and for applications where we cannot tolerate the downtime, that would be required to perform the entire migration, we would likely go with Full Load and CDC.
DMS Serverless also looked appealing as it would remove the overhead of managing the replication instance.
We tested that approach with staging apps, and all migrations were smooth - there were no errors, and the process was fast. The databases were tiny, which helped with the speed, but in the end, the entire process looked promising and frictionless.
So, we started with the migration of the production apps. The first few small ones were relatively fast yet substantially longer than the staging. But that made sense as the size was significantly greater - it was no longer a few hundred megabytes in size but rather a few gigabytes or tens of gigabytes.
Then, we got into far bigger databases, reaching over 100 GB. And this is where the severe issues began.
Before migrating any database, it's necessary to perform test migrations to get an idea of whether it will work at all and how much time it might take. It's even more critical for big databases to benchmark the process a few times to tell how long it will take. So, we did exactly that for the bigger databases and achieved promising and consistent results. The migrations were supposed to take quite a while, but that was still acceptable, so we proceeded and scheduled the longer migrations.
Then, we ran into the first significant issue. According to the previous benchmark, we consistently achieved a migration time below 1 hour while the database was under normal load during tests. And then, out of nowhere, with no traffic on the source database, it was taking almost 2 hours with no sign of being closed to finish! Based on the expected size of the target database that we knew from the test migrations, there was still a long way to go.
Sadly, we had to stop the migration process, bring the app back up and running on the original database cluster, and think about what went wrong. Overall, waking up after 5 AM and the extended downtime went for nothing. We tried to replicate the issue with another test migration, but it was working just fine, so we considered this an isolated incident and committed to performing another migration the next day, although for a different application, as we wanted to avoid extended downtime for two days in a row.
However, it wasn't any better during the next day. Even though the process took more or less what was expected based on the test, the database shrank from 267 GB to... 5332 MB! We expected the bloat there, but the bloat couldn't take the majority of the size. And it was a very different result from what we achieved during test runs.
The migration status inside AWS DMS UI was Load Complete, but after checking the number of records in the tables, it turned out all were empty!
That was another failed migration, the second in a row, without any apparent reason why it failed.
At that point, we concluded that the Serverless approach was not an option. It proved unreliable for bigger databases, and the lack of control over the process became an issue.
Fortunately, we had one more option left for the Full Load strategy - doing it via Replication Instance. It looked more complex, but at the same time, we had more control over the process.
We attempted the first migration, and wow, that was fast! It was way faster than Serverless, and all the records were there! That looked very promising. Except for the fact that all secondary indexes were missing! And foreign key constraints... And other constraints... And the sequences for generating numeric IDs! Literally everything was missing except for the actual rows...
We double-checked the configuration, and there was nothing about excluding constraints. Also, the config for LOBs was correct — a config param that one needs to be very careful about, as AWS DMS makes it easy for many data types to either not be migrated at all or truncated beyond a specific limit. And apparently, it's not only about JSON or array types but also text types!
We re-read the documentation to see what happens to the indexes during the migration, and we found very conflicting information, especially after our previous Serverless migrations, which migrated the indexes and constraints without any issues.
Let's see what AWS DMS documentation says about indexes:
Anyway, we found a reason why the migration was so fast. We also had to find a way to recreate all the missing constraints, indexes, and other things.
Fortunately, native tools helped us a lot. To get all the indexes and constraints, we used pg_dump with the --section=post-data option and then inlined the content of the dump and ran it directly from the Postgres console to have better visibility and control of the process. To bring back sequences, we used this script. It was very odd that AWS DMS does not have any option to handle this—it's capable of migrating Oracle to Neptune, yet it's not capable of smoothly handling indexes for the Replication Instance strategy, even though it's a trivial operation.
After recreating all these items, the state of the application database looked correct according to our post-migration check script (which will be shared later)—all the indexes and constraints were there, and the record counts matched for all tables.
At that point, we concluded that we were ready for another migration. And it looked smooth this time! It went fast, and the state of the source and target databases looked correct. We could bring back the application using a new database.
It looked just perfect! At least until we started receiving very unexpected errors from Sentry: PG::StringDataRightTruncation: ERROR: value too long for type character varying(8000) (ActiveRecord::ValueTooLong). Why did it stop working correctly after the migration? And where is this 8000 number coming from? Did AWS DMS convert the schema without saying anything about this?
We quickly modified the database schema to remove the limit, and everything returned to normal. However, we had to find out what had happened.
Let's see what the documentation says about schema conversion: "AWS DMS doesn't perform schema or code conversion". That clearly explains what happened! Another time where the documentation does not reflect the reality.
We couldn't find anything in the AWS DMS docs regarding the magic 8000 number for character varying type. However, we found this - docs for Qlik and the mapping between PostgreSQL types and Qlik Replicate data types. And it was there: "CHARACTER VARYING - If no length is specified: WSTRING (8000)", which was precisely the case! More conversions were also mentioned, for example, NUMERIC
-> NUMERIC(28,6)
, which also happened for a couple of columns in our case.
It's not clear if the services are related anyhow but this finding is definitely an interesting one.
We haven't been able to confirm with 100% certainty why this exact magic number (8000) was applied here, but it's likely related to PostgreSQL page size, which is commonly 8 kB.
That was not the end of our problems, though. The content of the affected columns got truncated! To fix this, we had to look for all records with content over 8000 characters and backfill the data from the source database to the target database if it hadn't been updated yet on the new database.
We also had to do 3 more things:
Until that point, the AWS DMS experience had been a horror story. Fortunately, this is where it stopped. We finally found a strategy that worked! The subsequent migrations were smooth, and I haven't experienced any issues after that. Even the migrations of databases closer to 1 TB went just fine - although they were a bit slow and required a few hours of downtime.
We could have achieved way better results in terms of minimizing the downtime by using CDC, but after our experience with a Full Load, which is the most straightforward approach, we didn't want to enable logical replication and let AWS DMS handle it to find out that yet another disaster happened - we lost trust in DMS and we wanted to stick to something that we know it works.
This approach worked well almost until the very end. The only friction we experienced with this final flow was the migration of the biggest database. We ran into a specific scenario where performance for one of the tables was far from acceptable, so we developed a simple custom service to speed up the migration. Yet, the other tables were perfectly migratable via DMS. Before the migration to RDS, that database was almost 11 TB, so it also required a serious effort to shrink its size before moving it to RDS.
We will cover everything we've done to prepare that database for the migration in the upcoming blog post, along with the custom database migration service.
The story might look chaotic, but that's for the purpose - even though we found a couple of negative opinions about AWS DMS, the magnitude of the problems wasn't apparent, so this is the article we wished we had read before all the migrations. Hopefully, it will help clarify that AWS DMS is a tool that looks magnificent, but at the time of writing this article, the quality in a lot of areas is closer to the open beta service rather than a production one that is supposed to deal with the business-critical assets - the data. Especially since AWS DMS proved incapable of handling the homogenous migration - we had to use pg_dump/pg_restore to make it work.
Nevertheless, if we were to migrate self-managed PostgreSQL clusters to AWS RDS one more time, we would use Database Migration Service again—we mastered the migration flow and understood the service's limitations to the extent that we would feel confident that we could make it work. And we developed a post-migration verification script that performs a thorough check to ensure that the target database's state is correct. Hopefully, after reading this article, you will be able to do the same thing without the problems we encountered in our migration journeys.
Here is the final list of hints and recommendations for using AWS DMS when performing homogenous migration from self-managed PostgreSQL to AWS RDS PostgreSQL:
Amazon Database Migration Service might initially seem like a perfect tool for a smooth and straightforward migration to RDS. However, our overall experience using it turned out to be closer to an open beta product rather than a production-ready tool for dealing with a critical asset of any company, which is its data. Nevertheless, with the extra adjustments, we made it work for almost all our needs.
Stay tuned for the next part of the series, where we will focus on preparing the enormous database for the migration and a very particular use case where our custom simple database migration tool was far more efficient than DMS (even up to 20x faster in a benchmark!) allowing us to migrate one of the databases simultaneously using both AWS DMS and our custom solution for different tables.