How we migrated a 17M+ records data to a new postgres database for subreddit tracker - Diligence

Things used

VPS (16 gb, 4 core)
Pg bouncer (for connection pooling)
Pg partman (for partitioning)
Pg backrest (incremental, full backups) in s3
Grok deep think for planning
Lot’s of planning, notes, 30 mins downtime

Could have used real-time WAL streaming for a small downtime window, but was too much overblown for my use case

The problem: Database queries, to find the best time to post and best time to post in a week started taking 30 seconds and then eventually 1 min, even with indexes applied. The queries scan through past months of data, and since we collect data every 10 mins for more than 1100+ subreddits, that quickly adds up. And top of it my AWS VPS was 2gb, 2 core , and it was struggling to maintain this 17M+ records table for reddit traffic data points.

Also my AWS bills went from $35/mo to $80/mo because of so many row scans happening on the SSD, i know we can use a precomputed materialized view, but don’t wanna use it here. So because of increase in users, there were a lot of reads/row scans on the database. So i thought it’s time to jump to another platform for hosting my VPS and deal with more predictable pricing. Also since i was using postgres 15, thought might as well use postgres 17 as it’s the latest stable database

The planning:

I did a lot of back and forth planning the strategy behind this migration. Chatting with grok deep think mode, o3, 4o etc.

Also, after chatting with llms and postgres subreddit, I realized we need to partition the active counts table since 17 M+ rows is manageable but we don’t use all of them in our queries, better to archive older ones. The,n partitioning got introduced in the picture. This one caused a few headaches, because then i had to check about the tradeoffs of not being able to use foreign keys as target pointed towards partioned tables, but luckily in our case foreign keys were pointing to other tables.

Then had to plan for the retention periods, partion intervals and type, here I used monthly because 1.5M-2M+ records is manageable for a month’s worth of data points especially in 2-5 tables than kept in one table, especially since i could setup the pg partman config to drop/detach, in my case detach since i can archive the older tables to r2/s3 and maybe even attach them on demand, if premium customers want to get analytics back in time.

This was 2 hours of painful effort but a worthy one, because it will not bottleneck me from introducing more features, and especially since data is growing more in the database, i can use the same learnings to keep my db lean and offload archived data to s3 and still have the confidence that the data is present

The setup

I setup my new VPS and one thing which i did not use over here was docker to complicate my steps, because in my old database, I used docker to run my database and every step i had to run 2-3 extra commands to get and put stuff inside docker, and especially since i am not running a large fleet of servers, a plain db on a metal vps should work fine. So I set up PG 17 in my VPS, it was smooth to be able to just run commands and set things up, pg partman, pg bouncer, pg backrest etc.

I did the partitioning test on my local partitioned, just to test that things are all good and get a checklist to make it happen smoothly on production. Got the final checklist

Time for production. I updated the maintenance banner on live site, saying site will be down for 2 hours for maintenance. Also kept my 2 scripts and re routed app routes to home page. Deployed the change

I quicky started performing things, made a copy of active counts table (where all 17M+ data points are), since we can’t partition on an existing table. Renamed the old table, created a partioned table for active counts.

Then I updated my pg partman config to pre-make partitions 4 partitions monthly, then inserted old data into new tables. The data got routed to correct partitions, checked with inserts too. Connected local to new prod it worked!!

New problem

Then the new problem started, I was all happy that thing worked, then I update the env variables on production, and scripts (email reports) etc. I started seeing connection timed out issues, got stressed but calmly checked for 10 mins, still nothing, pg bouncer is not showing logs, then I saw my port was wrong, I quickly updated things started working properly fine.

New beginning

Emails were being sent properly, new traffic data was routed to the correct partitions!! Then I set up pgbackrest (incremental and full backups), and few months later I will use the scripts to detach and archive older data to s3. Each monthly partition of active counts now will have around 2M+ data points, manageable for now!

My database, the heart of my app, is fresh and faster, no longer a bottleneck to introduce new features. Queries are faster, pg configurations are updated, in my last db, I never cared to configure the shared buffer (needs to be 25% of ram), cache size, this time I dived deep to understand, so I have more clarity on why and how things happen inside the database

This move buys me time for the next 3-4 months or maybe 5 months+, to tinker around with adding more data and use this knowledge to add more data and do more fun stuff with the queries and help customers to get better insights

Eventually, I will be introducing a feature for AI keyword monitoring, have done a ton of research from many months, customers want that. And hopefully the database is ready to handle the scale for it!

Let’s see how it goes, if I face any issues will keep you guys posted on how this helps me scale with lots of data :)

Things used

The planning:

The setup

New problem

New beginning

Have questions on how to get your AI Product built?