garrettf's comments

garrettf · on Oct 6, 2021

Howdy all, author here! Sharding our Postgres monolith was a huge undertaking for a small team. We got a huge boost from folks that joined within weeks before the switch-over date. If you’re interested in thinking about problems like this, I’d love to chat. Plus we’re also hiring across all roles—check my profile for details.

I’m happy to answer questions about the project here, feel free to reply below.

gunnarmorling · on Oct 6, 2021

Nice write-up! Two questions:

- Can you share details on the routing? I.e. how does the app know which database + schema it needs to go to for given workspace?

- Did you consider using several databases on the same Postgres host (instead of schemas within a single database)? Not sure what's better really, curious whether you have any thoughts about it.

Thanks!

garrettf · on Oct 6, 2021

> Can you share details on the routing?

All in the application layer! All of our server code runs from the same repo, and every Postgres query gets routed through the same module. This means that it was relatively easy to add a required "shard key" argument to all of our existing queries, and then within our Postgres module consult an in-app mapping between shard key range and DB+schema.

Plumbing that shard key argument through the application was more difficult, but luckily possible due to the hierarchical nature[0] of our data model.

> Did you consider using several databases on the same Postgres host

If I recall correctly, you cannot use a single client connection to connect to multiple databases on the same host, and so this could have ballooned our connection counts across the application. This is not something we explored too deeply though, would love to hear about potential benefits of splitting tables in this way.

[0] https://www.notion.so/blog/data-model-behind-notion

gunnarmorling · on Oct 6, 2021

Ah yes, good point about connections being DB-specific. The schema approach seems more light-weight in that regard. Thanks!

rednet · on Oct 6, 2021

Followup question: does the sharding happen within the app that talks to the DB, or do you shard traffic before it hits them? In the former case the total number of DB connections required presumably grows something like n^2.

ClassAndBurn · on Oct 6, 2021

What were the limitations that required you to move all customers to a shared system at once?

Could you have selected some workspaces with lower traffic to migrate first? That would have decreased the load on the primary, potentially speeding up the replication, which is a flywheel to enable more customers to migrate to shards.

garrettf · on Oct 6, 2021

Good question, that was an option. The main motivating factor here was that vacuums were beginning to take dangerously long. O(weeks) to complete, independent of the load on the database. While migrating spaces in segments would have reduced the number of records future vacuums need to scan, we were already running against the clock to complete one vacuum prior to TXID wraparound[0]. To kick off replication for specific spaces we would have needed to write our shard key to all data owned by those spaces. That would further contribute to TXID growth, and was not something we were comfortable doing.

At the end of the day, this is something we could have explored in more depth, but we were ultimately comfortable with the risk tradeoff of migrating all users at once vs. the consequences of depending on the monolith for longer, largely thanks to the effort we put into validating our migration strategy.

[0] https://blog.sentry.io/2015/07/23/transaction-id-wraparound-...

nicoburns · on Oct 7, 2021

Seems like something that might still be worth exploring, as if I’m thinking about this correctly, it would allow you to create new shards on the fly, and to migrate workspaces between shards while only locking one workspace at a time, and only for the amount of time required to catch up that single workspace.

varyherb · on Oct 7, 2021

Are you at all concerned about the selection of the workspace ID as the partition key? With every workspace on a single partition, couldn't a high-throughput workspace create a "hot" partition that will negatively impact other workspaces on that partition?

At the cost of potentially introducing more cross-partition queries, you might benefit from splitting up high-throughput workspaces. See strategy in https://d0.awsstatic.com/whitepapers/Multi_Tenant_SaaS_Stora..., pages 17-20.

jayd16 · on Oct 7, 2021

Perhaps this is why they have 15 logical shards per physical machine. Seems like if one logical shard gets too hot, it could be moved to a new physical machine.

silentsea90 · on Oct 7, 2021

> [2] In addition to packaged solutions, we considered a number of alternatives: switching to another database system such as DynamoDB (deemed too risky for our use case)..

1) Can you please talk more about the risks of using DynamoDB or a similar NoSQL solution?

2) Did you consider Spanner, which is the SQL DB of choice within Google and is available on Google Cloud.

Thanks for the wonderful engineering blog posts!

garrettf · on Oct 7, 2021

Oh yes—this footnote is not knocking non-relational DBs, in fact we have deployed DynamoDB in production for other use cases today.

We were on a tight timeline due to impending TXID wraparound. Switching database technologies would have required us to rewrite our queries, reexamine all of our indexes, and then validate that queries were both correct and performant on the new database. Even if we had the time to derisk those concerns, we'd be moving our most critical data from a system with scaling thresholds and failure modes we understand to a system we are less familiar with. Generally, we were already familiar with the performance characteristics of Postgres, and could leverage a decent amount of monitoring and tooling we built atop it.

There is nothing inherent about non-relational DBs that make them unsuitable for our workload. If we were designing our data storage architecture from scratch, we'd consider databases that are optimized for our heaviest queries (vs. the flexibility afforded by Postgres today). A large number of those are simple key-value lookups, and a plain key/value store like DynamoDB is great for that. We're considering these alternatives going forward, especially as we optimize specific user workloads.

Re: Cloud Spanner: we didn't consider a cross-cloud migration at the time due to the same time constraints. Still sounds like a wonderful product, we were just not ready at the time.

silentsea90 · on Oct 8, 2021

Makes sense. Thanks for the detailed response!

Ozzie_osman · on Oct 7, 2021

Curious if you had any tables that were "universal" and weren't shardable, global stuff that isn't workspace-specific. Would these be replicated across all shards? Or just stored in some special db (in which case joining to them might be difficult).

garrettf · on Oct 7, 2021

Plenty! But not significant compared to page content in terms of storage or load. That unsharded data is stored in separate unsharded databases today, but each table could be sharded later using a different shard key (often user id).

Luckily we did not have many join queries prior to sharding. The few cross-database joins were trivial to implement as separate queries in application logic.

Our main concern here was referential consistency: if you need to write to multiple databases at once, what happens if one of your writes fails? This was not a problem in practice for us since (1) our unsharded data is relatively static and (2) there are very few bidirectional pointers between data across databases.

Long term there are more interesting problems to solve when distributing unsharded data globally. However, given that our unsharded data is less dynamic and consistency is less critical, we have many levers we can pull here.