What’s the default “order by” for a database?
I’ve noticed a factor related to Prisma many times and finally decided to figure out the “logical explanation” behind it.
Suppose I fetch multiple rows of a table and take the first row. I will always get the most frequently updated row as a result. I knew this couldn’t be a Prisma thing and must depend on the underlying database. So let’s figure out what’s happening here. For the rest of this article every time I mention a database, I’m talking about PostgreSQL because the internal architecture will be different for each database.
In which format is data stored?
If you thought binary, you are partially right. But there’s more to it. Databases like PostgreSQL generally have a B-Tree architecture.
A quick recap on B-Tree (Skip if you know already) — B-Trees are an extended version of Binary Trees. In a binary tree, you have a root, if you have a node with a smaller value it will go to the left of the root and if the node has a higher value it will go to the right.
The profits of this structure are faster querying time. B-trees take this one step further by having multiple keys in the root and partitions. Think of it as B-trees are clever binary trees that allow even further querying improvements for larger data sizes.
So yeah, data in PostgreSQL is by default stored as B-trees.
Where is the data when you’re trying to retrieve it?
PostgreSQL has multiple components that ensure faster and more robust readability. Here are a couple of things to keep in mind
- Heap Order: In PostgreSQL, without an explicit
ORDER BY
clause, rows are usually returned in the order they are found in the heap, which is the underlying storage structure of the table. This order may reflect the order in which the rows were inserted, but it's not guaranteed. - Index Order: If an index is used to satisfy the query, the order of rows returned may be influenced by the order of the index. However, this order might not always be apparent or consistent, especially for complex queries involving multiple indexes or index types.
- Vacuuming and Reordering: PostgreSQL periodically runs a process called vacuuming to reclaim storage occupied by deleted or outdated rows and to reorganize the data for better performance. This process can affect the physical order of rows in the table over time, leading to changes in the default order of rows returned by queries.
- Concurrency and MVCC: PostgreSQL uses Multiversion Concurrency Control (MVCC) to manage concurrency and ensure data consistency. This can also affect the order of rows returned by a query, especially in environments with high levels of concurrent transactions.
How do you justify this behaviour?
So my guess is when you update a particular column, it is stored in the heap, just in case it gets updated again. To get a more technical answer here are a couple of reasons to justify this behavior:
MVCC and Transaction ID (XID) Order: PostgreSQL uses MVCC to manage concurrent transactions and ensure data consistency. Each transaction is assigned a unique Transaction ID (XID), and PostgreSQL keeps track of the visibility of each row based on the XID. When you update a row, PostgreSQL creates a new version of the row with a new XID and marks the old version as obsolete. I’m not sure but these IDs could be chronologically increasing over time and when you retrieve it, it comes in the same sorted manner.
Visibility Rules: By default, PostgreSQL uses a snapshot of the database at the start of the transaction to determine which rows are visible to the transaction. Rows that were updated or deleted after the start of the transaction are not visible. When you issue a SELECT *
without specifying an ORDER BY
clause, PostgreSQL may return rows in a manner consistent with the visibility rules and the XID order.
Heap Structure and Vacuuming: PostgreSQL stores data in a heap structure, and the physical order of rows in the heap can be influenced by various factors, including the order in which rows were inserted and the impact of vacuuming processes. If rows are frequently updated or deleted, vacuuming may lead to the most recently updated records being physically placed at the front of the heap.
Well this wasn’t the most interesting blog ever, but it answers my question to a point. I will try to figure out more about how this works internally and post a second part.