What is cursor (seek) database pagination and how it is different from offset pagination?

in #development5 years ago (edited)

Relational SQL databases offer LIMIT and OFFSET parameters to SELECT queries to retrieve data. Using those parameters user is able to pick a fragment of data from database. Such partial data fetching always corresponds to rows (not columns) and is called pagination (picking columns would be named projection). This is because it's mainly used to display data in pages.

Let's imagine a simple case of browsing products in an Internet shop. Once a user picks proper category and applies proper filters (like price range he/she is interested in) and also applies proper sorting (for example sorting by speed of potential delivery) he/she usually sees lists of product offers divided into pages. There is no use in presenting all the data at once as the user will most probably pick something from the top anyway (as the items are sorted). What is more presenting only parts of the data reduces database stress (small numbers of rows has to be computed) and network usage (small numbers of rows has to be fetched from database to back-end application and then presented to the user). Each page is limited to defined number of products (LIMIT). The limit stays the same between pages. What changes when user switches the pages is the OFFSET. It's the parameter that tells you how many rows to skip before retrieving proper data. On first page it will be zero (no OFFSET). On second it will be the size of skipped first page (so OFFSET = LIMIT). On third it will be the size of two skipped pages (OFFSET = LIMIT * 2).

What is worth noting the proper pagination requires sorting (ORDER BY in SQL). Even if user haven't applied any sorting it has to be applied on the back-end anyway. This is because most databases do not guarantee the order of rows in each SELECT statement. That means that multiple execution of the same query can return results in different order. Knowing this if we slice the whole result into parts (pages) and take, let's say, results from row 20 to 30 (out of 100) we can expect having totally different results each time (as the order of total 100 rows will be random). That makes pages switching impossible. Each page would contain random rows. Some would be even duplicates from previous pages as the call for next page would be a separate, subsequent call with another random order. What is even more interesting sometimes adding a sorting on non-unique values (column) doesn't resolve this issue. Let's say we're sorting on person's names. Once we have 20 different persons with name Alice the order of those Alices will still be random (or at least not guaranteed to be deterministic). To solve this problem sorting on unique key can be added to the query (ORDER BY id) as main (if there is no other sorting enabled) or secondary order (ORDER BY something, id). This makes pagination consistent.

Once we have consistent pagination there are still some drawbacks of the offset pagination presented above. First of such is floating results. Let's consider a case where while user browsers the items on pages somebody else adds or removes items that meets his/her criteria. During page switches items will move. User will either miss some of the items when switching to next page or see some duplicates from previous page as the number of entries changed. Off course this can be avoided by some additional conditions like skipping items that were added after some date (the date where user initially entered first page of the results) but it creates some additional trouble to deal with. For example: when to reset the date to display newly added records? How to tackle deleted rows? How to tackle modifications of items that causes them to sort differently now? In most cases such side-effects are left alone so user sees some strange software behaviours from time to time while browsing pages.

Another major drawback is performance. To calculate page that starts from OFFSET 50 database in fact has to first calculate 50 rows to skip and then the rows to show. This is trivial until users jumps to last page and increase OFFSET to many thousands, maybe millions. That means database needs to calculate all rows just to return few for this last page (see images below where calculated rows are represented as green boxes). This may stress database a lot, especially if we deal with multiple users at once and the query itself is complicated.

How is cursor (seek) pagination different?

Cursor pagination (also known as seek pagination) is a little different approach.

Let's consider the search engine (like Google). Once the user enters the search phrase and is presented with the first page of the results he/she is most often happy with what he/she received on it. If not, he/she either changes the search phrase or go to second page of results. That makes browsing continuous (user goes from first page to second, from second to third and so on). After switching to some page in a row nobody ever looks at... last one. In fact most search engines don't even provide the possibility to jump to last possible pages. Knowing that most of interesting results are on first pages and those are usually browsed in a row (rarely any pages are being skipped) we can come up with a solution that works the best in such case.

The idea of cursor pagination is that user is always provided with a cursor to next page and previous page (if any). Such cursor is actually the last result from current page (cursor to next page) or the first one (cursor to previous page). Knowing the edge result from previous page we can calculate the next page using WHERE condition on columns we sort on to provide rows that are larger than the last one (or smaller than the first one if we go backwards). Consider a following example:

IDName
3Alice
20Alice
1Tom

In this page we see results being sorted by name and then unique ID (to avoid randomness of results mentioned before). In fact we're interested only in fields we sort on. Cursor to next page will be "Tom" with ID 1. To calculate next page we may use the following SQL:

SELECT id, name
FROM persons
WHERE name > "Tom" OR (name = "Tom" AND id > 1)
ORDER BY name ASC, id ASC
LIMIT 3;

As we can see we just take results that are "bigger" than "Tom", so later in the alphabet. Additionally, as we sort also on IDs, we need to cover a situation where next (yet unknown) page starts with another "Tom" with higher ID. A part of this SQL in brackets does that. We also apply the LIMIT as the next page has limited number of rows as well. Now let's see how example next page can look like:

IDName
5Tom
22Tom
4Zoe

As we can see this page also contains 3 rows so possibly there is another one after that (in case there are less rows we can be sure it's over and this is the last page). As we also know we're not on the first page any more. That means we can calculate a cursor to previous page (so as to the third). That cursor will be a name "Tom" with ID 5. To get previous page we can either just buffer it in our application or re-fetch it again. The re-fetch SQL has to go backward so it's a little more complicated than the previous one:

SELECT id, name
FROM
(SELECT id, name
FROM persons
WHERE name < "Tom" OR (name = "Tom" AND id < 5)
ORDER BY name DESC, id DESC
LIMIT 3) as page
ORDER BY name ASC, id ASC;

As we can see the internal query does the same selection as for the previous page but with inverted conditions and sorting. We construct a sub-result with person's names earlier in alphabet than "Tom" or on the same place as "Tom" but with smaller ID. The sub-result is also limited to page size (3 in this case) and puts the rows in reverse direction (as we go backwards). The external query operates on subset of data selected by the internal limited one so it's very fast. It is being used just to reverse the rows order again to present them as they should be (in ascending order).

In this approach database calculates only the rows it needs to return. It's speed can be further increased by inserting proper indexes on the fields we sort the most often. It's obvious limitation is that it makes page leaping impossible (so you cannot jump from page 5 instantly to page 10 without calculating pages in between).

The kind of data presentation this pagination does usually the best is so-called infinite scrolling. This can be seen as a single page that grows once the user reaches the bottom of it or presses "More" button (so next page is being fetched from database using cursor and appends to the presentation on the front-end). Example of this is presented here (note the parts of the presentation appearing while you scroll down and even the URL changes to next pages that fetched while you were doing it).

What about concurrency issue? If we stick to infinite scrolling presentation it won't appear at all. No matter how many changes were made while user is browsing the data and switching the pages he/she will always just get rows that are "greater than something" so won't see any duplicates and also the amount of total results won't make a difference like when using OFFSET parameter. It may cause problems on the first page when we try to use standard pages buttons, though. If we present data in such way (only a chunk of data on the screen and buttons to previous page and next one with their numbers) we can face a situation where user goes from the first page to second one and then goes back to the first one. If other person removed a row from the first page in the meantime it will turn out that the first page now contains one row less than it should. If other person added a row that would fit on the first page it will turn out to be in fact... the second one (as the amount of data on the first page increased and LIMIT takes only a part of it). That's why it's better to drop the page numbering to avoid confusion. User's usually don't care which page they are looking at. Just having links to previous page and the next one should be enough (without numbers). Also we can consider dropping previous page cursor support so first page problem won't ever reach us.

As we can see cursor pagination has it's limitations but it mainly solves the drawbacks of the offset one. It is usually more effective not only on simple data presentation but also on chunk-by-chunk data processing. Such processing fetches just a page of data into application's memory to process it and not overflow computer's memory. Once it's done processing a page it fetches another one using a cursor without a worry the data "moved" while processing due to some rows insertions or deletions. Obviously it stops when a page is smaller than expected (below LIMIT) as this indicates the last page and no need to fetch further.

More and more libraries and APIs support both kind of pagination right now. Don't be afraid to pick and use pagination method that is the most suitable to your needs.

I hope you found this 100% steem power post interesting and see you next time! 😃

Sort:  

Interesting. I knew only about "LIMIT", so I learned something new. :)

What is more presenting only parts of the data reduces database stress (small numbers of rows has to be computed) and network usage (small numbers of rows has to be fetched from database to back-end application and then presented to the user).

I believe it is only partly true. Whenever you use ORDER BY clause - all rows that matches given filters (with WHERE clause) has to be computed by database engine. Only after that engine has to put them internally in right order and then - can selects part of them according to given LIMIT and OFFSET. So this only true when we are considering only profit from fetching smaller number of records.

Not quite. You can perform the tests yourself on bigger table. The bigger the OFFSET the bigger the delay on returning data. This is due to fact that it's easier for database to pick only 20 top rows than order whole table to get 20 from the middle. Using simple thought experiment: what is the sorting cost if you pick only top 20 elements? How would you implement this? It's complexity is less than n*20, right? In the meantime worst case scenario of sorting whole table could be even n^2. Mature databases uses the fact of LIMIT used for sorting optimisation.