if you’ve used mongodb’s skip() and limit() to do ur pagination, you might have noticed it’s possibly as slow as your number of documents. No surprise, skip is O(n) operation by natural. What I wanted to achieve was a much faster pagination, somehow making skip or alternative an O(1) operation.

It’s a perfect world if you could define an OFFSET field who starts from 0, and is incremental and continuous [0,1,2…], and all u need to do next is to use OFFSET as ranged queries between [PAGE_OFFSET * PAGE_SIZE, PAGE_OFFSET*PAGE_SIZE+PAGE_SIZE). Ok, but this isn’t a perfect world. 1st you cannot guarantee it’s continuous, it’s possible some documents get deleted in the long time, even if you don’t have deletions, you might have to deal with insertion failures, and as your incremental offset is an atomic operation, it might just get slipped forever; So face it, it might never be continuous, but incremental characteristics should be perserved. For the same reason, you might not even get 0 as a beginning either.

Ok, it’s incremental only, how to speed up the pagination then? It’s in fact broken down into 2 parts, 1st to find the actual offset from the logical one (PAGE_OFFSET*PAGE_SIZE), 2nd is to get number of PAGE_SIZE documents from actual offset obtained. 2nd part could be much simplified, using limit() if you’d like, as you pay for O(k) in this case anyway. It’s critical how fast the 1st part could be done.

Let’s say I believe PAGE_OFFSET * PAGE_SIZE is the actual offset, or in fact, could be quite close to it. So we could look for that document, and we could find about Nth document of this one in the collection by counting every offset that is smaller. By doing this, we get an delta, if we’re lucky, it’s 0, otherwise, we could recursively look at PAGE_OFFSET * PAGE_SIZE + delta till the delta becomes zero.

So far it looks worse, as more O(n) is done for the counting. But we could expect the recursion depth to be shallow as long as the documents offsets are close to each other. Next we observe that skip is super slow when you search towards the end of the collection. But if the total number of documents remain stable, we could do a small trick while doing the counting, as we could count the number of documents whose offset is equal to or larger than the PAGE_OFFSET * PAGE_SIZE, and use TOTAL – count to get the value. As this point we could reduce the O(n) operations by half! The end of the collection becomes the best-case scenarios instead of the worst!

Updated on July/26/12

A quick recap, given a large data set, we can guarantee the incremental offset on some field, nothing else. Trying to get the actual offset from the logical one, we’ve boosted the worst case scenario, which is at the end of the collection. But that doesn’t change the big O in the new worst case (in the middle). To avoid this new worst case scenario, we need some trade-off, memory for time as usual. Think of the end of the collection as some special pivot that we chose to fasten the actual offset calculation, all that we need is just to get more of such pivots. Let’s see what if I chose number of sqrt of total documents. For each of such pivot it holds a mapping between logical offset to the actual one. Then for each logical offset to calc, it will fall into a range whose size is no more than sqrt of total documents. That would reduce the worst case to O(sqrt(n)). And the memory consumption is O(sqrt(n)), think of a collection of one million, you’ll have a map of 1k size to hold the pivots, and the pagination will be boosted to O(sqrt(n)) even in the worst case scenario.

As my application is no more than 10 million size, i’m quite happy with the new performance benchmark now. With the growth of data, memory and time might become a concern again. But i’ll give it more thoughts till then.