Extrinsic sorting: A benchmark

in PHP2 months ago

Sorting algorithms are generally old hat to most programmers. They've either analyzed them to death in class, already written many of them, or work in a language where one is provided and they don't need to think about it. Or all three.

For PHP developers, we have a suite of sorting tools available to us: sort(), usort(), ksort(), uasort(), and various other letter combinations. All use Quick Sort internally, which is generally the best performing single-threaded option. Most importantly, many of them let us provide a custom comparison function to use when determining which of two values is larger or smaller.

However, that doesn't mean the last word has been spoken. Most sorting algorithms are all about picking what to compare to minimize the number of comparisons, not doing the actual comparison. Additionally, we don't always want to sort data by itself. Quite often, we need to sort records by some extrinsic value, such as a priority or before/after flags.

What are the performance characteristics of those options? Since my PSR-14 Event Dispatcher implementation, Tukio, supports both priorty and before/after logic, I wanted to see which was faster, and just how fast. The results were quite interesting!

Types of sorting

There are two main kinds of extrinsic sorting we're going to consider: Priority Sort and Topological Sort. Then we'll try combining them in various ways and see what happens.

If you want to follow along at home, the benchmarking code is available on GitHub. I've inlined the relevant highlights below. An important point is that for each sorting style, we maintain an internal array of Item objects. The Item object is an internal class (so just public properties, no methods) that holds the item value and its priority, or its before/after rules, or both. You'll see them at work in the code samples below.

Priority Sort

In a Priority Sort setup, each item to be sorted is given a priority integer value. When sorting, values are sorted by priority, with larger numbers coming first. That means a priority 5 item will come before a priority 3 item, but after a priority 8 item. If two items have the same priority, their relative order doesn't matter although getting them back in the same order they were originally is preferable.

There's another version of Priority Sort that calls the extra integer a "weight," and is the exact same thing but the lower numbers come first. Some systems use one, some use the other, but for our purposes they're interchangeable.

If you've ever used the Symfony EventDispatcher component, its "priority" values are exactly what we're talking about. (Tukio does the same thing.)

The basic algorithm for priority sort is the same as any other sort, but we compare the priority number associated with each item rather than the item itself.

The full code is here, but the important bits are simply this:

class PrioritySortNaive
{
    // ...

    /** @var array<string, PriorityItem>  */
    protected array $items;

    public function sorted(): iterable
    {
        if (!$this->sorted) {
            $this->sort();
            $this->sorted = true;
        }
        return array_map(static fn (PriorityItem $item): mixed => $item->item, $this->items);
    }

    protected function sort(): array
    {
        // We want higher numbers to come first, hence the * -1 to invert it.
        $sort = static fn (PriorityItem $a, PriorityItem $b): int 
            => -1 * ($a->priority <=> $b->priority);

        usort($this->items, $sort);

        return $this->items;
    }
}

It's pretty straightforward. The sort() method just defers everything to PHP's own usort() function, which sorts by the priority value on each item. (The -1 in there is to reverse the sort order so larger numbers come first.) The sorted() method sorts if necessary, then uses array_map() to easily get just the value out of each entry to return.

Topological Sorting

Topological Sorting is sorting a series of nodes that have a before/after relationship with other nodes. So node A comes "before" node B, node B comes "after" node C, etc. Topological sorting is somewhat more robust than Priority sorting, but has two additional costs:

  1. In order to specify nodes relative to each other, they all must have a unique ID that you can reference.
  2. There is the potential for "cycles," where A comes before B, B comes before C, and C comes before A. That's impossible and becomes a failure condition we have to handle.

There are a couple of topological sort algorithms available. The most straightforward goes roughly like so:

  1. Normalize all before/after rules to be just "before" rules.
  2. Build a list of how many nodes point to (must be before) each other node.
  3. Find any nodes that have zero incoming "before" arrows, meaning those are the first (in any order). Save those into a result list.
  4. Decrement the inbound-before counters for every node pointed at by the nodes we just found.
  5. Repeat steps 3 and 4 until we run out of nodes, in which case the result list is now sorted, or no nodes have zero inbound-before counters but there are still some to process, in which case we've found a cycle and there is no possible sorting order.

I mostly cribbed my implementation from a Python example, but the logic is reasonably straightforward. Here's the full version and the relevant bits:

class TopSortBasic
{
    // ...
    
    /** @var array<string, TopologicalItem>  */
    protected array $items;

    protected ?array $sorted = null;
    
    public function sorted(): iterable
    {
        return $this->sorted ??= array_map(fn (string $id): mixed 
            => $this->items[$id]->item, $this->sort());
    }

    protected function sort(): array
    {
        $this->normalizeDirection();

        // Compute the initial indegrees for all items.
        $indegrees = array_fill_keys(array_keys($this->items), 0);
        foreach ($this->items as $id => $node) {
            foreach ($node->before as $neighbor) {
                if (isset($this->items[$neighbor])) {
                    $indegrees[$neighbor]++;
                }
            }
        }

        // Find items with nothing that comes before it.
        $usableItems = [];
        foreach ($this->items as $id => $item) {
            if ($indegrees[$id] === 0) {
                $usableItems[] = $id;
            }
        }

        // Because the items were pushed onto the usable list, we need
        // to reverse it to get them back in the order they were added.
        $usableItems = array_reverse($usableItems);

        // Keep removing usable items until there are none left.
        $sorted = [];
        while (count($usableItems)) {
            // Grab an available item. We know it's sorted.
            $id = array_pop($usableItems);
            $sorted[] = $id;

            // Decrement the neighbor count of everything that item was before.
            foreach ($this->items[$id]->before as $neighbor) {
                if (isset($indegrees[$neighbor])) {
                    $indegrees[$neighbor]--;
                    if ($indegrees[$neighbor] === 0) {
                        $usableItems[] = $neighbor;
                    }
                }
            }
        }

        // We've run out of nodes with no incoming edges.
        // Did we add all the nodes or find a cycle?
        if (count($sorted) === count($this->items)) {
            return $sorted;
        }

        throw new CycleFound();
    }
    
    protected function normalizeDirection(): void
    {
        foreach ($this->items as $node) {
            foreach ($node->after ?? [] as $afterId) {
                // If this item should come after something that doesn't exist,
                // that's the same as no restrictions.
                if ($this->items[$afterId]) {
                    $this->items[$afterId]->before[] = $node->id;
                }
            }
        }
    }
}

Again, sorted() is just calling sort() and then mapping over the result to get back just the values. The sort() method is the algorithm we described a moment ago. normalizeDirection() just converts all "after" rules into "before" rules so that we have only one direction to check.

Grouped Priority Sort

The way priority sort works, if two items have the same priority we don't care what their order is, or at least prefer them to be left alone (so first added remains first). That suggests a different implementation.

Instead of a flat list of Items with priorities, we have an associative array of Items where the key is the priority and the value is an array of items in the order they were added. Then we need only sort the arrays by their key, which should (we hope) be a far smaller number of items, and know that the whole list is now sorted enough. In theory, it should be a huge time savings.

Here's the full code and the fun parts:

class PrioritySortGrouped
{
    // ...
    
    /** @var array<int, <string, <PriorityItem>> */
    protected array $items;
    
    public function sorted(): iterable
    {
        $this->sort();

        $it = (function () {
            foreach ($this->items as $itemList) {
                yield from array_map(static fn (PriorityItem $item) => $item->item, $itemList);
            }
        })();

        return iterator_to_array($it, false);
    }

    protected function sort(): void
    {
        if (!$this->sorted) {
            krsort($this->items);
            $this->sorted = true;
        }
    }
}

The sorting part is now dead-simple. We just defer to PHP's provided krsort() function, which sorts an array by its key values, in descending order (so that larger numbers come first, as desired). The sorted() method gets slightly more complex, as we now need to iterate over the arrays within the arrays and pluck out the values at each point. I found this most convenient to do with a generator, but there are other options.

Internalized Topological Sort

On a lark, I wanted to see what would happen if I used the before/after rules as the comparison routine given to PHP's internal functions. Would that be any faster? Would it even work?

The code goeth thusly:

class TopSortInternal
{
    protected function sort(): array
    {
       $this->normalizeDirection();

        uasort($this->items, static function (TopologicalItem $left, TopologicalItem $right): int {
            if (in_array($right->id, $left->before, true)) {
                return -1;
            }
            if (in_array($left->id, $right->before, true)) {
                return 1;
            }
            return 0;
        });

        return array_keys($this->items);
    }
}

(sorted() is the same as for TopSortBasic.)

Do you think that would work?

Turns out, it doesn't. PHP... never actually compares every item to every other item in QuickSort, so some rules simply end up unchecked. That leads to invalid results. Even if it had worked, we would have lost cycle detection. We'll exclude this from our benchmarks, but it's still good to try new things like this and see if, and why, they fail.

Combined Sort (Priority)

What if we want to allow users of a library to sort either by priority, or by before/after rules? Can we combine these two sorting mechanisms?

We can, to an extent. Specifically, we can allow users to provide either a priority or a topological ordering rule, and then before sorting convert one into the other. We can tell just by thinking about it that there will be some added cost to doing so, since in either case we're doing the same sort as before, plus some conversion logic. But for some use cases maybe that's OK.

If we want to use priority sorting internally, that does mean we lose cycle detection. That may also be OK for many situations.

Here's what I came up with for a priority-centric combined sort:

class CombinedSortPriority
{
    /** @var array<int, <string, <CombinedItem>> */
    protected array $items;

    /** @var CombinedItem[] */
    protected array $toPrioritize = [];

    /** @var array<string, CombinedItem>  */
    protected array $itemIndex = [];

    protected function sort(): void
    {
        $this->normalizeDirection();

        // Convert any before/after items to priorities.
        $this->prioritizePendingItems();

        if (!$this->sorted) {
            krsort($this->items);
            $this->sorted = true;
        }
    }

    protected function prioritizePendingItems(): void
    {
        // If there are no prioritized items at all yet, pick one
        // and declare it priority 0. (The last one is the fastest
        // to access and remove from the list, so do that.)
        if (empty($this->items)) {
            $item = array_pop($this->toPrioritize);
            $this->items[$item->priority][$item->id] = $item;
        }

        foreach ($this->toPrioritize as $item) {
            // Get the priority of all of the items this one must come before.
            $otherPriorities = array_map(fn(string $before): int 
                => $this->itemIndex[$before]->priority, $item->before);

            // This seems backwards, but that's because we want higher numbers
            // to come first.  So something that comes "before" another item
            // needs to have a higher priority than it.
            $priority = max($otherPriorities) + 1;

            $item->priority = $priority;
            $this->items[$priority][] = $item;
        }

        // We never need to reprioritize these again.
        $this->toPrioritize = [];
    }
}

There's more bookkeeping going on, although because the objects are never duplicated the overall cost shouldn't too high. sorted() is the same as in Grouped Priority Sort, which is what I built this example off of, so I didn't repeat it. The key is that we keep a separate list of items that have a priority and items that have a topological order. Then before sorting, we go through the topological list and convert all the before/after rules into priorities. To do so, we get the priority of every item it comes before, get the maximum priority of those, and add one to it. That guarantees that the item will sort before the others. We can then add it to the priority list and sort everything exactly as before.

Combined Sort (Topological)

The general concept for a Topological-centric combined sort is basically the same, but the other way around. We keep two separate lists, then convert the priority-based items into before-based items right before sorting.

Here's my first attempt at that:


    protected function sort(): array
    {
        $this->normalizeDirection();
        $this->topologizePendingItems();

        // The same as before.
    }

    protected function topologizePendingItems(): void
    {
        foreach ($this->toTopologize as $priority => $items) {
            foreach ($items as $item) {
                foreach ($this->toTopologize as $otherPriority => $otherItems) {
                    // For every other pending item, if this item's priority is
                    // greater than it, that means it must come before it. Mark
                    // the item as such.
                    if ($priority > $otherPriority) {
                        $add = array_map(static fn(CombinedItem $i) => $i->id, $otherItems);
                        $item->before = [...$item->before, ...array_values($add)];
                    }
                    $this->items[$item->id] = $item;
                }
            }
        }

        // We don't need to reprioritize these again.
        $this->toTopologize = [];
    }
}

In this case I am again storing the prioritized items grouped by priority. Then we iterate over those (which thus takes two foreach() loops, but is still only an O(n) operation) and for each one, add everything with a lower priority to its before list. The result is guaranteed to be a valid representation of the result we want, which can then be topologically sorted as before.

I did not actually benchmark this version. Or rather, I tried to and it never finished. Can you see why?

Because this is an O(n^2) algorithm! We're comparing every priority-based item against every other priority-based item. That's very bad on performance. In addition, we're adding way more information than we need to. If there are 5000 items, do we really need to say that a priority 100 item comes before all other 4999 items? No, we don't, but that would likely add more runtime to the topological sort routine.

So let's toss out that version, and instead do this:

class CombinedSortTop
{
    protected function topologizePendingItems(): void
    {
        // First, put the priorities in order, low numbers first.
        ksort($this->toTopologize);

        while (count($this->toTopologize)) {
            // Get the highest priority set.  That's the last item in the
            // list, which is fastest to access.
            $items = array_pop($this->toTopologize);

            // We don't actually care what the next priority is, but need it
            // as a lookup value to get the items in that priority.
            $otherPriority = array_key_last($this->toTopologize);

            /** @var CombinedItem $item */
            foreach ($items as $item) {
                // If $otherPriority is null, it means this is the last priority set
                // so there is nothing else it comes before.
                if ($otherPriority) {
                    $item->before = array_map(static fn(CombinedItem $i) 
                        => $i->id, $this->toTopologize[$otherPriority]);
                }
                $this->items[$item->id] = $item;
            }
        }
    }
}

In this version, we take advantage of the fact that we have the priorities already grouped. We first sort by priority, with low numbers first. We then pop items off the end (so starting with the highest priority) and say that all of those items must come before every item in the next priority after that, but we don't care about anything in priorities further down. That becomes implicit.

This approach was far faster, fast enough that I could benchmark it effectively. However, there's still a problem. Can you spot it?

The Results

Before we look at the data, look back at the code above. What do you expect the results to be? Which algorithm do you expect to perform best, and in what situations? Why? Ponder that a moment before continuing.

Tests were run on my local laptop, a 7th Gen Lenovo Thinkpad X1 Carbon with 16 GB of RAM running Kubuntu 22.04. Obviously if you have a different system your absolute results will be different but the relative values should be about the same. All tests were built with PHPBench. I used a list size of 100,000, because if I set it higher the benchmarks just took too damned long.

Priority Sort Naive

I ran four test on the naive priority sort.

  • PrioritySortNaiveUniquePreSortedBench - All items have a unique priority, and they come pre-sorted.
  • PrioritySortNaiveUniqueReverseSortedBench - All items have a unique priority, and the come pre-sorted backwards.
  • PrioritySortNaiveOnePriorityBench - All items have the same priority, so we don't really care about the order.
  • PrioritySortNaiveRandomPriorityBench - All items have a random priority between 1 and 100,000.

Results:

benchmarkmem_peakmeanrstdev
PrioritySortNaiveUniquePreSortedBench32.765mb187.187ms±0.49%
PrioritySortNaiveUniqueReverseSortedBench32.765mb131.790ms±4.42%
PrioritySortNaiveOnePriorityBench32.765mb134.550ms±3.11%
PrioritySortNaiveRandomPriorityBench32.765mb217.323ms±4.03%

I don't quite understand why the reversed data is faster, but I suspect it's jitter. When everything has the same priority, the time is 134ms. That's a not-small amount of time spend on the noop case. The more real-world test is the random priority, which clocked in at 217ms.

Priority Sort Grouped

I ran the same four tests for the grouped approach:

benchmarkmem_peakmeanrstdev
PrioritySortGroupedUniquePreSortedBench69.322mb59.864ms±0.71%
PrioritySortGroupedUniqueReverseSortedBench70.367mb53.726ms±2.74%
PrioritySortGroupedOnePriorityBench32.766mb17.148ms±4.05%
PrioritySortGroupedRandomPriorityBench51.307mb83.053ms±7.39%

Wow! Yeah, we can conclude that the grouped version is way better. That's hardly surprising, though. In the best case, with only one priority, we are effectively sorting a single item. If the number of priorities actually used is low, the sorting time is low. Additionally, we are not using a custom comparison routine here. The PHP engine is doing the entire job, in C, on numbers, without ever calling back to user-space. It's not surprising this is so fast.

Of note, though, it does seem to use more memory. It's telling that the one-priority version is almost the same memory use as the naive version. Nearly all the extra memory usage is from the extra array levels. PHP arrays are quite memory hungry.

Topological Sort

Topological Sort also had four tests.

  • TopSortBasicUniqueNoConstraintsBench - The noop case, where there's no rules provided at all.
  • TopSortBasicUniqueLinearSequenceBench - All items have a single "before" entry, pointing to the next item in the list. That is, they're already sorted.
  • TopSortBasicRandomSequenceBench - Every item is "before" some other random item in the list. In practice, I made it point to an item before it in the list as that ensured I had no cycles.
  • TopSortBasicRandomSequenceAfterBench - The same as the other random test, but using all "after" directives to see how costly that conversion step is.
benchmarkmem_peakmeanrstdev
TopSortBasicUniqueNoConstraintsBench38.569mb54.961ms±1.31%
TopSortBasicUniqueLinearSequenceBench75.172mb77.226ms±1.40%
TopSortBasicRandomSequenceBench77.273mb190.283ms±1.26%
TopSortBasicRandomSequenceAfterBench156.319mb350.023ms±0.38%

Topological sorting looks notably faster than priority sorting in the common case, and way better in the degenerate cases. Where it falls down is the after->before conversion. Apparently, that's non-fast. I wonder if we could speed that up. No matter what, it would still have a cost so that's something to keep in mind: If you can use "before" markers instead of "after", do so. (One could probably rewrite the algorithm to favor "after" markers instead of "before" if you knew that was the more common case.)

Also of note, it seems the before/after conversion is really bad on memory, too. Perhaps it would be better if we unset the "after" data after converting it?

Combined Sort (Priority)

For both combined sorts, I ran three tests:

  • All the items have a random priority.
  • All the items have a topological order.
  • 50% of items have a random priority, 50% have a topological order (with "before").
benchmarkmem_peakmeanrstdev
CombinedSortPriorityAllPriorityDataBench59.750mb77.418ms±2.28%
CombinedSortPriorityAllTopDataBench81.960mb118.359ms±1.93%
CombinedSortPriorityMixedRandomDataBench117.796mb172.262ms±2.00%

When the data is all-priority, the combined version is in the same neighborhood as the grouped priority. That is to be expected since it's the same algorithm, and just isn't using the conversion step beforehand. That it's slightly faster in this version I chalk up to test jitter.

It's odd that the all-topological case is actually faster than the mixed case. I suspect that's because the conversion logic naturally falls out when there's nothing with a given priority yet, so its internal logic is pretty fast. It looks like in the mixed case, the converter takes around 100ms, which is more than expected. Perhaps a faster version could be written.

The memory usage is much higher than the plain grouped version. Once again, that's down to the additional lookup arrays we have to maintain. There may be ways to reduce that a bit, but it won't be eliminated.

Combined sort (Topological)

benchmarkmem_peakmeanrstdev
CombinedSortTopAllPriorityDataBench102.575mb326.539ms±2.48%
CombinedSortTopAllTopDataBench78.880mb188.096ms±1.21%
CombinedSortTopMixedRandomDataBench133.662mb310.055ms±1.48%

Oof. That priority-to-topology conversion algorithm is still decidedly non-fast. If the data is all topological to start with, it's right around the same as the basic topological sort, as expected. The conversion logic, though, increases the runtime by over 130ms to the common case. That's... actually only 30% more than the topological-to-priority converter.

Overall results

Here's everything in one big table.

benchmarkmem_peakmeanrstdev
PrioritySortNaiveUniquePreSortedBench32.765mb187.187ms±0.49%
PrioritySortNaiveUniqueReverseSortedBench32.765mb131.790ms±4.42%
PrioritySortNaiveOnePriorityBench32.765mb134.550ms±3.11%
PrioritySortNaiveRandomPriorityBench32.765mb217.323ms±4.03%
PrioritySortGroupedUniquePreSortedBench69.322mb59.864ms±0.71%
PrioritySortGroupedUniqueReverseSortedBench70.367mb53.726ms±2.74%
PrioritySortGroupedOnePriorityBench32.766mb17.148ms±4.05%
PrioritySortGroupedRandomPriorityBench51.307mb83.053ms±7.39%
TopSortBasicUniqueNoConstraintsBench38.569mb54.961ms±1.31%
TopSortBasicUniqueLinearSequenceBench75.172mb77.226ms±1.40%
TopSortBasicRandomSequenceBench77.273mb190.283ms±1.26%
TopSortBasicRandomSequenceAfterBench156.319mb350.023ms±0.38%
CombinedSortPriorityAllPriorityDataBench59.750mb77.418ms±2.28%
CombinedSortPriorityAllTopDataBench81.960mb118.359ms±1.93%
CombinedSortPriorityMixedRandomDataBench117.796mb172.262ms±2.00%
CombinedSortTopAllPriorityDataBench102.575mb326.539ms±2.48%
CombinedSortTopAllTopDataBench78.880mb188.096ms±1.21%
CombinedSortTopMixedRandomDataBench133.662mb310.055ms±1.48%

And sorted by runtime if you prefer:

benchmarkmem_peakmeanrstdev
PrioritySortGroupedOnePriorityBench32.766mb17.148ms±4.05%
PrioritySortGroupedUniqueReverseSortedBench70.367mb53.726ms±2.74%
TopSortBasicUniqueNoConstraintsBench38.569mb54.961ms±1.31%
PrioritySortGroupedUniquePreSortedBench69.322mb59.864ms±0.71%
TopSortBasicUniqueLinearSequenceBench75.172mb77.226ms±1.40%
CombinedSortPriorityAllPriorityDataBench59.750mb77.418ms±2.28%
PrioritySortGroupedRandomPriorityBench51.307mb83.053ms±7.39%
CombinedSortPriorityAllTopDataBench81.960mb118.359ms±1.93%
PrioritySortNaiveUniqueReverseSortedBench32.765mb131.790ms±4.42%
PrioritySortNaiveOnePriorityBench32.765mb134.550ms±3.11%
CombinedSortPriorityMixedRandomDataBench117.796mb172.262ms±2.00%
PrioritySortNaiveUniquePreSortedBench32.765mb187.187ms±0.49%
CombinedSortTopAllTopDataBench78.880mb188.096ms±1.21%
TopSortBasicRandomSequenceBench77.273mb190.283ms±1.26%
PrioritySortNaiveRandomPriorityBench32.765mb217.323ms±4.03%
CombinedSortTopMixedRandomDataBench133.662mb310.055ms±1.48%
CombinedSortTopAllPriorityDataBench102.575mb326.539ms±2.48%
TopSortBasicRandomSequenceAfterBench156.319mb350.023ms±0.38%

Conclusions

Was this what you expected?

The biggest take-away for me is that the grouping on priority sort is an absolute must. In PHP specifically, it's a massive win, yielding between 3- and 4-fold improvement. I suspect in other languages with a less strong split between user-space and bundled-engine-libraries the difference would be less, but it's likely still significant.

For most cases, topological sorting is pretty fast. It's faster than naive priority sorting, but not as fast as grouped sorting. Again, that's likely due in part to the sorting logic happening all in the engine for the grouped version. Where it breaks down is if you have lots of "after" markers to process, as that conversion is surprisingly expensive.

If you want to offer both options in your API, using grouped priorities internally is the better strategy by a wide margin. With mixed data, it's around twice as fast. Even if your data is mostly topological, if you're willing to lose cycle detection, converting it to priorities first and letting the engine take care of it from there is a win. (That may not be true in other languages. Run your own benchmarks.)

Fortunately, it turns out Tukio, the library that started all this, is already using that approach! In fact, it's doing slightly better as it tries to prioritize the before/after items when they're added if the item it's coming before/after is already available, making the "prioritize these first" step even faster. It's nice when you guess right. I will likely split that off as a stand-alone library sometime in the near-ish future, now that I'm confident it's the best approach.

Sort:  

Thanks for your contribution to the STEMsocial community. Feel free to join us on discord to get to know the rest of us!

Please consider delegating to the @stemsocial account (85% of the curation rewards are returned).

You may also include @stemsocial as a beneficiary of the rewards of this post to get a stronger support. 
 

Dear @crell,
May I ask you to review and support the Dev Marketing Proposal (https://peakd.com/me/proposals/232) we presented on Conference Day 1 at HiveFest?
The campaign aims to onboard new application developers to grow our ecosystem. If you missed the presentation, you can watch it on YouTube.
You cast your vote for the proposal on Peakd, Ecency,

Hive.blog / https://wallet.hive.blog/proposals
or using HiveSigner.

Thank you!