PHP: Use associative arrays basically never

in #php6 years ago (edited)

The other day I was working on some sample code to test out an idea that involved an object with an internal nested array. This is a pretty common pattern in PHP: You have some simple one-off internal data structure so you make an informal struct using PHP associative arrays. Maybe you document it in a docblock, or maybe you're a lazy jerk and you don't. (Fight me!) But really, who bothers with defining a class for something that simple?

But that got me wondering, is that common pattern really, you know, good? Are objects actually more expensive or harder to work with than arrays? Or, more to the point, is that true today on PHP 7 given all the optimizations that have happened over the years compared with the bad old days of PHP 4?

So like any good scientist I decided to test it: What I found will shock you!

Benchmark environment

My test system is a Lenovo X1 Carbon 2017 Edition, i5-7300U CPU @ 2.60GHz, 16 GB of RAM, running Kubuntu 18.04. The PHP version is 7.2.5-0ubuntu0.18.04.1. XDebug is disabled. (Always do that before running benchmarks!) I have as much background processing turned off as I could manage, though on modern systems runtime optimizations mean there will always be some variation and jitter.

You will almost certainly get different absolute numbers than I do but the relative values should be about the same.

Associative arrays (Baseline)

The baseline test looks like this:

<?php
declare(strict_types=1);

error_reporting(E_ALL | E_STRICT);

const TEST_SIZE = 1000000;

$list = [];
$start = $stop = 0;

$start = microtime(true);

for ($i = 0; $i < TEST_SIZE; ++$i) {
  $list[$i] = [
    'a' => random_int(1, 500),
    'b' => base64_encode(random_bytes(16)),
  ];
}

ksort($list);

usort($list, function($first, $second) {
  return [$first['a'], $first['b']] <=> [$second['a'], $second['b']];
});

$stop = microtime(true);
$memory = memory_get_peak_usage();
printf("Runtime: %s\nMemory: %s\n", $stop - $start, $memory);

That is, we build an array of 1 million items, where each item is an associative array containing an int and a short string. This "anonymous struct" is very typical of the type of data structure I'm talking about, which is often assigned to a private property within an object and only accessed within it. (Although some systems like to expose these anonymous structs as though they were an API, which is one of the most developer-hostile API designs I have ever seen. You know who you are.) 1 million items is somewhat larger than a typical use case but we want to stress test it, so go big or go home.

The goal is to measure the memory used by all of those nested arrays as well as the time it takes to process them. For that, we're sorting the array twice, once by the key (which should be a no-op) and once by the array itself, using a custom sort function.

As a second test, I also want to check the serialization size. These giant lookup tables are often built once and serialized to a database for cache lookup, so knowing the trade off there is also useful. For that we use this slightly different script:

<?php
declare(strict_types=1);

error_reporting(E_ALL | E_STRICT);

const TEST_SIZE = 1000000;

$list = [];
$start = $stop = 0;

$start = microtime(true);

for ($i = 0; $i < TEST_SIZE; ++$i) {
  $list[$i] = [
    'a' => random_int(1, 500),
    'b' => base64_encode(random_bytes(16)),
  ];
}

$ser = serialize($list);
unserialize($ser);

$stop = microtime(true);
$memory = memory_get_peak_usage();
printf("Runtime: %s\nMemory: %s\nSize: %s\n", $stop - $start, $memory, strlen($ser));

To account for natural jitter in the process, I ran each test once to prime it (although on the CLI that shouldn't matter, but it doesn't hurt). Then I run three more times in a row and average the results. Here's the results for our baseline test:

Associative array (Sorting)

Run Runtime (s) Memory (bytes)
1 9.4488079547882 541450384
2 9.8389720916748 541450384
3 9.0056548118591 541450384
Avg 9.4311 541450384

Associative array (Serialize)

Run Runtime (s) Memory (bytes) Size
1 1.8638360500336 1100384368 68673068
2 1.8579361438751 1100384368 68672734
3 1.8860640525818 1100388464 68673514
Avg 1.8692 1100385733 68673105

So about 9.4 seconds and a half GB of memory to work with associative arrays. The serialized form is 68 MB. The runtime is pretty stable and the memory usage is constant, as expected. (The slight variation is most likely due to randomly generated numbers of different length.) Those are the values to beat.

stdClass

For completeness let's switch to a stdClass object. I predicted this would be about the same as structurally stdClass objects are basically associative arrays that pass by handle instead of by value. Here's the new tests (the boilerplate start and end parts omitted):

for ($i = 0; $i < TEST_SIZE; ++$i) {
  $o = new stdclass();
  $o->a = random_int(1, 500);
  $o->b = base64_encode(random_bytes(16));
  $list[$i] = $o;
}

ksort($list);

usort($list, function($first, $second) {
  return [$first->a, $first->b] <=> [$second->a, $second->b];
});

And here's the data:

stdClass (Sorting)

Run Runtime (s) Memory (bytes)
1 10.945838928223 589831120
2 11.50714302063 589831120
3 11.199006080627 589831120
Avg 11.2173 589831120

stdClass (Serialize)

Run Runtime (s) Memory (bytes) Size
1 3.1958901882172 1210154464 81672386
2 3.3245379924774 1210154464 81673031
3 3.2109470367432 1210154464 81673730
Avg 3.2437 1210154464 81673049

Huh. I expected the serialized version to be a bit bigger as it needs to store the string "stdClass" over and over again. I didn't expect it to also be measurably slower and less memory efficient than associative array. It's not a massive difference, and at smaller cardinality it probably wouldn't be measurable, but it's definitely there.

Why does anyone use stdClass again?

Object with public properties

Now let's get into the real test. In this case we'll predefine a class to use for our list and use two public properties on it. PHP doesn't support typed properties in PHP 7.2 (although it looks like it probably will in an upcoming version), but it does still do various optimizations to object structures when it knows the properties in advance. Let's see if those optimizations pan out in practice.

Here's our test code:

class Item
{
  public $a;
  public $b;
}

for ($i = 0; $i < TEST_SIZE; ++$i) {
  $o = new Item();
  $o->a = random_int(1, 500);
  $o->b = base64_encode(random_bytes(16));
  $list[$i] = $o;
}

ksort($list);

usort($list, function($first, $second) {
  return [$first->a, $first->b] <=> [$second->a, $second->b];
});

And the data:

Public properties (Sorting)

Run Runtime (s) Memory (bytes)
1 8.1981730461121 253831584
2 8.0346500873566 253831584
3 8.4190359115601 253831584
Avg 8.2172 253831584

Public properties (Serialize)

Run Runtime (s) Memory (bytes) Size
1 3.096804857254 1326154736 77673599
2 3.0712831020355 1326154736 77672792
3 3.0746259689331 1326154736 77672696
Avg 3.081 1326154736 77673029

BOOM! For sorting, a proper classed object is measurably faster than an array but the big difference is on memory. It uses half as much memory as the array version did. Half.

Serialization didn't fair quite so well. It's about on par with stdClass time-wise but a bit more efficient space-wise. I strongly suspect that's because the string "Item" is shorter than "stdClass", which gets repeated over and over in the serialized value. That's something to note if dealing with a namespaced class as then the serialized class name can be quite long.

Object with private properties

A lot of people (like yours truly) preach against using public properties, though, in favor of protected properties and methods. That does introduce more method calls into our test, though. How will that fare?

Here's the new test code:

class Item
{
  protected $a;
  protected $b;

  public function __construct(int $a, string $b)
  {
    $this->a = $a;
    $this->b = $b;
  }

  public function a() : int { return $this->a; }
  public function b() : string { return $this->b; }
}

for ($i = 0; $i < TEST_SIZE; ++$i) {
  $list[$i] = new Item(random_int(1, 500), base64_encode(random_bytes(16)));
}

ksort($list);

usort($list, function(Item $first, Item $second) {
  return [$first->a(), $first->b()] <=> [$second->a(), $second->b()];
});

And the data:

Private properties (Sorting)

Run Runtime (s) Memory (bytes)
1 11.160441160202 253833000
2 10.926701068878 253833000
3 11.177386045456 253833000
Avg 11.0881 253833000

Private properties (Serialize)

Run Runtime (s) Memory (bytes) Size
1 3.2856619358063 1332152352 83672594
2 3.1651678085327 1332152352 83672048
3 3.2460420131683 1332152352 83672899
Avg 3.2322 1332152352 83672513

As predicted, adding methods to the mix slows it down a bit. The memory usage is very close to the public property version. Somehow the serialized version got a little bit slower and larger, but not dramatically. Again, at lower cardinality it would probably not be measurable.

Anonymous classes

Of course, some people are allergic to defining classes. I don't know why but they still view it as a slow and expensive thing to do. Maybe they're concerned about file count (given that PHP by convention uses file-per-class structure, although nothing in the language mandates that). For completeness, though, let's define an anonymous class inline and see how it measures up. We'll only do the public-property version as we know that adding methods will slow it down a tad.

One thing to note, however, is that anonymous classes cannot be serialized. If you need to serialize your data structure then anonymous classes are a no-go. We'll skip that test, of course.

Here's the code:

for ($i = 0; $i < TEST_SIZE; ++$i) {
  $o = new class(random_int(1, 500), base64_encode(random_bytes(16))) {
    public $a;
    public $b;

    public function __construct(int $a, string $b)
    {
      $this->a = $a;
      $this->b = $b;
    }
  };
  $list[$i] = $o;
}

And the data:

Anonymous class (Sorting)

Run Runtime (s) Memory (bytes)
1 8.0319430828094 253832368
2 7.9839849472046 253832368
3 8.3128731250763 253832368
Avg 8.1095 253832368

Right in the same neighborhood as the named class, give or take. So for about the same performance and no ability to serialize it, you don't need to define a class by name. I'm sure someone will argue that is a good trade off but that someone would not be me.

Summary

Here's our final data, showing the percent change relative to our baseline for each value (negative number means decrease, which is good):

Summary (Sorting)

Technique Runtime (s) Memory (bytes)
Associative array 9.4311 (n/a) 541450384 (n/a)
stdClass 11.2173 (+18.94%) 589831120 (+8.94%)
Public properties 8.2172 (-12.87%) 253831584 (-53.12%)
Private properties 11.0881 (+17.57%) 253833000 (-53.12%)
Anonymous class 8.1095 (-14.07%) 253832368 (-53.12%)

Summary (Serialize)

Technique Runtime (s) Memory (bytes) Size
Associative array 1.8692 (n/a) 1100385733 (n/a) 68673105 (n/a)
stdClass 3.2437 (+73.53%) 1210154464 (+9.98%) 81673049 (+18.93%)
Public properties 3.081 (+64.83%) 1326154736 (+20.52%) 77673029 (+13.11%)
Private properties 3.2322 (+%72.92) 1332152352 (+21.06%) 83672513 (+21.84%)

What can we conclude from all of this?

First off, a reminder that we're dealing with a cardinality of 1 million here. That means if your cardinality is 4, odds are you won't notice an earth-shattering difference no matter what you do. However, it's still good to get into good habits in case your cardinality does grow considerably.

The first thing we can conclude is that if the one and only thing you care about is serialization/deserialization performance, associative arrays still win. They're the most time efficient by more than 50%, and the most space efficient by up to 20%.

The second thing we can conclude is that stdClass should be used basically never. It's slower and more memory intensive than arrays in every circumstance. Just don't go there.

In just about every other situation I can think of, named classes win. Their memory usage is half that of a corresponding array. The optimizations the engine can do when it knows up front what the structure of your data is going to be are massive and pay off huge dividends in memory consumption. They're also over 10% faster. The only downside is when trying to serialize them when there is an added cost to time, memory, and stored size. When we also consider that a classed object is far more self-documenting than an associative array, gives IDEs the ability to auto-complete for you, and gives you a place to include additional documentation (which you should include), it's one of the clearest wins I've seen in PHP.

In other words, if you're one of those people who claims that "good code is self-documenting, you don't need comments", and you're not using a classed object, then you're not just wrong, you're a hypocrite who's also wrong. Don't be that person.

The question of public properties vs methods is, I would argue, open. They do offer a more structured, self-documenting, more flexible approach but at the same time do have a hefty CPU penalty over associative arrays. (They still destroy arrays on memory, though.) Whether that is a good trade off or not depends on your use case. My default recommendation would be, when we're talking about what is essentially a private class, use public properties for the main data but don't feel shy about adding additional methods to the object if you want to compute stuff off of it, or it makes sorting easier, or it somehow otherwise is helpful for your use case. Putting a constructor on the class so you can initialize it in a single line is probably a good idea, and I expect would be a wash performance-wise.

As another consideration, it's common these days for larger frameworks to generate code based on plugin information and store that on disk not as a serialized string but as a generated PHP class that can then be just loaded like any other. (Think Dependency Injection Containers, Event Dispatchers, theme systems where you can register template plugins, etc.) In that case the serialization point is moot and you have absolutely no excuse for not using a named class. Generating out a big nested associative array into your compiled code is just flat out inexcusably wasteful. Don't do that. Stop it.

Although I only ran the tests on PHP 7.2 I'm reasonably confident these results will hold back to PHP 7.0 and later. It's possible they would be different on PHP 5, but since all versions of PHP 5 will be fully unsupported within 6 months I really don't care if they're applicable.

tl;dr: Use named classes with public properties for big internal data structures. If you're still using nested associative arrays for that, You're Doing It Wrong(tm).

Sort:  

I always start with arrays for quick prototyping then I jump back to objects for storing the same data. Not only because I suspected it would be faster (because of the class definition) but because the data I'm sharing with has their own methods that knows how to deal with that data. Here is an example that I moved array structures into their own class, the code is much nicer and it runs a bit faster if you measure a few million times.

Interesting article that confirms my theory :D, thanks for writing it.

Nice! Yeah, the ability to encapsulate behavior is one of the most obvious benefits of a class but there's been a general belief in PHP for years that doing so was more expensive than doing it "manually". That may have been true once, but it's definitely not true today. In fact quite the opposite.

An addendum, as a few people have pointed out to me on Twitter:

This applies to runtime behavior. PHP has another optimization where, if you define an array as a const it gets placed in shared memory with the code, so the net memory cost to each process using that array is 0.

That's really only applicable if:

  • You are generated compiled code.
  • The compiled array contains only scalars and arrays (no objects or closures).
  • The compiled array will never be modified at runtime.

In that case, a const big nested array may indeed be better both for CPU and memory.

The runtime builder for that compiled code, though, is still better off using objects for memory efficiency so that you can produce that compiled code.

As always, context matters. :-)

Great write up, Larry. I won’t fight you. You made a good argument.

Oh good. We have enough things to fight about. I'd hate to add programming optimization to the list. :-)

Nice benchmark !

And what about extending Serializable on the named class to still store it as an associative array ?

Is it the best win-win combo ? Of course we need to ask if defining serialization for simple data struct is relevant 😊.

My guess is it would be slower because it has to call serialize/deserialize in user-space for each class. It might end up being smaller but the performance cost is likely not worth it. That said, I haven't tried.

How about using the array_multisort() instead of usort():

array_multisort(
      $list, 
      array_column($list,'a'), 
      array_column($list, 'b')
);

Running on a mac book pro, 2.2GHz Intel Core i7, 16GB, listing Av. of 3 runs:

Associative array (Sorting)

MethodRuntime (s)Memory (bytes)
usort15.6589 (s)541414232 bytes (516.33 MB)
array_mulitsort8.8314 (s)706785816 bytes (674.04 MB)

Object with public properties (sorting):

MethodRuntime (s)Memory (bytes)
usort12.9263 (s)253795352 bytes (242.04 MB)
array_mulitsort7.6217 (s)419166808 bytes (399.75MB)

a tradeoff between memory and runtime ...

Interesting observation! If you're sorting an array, yes, that would make a big difference. However, the purpose of usort() here was to provide a direct comparison between objects and arrays, so they had to be used in the same way. That meant usort() so that we could compare the property access in each. I didn't as much care about the sorting itself as sorting was an easy way to call $array['a'] and $object->a a few zillion times. :-)

This is rather older, but here's a post from Nikita Popov explaining the difference in storage in PHP 5.4:

The structs have changed dramatically in PHP 7, but the basic optimization he describes is still with us, and is the reason for these results.

Some more recent posts on the topic, too:

https://nikic.github.io/2011/12/12/How-big-are-PHP-arrays-really-Hint-BIG.html
https://nikic.github.io/2014/12/22/PHPs-new-hashtable-implementation.html

Hi there:

I really tried to use model classes but it is impractical.

Let's say we want to json_serialize. Ok, it is not a problem. But what if we have a field that it's composed by another model

class Customer {
      var $fieldId=1;
      var $name="";
      var $typeCustomer=new TypeCustomer(); // or we could initialize in the constructor.
}

Serializing it's not fun. However, de-serializing (json) is a big challenge because the system doesn't understand the field $typeCustomer if an object and it de-serialize as stdClass, then every method attached to TypeCustomer fails.

https://dev.to/jorgecc/php-is-bad-for-object-oriented-programming-oop-282a

Very good post. I often had these issues with associative arrays while writung the code for websites like https://www.receivesms.co