Minimizing readable JSON

Yair Morgenstern
6 min readAug 7, 2024

--

If you want to send a lot of compressed data, JSON is not the way to go — you should be using a binary format intended for network transfer.

What JSON is good at is human readability. Large data structures are simple to prettify and human-parse, as well as human-edit.

In Unciv, we made a conscious decision to value this human readability over file size concerns, a decision which has proved to also have incredible value for debugging broken saves. But we still run into problems occasionally of Android users that aren’t tech-savvy, and just want to send us the error as text. This is as simple as ‘copy to clipboard’.
This is fine for stack traces, but the save files themselves can reach up to 2Mb, which is a lot of text — that doesn’t fit in the clipboard for many devices.

Problem summary

  • The goal — Minimize game save to fit in clipboard
  • The constraints — Saved games should be human parseable — valid (for formatting), readable json

Results

This is for our json data — obviously this is data-dependent

We can separate this into 2 sub-problems:

Minimize readable JSON

  • Don’t write default values — ~15% size decrease
  • Custom serialization for classes — ~15% size decrease

Pack JSON into unreadable, minimized text

  • Why not just a binary format?
  • Gzip, base64 — ~70% size decrease
  • Alternatively, replace known keys/values — ~50% size decrease
  • Going too far — hashsets to bitsets — ~17% size decrease

Minimize readable JSON

Don’t write default values

According to a friend of mine, there are 2 kinds of defaults: ‘no data’ defaults and ‘default data’ defaults.

‘No data’ includes nulls, empty lists and sets, empty strings, and zero numbers. As long as you don’t allow for both null and empty, these are consistent — the default value will not change, therefore it’s safe to simply not write empty values, and let the deserializer fill the default values in for you.

If your code differentiates ‘undefined’ from ‘null’ you need to be careful with this

However, ‘default data’ values — Enums, Strings, etc that have been determined in-code as “the default” — can change, and therefore not serializing them will lock you in to the current defaults.

This has happened to us — we changed a default and all data saved in previous default, now thought it was the new default. This is Not Fun At All — once your user has saved the new data and you want to rollback, you can’t actually tell what the original data value was, because both old and new values were considered default at one point and thus removed!

This can be overcome in-code by specifying different defaults for different save versions, but I do not recommend. This is a huge hassle.

Custom serialization for classes

This is relatively simple — A vector class that has an X and Y value, would be serialized by default as e.g. {"x":4,"y":5} which is 13 characters.

If we create a custom serialization to make this a string, "4/5", we’re down to 5 characters — we’re down ~60% with no loss in readability!

If you have users with different versions sharing data, you need to ensure that if A writes something then B can read it — so you can’t immediately shift to writing the new format. Instead:

  • Allow reading both old and new formats, write old format
  • Wait a reasonable amount of time for most users to upgrade
  • Shift to writing new format

I advise to have a ‘version number’ in the json for what version wrote it, thus we can check on json parse errors — if we’re behind the version that wrote it, it’s probably not a data corruption issue but a json parse issue, so we can tell the user to update.

Pack JSON into unreadable, minimized text

As we said, we value readability. But for over-the-network transfer, we value minimizing size even more.

Why not just a binary format?

If we’ve abandoned our goal of human readability, what is all this even about anymore? Why not just binary?

  • Simplicity — base64 decoding doesn’t require any specialized knowledge beyond ‘this is the encoding’
  • Partial readability > no readability: This json is human readable, once you decode it. You can still see and parse the object structure.
  • Trade-offs: Decomposing our big problem into smaller techniques lets us weigh trade-offs in each part of the json. Some parts are essential to human readability. Other parts are really not all that important in the grand scheme, so we can cheat a little and make them go away.

Gzip, base64

The obvious answer to ‘how do I compress data’ is ‘use a professional data compression algorithm’.

Why base64? Because A. not all devices support binary clipboard data, and B. our users often send emails, so text is the way to go

Replace known keys/values

If you use Gzip, you should not be doing this manually. Gzip does this automagically. This is included because it still retains a readable json, at expense of manual text replacements.

Many json frameworks allow specifying names for entries, which lets you replace long strings with single digits. But this requires prior knowledge on the code side of what those strings actually mean — thus the json is no longer self-contained.

Instead, you can add a new field to your json root, which is a mapping (like "mapping":{"a":"Apple","b":"Banana"} and then use the ‘minimized values’ everywhere within the json. By containing its own encoding this is like a “string-level gzip”

So encoding is:

  • Gather all strings to minimize
  • Generate minimized versions (can be numbers, for less char count and still valid json)
  • Add mapping to object, jsonify object
  • Replace all values except for mapping itself with mapped values

And decoding is:

  • Read just the mapping from the json (hack version: Regex, semi-hack: object with just ‘mapping’ field)
  • Replace instances of "a" with "apple" , etc — note that if you allow '-strings this will replace inner text, e.g. 'hi I'm "a" boy' -> 'hi I'm "Apple" boy' — YMMV on if this behaviour is desirable.

Going too far — hashsets to bitsets

In our games, we often have hashsets of string values — to indicate what buildings are built in a city, what civs have explored a tile, etc.

However, these strings are repeated frequently. The lightbulb-moment here is that a list of known values allows you to encode value hashsets as bitsets, which are representable as numbers. This is useless for a single hashset, but useful for when you have many hashsets with the same possible values.

For example: We have strings “one”, “two”, “three”, “four” etc. One hashset has “one” “two “three”, the other has “one two four”.

Classic json serialization would serialize this hashset as ["one","two","three"]and ["one","two","four"]

We can instead, make a list of all values seen in hashsets:

hashlist: ["one","two","three","four"]

and then create a bitset where bit N determines if value N exists in this hashset. Thus the first would be 0111 and the second 1011, or 7 and 11 in json. For extra efficiency we’d want the hashlist to be sorted most-seen to least-seen, so our base-10 numbers are consistently low.

You can see that the more hashsets we have, and the more values they have, the more efficient this packing is. In a way, this is like turning the values into a json-internal Enum :)

Obviously, if we’ve already replaced known strings then our hashsets go from e.g. ["1","2","3"] to 7 — which is still a gain, but vastly less so. Unlike other steps here, these two are competing on the same efficiency-space.

This is no longer human readable, which is why it’s “too far” — unlike previous steps, unpacking this json requires specialized code. But as part of our trade-offs, we see there are places where the minification is worth the non-readability

--

--

Yair Morgenstern
Yair Morgenstern

Written by Yair Morgenstern

Creator of Unciv, an open-source multiplatform reimplementation of Civ V https://github.com/yairm210/Unciv

No responses yet