Profiling Java Set operations

6 min readDec 31, 2024

Using bitwise operations for set intersection, we managed to decrease the latency of a core Kotlin service by nearly half, across all percentiles — e.g. p99 went from ~35ms to ~20ms

Our purpose here is to present the methodology of reaching this result, as well as the actual changes we made to do so.

Lessons learnt

Set<T> operations for latency-critical paths, where all possible set values are known, should be handled with bitwise operations and not HashSet<T>
Kotlin Deferred comes with handling overhead — for fast tasks, it is faster to pass the function lambda around and use when necessary
Kotlin’s Collections library is fantastic for quickly writing working code, but prioritizes simplicity over performance — it is far from performance optimized

Context

Forter provides trust to eCommerce merchants. In other words, we determine legitimacy of end-user actions, such as transactions — this is called a decision.

The Identity group at Forter is responsible for matching the data from a decision being made, with the corpus of identities that we already know about, at real time — meaning at the lowest latency possible. Is the user a legitimate returning customer, or maybe a repeat credit-card thief? It’s our job to say.

In the process of this matching, we execute several queries over different databases. One of these is a specialized search engine that we wrote internally in Kotlin to provide very low-cost search capabilities compared to previously used Elasticsearch.

We had already achieved lower latencies than the corresponding Elasticsearch queries, but we were still on the critical path of the decision — meaning reducing latencies in this service would reduce the overall latency for the entire decision. Better for us, for our customers, and for the end users.

Setup

Given a user request, our system retrieves possibly relevant data from a local database (i.e. performs I/O), filters the results on a list of predefined rules, and returns the relevant data

Due to existing latency metrics in-code, we already knew that the latency-heavy part was the rules-based filtering section.

This fact is interesting. We expected the I/O in the pipeline to be the latency heavy part, but in reality it’s the in-memory data crunching that is delaying the response most.

On the other hand, this is terrific for us trying to optimize the latency. This meant that we didn’t need to actually retrieve in order to optimize it — we could test with the retrieved data as input, which would give us the same behaviour as the service.

Profiling a live service with tools like Pyroscope is definitely possible, but the results are much less detailed, and the feedback loop much longer, than profiling locally with a tool like IntelliJ.

We therefore:

Serialized samples of retrieved data in a local file on the server (appended to upon X% of requests)
Uploaded that file to a shared location, downloaded locally
Wrote a custom reader to deserialize these files and provide the inputs as JVM classes

Our profiling could then be done from within IntelliJ, using production inputs

Profiling

We started with running on all inputs we got from live data retrievals in production. We then took the original inputs and split them by latency percentile into new input files — for p90, p99, and p99.9, thinking that the tail latencies could be caused by behaviour uncharacteristic of most requests.

We discovered that the majority of the time went on comparing the fields that were equal against the fields that need to be equal for each of the rules — Set contains — as well as other Set intersection and union functions.

Another significant portion of the time went to overhead for creating and cancelling Kotlin Deferred.

(How to read: The bottom function is the caller, each function above is a callee; The width of each function represents its execution time; The leftmost functions are the longest)

After (Spoiler!):

Set functions virtually disappear, Deferred overhead actually disappears

In short, the changes were:

Use specialized sets instead of generic HashSets
Pass around the “Deferred” function and execute it if necessary, rather than to try and parallelize it

BitSet vs EnumSet

Since all possible fields for the set operations were well-known, we decided to use bitwise operations for set unions, contains, and intersection operations.

We knew that EnumSet did much of the same thing, but it requires that all the fields are known at compile-time. Since we wanted to remain open to the idea of dynamic fields, we decided to implement much of the same logic using java.util.BitSet.

When testing this for performance, we found the EnumSet to be twice as fast as the BitSet implementation, leading to 15% higher end to end latencies.

How does EnumSet achieve this? It distinguishes between 2 cases, based on the size of its “universe”, or number of possible values:

Number of possible values <=64 — use a single Long for all bit operations, since it has 64 bits we can fit the entire set on a single long
Number of possible values > 64 — use a BitSet which can be arbitrarily large

With this information, we were able to build an implementation that did this same distinction, but also allowed the possible fields to be declared dynamically.

Since we lack the compile-time guarantees of EnumSet<T>, which knows that its “universe” is always the same, we need to validate actions between 2 sets that they both have the same universe.

As documented in the code:

 * |                             | `Set<Int>` |      [EnumSet]       | [BitSet] |  [FixedSizeBitSet]   |
 * |-----------------------------|------------|----------------------|----------|----------------------|
 * | Works with                  | Any Int    | Enums only           | Any Int  | Any Int < capacity   |
 * | Max size                    | Dynamic    | Fixed (compile-time) | Dynamic  | Fixed (runtime)      |
 * | Speed when capacity <= 64   | Slow       | Very fast            | Fast     | Very Fast            |
 * | Speed when capacity > 64    | Slow       | Fast                 | Fast     | Fast                 |

Live Comparison

To test this end-to-end, we had 2 apples-to-apples comparisons to make: Latency (are we faster?) and Correctness (are we returning correct results?).

We achieved this by replicating requests to the production service and sending them to an experimental service — with our existing latency monitoring, this was enough to compare speed.

We then recorded the results returned by each of these services, and compared them to ensure equivalence.

As for the results: we managed to cut latency by half across the board 😀

This obviously also expresses itself as:

Less cost — each instance can do twice as much
Less GC pauses — each EnumSet-equivalent is in-memory just one object (2 bytes for JVM header) with a single primitive long (8 bytes) for a total of 10 bytes, compared to the very large size of HashSets