search term:

hashing bakeoff

This is a blog post on sbt 2.x development, continuing from sbt 2.x remote cache, sudori part 4, part 5, part 7 etc.

background

Hashing comes up while developing sbt, often in the context of caching and incremental compilation. Basically when we have either a list of files or some parameters, we want to figure out if the calculation has ever been done before. That require hashing the inputs. It’s a neat trick to be able to treat various shape of data as numbers.

The downside of this approach is that we are hashing things all the time, which takes time. In 2020, I ran a small experiment to hash 1000 JAR files on my laptop to compare hashing algorithms. This is how I settled on using non-cryptographic FarmHash via Zero-Allocation-Hashing (ZAHa) in sbt 1.x and Zinc 1.x VirtualFile.

Fast forward to 2026, the relevance of hashing has only increased as sbt 2.x attempts to cache more tasks. One issue with ZAHa is its use of sun.misc.Unsafe, which will be removed in future JDK:

WARNING: Please consider reporting this to the maintainers of class net.openhft.hashing.UnsafeAccess
WARNING: sun.misc.Unsafe::arrayBaseOffset will be removed in a future release
WARNING: A restricted method in java.lang.System has been called

XXHash and WyHash

Since Scala 3.x and sbt 2.x are collectively leaping forward to JDK 17, we can port non-cryptographic hashing functions to Scala 3 using Variable Handles, or VarHandles.

In sbt#9267 I’ve ported XXHash and WyHash to Scala 3.

kcrypt/scala-blake3

I’ve added Kcrypt Lab’s Blake3 implementation into the benchmark. The original C implementation of Blake3 boasts impressive performance.

Blake3 can output a range of digest length, 64-bit or 256-bit. It’s also a cryptographic hash that’s recognized by Bazel cache system.

benchmarking hashing fuctions

Here’s JMH benchmark result of hashing 2048 bytes on GitHub Actions:

[info] Benchmark                                Mode  Cnt     Score    Error  Units
[info] FarmHashHashBenchmark.hashByteArray      avgt    5     0.255 ±  0.001  us/op
[info] XXHash64HashBenchmark.hashByteArray      avgt    5     0.216 ±  0.001  us/op
[info] WyHash64HashBenchmark.hashByteArray      avgt    5     0.377 ±  0.002  us/op
[info] MurmurHash32HashBenchmark.hashByteArray  avgt    5     1.340 ±  0.007  us/op
[info] MurmurHash64HashBenchmark.hashByteArray  avgt    5     2.368 ±  0.020  us/op
[info] Blake3HashBenchmark.hashByteArray        avgt    5     6.639 ±  0.067  us/op
[info] Md5HashBenchmark.hashByteArray           avgt    5     7.936 ±  0.059  us/op
[info] Sha256HashBenchmark.hashByteArray        avgt    5     9.299 ±  0.103  us/op

I’ve included the current ZAHa FarmHash as the baseline, as well as scala.util.MurmurHash from the Scala standard library, MD5, and SHA256. I sometimes wonder if MD5 can be a “non-cryptographic” hash. The result shows that FarmHash took 255 ns to hash 2048 bytes. To normalize the score, we can convert it to GB/s:

Hash Score
(lower is better)
GB/s Speedup
FarmHash (ZAHa) 0.255 7.48 1
XXHash64 0.216 8.83 1.18x faster
WyHash64 0.377 5.06 1.5x slower
MurmurHash32 (Scala.util) 1.340 1.42 5.3x slower
MurmurHash64 2.368 0.805 9.3x slower
Blake3 (scala-blake3 3.1.2) 6.639 0.287 26x slower
Md5Hash 7.936 0.240 31x slower
Sha256 9.299 0.205 36x slower

This doesn’t necessarily mean that we’d be able to hash 7 GB in 1s, but it gives us a rough idea. This also shows that XXHash64 is faster than ZAHa FarmHash on GitHub Actions, and that non-crypograph hash functions are order of magnitude faster than MD5, Blake3, or SHA256.

benchmarking the file hashes

Here is a hashing benchmark of hashing 1MB file:

[info] Benchmark                                Mode  Cnt     Score    Error  Units
[info] ImoXXHash64FileHashBenchmark.hashFile    avgt    5    26.312 ±  0.199  us/op
[info] ImoWyHash64FileHashBenchmark.hashFile    avgt    5    27.851 ±  0.046  us/op
[info] XXHash64FileHashBenchmark.hashFile       avgt    5   239.858 ±  1.517  us/op
[info] WyHash64FileHashBenchmark.hashFile       avgt    5   282.985 ±  1.474  us/op
[info] Sha1FileHashBenchmark.hashFile           avgt    5   783.139 ±  5.554  us/op
[info] Sha256FileHashBenchmark.hashFile         avgt    5   827.696 ±  3.025  us/op
[info] Blake3FileHashBenchmark.hashFile         avgt    5  3147.657 ±  1.786  us/op

ImoXXHash64 uses the ImoHash, which hashes only the beginning, middle, and the tail of the file. It would hash any file in approx 26 μs.

Hash Score
(lower is better)
GB/s
ImoXXHash64 26.312 n/a
ImoWhyHash64 27.851 n/a
XXHash64 239.858 4.1
WyHash64 282.985 3.5
Sha1 783.139 1.2
Sha256 827.696 1.2
Blake3 (scala-blake3 3.1.2) 3147.657 0.3

This shows that for file hashing, the difference between XXHash64 vs SHA256 difference isn’t as dramatic.

Regardless, constant-time ImoHash would be significantly faster than either XXHash64 or SHA256. I’m a bit surprised that Blake3 is order-of-magnitude slower than everything else, including SHA256. I wonder if this is due to some GC that needs to be optimized out using VarHandle etc.

see also hash4j

To drop the usage of ZAHa I’ve opted to port hashing functions to Scala myself, but there’s also a Java library called hash4j, which implements these functions. I have not tested them.

summary