sbt 2.x remote cache
introduction
A remote cache, or a cloud build system, can speed up builds dramatically by sharing build results (Mokhov 2018). This is a feature that I’ve been interested ever since heard about Blaze (now open sourced as Bazel). In 2020, I implemented cached compilation in sbt 1.x. reibitto has reported that “what was once 7 minutes to compile everything now takes 15 seconds.” Others have also reported 2x ~ 5x speedup. While this is promising, it’s a bit clunky and it works only for the compile
task. In March 2023, I jotted down RFC-1: sbt cache ideas to outline the current issues and a solution design. Here are some of the problems:
- Problem 1: sbt 1.x implements remote caching for
compile
, and disk caching for some other tasks, but we would like a solution that custom tasks can participate - Problem 2: sbt 1.x has separate mechanism for disk cache and remote cache, but we would like one mechanism that build user can switch between local or remote cache
- Problem 3: sbt 1.x used Ivy resolver as the cache abstration, but we’d like a more open design for remote cache backend
As my december adventure 2023 project I decided to tackle the sbt 2.x remote cache feature in my free time. The proposal is on GitHub #7464. This post explores the details of the change. Note: It shouldn’t require too much of sbt internal knowledge, but the target audience is advanced since this is more of an extended PR description.
low-level foundation
In the abstract, we can think of a cached task as:
(In1, In2, In3, ...) => (A1 && Seq[Path])
If we saved the hash of inputs and the result somewhere, like on a disk, we can skip the evaluation of expensive tasks, and present the result instead. The result of a cached task is represented as an ActionResult
:
import xsbti.HashedVirtualFileRef
class ActionResult[A1](a: A1, outs: Seq[HashedVirtualFileRef]):
def value: A1 = a
def outputs: Seq[HashedVirtualFileRef] = outs
....
end ActionResult
We’ll come back to HashedVirtualFileRef
later, but it carries a file name with some content hash. Using these, we can define the cache
function as follows:
import sjsonnew.{ HashWriter, JsonFormat }
import xsbti.VirtualFile
object ActionCache:
def cache[I: HashWriter, O: JsonFormat: ClassTag](
key: I,
codeContentHash: Digest,
extraHash: Digest,
tags: List[CacheLevelTag],
)(
action: I => (O, Seq[VirtualFile])
)(
config: BuildWideCacheConfiguration
): O =
val input =
Digest.sha256Hash(codeContentHash, extraHash, Digest.dummy(Hasher.hashUnsafe[I](key)))
....
end ActionCache
The type parameter I
would typically be a tuple. The signature of the action
function looks a bit odd, because it includes Seq[VirtualFile]
. This is to capture file output effects during a task.
automatic derivation of cacheable task
sbt’s DSL is an Applicative do
-notation, which translates
someKey := {
name.value + version.value + "!"
}
into an Applicative mapN
expression via macros:
someKey <<= i.mapN((wrap(name), wrap(version)), (q1: String, q2: String) => {
q1 + q2 + "!"
})
Using Scala 3 macros, we can automatically derive a cacheable task by further wrapping the output:
someKey <<= i.mapN((wrap(name), wrap(version)), (q1: String, q2: String) => {
ActionCache.cache[(String, String), String](
key = (q1, q2),
otherInputs = 0): input =>
(q1 + q2 + "!", Nil))
})
For this to work, the input tuple must satisfy sjsonnew.HashWriter
, and the result type, for example String
, must satisfy JsonFormat
. One way to think about this is that we are constructing a Merkle tree out of the abstract syntax tree of your build.sbt
and pseudo case classes.
cache backend
The following trait abstracts over a cache backend.
opaque type Digest = String
/**
* An abstration of a remote or local cache store.
*/
trait ActionCacheStore:
def put[A1: ClassTag: JsonFormat](
actionDigest: Digest,
value: A1,
blobs: Seq[VirtualFile],
): ActionResult[A1]
def get[A1: ClassTag: JsonFormat](input: Digest): Option[ActionResult[A1]]
def putBlobs(blobs: Seq[VirtualFile]): Seq[HashedVirtualFileRef]
def getBlobs(refs: Seq[HashedVirtualFileRef]): Seq[VirtualFile]
def syncBlobs(refs: Seq[HashedVirtualFileRef], outputDirectory: Path): Seq[Path]
end ActionCacheStore
Hopefully the methods are self-explanatory, but this API is for someone who wants to implement a cache backend so understanding the detail isn’t important. An interesting thing to note is that it only requires 5 methods. For the initial testing, I’m going to focus on a local disk cache.
Here’s how the cache directory looks after running pacakgeBin
, which is a cached task:
$ tree $HOME/Library/Caches/sbt/v2/
~/Library/Caches/sbt/v2/
├── ac
│ ├── sha256-d3ea49940f3ec7f983ddfe91f811161d2fee53c19ec58db224c789b63c5d759d
│ └── sha256-e2d1010d6ce5808902e35222ec91d340ae7ecb013ec7cb3b568c3b2c33c3ffa0
└── cas
├── sha256-02775d17841ec170a97b2abec01f56fb3e3949fefc8d69121e811f80c041cfb1
├── sha256-601ba6379aeed7fefd522d3a916b3750c35fe8cd02afe95a7be4960de1fbcfa7
└── sha256-f824ffec2c48cbc5e4cdcaec71670983064312055d3e9cfcc1220d7f4f193027
The file content of ac/sha256-d3ea49940f3ec7f983ddfe91f811161d2fee53c19ec58db224c789b63c5d759d
is:
{"$fields":["value","outputFiles"],"value":"${OUT}/jvm/scala-3.3.1/hello/hello_3-0.1.0-SNAPSHOT.jar>sha256-f824ffec2c48cbc5e4cdcaec71670983064312055d3e9cfcc1220d7f4f193027","outputFiles":["${OUT}/jvm/scala-3.3.1/hello/hello_3-0.1.0-SNAPSHOT.jar>sha256-f824ffec2c48cbc5e4cdcaec71670983064312055d3e9cfcc1220d7f4f193027"]}
cas/sha256-f824ffe...
is a JAR file:
$ unzip -l $HOME/Library/Caches/sbt/v2/cas/sha256-f824ffec2c48cbc5e4cdcaec71670983064312055d3e9cfcc1220d7f4f193027
Archive: ~/Library/Caches/sbt/v2/cas/sha256-f824ffec2c48cbc5e4cdcaec71670983064312055d3e9cfcc1220d7f4f193027
Length Date Time Name
--------- ---------- ----- ----
298 01-01-2010 00:00 META-INF/MANIFEST.MF
0 01-01-2010 00:00 example/
608 01-01-2010 00:00 example/Greeting.class
363 01-01-2010 00:00 example/Greeting.tasty
....
practical problems with caching
If caching were easy, it wouldn’t be listed as one of the hardest problems in computer science along with making profits from open source (and off-by-one error).
serialization issues
First, caching is serialization-hard, i.e. at least as hard as the serialization problem. For sbt, a build tool that has existed in the current shape for 10+ years, this is going to be the biggest hurdle to cross. For instance, there’s a datatype called Attributed[A1]
that holds data A1
with an arbitrary metadata key-value. Basic things like classpath are expressed using Seq[Attributed[File]]
, which is used to associate a Zinc Analysis
with classpath entries.
As long as we were executing tasks like compile
in-memory, Attributed[A1]
, which is effectively a Map[String, Any]
worked ok. But in light of caching, we’d need HashWriter
for inputs, and JsonFormat
for cached values, which is not possible for Any
. In this case, I’ve worked around this issue by creating StringAttributeMap
.
file serialization issues
Caching is file-serialization-hard, i.e. at least as hard as serializing a file. java.io.File
(or Path
) is such a special beast that requires its own consideration, not because of technicality, but mostly because of our own assumptions of what it means. When we say a “file” it could actually mean:
- relative path from a well-known location
- a unique proof of a file, or a content hash
- materialized actual file
When we use java.io.File
, it’s somewhat ambiguous what is meant by it from the above three. Technically speaking a File
just means the file path, so we can deserialize just the filename such as target/a/b.jar
. This will fail the downstream tasks if they assumed that target/a/b.jar
would exist in the file system.
To disambiguate, xsbti.VirtualFileRef
is used for just relative paths only; and xsbti.VirtualFile
is used for materialized virtual files with contents. However, for the purpose of caching a list of files, neither is great. Having just the filename alone doesn’t guarantee that the file will be the same, and carrying the entire content of the files is too inefficient in a JSON etc. Given that same JAR can be repeated within a build, it doesn’t make sense to embed the contents when we need just a reference.
This is where the mysterious second option, a unique proof of file comes in handy. One of key innovations of Bazel cache is the idea of content-addressable storage (CAS). You can think of a directory full of files whose filename is named using the content hash of the file. Now, by knowing the content hash, we can always materialize it into an actual file, but for the purpose of data we can address it using the content hash. Actually, we’d also need the name of the file as well, so in sbt 2.x I’ve added HashedVirtualFileRef
to represent this:
public interface HashedVirtualFileRef extends VirtualFileRef {
String contentHashStr();
}
effect issues
Caching is IO-hard, if we generalized the file serialization issue to all side effects. We need to manage any side effects that the tasks perform that we care about, which might include displaying text on the console. We might also need to think about composition.
declaring the outputs
In sbt 2.x, I’m introducing a new function Def.declareOutput
:
Def.declareOutput(out)
This would be called from within a task to declare a file output. In a typical build tool file creation is performed via side effects, and a task may generate many files, which the downstream tasks may or may not actually use. With a remote cached build tool, we need to declare the output so expected files are downloaded. Also note that some tasks like compile
currently generate files, but do not have file as return type.
someKey := Def.cachedTask {
val output = StringVirtualFile1("a.txt", "foo")
Def.declareOutput(output)
name.value + version.value + "!"
}
This becomes:
someKey <<= i.mapN((wrap(name), wrap(version)), (q1: String, q2: String) => {
var o1: VirtualFile = _
ActionCache.cache[(String, String), String](
key = (q1, q2),
otherInputs = 0): input =>
val output = StringVirtualFile1("a.txt", "foo")
o1 = output
(q1 + q2 + "!", List(o1))
})
When we run the task for the first time, sbt evaluates q1 + q2 + "!"
, but it’ll also store o1
into the CAS and calculate an ActionResult
, which contains a list of HashedVirtualFileRef
. During the second run, ActionCache.cache(...)
can materialize it into a physical file and return a VirtualFile
for it.
opting out of serialization
In the previous example, all input settings/tasks were assumed to be a cache key:
ActionCache.cache[(String, String), String](
key = (q1, q2),
....
This is probably a decent default behavior, but in practice there are some keys that you’d want to exclude from the cache key. For example, streams
key is used for logging, and is given a fresh value each time, which has no meaningful value for serialization. There’s no reason to try to turn it into JSON.
I’ve added an annotation called cacheLevel(...)
for this purpose:
@meta.getter
class cacheLevel(
include: Array[CacheLevelTag],
) extends StaticAnnotation
enum CacheLevelTag:
case Local
case Remote
end CacheLevelTag
Now we can opt-out streams
as follows:
@cacheLevel(include = Array.empty)
val streams = taskKey[TaskStreams]("Provides streams for logging and persisting data.")
.withRank(DTask)
In general, we might want to exclude anything machine-specific or non-hermetic from the cache key when possible.
latency tradeoff issues
Caching is latency-tradeoff-hard. If the compile
task generated 100 .class
files, and packageBin
created a .jar
, cache hit of compile
task then incurs 100 file read for a disk cache, and 100 file download for a remote cache. Given that a JAR file can approximate .class
files, we should use JAR files for compile
to reduce the file download chattiness.
hermeticity issues
Remote caching is hermecity-hard. The premise of remote cache is that the cached results are sharable across different machines. When we end up capturing machine-specific information unintentionally into the artifact, we could either end up with a growing cache size, low cache hit%, or a runtime error. This is called a hermeticity break.
Two common issues are capturing the absolute path via java.io.File
, or the current timestamp. More subtle ones that I’ve seen are JVM bug that captures timezone of the machine, and GraalVM capturing the glibc version.
package aggregation issue
Cache invalidation is package-aggregation hard. See Analysis of Zinc talk for more details. I just made up the name package aggregation here, but the gist of the issue is that the more source files you aggregate into a subproject, the more inter-connected the subprojects become, and naïve invalidation of simply inverting the dependency graph would end up spreading the initial invalidation (code changes) to most of the monorepo like a wildfire.
Build tools deal with this issue in various ways:
- Make the subproject more granular. Like 1:1:1 Rule (one directory, one package, one target)
- Ignore transitive dependency, also known as strict deps (Bazel does this for Java)
- Track dependency at the method usage granularity (Zinc does this)
- Remove unused imports and library dependencies
Initially I’m going to implement the simple naïve invalidation, but we should leave a door open to iterate in this area. (Thanks Matthias Berndt for remind me about this)
case study: packageBin task
The pacakgeBin
task creates the JAR file of the class files. In general package*
family of tasks are created using packageTaskSettings
and packageTask
functions and Package
object. We can try turning packageBin
into a cached task.
First, we need to make PackageOption
serializable. I turned it into a Scala 3 enum, implemented JsonFormat
for each cases, and then defined an union:
enum PackageOption:
case JarManifest(m: Manifest)
case MainClass(mainClassName: String)
case ManifestAttributes(attributes: (Attributes.Name, String)*)
case FixedTimestamp(value: Option[Long])
object PackageOption:
....
given JsonFormat[PackageOption] = flatUnionFormat4[
PackageOption,
PackageOption.JarManifest,
PackageOption.MainClass,
PackageOption.ManifestAttributes,
PackageOption.FixedTimestamp,
]("type")
end PackageOption
The Package.Configuration
class was modified as follows:
// in sbt 1.x
final class Configuration(
val sources: Seq[(File, String)],
val jar: File,
val options: Seq[PackageOption]
)
// in sbt 2.x
final class Configuration(
val sources: Seq[(HashedVirtualFileRef, String)],
val jar: VirtualFileRef,
val options: Seq[PackageOption]
)
Note that we see HashedVirtualFileRef
representing the input sources, and VirtualFileRef
is used to specify the output file name. The action code to create a JAR file Pkg.apply(...)
will return VirtualFile
instead of Unit
.
packageBin
key in Keys.scala was changed to:
val packageBin = taskKey[HashedVirtualFileRef]("Produces a main artifact, such as a binary jar.").withRank(ATask)
and the new packageTask
looks like this:
def packageTask: Initialize[Task[HashedVirtualFileRef]] =
Def.cachedTask {
val config = packageConfiguration.value
val s = streams.value
val converter = fileConverter.value
val out = Pkg(
config,
converter,
s.log,
Pkg.timeFromConfiguration(config)
)
Def.declareOutput(out)
out
}
A subtle point I want to make is that in the above, I chose to use HashedVirtualFileRef
instead of VirtualFile
as the return type even though out
is a VirtualFile
. In fact it would not compile if the task key is changed to Initialize[Task[VirtualFile]]
:
[error] -- [E172] Type Error: /user/xxx/sbt/main/src/main/scala/sbt/Defaults.scala:1979:5
[error] 1979 | }
[error] | ^
[error] |Cannot find JsonWriter or JsonFormat type class for xsbti.VirtualFile.
Recall ac/sha256-d3ea49940f3ec7f983ddfe91f811161d2fee53c19ec58db224c789b63c5d759d
in disk cache:
{"$fields":["value","outputFiles"],"value":"${OUT}/jvm/scala-3.3.1/hello/hello_3-0.1.0-SNAPSHOT.jar>sha256-f824ffec2c48cbc5e4cdcaec71670983064312055d3e9cfcc1220d7f4f193027","outputFiles":["${OUT}/jvm/scala-3.3.1/hello/hello_3-0.1.0-SNAPSHOT.jar>sha256-f824ffec2c48cbc5e4cdcaec71670983064312055d3e9cfcc1220d7f4f193027"]}
If the task’s return type was VirtualFile
, we’d have to serialize the entire file content in the above JSON. Instead, we’re storing only the relative path along with its unique proof of the file, calculated using SHA-256: "${OUT}/jvm/3.3.1/hello/scala-3.3.1/hello_3-0.1.0-SNAPSHOT.jar>farm64-b9c876a13587c8e2"
. The actual content is given to the CAS via Def.declareOutput(out)
.
Once the disk cache is hydrated, even after clean
, packageBin
will now be able to quickly make a symbolic link to the disk cache, instead of zipping the inputs.
case study: compile task
Now that packageBin
is cached automatically, we can extend this idea to compile
as well. One of the challenges, as mentioned above is the latency-tradeoff problem. In sbt 1.x, we can basically create as fine grained tasks as we want since it’s used only to denote the chunk of work that are typed and can be parallelized. In sbt 2.x, we might need to be mindful about the network latencies (Something we should experimment). Thankfully we have JAR file that the compiler is already used to dealing with, so we can let compile
generate a JAR instead of trying to cache all *.class
files.
Here’s a rough snippet of compileIncremental
:
compileIncremental := (Def.cachedTask {
val s = streams.value
val ci = (compile / compileInputs).value
val c = fileConverter.value
// do the normal incremental compilation here:
val analysisResult: CompileResult =
BspCompileTask
.compute(bspTargetIdentifier.value, thisProjectRef.value, configuration.value) {
bspTask => compileIncrementalTaskImpl(bspTask, s, ci, ping, reporter)
}
val analysisOut = c.toVirtualFile(setup.cachePath())
Def.declareOutput(analysisOut)
// inline packageBin to create a JAR file
val mappings = ....
val pkgConfig = Pkg.Configuration(...)
val out = Pkg(...)
s.log.info(s"wrote $out")
Def.declareOutput(out)
analysisResult.hasModified() -> (out: HashedVirtualFileRef)
})
.tag(Tags.Compile, Tags.CPU)
.value,
Here’s how we can use this:
$ sbt
[info] welcome to sbt 2.0.0-alpha8-SNAPSHOT (Azul Systems, Inc. Java 1.8.0_352)
[info] loading project definition from hello1/project
[info] compiling 1 Scala source to hello1/target/out/jvm/scala-3.3.1/hello1-build/classes ...
[info] wrote ${OUT}/jvm/scala-3.3.1/hello1-build/hello1-build-0.1.0-SNAPSHOT-noresources.jar
....
sbt:Hello> compile
[info] compiling 1 Scala source to hello1/target/out/jvm/scala-3.3.1/hello/classes ...
[info] wrote ${OUT}/jvm/scala-3.3.1/hello/hello_3-0.1.0-SNAPSHOT-noresources.jar
[success] Total time: 3 s
sbt:Hello> clean
[success] Total time: 0 s
sbt:Hello> compile
[success] Total time: 1 s
sbt:Hello> run
[info] running example.Hello
hello
[success] Total time: 1 s
sbt:Hello> exit
[info] shutting down sbt server
This shows that even after clean
, which currently cleans the target directory, compile
is cached. It’s actually not an no-op because some of the dependent tasks are not yet cached, but it finished in 1s. We can also exit the sbt session and remove target/
to be sure:
$ rm -rf project/target
$ rm -rf target
$ sbt
[info] welcome to sbt 2.0.0-alpha8-SNAPSHOT (Azul Systems, Inc. Java 1.8.0_352)
....
sbt:Hello> run
[info] running example.Hello
hello
[success] Total time: 2 s
sbt:Hello> exit
[info] shutting down sbt server
$ ls -l target/out/jvm/scala-3.3.1/hello/
$ ls -l target/out/jvm/scala-3.3.1/hello/
total 0
drwxr-xr-x 4 xxx staff 128 Dec 27 03:44 classes/
lrwxr-xr-x 1 xxx staff 113 Dec 27 03:44 hello_3-0.1.0-SNAPSHOT-noresources.jar@ -> /Users/xxx/Library/Caches/sbt/v2/cas/sha256-02775d17841ec170a97b2abec01f56fb3e3949fefc8d69121e811f80c041cfb1
lrwxr-xr-x 1 eed3si9n staff 113 Dec 27 03:44 hello_3-0.1.0-SNAPSHOT.jar@ -> /Users/xxx/Library/Caches/sbt/v2/cas/sha256-f824ffec2c48cbc5e4cdcaec71670983064312055d3e9cfcc1220d7f4f193027
drwxr-xr-x 5 xxx staff 160 Dec 27 03:44 streams/
drwxr-xr-x 3 xxx staff 96 Dec 27 03:44 sync/
drwxr-xr-x 3 xxx staff 96 Dec 27 03:44 update/
drwxr-xr-x 3 xxx staff 96 Dec 27 03:44 zinc/
Again, run
worked without invoking the Scala compiler. The reason why we have two JARs is that technically compile
task does not include src/main/resources/
contents. In sbt 1.x, that’s the job of copyResources
task, which is called by products
.
Again, there’s a tradeoff of task granularity. By sepearating compilation and resources, we can avoid uploading resource files into the cache when we make source changes. On the other hand, the separation requires double uploading when you want the product output, which for us is packageBin
.
new Classpath type
As mentioned above, in sbt 1.x, classpaths were expressed using Seq[Attributed[File]]
. java.io.File
isn’t suitable as cache inputs since it ends up capturing the absolute path and it’s woefully unaware of the content changes. In sbt 2.x, the new Classpath
is defined as follows:
type Classpath = Seq[Attributed[HashedVirtualFileRef]]
Note that HashedVirtualFileRef
can always be turned back into Path
given an instance of FileConverter
, which is available via fileConverter.value
. There’s a Scala 3 extension method files
that can be used to turn a classpath into a Seq[Path]
:
given FileConverter = fileConverter.value
val cp = (Compile / dependencyClasspath).value.files
summary
Based on RFC-1: sbt cache ideas, #7464 implements automatic cached task called Def.cachedTask
:
someKey := Def.cachedTask {
val output = StringVirtualFile1("a.txt", "foo")
Def.declareOutput(output)
name.value + version.value + "!"
}
This uses Scala 3 macro to automatically track the dependent tasks as cache keys, and serialize and deserialize the outputs. The requirement for the inputs is that they must implement sjsonnew.HashWriter
a typeclass for a Merkle tree. The result type must satisfy sjsonnew.JsonFormat
.
To track files, sbt 2.x uses two types: VirtualFile
and HashedVirtualFileRef
. VirtualFile
is used by the tasks for actual reading and writing, while HashedVirtualFileRef
is used as a cache-friendly reference to files, including classpath-related tasks.
Def.declareOutput(...)
is used to explicitly declare the file creation that is relevant to the task. For example, compile
task may create *.class
files, but they will not be cached. Instead a JAR file will be registered using Def.declareOutput(...)
.
To put the mechanism to test, #7464 implements automatic caching for both packageBin
and compile
task.