RFC-1: sbt cache ideas
In sbt 2.0 ideas I wrote:
idea 6: more disk cache and remote cache
Extending the idea of cached compilation in sbt 1.4.0, we should generalize the mechanism so any task can participate in the remote caching.
Here are some more concrete ideas for caching.
problem space
To summarize the general problem space, currently setting up disk caching for tasks is a manual work, so it’s under-utilized. Remote caching is limited to cached compilation.
Generally we would like:
- Easier caching for tasks
- Participation to remote cache
- Open design for remote cache implementation
caching in abstract
In the abstract we can think of cache as:
(A1, A2, A3, ...) => (Seq[Path] && B1)
Why is Seq[Path]
so special? We need to treat files completely differently because, let’s say you created a text file for each cached output B1
and it says foo/Hello.jar
that’s good, but that’s not good enough for a build tool. Because we need the actual file to exist on disk to perform other tasks.
So really, what we need to encode is the notion “output of file”. If you think about sbt tasks like update
or compile
, the return type of these tasks are reports about the dependency or source graph, but it’s expected that the file creation has also taken place as side effect.
one cache pipeline, multiple backends
What’s neat about Bazel is that the caching mechanism is abstracted away from the plugin authors.
Let’s say the caching code looks something like this:
(A1, A2, A3, ...) =>
val inputHash = hash((a1, a2, a3, ...) + other_inputs)
getCachedAction(inputHash) match {
case Some(ac) =>
retrieveBlobs(ac.outputs)
(ac.outputs, ac.value)
case None =>
val ac = doActual()
sendBlobs(ac.outputs)
putCachedAction(inputHash, ac)
(ac.outputs, ac.value)
}
We can create multiple cache backend that could implement getCachedAction(inputHash)
, retrieveBlobs(outputs)
, etc.
disk cache
The basic caching setup would be to use the local cache. This would replace the per-task caching that’s done in sbt 1.x.
getCachedAction
can check if the correspondng result file exists or not, and the content could be a text file.retrieveBlobs
can’t just rely on the file name, since the content may change over time. Bazel uses content-addressable storage (CAS) to keep track of the hash of the files.
remote caching: HTTP
As a starter, plain HTTP server could be a starting point for remote cache. A good thing is that’s easy to set up, the downside is that reading and writing one file at a time is slow.
In any case, we can use some URL scheme like:
http://example.com/cache/ac/30c6172189093a9d0a4cf1fbfa79632b
http://example.com/cache/cas/3b8e48b651b51e2e03b6575347c64e6f
getCachedAction
would beGET
onac/...
retrieveBlobs
would also be series ofGET
per filesendBlobs
would be a series ofPUT
per fileputCachedAction
would bePUT
onac/...
remote caching: others
Using these as starting points, people can implement their own remote caching that are more suited to their environment.
participating in the cache system
It depends how well it works, but it would be nice if a plain task automatically can participate in the caching system.
foo := {
val s = streams.value
s.log.info("hi")
SomethingReport()
}
If it’s implemented this way, then it would also mean that we won’t execute any side effects when the cache is available (locally or remotely), unless we also design to track them explicitly.
We’d also need some opt-out:
foo := Def.uncachedTask {
SomethingReport()
}
declaring the outputs
As mentioned above, sbt tasks like update
and compile
do not directly have Seq[Path]
as the return type. This means we would need a new mechanism to declare the outputs:
foo := {
doSomething(target.value / "a.jar")
declareOutput(target.value / "a.jar")
SomethingReport()
}
This should let the macro know which files needs to be tracked as outputs for caching.
file inputs
Similar to the output story, we would need to include the content hashes of files into the input hash, not just the file name.
We might need to set up some cascade of typeclasses to try re-use existing typeclass, like use Hashable1[A]
if available, otherwise use Hashable2[A]
via summon
?
Also in general, similar to what I had to do in Zinc for cached compilation, we’d likely need to remove the absolute paths and used a mapper so any cachable input paths are machine-independent.
- sbt 1.x:
new File("/Users/yourname/workspace/foo/bar/src/main/scala/foo/bar/Hello.scala")
- sbt 2.x:
VirualFile("${BASE}/foo/bar/src/main/scala/foo/bar/Hello.scala")
Tasks that require actual File
can convert VirtualFileRef
s back using a mapper, which would know about all the absolute paths needed for the build.
other inputs
Let’s take a look at the example task again:
foo := {
doSomething(target.value / "a.jar")
declareOutput(target.value / "a.jar")
SomethingReport()
}
In addition to the target.value
, note that it is using doSomething(...)
function somehow, which means that we would need a way to keep track of declarations and classpath that are available to build.sbt
as part of the cache.
Also the shape of the source code also need to be part of the input hash. In Scala 3, this would likely use Expr#show
(or a tree hash, per Guillaume Martres).
feedback
I created a discussion thread https://github.com/sbt/sbt/discussions/7180 on GitHub. Let me know what you think there.