search term:

RFC-1: sbt cache ideas

2023-03-19 / sbt

idea 6: more disk cache and remote cache

Extending the idea of cached compilation in sbt 1.4.0, we should generalize the mechanism so any task can participate in the remote caching.

Here are some more concrete ideas for caching.

problem space

To summarize the general problem space, currently setting up disk caching for tasks is a manual work, so it’s under-utilized. Remote caching is limited to cached compilation.

Generally we would like:

Easier caching for tasks
Participation to remote cache
Open design for remote cache implementation

caching in abstract

In the abstract we can think of cache as:

(A1, A2, A3, ...) => (Seq[Path] && B1)

Why is Seq[Path] so special? We need to treat files completely differently because, let’s say you created a text file for each cached output B1 and it says foo/Hello.jar that’s good, but that’s not good enough for a build tool. Because we need the actual file to exist on disk to perform other tasks.

So really, what we need to encode is the notion “output of file”. If you think about sbt tasks like update or compile, the return type of these tasks are reports about the dependency or source graph, but it’s expected that the file creation has also taken place as side effect.

one cache pipeline, multiple backends

What’s neat about Bazel is that the caching mechanism is abstracted away from the plugin authors.

Let’s say the caching code looks something like this:

(A1, A2, A3, ...) =>
  val inputHash = hash((a1, a2, a3, ...) + other_inputs)
  getCachedAction(inputHash) match {
    case Some(ac) =>
      retrieveBlobs(ac.outputs)
      (ac.outputs, ac.value)
    case None     =>
      val ac = doActual()
      sendBlobs(ac.outputs)
      putCachedAction(inputHash, ac)
      (ac.outputs, ac.value)
  }

We can create multiple cache backend that could implement getCachedAction(inputHash), retrieveBlobs(outputs), etc.

disk cache

The basic caching setup would be to use the local cache. This would replace the per-task caching that’s done in sbt 1.x.

getCachedAction can check if the correspondng result file exists or not, and the content could be a text file.
retrieveBlobs can’t just rely on the file name, since the content may change over time. Bazel uses content-addressable storage (CAS) to keep track of the hash of the files.

remote caching: HTTP

As a starter, plain HTTP server could be a starting point for remote cache. A good thing is that’s easy to set up, the downside is that reading and writing one file at a time is slow.

In any case, we can use some URL scheme like:

http://example.com/cache/ac/30c6172189093a9d0a4cf1fbfa79632b
http://example.com/cache/cas/3b8e48b651b51e2e03b6575347c64e6f

getCachedAction would be GET on ac/...
retrieveBlobs would also be series of GET per file
sendBlobs would be a series of PUT per file
putCachedAction would be PUT on ac/...

remote caching: others

Using these as starting points, people can implement their own remote caching that are more suited to their environment.

participating in the cache system

It depends how well it works, but it would be nice if a plain task automatically can participate in the caching system.

foo := {
  val s = streams.value
  s.log.info("hi")
  SomethingReport()
}

If it’s implemented this way, then it would also mean that we won’t execute any side effects when the cache is available (locally or remotely), unless we also design to track them explicitly.

We’d also need some opt-out:

foo := Def.uncachedTask {
  SomethingReport()
}

declaring the outputs

As mentioned above, sbt tasks like update and compile do not directly have Seq[Path] as the return type. This means we would need a new mechanism to declare the outputs:

foo := {
  doSomething(target.value / "a.jar")
  declareOutput(target.value / "a.jar")
  SomethingReport()
}

This should let the macro know which files needs to be tracked as outputs for caching.

file inputs

Similar to the output story, we would need to include the content hashes of files into the input hash, not just the file name.

We might need to set up some cascade of typeclasses to try re-use existing typeclass, like use Hashable1[A] if available, otherwise use Hashable2[A] via summon?

Also in general, similar to what I had to do in Zinc for cached compilation, we’d likely need to remove the absolute paths and used a mapper so any cachable input paths are machine-independent.

sbt 1.x: new File("/Users/yourname/workspace/foo/bar/src/main/scala/foo/bar/Hello.scala")
sbt 2.x: VirualFile("${BASE}/foo/bar/src/main/scala/foo/bar/Hello.scala")

Tasks that require actual File can convert VirtualFileRefs back using a mapper, which would know about all the absolute paths needed for the build.

other inputs

Let’s take a look at the example task again:

foo := {
  doSomething(target.value / "a.jar")
  declareOutput(target.value / "a.jar")
  SomethingReport()
}

In addition to the target.value, note that it is using doSomething(...) function somehow, which means that we would need a way to keep track of declarations and classpath that are available to build.sbt as part of the cache.

Also the shape of the source code also need to be part of the input hash. In Scala 3, this would likely use Expr#show (or a tree hash, per Guillaume Martres).

feedback

I created a discussion thread https://github.com/sbt/sbt/discussions/7180 on GitHub. Let me know what you think there.