sudori part 4
This is a blog post on sbt 2.x development, continuing from sudori part3, sbt 2.x remote cache, and sbt 2.x remote cache with Bazel compatibility. I work on sbt 2.x in my own time with collaboration with the Scala Center. These posts are extended PR descriptions to share the features that will come to the future version of sbt, hopefully.
August 2024 status quo
Since April, 2024, we have had Bazel-compatible remote cache capability. The implementation currently supports file output as a cached side effect. In other words, even if we start from a fresh machine, if the remote cache is hydrated, we can download JAR files from the cache instead of running the compiler.
In practice, however, we actually need to support arbitrary number of files in a directory to support incremental compilation. There are other potentially other avenues too, but I think supporting directory is the safe next step.
file directory problem
Caching of a file directory hits on a number of caching issues outlined in the sbt 2.x remote cache post:
- A file directory can just be a relative path, a unique proof of the directory, or a materialized actual directory in the file system
- An actual file directory may contain an arbitrary number of files
- We don’t want to make too many network calls to cache a directory
declaring the outputs
In sbt/sbt#7621, I’m introducing a new output called Def.declareOutputDirectory
:
Def.declareOutputDirectory(dir)
This would be called from within a task to declare a directory output. This is different from the return type of tasks. For example, the compile
task returns an Analysis
, but it generates *.class
files on the side, which downstream tasks expects to be there in some agreed-upon directory. Declaration makes this process a bit more explicit. Here’s an example usaga:
import sjsonnew.BasicJsonProtocol.given
lazy val someKey = taskKey[Int]("")
someKey := (Def.cachedTask {
val conv = fileConverter.value
val dir = target.value / "foo"
IO.write(dir / "bar.txt", "1")
val vf = conv.toVirtualFile(dir.toPath())
Def.declareOutputDirectory(vf)
1
}).value
setting up NativeLink cache
[NativeLink][nativelink] is a relatively new Bazel remote execution backend implementated in Rust with emphasis on performance. It’s open source, and also has NativeLink Cloud, available for free trial, which my friend Adam Singer has been telling me about.
- To enable remote caching, add
addRemoteCachePlugin
toproject/plugins.sbt
. - From https://app.nativelink.com/, go to Quickstart and take note of the URLs and
--remote_header
. - Create a file called
$HOME/.sbt/nativelink_credential.txt
and put in the API key:
x-nativelink-api-key=*******
The sbt 2.x configuration would look like this:
Global / remoteCache := Some(uri("grpcs://something.build-faster.nativelink.net"))
Global / remoteCacheHeaders += IO.read(BuildPaths.defaultGlobalBase / "nativelink_credential.txt").trim
See sbt 2.x remote cache with Bazel compatibility for other remote cache solutions.
running the task
Given the setup, we can now try running the someKey
task:
> someKey
[success] elapsed time: 1 s, cache 0%, 1 onsite task
> exit
Next, we want to wipe out the local cache, and see if we can recover the directory:
$ rmtrash $HOME/Library/Caches/sbt/v2/ && rmtrash target
$ sbt
> someKey
[success] elapsed time: 1 s, cache 100%, 1 remote cache hit
> exit
$ tree target/out/jvm/scala-3.4.2/dirtest/
target/out/jvm/scala-3.4.2/dirtest/
├── aaa
│ └── bbb.txt -> $HOME/Library/Caches/sbt/v2/cas/sha256-6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b-1
└── aaa.sbtdir.zip -> $HOME/Library/Caches/sbt/v2/cas/sha256-b8e8af292865273d51e6ab681d52cc2410cd6e4d33aa563f6e691b8cd3c6e665-904
$ cat target/out/jvm/scala-3.4.2/dirtest/aaa/bbb.txt
1
This shows that aaa
directory was recovered from the remote cache.
Implementation details
In short, I’ve emulated the directory caching by actually creating a .zip
file called aaa.sbtdir.zip
and caching the zip file instead. This allows us to deal with a concrete file that can be hashed for caching and reuse the same mechanism as Def.declareOutput
.
macro expansion of outputs
Before going into the directories, let’s recap how a cached task works with an output. Supposed we have something like the follows:
Def.cachedTask {
val vf = StringVirtualFile1("a.txt", "foo")
Def.declareOutput(vf)
name.value + version.value + "!"
}
In the macro, this expands to the following function calls:
i.mapN((wrap(name), wrap(version)), (q1: String, q2: String) => {
var o1: VirtualFile = _
ActionCache.cache[(String, String), String](
key = (q1, q2),
otherInputs = 0): input =>
val vf = StringVirtualFile1("a.txt", "foo")
o1 = vf
InternalActionResult(q1 + q2 + "!", List(o1))
})
Note how Def.declareOutput(vf)
expands into three different places:
- declaration of a synthetic variable
var o1: VirtualFile
Def.declareOutput(vf)
becomes simple assignmento1 = vf
- Later,
o1
is passed intoInternalActionResult
, which is passed intoActionCache.cache
.
What I did for Dec.declareOutputDirectory
is mostly the same:
- declaration of a synthetic variable
var o1
Def.declareOutputDirectory(vf)
becomes assignmento1 = ActionCache.packageDirectory(vf)
- Later,
o1
is passed intoInternalActionResult
, which is passed intoActionCache.cache
.
Inside of the ActionCache.packageDirectory
we can create a zip file.
acting on the effect
So far all we have is a zip file caching. Next, I’ve included a manifest file in the zip:
$ unzip -p target/out/jvm/scala-3.4.2/dirtest/aaa.sbtdir.zip sbtdir_manifest.json | jq
{
"version": "0.1.0",
"outputFiles": [
"${OUT}/jvm/scala-3.4.2/dirtest/aaa/bbb.txt>sha256-6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b/1"
]
}
This lists all the items in the directory and their content hashes. When sbt comes across an output whose name ends with .sbtdir.zip
, it will open this manifest file, compare the SHA-256 hash against the existing files and sync it using the disk cache. This means that if we come across the same item more than once, which we will during compilation etc, we will cache it once centrally rather than overwriting it each time.
case study: compile task
Previous implementation of the compile
task in sbt 2.x remote cache was based on the idea of creating a JAR file minus the resource files. Going back to using directory makes the implementation simpler:
val analysisResult = Retry(compileIncrementalTaskImpl(bspTask, s, ci, ping))
....
val dir = ci.options.classesDirectory
val vfDir = c.toVirtualFile(dir)
val packedDir = Def.declareOutputDirectory(vfDir)
(analysisResult.hasModified(), vfDir: VirtualFileRef, packedDir: HashedVirtualFileRef)
Since this would upload a zip
file, effectively it would be similar to creating a JAR file, except we now have individual files unzipped in a machine-independent fashion.
beware the hermeticity breakage
The compile
task example reminds me of a potential pitfall with Def.declareOutputDirectory
when used with Def.cachedTask(...)
, which is that if we’re not careful, we could end up breaking the hermeticity. Specifically in this case, we could end up reusing an old, incorrect cache.
This is because VirtualFileRef
that represents the directory name will not contain the hash information of the content, which does not contain enough information to invalidate the downstream tasks. For example, if a cached task foo
produces a directory and passes it to another cached task bar
, passing VirtualFileRef
alone will not be a sufficient cache.
// bad
foo := (Def.cachedTask {
val dir = target.value / "foo"
....
val vfDir = c.toVirtualFile(dir)
Def.declareOutputDirectory(vfDir)
vfDir // returning VirtualFileRef may not be safe here
}).value
bar := (Def.cachedTask {
val vfDir = foo.value
....
}).value
To workaround this, Def.declareOutputDirectory
returns VirtualFile
for the synthetic zip file, which does contain the hash, and should be able to invalidate the downstream.
// good
foo := (Def.cachedTask {
val dir = target.value / "foo"
....
val vfDir = c.toVirtualFile(dir)
val packedDir = Def.declareOutputDirectory(vfDir)
(vfDir, packedDir)
}).value
bar := (Def.cachedTask {
val (vfDir, _) = foo.value
....
}).value
summary
In sbt/sbt#7621 introduces a new output called Def.declareOutputDirectory(...)
that can produce arbitrary number of files from a cached task. The intent of the output is to port sbt 1.x tasks, such as incremental compilation easily.
This internally produces a zip file, and any Bazel remote cache implementations should support this feature. However, special care must be taken to pass a directory to other cached tasks to avoid breaking hermeticity.
Donate to Scala Center
Scala Center is a non-profit center at EPFL to support education and open source. Please consider donating to them, and publicly tweet/toot at @eed3si9n and @scala_lang when you do (I don’t work for them, but we maintain sbt together).