Reflow: Graph-Based Workflows with Checkpointing

One of my work projects is publicly available! We call it Reflow. Written in Java, it's a library for composing individual units of work into a directed acyclic graph. My team has started using it to drive a bunch of our data processing, and now you can try it out too. Source and documentation can be found on GitHub, and as of today, we're also publishing build artifacts to JCenter.

Once a dependency graph has been defined, Reflow enables you to run it end to end in a single method call, with multiple tasks executing in parallel when possible. And there's more:

  • Each task can declare that it will produce some output (database tables, files on local disk, etc.), and those tasks can be skipped when the output is already present.
  • Task definitions are extremely flexible—in fact, the only real requirement is that each task be representable with a Java object. If your tasks happen to implement the Runnable interface, it's easy to get them scheduled on an Executor of your choosing, but you can opt to handle scheduling yourself and even schedule tasks outside of the JVM.
  • If you do schedule tasks externally, the state of the overall workflow can be serialized even while tasks are running. This allows you to bring down one “coordinator” process and bring up another without missing a beat.

The library was inspired by my team's data handling needs. Every morning, we ingest data from the previous day and kick off an hours-long processing pipeline. It's important that we finish by the end of the day, and things don't always go off without a hitch. The input data is occasionally incorrect or contains unanticipated combinations of values, and such issues must be identified and resolved quickly if we want to meet our deadline.

One of the easiest ways to save time is to do several things at once. Several of the stages in our pipeline admit parallelism: for example, we need to aggregate a particular data set over multiple independent attributes, meaning each aggregation can begin as soon as the common input data is ready. With Reflow, all we have to do is define those dependencies, and the work is automatically performed in parallel.

Another easy win is avoiding work altogether. If a piece of input data turns out to be bad, only the pipeline stages downstream of that data should be affected; if a particular stage fails, there's no need to rerun everything upstream when it's been fixed. By resuming intelligently in the wake of a failure, we save both time and resources and the experience is less stressful for everyone.

“Hang on,” you cry, “it's 2018! Surely someone has done this already?” I did spend some time searching for alternatives, and the closest thing I found was this library from Spotify. Java 8 also introduced a CompletableFuture class for chaining asynchronous computations.

In contrast to those options, Reflow is geared towards larger-scale tasks that communicate through shared external state. It makes no attempt to handle inter-task data transfer, and the in-memory graph data structure is fairly heavyweight. If you're looking to chain together some ordinary quick-running Java methods, you'll have to handle all the input and output, and the scheduling overhead will be relatively high. But Reflow is especially well-suited to JVM-external tasks, which tend to run longer and work with JVM-external data: We often use it to coordinate jobs on a Hadoop cluster, for example.

Above all, I've tried to make a library that I enjoy working with, and I hope you like it too.