Timely dataflow is a low-latency cyclic dataflow computational model, introduced in the paper Naiad: a timely dataflow system. This project is an extended and more modular implementation of timely dataflow in Rust.
This project is something akin to a distributed abstractions.
Timely dataflow is based on a directed graph with vertices at the nodes and logically timestamped messages flowing along edges. The graph may contain cycles. Input vertices receive sequences of messages from external producers, and output vertices send a sequence of messages out to external consumers. Jun 08, 2016 The Dataflow Monitoring Interface job summary already tells you the Elapsed Time of your job in real seconds. In this case, about 20 minutes. In this case, about 20 minutes. It also shows the Total Worker Time — how much compute time your job has spent in worker seconds.
There are currently a few options for writing timely dataflow programs. Ideally this set will expand with time, as interested people write their own layers (or build on those of others).
Timely dataflow: Timely dataflow includes several primitive operators, including standard operators like
map
,filter
, andconcat
. It also including more exotic operators for tasks like entering and exiting loops (enter
andleave
), as well as generic operators whose implementations can be supplied using closures (unary
andbinary
).Differential dataflow: A higher-level language built on timely dataflow, differential dataflow includes operators like
group
,join
, anditerate
. Its implementation is fully incrementalized, and the details are pretty cool (if mysterious).
There are also a few applications built on timely dataflow, including a streaming worst-case optimal join implementation and a PageRank implementation, both of which should provide helpful examples of writing timely dataflow programs.
If you are interested in working with or helping out with timely dataflow, great!
There are a few classes of work that are helpful for us, and may be interesting for you. There are a few broad categories, and then an ever-shifting pile of issues of various complexity.
If you would like to write programs using timely dataflow, this is very interesting for us. Ideally timely dataflow is meant to be an ergonomic approach to a non-trivial class of dataflow computations. As people use it and report back on their experiences, we learn about the classes of bugs they find, the ergonomic pain points, and other things we didn't even imagine ahead of time. Learning about timely dataflow, trying to use it, and reporting back is helpful!
If you like writing little example programs or documentation tests, there are many places throughout timely dataflow where the examples are relatively sparse, or do not actually test the demonstrated functionality. These can often be easy to pick up, flesh out, and push without a large up-front obligation. It is probably also a great way to get one of us to explain something in detail to you, if that is what you are looking for.
If you like the idea of getting your hands dirty in timely dataflow, the issue tracker has a variety of issues that touch on different levels of the stack. For example:
Timely currently does more copies of data than it must, in the interest of appeasing Rust's ownership discipline most directly. Several of these copies could be elided with some more care in the resource management (for example, using shared regions of one
Vec<u8>
in the way that the bytes crate does). Not everything is obvious here, so there is the chance for a bit of design work too.We recently landed a bunch of logging changes, but there is still a list of nice to have features that haven't made it yet. If you are interested in teasing out how timely works in part by poking around at the infrastructure that records what it does, this could be a good fit! It has the added benefit that the logs are timely streams themselves, so you can even do some log processing on timely. Whoa...
There is an open issue on integrating Rust ownership idioms into timely dataflow. Right now, timely streams are of cloneable objects, and when a stream is re-used, items will be cloned. We could make that more explicit, and require calling a
.cloned()
method to get owned objects in the same way that iterators require it. At the same time, using a reference to a stream without taking ownership should get you the chance to look at the records that go past without taking ownership (and without requiring a clone, as is currently done). This is often plenty for exchange channels which may need to serialize the data and can't take much advantage of ownership anyhow.There is a bunch of interesting work in scheduling timely dataflow operators, where when given the chance to schedule many operators, we might think for a moment and realize that several of them have to work to do and can be skipped. Better, we might maintain the list of operators with anything to do, and do nothing for those without work to do.
There are also some larger themes of work, whose solutions are not immediately obvious and each with the potential to sort out various performance issues:
Rate-controlling output
At the moment, the implementations of unary
and binary
operators allow their closures to send un-bounded amounts of output. This can cause unwelcome resource exhaustion, and poor performance generally if the runtime needs to allocate lots of new memory to buffer data sent in bulk without being given a chance to digest it. It is commonly the case that when large amounts of data are produced, they are eventually reduced given the opportunity.
With the current interfaces there is not much to be done. One possible change would be to have the input
and notificator
objects ask for a closure from an input message or timestamp, respectively, to an output iterator. This gives the system the chance to play the iterator at the speed they feel is appropriate. As many operators producet horrible but could plausibly be improved. It may also be that leaving the user with control of the granularity leaves them with more control over the latency/throughput trade-off, which could be a good thing for the system to do.
This project is something akin to a distributed abstractions.
There are currently a few options for writing timely dataflow programs. Ideally this set will expand with time, as interested people write their own layers (or build on those of others).
Timely dataflow: Timely dataflow includes several primitive operators, including standard operators like
map
,filter
, andconcat
. It also including more exotic operators for tasks like entering and exiting loops (enter
andleave
), as well as generic operators whose implementations can be supplied using closures (unary
andbinary
).Differential dataflow: A higher-level language built on timely dataflow, differential dataflow includes operators like
group
,join
, anditerate
. Its implementation is fully incrementalized, and the details are pretty cool (if mysterious).
There are also a few applications built on timely dataflow, including a streaming worst-case optimal join implementation and a PageRank implementation, both of which should provide helpful examples of writing timely dataflow programs.
If you are interested in working with or helping out with timely dataflow, great!
There are a few classes of work that are helpful for us, and may be interesting for you. There are a few broad categories, and then an ever-shifting pile of issues of various complexity.
If you would like to write programs using timely dataflow, this is very interesting for us. Ideally timely dataflow is meant to be an ergonomic approach to a non-trivial class of dataflow computations. As people use it and report back on their experiences, we learn about the classes of bugs they find, the ergonomic pain points, and other things we didn't even imagine ahead of time. Learning about timely dataflow, trying to use it, and reporting back is helpful!
If you like writing little example programs or documentation tests, there are many places throughout timely dataflow where the examples are relatively sparse, or do not actually test the demonstrated functionality. These can often be easy to pick up, flesh out, and push without a large up-front obligation. It is probably also a great way to get one of us to explain something in detail to you, if that is what you are looking for.
If you like the idea of getting your hands dirty in timely dataflow, the issue tracker has a variety of issues that touch on different levels of the stack. For example:
Timely currently does more copies of data than it must, in the interest of appeasing Rust's ownership discipline most directly. Several of these copies could be elided with some more care in the resource management (for example, using shared regions of one
Vec<u8>
in the way that the bytes crate does). Not everything is obvious here, so there is the chance for a bit of design work too.We recently landed a bunch of logging changes, but there is still a list of nice to have features that haven't made it yet. If you are interested in teasing out how timely works in part by poking around at the infrastructure that records what it does, this could be a good fit! It has the added benefit that the logs are timely streams themselves, so you can even do some log processing on timely. Whoa...
There is an open issue on integrating Rust ownership idioms into timely dataflow. Right now, timely streams are of cloneable objects, and when a stream is re-used, items will be cloned. We could make that more explicit, and require calling a
.cloned()
method to get owned objects in the same way that iterators require it. At the same time, using a reference to a stream without taking ownership should get you the chance to look at the records that go past without taking ownership (and without requiring a clone, as is currently done). This is often plenty for exchange channels which may need to serialize the data and can't take much advantage of ownership anyhow.There is a bunch of interesting work in scheduling timely dataflow operators, where when given the chance to schedule many operators, we might think for a moment and realize that several of them have to work to do and can be skipped. Better, we might maintain the list of operators with anything to do, and do nothing for those without work to do.
There are also some larger themes of work, whose solutions are not immediately obvious and each with the potential to sort out various performance issues:
Rate-controlling output
At the moment, the implementations of unary
and binary
operators allow their closures to send un-bounded amounts of output. This can cause unwelcome resource exhaustion, and poor performance generally if the runtime needs to allocate lots of new memory to buffer data sent in bulk without being given a chance to digest it. It is commonly the case that when large amounts of data are produced, they are eventually reduced given the opportunity.
With the current interfaces there is not much to be done. One possible change would be to have the input
and notificator
objects ask for a closure from an input message or timestamp, respectively, to an output iterator. This gives the system the chance to play the iterator at the speed they feel is appropriate. As many operators producet horrible but could plausibly be improved. It may also be that leaving the user with control of the granularity leaves them with more control over the latency/throughput trade-off, which could be a good thing for the system to do.