Current State of Data Pipelines frameworks[November 2020]
As part of building a new Recommendation System my team decided to take sometime and review available tooling for data pipelines. Since we were building a new system we might as well embrace some new data pipeline tools. We had a play with what seems to be the most remarkable names out there, in this article I will cover my experiences working on toy pipelines with the mentioned tools.
Data pipelines are a needed part of the machine learning release cycle.
Until recently they just provided automation: run task1, then task2, then task3 and so on..
However as our understanding as an industry of machine learning development cycles expanded we understood that tracking just code changes is not enough. We also need to track the data and transformations that produce a model. So modern Data pipelines take the responsibility of providing reproducible experiments/models as well as data traceability.
In other words they try to version and track: models, experiment results, code version, and data version.
I think a good way to classify data pipeline tools is as explained by Ludovic Santos in . He classifies pipelines as : Task-driven and Data-driven
Task driven frameworks don’t really care too much about what’s the input or output of a step in a pipeline. They only care about orchestrating a workflow. Luigi and airflow are good examples of task driven pipeline frameworks.
These kind of frameworks not only orchestrate a workflow but also are aware of the data being passed between tasks. They are aware of types of input/output of a task. They also consider the output of a task to be artifacts which can be versioned. Usually tools in this realm come with caching which helps avoiding rerunning tasks . Testing artifacts is also part of these frameworks.
Luigi is a task driven pipeline tool. We have been using Luigi for a long time. I think what we like about Luigi is : it’s simplicity.
In our case we could easily define a Luigi layer which plays as a caching layer avoiding us from rerunning tasks.
Luigi was coded in a pre-container era. However it has a Kubernetes task which provides a very basic interface for running tasks on Kubernetes.
It seems Luigi development is not moving forward these days and so the Kubernetes task feels a bit outdated and a bit brittle.
Luigi is bare bones, it has only the basics . Pipeline versioning and data versioning is something you will have to handle. There is no scheduler which means you need some other tool to schedule Luigi pipelines.
I guess at this point you might be dismissing Luigi but I beg you not to. Luigi shines because it lets you easily write tests, because it is bare bones you can extend it to do whatever you want. Also the way you define tasks means you can easily reuse them and compose them.
I particularly like the fact that Luigi can run any task even if that’s a task in the middle of a pipeline. This is something that as far as I am aware can’t be done on airflow. (Please correct me if I am wrong ) on airflow you always have to run a complete pipeline (DAG), and it would be difficult to run only half of it.
Luigi’s UI is very basic and it is really hard to know what’s has run and what has not.
Airflow is another task driven framework. Unlike Luigi is in active development, a new version is being released soon and it mostly focuses on : making easier to pass data between tasks (operators) something that has been cumbersome until now . The new release also vastly improves interacting with Kubernetes.
Airflow lets you run tasks on Kubernetes via KubernetesOperator.
Something I really liked about airflow is how stable the KubernetesOperator is. As a comparison Luigi’s equivalent has some problems getting logs during failures or it can leak Kubernetes resources in some circumstances.
Airflow also comes with a very sophisticated ui and scheduler. You can define for example that a pipeline would run every day at a particular time.
I think if your use case consider having multiple pipelines, for example a recurrent pipeline per client that gets triggered once every day , then airflow might be a good framework to use. The ui provides an easy way to check what’s going on as well as looking at logs.
Unlike Luigi caching here seems more difficult and it is left entirely to the user. Versioning and artifacts are also left entirely to the user .
I guess if you are on GCP giving a try to airflow makes sense as cloud composer is a managed service offering airflow.
Comparing Luigi and airflow I think airflow operators are way less composable.
Kubeflow is a bunch of things. It is like a box of lego pieces, it shines integrating various parts of the machine learning release process . For example model serving could be well integrated with data pipelines.
As far as this post goes I will only refer to “kubeflow pipelines” which is the Kubeflow component which lets you define data pipelines .
Kubeflow pipelines belong to what I think is a second generation of frameworks, it is a data driven framework, it tries to version everything. All tasks are meant to run in containers.
Let me start by describing the process of writing and running pipelines. While on Luigi and Airflow you pretty much write and run standard python, in Kubeflow the process is a bit different
Kubeflow development cycle
1. You define pipelines in a Python DSL. Every task is a container that will be run on Kubernetes.
- Not all tasks have to provide a container. DSL provides some tools to convert a python function into a task.
- Kubeflow provides the pieces for creating Kubernetes volumes and Kubernetes snapshots as part of pipeline actions .
- Kubeflow also provide tasks related to model serving (Seldon, https://github.com/SeldonIO)
- passing outputs between tasks is very easy in Kubeflow Unlike Airflow where there are many limitations.
2. There is a cmd tool that takes a python file defining a pipeline and transforms it into a .tar.gz file which you can upload to Kubeflow
- Cmd tool will check types and burps if there is any error in your pipeline definition. Since tasks are type aware this can be useful.
- It essentially saves you from trying to run a pipeline that would burp half way through
3. Before you upload a pipeline you need to go to the ui and create “an experiment “ and a “Pipeline name “ . You can also do this via api.
4. You can upload a tar.gz pipeline via the UI or a rest API. A pipeline upload will be versioned. Every time you upload a pipeline it will be versioned.
5. You can define recurrent pipeline runs via the UI or a rest API. You can also trigger a single run.
6. Task outputs get cached in a database. They are cached based on pipeline version , task and task parameters. If a task was run before with the same pipeline version and inputs , then it won’t be rerun.
Current state of Kubeflow
Kubeflow pipelines have many features. Every time I tried to do something with Kubeflow it had a feature for it.
⚠️ However very few features are documented. Most of the time I had to go to the source code and dig code samples or tests to check how to do something .
⚠️Some parts are a bit brittle at the moment. For example: I could not get caching to work; when defining a recurrent run(i.e run everyday at 9am) I could not specify a pipeline version in an easy manner.
In my opinion Kubeflow might be good for a team who is already invested in Kubernetes. Very often you will find yourself fighting with Kubernetes (or in my case my lack of deep knowledge on Kubernetes).
Kubeflow installs many things in the background: Istio, Seldon, Prometheus.
Getting through the installation and getting it to work is quite easy. Knowing what’s happening underneath seems more difficult. This particularly hit me when I checked all pods running under Kubeflow namespace.
As a team we only thought of using Kubeflow pipelines so other components looked a bit overkill, but I can see why it could become a very popular project, it puts a lot of things together which are very hard to ingrate : pipelines, model serving, hyper parameter search, experiment notebooks, a/b testing, model observability.
Caching is also something we particularly wanted, IMO it looks super neat implemented in Kubeflow since it already versions everything.
I think Kubeflow is a promising project but unless your team is willing to go the extra mile contributing documentation and dealing with a few broken things then it might be better to wait a bit more.
Kubeflow certainly opens a lot of new possibilities for example the idea of sharing data through Kubernetes volumes, and maybe having volume snapshots as Artifacts.
UI is pretty good and it is easy to get around recurrent runs. It is also straight forward to get logs.
Kubeflow provides an abstraction called experiments . All resources belong to an experiment, think of it to be some sort of
namespace .Your pipeline belongs to an experiment, in an experiment you can track metrics.
For example one of your metrics could be fscore , recall etc. your pipeline can push results to an experiment .
Artifacts (tasks outputs) are versioned and tracked. I think there is a lot of potential features in this realm.
Closing the gap between ops data science
I think under the hood what Kubeflow really wants to do is to close the gap between ops and data science.
If you use Luigi or airflow your data science team will eventually have the problem of needing some infrastructure to run that orchestration somewhere. Even after that problem is solved, assuming your data science team is creating models, then a new problem will arise: the infrastructure needed to run those models.
This also happens during experimentation when data scientists want to take their python notebook from a local environment to a cloud instance with more power: “how do I get there?”. From the perspective of ops engineers the data science tooling is hard, full of unknown dependencies and assumptions.
Kubeflow closes this gap by separation of concerns. An ops engineer can then focus on Kubernetes stuff only without worrying about specific docker images, python notebooks or any other data science dependencies.
Simultaneously data scientists don’t have a wall of ops knowledge to get something running on the company’s Infrastructure.
Argo is another framework for defining pipelines.
Parts of Kubeflow itself are built on top of Argo. However Kubeflow just focuses on machine learning specifics.
It is quite generic and we simply decided to skip it and instead experiment with Kubeflow.
TFX pipelines (Tensorflow Extended Pipelines)
This is a new python framework for defining pipelines.
It relies either on Kubeflow or Airflow as a backend, so we did not dig into it too much.
As far as I read it supports caching, and it does experiment tracking.
MLFlow is not a pipeline framework, but chances are you will be hearing about MLFlow .
MLFlow is a python library that lets you track experiments. It powers up your pipelines so they can track metrics (recall, precision, loss..)
You can use MLFlow with any pipeline framework i.e: Luigi or airflow.
MLFlow can also help you deploy models, in this sense it also overlaps with Kubeflow.
At the end of the day I think Kubeflow looks promising.
I would give Kubeflow a try if my team is already using Kubernetes and if I have someone to poke from time to time to ask Kubernetes specific questions.
As for us, we are building an MVP. So we decided to make sure all our tasks are running in containers and jumped into Luigi. Luigi is simple, we have been using it for a long time already.
As we move out of our MVP phase shall we need something more powerful then we would consider something else. As for now there are more important features to tackle rather than fighting with Kubeflow/Kubernetes .