Kubeflow, the freely available machine learning platform cofounded by developers at Google, Cisco, IBM, Red Hat, CoreOS, and CaiCloud, made its debut at the annual Kubecon conference in 2017. Three years later, Kubeflow has reached version 1.0 — its first major release — as the project grows to hundreds of contributors over 30 participating organizations. Companies including US Bank, Chase, GoJek, Amazon Web Services, Bloomberg, Uber, Shopify, GitHub, Canonical, Intel, Alibaba Cloud, TuSimple, Dell, Shell, Arrikto, and Volvo are among those using it in production.
Project coauthors Jeremy Lewi, Josh Bottum, Elvira Dzhuraeva, David Aronchick, Amy Unruh, Animesh Singh, and Ellis Bigelow announced the news in a Medium post this morning. “Kubeflow’s goal is to make it easy for machine learning engineers and data scientists to leverage cloud assets (public or on-premise) for [machine learning] workloads,” they wrote. “With Kubeflow, there is no need for data scientists to learn new concepts or platforms to deploy their applications, or to deal with ingress, networking certificates, etc.”
Kubeflow 1.0 graduates to a core set of stable components needed to develop, build, train, and deploy models efficiently on Kubernetes, the Google-developed open source container-orchestration system for automating app deployment, scaling, and management. In addition to Kubeflow’s central dashboard UI and Jupyter notebook controller, Kubeflow 1.0 ships with the web app Tensorflow Operator (TFJob), PyTorch Operator (for distributed training), kfctl (for deployment and upgrades), and a profile controller and multiuser management UI.
With Kubeflow 1.0, developers can use the programming notebook platform Jupyter and Kubeflow tools like Kubeflow’s Python software development kit to develop models, build containers, and create Kubernetes resources to train those models. Trained models can be optionally funneled through Kubeflow’s KFServing resource to create, deploy, and auto-scale an inferencing server across a range of hardware, tapping into new KFServing explainability and payload logging features in alpha.
Kubeflow 1.0 introduces a command-line interface and configuration files that enable it to be deployed with a single command, as well as modules under development like Pipelines. (Pipelines is partly based on and utilizes libraries from TensorFlow Extended, which was used internally at Google to build machine learning components and then allow developers on various internal teams to utilize that work and put it into production.) Other work-in-progress apps in Kubeflow 1.0 are Metadata (for tracking datasets, jobs, and models); Katib (for hyper-parameter tuning); and distributed operators for other frameworks like xgboost. In future releases of Kubeflow, they’ll be graduated to 1.0.
As before, Kubeflow enables data scientists and teams to run workloads within namespaces. (Namespaces provide security and resource isolation, and, using Kubernetes resource quotas, admins can limit how much resources an individual or team can consume to ensure fair scheduling.) From the Kubeflow UI, users can launch programming notebooks by choosing one of the pre-built images or entering the URL of a custom image. They can then set how many processors and graphics cards to attach to their notebook, as well as which configuration and secrets parameters to include from repositories and databases. Plus, they’re able to define a TFJob or PyTorch resource to have the controller take care of spinning up and managing processes and configuring them to talk to one another.
“This was a significant investment. It has taken several organizations and a lot of precious resources to get here,” wrote Cisco distinguished engineer and Kubeflow contributor Debo Dutta in a blog post. “We are very excited about the future of Kubeflow. We would like to see the community get stronger and more diverse, and we would like to request more individuals and organizations to join the community.”