The DevOps Success Story
Its mid 2020 and software development life cycle has reached an appreciable level of maturity. Two things that stand out now are -
DevOps culture & practices have evolved immensely. From version control to build phases, CI/CD, automation tests, deployment orchestration, cloud infrastructure-as-code - all these processes have created a synergy for successful software delivery.
The tool ecosystem for developing software applications is remarkably rich. And DevOps processes have revolved around these tools to 'Automate everything humanly possible!'
For the uninitiated, DevOps is defined as "a set of practices intended to reduce the time between committing a change to a system and the change being placed into normal production, while ensuring high quality".
DevOps has helped software businesses to succeed by majorly by these three ways -
Collaboration: DevOps has helped break silos between software developers and operation engineers. So it's a culture that has promoted "We build it, we run it," not the "Throw the code over the wall" paradigm.
Speed: DevOps practices heavily advocate the "Automate Everything" ideology and this leads to faster time to market.
Reliability: Use of standardised CI/CD pipelines lead to near zero errors and reproducibility. So things fail fast(which is good, and can be fixed) or do not fail at all!
Enter Artificial Intelligence
In parallel to the rise of Cloud-native software services & DevOps, one more area that is making waves in the technology circles is Artificial Intelligence (AI).
Fathers of AI Minsky and McCarthy, described artificial intelligence as any task performed by a program or a machine that, if a human carried out the same activity, we would say the human had to apply intelligence to accomplish the task.
Now AI has many sub disciplines and helps solve problems for various fields like planning, learning, reasoning, problem solving, knowledge representation, perception, motion, and manipulation and, to a lesser extent, social intelligence and creativity.
One one hand DevOps is the de-facto standard for application development. However, modern ML (Machine Learning) and AI do not have a standard tooling or process ecosystem. This makes sense for a number of reasons-
The best practices have not emerged as of now because the tools are changing rapidly and there is a need for a single body of knowledge here.
The below excerpt from Microsoft Azure Blog, throws more light on the topic-
"AI/ML projects Like DevOps, these methodologies are grounded in principles and practices learned from real-world projects. AI/ML teams use an approach unique to data science projects where there are frequent, small iterations to refine the data features, the model, and the analytics question. It's a process intended to align a business problem with AI/ML model development. The release process is not a focus for CRISP-DM or TDSP and there is little interaction with an operations team. DevOps teams (today) are yet not familiar with the tools, languages, and artifacts of data science projects.
DevOps and AI/ML development are two independent methodologies with a common goal: to put an AI application into production. Today it takes the effort to bridge the gaps between the two approaches. AI/ML projects need to incorporate some of the operational and deployment practices that make DevOps effective and DevOps projects need to accommodate the AI/ML development process to automate the deployment and release process for AI/ML models.
DevOps for AI/ML
DevOps for AI/ML has the potential to stabilize and streamline the model release process. It is often paired with the practice and toolset to support Continuous Integration/Continuous Deployment (CI/CD). Here are some ways to consider CI/CD for AI/ML workstreams:
The AI/ML process relies on experimentation and iteration of models and it can take hours or days for a model to train and test. Carve out a separate workflow to accommodate the timelines and artifacts for a model build and test cycle. Avoid gating time-sensitive application builds on AM/ML model builds.
For AI/ML teams, think about models as having an expectation to deliver value over time rather than a one-time construction of the model. Adopt practices and processes that plan for and allow a model lifecycle and evolution.
DevOps is often characterized as bringing together business, development, release, and operational expertise to deliver a solution. Ensure that AI/ML is represented on feature teams and is included throughout the design, development, and operational sessions.
Establish performance metrics and operational telemetry for AI/ML
Use metrics and telemetry to inform what models will be deployed and updated. Metrics can be standard performance measures like precision, recall, or F1 scores. Or they can be scenario specific measures like the industry-standard fraud metrics developed to inform a fraud manager about a fraud model's performance. Here are some ways to integrate AI/ML metrics into an application solution:
Define model accuracy metrics and track them through model training, validation, testing, and deployment.
Define business metrics to capture the business impact of the model in operations. For an example see R notebook for fraud metrics.
Capture data metrics, like dataset sizes, volumes, update frequencies, distributions, categories, and data types. Model performance can change unexpectedly for many reasons and it's expedient to know if changes are due to data.
Track operational telemetry about the model: how often is it called? By which applications or gateways? Are there problems? What are the accuracy and usage trends? How much compute or memory does the model consume?
Create a model performance dashboard that tracks model versions, performance metrics, and data sets.
AI/ML models need to be updated periodically. Over time, and as new and different data becomes available — or customers or seasons or trends change — a model will need to be re-trained to continue to be effective. Use metrics and telemetry to help refine the update strategy and determine when a model needs to be re-trained.
Automate the end-to-end data and model pipeline
The AI/ML pipeline is an important concept because it connects the necessary tools, processes, and data elements to produce and operationalize an AI/ML model. It also introduces another dimension of complexity for a DevOps process. One of the foundational pillars of DevOps is automation, but automating an end-to-end data and model pipeline is a byzantine integration challenge.
Workstreams in an AI/ML pipeline are typically divided between different teams of experts where each step in the process can be very detailed and intricate. It may not be practical to automate across the entire pipeline because of the difference in requirements, tools, and languages. Identify the steps in the process that can be easily automated like the data transformation scripts, or data and model quality checks. Consider the following workstreams:
An automated end-to-end process for the AI/ML pipeline can accelerate development and drive reproducibility, consistency, and efficiency across AI/ML projects."
The problems plaguing AI/ML/Data Scientists is the need of toolchains, automation pipelines, knowledge about standard model training frameworks and ease of hardware access - different teams need different numbers of GPUs , FPGAs , CPUs, TPUs or even IPUs.
Here are the some of the challenges put out as questions-
Who manages and maintains these resources for AI teams?
Who administers hardware resources?
Who prioritizes the jobs?
How is the sanity of resource allocations maintained?
Who supports automation scripting and defining pipelines?
Who handles security issues, authentication & authorization?
Who ensures all the accelerators and nodes are optimized?
How to profile slow applications and help the Data Scientists?
Who maintains the toolchains and cloud servers for AI teams?
Who maintains any other infrastructure or systems specific issues?
So who is the one with the cape?
DevOps the Superhero!
The answer to all this is again DevOps. But it's not the same DevOps from the Application Development era that would fit in here! This is another beast and needs some more superpowers in addition to its core strengths. Knowledge of newer tools and practices like Kubeflow, Tensorflow, Google ML-Ops, Azure AI pipelines, AWS Sagemaker Studio will be required. And it's high time all this knowledge is aggregated and standardised. I will follow up with more soon, until then enjoy this insightful white-paper from google.ai with some research finding on these lines - https://storage.googleapis.com/pub-tools-public-publication-data/pdf/aad9f93b86b7addfea4c419b9100c6cdd26cacea.pdf