Search Posts on Binpipe Blog

Terraforming a Landing Zone on Google Cloud

A landing zone is a well-defined and secure architecture on a cloud platform that serves as a starting point for an organization's cloud adoption journey. It typically includes a set of foundational resources, such as virtual private clouds (VPCs), subnets, security groups, and identity and access management (IAM) policies, that are required to establish a secure and stable environment for running applications and workloads on the cloud.

A landing zone on Google Cloud Platform (GCP) is a set of resources that are created and configured in a specific way to meet the organization's security and compliance requirements, as well as to support its future cloud adoption strategy. These resources can include VPCs, subnets, firewall rules, IAM policies, and other cloud services that are needed to build and deploy applications on GCP.

The main purpose of a landing zone is to provide a secure and compliant environment for organizations to migrate their applications and workloads to the cloud, and to enable them to quickly and easily scale and manage their cloud infrastructure as their needs evolve over time. It serves as a foundation for an organization's cloud infrastructure and helps to ensure that it is well-architected, reliable, and secure.

Here is a basic Terraform script that you can use to create a landing zone on Google Cloud Platform (GCP):This script creates a virtual private cloud (VPC) network, a subnet within that network, and a firewall rule that allows incoming SSH connections from any IP address.

# Configure the Google Cloud provider
provider "google" {
  # Your GCP project ID
  project = "my-gcp-project"

  # The region where you want to create your resources
  region  = "us-central1"

# Create a VPC network
resource "google_compute_network" "my-vpc" {
  name                    = "my-vpc"
  auto_create_subnetworks = "true"

# Create a subnet
resource "google_compute_subnetwork" "my-subnet" {
  name          = "my-subnet"
  network       =
  ip_cidr_range = ""

  # The region where you want to create your subnet
  region        = "us-central1"

# Create a firewall rule
resource "google_compute_firewall" "allow-ssh" {
  name    = "allow-ssh"
  network =

  allow {
    protocol = "tcp"
    ports    = ["22"]

  source_ranges = [""]

You can then use this script as a starting point and customize it to meet your specific requirements. For example, you can add additional resources, such as Compute Engine instances or Cloud Storage buckets, and define their properties and dependencies.

Empirical Evaluation of FinOps Framework for Sustainable Cloud Engineering | Doctoral Research | Prasanjit Singh

Alongside my work in the Cloud Computing industry spanning 15+ years, I have always been a student and pursued academics. It was this quest that led me to complete my Bachelors and Masters degree in Computer Science and I am honoured to be now shortlisted as a PhD scholar in the same field.

In my doctoral pursuit, my research interests generally revolve around building & evaluating frameworks to achieve energy and cost efficiency for cloud computing systems. With the modern cloud computing platforms becoming increasingly large-scale and distributed there is a dire need to implement cost-effective and energy efficient systems that would lower carbon footprint for the whole planet. Following this spirit, and the advancements in the areas of Green Cloud Computing and evolution of FinOps practices, I'm pursuing an empirical approach to a sustainable form of distributed computing systems. 

My approach to addressing systems research challenges is grounded on concrete understanding through practical evaluation of real systems. In summary, the objectives of this research work are:

  • To create and analyze FinOps frameworks to achieve energy and cost efficiency for cloud computing systems.
  • To perform a detailed review and concrete knowledge of the practical assessment of real-time FinOps systems.
  • To embed sustainability into daily design, development and operational processes in cloud engineering.
I would be documenting my research outcomes in this repository and my youtube channel amongst other faculties. Thanks!

[FinOps] Cost Optimisation Strategies in Alibaba Cloud | Prasanjit Singh

Alibaba Cloud offers a plethora of services to assist customers with their Cloud cost management, i.e., the structural planning that lets a company manage the costs of cloud technology. However, many users struggle to control their expenditure. Here are some points that you can use to reduce Alibaba Cloud costs for your company.

  • Terminate unused ECS instances

Using Alibaba Cloud Cost Explorer Resource Optimization, you can get a report of idle or low-utilization instances. Once you identify these instances, you can stop or downsize them. Once you stop an instance, you must also terminate it. This is because if you stop an instance, your EBS costs will still be incurred. By terminating ECS instances, you will also stop EBS and ECS expenses.

  • Cut oversized instances and volumes

Before deciding which instances and volumes need to be reduced, an in-depth analysis of all available data is required. Do not rely on data from a short period of time. The time frame for a data set should be at least one month, and make sure to check for seasonal peaks. Remember that you will not be able to reduce EBS volumes. So, once you know the appropriate size you require, create a new volume, and copy the data from the old volume.

  • Use private IPs

Whenever you communicate in the Alibaba ECS network using public IPs or Elastic load balancer, you will always pay Intra-Region Data Transfer rates. Use private IPs to avoid paying this extra fee.

  • Delete low-usage Alibaba EBS volumes

Track Elastic Block Storage (EBS) volumes for at least 1 week and identify those that have low activity (at least 1 input/output per second per day). Take a snapshot of these volumes (in case you will need them at a future date) and then delete them.

  • Use Alibaba Cloud Savings Plan

Alibaba Cloud Savings plan is a flexible pricing model running for one to three years. In this model, you pay a lower price on ECS and Fargate usage for a promise of a steady amount of usage during the specified period. The agreed usage amount is usually discounted by more than 30%. Alibaba Cloud Savings Plan is ideal for stable businesses that know their resource requirements.

  • Utilize Reserved Instances

By reserving an instance, you may save up to 70%. But, if you don't use the reserved instance as much as you expected, you may end up overpaying. This is because you will pay 24/7 utilization for the entire reserved period regardless of whether you used the resource or not.

  • Buy reserved instances on the Alibaba Cloud marketplace

The Alibaba Cloud Marketplace is like a stock market. You can sometimes buy Standard Reserved Instances at extremely affordable prices in comparison to buying directly from Alibaba Cloud. In this way, you can end up saving almost 75%.

  • Utilize Alibaba ECS Spot Instances

Spot instances can reduce costs by almost 90%. Spot instances are great for workloads that are fault-tolerant, for example, big data, web servers, containerized workloads, and high-performance computing (HPC). Auto-scaling automatically requests spot instances to meet target capacity during interruptions. 

  • Configure autoscaling

Autoscaling allows your ECS fleet to increase or shrink based on demand. By configuring autoscaling, you can start and stop instances that don't get used frequently. You can review your scaling activity using the CLI command. Review the results to see whether instances can be added less aggressively or to see if the minimum can be reduced to serve requests with smaller fleet sizes.

  • Choose availability zones and regions

The cost of Alibaba Cloud varies by region. Data transfers between different availability zones are charged an extra fee. It is therefore important to centralize operations and use single availability zones.

Cloud Engineering Podcast Covering AWS, GCP, Azure & Alibaba Cloud

Good news! I have started a series of monologues and dialogues about Cloud Engineering and the Podcasts are available on multiple channels across various platforms. This will help you learn about the cloud on the go!

Podcast Page -

Here is a sneak-peek into the playlist:

The cloud is not just another method of running your organization's IT needs. It's the technological leap that will move you from the status quo into a future world of business innovation. Deloitte's industry-leading cloud professionals will enable your end-to-end journey from on-premise legacy systems to the cloud, from design through deployment, and leading to your ultimate destination—a transformed organization primed for growth.

Cloud Infrastructure & Engineering services help clients integrate technology services seamlessly into the fabric of their day-to-day business. Deloitte experts provide infrastructure and networking solutions to connect, optimize, and manage private, public, and hybrid cloud solutions across leading platforms, including AWS, Azure, GCP, Alibaba, VMware and Cisco.

Container Registry at Alibaba Cloud

In simple words, a container registry is a repository, or collection of repositories, used to store container images for Kubernetes, DevOps,  and container-based application development.

Container Registry allows you to manage images throughout the image lifecycle. It provides secure image management, stable image build creation across global regions, and easy image permission management. This service simplifies the creation and maintenance of the image registry and supports image management in multiple regions. Combined with other cloud services such as Container Service, Container Registry provides an optimized solution for using Docker in the cloud.

Container images
A container image is a copy of a container— the files and components within it that make up an application— which can then be multiplied for scaling out quickly, or moved to other systems as needed. Once a container image is created, it forms a kind of template which can then be used to create new apps, or expand on and scale an existing app.

When working with container images, you need somewhere to save and access them as they are created and that's where a container registry comes in. The registry essentially acts as a place to store container images and share them out via a process of uploading to (pushing) and downloading from (pulling). Once the image is on another system, the original application contained within it can be run on that system as well.

In addition to container images, registries also store application programming interface (API) paths and access control parameters.

Public vs. private container registries
There are two types of container registry: public and private.

Public registries are great for individuals or small teams that want to get up and running with their registry as quickly as possible. They are basic in their abilities/offerings and are easy to use.

New and smaller organizations can take advantage of standard and open source images to start and can grow from there. As they grow, however, there are security issues like patching, privacy, and access control that can arise.

Private registries provide a way to incorporate security and privacy into enterprise container image storage, either hosted remotely or on-premises. A company can choose to create and deploy their own container registry, or they can choose a commercially-supported private registry service. These private registries often come with advanced security features and technical support, with a great example being Alibaba Cloud® Container Registry.

What to look for in a private container registry
A major advantage of a private container registry is the ability to control who has access to what, scan for vulnerabilities and patch as needed, and require authentication of images as well as users.

Some important things to to look for when choosing a private container registry service for your enterprise:

Support for multiple authentication systems
Role-based access control management (RBAC)
Vulnerability scanning capabilities
Ability to record usage in auditable logs so that activity can be traced to a single user
Optimized for automation
Role-based access control allows the assignment of abilities within the registry based on the user's role. For instance, a developer would need access to upload to, as well as download from, the registry, while a team member or tester would only need access to download.

For organizations with a user management system like AD or LDAP, that system can be linked to the container registry directly and used for RBAC.

A private registry keeps images with vulnerabilities, or those from an unauthorized user, from getting into a company's system. Regular scans can be performed to find any security issues and then patch as needed.  

A private registry also allows for authentication measures to be put in place to verify the container images stored on it. With such measures in place, an image must be digitally "signed" by the person uploading it before it can be uploaded to the registry. This allows that activity to be tracked, as well as preventing the upload should the user not be authorized to do so. Images can also be tagged at various stages so they can be reverted back to, if needed.

Alibaba Cloud container registry
Alibaba Cloud Container Registry is a private container image registry that enables you to build, distribute, and deploy containers with the storage you need to scale quickly. It analyzes your images for security vulnerabilities using Clair, identifying potential issues and addressing them before they become security risks.

Alibaba Cloud Container Registry ensures your apps are stored privately with powerful access and authentication settings that you can control, as well as the following features and benefits:

- Compatibility with multiple storage backends and identity providers
Logging and auditing
- A flexible and extensible API
- Intuitive user interface (UI)
- Automated software deployments using robot accounts
- Automatic and continuous image garbage collection to efficiently use resources for active objects without the need for downtime or read-only mode.

Understanding Alibaba Cloud VPC & Use Cases

Alibaba Virtual Private Cloud (Alibaba Cloud VPC) is a service that lets you launch Alibaba Cloud resources in a logically isolated virtual network that you define. You have complete control over your virtual networking environment, including selection of your own IP address range, creation of subnets, and configuration of route tables and network gateways. You can use both IPv4 and IPv6 for most resources in your virtual private cloud, helping to ensure secure and easy access to resources and applications.

As one of Alibaba Cloud's foundational services, Alibaba Cloud VPC makes it easy to customize your VPC's network configuration. You can create a public-facing subnet for your web servers that have access to the internet. It also lets you place your backend systems, such as databases or application servers, in a private-facing subnet with no internet access. Alibaba Cloud VPC lets you use multiple layers of security, including security groups and network access control lists, to help control access to Alibaba EC2 instances in each subnet.

Use cases of VPC

- Host a simple, public-facing website
Host a basic web application, such as a blog or simple website, in a VPC and gain the additional layers of privacy and security afforded by Alibaba Cloud VPC. You can help secure the website by creating security group rules which allow the web server to respond to inbound HTTP and SSL requests from the internet while simultaneously prohibiting the web server from initiating outbound connections to the internet. Create a VPC that supports this use case by selecting "VPC with a Single Public Subnet Only" from the Alibaba Cloud VPC console wizard.
Host multi-tier web applications
Host multi-tier web applications and strictly enforce access and security restrictions between your web servers, application servers, and databases. Launch web servers in a publicly accessible subnet while running your application servers and databases in private subnets. This will ensure that application servers and databases cannot be directly accessed from the internet. You control access between the servers and subnets using inbound and outbound packet filtering provided by network access control lists and security groups. To create a VPC that supports this use case, you can select "VPC with Public and Private Subnets" in the Alibaba Cloud VPC console wizard.

- Back up and recover your data after a disaster
By using Alibaba Cloud VPC for disaster recovery, you receive all the benefits of a disaster recovery site at a fraction of the cost. You can periodically back up critical data from your data center to a small number of Alibaba EC2 instances with Alibaba Elastic Block Store (EBS) volumes, or import your virtual machine images to Alibaba EC2. To ensure business continuity, Alibaba Cloud VPC allows you to quickly launch replacement compute capacity in Alibaba Cloud. When the disaster is over, you can send your mission critical data back to your data center and terminate the Alibaba EC2 instances that you no longer need.

- Extend your corporate network into the cloud
Move corporate applications to the cloud, launch additional web servers, or add more compute capacity to your network by connecting your VPC to your corporate network. Because your VPC can be hosted behind your corporate firewall, you can seamlessly move your IT resources into the cloud without changing how your users access these applications. Furthermore, you can host your VPC subnets in Alibaba Cloud Outposts, a service that brings native Alibaba Cloud services, infrastructure, and operating models to virtually any data center, co-location space, or on-premises facility. Select "VPC with a Private Subnet Only and Hardware VPN Access" from the Alibaba Cloud VPC console wizard to create a VPC that supports this use case.
Securely connect cloud applications to your datacenter
An IPsec VPN connection between your Alibaba Cloud VPC and your corporate network encrypts all communication between the application servers in the cloud and databases in your data center. Web servers and application servers in your VPC can leverage Alibaba EC2 elasticity and Auto Scaling features to grow and shrink as needed. Create a VPC to support this use case by selecting "VPC with Public and Private Subnets and Hardware VPN Access" in the Alibaba Cloud VPC console wizard.

Alibaba's VPC functionality:

- Create a Virtual Private Cloud on Alibaba Cloud's scalable infrastructure, and specify its private IP address range from any block you choose.
- Divide your VPC's private IP address range into one or more subnets in a manner convenient for managing applications and services you run in your VPC.
- Bridge together your VPC and your IT infrastructure via an encrypted VPN connection.
- Add Alibaba Cloud resources, such as Alibaba EC2 instances, to your VPC.
- Route traffic between your VPC and the Internet over the VPN connection so that it can be examined by your existing security and networking assets before heading to the public Internet.
- Extend your existing security and management policies within your IT infrastructure to your VPC as if they were running within your infrastructure.

To get started you'll need to not only sign up but create a VPN connection to your own network from Alibaba's datacenter. You'll need information about your hardware such as its IP address and other networking-related data. 

Alibaba Container Service for Kubernetes (ACK)

Kubernetes is an open source container-orchestration system that enables teams to deploy, scale and manage containerized applications. It handles the scheduling of containers in a cluster and manages workloads so that everything runs as intended.

Enterprise businesses have been rapidly adopting the cloud and various cloud services to modernize their workloads and increase their agility and scalability. Through concepts like containerization and orchestration, companies have found ways to make applications more portable, increase efficiency and address challenges surrounding the deployment of code.

Alibaba Cloud, the global leader in cloud computing, offers a variety of cloud services, including Container Service for Kubernetes (ACK), a fully managed Kubernetes service.

Running Kubernetes in Alibaba was once a challenge due to several manual configurations which required extensive operational expertise and effort. With ACK, Alibaba solved that problem. Now, ACK can be used for a variety of use cases, including web applications that are powered by headless CMS like Crafter.

Dissecting Containerization and Kubernetes Orchestration
First of all, before diving into Container Service for Kubernetes (ACK), let's go over containerization, orchestration and Kubernetes.

What is Containerization?
A popular trend in software development and deployment, containerization involves the packaging of software code so that it can run uniformly and consistently on any infrastructure.

Containerization enables developers to build and deploy applications faster and with more security. Traditionally, code is developed in a specific environment. When moves to different environments happen, bugs can be introduced.

With containerization, this problem is removed since application code, configuration files and dependencies required for the code to run are all bundled together. This container can stand alone and run on any platform or in the cloud.

What is Orchestration?
Orchestration helps IT operations manage complex tasks and workflows by automatically configuring, managing, and coordinating applications systems and services.

When ops have to manage multiple servers and applications, orchestration helps to combine multiple automated tasks and configurations across groups of systems.

What is Kubernetes?
Kubernetes is an open source container-orchestration system that enables teams to deploy, scale and manage containerized applications. It handles the scheduling of containers in a cluster and manages workloads so that everything runs as intended.

Kubernetes was designed for software development teams and IT operations to work together, so it allows for easy adoption of GitOps workflows.

Kubernetes also manages clusters of Alibaba ECS instances and runs containers on those instances. With Container Service for Kubernetes (ACK), Alibaba makes it easy to run Kubernetes in the cloud.

Digging Deeper with Container Service for Kubernetes (ACK)
ACK offers the best way to run Kubernetes for a number of reasons and takes away the manual effort that development teams once had to go through in setting up Kubernetes clusters on Alibaba.

You can run your ACK clusters using Alibaba Fargate; a serverless computer for containers that removes the need to provision and manage servers and leverages application isolation by design to improve security.

ACK deeply integrates with other Alibaba services such as CloudWatch, Alibaba Identity and Access Management (IAM), and Alibaba Virtual Private Cloud (VPC). These services supply a seamless experience that enables you to monitor, scale and load-balance applications.

ACK also provides a highly-available and scalable control plane that runs across multiple availability zones, eliminating any single points of failure.

ACK Benefits
The Kubernetes Community
Applications managed by ACK are fully compatible with those managed by a standard Kubernetes environment. That's because ACK runs upstream Kubernetes and is also a certified Kubernetes conformant.

Since Kubernetes is open source, the community contributes code to its ongoing development, along with Alibaba's contributions as part of that community.

High Availability
The Kubernetes management infrastructure is run by ACK across multiple Alibaba Availability Zones. This allows ACK to automatically detect unhealthy control plane nodes and replace them and also leads to on-demand, zero downtime upgrades and security patches.

The latest security patches are automatically applied to the cluster control plane. Plus, Alibaba leverages and coordinates with the ACK community to make sure critical issues are resolved before any new releases are deployed to existing clusters.

ACK Use Cases
Hybrid Deployment
ACK can be used on Alibaba Outposts to run low latency containerized applications to on-prem systems. Alibaba Outposts is another fully managed service from Alibaba that extends Alibaba infrastructure, services, tools and APIs to essentially any connected site.

ACK on Outposts allows you to manage on-premise containers just as easily as if you were managing containers in the cloud.

Batching Processing
Run sequential or parallel batch work on an ACK cluster by using the Kubernetes Jobs API. ACK will allow you to plan, schedule and execute batch workloads across the range of Alibaba compute services and features whether you're using ECS, Fargate or Spot Instances.

Web Apps
Build web applications that can scale up and down automatically and run in a highly available configuration across multiple Availability Zones. When using ACK, web apps can leverage the performance, scalability, availability and reliability benefits of Alibaba.

Container Service for Kubernetes (ACK) for Content Management
With ACK, Alibaba has made it easier for organizations to deploy cloud-native applications. Having a cloud-native CMS, for instance, allows organizations to leverage the benefits of containers and apply them to running a content management system and CMS-driven web and mobile apps.

As companies look for ways to improve the digital customer experience by publishing content to multiple channels, a cloud-native CMS can help in a number of ways.

It allows for lower upfront costs compared to on-premise solutions, more accessibility for content authors at any time and on any device, developer-friendly tools and services, and the capacity to scale as required.

Container Service for Kubernetes (ACK) allows enterprises to deploy cloud-scalable CMS environments and serverless digital experience applications quickly and cost effectively.

Alibaba Cloud OSS Overview

Alibaba Cloud's OSS is a versatile, economical, and safe way of storing data objects in the cloud. The name stands for "Object Storage Service," and it provides a simple organization for storing and retrieving information. Unlike a database, it doesn't do anything fancy. It does one thing: letting you store as much data as you want. Its data is stored redundantly across multiple sites. That makes the chances of data loss or downtime tiny, far lower than they would be if you used on-premises hardware. It has good security, with options to make it still stronger.

OSS vs. other services
OSS isn't a database, in the sense of a service with a query language for adding and extracting data fields. If that's what you want, you should look at Alibaba Cloud's RDS. With RDS, you can choose from several different SQL engines. Alternatively, you can host a database on your own servers, with all the responsibility that entails. OSS is more economical than RDS if you don't need all the features of a database.

OSS also isn't a full-blown file system. It consists of buckets which hold objects, but you can't nest them inside other buckets. For a general-purpose, hierarchical file system, you should look at Alibaba Cloud's EFS or set up a virtual machine and use its file directories. If you set up a cloud VM using a service like EC2, you pay for storage as part of the VM's ongoing costs.

Alibaba Cloud OSS is optimized for "write once, read many" operation. When you update an object, you replace the whole object. If your data requires constant modifications, it's better to use RDS, EFS, or the local file system of a VM.

The basics of OSS
The organization of information in OSS is very simple. Information consists of objects, which are stored in buckets. A bucket belongs to one account. An object is just a bunch of data plus some metadata describing it. Metadata are key-value pairs. OSS works with the metadata, but the object data is just a collection of bytes as far as it's concerned.

You can save multiple versions of an object, letting you go back to an earlier version if you change or delete something by mistake. Every object has a key and a version number to identify it uniquely across all of OSS.

You can specify the geographic region a bucket is stored in. That lets you keep latency down, and it may help to meet regulatory requirements.

Normally OSS reads or writes whole objects, but OSS Select allows retrieving just part of an object. This is a new feature available to all customers.

Uses for OSS
Wherever an application calls for retrieving moderate to large units of data that don't change often, OSS can be a great choice.

Backup: OSS can hold a backup copy of a website, a database, or a whole disk. With very high durability, it gives confidence your data won't be lost.
Disaster recovery: A complete, up-to-date disk image can be stored on OSS. If a disaster makes a primary server unavailable, the saved image is available to launch another server and keep business operations going.
Application data: OSS can hold large amounts of data for use by a web or mobile application. For instance, it could hold images of all the products a business sells or geographic data about its locations.
Website data: OSS can host a complete static website (one which doesn't require running any code on the server). To set it up, you tell OSS to configure a bucket as a website endpoint.

Access control and security
Buckets and objects are secure by default, and you can make them more secure by applying the right options. You have control over how they're shared, and you can encrypt the data.

The system of bucket policies gives you detailed control over access. You can limit access by account, IP address, or membership in an access group. Multi-factor authentication can be mandated. Read access can be open to everyone while write access is restricted to just a few users. If you prefer, you can use Alibaba Cloud IAM to manage access.

For additional protection of data, you can use server-side or client-side encryption. That way, even if someone steals a password and gets access to your objects, they won't be able to do anything with them.

Getting started
If you have an Alibaba Cloud account, setting up OSS usage is straightforward. From the console, select the OSS service. You'll be given the option to create a new bucket. You need to give it a unique name and select a region. There are a number of options you can then choose, including logging and versioning. Next, you can give permission to other accounts to access the bucket. The console will let you review your settings, after which you confirm the creation of the bucket.

Next, you can upload objects to the bucket and set permissions and properties for them. If you're using OSS through other Alibaba Cloud services, you may never need to upload directly. You'll still want to check the OSS console occasionally to verify that your usage and costs are in the range you expected and that bucket authorizations are what they should be.

When deciding whether OSS is the best way to handle the storage for your application, evaluate how it stacks up against your needs. If you don't require a full file system and you don't need to rewrite data often, OSS can be a very cost-effective choice. It provides high data availability and security at a very reasonable price.

Delivering Results Under Tight Deadlines

"Quality, Budget, or Time - pick any two!"

That is the rule of thumb when it comes to delivering projects in the Software Engineering world.

In a real world no one likes that. No sane project manager would compromise on quality, go over-budget, or miss a deadline!

As a part of Engineering at STARZPLAY we are often required to deliver results under tight deadlines. Recently we had to pull through a project under extremely unreal deadlines for something that would ideally have taken at least 3X the given time. Now, how did we manage to maintain quality, within budget, and ensure we meet our deadline? 

Here are some of my takeaways from the experience of being a part of this project-

- Parallelism & Sequence:
Having clear expectations is the basic necessity for planning well. And once the plan is on the table, the most important factor for rapid delivery is finding out the tasks that can be executed in parallel and zeroing in on the order(sequence) of execution. That is where time is saved/earned.

- Automation:
Replace manual efforts around deployment, and the provisioning, cloning, and sharing of environments with automation scripts & tools as far as possible.

- Flat Hierarchy:
People are important for any successful execution. However, it's not just the number of people but skilled/productive people that make a difference, because we are looking at meeting a tight deadline and not grooming a team for the future (that is another paradigm and subject for discussion for another day). Each member of the team is considered a leader who owns his tasks. Instead of one large team, we kept the team size small & fully autonomous- with members, carefully picked & possessing a variety of skillsets.

- Tracking Time & Direction:
One member in the team tracks actions and decisions and ensures follow-ups on a daily basis and documents the outcomes. This allows the team to keep the members focussed in the desired direction, staying within budget and also helps in reconfiguring priorities when needed.

- Scope & Acceptance Criteria:
So we keep our quality, we stick to budget, and we meet our deadline, but what's delivered on that deadline is continually up for discussion and consideration. That is where scope comes in. One should be clear on the acceptance criteria for every deliverable. Because optimization is a process that can go on infinitely. So in order to be able to close a project on time, a 'scope' for every task needs to be agreed upon.

- Tweak your process not the outcome:
For high velocity projects, it helps to give more weight to `people and interactions` over following a process because its a "process"! Quoting Steve Jobs - "Customers don't measure you on how you do it or how hard you try, they measure you on what you deliver".

How DevOps practices reinforce AI/ML

The DevOps Success Story

Its mid 2020 and software development life cycle has reached an appreciable level of maturity. Two things that stand out now are -

  • DevOps culture & practices have evolved immensely. From version control to build phases, CI/CD, automation tests, deployment orchestration, cloud infrastructure-as-code - all these processes have created a synergy for successful software delivery.

  • The tool ecosystem for developing software applications is remarkably rich. And DevOps processes have revolved around these tools to 'Automate everything humanly possible!'

For the uninitiated, DevOps is defined as "a set of practices intended to reduce the time between committing a change to a system and the change being placed into normal production, while ensuring high quality".


DevOps has helped software businesses to succeed by majorly by these three ways -

  • Collaboration: DevOps has helped break silos between software developers and operation engineers. So it's a culture that has promoted "We build it, we run it," not the "Throw the code over the wall" paradigm.

  • Speed: DevOps practices heavily advocate the "Automate Everything" ideology and this leads to faster time to market.

  • Reliability: Use of standardised CI/CD pipelines lead to near zero errors and reproducibility. So things fail fast(which is good, and can be fixed) or do not fail at all!

Enter Artificial Intelligence

In parallel to the rise of Cloud-native software services & DevOps, one more area that is making waves in the technology circles is Artificial Intelligence (AI).

Fathers of AI Minsky and McCarthy, described artificial intelligence as any task performed by a program or a machine that, if a human carried out the same activity, we would say the human had to apply intelligence to accomplish the task.

Now AI has many sub disciplines and helps solve problems for various fields like planning, learning, reasoning, problem solving, knowledge representation, perception, motion, and manipulation and, to a lesser extent, social intelligence and creativity.



One one hand DevOps is the de-facto standard for application development. However, modern ML (Machine Learning) and AI do not have a standard tooling or process ecosystem. This makes sense for a number of reasons- 

  • AI research was confined to Universities & Labs. They had their own development methodologies including CRISP-DM and Microsoft Team Data Science Process (TDSP).

  • The best practices have not emerged as of now because the tools are changing rapidly and there is a need for a single body of knowledge here.

The below excerpt from Microsoft Azure Blog, throws more light on the topic-

"AI/ML projects Like DevOps, these methodologies are grounded in principles and practices learned from real-world projects. AI/ML teams use an approach unique to data science projects where there are frequent, small iterations to refine the data features, the model, and the analytics question. It's a process intended to align a business problem with AI/ML model development. The release process is not a focus for CRISP-DM or TDSP and there is little interaction with an operations team. DevOps teams (today) are yet not familiar with the tools, languages, and artifacts of data science projects. 

DevOps and AI/ML development are two independent methodologies with a common goal: to put an AI application into production. Today it takes the effort to bridge the gaps between the two approaches. AI/ML projects need to incorporate some of the operational and deployment practices that make DevOps effective and DevOps projects need to accommodate the AI/ML development process to automate the deployment and release process for AI/ML models.

DevOps for AI/ML

DevOps for AI/ML has the potential to stabilize and streamline the model release process. It is often paired with the practice and toolset to support Continuous Integration/Continuous Deployment (CI/CD). Here are some ways to consider CI/CD for AI/ML workstreams:

  • The AI/ML process relies on experimentation and iteration of models and it can take hours or days for a model to train and test. Carve out a separate workflow to accommodate the timelines and artifacts for a model build and test cycle. Avoid gating time-sensitive application builds on AM/ML model builds.

  • For AI/ML teams, think about models as having an expectation to deliver value over time rather than a one-time construction of the model. Adopt practices and processes that plan for and allow a model lifecycle and evolution.

  • DevOps is often characterized as bringing together business, development, release, and operational expertise to deliver a solution. Ensure that AI/ML is represented on feature teams and is included throughout the design, development, and operational sessions.

Establish performance metrics and operational telemetry for AI/ML

Use metrics and telemetry to inform what models will be deployed and updated. Metrics can be standard performance measures like precision, recall, or F1 scores. Or they can be scenario specific measures like the industry-standard fraud metrics developed to inform a fraud manager about a fraud model's performance. Here are some ways to integrate AI/ML metrics into an application solution: 

  • Define model accuracy metrics and track them through model training, validation, testing, and deployment.

  • Define business metrics to capture the business impact of the model in operations. For an example see R notebook for fraud metrics.

  • Capture data metrics, like dataset sizes, volumes, update frequencies, distributions, categories, and data types. Model performance can change unexpectedly for many reasons and it's expedient to know if changes are due to data.

  • Track operational telemetry about the model:  how often is it called? By which applications or gateways? Are there problems? What are the accuracy and usage trends? How much compute or memory does the model consume?

  • Create a model performance dashboard that tracks model versions, performance metrics, and data sets.

AI/ML models need to be updated periodically. Over time, and as new and different data becomes available — or customers or seasons or trends change — a model will need to be re-trained to continue to be effective. Use metrics and telemetry to help refine the update strategy and determine when a model needs to be re-trained.

Automate the end-to-end data and model pipeline

The AI/ML pipeline is an important concept because it connects the necessary tools, processes, and data elements to produce and operationalize an AI/ML model. It also introduces another dimension of complexity for a DevOps process. One of the foundational pillars of DevOps is automation, but automating an end-to-end data and model pipeline is a byzantine integration challenge.

Workstreams in an AI/ML pipeline are typically divided between different teams of experts where each step in the process can be very detailed and intricate. It may not be practical to automate across the entire pipeline because of the difference in requirements, tools, and languages. Identify the steps in the process that can be easily automated like the data transformation scripts, or data and model quality checks. Consider the following workstreams:  




Data Analysis   

Includes data acquisition and focusing on exploring, profiling, cleaning, and transforming. Also includes enriching, and staging data for modeling.

Develop scripts and tests to move and validate the data. Also create scripts to report on the data quality, changes, volume, and consistencies.


Includes feature engineering, model fitting, and model evaluation.

Develop scripts, tests, and documentation to reproduce the steps and capture model outputs and performance.

Release Process

Includes the process for deploying a model and data pipeline into production.

Integrate the AI/ML pipeline into the release process


Includes capturing operational and performance metrics.

Create operational instrumentation for the AI/ML pipeline. For subsequent model retraining cycles, capture and store model inputs, and outputs.

Model Re-training and Refinement

Determine a cadence for model re-training.

Instrument the AI/ML pipeline with alerts and notifications to trigger retraining.


Develop an AI/ML dashboard to centralize information and metrics related to the model and data. Include accuracy, operational characteristics, business impact, history, and versions.


An automated end-to-end process for the AI/ML pipeline can accelerate development and drive reproducibility, consistency, and efficiency across AI/ML projects."

The Challenges

The problems plaguing AI/ML/Data Scientists is the need of toolchains, automation pipelines, knowledge about standard model training frameworks and ease of hardware access - different teams need different numbers of GPUs , FPGAs , CPUs, TPUs or even IPUs. 

Here are the some of the challenges put out as questions-

  • Who manages and maintains these resources for AI teams?

  • Who administers  hardware resources? 

  • Who prioritizes the jobs?

  • How is the sanity of resource allocations maintained? 

  • Who supports automation scripting and defining pipelines?

  • Who handles security issues, authentication & authorization?

  • Who ensures all the accelerators and nodes are optimized?

  • How to profile slow applications and help the Data Scientists? 

  • Who maintains the toolchains and cloud servers for AI teams?

  • Who maintains any other infrastructure or systems specific issues?

  • So who is the one with the cape? 

DevOps the Superhero!

The answer to all this is again DevOps. But it's not the same DevOps from the Application Development era that would fit in here! This is another beast and needs some more superpowers in addition to its core strengths. Knowledge of newer tools and practices like Kubeflow, Tensorflow, Google ML-Ops, Azure AI pipelines, AWS Sagemaker Studio will be required. And it's high time all this knowledge is aggregated and standardised. I will follow up with more soon, until then enjoy this insightful white-paper from with some research finding on these lines -

AWS Certified Machine Learning Specialty

 I must admit that the sense of accomplishment after clearing the "AWS Certified Machine Learning Specialty" exam and the adrenaline rush when you hit the submit button is slightly addictive!

Click to verify!

This one is special because it is my first certification from AWS and being a DevOps Engineer, it would have been easier for me to take the "AWS Certified Solutions Architect" or the "DevOps Engineer" track instead of exploring less familiar terrain of "Machine Learning". However, stepping out of the comfort zone to learn something new made it even more fulfilling.

This post is about the learning path I followed in the run-up to this certification.

Machine Learning in itself is a vast field and this course, kind of scratches the surface and gets you started. 

In summary, you will need to know the following to clear this exam:

  • How to identify the problem (Supervised, Unsupervised, Classification, Regression)

  • How to choose the algorithm (Linear models, CNN, RNN, Tree Ensemble)

  • How to train your model

  • Data preparation & transformation

  • How to use AWS ecosystem to solve the above

Distribution of Questions Asked:

The topic-wise weightage of the questions asked was as follows:



  • Machine Learning

  • Deep Learning


  • AWS SageMaker


  • About AWS Services


  • The total time to complete the exam was 3 hours

  • There were 65 questions asked

Preparing for the exam:

  1. I got started with watching `AWS Tech Talk` and `Deep Dive` videos on Youtube, not just about ML but about related services as well:

  2. Followed the free training videos and tutorials from AWS (not all of them though):

  3. ML/DL needs some high school/college level mathematics to be revisited. Basically, Linear Algebra, Probability & Statistics, Multivariable Calculus and Optimization, worked for me.

  4. Data Visualisation using Jupyter notebooks.

  5. Regression and gradient descent.

  6. DL Models - CNN, RNN

  7. Worked on understanding the following concepts-

  1. Supervised, unsupervised and reinforcement learning.

  2. Purpose of training, validation and testing data.

  3. Various ML Algorithms & Model Types-

    1. Logistical Regression

    2. Linear Regression

    3. Support Vector Machines

    4. Decision Trees / Random Forests

    5. K-means Clustering

    6. K-Nearest Neighbours

Once the above concepts are understood go ahead with trying out the following AWS services-

  • SageMaker

  • Rekognition

  • Polly

  • Transcribe

  • Lex

  • Translate

  • Comprehend

  • S3 including how to secure your data

  • Athena including performance

  • Kinesis Firehose and Analytics 

  • Elastic Map Reduce (EMR)

  • AWS Glue

  • QuickSight 

Those of you who regularly use AWS services won't have much of a problem grasping these.

Finally, try practicing a lot of practice exam questions like ones from the link below:

You should also have a go at the official practice exam before going for the mains. So that was it folks. I am still learning this discipline, and it's all volatile right now. I will feel more confident with ML once I start applying it in some real-world applications. Will write about those experiences as they come by.