Amazon EC2 - A web service that provides secure, resizable compute capacity in the cloud
Amazon EBS - An easy-to-use, high-performance block storage service designed for use with Amazon Elastic Compute Cloud
Azure Virtual Machines - A service to provision Windows and Linux virtual machines in seconds
Azure Disk Storage - A high-performance, durable block storage for Azure Virtual Machines
Google Cloud Compute Engine - A customizable compute service that lets you create and run virtual machines on Google's infrastructure
Networking
Amazon VPC - A service that lets you launch AWS resources in a logically isolated virtual network that you define
Amazon ELB - A service that automatically distributes incoming application traffic across multiple targets, such as Amazon EC2 instances, containers, IP addresses, and Lambda functions
Azure Virtual Network - The fundamental building block for your private network in Azure
access to high-performance networking
Azure Load Balancer - A service that allows you to distribute traffic to your backend virtual machines
Google Cloud VPC - A virtual version of a physical network that is implemented inside of Google's production network by using Andromeda
Cloud Load Balancing - A fully distributed, software-defined, managed service for all your traffic
Application Hosting Platform (PaaS)
Azure App Service - An HTTP-based service for hosting web applications, REST APIs, and mobile back ends
AWS Elastic Beanstalk - An easy-to-use service for deploying and scaling web applications and services
Google Cloud App Engine - A fully managed, serverless platform for developing and hosting web applications at scale
Cloud Emulators
LocalStack - A fully functional local cloud stack to develop and test your cloud and serverless apps offline
Hashicorp Terraform - An infrastructure as code tool that lets you build, change, and version infrastructure safely and efficiently
OpenTofu - An open-source alternative to Terraform
Pulumi - An infrastructure as code platform that allows you to use familiar programming languages and tools to build, deploy, and manage cloud infrastructure
Configuration Management & Automation
Ansible - An open source IT automation engine that automates provisioning, configuration management, application deployment, orchestration, and many other IT processes
cloud-init - The standard for customising cloud instances
Image Building
Hashicorp Packer - A tool for creating identical machine images for multiple platforms from a single source configuration
TerraGrant - A thin wrapper that provides extra tools for keeping your configurations DRY, working with multiple Terraform modules, and managing remote state
TerraTest - A Go library that provides patterns and helper functions for testing infrastructure
Atmos - A universal tool for DevOps and Cloud Engineering that orchestrates workflows and simplifies the management of infrastructure
Distributed Version Control - A form of version control where the complete codebase, including its full history, is mirrored on every developer's computer
Git - A free and open source distributed version control system designed to handle everything from small to very large projects with speed and efficiency
local repository, remote repository
branch, tag, worktree
push, pull, fetch, rebase, reset, stash
staging, commit
git lfs - An open source Git extension for versioning large files
Git Lint - A command line interface for linting Git commits by ensuring you maintain a clean, easy to read, debuggable, and maintainable project history
git cliff - A highly customizable changelog generator
pre-commit - A framework for managing and maintaining multi-language pre-commit hooks
TortoiseGit - A Windows Shell Interface to Git and based on TortoiseSVN
Git hosting services
GitLab SCM - The single source of truth for collaborating on code and projects
Gitea - A painless self-hosted all-in-one software development service, including Git hosting, code review, team collaboration, package registry and CI/CD
Codeberg - A community-led effort that provides Git hosting and other services for free and open source projects
Forgejo - A self-hosted lightweight software forge
Soft Serve - A tasty, self-hostable Git server for the command line
Azure Repos - A set of version control tools that you can use to manage your code
GitHub - The AI-powered developer platform to build, scale, and deliver secure software
Practices
Trunk Based Development - A source-control branching model, where developers collaborate on code in a single branch called 'trunk', resist any pressure to create other long-lived development branches by employing documented techniques
Conventions
keep a changelog - A file which contains a curated, chronologically ordered list of notable changes for each version of a project
busybox - A single small executable that combines tiny versions of many common UNIX utilities
The Open Container Initiative (OCI) - An open governance structure for the express purpose of creating open industry standards around container formats and runtimes
Containers for Development
Development Containers - An open specification for enriching containers with development-specific settings, tools, and configuration
Kubernetes - An open-source system for automating deployment, scaling, and management of containerized applications
Architecture
Master node
kube-apiserver - Responsible for API services
kube-scheduler - Responsible for scheduling
kube-controller-manager - Responsible for container orchestration
Compute node
kubelet - watches the API server for pods on that node and makes sure they are running
cAdvisor - collects metrics about pods running on that particular node
kube-proxy - watches the API server for pods/services changes in order to maintain the network up to date
container runtime - responsible for managing container images and running containers on that node
Interface Standards
CNI (Container Networking Interface)
Calico - A networking and security solution that enables Kubernetes workloads and non-Kubernetes/legacy workloads to communicate seamlessly and securely
Cilium - An open source, cloud native solution for providing, securing, and observing network connectivity between workloads, fueled by the revolutionary Kernel technology eBPF
CSI (Container Storage Interface)
CRI (Container Runtime Interface)
cri-o - An implementation of the Kubernetes CRI (Container Runtime Interface) to enable using OCI (Open Container Initiative) compatible runtimes
liveness probe - A probe the kubelet uses to know when to restart a container
requests and limits
eviction
Deployment, ReplicaSet, StatefulSet, DaemonSet
Services, Load Balancing & Networking
Kubernetes network model - A set of fundamental requirements and principles for networking in a Kubernetes cluster
Service, Ingress, Ingress Controllers
Storage - A powerful volume subsystem with an API that abstracts how storage is provided and consumed
PersistentVolume, PVC, StorageClass
Configuration - A range of mechanisms that let you inject configuration data into the Pods that run your applications
Secret, ConfigMap
Security & Policy
Kubernetes RBAC - A method of regulating access to computer or network resources based on the roles of individual users within an enterprise
PodDisruptionBudget - An object that limits the number of concurrent disruptions that your application experiences, allowing for high availability
Security context - A definition of privilege and access control settings for a Pod or Container
Autoscaling
HPA - The component that automatically scales the number of Pods in a replication controller, deployment, replica set or stateful set based on observed CPU utilization
Cluster Autoscaler - A tool that automatically adjusts the size of the Kubernetes cluster
Seabird - The native desktop app that simplifies working with Kubernetes
Headlamp - A user-friendly Kubernetes UI focused on extensibility
Local K8s
Minikube - A tool that lets you run Kubernetes locally
Kind - A tool for running local Kubernetes clusters using Docker container “nodes”
K8s Operators
Prometheus Operator - The operator that creates/configures/manages Prometheus clusters atop Kubernetes
kube-prometheus - A collection of Kubernetes manifests, Grafana dashboards, and Prometheus rules combined with documentation and scripts to provide easy to operate end-to-end Kubernetes cluster monitoring
Serverless Computing - A cloud computing execution model in which the cloud provider allocates machine resources on demand, taking care of the servers on behalf of their customers
Azure Kubernetes Service (AKS) - A fully managed Kubernetes service for deploying and managing containerized applications
Simplified Container Hosting
Amazon Elastic Container Service - A fully managed container orchestration service that helps you easily deploy, manage, and scale containerized applications
AWS Fargate - A serverless compute engine for containers that works with both ECS and EKS
Azure Container Apps - A fully managed serverless container service built on Kubernetes
Google Cloud Run - A managed compute platform that lets you run containers that are automatically scaled
Function as a Service (FaaS)
AWS Lambda - A serverless, event-driven compute service that lets you run code for virtually any type of application or backend service without provisioning or managing servers
Azure Functions - An event-driven, serverless compute platform that helps you develop more efficiently using the programming language of your choice
Google Cloud Run Functions - A serverless execution environment for building and connecting cloud services
KEDA (Kubernetes Event-driven Autoscaling) - A single-purpose and lightweight component that can be added into any cluster to provide event-driven scale for any container running in the environment
Dapr (Distributed Application Runtime) - A portable, event-driven runtime that makes it easy for any developer to build resilient, stateless, and stateful applications that run on the cloud and edge and embraces the diversity of languages and developer frameworks
Serverless Computing
OpenFaaS - A framework that makes it easy for developers to deploy event-driven functions and microservices to Kubernetes
Knative - A Kubernetes-based platform to build, deploy, and manage modern serverless workloads
Service Mesh & Discovery
Istio - An open source service mesh that layers transparently onto existing distributed applications
Kiali - The service mesh observability and configuration tool for Istio
Linkerd - An ultralight, security-first service mesh for Kubernetes
Hashicorp Consul - A service networking solution to connect and secure services across any runtime platform and public or private cloud
Traefik Mesh - A straight-forward, easy to configure, and non-invasive service mesh
Edge Proxies & Ingress
Envoy Proxy - An open source edge and service proxy
Traefik proxy - A leading modern open source reverse proxy and ingress controller
Azure Artifacts - A service that enables you to create and share Maven, npm, NuGet, and Python package feeds from public and private sources
GitOps Style CD
ArgoCD - A declarative, GitOps continuous delivery tool for Kubernetes
FluxCD - A tool for keeping Kubernetes clusters in sync with sources of configuration (like Git repositories), and automating updates to configuration when there is new code to deploy
Cloud-Native Application Delivery
Open Application Model - A specification for describing applications so that they can be deployed and managed across any platform
KubeVela - A modern software delivery platform that makes deploying and operating applications across today's hybrid, multi-cloud environments easier, faster and more reliable
Flagger - A progressive delivery tool that automates the release process for applications running on Kubernetes
Site Reliability Engineering - A discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems
Service Level Objectives (SLOs) - A target value or range of values for a service level that is measured by a service level indicator (SLI)
Dickerson's Hierarchy of Service Reliability - A model that illustrates the foundational elements required to build and maintain reliable services, often visualized as a pyramid
The Four Golden Signals - The four key metrics (Latency, Traffic, Errors, and Saturation) that Google SREs use for monitoring user-facing systems
Ishikawa diagram - A causal diagram created by Kaoru Ishikawa that shows the potential causes of a specific event
Observability - A measure of how well internal states of a system can be inferred from knowledge of its external outputs
Instrumentation Libraries
OpenTelemetry - A vendor-neutral open source Observability framework for instrumenting, generating, collecting, and exporting telemetry data such as traces, metrics, and logs
Micrometer - A metrics instrumentation library for JVM-based applications
Tools
Uptime Kuma - An easy-to-use self-hosted monitoring tool
node-exporter - An exporter for hardware and OS metrics exposed by *NIX kernels
blackbox-exporter - A tool that allows blackbox probing of endpoints over HTTP, HTTPS, DNS, TCP, ICMP and gRPC
Grafana Alloy - An open source OpenTelemetry collector with built-in Prometheus pipelines and support for metrics, logs, traces, and profiles
Fluent Bit - A super fast, lightweight, and highly scalable logging, metrics, and traces processor and forwarder
Fluentd - An open source data collector, which lets you unify the data collection and consumption for a better use and understanding of data
Filebeat - A lightweight shipper for forwarding and centralizing log data
Logstash - An open source server-side data processing pipeline that ingests data from a multitude of sources, transforms it, and then sends it to your favorite "stash"
Telegraf - An open source server agent that helps you collect metrics from your stacks, sensors, and systems
Metricbeat - A lightweight shipper that you can install on your servers to periodically collect metrics from the operating system and from services running on the server
rsyslog - The rocket-fast system for log processing
Vendor-specific Tools
Azure Monitor Agent - The agent that collects monitoring data from the guest operating system of Azure and hybrid virtual machines
Cloudwatch Agent - The agent you can use to collect both system-level metrics and log files from Amazon EC2 instances and on-premises servers
Grafana Tempo - An open source, easy-to-use and high-scale distributed tracing backend
TraceQL - A query language designed for selecting traces
ElasticSearch - An open source distributed, RESTful search and analytics engine, scalable data store, and vector database
Elastic Common Schema - An open source specification, developed with support from the Elastic user community
Ingest pipelines - A feature that lets you perform common transformations on your data before indexing
Dissect and Grok - The processors that let you extract structured fields out of a single text field
Graphite - A highly scalable real-time graphing system
Grafana Alerting - A feature that allows you to create and manage alerts for your data
OpenObserve - An open-source observability platform designed for modern applications
Vendor-specific Tools
Azure Monitor - A comprehensive solution for collecting, analyzing, and acting on telemetry from your cloud and on-premises environments
Kusto Query Language - A powerful tool to explore your data and discover patterns, identify anomalies and outliers, create statistical models, and more
App Insights - A feature of Azure Monitor, is an extensible Application Performance Management (APM) service for developers and DevOps professionals
AWS CloudWatch - A monitoring and observability service built for DevOps engineers, developers, site reliability engineers (SREs), and IT managers
Visualization Tools
Grafana - The open source data visualization and monitoring solution
Grafonnet - A Jsonnet library for generating Grafana dashboards
Kibana - A free and open user interface that lets you visualize your Elasticsearch data and navigate the Elastic Stack
Chaos Engineering - The practice of experimenting on a system in order to build confidence in the system's capability to withstand turbulent conditions in production