Big data - Data sets that are too large or complex to be dealt with by traditional data-processing application software
DIKW pyramid - A class of models representing purported structural and/or functional relationships between data, information, knowledge, and wisdom
Garbage in, garbage out - A concept in computer science and information and communications technology that the quality of the output is determined by the quality of the input
Core Data Engineering & Database Concepts
Concurrency control - The mechanism ensuring that correct results for concurrent operations are generated efficiently
CRUD operations - The four basic operations of persistent storage: create, read, update, and delete
Shard - A horizontal partition of data in a database or search engine
ETL - A three-phase process where data is extracted from an input source, transformed, and loaded into an output data container
Algebra - A branch of mathematics that deals with abstract systems, known as algebraic structures, and the manipulation of expressions within those systems
Relational model - An approach to managing data using a structure and language consistent with first-order predicate logic
ACID properties - A set of properties of database transactions intended to guarantee data validity despite errors, power failures, and other mishaps
Atomicity, Consistency, Isolation, and Durability
Codd's Twelve Rules - A set of thirteen rules proposed by Edgar F. Codd to define what is required from a database management system in order for it to be considered relational
Languages & Dialects
Structured Query Language (SQL) - A domain-specific language used for managing data held in a relational database management system
Command Categories
DDL - Data Definition Language
DQL - Data Query Language
DML - Data Manipulation Language
DCL - Data Control Language
TCL - Transaction Control Language
SQL Join - A clause that combines columns from one or more tables in a relational database
Transact-SQL - The proprietary extension to SQL used to program and manage SQL Server
Database Management Systems (DBMS)
Client-Server RDBMS
PostgreSQL - An object-relational database management system (ORDBMS) based on POSTGRES, Version 4.2, developed at the University of California at Berkeley Computer Science Department
MySQL - The most popular Open Source SQL database management system, is developed, distributed, and supported by Oracle Corporation
MariaDB community Server - The open source relational database that is a community-developed fork of MySQL
Distributed SQL
TiDB - An open-source distributed SQL database that supports Hybrid Transactional and Analytical Processing (HTAP) workloads
Embedded / In-Process
SQLite - A C-language library that implements a small, fast, self-contained, high-reliability, full-featured, SQL database engine
PGlite - A WASM Postgres build packaged into a TypeScript/JavaScript client library, that enables you to run Postgres in the browser, Node.js and Bun
Cloud Services & Platforms
Managed Database Services
Amazon RDS - A collection of managed services that makes it simple to set up, operate, and scale databases in the cloud
Amazon Aurora - A modern relational database service built for the cloud, with MySQL and PostgreSQL compatibility
Azure SQL Database - An intelligent, scalable, relational database service built for the cloud
Object-Relational Mapping (ORM) - A programming technique for converting data between incompatible type systems using object-oriented programming languages
Prisma - A next-generation ORM that makes it easy to build reliable and scalable applications with databases
Hibernate - An object-relational mapping tool for the Java programming language
GORM - The fantastic ORM library for Golang aims to be developer friendly
DB Browser for SQLite - A high quality, visual, open source tool to create, design, and edit database files compatible with SQLite
Azure Data Studio - A modern open-source, cross-platform hybrid data analytics tool designed to simplify the data landscape
Beekeeper Studio - A modern, easy to use, and good looking SQL editor and database manager
Developer Libraries & Drivers
Vanna.AI - A Python package that uses retrieval augmentation to help you generate accurate SQL queries for your database using LLMs
Psycopg - The most popular PostgreSQL adapter for the Python programming language
Command-Line & Deployment Utilities
sqlcmd utility - A command-line utility for ad hoc, interactive execution of Transact-SQL statements and scripts and for automating T-SQL scripting tasks
sqlpackage - A command-line utility that automates several database development tasks
DAC (Data-tier Applications) - A logical database management concept that defines all of the SQL Server objects associated with a user's database
Monitoring & Analysis
pgBadger - A PostgreSQL log analyzer built for speed with fully detailed reports and professional rendering
CAP theorem - A theorem stating that any distributed data store can provide only two of the following three guarantees: Consistency, Availability, and Partition tolerance
BASE properties - A database model that prioritizes availability over consistency
Data model - An abstract model that organizes elements of data and standardizes how they relate to one another and to the properties of real-world entities
Data orientation - A perspective of data that emphasizes the data itself, rather than the applications that use the data
Object-relational impedance mismatch - A set of conceptual and technical difficulties that are often encountered when a relational database management system (RDBMS) is being used by a program written in an object-oriented programming language or style
Multi-model Databases
Azure Cosmos DB - A fully managed, serverless distributed database for modern app development
Amazon DynamoDB - A fully managed, serverless, key-value NoSQL database designed to run high-performance applications at any scale
Document Databases
MongoDB - A document database designed for ease of application development and scaling
Cloud Firestore - A cloud-hosted, NoSQL database that your Apple, Android, and web apps can access directly via native SDKs
DocumentDB - A powerful, scalable open-source document database built for modern applications
Key-value Stores
etcd - A distributed, reliable key-value store for the most critical data of a distributed system
Redis - An in-memory data store used by millions of developers as a cache, vector database, document database, streaming engine
Neo4j - A high-speed graph database with unbounded scale, security, and data integrity
Amazon Neptune - A fast, reliable, and fully managed graph database service that makes it easy to build and run applications that work with highly connected datasets
Apache HBase - The Hadoop database, a distributed, scalable, big data store
Google Cloud Bigtable - A NoSQL wide-column database service for large analytical and operational workloads
Vector Databases
pgvector - An open-source vector similarity search for Postgres
Weaviate - An open-source vector database that simplifies the development of AI applications
Milvus - A high-performance open-source vector database built to handle billions of vectors
Chroma - The AI-native open-source embedding database
450 - Distributed Processing and Application Integration (WIP)
Base Frameworks
Apache Hadoop - A framework that allows for the distributed processing of large data sets
mrjob - The easiest route to writing Python programs that run on Hadoop
Apache Spark - The unified engine for large-scale data analytics
PySpark - The Python API for Apache Spark, allowing big data processing with Python
RAY - An open-source unified compute framework that makes it easy to scale AI and Python workloads
Full-fledged ETL
Azure Data Factory - Azure's cloud ETL service for scale-out serverless data integration and data transformation
AWS Glue - A serverless data integration service that makes it easy to discover, prepare, move, and integrate data from multiple sources
Google Cloud Data Fusion - A fully managed, cloud-native data integration service that helps users efficiently build and manage ETL/ELT data pipelines
Apache NiFi - An easy to use, powerful, and reliable system to process and distribute data
Numerical & Scientific Computing
JAX - A Python library for accelerator-oriented array computation and program transformation
Stream Processing Engines
Spark Structured Streaming - A scalable and fault-tolerant stream processing engine built on the Spark SQL engine
Apache Storm - A free and open source distributed realtime computation system
Apache Flink - A framework and distributed processing engine for stateful computations over unbounded and bounded data streams
Google Cloud Dataflow - A fully managed streaming analytics service that minimizes latency, processing time, and cost through autoscaling and batch processing
Event Ingestion / Message Queues
Amazon Kinesis - The service making it easy to collect, process, and analyze real-time, streaming data
Azure Event Hubs - A highly scalable and reliable event streaming platform capable of ingesting millions of events per second
Apache Kafka - An open-source distributed event streaming platform
Message Brokers - An intermediary computer program module that translates a message from the formal messaging protocol of the sender to the formal messaging protocol of the receiver
Azure Service Bus - A fully managed enterprise message broker with message queues and publish-subscribe topics
RabbitMQ - A reliable and mature messaging and streaming broker
Apache Lucene - A Java library providing powerful indexing and search features
Faiss - A library for efficient similarity search and clustering of dense vectors
Analytics Platforms
Apache Hive - A distributed, fault-tolerant data warehouse system that enables analytics at a massive scale
Presto - A distributed SQL query engine designed for fast, reliable, and efficient analytics at any scale
Trino - A distributed SQL query engine designed to query large data sets distributed over one or more heterogeneous data sources
Amazon EMR - A cloud big data platform for running large-scale distributed data processing jobs, interactive SQL queries, and machine learning applications
Amazon Redshift - A fully managed, petabyte-scale data warehouse service in the cloud
Amazon Athena - An interactive query service that makes it easy to analyze data directly in Amazon S3 and other data stores using standard SQL
Databricks - The platform that allows your entire organization to use data and AI
Microsoft Fabric - An end-to-end analytics solution with full-service capabilities including data movement, data lakes, data engineering, data integration, data science, real-time analytics, and business intelligence
Azure Synapse Analytics - An enterprise analytics service that accelerates time to insight across data warehouses and big data systems
Google Cloud BigQuery - A fully managed, AI-ready data analytics platform that helps you maximize value from your data and is designed to be multi-engine, multi-format, and multi-cloud