Data - Any sequence of one or more symbols; datum is a single symbol of data
Metadata - The data that provides information about other data, but not the content of the data
Big data - The data sets that are too large or complex to be dealt with by traditional data-processing application software
Data model - An abstract model that organizes elements of data and standardizes how they relate to one another and to the properties of real-world entities
Data orientation - A perspective of data that emphasizes the data itself, rather than the applications that use the data
DIKW pyramid - A class of models representing purported structural and/or functional relationships between data, information, knowledge, and wisdom
Garbage in, garbage out - A concept in computer science and information and communications technology that the quality of the output is determined by the quality of the input
Data cleansing - The process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database
Data lifecycle management - A policy-based approach to managing the flow of an information system's data throughout its life cycle
Master data management - A technology-enabled discipline in which business and IT work together to ensure the uniformity, accuracy, stewardship, semantic consistency and accountability of the enterprise's official shared master data assets
Data quality - A measure of the condition of data based on factors such as accuracy, completeness, consistency, reliability and whether it's up to date
Single source of truth - The practice of structuring information models and associated data schema such that every data element is mastered (or edited) in only one place
Concurrency control - The mechanism ensuring that correct results for concurrent operations are generated efficiently
CRUD operations - The four basic operations of persistent storage: create, read, update, and delete
Shard - A horizontal partition of data in a database or search engine
ETL - A three-phase process where data is extracted from an input source, transformed, and loaded into an output data container
ELT - A data integration process where raw data is moved from a source system to a destination resource, such as a data warehouse, and then transformed for use
Data pipeline - A set of data processing elements connected in series, where the output of one element is the input of the next one
Data governance - A data management concept concerning the capability that enables an organization to ensure that high data quality exists throughout the complete lifecycle of the data
Data lineage - The process of understanding, recording, and visualizing data as it flows from data sources to consumption
SageMath - A free open-source mathematics software system licensed under the GPL
statsmodels - A Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration
R - A free software environment for statistical computing and graphics
Tidyverse - An opinionated collection of R packages designed for data science
Wolfram Language - A symbolic language, deliberately designed with the breadth and unity needed to develop powerful programs quickly
Specialized Tools
latexify - A Python package to compile a fragment of Python source code to a corresponding LaTeX expression
handcalcs - A Python library to render Python calculation code automatically in Latex, but in a manner that mimics how one might format their calculation if it were written with a pencil
NetworkX - A Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks
JAX - A Python library for accelerator-oriented array computation and program transformation
Data Sources
GeoLite2 - A set of free geolocation and ASN data in downloadable database and web service formats
JupyterLab - A web-based interactive development environment for notebooks, code, and data
Jupyter Notebook - The original web application for creating and sharing computational documents
VSCode Jupyter Extension - A VS Code extension that provides basic notebook support for language kernels supported in the environment
nbviewer - A simple way to share Jupyter Notebooks
R Markdown - An authoring framework that helps you create dynamic analysis documents combining code, rendered output, and prose
Wolfram Notebooks - A powerful environment for exploration and communication, combining text, literate programming, graphics and custom interactive elements
Voila - A tool that turns Jupyter notebooks into standalone web applications
Single point of failure - A part of a system that, if it fails, will stop the entire system from working
Fault tolerance - The property that enables a system to continue operating properly in the event of the failure of some of its components
Load balancing - The process of distributing a set of tasks over a set of resources, with the aim of making their overall processing more efficient
Fallacies of distributed computing - A set of assertions describing false assumptions that programmers new to distributed applications invariably make
Byzantine fault - A condition of a distributed system, where components may fail and there is imperfect information about whether a component has failed
Consensus - A fault-tolerant mechanism that is used in distributed systems to achieve the necessary agreement on a single data value among distributed processes or systems
CAP theorem - A theorem stating that any distributed data store can provide only two of the following three guarantees: Consistency, Availability, and Partition tolerance
BASE properties - A database model that prioritizes availability over consistency
Algebra - A branch of mathematics that deals with abstract systems, known as algebraic structures, and the manipulation of expressions within those systems
Boolean algebra - A branch of algebra that differs from elementary algebra in that the values of the variables are the truth values true and false, usually denoted by 1 and 0, and it uses logical operators such as conjunction (and), disjunction (or), and negation (not)
Elementary algebra - A branch of mathematics that encompasses the basic concepts of algebra
Equation - A mathematical formula that expresses the equality of two expressions, by connecting them with the equals sign =
Logarithm - The exponent by which another fixed value, the base, must be raised to produce that number
Linear algebra - The branch of mathematics concerning linear equations, linear maps, and their representations in vector spaces and through matrices
Vector space - A set whose elements, often called vectors, can be added together and multiplied ("scaled") by numbers called scalars
Matrix - A rectangular array of numbers or other mathematical objects with elements or entries arranged in rows and columns, usually satisfying certain properties of addition and multiplication
Sparse matrix - A matrix in which most of the elements are zero
Rank - The dimension of the vector space generated (or spanned) by its columns
Determinant - A scalar-valued function of the entries of a square matrix
Calculus - The mathematical study of continuous change, in the same way that geometry is the study of shape and algebra is the study of generalizations of arithmetic operations
Differential calculus - A subfield of calculus that studies the rates at which quantities change
Integral calculus - The continuous analog of a sum, and is used to calculate areas, volumes, and their generalizations
Differential equation - An equation that relates one or more unknown functions and their derivatives
Geometry - A branch of mathematics concerned with properties of space such as the distance, shape, size, and relative position of figures
Trigonometry - A branch of mathematics concerned with relationships between angles and side lengths of triangles
Coordinate system - A system that uses one or more numbers, or coordinates, to uniquely determine and standardize the position of the points or other geometric elements on a manifold such as Euclidean space
Euclidean distance - The length of the line segment between two points in a Euclidean space
Root mean square - The square root of the mean of the squares of a set of numbers
Transforms
Discrete cosine transform - A transform that expresses a finite sequence of data points in terms of a sum of cosine functions oscillating at different frequencies
Discrete Fourier transform - A discrete version of the Fourier transform that converts a finite sequence of equally-spaced samples of a function into a same-length sequence of equally-spaced samples of the discrete-time Fourier transform (DTFT)
Probability theory - The branch of mathematics concerned with probability
Bayes' theorem - A mathematical rule for inverting conditional probabilities, allowing the probability of a cause to be found given its effect
Central limit theorem (CLT) - A theorem that states that, under appropriate conditions, the distribution of a normalized version of the sample mean converges to a standard normal distribution
Information theory - A scientific study of the quantification, storage, and communication of digital information
Entropy - The average level of 'information', 'surprise', or 'uncertainty' inherent in a random variable's possible outcomes
Hypothesis testing - A method of statistical inference used to decide whether the data provide sufficient evidence to reject a particular hypothesis
Null hypothesis - A typical statistical theory which suggests that no statistical relationship and significance exists in a set of given single observed variable, between two sets of observed data and measured phenomena
Confidence interval (CI) - A range of values which is likely to contain (in repeated sampling) the true value of an unknown statistical parameter, such as a population mean
P-value - The probability of obtaining test results at least as extreme as the result actually observed, under the assumption that the null hypothesis is correct
Numerical methods
Significant figures - The specific digits within a number that is written in positional notation that carry both reliability and necessity in conveying a particular quantity
Relational model - An approach to managing data using a structure and language consistent with first-order predicate logic
ACID properties - A set of properties of database transactions intended to guarantee data validity despite errors, power failures, and other mishaps
Atomicity, Consistency, Isolation, and Durability
Codd's Twelve Rules - A set of thirteen rules proposed by Edgar F. Codd to define what is required from a database management system in order for it to be considered relational
Database normalization - The process of organizing columns (attributes) and tables (relations) of a relational database to minimize data redundancy
Languages & Dialects
Structured Query Language (SQL) - A domain-specific language used for managing data held in a relational database management system
Command Categories
DDL - Data Definition Language
DQL - Data Query Language
DML - Data Manipulation Language
DCL - Data Control Language
TCL - Transaction Control Language
SQL Join - A clause that combines columns from one or more tables in a relational database
Aggregate function - A function where the values of multiple rows are grouped together to form a single summary value
Transact-SQL - The proprietary extension to SQL used to program and manage SQL Server
PostgreSQL - An object-relational database management system (ORDBMS) based on POSTGRES, Version 4.2, developed at the University of California at Berkeley Computer Science Department
MySQL - The most popular Open Source SQL database management system, is developed, distributed, and supported by Oracle Corporation
MariaDB community Server - The open source relational database that is a community-developed fork of MySQL
Distributed SQL
TiDB - An open-source distributed SQL database that supports Hybrid Transactional and Analytical Processing (HTAP) workloads
Embedded / In-Process
SQLite - A C-language library that implements a small, fast, self-contained, high-reliability, and full-featured database engine
PGlite - A WASM build packaged into a TypeScript/JavaScript client library, that enables you to run the database in the browser, Node.js and Bun
DuckDB - An in-process SQL OLAP database management system
Storage Engines
Storage Engine - A software component that a database management system uses to create, read, update and delete (CRUD) data from a database
InnoDB - A transactional storage engine for MySQL and MariaDB
Connection pool - A cache of database connections maintained so that the connections can be reused when future requests to the database are required
ODBC - A standard application programming interface for accessing database management systems
JDBC - An API that allows access to virtually any tabular data source from the Java programming language
Jdbi - A library that provides a more idiomatic way to use the Java Database Connectivity API
Object-Relational Mapping (ORM) - A programming technique for converting data between incompatible type systems using object-oriented programming languages
Prisma - A next-generation ORM that makes it easy to build reliable and scalable applications with databases
Hibernate - An object-relational mapping tool for the Java programming language
SQLAlchemy - The Python SQL toolkit and Object Relational Mapper that gives application developers the full power and flexibility of SQL
GORM - The fantastic ORM library for Golang aims to be developer friendly
DB Browser for SQLite - A high quality, visual, open source tool to create, design, and edit database files compatible with SQLite
Azure Data Studio - A modern open-source, cross-platform hybrid data analytics tool designed to simplify the data landscape
Beekeeper Studio - A modern, easy to use, and good looking SQL editor and database manager
Command-Line & Deployment Utilities
sqlcmd utility - A command-line utility for ad hoc, interactive execution of Transact-SQL statements and scripts and for automating T-SQL scripting tasks
sqlpackage - A command-line utility that automates several database development tasks
DAC (Data-tier Applications) - A logical database management concept that defines all of the SQL Server objects associated with a user's database
pgroll - A zero-downtime, reversible, schema migration tool for PostgreSQL
Monitoring & Analysis
pgBadger - A PostgreSQL log analyzer built for speed with fully detailed reports and professional rendering
NocoDB - An open-source, no-code platform that turns any database into a smart spreadsheet, providing a collaborative interface for relational databases
Object-relational impedance mismatch - A set of conceptual and technical difficulties that are often encountered when a relational database management system (RDBMS) is being used by a program written in an object-oriented programming language or style
Document Databases
MongoDB - A document database designed for ease of application development and scaling
DocumentDB - A powerful, scalable open-source document database built for modern applications
Key-value Stores
etcd - A distributed, reliable key-value store for the most critical data of a distributed system
Redis - An in-memory data store used by millions of developers as a cache, vector database, document database, streaming engine
Azure Cosmos DB - A fully managed, serverless distributed database for modern app development
Amazon DynamoDB - A fully managed, serverless, key-value NoSQL database designed to run high-performance applications at any scale
Document Databases
Cloud Firestore - A cloud-hosted, NoSQL database that your Apple, Android, and web apps can access directly via native SDKs
Graph Databases
Amazon Neptune - A fast, reliable, and fully managed graph database service that makes it easy to build and run applications that work with highly connected datasets
Wide-columns Databases
Google Cloud Bigtable - A NoSQL wide-column database service for large analytical and operational workloads
Message Brokers - An intermediary computer program module that translates a message from the formal messaging protocol of the sender to the formal messaging protocol of the receiver
Dead-letter queue - A specialized queue used in message queuing systems to store messages that could not be delivered or processed successfully
Messaging & Streaming Platforms (Software)
Apache Kafka - An open-source distributed event streaming platform
RabbitMQ - A reliable and mature messaging and streaming broker
Cloud Services
Amazon Kinesis - A service making it easy to collect, process, and analyze real-time, streaming data
Azure Event Hubs - A highly scalable and reliable event streaming platform capable of ingesting millions of events per second
Azure Service Bus - A fully managed enterprise message broker with message queues and publish-subscribe topics
Apache Hadoop - A framework that allows for the distributed processing of large data sets
mrjob - The easiest route to writing Python programs that run on the framework
Apache Spark - The unified engine for large-scale data analytics
PySpark - The Python API for the engine, allowing big data processing with the language
RAY - An open-source unified compute framework that makes it easy to scale AI and Python workloads
Joblib - A set of tools to provide lightweight pipelining in Python
Workflow Orchestration & ETL Tools (Software)
Apache NiFi - An easy to use, powerful, and reliable system to process and distribute data
Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
dbt - A unified platform for delivering trusted data that empowers teams to deliver reliable, governed data at scale
Dagu - A local-first workflow engine that provides a declarative, file-based, and self-contained platform to orchestrate tasks from a single binary that scales from a laptop to a distributed cluster
Managed ETL & Data Integration Services
Azure Data Factory - The cloud ETL service for scale-out serverless data integration and data transformation
AWS Glue - A serverless data integration service that makes it easy to discover, prepare, move, and integrate data from multiple sources
Google Cloud Data Fusion - A fully managed, cloud-native data integration service that helps users efficiently build and manage ETL/ELT data pipelines
Spark Structured Streaming - A scalable and fault-tolerant stream processing engine built on the Spark SQL engine
Apache Storm - A free and open source distributed realtime computation system
Apache Flink - A framework and distributed processing engine for stateful computations over unbounded and bounded data streams
Cloud Services
Google Cloud Dataflow - A fully managed streaming analytics service that minimizes latency, processing time, and cost through autoscaling and batch processing
Google Search - The search engine that allows you to search the world's information, including webpages, images, videos and more
DuckDuckGo - The search engine that doesn't track you
Answer Engines
Wolfram|Alpha - A computational knowledge engine that computes expert-level answers using breakthrough algorithms, knowledgebase and AI technology
Perplexity AI - An AI-powered answer engine that provides accurate, trusted, and real-time answers to any question
Search Platforms and Tools
Azure AI Search - A fully managed, cloud-hosted service that unifies access to enterprise and web content for AI-powered search and retrieval-augmented generation
Reciprocal Rank Fusion (RRF) - An algorithm that evaluates the search scores from multiple, previously executed queries to produce a unified result set
ElasticSearch - An open source distributed, RESTful search and analytics engine, scalable data store, and vector database
Painless - A simple, secure scripting language designed specifically for use with the engine
ES|QL - A piped language that allows you to filter, transform, and analyze data stored in the engine
Kibana - The open source interface to query, analyze, visualize, and manage your data stored in the engine
Apache Hive - A distributed, fault-tolerant data warehouse system that enables analytics at a massive scale
Presto - A distributed SQL query engine designed for fast, reliable, and efficient analytics at any scale
Trino - A distributed SQL query engine designed to query large data sets distributed over one or more heterogeneous data sources
Amazon EMR - A cloud big data platform for running large-scale distributed data processing jobs, interactive SQL queries, and machine learning applications
Amazon Redshift - A fully managed, petabyte-scale data warehouse service in the cloud
Amazon Athena - An interactive query service that makes it easy to analyze data directly in Amazon S3 and other data stores using standard SQL
Databricks - The platform that allows your entire organization to use data and AI
Microsoft Fabric - An end-to-end analytics solution with full-service capabilities including data movement, data lakes, data engineering, data integration, data science, real-time analytics, and business intelligence
Microsoft OneLake - A single, unified, logical data lake for your whole organization
Lakehouse vs Data Warehouse - A guide for choosing between a lakehouse and a data warehouse based on data volume, structure, and processing requirements
Azure Synapse Analytics - An enterprise analytics service that accelerates time to insight across data warehouses and big data systems
Google Cloud BigQuery - A fully managed, AI-ready data analytics platform that helps you maximize value from your data and is designed to be multi-engine, multi-format, and multi-cloud
Amazon QuickSight - An AI-powered business intelligence service that enables users to analyze data, create visualizations, and gain insights from various enterprise data sources