05 - Data Science & Engineering

Foundational Concepts

General Data Principles

General Data Concepts & Principles
- Big data - Data sets that are too large or complex to be dealt with by traditional data-processing application software
- Data model - An abstract model that organizes elements of data and standardizes how they relate to one another and to the properties of real-world entities
- Data orientation - A perspective of data that emphasizes the data itself, rather than the applications that use the data
- DIKW pyramid - A class of models representing purported structural and/or functional relationships between data, information, knowledge, and wisdom
- Garbage in, garbage out - A concept in computer science and information and communications technology that the quality of the output is determined by the quality of the input
- Data cleansing - The process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database

Core Data Engineering

Core Data Engineering & Database Concepts
- Concurrency control - The mechanism ensuring that correct results for concurrent operations are generated efficiently
- CRUD operations - The four basic operations of persistent storage: create, read, update, and delete
- Shard - A horizontal partition of data in a database or search engine
- ETL - A three-phase process where data is extracted from an input source, transformed, and loaded into an output data container
- ELT - A data integration process where raw data is moved from a source system to a destination resource, such as a data warehouse, and then transformed for use
- Online transaction processing (OLTP) - A type of data processing that consists of executing a number of transactions occurring concurrently
- Online analytical processing (OLAP) - An approach to answering multi-dimensional analytical queries swiftly in computing
- Search engine indexing - The collecting, parsing, and storing of data to facilitate fast and accurate information retrieval

Distributed Systems

Core Concepts

Distributed Computing - A field of computer science that studies distributed systems
- Single point of failure - A part of a system that, if it fails, will stop the entire system from working
- Fault tolerance - The property that enables a system to continue operating properly in the event of the failure of some of its components
- Load balancing - The process of distributing a set of tasks over a set of resources, with the aim of making their overall processing more efficient
- Fallacies of distributed computing - A set of assertions describing false assumptions that programmers new to distributed applications invariably make
- Byzantine fault - A condition of a distributed system, where components may fail and there is imperfect information about whether a component has failed
  - Consensus - A fault-tolerant mechanism that is used in distributed systems to achieve the necessary agreement on a single data value among distributed processes or systems
- CAP theorem - A theorem stating that any distributed data store can provide only two of the following three guarantees: Consistency, Availability, and Partition tolerance
- BASE properties - A database model that prioritizes availability over consistency

Distributed File Systems & Storage

Distributed File Systems
- HDFS - A distributed file system designed to run on commodity hardware
Object storage - A computer data storage architecture that manages data as objects
- Amazon S3 - An object storage service offering industry-leading scalability, data availability, security, and performance
- Azure Blob Storage - The Microsoft's object storage solution for the cloud, optimized for storing massive amounts of unstructured data
- Google Cloud Storage - A RESTful online file storage web service for storing and accessing data on Google Cloud Platform infrastructure
- Cloud Storage for Firebase - The service letting you upload and share user generated content, such as images and video
- Supabase Storage - The service making it simple to store and serve large files like photos and videos
- Self-hosted (advanced)
  - Ceph - An open-source, distributed storage system
  - MinIO - A high-performance, S3 compatible object store
- Tooling
  - s5cmd - A very fast S3 and local filesystem execution tool
  - Rclone - A command-line program to manage files on cloud storage
  - Azure Storage Explorer - A standalone app making it easy to work with Azure Storage data on Windows, macOS, and Linux

Mathematics & Statistics

Base Mathematics

Algebra - A branch of mathematics that deals with abstract systems, known as algebraic structures, and the manipulation of expressions within those systems
Calculus
Geometry
Root mean square - The square root of the mean of the squares of a set of numbers
Related Resources
- NIST Digital Library of Mathematical Functions - The definitive reference for the special functions of applied mathematics
  - Notations - A list of notations used in the library

Probability & Information Theory

Probability theory
- Bayes' theorem
- Central limit theorem (CLT)
Information theory - A scientific study of the quantification, storage, and communication of digital information
- Entropy - The average level of 'information', 'surprise', or 'uncertainty' inherent in a random variable's possible outcomes

Statistics

Statistics
- Sampling
- Errors and residuals
- Standard deviation
- Root mean square deviation - The square root of the average of the squared differences between the predicted values and the actual values
- Correlation
  - Pearson correlation coefficient
- Hypothesis testing
Numerical methods
- Significant figures
Resources
- Openstax Introductory Statistics
- OpenIntro Statistics

Data Formats & Storage

Data Formats

Apache Parquet - An open source, column-oriented data file format designed for efficient data storage and retrieval
Apache ORC - The smallest, fastest columnar storage for Hadoop workloads
BSON - A binary-encoded serialization of JSON-like documents
Apache Avro - The leading serialization format for record data, and first choice for streaming data pipelines

Table Formats

Delta Lake - An open-source storage framework that enables building a format agnostic Lakehouse architecture with compute engines
Apache Iceberg - The open table format for huge analytic datasets
Apache Hudi - The Streaming Data Lake Platform

Databases

Relational Databases (SQL)

Fundational Concepts
- Relational model - An approach to managing data using a structure and language consistent with first-order predicate logic
- ACID properties - A set of properties of database transactions intended to guarantee data validity despite errors, power failures, and other mishaps
  - Atomicity, Consistency, Isolation, and Durability
- Codd's Twelve Rules - A set of thirteen rules proposed by Edgar F. Codd to define what is required from a database management system in order for it to be considered relational
- Database normalization - The process of organizing columns (attributes) and tables (relations) of a relational database to minimize data redundancy
Languages & Dialects
- Structured Query Language (SQL) - A domain-specific language used for managing data held in a relational database management system
  - Command Categories
    - DDL - Data Definition Language
    - DQL - Data Query Language
    - DML - Data Manipulation Language
    - DCL - Data Control Language
    - TCL - Transaction Control Language
  - SQL Join - A clause that combines columns from one or more tables in a relational database
  - Aggregate function - A function where the values of multiple rows are grouped together to form a single summary value
- Transact-SQL - The proprietary extension to SQL used to program and manage SQL Server
Database Management Systems (DBMS)
- Client-Server RDBMS
  - PostgreSQL - An object-relational database management system (ORDBMS) based on POSTGRES, Version 4.2, developed at the University of California at Berkeley Computer Science Department
  - MySQL - The most popular Open Source SQL database management system, is developed, distributed, and supported by Oracle Corporation
  - MariaDB community Server - The open source relational database that is a community-developed fork of MySQL
- Distributed SQL
  - TiDB - An open-source distributed SQL database that supports Hybrid Transactional and Analytical Processing (HTAP) workloads
- Embedded / In-Process
  - SQLite - A C-language library that implements a small, fast, self-contained, high-reliability, full-featured, SQL database engine
  - PGlite - A WASM Postgres build packaged into a TypeScript/JavaScript client library, that enables you to run Postgres in the browser, Node.js and Bun
Cloud Services & Platforms
- Managed Database Services
  - Amazon RDS - A collection of managed services that makes it simple to set up, operate, and scale databases in the cloud
  - Amazon Aurora - A modern relational database service built for the cloud, with MySQL and PostgreSQL compatibility
  - Azure SQL Database - An intelligent, scalable, relational database service built for the cloud
  - Google Cloud SQL - A fully-managed database service that helps you set up, maintain, manage, and administer your relational databases on Google Cloud
  - Neon - A serverless, fault-tolerant, and scalable Postgres with a generous free tier
- Backend-as-a-Service (BaaS)
  - Supabase Database - An open source Firebase alternative
Connectivity & Abstraction
- Connectivity APIs
  - ODBC - A standard application programming interface for accessing database management systems
  - JDBC - An API that allows access to virtually any tabular data source from the Java programming language
    - Jdbi
- Object-Relational Mapping (ORM) - A programming technique for converting data between incompatible type systems using object-oriented programming languages
  - Prisma - A next-generation ORM that makes it easy to build reliable and scalable applications with databases
  - Hibernate - An object-relational mapping tool for the Java programming language
  - SQLAlchemy - The Python SQL Toolkit and Object Relational Mapper
  - GORM - The fantastic ORM library for Golang aims to be developer friendly
  - XORM - A Simple and Powerful ORM for Go
  - Diesel - A Safe, Extensible ORM and Query Builder for Rust
Tooling & Ecosystem
- Database Clients & IDEs
  - pgAdmin - The most popular and feature rich Open Source administration and development platform for PostgreSQL
  - SSMS (SQL Server Management Studio) - An integrated environment for managing any SQL infrastructure, from SQL Server to Azure SQL Database
  - DB Browser for SQLite - A high quality, visual, open source tool to create, design, and edit database files compatible with SQLite
  - Azure Data Studio - A modern open-source, cross-platform hybrid data analytics tool designed to simplify the data landscape
  - Beekeeper Studio - A modern, easy to use, and good looking SQL editor and database manager
- Developer Libraries & Drivers
  - Vanna.AI - A Python package that uses retrieval augmentation to help you generate accurate SQL queries for your database using LLMs
  - Psycopg - The most popular PostgreSQL adapter for the Python programming language
- Command-Line & Deployment Utilities
  - sqlcmd utility - A command-line utility for ad hoc, interactive execution of Transact-SQL statements and scripts and for automating T-SQL scripting tasks
  - sqlpackage - A command-line utility that automates several database development tasks
  - DAC (Data-tier Applications) - A logical database management concept that defines all of the SQL Server objects associated with a user's database
  - pgroll - A zero-downtime, reversible, schema migration tool for PostgreSQL
- Monitoring & Analysis
  - pgBadger - A PostgreSQL log analyzer built for speed with fully detailed reports and professional rendering

NoSQL Databases

Fundational Concepts
- Object-relational impedance mismatch - A set of conceptual and technical difficulties that are often encountered when a relational database management system (RDBMS) is being used by a program written in an object-oriented programming language or style
Multi-model Databases
- Azure Cosmos DB - A fully managed, serverless distributed database for modern app development
- Amazon DynamoDB - A fully managed, serverless, key-value NoSQL database designed to run high-performance applications at any scale
Document Databases
- MongoDB - A document database designed for ease of application development and scaling
- Cloud Firestore - A cloud-hosted, NoSQL database that your Apple, Android, and web apps can access directly via native SDKs
- DocumentDB - A powerful, scalable open-source document database built for modern applications
Key-value Stores
- etcd - A distributed, reliable key-value store for the most critical data of a distributed system
- Redis - An in-memory data store used by millions of developers as a cache, vector database, document database, streaming engine
- Dragonfly - A drop-in Redis replacement
Graph Databases
- Neo4j - A high-speed graph database with unbounded scale, security, and data integrity
- Amazon Neptune - A fast, reliable, and fully managed graph database service that makes it easy to build and run applications that work with highly connected datasets
Wide-columns Databases
- Apache Cassandra - An open source NoSQL distributed database
- Apache HBase - The Hadoop database, a distributed, scalable, big data store
- Google Cloud Bigtable - A NoSQL wide-column database service for large analytical and operational workloads
Vector Databases
- pgvector - An open-source vector similarity search for Postgres
- Weaviate - An open-source vector database that simplifies the development of AI applications
- Milvus - A high-performance open-source vector database built to handle billions of vectors
- Chroma - The AI-native open-source embedding database

Data Processing & Pipelines

Batch Processing & ETL/ELT

Base Frameworks
- Apache Hadoop - A framework that allows for the distributed processing of large data sets
  - mrjob - The easiest route to writing Python programs that run on Hadoop
- Apache Spark - The unified engine for large-scale data analytics
  - PySpark - The Python API for Apache Spark, allowing big data processing with Python
- RAY - An open-source unified compute framework that makes it easy to scale AI and Python workloads
Full-fledged ETL
- Azure Data Factory - Azure's cloud ETL service for scale-out serverless data integration and data transformation
- AWS Glue - A serverless data integration service that makes it easy to discover, prepare, move, and integrate data from multiple sources
- Google Cloud Data Fusion - A fully managed, cloud-native data integration service that helps users efficiently build and manage ETL/ELT data pipelines
- Apache NiFi - An easy to use, powerful, and reliable system to process and distribute data
- Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

Stream Processing

Stream Processing Engines
- Spark Structured Streaming - A scalable and fault-tolerant stream processing engine built on the Spark SQL engine
- Apache Storm - A free and open source distributed realtime computation system
- Apache Flink - A framework and distributed processing engine for stateful computations over unbounded and bounded data streams
- Google Cloud Dataflow - A fully managed streaming analytics service that minimizes latency, processing time, and cost through autoscaling and batch processing
Event Ingestion / Message Queues
- Amazon Kinesis - The service making it easy to collect, process, and analyze real-time, streaming data
- Azure Event Hubs - A highly scalable and reliable event streaming platform capable of ingesting millions of events per second
- Apache Kafka - An open-source distributed event streaming platform
Message Brokers - An intermediary computer program module that translates a message from the formal messaging protocol of the sender to the formal messaging protocol of the receiver
- Azure Service Bus - A fully managed enterprise message broker with message queues and publish-subscribe topics
- RabbitMQ - A reliable and mature messaging and streaming broker

Data Analytics & Science

Methodologies

Data Analytics Methodologies and Architectures
- Data warehouse - A system used for reporting and data analysis and is a core component of business intelligence
- Data lake - A system or repository of data stored in its natural/raw format, usually object blobs or files
- Data lakehouse - A new, open architecture that combines the best elements of data lakes and data warehouses
- Medallion Architecture - A data design pattern used to logically organize data in a lakehouse
- CRISP-DM - An open standard process model that describes common approaches used by data mining experts

Analytics & Search Platforms

Web Search Engines
- Google Search - The search engine that allows you to search the world's information, including webpages, images, videos and more
- DuckDuckGo - The search engine that doesn't track you
Answer Engines
- Wolfram|Alpha - A computational knowledge engine that computes expert-level answers using breakthrough algorithms, knowledgebase and AI technology
Search Platforms and Tools
- ElasticSearch - An open source distributed, RESTful search and analytics engine, scalable data store, and vector database
  - Painless - A simple, secure scripting language designed specifically for use with Elasticsearch
  - ES|QL
  - Kibana
  - Kibana Query Language
  - ElasticSearch vector database
- Apache Solr - The popular, blazing-fast, open source enterprise search platform built on Apache Lucene
  - Apache Lucene - A Java library providing powerful indexing and search features
- Faiss - A library for efficient similarity search and clustering of dense vectors
- Meilisearch - A lightning-fast search engine that fits effortlessly into your apps, websites, and workflow
- TypeSense - A lightning-fast, open source, search-as-you-type engine for building delightful search experiences
Analytics Platforms
- Apache Hive - A distributed, fault-tolerant data warehouse system that enables analytics at a massive scale
- Presto - A distributed SQL query engine designed for fast, reliable, and efficient analytics at any scale
- Trino - A distributed SQL query engine designed to query large data sets distributed over one or more heterogeneous data sources
- Amazon EMR - A cloud big data platform for running large-scale distributed data processing jobs, interactive SQL queries, and machine learning applications
- Amazon Redshift - A fully managed, petabyte-scale data warehouse service in the cloud
- Amazon Athena - An interactive query service that makes it easy to analyze data directly in Amazon S3 and other data stores using standard SQL
- Databricks - The platform that allows your entire organization to use data and AI
- Microsoft Fabric - An end-to-end analytics solution with full-service capabilities including data movement, data lakes, data engineering, data integration, data science, real-time analytics, and business intelligence
- Azure Synapse Analytics - An enterprise analytics service that accelerates time to insight across data warehouses and big data systems
- Google Cloud BigQuery - A fully managed, AI-ready data analytics platform that helps you maximize value from your data and is designed to be multi-engine, multi-format, and multi-cloud

Toolkit & Libraries

Languages & Core Libraries
- Python
  - Pandas - A fast, powerful, flexible and easy to use open source data analysis and manipulation tool
  - Polars - A blazingly fast DataFrame library for manipulating structured data
  - Narwhals - A lazy-first, type-agnostic, and framework-agnostic dataframe library in Python
  - NumPy - The fundamental package for scientific computing with Python
  - SciPy - Fundamental algorithms for scientific computing in Python
  - SymPy - A Python library for symbolic mathematics
  - SageMath
  - statsmodels
- R - A free software environment for statistical computing and graphics
  - Tidyverse - An opinionated collection of R packages designed for data science
    - dplyr, tidyr, stringr, purrr, readr
- Wolfram Language
Interactive Computing Environments
- JupyterLab - A web-based interactive development environment for notebooks, code, and data
- Jupyter Notebook - The original web application for creating and sharing computational documents
  - VSCode Jupyter Extension - A VS Code extension that provides basic notebook support for language kernels supported in Jupyter Notebooks
- nbviewer - A simple way to share Jupyter Notebooks
- BeakerX - A collection of kernels and extensions to the Jupyter interactive computing environment
- R Markdown - An authoring framework that helps you create dynamic analysis documents combining code, rendered output, and prose
- Wolfram Notebooks
Expression Generators
- latexify
- handcalcs
Network Analysis
- NetworkX - A Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks
Numerical & Scientific Computing
- JAX - A Python library for accelerator-oriented array computation and program transformation
Data Sources
- GeoLite2 - A free IP geolocation database

Data Visualization

Chart Types

Common Chart Types
- Histgram - An approximate representation of the distribution of numerical data
- Scatter plot - A type of plot or mathematical diagram using Cartesian coordinates to display values for typically two variables for a set of data
- Box plot - A method for graphically demonstrating the locality, spread and skewness groups of numerical data through their quartiles
- Error bar - A graphical representation of the variability of data and used on graphs to indicate the error or uncertainty in a reported measurement
- Heat map - A data visualization technique that shows magnitude of a phenomenon as color in two dimensions
- Choropleth map - A type of thematic map in which a set of pre-defined areas is colored or patterned in proportion to a statistical variable
- Proportional symbol map - A type of thematic map that uses map symbols that vary in size to represent a quantitative variable
- Tag cloud - A novelty visual representation of text data

Tools & Libraries

Tools and Libraries
- gnuplot - A portable command-line driven graphing utility
- matplotlib - A comprehensive library for creating static, animated, and interactive visualizations in Python
- seaborn - A Python data visualization library based on matplotlib
- Plotly - The interactive graphing library for Python (includes Plotly Express)
- ggplot2 - A system for declaratively creating graphics, based on The Grammar of Graphics
- Vega - A visualization grammar, a declarative language for creating, saving, and sharing interactive visualization designs
- Vega-Lite - A high-level grammar of interactive graphics
- D3 - The JavaScript library for bespoke data visualization
- GoJS - A JavaScript library that lets you easily create interactive diagrams in web browsers
- Chart.js - A simple yet flexible JavaScript charting library for the modern web
- Recharts - A composable charting library built on React components
- WordCloud for Python - A little word cloud generator in Python

Dashboarding & Web Apps

Dash - The original low-code framework for rapidly building data apps in Python, R, Julia, and F#
Panel - A powerful Python library that lets you create interactive web apps and dashboards
Voila - A tool that turns Jupyter notebooks into standalone web applications

Foundational Concepts​

General Data Principles​

Core Data Engineering​

Distributed Systems​

Core Concepts​

Distributed File Systems & Storage​

Mathematics & Statistics​

Base Mathematics​

Probability & Information Theory​

Statistics​

Data Formats & Storage​

Data Formats​

Table Formats​

Databases​

Relational Databases (SQL)​

NoSQL Databases​

Data Processing & Pipelines​

Batch Processing & ETL/ELT​

Stream Processing​

Data Analytics & Science​

Methodologies​

Analytics & Search Platforms​

Toolkit & Libraries​

Data Visualization​

Chart Types​

Tools & Libraries​

Dashboarding & Web Apps​

Foundational Concepts

General Data Principles

Core Data Engineering

Distributed Systems

Core Concepts

Distributed File Systems & Storage

Mathematics & Statistics

Base Mathematics

Probability & Information Theory

Statistics

Data Formats & Storage

Data Formats

Table Formats

Databases

Relational Databases (SQL)

NoSQL Databases

Data Processing & Pipelines

Batch Processing & ETL/ELT

Stream Processing

Data Analytics & Science

Methodologies

Analytics & Search Platforms

Toolkit & Libraries

Data Visualization

Chart Types

Tools & Libraries

Dashboarding & Web Apps