Skip to main content

400 - Data Science and Engineering

400 - Concepts and Math

400 - Core Concepts

  • General Data Concepts & Principles
    • Big data - Data sets that are too large or complex to be dealt with by traditional data-processing application software
    • DIKW pyramid - A class of models representing purported structural and/or functional relationships between data, information, knowledge, and wisdom
    • Garbage in, garbage out - A concept in computer science and information and communications technology that the quality of the output is determined by the quality of the input
  • Core Data Engineering & Database Concepts
  • Network science
    • Centrality - A measure of the relative importance of a node or vertex within a graph in graph theory and network analysis

401 - Base Mathematics

402 - Statistics

410 - Data Science Toolkit

  • Languages & Core Libraries
    • Python
      • Pandas - A fast, powerful, flexible and easy to use open source data analysis and manipulation tool
      • Polars - A blazingly fast DataFrame library for manipulating structured data
      • NumPy - The fundamental package for scientific computing with Python
      • SciPy - Fundamental algorithms for scientific computing in Python
      • SymPy - A Python library for symbolic mathematics
      • SageMath
      • statsmodels
      • Pydantic
    • R - A free software environment for statistical computing and graphics
      • Tidyverse - An opinionated collection of R packages designed for data science
        • dplyr, tidyr, stringr, purrr, readr
    • Wolfram Language
  • Interactive Computing Environments
    • JupyterLab - A web-based interactive development environment for notebooks, code, and data
    • Jupyter Notebook - The original web application for creating and sharing computational documents
      • VSCode Jupyter Extension - A VS Code extension that provides basic notebook support for language kernels supported in Jupyter Notebooks
    • BeakerX - A collection of kernels and extensions to the Jupyter interactive computing environment
    • R Markdown - An authoring framework that helps you create dynamic analysis documents combining code, rendered output, and prose
    • Wolfram Notebooks
  • Expression Generators
  • Network Analysis
    • NetworkX - A Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks
  • Data Visualization
    • Common Chart Types
      • Histgram - An approximate representation of the distribution of numerical data
      • Scatter plot - A type of plot or mathematical diagram using Cartesian coordinates to display values for typically two variables for a set of data
      • Box plot - A method for graphically demonstrating the locality, spread and skewness groups of numerical data through their quartiles
      • Error bar - A graphical representation of the variability of data and used on graphs to indicate the error or uncertainty in a reported measurement
      • Heat map - A data visualization technique that shows magnitude of a phenomenon as color in two dimensions
      • Choropleth map - A type of thematic map in which a set of pre-defined areas is colored or patterned in proportion to a statistical variable
      • Proportional symbol map - A type of thematic map that uses map symbols that vary in size to represent a quantitative variable
      • Tag cloud - A novelty visual representation of text data
    • Tools and Libraries
      • gnuplot - A portable command-line driven graphing utility
      • matplotlib - A comprehensive library for creating static, animated, and interactive visualizations in Python
      • seaborn - A Python data visualization library based on matplotlib
      • ggplot2 - A system for declaratively creating graphics, based on The Grammar of Graphics
      • Vega - A visualization grammar, a declarative language for creating, saving, and sharing interactive visualization designs
      • Vega-Lite - A high-level grammar of interactive graphics
      • D3 - The JavaScript library for bespoke data visualization
      • GoJS - A JavaScript library that lets you easily create interactive diagrams in web browsers
      • Chart.js - A simple yet flexible JavaScript charting library for the modern web
      • Recharts
      • WordCloud for Python - A little word cloud generator in Python
  • Data Sources
    • GeoLite2 - A free IP geolocation database

420 - Data Formats and Storage

  • Data Formats
    • Apache Parquet - An open source, column-oriented data file format designed for efficient data storage and retrieval
    • Apache ORC - The smallest, fastest columnar storage for Hadoop workloads
    • BSON - A bin­ary-en­coded seri­al­iz­a­tion of JSON-like doc­u­ments
    • Apache Avro - The leading serialization format for record data, and first choice for streaming data pipelines
  • Data Storage Systems
    • Object storage - A computer data storage architecture that manages data as objects
      • Amazon S3 - An object storage service offering industry-leading scalability, data availability, security, and performance
      • Azure Blob Storage - The Microsoft's object storage solution for the cloud, optimized for storing massive amounts of unstructured data
      • Google Cloud Storage - A RESTful online file storage web service for storing and accessing data on Google Cloud Platform infrastructure
      • Cloud Storage for Firebase - The service letting you upload and share user generated content, such as images and video
      • Supabase Storage - The service making it simple to store and serve large files like photos and videos
      • Self-hosted (advanced)
        • Ceph - An open-source, distributed storage system
        • MinIO - A high-performance, S3 compatible object store
      • Tooling
        • s5cmd - A very fast S3 and local filesystem execution tool
        • Rclone - A command-line program to manage files on cloud storage
        • Azure Storage Explorer - A standalone app making it easy to work with Azure Storage data on Windows, macOS, and Linux
    • Distributed File Systems
      • HDFS - A distributed file system designed to run on commodity hardware

430 - Relational Databases

  • Fundational Concepts
    • Relational model - An approach to managing data using a structure and language consistent with first-order predicate logic
    • ACID properties - A set of properties of database transactions intended to guarantee data validity despite errors, power failures, and other mishaps
      • Atomicity, Consistency, Isolation, and Durability
    • Codd's Twelve Rules - A set of thirteen rules proposed by Edgar F. Codd to define what is required from a database management system in order for it to be considered relational
  • Languages & Dialects
    • Structured Query Language (SQL) - A domain-specific language used for managing data held in a relational database management system
      • Command Categories
        • DDL - Data Definition Language
        • DQL - Data Query Language
        • DML - Data Manipulation Language
        • DCL - Data Control Language
        • TCL - Transaction Control Language
      • SQL Join - A clause that combines columns from one or more tables in a relational database
    • Transact-SQL - The proprietary extension to SQL used to program and manage SQL Server
  • Database Management Systems (DBMS)
    • Client-Server RDBMS
      • PostgreSQL - An object-relational database management system (ORDBMS) based on POSTGRES, Version 4.2, developed at the University of California at Berkeley Computer Science Department
      • MySQL - The most popular Open Source SQL database management system, is developed, distributed, and supported by Oracle Corporation
      • MariaDB community Server - The open source relational database that is a community-developed fork of MySQL
    • Distributed SQL
      • TiDB - An open-source distributed SQL database that supports Hybrid Transactional and Analytical Processing (HTAP) workloads
    • Embedded / In-Process
      • SQLite - A C-language library that implements a small, fast, self-contained, high-reliability, full-featured, SQL database engine
      • PGlite - A WASM Postgres build packaged into a TypeScript/JavaScript client library, that enables you to run Postgres in the browser, Node.js and Bun
  • Cloud Services & Platforms
    • Managed Database Services
      • Amazon RDS - A collection of managed services that makes it simple to set up, operate, and scale databases in the cloud
      • Amazon Aurora - A modern relational database service built for the cloud, with MySQL and PostgreSQL compatibility
      • Azure SQL Database - An intelligent, scalable, relational database service built for the cloud
    • Backend-as-a-Service (BaaS)
  • Connectivity & Abstraction
    • Connectivity APIs
      • ODBC - A standard application programming interface for accessing database management systems
      • JDBC - An API that allows access to virtually any tabular data source from the Java programming language
    • Object-Relational Mapping (ORM) - A programming technique for converting data between incompatible type systems using object-oriented programming languages
      • Prisma - A next-generation ORM that makes it easy to build reliable and scalable applications with databases
      • Hibernate - An object-relational mapping tool for the Java programming language
      • GORM - The fantastic ORM library for Golang aims to be developer friendly
      • XORM - A Simple and Powerful ORM for Go
      • Diesel - A Safe, Extensible ORM and Query Builder for Rust
  • Tooling & Ecosystem
    • Database Clients & IDEs
      • pgAdmin - The most popular and feature rich Open Source administration and development platform for PostgreSQL
      • SSMS (SQL Server Management Studio) - An integrated environment for managing any SQL infrastructure, from SQL Server to Azure SQL Database
      • DB Browser for SQLite - A high quality, visual, open source tool to create, design, and edit database files compatible with SQLite
      • Azure Data Studio - A modern open-source, cross-platform hybrid data analytics tool designed to simplify the data landscape
      • Beekeeper Studio - A modern, easy to use, and good looking SQL editor and database manager
    • Developer Libraries & Drivers
      • Vanna.AI - A Python package that uses retrieval augmentation to help you generate accurate SQL queries for your database using LLMs
      • Psycopg - The most popular PostgreSQL adapter for the Python programming language
    • Command-Line & Deployment Utilities
      • sqlcmd utility - A command-line utility for ad hoc, interactive execution of Transact-SQL statements and scripts and for automating T-SQL scripting tasks
      • sqlpackage - A command-line utility that automates several database development tasks
      • DAC (Data-tier Applications) - A logical database management concept that defines all of the SQL Server objects associated with a user's database
    • Monitoring & Analysis
      • pgBadger - A PostgreSQL log analyzer built for speed with fully detailed reports and professional rendering

440 - NoSQL Databases

  • Fundational Concepts
    • CAP theorem - A theorem stating that any distributed data store can provide only two of the following three guarantees: Consistency, Availability, and Partition tolerance
    • BASE properties - A database model that prioritizes availability over consistency
    • Data model - An abstract model that organizes elements of data and standardizes how they relate to one another and to the properties of real-world entities
    • Data orientation - A perspective of data that emphasizes the data itself, rather than the applications that use the data
    • Object-relational impedance mismatch - A set of conceptual and technical difficulties that are often encountered when a relational database management system (RDBMS) is being used by a program written in an object-oriented programming language or style
  • Multi-model Databases
    • Azure Cosmos DB - A fully managed, serverless distributed database for modern app development
    • Amazon DynamoDB - A fully managed, serverless, key-value NoSQL database designed to run high-performance applications at any scale
  • Document Databases
    • MongoDB - A document database designed for ease of application development and scaling
    • Cloud Firestore - A cloud-hosted, NoSQL database that your Apple, Android, and web apps can access directly via native SDKs
    • DocumentDB - A powerful, scalable open-source document database built for modern applications
  • Key-value Stores
    • etcd - A distributed, reliable key-value store for the most critical data of a distributed system
    • Redis - An in-memory data store used by millions of developers as a cache, vector database, document database, streaming engine
    • Dragonfly - A drop-in Redis replacement
  • Graph Databases
    • Neo4j - A high-speed graph database with unbounded scale, security, and data integrity
    • Amazon Neptune - A fast, reliable, and fully managed graph database service that makes it easy to build and run applications that work with highly connected datasets
  • Wide-columns Databases
    • Apache Cassandra - An open source NoSQL distributed database
    • Apache HBase - The Hadoop database, a distributed, scalable, big data store
    • Google Cloud Bigtable - A NoSQL wide-column database service for large analytical and operational workloads
  • Vector Databases
    • pgvector - An open-source vector similarity search for Postgres
    • Weaviate - An open-source vector database that simplifies the development of AI applications
    • Milvus - A high-performance open-source vector database built to handle billions of vectors
    • Chroma - The AI-native open-source embedding database

450 - Distributed Processing and Application Integration (WIP)

  • Base Frameworks
    • Apache Hadoop - A framework that allows for the distributed processing of large data sets
      • mrjob - The easiest route to writing Python programs that run on Hadoop
    • Apache Spark - The unified engine for large-scale data analytics
      • PySpark - The Python API for Apache Spark, allowing big data processing with Python
    • RAY - An open-source unified compute framework that makes it easy to scale AI and Python workloads
  • Full-fledged ETL
    • Azure Data Factory - Azure's cloud ETL service for scale-out serverless data integration and data transformation
    • AWS Glue - A serverless data integration service that makes it easy to discover, prepare, move, and integrate data from multiple sources
    • Google Cloud Data Fusion - A fully managed, cloud-native data integration service that helps users efficiently build and manage ETL/ELT data pipelines
    • Apache NiFi - An easy to use, powerful, and reliable system to process and distribute data
  • Numerical & Scientific Computing
    • JAX - A Python library for accelerator-oriented array computation and program transformation
  • Stream Processing Engines
    • Spark Structured Streaming - A scalable and fault-tolerant stream processing engine built on the Spark SQL engine
    • Apache Storm - A free and open source distributed realtime computation system
    • Apache Flink - A framework and distributed processing engine for stateful computations over unbounded and bounded data streams
    • Google Cloud Dataflow - A fully managed streaming analytics service that minimizes latency, processing time, and cost through autoscaling and batch processing
  • Event Ingestion / Message Queues
    • Amazon Kinesis - The service making it easy to collect, process, and analyze real-time, streaming data
    • Azure Event Hubs - A highly scalable and reliable event streaming platform capable of ingesting millions of events per second
    • Apache Kafka - An open-source distributed event streaming platform
  • Message Brokers - An intermediary computer program module that translates a message from the formal messaging protocol of the sender to the formal messaging protocol of the receiver
    • Azure Service Bus - A fully managed enterprise message broker with message queues and publish-subscribe topics
    • RabbitMQ - A reliable and mature messaging and streaming broker

460 - Search and Analytics (WIP)

  • Web Search Engines
  • Answer Engines
  • Data Analytics Methodologies and Architectures
    • Data warehouse - A system used for reporting and data analysis and is a core component of business intelligence
    • Data lake - A system or repository of data stored in its natural/raw format, usually object blobs or files
    • Data lakehouse - A new, open architecture that combines the best elements of data lakes and data warehouses
    • Medallion Architecture - A data design pattern used to logically organize data in a lakehouse
    • CRISP-DM - An open standard process model that describes common approaches used by data mining experts
  • Table Formats
    • Delta Lake - An open-source storage framework that enables building a format agnostic Lakehouse architecture with compute engines
    • Apache Iceberg - The open table format for huge analytic datasets
    • Apache Hudi - The Streaming Data Lake Platform
  • Search Platforms and Tools
  • Analytics Platforms
    • Apache Hive - A distributed, fault-tolerant data warehouse system that enables analytics at a massive scale
    • Presto - A distributed SQL query engine designed for fast, reliable, and efficient analytics at any scale
    • Trino - A distributed SQL query engine designed to query large data sets distributed over one or more heterogeneous data sources
    • Amazon EMR - A cloud big data platform for running large-scale distributed data processing jobs, interactive SQL queries, and machine learning applications
    • Amazon Redshift - A fully managed, petabyte-scale data warehouse service in the cloud
    • Amazon Athena - An interactive query service that makes it easy to analyze data directly in Amazon S3 and other data stores using standard SQL
    • Databricks - The platform that allows your entire organization to use data and AI
    • Microsoft Fabric - An end-to-end analytics solution with full-service capabilities including data movement, data lakes, data engineering, data integration, data science, real-time analytics, and business intelligence
    • Azure Synapse Analytics - An enterprise analytics service that accelerates time to insight across data warehouses and big data systems
    • Google Cloud BigQuery - A fully managed, AI-ready data analytics platform that helps you maximize value from your data and is designed to be multi-engine, multi-format, and multi-cloud