05 - Data Science & Engineering
Foundational Concepts
Relevant DSS-P Skills
- 2. Data Preparation & Utilization > 2.1 Strategic Utilization of Data and AI > Understanding and Utilization of Data and AI
General Data Concepts & Principles
- Data - Any sequence of one or more symbols; datum is a single symbol of data
- Metadata - The data that provides information about other data, but not the content of the data
- Big data - The data sets that are too large or complex to be dealt with by traditional data-processing application software
- Data model - An abstract model that organizes elements of data and standardizes how they relate to one another and to the properties of real-world entities
- Data orientation - A perspective of data that emphasizes the data itself, rather than the applications that use the data
- DIKW pyramid - A class of models representing purported structural and/or functional relationships between data, information, knowledge, and wisdom
- Garbage in, garbage out - A concept in computer science and information and communications technology that the quality of the output is determined by the quality of the input
- Data cleansing - The process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database
- Data lifecycle management - A policy-based approach to managing the flow of an information system's data throughout its life cycle
- Master data management - A technology-enabled discipline in which business and IT work together to ensure the uniformity, accuracy, stewardship, semantic consistency and accountability of the enterprise's official shared master data assets
- Data quality - A measure of the condition of data based on factors such as accuracy, completeness, consistency, reliability and whether it's up to date
- Single source of truth - The practice of structuring information models and associated data schema such that every data element is mastered (or edited) in only one place
Core Data Engineering & Database Concepts
- Concurrency control - The mechanism ensuring that correct results for concurrent operations are generated efficiently
- CRUD operations - The four basic operations of persistent storage: create, read, update, and delete
- Shard - A horizontal partition of data in a database or search engine
- ETL - A three-phase process where data is extracted from an input source, transformed, and loaded into an output data container
- ELT - A data integration process where raw data is moved from a source system to a destination resource, such as a data warehouse, and then transformed for use
- Data pipeline - A set of data processing elements connected in series, where the output of one element is the input of the next one
- Data governance - A data management concept concerning the capability that enables an organization to ensure that high data quality exists throughout the complete lifecycle of the data
- Data lineage - The process of understanding, recording, and visualizing data as it flows from data sources to consumption
- Online transaction processing (OLTP) - A type of data processing that consists of executing a number of transactions occurring concurrently
- Online analytical processing (OLAP) - An approach to answering multi-dimensional analytical queries swiftly in computing
- Search engine indexing - The collecting, parsing, and storing of data to facilitate fast and accurate information retrieval
Data Governance, Quality & Architecture
- Data Catalog - A centralized metadata repository that helps organizations manage and discover data assets
- Data Stewardship - A set of practices and processes for managing an organization's data assets to ensure quality, security, and compliance
- Data Privacy - The right and ability of an individual to determine what happens to information about themselves
- Data Contract - An explicit agreement on data structure, quality, and semantics between data producers and consumers
- Schema Evolution - The process of modifying a database schema while maintaining compatibility with existing data and applications
- Dimensional Modeling - A database design technique used to optimize data warehouses for analytical queries using facts and dimensions
Data Science Toolkit
Relevant DSS-P Skills
- 2. Data Preparation & Utilization > 2.2 AI and Data Science > Mathematical Statistics, Multivariate Analysis, and Data Visualization
Programming Languages & Libraries
- Python - A programming language that lets you work quickly and integrate systems more effectively
- Awesome Python - A curated list of awesome Python frameworks, libraries, tools, and resources
- Pandas - A fast, powerful, flexible and easy to use open source data analysis and manipulation tool
- Polars - A blazingly fast DataFrame library for manipulating structured data
- Narwhals - A lazy-first, type-agnostic, and framework-agnostic dataframe library in Python
- NumPy - The fundamental package for scientific computing with Python
- SciPy - Fundamental algorithms for scientific computing in Python
- SymPy - A Python library for symbolic mathematics
- SageMath - A free open-source mathematics software system licensed under the GPL
- statsmodels - A Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration
- R - A free software environment for statistical computing and graphics
- Tidyverse - An opinionated collection of R packages designed for data science
- GNU Octave - A high-level language, primarily intended for numerical computations
- Wolfram Language - A symbolic language, deliberately designed with the breadth and unity needed to develop powerful programs quickly
Specialized & Scientific Tools
- latexify - A Python package to compile a fragment of Python source code to a corresponding LaTeX expression
- handcalcs - A Python library to render Python calculation code automatically in Latex, but in a manner that mimics how one might format their calculation if it were written with a pencil
- NetworkX - A Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks
- JAX - A Python library for accelerator-oriented array computation and program transformation
Data Sources & Geospatial
- GeoLite2 - A set of free geolocation and ASN data in downloadable database and web service formats
Spreadsheet & Collaborative Data Platforms
- Microsoft Excel - The industry-leading spreadsheet software program and a powerful data visualization and analysis tool
- Grist - A relational spreadsheet that combines the familiar interface of a spreadsheet with the power and structure of a relational database
- NocoBase - A scalability-first, open-source no-code platform designed for building complex business applications and internal tools
- NocoDB - An open-source, no-code platform that turns any database into a smart spreadsheet, providing a collaborative interface for relational databases
- Airtable - A platform that combines the flexibility of a spreadsheet with the power of a database to help teams manage their work
Interactive Computing Environments
- JupyterLab - A web-based interactive development environment for notebooks, code, and data
- Jupyter Notebook - The original web application for creating and sharing computational documents
- VSCode Jupyter Extension - A VS Code extension that provides basic notebook support for language kernels supported in the environment
- nbviewer - A simple way to share Jupyter Notebooks
- R Markdown - An authoring framework that helps you create dynamic analysis documents combining code, rendered output, and prose
- Wolfram Notebooks - A powerful environment for exploration and communication, combining text, literate programming, graphics and custom interactive elements
- Voila - A tool that turns Jupyter notebooks into standalone web applications
Data Visualization
Relevant DSS-P Skills
- 2. Data Preparation & Utilization > 2.2 AI and Data Science > Mathematical Statistics, Multivariate Analysis, and Data Visualization
Common Chart Types
- Histogram - A representation of the distribution of numerical data
- Scatter plot - A type of plot or mathematical diagram using Cartesian coordinates to display values for typically two variables for a set of data
- Box plot - A method for graphically demonstrating the locality, spread and skewness groups of numerical data through their quartiles
- Error bar - A graphical representation of the variability of data used on graphs to indicate the uncertainty in a reported measurement
- Heat map - A technique that shows magnitude of a phenomenon as color in two dimensions
- Choropleth map - A type of thematic map in which a set of pre-defined areas is colored or patterned in proportion to a statistical variable
- Proportional symbol map - A type of thematic map that uses symbols that vary in size to represent a quantitative variable
- Tag cloud - A novelty visual representation of text data
Visualization Tools & Libraries
- Python Libraries
- matplotlib - A comprehensive library for creating static, animated, and interactive visualizations in Python
- seaborn - A Python data visualization library based on matplotlib
- Plotly - The interactive, open-source, and browser-based graphing library for Python (includes Plotly Express)
- WordCloud for Python - A little word cloud generator in Python
- JavaScript Libraries
- D3 - The JavaScript library for bespoke data visualization
- GoJS - A JavaScript library that lets you easily create interactive diagrams in web browsers
- Chart.js - A simple yet flexible JavaScript charting library for the modern web
- Recharts - A composable charting library built on React components
- Tabulator - An easy to use, simple to code, fully featured, interactive JavaScript library for creating tables and data grids
- Grammars & Other
- gnuplot - A portable command-line driven graphing utility
- ggplot2 - A system for declaratively creating graphics, based on The Grammar of Graphics
- Vega - A visualization grammar, a declarative language for creating, saving, and sharing interactive visualization designs
- Vega-Lite - A high-level grammar of interactive graphics
Dashboarding & Web Apps
- Dash - The original low-code framework for rapidly building data apps in Python, R, Julia, and F#
- Panel - A powerful Python library that lets you create interactive web apps and dashboards
- Streamlit - A faster way to build and share data apps
Distributed Systems
Relevant DSS-P Skills
- 2. Data Preparation & Utilization > 2.3 Data Management > Data Engineering (Design, Collection, Integration, Provision)
Distributed Computing Principles
- Distributed computing - A field of computer science that studies such systems
- Single point of failure - A part of a system that, if it fails, will stop the entire system from working
- Fault tolerance - The property that enables a system to continue operating properly in the event of the failure of some of its components
- Load balancing - The process of distributing a set of tasks over a set of resources, with the aim of making their overall processing more efficient
- Fallacies of distributed computing - A set of assertions describing false assumptions that programmers new to distributed applications invariably make
- Byzantine fault - A condition of a distributed system, where components may fail and there is imperfect information about whether a component has failed
- Consensus - A fault-tolerant mechanism that is used in distributed systems to achieve the necessary agreement on a single data value among distributed processes or systems
- CAP theorem - A theorem stating that any distributed data store can provide only two of the following three guarantees: Consistency, Availability, and Partition tolerance
- BASE properties - A database model that prioritizes availability over consistency
Consensus & Replication Strategies
- Raft Consensus Algorithm - A consensus algorithm designed to be more understandable than Paxos, enabling safe state machine replication across clusters
- Paxos Algorithm - A family of protocols for solving consensus in a network of unreliable or asynchronous processors
- Data Replication - The frequent electronic copying of data from a computer or server to another location, computer, or server
- Master-Slave Replication - A pattern where one primary node accepts writes and slaves replicate data
- Consensus - A fault-tolerant mechanism that is used in distributed systems to achieve the necessary agreement on a single data value among distributed processes or systems
Distributed Patterns & Observability
- Circuit Breaker Pattern - A design pattern to prevent cascading failures in distributed systems
- Distributed Tracing - A method for profiling and monitoring applications, especially those built using microservices architecture
- Event Sourcing - A pattern where all changes to application state are stored as a sequence of immutable events
Distributed Storage Systems
- Distributed File Systems
- Object storage - A computer data storage architecture that manages data as objects
- Amazon S3 - An object storage service offering industry-leading scalability, data availability, security, and performance
- Azure Blob Storage - The Microsoft's object storage solution for the cloud, optimized for storing massive amounts of unstructured data
- Azure Data Lake Storage (ADLS) - A scalable and secure data lake for high-performance analytics workloads
- Google Cloud Storage - A RESTful online file storage web service for storing and accessing data on Google Cloud Platform infrastructure
- Cloud Storage for Firebase - The service letting you upload and share user generated content, such as images and video
- Supabase Storage - The service making it simple to store and serve large files like photos and videos
- Self-hosted (advanced)
- Tooling
- s5cmd - A very fast S3 and local filesystem execution tool
- Rclone - A command-line program to manage files on cloud storage
- Azure Storage Explorer - A standalone app making it easy to work with Azure Storage data on Windows, macOS, and Linux
- Azurite - An open-source Azure Storage emulator
Mathematics & Statistics
Relevant DSS-P Skills
- 2. Data Preparation & Utilization > 2.2 AI and Data Science > Mathematical Statistics, Multivariate Analysis, and Data Visualization
Base Mathematics
- Algebra - A branch of mathematics that deals with abstract systems, known as algebraic structures, and the manipulation of expressions within those systems
- Boolean algebra - A branch of algebra that differs from elementary algebra in that the values of the variables are the truth values true and false, usually denoted by 1 and 0, and it uses logical operators such as conjunction (and), disjunction (or), and negation (not)
- Elementary algebra - A branch of mathematics that encompasses the basic concepts of algebra
- Abstract algebra - The study of algebraic structures, which are sets with specific operations acting on their elements
- Linear algebra - The branch of mathematics concerning linear equations, linear maps, and their representations in vector spaces and through matrices
- Vector space - A set whose elements, often called vectors, can be added together and multiplied ("scaled") by numbers called scalars
- Matrix - A rectangular array of numbers or other mathematical objects with elements or entries arranged in rows and columns, usually satisfying certain properties of addition and multiplication
- Sparse matrix - A matrix in which most of the elements are zero
- Rank - The dimension of the vector space generated (or spanned) by its columns
- Determinant - A scalar-valued function of the entries of a square matrix
- Calculus - The mathematical study of continuous change, in the same way that geometry is the study of shape and algebra is the study of generalizations of arithmetic operations
- Differential calculus - A subfield of calculus that studies the rates at which quantities change
- Integral calculus - The continuous analog of a sum, and is used to calculate areas, volumes, and their generalizations
- Differential equation - An equation that relates one or more unknown functions and their derivatives
- Geometry - A branch of mathematics concerned with properties of space such as the distance, shape, size, and relative position of figures
- Trigonometry - A branch of mathematics concerned with relationships between angles and side lengths of triangles
- Coordinate system - A system that uses one or more numbers, or coordinates, to uniquely determine and standardize the position of the points or other geometric elements on a manifold such as Euclidean space
- Euclidean distance - The length of the line segment between two points in a Euclidean space
- Category theory - A general theory of mathematical structures and their relations
- Functor - A mapping between categories
- Root mean square - The square root of the mean of the squares of a set of numbers
- Transforms
- Discrete cosine transform - A transform that expresses a finite sequence of data points in terms of a sum of cosine functions oscillating at different frequencies
- Discrete Fourier transform - A discrete version of the Fourier transform that converts a finite sequence of equally-spaced samples of a function into a same-length sequence of equally-spaced samples of the discrete-time Fourier transform (DTFT)
- Related Resources
- NIST Digital Library of Mathematical Functions - The definitive reference for the special functions of applied mathematics
- Notations - A list of notations used in the library
- NIST Digital Library of Mathematical Functions - The definitive reference for the special functions of applied mathematics
Probability & Information Theory
- Probability theory - The branch of mathematics concerned with probability
- Bayes' theorem - A mathematical rule for inverting conditional probabilities, allowing the probability of a cause to be found given its effect
- Central limit theorem (CLT) - A theorem that states that, under appropriate conditions, the distribution of a normalized version of the sample mean converges to a standard normal distribution
- Information theory - A scientific study of the quantification, storage, and communication of digital information
- Entropy - The average level of 'information', 'surprise', or 'uncertainty' inherent in a random variable's possible outcomes
Statistics & Numerical Methods
- Statistics - A discipline that concerns the collection, organization, analysis, interpretation, and presentation of data
- Sampling - The selection of a subset of individuals from within a statistical population to estimate characteristics of the whole population
- Errors and residuals - The measures of the deviation of an observed value of an element of a statistical sample from its "true value"
- Standard deviation - A measure of the amount of variation of the values of a variable about its average
- Root mean square deviation - The square root of the average of the squared differences between the predicted values and the actual values
- F-score - A measure of predictive performance in statistical analysis of binary classification and information retrieval systems
- Correlation - A kind of statistical relationship between two random variables or bivariate data
- Pearson correlation coefficient - A correlation coefficient that measures linear correlation between two sets of data
- Hypothesis testing - A method of statistical inference used to decide whether the data provide sufficient evidence to reject a particular hypothesis
- Null hypothesis - A typical statistical theory which suggests that no statistical relationship and significance exists in a set of given single observed variable, between two sets of observed data and measured phenomena
- Confidence interval (CI) - A range of values which is likely to contain (in repeated sampling) the true value of an unknown statistical parameter, such as a population mean
- P-value - The probability of obtaining test results at least as extreme as the result actually observed, under the assumption that the null hypothesis is correct
- Numerical methods
- Significant figures - The specific digits within a number that is written in positional notation that carry both reliability and necessity in conveying a particular quantity
- Resources
- Openstax Introductory Statistics - An open-source textbook for introductory statistics courses
- OpenIntro Statistics - A dynamic take on the traditional curriculum, being successfully used at Community Colleges to the Ivy League
Data Formats & Architecture
Relevant DSS-P Skills
- 2. Data Preparation & Utilization > 2.3 Data Management > Data Engineering (Design, Collection, Integration, Provision)
- 2. Data Preparation & Utilization > 2.3 Data Management > Improvement of Data Quality and Safety
Data Formats & Table Formats
- Apache Parquet - An open source, column-oriented data file format designed for efficient data storage and retrieval
- Apache ORC - The smallest, fastest columnar storage for Hadoop workloads
- Apache Arrow - A universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
- BSON - A binary-encoded serialization of JSON-like documents
- Apache Avro - The leading serialization format for record data, and first choice for streaming data pipelines
- Delta Lake - An open-source storage framework that enables building a format agnostic Lakehouse architecture with compute engines
- Apache Iceberg - The open table format for huge analytic datasets
- Apache Hudi - The Streaming Data Lake Platform
Data Architectures & Methodologies
- Data warehouse - A system used for reporting and data analysis and is a core component of business intelligence
- Data lake - A system or repository of data stored in its natural/raw format, usually object blobs or files
- Data lakehouse - A new, open architecture that combines the best elements of data lakes and data warehouses
- Medallion Architecture - A data design pattern used to logically organize data in a lakehouse
- CRISP-DM - An open standard process model that describes common approaches used by data mining experts
Data Governance & Metadata Management
- Apache Atlas - A scalable and extensible set of core foundational governance services that enable enterprises to meet compliance requirements
- Collibra - An enterprise data governance platform providing a common language for data management
- Informatica Metadata Manager - A comprehensive metadata management solution for enterprise data governance
- OpenMetadata - An open-source metadata management platform for data discovery, governance, and collaboration
Data Quality & Validation
- Great Expectations - A Python library for defining, documenting, and testing data quality
- Apache Griffin - A data quality solution built on Apache Spark and Apache Hadoop for distributed data quality measurement
- Soda - A data quality monitoring solution that integrates with modern data stacks
Data Versioning & Schema Management
- Schema Registry - A hosted schema management service that centralizes schemas for Kafka topics
- Git-based Schema Management - Using Git repositories to version control database schemas
- DBT Contracts - Explicit data contracts defining input and output data requirements
Relational Databases (SQL)
Relevant DSS-P Skills
- 2. Data Preparation & Utilization > 2.3 Data Management > Data Engineering (Design, Collection, Integration, Provision)
SQL Fundamentals
- Foundational Concepts
- Relational model - An approach to managing data using a structure and language consistent with first-order predicate logic
- ACID properties - A set of properties of database transactions intended to guarantee data validity despite errors, power failures, and other mishaps
- Atomicity, Consistency, Isolation, and Durability
- Codd's Twelve Rules - A set of thirteen rules proposed by Edgar F. Codd to define what is required from a database management system in order for it to be considered relational
- Database normalization - The process of organizing columns (attributes) and tables (relations) of a relational database to minimize data redundancy
- Languages & Dialects
- Structured Query Language (SQL) - A domain-specific language used for managing data held in a relational database management system
- Command Categories
- DDL - Data Definition Language
- DQL - Data Query Language
- DML - Data Manipulation Language
- DCL - Data Control Language
- TCL - Transaction Control Language
- SQL Join - A clause that combines columns from one or more tables in a relational database
- Aggregate function - A function where the values of multiple rows are grouped together to form a single summary value
- Command Categories
- Transact-SQL - The proprietary extension to SQL used to program and manage SQL Server
- Structured Query Language (SQL) - A domain-specific language used for managing data held in a relational database management system
Database Management Systems (DBMS)
- Client-Server RDBMS
- PostgreSQL - An object-relational database management system (ORDBMS) based on POSTGRES, Version 4.2, developed at the University of California at Berkeley Computer Science Department
- MySQL - The most popular Open Source SQL database management system, is developed, distributed, and supported by Oracle Corporation
- MariaDB community Server - The open source relational database that is a community-developed fork of MySQL
- Distributed SQL
- TiDB - An open-source distributed SQL database that supports Hybrid Transactional and Analytical Processing (HTAP) workloads
- Embedded / In-Process
- SQLite - A C-language library that implements a small, fast, self-contained, high-reliability, and full-featured database engine
- PGlite - A WASM build packaged into a TypeScript/JavaScript client library, that enables you to run the database in the browser, Node.js and Bun
- DuckDB - An in-process SQL OLAP database management system
- Storage Engines
- Storage Engine - A software component that a database management system uses to create, read, update and delete (CRUD) data from a database
- InnoDB - A transactional storage engine for MySQL and MariaDB
Cloud & Managed Services
- Managed Database Services
- Amazon RDS - A collection of managed services that makes it simple to set up, operate, and scale databases in the cloud
- Amazon Aurora - A fully managed relational database engine offering high performance and availability at global scale for PostgreSQL, MySQL, and DSQL
- Azure SQL Database - An intelligent, scalable, relational database service built for the cloud
- Google Cloud SQL - A fully-managed database service that helps you set up, maintain, manage, and administer your relational databases on Google Cloud
- Neon - A serverless, fault-tolerant, and scalable Postgres with a generous free tier
Connectivity & Tooling
- Connectivity APIs & ORMs
- Connection pool - A cache of database connections maintained so that the connections can be reused when future requests to the database are required
- ODBC - A standard application programming interface for accessing database management systems
- JDBC - An API that allows access to virtually any tabular data source from the Java programming language
- Jdbi - A library that provides a more idiomatic way to use the Java Database Connectivity API
- Object-Relational Mapping (ORM) - A programming technique for converting data between incompatible type systems using object-oriented programming languages
- Prisma - A next-generation ORM that makes it easy to build reliable and scalable applications with databases
- Hibernate - An object-relational mapping tool for the Java programming language
- SQLAlchemy - The Python SQL toolkit and Object Relational Mapper that gives application developers the full power and flexibility of SQL
- GORM - The fantastic ORM library for Golang aims to be developer friendly
- XORM - A Simple and Powerful ORM for Go
- Diesel - A Safe, Extensible ORM and Query Builder for Rust
- Developer Libraries & Drivers
- Database Clients & IDEs
- pgAdmin - The most popular and feature rich Open Source administration and development platform for PostgreSQL
- SSMS (SQL Server Management Studio) - An integrated environment for managing any SQL infrastructure, from SQL Server to Azure SQL Database
- DB Browser for SQLite - A high quality, visual, open source tool to create, design, and edit database files compatible with SQLite
- Azure Data Studio - A modern open-source, cross-platform hybrid data analytics tool designed to simplify the data landscape
- Beekeeper Studio - A modern, easy to use, and good looking SQL editor and database manager
- Command-Line & Deployment Utilities
- sqlcmd utility - A command-line utility for ad hoc, interactive execution of Transact-SQL statements and scripts and for automating T-SQL scripting tasks
- sqlpackage - A command-line utility that automates several database development tasks
- DAC (Data-tier Applications) - A logical database management concept that defines all of the SQL Server objects associated with a user's database
- pgroll - A zero-downtime, reversible, schema migration tool for PostgreSQL
- Monitoring & Analysis
- pgBadger - A PostgreSQL log analyzer built for speed with fully detailed reports and professional rendering
NoSQL & Specialized Databases
Relevant DSS-P Skills
- 2. Data Preparation & Utilization > 2.3 Data Management > Data Engineering (Design, Collection, Integration, Provision)
NoSQL Data Models
- Object-relational impedance mismatch - A set of conceptual and technical difficulties that are often encountered when a relational database management system (RDBMS) is being used by a program written in an object-oriented programming language or style
- Document Databases
- MongoDB - A document database designed for ease of application development and scaling
- DocumentDB - A powerful, scalable open-source document database built for modern applications
- Key-value Stores
- Graph Databases
- Wide-columns Databases
- Apache Cassandra - An open source NoSQL distributed database
- Apache HBase - The Hadoop database, a distributed, scalable, big data store
- ClickHouse - A fast, open-source OLAP (Online Analytical Processing) database management system designed for real-time analytics
Vector & AI Databases
- Concepts
- HNSW (Hierarchical Navigable Small Worlds) - A top-performing index for vector similarity search
- Vector Databases
- Pinecone - A purpose-built vector database delivering relevant results at any scale
- pgvector - An open-source vector similarity search for Postgres
- ElasticSearch vector database - The world's most widely deployed, open source vector database
- Weaviate - An open-source vector database that simplifies the development of AI applications
- Milvus - A high-performance open-source vector database built to handle billions of vectors
- Chroma - The AI-native open-source embedding database
- Qdrant - A high-performance vector search engine built entirely in Rust that helps developers build AI retrieval at any scale
Cloud NoSQL Services
- Multi-model Databases
- Azure Cosmos DB - A fully managed, serverless distributed database for modern app development
- Amazon DynamoDB - A fully managed, serverless, key-value NoSQL database designed to run high-performance applications at any scale
- Document Databases
- Cloud Firestore - A cloud-hosted, NoSQL database that your Apple, Android, and web apps can access directly via native SDKs
- Graph Databases
- Amazon Neptune - A fast, reliable, and fully managed graph database service that makes it easy to build and run applications that work with highly connected datasets
- Wide-columns Databases
- Google Cloud Bigtable - A NoSQL wide-column database service for large analytical and operational workloads
Data Processing & Messaging
Relevant DSS-P Skills
- 2. Data Preparation & Utilization > 2.3 Data Management > Data Engineering (Design, Collection, Integration, Provision)
Enterprise Integration
- Enterprise Integration Patterns - A pattern language of 65 integration patterns that helps developers design and build distributed applications or integrate existing ones
- Apache Camel - An open-source integration framework that empowers you to quickly and easily integrate various systems consuming or producing data
Message Queuing & Event Streaming
- Concepts
- Message Brokers - An intermediary computer program module that translates a message from the formal messaging protocol of the sender to the formal messaging protocol of the receiver
- Dead-letter queue - A specialized queue used in message queuing systems to store messages that could not be delivered or processed successfully
- Messaging & Streaming Platforms (Software)
- Apache Kafka - An open-source distributed event streaming platform
- Apache Kafka Ecosystem
- Kafbat UI - A versatile, fast, lightweight, and flexible web interface designed to monitor and manage Apache Kafka clusters
- Apache Kafka Ecosystem
- RabbitMQ - A reliable and mature messaging and streaming broker
- Apache Kafka - An open-source distributed event streaming platform
- Cloud Services
- Amazon Kinesis - A service making it easy to collect, process, and analyze real-time, streaming data
- Azure Event Hubs - A highly scalable and reliable event streaming platform capable of ingesting millions of events per second
- Azure Service Bus - A fully managed enterprise message broker with message queues and publish-subscribe topics
Batch Processing (ETL/ELT)
- Base Frameworks
- Apache Hadoop - A framework that allows for the distributed processing of large data sets
- mrjob - The easiest route to writing Python programs that run on the framework
- Apache Spark - The unified engine for large-scale data analytics
- PySpark - The Python API for the engine, allowing big data processing with the language
- RAY - An open-source unified compute framework that makes it easy to scale AI and Python workloads
- Joblib - A set of tools to provide lightweight pipelining in Python
- Apache Hadoop - A framework that allows for the distributed processing of large data sets
- Workflow Orchestration & ETL Tools (Software)
- Apache NiFi - An easy to use, powerful, and reliable system to process and distribute data
- Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
- dbt - A unified platform for delivering trusted data that empowers teams to deliver reliable, governed data at scale
- Dagu - A local-first workflow engine that provides a declarative, file-based, and self-contained platform to orchestrate tasks from a single binary that scales from a laptop to a distributed cluster
- Managed ETL & Data Integration Services
- Azure Data Factory - The cloud ETL service for scale-out serverless data integration and data transformation
- AWS Glue - A serverless data integration service that makes it easy to discover, prepare, move, and integrate data from multiple sources
- Google Cloud Data Fusion - A fully managed, cloud-native data integration service that helps users efficiently build and manage ETL/ELT data pipelines
Stream Processing
- Stream Processing Engines (Software)
- Spark Structured Streaming - A scalable and fault-tolerant stream processing engine built on the Spark SQL engine
- Apache Storm - A free and open source distributed realtime computation system
- Apache Flink - A framework and distributed processing engine for stateful computations over unbounded and bounded data streams
- Cloud Services
- Google Cloud Dataflow - A fully managed streaming analytics service that minimizes latency, processing time, and cost through autoscaling and batch processing
Data Analytics & Search
Relevant DSS-P Skills
- 2. Data Preparation & Utilization > 2.1 Strategic Utilization of Data and AI > Understanding and Utilization of Data and AI
Search Engines & Platforms
- Web Search Engines
- Google Search - The search engine that allows you to search the world's information, including webpages, images, videos and more
- DuckDuckGo - The search engine that doesn't track you
- Answer Engines
- Wolfram|Alpha - A computational knowledge engine that computes expert-level answers using breakthrough algorithms, knowledgebase and AI technology
- Perplexity AI - An AI-powered answer engine that provides accurate, trusted, and real-time answers to any question
- Search Platforms and Tools
- Azure AI Search - A fully managed, cloud-hosted service that unifies access to enterprise and web content for AI-powered search and retrieval-augmented generation
- Reciprocal Rank Fusion (RRF) - An algorithm that evaluates the search scores from multiple, previously executed queries to produce a unified result set
- ElasticSearch - An open source distributed, RESTful search and analytics engine, scalable data store, and vector database
- Painless - A simple, secure scripting language designed specifically for use with the engine
- ES|QL - A piped language that allows you to filter, transform, and analyze data stored in the engine
- Kibana - The open source interface to query, analyze, visualize, and manage your data stored in the engine
- Kibana Query Language - A simple text-based query language for filtering data
- Apache Solr - The popular, blazing-fast, open source enterprise search platform built on Apache Lucene
- Apache Lucene - A Java library providing powerful indexing and search features
- Faiss - A library for efficient similarity search and clustering of dense vectors
- Meilisearch - A lightning-fast search engine that fits effortlessly into your apps, websites, and workflow
- TypeSense - A lightning-fast, open source, search-as-you-type engine for building delightful search experiences
- Azure AI Search - A fully managed, cloud-hosted service that unifies access to enterprise and web content for AI-powered search and retrieval-augmented generation
Analytics Engines & Platforms
- Software & Managed Services
- Apache Hive - A distributed, fault-tolerant data warehouse system that enables analytics at a massive scale
- Presto - A distributed SQL query engine designed for fast, reliable, and efficient analytics at any scale
- Trino - A distributed SQL query engine designed to query large data sets distributed over one or more heterogeneous data sources
- Amazon EMR - A cloud big data platform for running large-scale distributed data processing jobs, interactive SQL queries, and machine learning applications
- Amazon Redshift - A fully managed, petabyte-scale data warehouse service in the cloud
- Amazon Athena - An interactive query service that makes it easy to analyze data directly in Amazon S3 and other data stores using standard SQL
- Databricks - The platform that allows your entire organization to use data and AI
- Microsoft Fabric - An end-to-end analytics solution with full-service capabilities including data movement, data lakes, data engineering, data integration, data science, real-time analytics, and business intelligence
- Microsoft OneLake - A single, unified, logical data lake for your whole organization
- Lakehouse vs Data Warehouse - A guide for choosing between a lakehouse and a data warehouse based on data volume, structure, and processing requirements
- Azure Synapse Analytics - An enterprise analytics service that accelerates time to insight across data warehouses and big data systems
- Google Cloud BigQuery - A fully managed, AI-ready data analytics platform that helps you maximize value from your data and is designed to be multi-engine, multi-format, and multi-cloud
- Amazon QuickSight - An AI-powered business intelligence service that enables users to analyze data, create visualizations, and gain insights from various enterprise data sources
Semantic Layer
- Cube - The agentic analytics platform to deploy AI agents to model, analyze, and report on your data
- Open Semantic Interchange (OSI) - The universal standard for semantic model exchange enabling semantic metadata interchange across analytics, AI, and BI platforms