Data Engineering and Computer Science
Data engineering role is ensuring uninterrupted flow of data between servers and applications
# Resources
- https://github.com/ossu/computer-science
- What is Data Engineering and Why Is It So Important?
- ETL (extract, transform, load)
- Have we bridged the gap between Data Science and DevOps?
- Codelabs
- Google Developers Codelabs provide a guided, tutorial, hands-on coding experience. Most codelabs will step you through the process of building a small application, or adding a new feature to an existing application
# Python
# Julia
# Javascript
- https://www.w3schools.com/js/
- https://codesandbox.io
- https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/JavaScript_basics
- https://dtabio.gitbooks.io/data-science-with-javascript/content/links_and_resources.html
- http://www.kdnuggets.com/2016/06/top-machine-learning-libraries-javascript.html
# Bash
# CUDA
- https://developer.nvidia.com/cuda-education
- https://dragan.rocks/articles/18/Interactive-GPU-Programming-1-Hello-CUDA
# Books
see ““Books” section in AI/DS and DataEng/Python
- #BOOK Mining of Massive Datasets (Leskovec, 2014 CAMBRIDGE)
- #BOOK Advanced Analytics with Spark (Ryza, 2017 OREILLY)
- #BOOK The Big Book of Data Engineering (Databricks)
# R
- #BOOK R para profesionales de los datos: una introducción
- #BOOK Geocomputation with R
- #BOOK Efficient R programming
- #BOOK Engineering Production-Grade Shiny Apps
- #BOOK Advanced R
- #BOOK Hands-On Programming with R
- #BOOK R Packages (Wickham 2020)
# Courses
- See “Courses” section in AI/DS and DataEng/Python
- #COURSE Intro to Hadoop and MapReduce
- #COURSE Mining Massive Data Sets (CS246 Stanford)
- #COURSE Getting and Cleaning Data (Coursera)
- SQL:
- Tutorial and exercises
- SQL (basic, intermediate, advanced / pet problems):
# Code
- See AI/DS and DataEng/ML Ops
- #CODE ABSL.flags - Defines a distributed command line system and manual argument parsing
- #CODE Memray - Memray is a memory profiler for Python
- #CODE
mmap.ninja - Memory mapped numpy arrays of varying shapes
- You can use
mmap_ninja
with any training framework (such asTensorflow
,PyTorch
,MxNet
), etc., as it stores your dataset as a memory-mapped numpy array - A memory mapped file is a file that is physically present on disk in a way that the correlation between the file and the memory space permits applications to treat the mapped portions as if it were primary memory, allowing very fast I/O
- You can use
- #CODE Polars - Fast multi-threaded, hybrid-out-of-core DataFrame library in Rust | Python | Node.js
- #CODE Pandas AI/DS and DataEng/Pandas
- #CODE Modin - Scale your pandas workflows by changing one line of code
- #CODE Xarray AI/DS and DataEng/Xarray
- #CODE Dedupe - A python library for accurate and scaleable fuzzy matching, record deduplication and entity-resolution
- #CODE PyTables
- #CODE H5py
- #CODE Singer - Simple, Composable Open Source ETL
- #CODE Docker
- #CODE Kubernetes - K8s is an open-source system for automating deployment, scaling, and management of containerized applications.
# Business Intelligence
- #CODE kuwala
# Big data, distributed computing
- #CODE Dask
- #CODE
Ray
- A system for parallel and distributed Python that unifies the ML ecosystem
- https://ray.readthedocs.io/en/latest/
- https://ray-project.github.io/
- #TALK Ray: A Distributed Execution Framework for AI | SciPy 2018 | Robert Nishihara
- #TALK Ray: A System for Scalable Python and ML |SciPy 2020| Robert Nishihara
- #CODE PyGDF - GPU Data Frame
- #CODE
Apache Hadoop
- The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.
- https://www.quora.com/What-is-the-difference-between-Apache-Spark-and-Apache-Hadoop-Map-Reduce
- Intro to Hadoop and MapReduce (Udacity)
- https://datawanderings.com/2017/01/15/your-first-diy-hadoop-cluster/
- http://ruhanixedu.com/blog/interview-question-and-answers/big-data/
- #CODE
Apache Spark
- http://cacm.acm.org/magazines/2016/11/209116-apache-spark/fulltext
- http://www.kdnuggets.com/2015/11/introduction-spark-python.html
- https://databricks.com/blog/2018/05/03/benchmarking-apache-spark-on-a-single-node-machine.html
- #TALK A brief introduction to Distributed Computing with PySpark (Pydata)
- #TALK Connecting Python To The Spark Ecosystem
- http://tech.marksblogg.com/billion-nyc-taxi-rides-spark-2-1-0-emr.html
- http://ruhanixedu.com/blog/interview-question-and-answers/apache-spark-interview-questions-answers/
- Text Normalization with Spark
- Spark ML
- [MLlib]( http://spark.apache.org/mllib/, https://spark.apache.org/docs/latest/ml-guide.html)
- PySpark
- Optimus
- #CODE Apache Storm
- #CODE Apache Arrow
- #CODE Blaze
# Databases
- SQL:
- #CODE SQLAlchemy
- #CODE Pyodbc
- #CODE ClickHouse
- NoSQL:
# Subtopics
# Open datasets (for ML, DL and DS)
See AI/DS and DataEng/Open ML data
# MLOps
# Feature engineering
- https://en.wikipedia.org/wiki/Feature_engineering
- Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. It is fundamental to the application of ML, and is both difficult and expensive. The need for manual feature engineering can be obviated by automated feature learning
- http://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/
- https://tech.zalando.com/blog/feature-extraction-science-or-engineering/
# Feature extraction
See AI/Feature learning techniques in AI/Computer Vision/Computer vision
# Data mining
- http://nbviewer.jupyter.org/github/ptwobrussell/Mining-the-Social-Web-2nd-Edition/tree/master/ipynb/
- https://www.dataquest.io/course/apis-and-scraping
# Web scraping
- https://www.dataquest.io/blog/web-scraping-tutorial-python/
- http://thiagomarzagao.com/2013/11/12/webscraping-with-selenium-part-1/
- https://medium.com/@hoppy/how-to-test-or-scrape-javascript-rendered-websites-with-python-selenium-a-beginner-step-by-c137892216aa#.hrjljvffd
- https://antonio-maiolo.com/2016/12/01/web-crawler-scrapy-crawl-spider-tutorial/
- http://stackoverflow.com/questions/19021541/scrapy-scrapping-data-inside-a-javascript
# API
- A categorized public list of APIs from round the web
- A collective list of public JSON APIs for use in web development
- Public APIs
# Databases
- https://en.wikipedia.org/wiki/Distributed_database
- ACID (Atomicity, Consistency, Isolation, Durability)
- SQL vs NoSQL
# SQL
- https://en.wikipedia.org/wiki/SQL
- https://en.wikipedia.org/wiki/Relational_database
- A relational database is a digital database whose organization is based on the relational model of data.
- https://www.analyticsvidhya.com/blog/2017/01/46-questions-on-sql-to-test-a-data-science-professional-skilltest-solution/
- Tutorial and exercises
- SQL (basic, intermediate, advanced / pet problems)
- List of SQL Commands
- JOIN
- A SQL join clause combines columns from one or more tables in a relational database. It creates a set that can be saved as a table or used as it is. A JOIN is a means for combining columns from one (self-table) or more tables by using values common to each. ANSI-standard SQL specifies five types ofJOIN:INNER,LEFT OUTER,RIGHT OUTER,FULL OUTER and CROSS.
- https://periscopedata.com/blog//how-joins-work.html
- https://www.digitalocean.com/community/tutorials/sqlite-vs-mysql-vs-postgresql-a-comparison-of-relational-database-management-systems
- Python interface
# NoSQL
- https://en.wikipedia.org/wiki/NoSQL
- Not only SQL: A NoSQL database provides a mechanism for storage and retrieval of data which is modeled in means other than the tabular relations used in relational databases. NoSQL databases are increasingly used in big data and real-time web applications. Many NoSQL stores compromise consistency (in the sense of theCAP theorem) in favor of availability, partition tolerance, and speed.
- Column: Accumulo, Cassandra, Druid, HBase, Vertica, SAP HANA
- #TALK GOTO 2012 - Introduction to NoSQL - Martin Fowler
- Graph:
- A graph database is a database that uses graph structures for semantic queries with nodes, edges and properties to represent and store data. A key concept of the system is the graph (or edge or relationship), which directly relates data items in the store. The relationships allow data in the store to be linked together directly, and in many cases retrieved with a single operation.
- Graph databases employ nodes, edges and properties.
- Nodes represent entities/items you might want to keep track of (people, businesses, accounts).
- Edges, also known as graphs or relationships, are the lines that connect nodes to other nodes; they represent the relationship between them.
- Properties are pertinent information that relate to nodes (sort of keywords).
- AllegroGraph, ArangoDB, InfiniteGraph, Apache Giraph, MarkLogic, Neo4J, OrientDB, Virtuoso, Stardog
- https://neo4j.com/developer/graph-database/
- Key-value
- https://en.wikipedia.org/wiki/Key-value_database
- A key-value store, or key-value database, is a data storage paradigm designed for storing, retrieving, and managing associative arrays, a data structure more commonly known today as a dictionary or hash.
- Dictionaries contain a collection of objects, or records, which in turn have many different fields within them, each containing data. These records are stored and retrieved using a key that uniquely identifies the record, and is used to quickly find the data within the database.
- Document-oriented database
# Data munging
# Data preparation
- Data cleansing: Missing data
- Variables encoding
- Normalisation, scaling
- Outlier detection
# Exploratory data analysis
- https://www.codementor.io/jadianes/data-science-python-r-exploratory-data-analysis-visualization-du107jjms
- http://blog.districtdatalabs.com/data-exploration-with-python-2
- https://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/
# Big data
- http://www.datasciencecentral.com/profiles/blogs/25-big-data-terms-you-must-know-to-impress-your-date-or-whoever
- Architecture of Giants: Data Stacks at Facebook, Netflix, Airbnb, and Pinterest
# MapReduce
- https://en.wikipedia.org/wiki/MapReduce
- MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster
- A MapReduce program is composed of aMap() procedure (method) that performs filtering and sorting (such as sorting students by first name into queues, one queue for each name) and aReduce() method that performs a summary operation (such as counting the number of students in each queue, yielding name frequencies)