Getting Oriented in the RAPIDS Distributed ML Ecosystem, ETL
This blog post, the first of two exploring this emerging ecosystem, is an introduction to distributed ETL using the dask, cudf, and dask_cudf APIs.
exploratory-data-analysis gpu rapids article dask cudf pandas

For a long time being a data scientist that worked with large datasets and/or models meant mastering two sets of tools, one for local work and one for "big data". pandas, numpy, and scikit-learn make it easy to do stuff on your local machine, but can’t handle anything too big to fit in RAM. Once data gets too big, or training too costly, you have to move on to a "big data" tool that pools the resources of several machines together to get the job done. This traditionally meant Apache Spark, which, though powerful, requires learning a brand new API and maybe even a brand new language (performance Spark code is written in Scala).

Enter Dask. Dask is a distributed ETL tool that’s tightly integrated into the Python data science ecosystem. Dask is extremely popular among data scientists because its core API is a subset of the pandas, numpy, and scikit-learn APIs. This flattens the learning curve considerably: most Pythonistas can be productive with Dask almost immediately.

As part of its RAPIDS initiative, NVIDIA is going one step further, partnering with the community to build an ecosystem for distributed data science on GPUs on top of Dask. Their new cudf Python package already boasts some pretty impressive results — like this one from Capital One Labs showing a log-scale speedup for an internal ETL job that was previously being run on CPU.

This blog post, the first of two exploring this emerging ecosystem, is an introduction to distributed ETL using the dask, cudf, and dask_cudf APIs. We build the following mental map of the ecosystem:

Don't forget to tag @ResidentMario in your comment, otherwise they may not be notified.

Authors original post
Building tools for doing data science @spellrun. {📊, 💻, 🛠️}. Previously: @quiltdata, @recursecenter, @Kaggle, @MODA-NYC.
Share this project
Similar projects
Missingno: Missing data visualization module for Python.
Missingno provides a small toolset of flexible and easy-to-use missing data visualizations.
Insights from Stackoverflow Dataset
Analysis of Stackoverflow Developers Dataset to answer some business questions
Beginner's Guide to Altair Visualization
Getting started with Visualization using Altair on Kaggle with this simple tutorial.
Bad passwords and the NIST guidelines
Example project provided by DataCamp. In this project, you will write code that automatically detects and flags the bad passwords.
Top collections