For a long time being a data scientist that worked with large datasets and/or models meant mastering two sets of tools, one for local work and one for "big data".
scikit-learn make it easy to do stuff on your local machine, but can’t handle anything too big to fit in RAM. Once data gets too big, or training too costly, you have to move on to a "big data" tool that pools the resources of several machines together to get the job done. This traditionally meant Apache Spark, which, though powerful, requires learning a brand new API and maybe even a brand new language (performance Spark code is written in Scala).
Enter Dask. Dask is a distributed ETL tool that’s tightly integrated into the Python data science ecosystem. Dask is extremely popular among data scientists because its core API is a subset of the
scikit-learn APIs. This flattens the learning curve considerably: most Pythonistas can be productive with Dask almost immediately.
As part of its RAPIDS initiative, NVIDIA is going one step further, partnering with the community to build an ecosystem for distributed data science on GPUs on top of Dask. Their new
cudf Python package already boasts some pretty impressive results — like this one from Capital One Labs showing a log-scale speedup for an internal ETL job that was previously being run on CPU.
This blog post, the first of two exploring this emerging ecosystem, is an introduction to distributed ETL using the dask, cudf, and dask_cudf APIs. We build the following mental map of the ecosystem:
Don't forget to tag @ResidentMario in your comment, otherwise they may not be notified.