Developers
July 28, 2020

Ray vs Dask vs Celery: The Road to Parallel Computing in Python

Multiple frameworks are making Python a parallel computing juggernaut.

Python consistently ranks as one of the most popular programming languages in existence. In fact, since 2003, it has stayed in the top ten most popular languages, according to the TIOBE Programming Community Index. At the time of writing, Python sits at the third spot on the list.

There are a number of reasons for Python’s popularity. As an interpreted language, Python is relatively easy to learn, especially when compared with languages such as C, C++ or Java. Because it’s interpreted, development is often faster, as there is no need to recompile the application to test new features or code.

Python’s straightforward approach is another significant factor in its popularity. Unlike many languages that emphasize creativity, or multiple paths to the same destination, Python emphasizes the idea that “there should be one-- and preferably only one --obvious way to do it.” This approach is best described in the Zen of Python document:

Beautiful is better than ugly.

Explicit is better than implicit.

Simple is better than complex.

Complex is better than complicated.

Flat is better than nested.

Sparse is better than dense.
Readability counts.

Special cases aren't special enough to break the rules.

Although practicality beats purity.

Errors should never pass silently.

Unless explicitly silenced.

In the face of ambiguity, refuse the temptation to guess.

There should be one-- and preferably only one --obvious way to do it.

Although that way may not be obvious at first unless you're Dutch.

Now is better than never.

Although never is often better than right now.

If the implementation is hard to explain, it's a bad idea.

If the implementation is easy to explain, it may be a good idea.

Namespaces are one honking great idea -- let's do more of those!

For programmers just getting started, this approach can make it easier to pick up the language and start being productive, rather than spending time trying to choose between a bunch of different ways to accomplish a task.

Another significant factor is Python’s extensibility. Python creator Guido van Rossum designed Python around a relatively small core, with the ability to extend it via modules and libraries.

Parallel Computing

Parallel computing represents a significant upgrade in the performance ceiling of modern computing. Traditionally, software tended to be sequential—completing a single task before moving on to the next.

Parallel computing, on the other hand, allows large tasks to be broken into smaller chucks and enables multiple tasks to be accomplished simultaneously. This significantly speeds up computational performance. Concurrent programming is a similar concept, but is defined by the ability of a system to work on multiple tasks that may be completely unrelated or out of order.

While Python does have a multiprocessing module, it has a number of limitations. Given the advantages parallel computing provides, it’s not surprising there are several options designed to add such abilities to Python. Three of the common ones are Ray, Dask and Celery.

Ray

According to its GitHub page, Ray is “a fast and simple framework for building and running distributed applications. Ray is packaged with RLlib, a scalable reinforcement learning library, and Tune, a scalable hyperparameter tuning library.”

Ray solves a number of the issues with Python’s built-in multiprocessing module, including adding the ability to run the same code on multiple machines, handling machine failures, scaling easily from a single computer to a full-scale cluster and much more.

Dask

Dask is another parallel computing library, with a special focus on data science. Python has become one of the most popular languages for data science applications, but the built-in libraries are primarily designed for single computer use.

Dask, on the other hand, is designed to mimic the APIs of Pandas, Scikit-Learn, and Numpy, making it easy for developers to scale their data science applications from a single computer on up to a full cluster.

As its web page highlights:

”Python includes computational libraries like Numpy, Pandas, and Scikit-Learn, and many others for data access, plotting, statistics, image and signal processing, and more. These libraries work together seamlessly to produce a cohesive ecosystem of packages that co-evolve to meet the needs of analysts in most domains today.

”This ecosystem is tied together by common standards and protocols to which everyone adheres, which allows these packages to benefit each other in surprising and delightful ways.

”Dask evolved from within this ecosystem. It abides by these standards and protocols and actively engages in community efforts to push forward new ones. This enables the rest of the ecosystem to benefit from parallel and distributed computing with minimal coordination. Dask does not seek to disrupt or displace the existing ecosystem, but rather to complement and benefit it from within.”

Celery

Celery is a distributed, asynchronous task queue. Celery allows tasks to be completed concurrently, either asynchronously or synchronously. While Celery is written in Python, the protocol can be used in other languages.

Celery is used in some of the most data-intensive applications, including Instagram. As such, Celery is extremely powerful but also can be difficult to learn.

Which Should You Choose

Each of these libraries offer similarities and differences. Ray may be the easier choice for developers looking for general purpose distributed applications. Dask, on the other hand, can be used for general purpose but really shines in the realm of data science. Meanwhile, Celery has firmly cemented itself as the distributed computing workhorse.

TagsParallel ComputingPythonRayDaskCelery
Matt Milano
Technical Writer
Matt is a tech journalist and writer with a background in web and software development.

Related Articles

Back
DevelopersJuly 28, 2020
Ray vs Dask vs Celery: The Road to Parallel Computing in Python
Multiple frameworks are making Python a parallel computing juggernaut.

Python consistently ranks as one of the most popular programming languages in existence. In fact, since 2003, it has stayed in the top ten most popular languages, according to the TIOBE Programming Community Index. At the time of writing, Python sits at the third spot on the list.

There are a number of reasons for Python’s popularity. As an interpreted language, Python is relatively easy to learn, especially when compared with languages such as C, C++ or Java. Because it’s interpreted, development is often faster, as there is no need to recompile the application to test new features or code.

Python’s straightforward approach is another significant factor in its popularity. Unlike many languages that emphasize creativity, or multiple paths to the same destination, Python emphasizes the idea that “there should be one-- and preferably only one --obvious way to do it.” This approach is best described in the Zen of Python document:

Beautiful is better than ugly.

Explicit is better than implicit.

Simple is better than complex.

Complex is better than complicated.

Flat is better than nested.

Sparse is better than dense.
Readability counts.

Special cases aren't special enough to break the rules.

Although practicality beats purity.

Errors should never pass silently.

Unless explicitly silenced.

In the face of ambiguity, refuse the temptation to guess.

There should be one-- and preferably only one --obvious way to do it.

Although that way may not be obvious at first unless you're Dutch.

Now is better than never.

Although never is often better than right now.

If the implementation is hard to explain, it's a bad idea.

If the implementation is easy to explain, it may be a good idea.

Namespaces are one honking great idea -- let's do more of those!

For programmers just getting started, this approach can make it easier to pick up the language and start being productive, rather than spending time trying to choose between a bunch of different ways to accomplish a task.

Another significant factor is Python’s extensibility. Python creator Guido van Rossum designed Python around a relatively small core, with the ability to extend it via modules and libraries.

Parallel Computing

Parallel computing represents a significant upgrade in the performance ceiling of modern computing. Traditionally, software tended to be sequential—completing a single task before moving on to the next.

Parallel computing, on the other hand, allows large tasks to be broken into smaller chucks and enables multiple tasks to be accomplished simultaneously. This significantly speeds up computational performance. Concurrent programming is a similar concept, but is defined by the ability of a system to work on multiple tasks that may be completely unrelated or out of order.

While Python does have a multiprocessing module, it has a number of limitations. Given the advantages parallel computing provides, it’s not surprising there are several options designed to add such abilities to Python. Three of the common ones are Ray, Dask and Celery.

Ray

According to its GitHub page, Ray is “a fast and simple framework for building and running distributed applications. Ray is packaged with RLlib, a scalable reinforcement learning library, and Tune, a scalable hyperparameter tuning library.”

Ray solves a number of the issues with Python’s built-in multiprocessing module, including adding the ability to run the same code on multiple machines, handling machine failures, scaling easily from a single computer to a full-scale cluster and much more.

Dask

Dask is another parallel computing library, with a special focus on data science. Python has become one of the most popular languages for data science applications, but the built-in libraries are primarily designed for single computer use.

Dask, on the other hand, is designed to mimic the APIs of Pandas, Scikit-Learn, and Numpy, making it easy for developers to scale their data science applications from a single computer on up to a full cluster.

As its web page highlights:

”Python includes computational libraries like Numpy, Pandas, and Scikit-Learn, and many others for data access, plotting, statistics, image and signal processing, and more. These libraries work together seamlessly to produce a cohesive ecosystem of packages that co-evolve to meet the needs of analysts in most domains today.

”This ecosystem is tied together by common standards and protocols to which everyone adheres, which allows these packages to benefit each other in surprising and delightful ways.

”Dask evolved from within this ecosystem. It abides by these standards and protocols and actively engages in community efforts to push forward new ones. This enables the rest of the ecosystem to benefit from parallel and distributed computing with minimal coordination. Dask does not seek to disrupt or displace the existing ecosystem, but rather to complement and benefit it from within.”

Celery

Celery is a distributed, asynchronous task queue. Celery allows tasks to be completed concurrently, either asynchronously or synchronously. While Celery is written in Python, the protocol can be used in other languages.

Celery is used in some of the most data-intensive applications, including Instagram. As such, Celery is extremely powerful but also can be difficult to learn.

Which Should You Choose

Each of these libraries offer similarities and differences. Ray may be the easier choice for developers looking for general purpose distributed applications. Dask, on the other hand, can be used for general purpose but really shines in the realm of data science. Meanwhile, Celery has firmly cemented itself as the distributed computing workhorse.

Parallel Computing
Python
Ray
Dask
Celery
About the author
Matt Milano -Technical Writer
Matt is a tech journalist and writer with a background in web and software development.

Related Articles