Developers
July 14, 2020

Google’s Data Analytics Service Dataproc Introduces Spark 3 and Hadoop 3

GCP is launching Dataproc image version 2.0, the latest set of open source software that offers fully configured autoscaling clusters.

Today we will talk about a Google Data Analytics service: Dataproc. The service makes Data Analytics open-source, fast, and simple in the cloud.

The clusters provided are autoscaling and they are already configured. They run in less than 90 seconds on custom machine types. This is why many choose Dataproc for open-source Data Analytics.

Dataproc uses images to align the bundles of the software that origin from Hadoop and Spark clusters. The service counts with optional components that can extend the bundle to other open-source technologies.

The cluster can be customized with your configurations. The system can be deployed in a fast way by the use of initialization actions. There's an entire GitHub repository on the initialization process. GitHub can help you in the installation process.

Use Apache Spark 3 in preview

Apache Spark 3 is the newest generation of Apache Spark. It is not recommended to use it for production workloads yet. It is already available as open-source. One recommendation is that if you don’t want to use Spark 3 yet, you can migrate your data with the help of isolated clusters on the 2.0 version of Dataproc.

Spark 3 has been built for one thing, and one thing only: High Performance. The way it processes data changes from the previous version and it surely helps developers work in a good standard.

Spark 3 counts with 3 performance improvements. Adaptive queries, Dynamic partition pruning, and GPU acceleration. Adaptive Queries optimize the query while executing at the same time. Previously, there was a lack of statistics at the time of query processing. Dynamic partition pruning avoids unnecessary data scans. It uses a single fact table and many dimension tables. GPU acceleration as it sounds, accelerates the GPU. NVIDIA helps to bring forward GPUs to Spark.

High performance is the main focus of this release, but not the only one. It will also enable dynamic scaling, and you will be able to run Dataproc jobs on Google Kubernetes Engine (GKE). Many developers chose this migration option to wok on Spark 3.

In Spark 3 there are also deprecations, parts of the service that are not recommended. “Don't use” recommendations. The users always give feedback for any service in any industry. It is the responsibility of companies and software providers to always listen to the market, despite what they think is best. Something might look good for the customer, but if when used, the customer rejects it, it's time to move on to something else.

MLLib has been deprecated. It still works but won't be updated or supported anymore. Google recommends to move away from MLLib when you migrate to Spark 3. You can also consider using a deep learning model as Spak 3 powers deep learning.

GraphX has been deprecated and a new component has emerged: SparkGraph. This new component is based on Cypher, a much richer offer than GraphX.

DataSource API has been deprecated. As a result, it becomes DataSource V2, unifying the way to write to multiple sources.

Python 2.7 has been deprecated and it will update to Python 3. It is reasonable that users prefer the latest programming language update as it allows them to be in tune with the community and develop to the latest standards.

Hadoop 3 is available

Another release on the Dataproc image version is Hadoop 3. This release is made up of two pillars, HDFS and YARN.

HDFS (Hadoop Distributed File System) is a file system that is built to run on low-cost hardware.HDFS storage will be replaced for cloud storage in most use cases.

YARN is used for scheduling resources within a cluster. For cloud services, many Hadoop consumers are requesting the same resource management. Dataproc is currently offering job clusters that fit the size of the task requested. This is more effective than configuring a cluster with more workload than needed.

When migrating your infrastructure to Google Cloud, it is recommended to keep all your current tools and processes working. You go adding new cloud methodologies on demand, gradually, when needed.

While Hadoop 3 is designed for specific use cases, some features caused and will cause users to be interested. There is Native Support for GPUs in the YARN scheduler, and there is YARN containerization.
In conclusion, there are two introductions for Google’s Data Analytics Dataproc image version 2.0 The introductions are Spark 3 and Hadoop 3. Spark 3 was built with one purpose in mind: High Performance. The performing features include Adaptive queries, Dynamic partition pruning, and GPU acceleration. On the other hand, Hadoop 3 gives users the ability to manage their data based on low-cost hardware.  Hadoop 3 consists of two pillars, HDFS and YARN. HDFS is used for file storage, while YARN is used to schedule resources within a cluster. Before migrating your infrastructure to Google Cloud, make sure to keep all your currently used tools and processes.

TagsGCPData Analytics ServiceDataprocSparkHadoop
Lucas Bonder
Technical Writer
Lucas is an Entrepreneur, Web Developer, and Article Writer about Technology.

Related Articles

Back
DevelopersJuly 14, 2020
Google’s Data Analytics Service Dataproc Introduces Spark 3 and Hadoop 3
GCP is launching Dataproc image version 2.0, the latest set of open source software that offers fully configured autoscaling clusters.

Today we will talk about a Google Data Analytics service: Dataproc. The service makes Data Analytics open-source, fast, and simple in the cloud.

The clusters provided are autoscaling and they are already configured. They run in less than 90 seconds on custom machine types. This is why many choose Dataproc for open-source Data Analytics.

Dataproc uses images to align the bundles of the software that origin from Hadoop and Spark clusters. The service counts with optional components that can extend the bundle to other open-source technologies.

The cluster can be customized with your configurations. The system can be deployed in a fast way by the use of initialization actions. There's an entire GitHub repository on the initialization process. GitHub can help you in the installation process.

Use Apache Spark 3 in preview

Apache Spark 3 is the newest generation of Apache Spark. It is not recommended to use it for production workloads yet. It is already available as open-source. One recommendation is that if you don’t want to use Spark 3 yet, you can migrate your data with the help of isolated clusters on the 2.0 version of Dataproc.

Spark 3 has been built for one thing, and one thing only: High Performance. The way it processes data changes from the previous version and it surely helps developers work in a good standard.

Spark 3 counts with 3 performance improvements. Adaptive queries, Dynamic partition pruning, and GPU acceleration. Adaptive Queries optimize the query while executing at the same time. Previously, there was a lack of statistics at the time of query processing. Dynamic partition pruning avoids unnecessary data scans. It uses a single fact table and many dimension tables. GPU acceleration as it sounds, accelerates the GPU. NVIDIA helps to bring forward GPUs to Spark.

High performance is the main focus of this release, but not the only one. It will also enable dynamic scaling, and you will be able to run Dataproc jobs on Google Kubernetes Engine (GKE). Many developers chose this migration option to wok on Spark 3.

In Spark 3 there are also deprecations, parts of the service that are not recommended. “Don't use” recommendations. The users always give feedback for any service in any industry. It is the responsibility of companies and software providers to always listen to the market, despite what they think is best. Something might look good for the customer, but if when used, the customer rejects it, it's time to move on to something else.

MLLib has been deprecated. It still works but won't be updated or supported anymore. Google recommends to move away from MLLib when you migrate to Spark 3. You can also consider using a deep learning model as Spak 3 powers deep learning.

GraphX has been deprecated and a new component has emerged: SparkGraph. This new component is based on Cypher, a much richer offer than GraphX.

DataSource API has been deprecated. As a result, it becomes DataSource V2, unifying the way to write to multiple sources.

Python 2.7 has been deprecated and it will update to Python 3. It is reasonable that users prefer the latest programming language update as it allows them to be in tune with the community and develop to the latest standards.

Hadoop 3 is available

Another release on the Dataproc image version is Hadoop 3. This release is made up of two pillars, HDFS and YARN.

HDFS (Hadoop Distributed File System) is a file system that is built to run on low-cost hardware.HDFS storage will be replaced for cloud storage in most use cases.

YARN is used for scheduling resources within a cluster. For cloud services, many Hadoop consumers are requesting the same resource management. Dataproc is currently offering job clusters that fit the size of the task requested. This is more effective than configuring a cluster with more workload than needed.

When migrating your infrastructure to Google Cloud, it is recommended to keep all your current tools and processes working. You go adding new cloud methodologies on demand, gradually, when needed.

While Hadoop 3 is designed for specific use cases, some features caused and will cause users to be interested. There is Native Support for GPUs in the YARN scheduler, and there is YARN containerization.
In conclusion, there are two introductions for Google’s Data Analytics Dataproc image version 2.0 The introductions are Spark 3 and Hadoop 3. Spark 3 was built with one purpose in mind: High Performance. The performing features include Adaptive queries, Dynamic partition pruning, and GPU acceleration. On the other hand, Hadoop 3 gives users the ability to manage their data based on low-cost hardware.  Hadoop 3 consists of two pillars, HDFS and YARN. HDFS is used for file storage, while YARN is used to schedule resources within a cluster. Before migrating your infrastructure to Google Cloud, make sure to keep all your currently used tools and processes.

GCP
Data Analytics Service
Dataproc
Spark
Hadoop
About the author
Lucas Bonder -Technical Writer
Lucas is an Entrepreneur, Web Developer, and Article Writer about Technology.

Related Articles