Developers
August 3, 2020

Genomics Analysis With Hail, BigQuery, and Dataproc

Google Cloud Platform offers healthcare tools to help researchers empower treatments and advancement in pharmaceuticals.

Today we will talk about genomics data analysis and how Google Cloud allows organizations to research to empower treatments and advancement in pharmaceuticals.

Let's start by describing what Hail is. Hail is an open-source Python-based data analysis library that counts with data types and methods that help you work with genomic data on Apache Spark.

Hail is designed and built so that you can scale your applications. The Hail team made their software available to the community through the MIT license. This makes Hail a perfect augmentation to the Google Cloud Life Sciences suite that processes genomics data.

What does Dataproc do? It makes open-source available offering analytics in a fast easy and secure way. It runs in the cloud and it is managed by Apache Spark, which allows the acceleration of data science.

Google Cloud Platform and Healthcare

What makes Google Cloud different from other cloud computing platforms is that offers healthcare tools. These are the same tools used for genomic data analysis.

Genotype data is mixed with phenotype data from electronic health records, device data, and medical notes. There are Google Cloud-based analysis platforms like AI platform Notebooks and Dataproc Hub, this brings the possibility for researchers to work together safely and reliably.  

Hail counts with pip installations that come bundled with a command-line tool, hailctl. This has a submodule that is named dataproc so it can work with Dataproc clusters.

To get started with Dataproc and Hail, you can go to the Google Cloud console, and click the icon for Cloud Shell that is located at the top of the console window. The Cloud Shell provides you with command-line access to your cloud resources directly from your browser. This means you don't have to install tools on your local system. 

To install Hail simply type “sudo pip3 install hail”. After the installation, create a Dataproc cluster that works for Apache Spark and Hail by using the following command:

“hailctl dataproc start my-first-hail-cluster --region={REGION} --project={PROJECT-ID}”

After the Dataproc cluster is created, you can open the Editor by clicking the button from the Cloud Shell that says “Open Editor”. This takes you directly to the built-in editor to create and modify code.  

The Cloud Shell Editor saves the file automatically, but just in case, you can also save the file from the menu. After saving the file, return to the command line terminal by clicking the “Open Terminal” button 

Creating a Dataproc Hub Environment for Hail

As mentioned earlier, Hail version 0.2.15 pip installations come bundled with hailctl, a command-line tool that has a submodule called data proc for working with Google Dataproc clusters. This includes a fully configured notebook environment that can be used simply by calling: "hailctl data proc connect CLUSTER_NAME notebook". 

To use the notebook features that are unique to Dataproc, you need the Dataproc initialization action. It provides a standalone version of Hail. To create a Dataproc cluster that runs Hail within a Dataproc environment you call a simple command. After you create the cluster, you can click on the cluster, choose the Web Interfaces tab and click the component gateway so you can use JupyterLab.

You can run a GWAS study directly in BigQuery by using SQL logic to push the processing down into BigQuery. You can later bring the query results into a Pandas data frame notebook.  

There are many tutorials, one of them explains how GWAS analysis performs in BigQuery with a notebook. There's another one explaining how BigQuery ML works. A feature of BiqQuery that provides the ability to run basic regression techniques clusters using SQL queries.  

BigQuery is usually used for preliminary steps of GWAS, including feature engineering, cohort data definition, and running descriptive analysis. We can take a look at how it works by using the 1000 genome variant that BigQuery public datasets hosts.

The query populates a Pandas data frame with basic information coming from the 1000 Genomes project samples. You can run Python and Pandas functions to review, plot, and understand data available from the cohort. By using the Apache Spark BigQuery connector from data proc, BigQuery becomes another source to read and write data.

In conclusion, genomics analysis counts with the use of different technologies, including Hail, BigQuery, and Dataproc. The Google Cloud platform is different from other Cloud platforms, it offers healthcare tools and is currently the most used by researchers and scientists. Researchers use it for genomics and other scientific investigations too. Hail is an open-source Python data analysis library. It is used for genomics and other purposes too. Hail is designed and built so you can scale your applications. The Google Cloud Platform works allowing organizations to research to empower treatments and advancement in pharmaceuticals, such as Genomics Data Analysis.

TagsGCPGenomicsData Analysis
Lucas Bonder
Technical Writer
Lucas is an Entrepreneur, Web Developer, and Article Writer about Technology.

Related Articles

Back
DevelopersAugust 3, 2020
Genomics Analysis With Hail, BigQuery, and Dataproc
Google Cloud Platform offers healthcare tools to help researchers empower treatments and advancement in pharmaceuticals.

Today we will talk about genomics data analysis and how Google Cloud allows organizations to research to empower treatments and advancement in pharmaceuticals.

Let's start by describing what Hail is. Hail is an open-source Python-based data analysis library that counts with data types and methods that help you work with genomic data on Apache Spark.

Hail is designed and built so that you can scale your applications. The Hail team made their software available to the community through the MIT license. This makes Hail a perfect augmentation to the Google Cloud Life Sciences suite that processes genomics data.

What does Dataproc do? It makes open-source available offering analytics in a fast easy and secure way. It runs in the cloud and it is managed by Apache Spark, which allows the acceleration of data science.

Google Cloud Platform and Healthcare

What makes Google Cloud different from other cloud computing platforms is that offers healthcare tools. These are the same tools used for genomic data analysis.

Genotype data is mixed with phenotype data from electronic health records, device data, and medical notes. There are Google Cloud-based analysis platforms like AI platform Notebooks and Dataproc Hub, this brings the possibility for researchers to work together safely and reliably.  

Hail counts with pip installations that come bundled with a command-line tool, hailctl. This has a submodule that is named dataproc so it can work with Dataproc clusters.

To get started with Dataproc and Hail, you can go to the Google Cloud console, and click the icon for Cloud Shell that is located at the top of the console window. The Cloud Shell provides you with command-line access to your cloud resources directly from your browser. This means you don't have to install tools on your local system. 

To install Hail simply type “sudo pip3 install hail”. After the installation, create a Dataproc cluster that works for Apache Spark and Hail by using the following command:

“hailctl dataproc start my-first-hail-cluster --region={REGION} --project={PROJECT-ID}”

After the Dataproc cluster is created, you can open the Editor by clicking the button from the Cloud Shell that says “Open Editor”. This takes you directly to the built-in editor to create and modify code.  

The Cloud Shell Editor saves the file automatically, but just in case, you can also save the file from the menu. After saving the file, return to the command line terminal by clicking the “Open Terminal” button 

Creating a Dataproc Hub Environment for Hail

As mentioned earlier, Hail version 0.2.15 pip installations come bundled with hailctl, a command-line tool that has a submodule called data proc for working with Google Dataproc clusters. This includes a fully configured notebook environment that can be used simply by calling: "hailctl data proc connect CLUSTER_NAME notebook". 

To use the notebook features that are unique to Dataproc, you need the Dataproc initialization action. It provides a standalone version of Hail. To create a Dataproc cluster that runs Hail within a Dataproc environment you call a simple command. After you create the cluster, you can click on the cluster, choose the Web Interfaces tab and click the component gateway so you can use JupyterLab.

You can run a GWAS study directly in BigQuery by using SQL logic to push the processing down into BigQuery. You can later bring the query results into a Pandas data frame notebook.  

There are many tutorials, one of them explains how GWAS analysis performs in BigQuery with a notebook. There's another one explaining how BigQuery ML works. A feature of BiqQuery that provides the ability to run basic regression techniques clusters using SQL queries.  

BigQuery is usually used for preliminary steps of GWAS, including feature engineering, cohort data definition, and running descriptive analysis. We can take a look at how it works by using the 1000 genome variant that BigQuery public datasets hosts.

The query populates a Pandas data frame with basic information coming from the 1000 Genomes project samples. You can run Python and Pandas functions to review, plot, and understand data available from the cohort. By using the Apache Spark BigQuery connector from data proc, BigQuery becomes another source to read and write data.

In conclusion, genomics analysis counts with the use of different technologies, including Hail, BigQuery, and Dataproc. The Google Cloud platform is different from other Cloud platforms, it offers healthcare tools and is currently the most used by researchers and scientists. Researchers use it for genomics and other scientific investigations too. Hail is an open-source Python data analysis library. It is used for genomics and other purposes too. Hail is designed and built so you can scale your applications. The Google Cloud Platform works allowing organizations to research to empower treatments and advancement in pharmaceuticals, such as Genomics Data Analysis.

GCP
Genomics
Data Analysis
About the author
Lucas Bonder -Technical Writer
Lucas is an Entrepreneur, Web Developer, and Article Writer about Technology.

Related Articles