Posted on

Downloading Kaggle Data Sets using the Linux Command Line

The following tutorial covers how to download datasets from Kaggle using the Linux command line tools in a virtual machine running on Google Cloud Platform.

This tutorial breaks this task down into multiple steps including:

In Kaggle: Download API Token file and then locate a data set

In Google Cloud:

  • Create a Compute Engine VM
  • Upload API Token file
  • Set up Python environment
  • Download Kaggle Data set
  • Unzip Kaggle Data set
  • Copy files to Google Cloud Storage

If you would prefer a video presentation of these materials, please use this link to YouTube

Prerequisites

Before starting this tutorial, make sure you create an account on Kaggle.com and then log in to your Kaggle.com account.

Log in to your Google Cloud Platform account and make sure you have enabled the Compute Engine API

Tutorial Sections

Get started on the tutorial by downloading the Kaggle API Token file as described on the next page.

Posted on

Running Jupyter Notebook on Google Cloud Platform Dataproc

January 16, 2024

The purpose of this tutorial is to demonstrate setting up Jupyter Notebook to run on the Google Cloud Platform service called Dataproc.

Google Cloud Dataproc is Google’s implementation of the Hadoop ecosystem that includes the Hadoop Distributed File System (HDFS), Map/Reduce and Spark processing framework. The Google Cloud Dataproc system includes several applications such as Hive, Mahout, Pig, Spark and Hue that are built on top of Hadoop.

Apache Spark is a processing framework that operates on top of the Hadoop Distributed File System (HDFS) (as well as other data stores). It features interactive shells that can launch distributed process jobs across a cluster. Spark supports programming language interfaces for Scala, Java, Python and SQL. PySpark is the Python interface to Spark.

Prerequisites

Before starting this tutorial, the following tutorials and notes should be reviewed.

This tutorial assumes you have a basic understanding of Hadoop and Spark as well as some programming experience with Python and PySpark.

This tutorial also assumes you already have a Google Cloud Platform account set up and funded. If you do not have a Google Cloud Platform account set up, please follow these instructions first. Within GCP you will have at least one “Project” with a Project Name and Project ID. In this tutorial, the example project name is “My First Project” and the Project ID is: ‘handy-bonbon-142723’. Make a note of your Project Name and Project ID and use your own in place of what you see in the examples in this tutorial.

Google Cloud Platform operates using a series of Application Programming Interfaces (APIs). Most GCP services has an associated API that must be enabled before you can use the service. For the Dataproc service, make sure you have the following APIs enabled:

  • Cloud Dataproc API
  • Compute Engine API
  • Cloud Resource Manager API

You can enable APIs by visiting the APIs & Services console page.

These instructions were created in January 2024. Some configuration pages may have changed since then.

Topics Outline

Running Jupyter Notebook on Google Cloud Dataproc requires the following main steps:

  • Enable the Google Cloud Compute Engine API and Cloud Dataproc API
  • Create, Configure and Launch a Google Cloud Dataproc cluster
  • Create, Upload, Download and work with Jupyter Notebooks
  • Shut down cluster and remove any temporary resources
  • Creating a Dataproc cluster with Google Cloud Shell

Each of these topics will be covered in the sections that follow.

Posted on

Running Jupyter Notebook on Amazon EC2

Introduction

This tutorial is current as of October, 2023.

Amazon Elastic Cloud Compute (EC2) is a service for hosting virtual machines in the Amazon web services cloud. An EC2 instance can be created with a variety of hardware configurations (CPU, disk, memory) and operating systems.

Jupyter Notebooks are a popular way to write software for Python and PySpark. A notebook can be hosted in a web browser and can provide code and documentation cells for individual parts of the software.

This tutorial covers how to install and configure Jupyter Notebook so that it will run on an EC2 instance with Amazon Linux.

This tutorial assumes you have an Amazon Web Services Account and that you can log in and view the AWS Management Console.

 

The first step will be to create an EC2 instance (if you do not have one currently) and the instructions will follow on the next page.

Posted on

Running an Amazon EMR Cluster in the AWS Academy Data Engineering Sandbox

Introduction

This is a brief tutorial about how to run an Amazon Elastic Map Reduce (EMR) cluster in the AWS Academy Sandbox environment.

Before you begin, make sure you are completely logged out of your personal AWS Management Console. Or, use a different web browser to log in to AWS Academy.

Make sure you have accepted your professor’s invitation to join the Data Engineering course on AWS Academy.

Please note that it can take between 5 to 10 minutes to start up the Sandbox and then 15 to 20 minutes to start up an EMR cluster.

Posted on

Reverse Engineering a Google BigQuery Schema with Dataedo 10

Introduction

The purpose of this tutorial is to demonstrate how to reverse engineer a Google BigQuery dataset into a physical level diagram using the Dataedo version 10 software. Reverse Engineering is the process of reconstructing a physical and/or ER model (diagram) from an existing database schema. Reverse Engineering can be used to provide documentation on an existing application, provide applications developers with a better overall picture of the schema and provide better guidance when performing changes to the schema.

Dataedo is a powerful software tool that is able to connect to an extensive number of different database management systems including cloud-based databases like AWS Redshift and Google BigQuery. Currently, Dataedo only runs on Microsoft Windows operating system.

Tutorial Pre-Requisites

This tutorial uses Google BigQuery as the source database management system. You should also have a Google Cloud Platform (GCP) Project set up and associated with a billing account. Be sure to note your Project id and GMail account associated with the GCP account. After setting up your account, enable the Cloud Resource Manager API using the APIs and Services menu in the GCP Console.

The BigQuery dataset to be reverse engineered should be located under your GCP Project. If you currently do not have any data sets, consider copying one of the free open source data sets into your own BigQuery project. For example, this tutorial makes use of the CMS Synthetic Patient Data OMOP Data set [Link] that has the name: cms_synthetic_patient_data_omop available through BigQuery Public datasets. Use the BigQuery console to copy this data set into your own Project.

Connecting to Google BigQuery from Dataedo requires a Service Account and Key File. This tutorial [link] provides instructions on how to create a Service Account and Key File. If you have not created your Service Account and Key File, please do that before you move on with the rest of this tutorial. Make sure you store your Key File in a secure location. This tutorial assumes you have the Key File stored in your Downloads folder. In Windows this would be the c:\users\username\Documents folder (where “username” is your Windows user name). Use Windows File Explorer to view the Documents folder and confirm your Key File has been downloaded.

Before starting this tutorial, download and install the Dataedo software. Currently (January 2023), Dataedo offers a free 14 day trial license. Sign up for the free trial and go through the e-mail verification and sign up. During the sing-up process select the operating system (Mac, Windows or Linux) that you are using. Then the Download page should appear as shown below. Only the Dataedo Desktop is required.

After downloading the file (such as Dataedo_Desktop_10.4.0_setup.exe) to your downloads folder, double-click on the file to launch the installer. Follow all of the default steps to install Dataedo.

The first step in working with Dataedo is to establish a repository for database metadata. The steps to create a simple single user repository in Dataedo are provided on the next page.

Posted on

Reverse Engineering a Google BigQuery Schema with DbSchema

Introduction

The purpose of this tutorial is to demonstrate how to reverse engineer a Google BigQuery dataset into a physical level diagram using the DbSchema software. Reverse Engineering is the process of reconstructing a physical and/or ER model from an existing schema. Reverse Engineering can be used to provide documentation on an existing application, provide applications developers with a better overall picture of the schema and provide better guidance when performing changes to the schema.

DbSchema is a powerful software tool that is able to connect to an extensive number of different database management systems including cloud-based databases like AWS Redshift and Google BigQuery. The DbSchema software is supported on MS Windows, Mac OSX and Linux. DbSchema offers a free license version that has limited features.

Tutorial Pre-Requisites

Before starting this tutorial, download and install the DbSchema software for your operating system.

You should also have a Google Cloud Platform (GCP) Project set up and associated with a billing account. Be sure to note your Project id and GMail account associated with the GCP account. The BigQuery dataset to be reverse engineered should be located under your GCP Project. If you currently do not have any data sets, consider copying one of the free open source data sets into your own BigQuery project. For example, this tutorial makes use of the CMS Synthetic Patient Data OMOP Data set that has the name: cms_synthetic_patient_data_omop. This data set is available through BigQuery Public datasets. Use the BigQuery console to copy this data set into your own Project.

Connecting to Google BigQuery from DbSchema requires a Service Account and Key File. This tutorial [link] provides instructions on how to create a Service Account and Key File. If you have not created your Service Account and Key File, please do that before you move on with the rest of this tutorial. Make sure you store your Key File in a secure location. This tutorial assumes you have the Key File stored in your Documents folder. In Windows this would be the c:\users\username\Documents folder (where “username” is your Windows user name). In MacOSX, this will be the /Users/username/Documents folder. In Linux this will be the /home/username/Documents folder.

With these pre-requisites in place, the reverse engineering process can be started by connecting to a BigQuery dataset. These steps are described on the next page.

Posted on

Reverse Engineering a Google BigQuery Schema with Dataedo 9

Introduction

The purpose of this tutorial is to demonstrate how to reverse engineer a Google BigQuery dataset into a physical level diagram using the Dataedo Version 9 software. A new tutorial covers version 10. Reverse Engineering is the process of reconstructing a physical and/or ER model from an existing schema. Reverse Engineering can be used to provide documentation on an existing application, provide applications developers with a better overall picture of the schema and provide better guidance when performing changes to the schema.

Dataedo is a powerful software tool that is able to connect to an extensive number of different database management systems including cloud-based databases like AWS Redshift and Google BigQuery. Dataedo runs on Microsoft Windows.

Tutorial Pre-Requisites

Before starting this tutorial, download and install the Dataedo software. Currently (February 2022), Dataedo offers a free 14 day trial license.

You should also have a Google Cloud Platform (GCP) Project set up and associated with a billing account. Be sure to note your Project id and GMail account associated with the GCP account. After setting up your account, enable the Cloud Resource Manager API using the APIs and Services menu in the GCP Console.

The BigQuery dataset to be reverse engineered should be located under your GCP Project. If you currently do not have any data sets, consider copying one of the free open source data sets into your own BigQuery project. For example, this tutorial makes use of the CMS Synthetic Patient Data OMOP Data set that has the name: cms_synthetic_patient_data_omop. This dataset is available through BigQuery Public datasets. Use the BigQuery console to copy this data set into your own Project.

Connecting to Google BigQuery from Dataedo requires a Service Account and Key File. This tutorial [link] provides instructions on how to create a Service Account and Key File. If you have not created your Service Account and Key File, please do that before you move on with the rest of this tutorial. Make sure you store your Key File in a secure location. This tutorial assumes you have the Key File stored in your Documents folder. In Windows this would be the c:\users\username\Documents folder (where “username” is your Windows user name).

The first step in working with Dataedo is to establish a repository for database metadata. The steps to create a simple single user repository in Dataedo are provided on the next page.

Posted on

Python Programming with Google BigQuery

Introduction

Google BigQuery is a cloud based storage system for tabular row-oriented data that responds to SQL. BigQuery is suited for analytics applications where data is loaded once and then repeatedly analyzed using SQL. BigQuery is built on top of Google BigTable which in turn is built on top of the Google File System.

Python is a popular programming language for analytics. Some typical use cases for Python and BigQuery include:

  • Querying data stored in BigQuery and analyzing/manipulating that data in Python.
  • Using Python to process data from a data source and then loading the data into BigQuery.

This tutorial is divided into nine sections:

  1. Pre-Requisites
  2. Installing Python Modules
  3. Running a simple SQL query in Python
  4. Configuring Authentication with a Service Account Key File
  5. Obtaining BigQuery Dataset Metadata
  6. Downloading data from BigQuery into a Pandas DataFrame
  7. Downloading Query results into a Dataframe
  8. Loading data from a Python Pandas dataframe into a BigQuery table
  9. Conclusions

The following page discusses the necessary pre-requisites for connecting Python to Google BigQuery.

Posted on

Creating a Service Account and Key File for Google BigQuery

Introduction

In the Google Cloud Platform (GCP), a Service Account is a special type of user account that allows one GCP service to access services or resources on another GCP service. For example, a Service Account can be used in a virtual machine to access data stored in Google Cloud Storage. When it comes to working with databases such as Google BigQuery, a Service Account can be used to allow other software to connect a BigQuery data set. Some examples of this include:

  • Connecting an ETL tool like dbt to BigQuery to run ETL or ELT jobs that query, transform and load data back to BigQuery.
  • Connecting a Python (or other software) program to BigQuery so that data can be loaded to BigQuery or downloaded via queries.
  • Connecting Tableau to BigQuery so that data can be queried and visualized.

Service Accounts are generated within a Project in GCP and use an RSA Public/Private key pair for authentication.

There are many details and additional use cases for Service Accounts that are beyond the scope of this tutorial. The full documentation for Service Accounts can be found on this page: IAM > Documentation > Guides > Service Accounts.

Pre-Requisites

Before a Service Account and Key file can be created, you should have a GCP account and Project set up. The Project should be associated with a billing account. Finally ensure the the BigQuery API is enabled. It should be enabled by default. If not follow these instructions.

The steps to create a new Service Account are provided on the next page.

Posted on

Getting started with PySpark on Google Cloud Platform Dataproc

Introduction

The purpose of this tutorial is to demonstrate the PySpark python interface to Spark running on the Google Cloud Platform service called Dataproc.

Google Cloud Dataproc is Google’s implementation of the Hadoop ecosystem that includes the Hadoop Distributed File System (HDFS) and Map/Reduce processing framework. The Google Cloud Dataproc system includes a number of applications such as Hive, Mahout, Pig, Spark and Hue that are built on top of Hadoop.

Apache Spark is a processing framework that operates on top of the Hadoop Distributed File System (HDFS) (as well as other data stores). It features interactive shells that can launch distributed process jobs across a cluster. Spark supports programming language interfaces for Scala, Java, Python and SQL. PySpark is the Python interface to Spark.

Prerequisites

Before starting this tutorial, the following tutorials and notes should be reviewed.

Basic familiarity with the Hadoop Ecosystem is required to get the most out of this tutorial. My Introduction to Hadoop notes – Including Spark can help with a review of these concepts.

This tutorial assumes you already have a Google Cloud Platform account set up and funded. If you do not have a Google Cloud Platform account set up, please follow these instructions first.

Topics Outline

Processing a data set using Spark on Google Cloud Dataproc requires the following main steps:

  • Enable the Google Cloud Compute Engine API
  • Create, Configure and Launch a Google Cloud Dataproc cluster
  • Log in to the Hadoop cluster master node
  • Load data to the HDFS
  • Run the Command Line Interface and issue PySpark commands to process the data
  • Shut down cluster and remove any temporary resources

Each of these topics will be covered in the sections that follow.