Main Web>TWikiUsers>MariaDelCarmenMisaMoreira>Google_training_course>Google_cloud_feature_engineering (2021-11-02, MariaDelCarmenMisaMoreira)

EditAttachPDF

Feature Engineering

Feature Engineering

Performing Basic Feature Engineering in BQML

# On the Notebook instances page, click NEW INSTANCE. Select TensorFlow Enterprise and choose the latest version of TensorFlow Enterprise 2.6 (with LTS) > Without GPUs.
git clone https://github.com/GoogleCloudPlatform/training-data-analyst
training-data-analyst > courses > machine_learning > deepdive2 > feature_engineering > labs > 1_bqml_basic_feat_eng_bqml-lab.ipynb.

Performing Basic Feature Engineering in Keras

# On the Notebook instances page, click NEW INSTANCE. Select TensorFlow Enterprise and choose the latest version of TensorFlow Enterprise 2.6 (with LTS) > Without GPUs.
git clone https://github.com/GoogleCloudPlatform/training-data-analyst
training-data-analyst > courses > machine_learning > deepdive2 > feature_engineering > labs > 3_keras_basic_feat_eng-lab.ipynb.

Simple Dataflow Pipeline

# Activate Cloud Shell
# List the active account name
gcloud auth list

# List the project ID
gcloud config list project

# Clone the repository from the Cloud Shell command line
git clone https://github.com/GoogleCloudPlatform/training-data-analyst

# In the Console, on the Navigation menu (7a91d354499ac9f1.png), click Cloud Storage > Browser. Click Create Bucket
BUCKET="qwiklabs-gcp-00-f40ffad65927"
echo $BUCKET

# Return to the browser tab containing Cloud Shell. In Cloud Shell navigate to the directory for this lab
cd ~/training-data-analyst/courses/data_analysis/lab2/python

# Install the necessary dependencies for Python dataflow
sudo ./install_packages.sh

# Verify that you have the right version of pip. (It should be > 8.0)
pip3 -V

# Execute the pipeline locally
# In the Cloud Shell command line, locally execute grep.py
cd ~/training-data-analyst/courses/data_analysis/lab2/python
python3 grep.py

# The output file will be output.txt. If the output is large enough, it will be sharded into separate parts with names like: output-00000-of-00001. If necessary, you can locate the correct file by examining the file's time
ls -al /tmp

# Examine the output file. Replace "-*" below with the appropriate suffix
cat /tmp/output-*

# Execute the pipeline on the cloud
# Copy some Java files to the cloud
gsutil cp ../javahelp/src/main/java/com/google/cloud/training/dataanalyst/javahelp/*.java gs://$BUCKET/javahelp

# Edit the Dataflow pipeline in grepc.py. In the Cloud Shell code editor navigate to the directory /training-data-analyst/courses/data_analysis/lab2/python in and edit the file grepc.py.
# Replace PROJECT and BUCKET with your Project ID and Bucket name. Here are easy ways to retrieve the values:
echo $DEVSHELL_PROJECT_ID
echo $BUCKET

# Submit the Dataflow job to the cloud:
python3 grepc.py

# Download the file in Cloud Shell and view it:
gsutil cp gs://$BUCKET/javahelp/output* .
cat output*

MapReduce in Dataflow (Python)

# Activate Cloud Shell
# List the active account name
gcloud auth list

# List the project ID
gcloud config list project

# Identify Map and Reduce operations
# In CloudShell clone the source repo which has starter scripts for this lab
git clone https://github.com/GoogleCloudPlatform/training-data-analyst

# Then navigate to the code for this lab.
cd training-data-analyst/courses/data_analysis/lab2/python

# Execute the pipeline
# Install the necessary dependencies for Python dataflow
sudo ./install_packages.sh

# Verify that you have the right version of pip (should be > 8.0)
pip3 -V

# Run the pipeline locally
python3 ./is_popular.py

# Examine the output file
cat /tmp/output-*

# Use command line parameters
# Change the output prefix from the default value
python3 ./is_popular.py --output_prefix=/tmp/myoutput

# Note that we now have a new file in the /tmp directory
ls -lrt /tmp/myoutput*

Computing Time-Windowed Features in Cloud Dataprep

# Open the Google Cloud Console at console.cloud.google.com. Go to Cloud Storage in the Navigation menu (left-side navigation). 
# Click Create Bucket (or use an existing bucket).
# Create BigQuery Dataset to store Cloud Dataprep Output
# Open BigQuery: In the Google Cloud Console, select Navigation menu > BigQuery
# For Dataset ID, type taxi_cab_reporting and click CREATE DATASET
# Open the Navigation menu. Under Big Data, click on Dataprep.

# Import NYC Taxi Data from GCS into a Dataprep Flow
# In the Cloud Dataprep UI, click on the Dataprep icon on the top left corner to go to the home screen and then click Create Flow on the top right side of the page. Click Untitled Flow and specify the following Flow details:
Flow Name: NYC Taxi Cab Data Reporting
Flow Description: Ingesting, Transforming, and Analyzing Taxi Data
# Click on the + under Dataset to add a new data source and then click Import datasets.
# In the data importer left side menu, click GCS (Google Cloud Storage). Click the Pencil icon to edit the GCS path.
# Paste in the 2015 taxi rides dataset CSV from Google Cloud Storage:
gs://cloud-training/gcpml/c4/tlc_yellow_trips_2015_small.csv

# Before selecting Import, click the Pencil Icon to edit the GCS path a second time and paste in the 2016 CSV below:
gs://cloud-training/gcpml/c4/tlc_yellow_trips_2016_small.csv
# Click Import & Add to Flow

# While your Cloud Dataprep flow starts and manages your Cloud Dataflow job, you can preview your output results by using BigQuery to query this pre-populated table:
#standardSQL
SELECT
  pickup_hour,
  FORMAT("$%.2f",ROUND(average_3hr_rolling_fare,2)) AS avg_recent_fare,
  ROUND(average_trip_distance,2) AS average_trip_distance_miles_by_hour,
  FORMAT("%'d",sum_passenger_count) AS total_passengers_by_hour
FROM
  `cloud-training-demos.demos.nyc_taxi_reporting`
ORDER BY
  pickup_hour DESC;

Improve ML Model with Feature Engineering

# Create a bucket
# On the Notebook instances page, click NEW INSTANCE. Select TensorFlow Enterprise and choose the latest version of TensorFlow Enterprise 2.6 (with LTS) > Without GPUs.
git clone https://github.com/GoogleCloudPlatform/training-data-analyst
training-data-analyst > courses > machine_learning > feateng and open feateng.ipynb.

Performing Advanced Feature Engineering in BQML

# On the Notebook instances page, click NEW INSTANCE. Select TensorFlow Enterprise and choose the latest version of TensorFlow Enterprise 2.6 (with LTS) > Without GPUs.
git clone https://github.com/GoogleCloudPlatform/training-data-analyst
training-data-analyst > courses > machine_learning > deepdive2 > feature_engineering > labs and opening 2_bqml_adv_feat_eng-lab.ipynb.

Performing Advanced Feature Engineering in Keras

# On the Notebook instances page, click NEW INSTANCE. Select TensorFlow Enterprise and choose the latest version of TensorFlow Enterprise 2.6 (with LTS) > Without GPUs.
git clone https://github.com/GoogleCloudPlatform/training-data-analyst
training-data-analyst > courses > machine_learning > deepdive2 > feature_engineering > labs and opening 4_keras_adv_feat_eng-lab.ipynb.

Exploring tf.transform

# On the Notebook instances page, click NEW INSTANCE. Select TensorFlow Enterprise and choose the latest version of TensorFlow Enterprise 2.6 (with LTS) > Without GPUs.
git clone https://github.com/GoogleCloudPlatform/training-data-analyst
training-data-analyst > courses > machine_learning > deepdive2 > feature_engineering > labs > 5_tftransform_taxifare.ipynb.

Topic revision: r3 - 2021-11-02 - MariaDelCarmenMisaMoreira

Main

Webs

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
Main All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback