Detailed Course Outline
Module 01 - Introduction to Data Engineering
Topics:
- Explore the role of a data engineer.
- Analyze data engineering challenges
- Introduction to BigQuery
- Data lakes and data warehouses
- Transactional databases versus data warehouses
- Partner effectively with other data teams
- Manage data access and governance
- Build production-ready pipelines
- Review Google Cloud customer case study
Objectives:
- Understand the role of a data engineer
- Discuss benefits of doing data engineering in the cloud
- Discuss challenges of data engineering practice and how building data pipelines in the cloud helps to address these
- Review and understand the purpose of a data lake versus a data warehouse, and when to use which
Activities:
- Lab: Using BigQuery to do Analysis
Module 02 - Building a Data Lake
Topics:
- Introduction to data lakes
- Data storage and ETL options on Google Cloud
- Building a data lake using Cloud Storage
- Securing Cloud Storage
- Storing all sorts of data types
- Cloud SQL as a relational data lake
Objectives:
- Understand why Cloud Storage is a great option for building a data lake on Google Cloud
- Learn how to use Cloud SQL for a relational data lake
Activities:
- Lab: Loading Taxi Data into Cloud SQL
Module 03 - Building a Data Warehouse
Topics:
- The modern data warehouse
- Introduction to BigQuery
- Getting started with BigQuery
- Loading data
- Exploring schemas
- Schema design
- Nested and repeated fields
- Optimizing with partitioning and clustering
Objectives:
- Discuss requirements of a modern warehouse
- Understand why BigQuery is the scalable data warehousing solution on Google Cloud
- Understand core concepts of BigQuery and review options of loading data into BigQuery
Activities:
- Lab: Loading Data into BigQuery
- Lab: Working with JSON and Array Data in BigQuery
Module 04 - Introduction to Building Batch Data Pipelines
Topics:
- EL, ELT, ETL
- Quality considerations
- How to carry out operations in BigQuery
- Shortcomings
- ETL to solve data quality issues
Objectives:
- Review different methods of loading data into your data lakes and warehouses: EL, ELT, and ETL
- Discuss data quality considerations and when to use ETL instead of EL and ELT
Module 05 - Executing Spark on Dataproc
Topics:
- The Hadoop ecosystem
- Run Hadoop on Dataproc
- Cloud Storage instead of HDFS
- Optimize Dataproc
Objectives:
- Review the parts of the Hadoop ecosystem
- Learn how to lift and shift your existing Hadoop workloads to the cloud using Dataproc
- Understand considerations around using Cloud Storage instead of HDFS for storage
- Learn how to optimize Dataproc jobs
Activities:
- Lab: Running Apache Spark jobs on Dataproc
Module 06 - Serverless Data Processing with Dataflow
Topics:
- Introduction to Dataflow
- Why customers value Dataflow
- Dataflow pipelines
- Aggregating with GroupByKey and Combine
- Side inputs and windows
- Dataflow templates
- Dataflow SQL
Objectives:
- Understand how to decide between Dataflow and Dataproc for processing data pipelines
- Understand the features that customers value in Dataflow
- Discuss core concepts in Dataflow
- Review the use of Dataflow templates and SQL
Activities:
- Lab: A Simple Dataflow Pipeline (Python/Java)
- Lab: MapReduce in Dataflow (Python/Java)
- Lab: Side inputs (Python/Java)
Module 07 - Manage Data Pipelines with Cloud Data Fusion and Cloud Composer
Topics:
- Building batch data pipelines visually with Cloud Data Fusion
- Components
- UI overview
- Building a pipeline
- Exploring data using Wrangler
- Orchestrating work between Google Cloud services with Cloud Composer
- Apache Airflow environment
- DAGs and operators
- Workflow scheduling
- Monitoring and logging
Objectives:
- Discuss how to manage your data pipelines with Data Fusion and Cloud Composer
- Understand Data Fusion’s visual design capabilities
- Learn how Cloud Composer can help to orchestrate the work across multiple Google Cloud services
Activities:
- Lab: Building and Executing a Pipeline Graph in Data Fusion
- Optional Lab: An introduction to Cloud Composer
Module 08 - Introduction to Processing Streaming Data
Topics: Processing Streaming Data
Objectives:
- Explain streaming data processing
- Describe the challenges with streaming data
- Identify the Google Cloud products and tools that can help address streaming data challenges
Module 09 - Serverless Messaging with Pub/Sub
Topics:
- Introduction to Pub/Sub
- Pub/Sub push versus pull
- Publishing with Pub/Sub code
Objectives:
- Describe the Pub/Sub service
- Understand how Pub/Sub works
- Gain hands-on Pub/Sub experience with a lab that simulates real-time streaming sensor data
Activities:
- Lab: Publish Streaming Data into Pub/Sub
Module 10 - Dataflow Streaming Features
Topics:
- Steaming data challenges
- Dataflow windowing
Objectives:
- Understand the Dataflow service
- Build a stream processing pipeline for live traffic data
- Demonstrate how to handle late data using watermarks, triggers, and accumulation
Activities:
- Lab: Streaming Data Pipelines
Module 11 - High-Thoughput BigQuery and Bigtable Streaming Features
Topics:
- Streaming into BigQuery and visualizing results
- High-throughput streaming with Cloud Bigtable
- Optimizing Cloud Bigtable performance
Objectives:
- Learn how to perform ad hoc analysis on streaming data using BigQuery and dashboards
- Understand how Cloud Bigtable is a low-latency solution
- Describe how to architect for Bigtable and how to ingest data into Bigtable
- Highlight performance considerations for the relevant services
Activities:
- Lab: Streaming Analytics and Dashboards
- Lab: Streaming Data Pipelines into Bigtable
Module 12 - Advanced BigQuery Functionality and Performance
Topics:
- Analytic window functions
- Use With clauses
- GIS functions
- Performance considerations
Objectives:
- Review some of BigQuery’s advanced analysis capabilities
- Discuss ways to improve query performance
Activities:
- Lab: Optimizing your BigQuery Queries for Performance
- Optional Lab: Partitioned Tables in BigQuery
Module 13 - Introduction to Analytics and AI
Topics:
- What is AI?
- From ad-hoc data analysis to data-driven decisions
- Options for ML models on Google Cloud
Objectives:
- Understand the proposition that ML adds value to your data
- Understand the relationship between ML, AI, and Deep Learning
- Identify ML options on Google Cloud
Module 14 - Prebuilt ML Model APIs for Unstructured Data
Topics:
- Unstructured data is hard
- ML APIs for enriching data
Objectives:
- Discuss challenges when working with unstructured data
- Learn the applications of ready-to-use ML APIs on unstructured data
Activities:
- Lab: Using the Natural Language API to Classify Unstructured Text
Module 15 - Big Data Analytics with Notebooks
Topics:
- What’s a notebook?
- BigQuery magic and ties to Pandas
Objectives:
- Introduce Notebooks as a tool for prototyping ML solutions
- Learn to execute BigQuery commands from Notebooks
Activities:
- Lab: BigQuery in Jupyter Labs on AI Platform
Module 16 - Production ML Pipelines
Topics:
- Ways to do ML on Google Cloud
- Vertex AI Pipelines
- AI Hub
Objectives:
- Describe options available for building custom ML models
- Understand the use of tools like Vertex AI Pipelines
Activities:
- Lab: Running Pipelines on Vertex AI
Module 17 - Custom Model Building with SQL in BigQuery ML
Topics:
- BigQuery ML for quick model building
- Supported models
Objectives:
- Learn how to create ML models by using SQL syntax in BigQuery
- Demonstrate building different kinds of ML models using BigQuery ML
Activities:
- Lab option 1: Predict Bike Trip Duration with a Regression Model in BigQuery ML
- Lab option 2: Movie Recommendations in BigQuery ML
Module 18 - Custom Model Building with AutoML
Topics:
- Why AutoML?
- AutoML Vision
- AutoML NLP
- AutoML tables
Objectives:
- Explore various AutoML products used in machine learning
- Learn to use AutoML to create powerful models without coding