Data Analysis in the Cloud at Scale

Spring 2020-2023

Duke MIDS

Course Description

This course is designed to give you a comprehensive view of cloud computing including Big Data and Machine Learning. A variety of learning resources will be used including interactive labs on Cloud Platforms (Google, AWS, Azure). This is a project-based course with extensive hands-on assignments.

Course Goals and Learning Objectives

Upon successful completion of this course, you will be able to:

  1. Summarize the fundamentals of cloud computing
  2. Evaluate the economics of cloud computing
  3. Accurately evaluate distributed computing challenges and opportunities and apply this knowledge to real-world projects.
  4. Develop non-linear life-long learning skills
  5. Build, share and present compelling portfolios using: Github, Hugging Face, YouTube, and Linkedin.
  6. Develop Metacognition skills (By teaching we learn)

Conceptual Topics

  1. Cloud Computing Foundations

    1. Overview of Cloud Computing
    2. Cloud Adoption Framework(s)
    3. Economics of Cloud Computing
    4. Types of Cloud Services: SaaS, PaaS, IaaS, MaaS, Serverless
    5. IaC (Infrastructure as Code) w/ Terraform
    6. Continuous Delivery
  2. Virtualization & Containerization

    1. CPU, Memory, I/O
    2. SDN (Software Defined Networks)
    3. SDS (Software Defined Storage)
    4. Containers: Docker, Kubernetes, EKS (Elastic Kubernetes Service), Google Kubernetes Engine, Container Registries
  3. Challenges and Opportunities in Distributed Computing

    1. CAP Theorem
    2. Eventual Consistency
    3. Amdahl’s law
    4. End of Moore’s Law
    5. ASICS: GPUs, TPUs, FPGAs
  4. Cloud Storage

    1. Cloud Databases: HBase, MongoDB, Cassandra, DynamoDB, Google BigQuery
    2. Cloud Object Storage: Amazon S3, GCP Cloud Storage, Amazon Glacier, Data Lakes, OpenStack Swift
    3. Distributed File Systems: Red Hat Ceph, Amazon EFS (Elastic File System), HDFS
  5. Serverless

    1. Cloud 9 Development Environment
    2. FaaS (Function as a Service): AWS Lambda, GCP Cloud Functions, Azure Functions
    3. Cloud-Native Primitives: AWS Step Machines, AWS SQS, AWS SNS, AWS Cognito, AWS API Gateway
    4. Google Cloud Shell Development Environment
    5. Google App Engine
  6. Big Data Platforms

    1. Batch Processing: EMR/Hadoop, AWS Batch
    2. ETL (Extract Transform Load): AWS Glue, AWS Athena
    3. Stream Processing: EMR/Spark, AWS Kinesis, Kafka
  7. Managed Machine Learning Systems and Platforms

    1. AWS Sagemaker
    2. GCP AI Platform
    3. Azure ML Studio
  8. Edge Computing

    1. IoT: AWS Greengrass, Raspberry Pi
    2. Edge Machine Learning: Tensorflow lite, Intel Movidius, Apple X12

Discussion Forum

The purpose of the async discussion forum is to facilitate a free exchange of ideas. Remain respectful of other ideas. Active, relevant and timely discussion is encouraged. Please refrain from simple replies such as “I agree”. Use the Critical Thinking framework as described in the O’Reilly book Practical MLOps Preface.

The requirements each week is to both create a post according to the assignment but to also comment in a meaningful way on posts by two other students.

MLOps Template GitHub Pull Requests

Students can choose to either do a discussion question each week or create a pull request to work on a ticket on the MLOps Template project. The queue of work will be organized by the TAs and advanced can also create a ticket, then work on it.

Weekly Demo

  • Each week you will do 1-5 minute demo (hard capped at 5 minutes). This trains your metacognitive abilities.
  • You have the option of doing a demo on work from GitHub contributions related to class or class projects.

Cloud Resources (Labs, Credits, and Accounts)

This course will make sure of several free resources that allow students to use real cloud environments on AWS, Google and Azure. Please set up accounts as follows:

Required and Optional Readings and Resources

Required Readings and Media

Optional Readings & Media

Additional Coursera Labs: Bash and Linux

Supplementary Readings & Media

AWS
GCP
Python
Development Environment
Linux and Systems Engineering
Containers and Kubernetes
MLOps and DataOps