Data Analysis in the Cloud at Scale
Data Analysis in the Cloud at Scale
Spring 2020-2023
Duke MIDS
Course Description
This course is designed to give you a comprehensive view of cloud computing including Big Data and Machine Learning. A variety of learning resources will be used including interactive labs on Cloud Platforms (Google, AWS, Azure). This is a project-based course with extensive hands-on assignments.
Course Goals and Learning Objectives
Upon successful completion of this course, you will be able to:
- Summarize the fundamentals of cloud computing
- Evaluate the economics of cloud computing
- Accurately evaluate distributed computing challenges and opportunities and apply this knowledge to real-world projects.
- Develop non-linear life-long learning skills
- Build, share and present compelling portfolios using: Github, Hugging Face, YouTube, and Linkedin.
- Develop Metacognition skills (By teaching we learn)
Conceptual Topics
-
Cloud Computing Foundations
- Overview of Cloud Computing
- Cloud Adoption Framework(s)
- Economics of Cloud Computing
- Types of Cloud Services: SaaS, PaaS, IaaS, MaaS, Serverless
- IaC (Infrastructure as Code) w/ Terraform
- Continuous Delivery
-
Virtualization & Containerization
- CPU, Memory, I/O
- SDN (Software Defined Networks)
- SDS (Software Defined Storage)
- Containers: Docker, Kubernetes, EKS (Elastic Kubernetes Service), Google Kubernetes Engine, Container Registries
-
Challenges and Opportunities in Distributed Computing
- CAP Theorem
- Eventual Consistency
- Amdahl’s law
- End of Moore’s Law
- ASICS: GPUs, TPUs, FPGAs
-
Cloud Storage
- Cloud Databases: HBase, MongoDB, Cassandra, DynamoDB, Google BigQuery
- Cloud Object Storage: Amazon S3, GCP Cloud Storage, Amazon Glacier, Data Lakes, OpenStack Swift
- Distributed File Systems: Red Hat Ceph, Amazon EFS (Elastic File System), HDFS
-
Serverless
- Cloud 9 Development Environment
- FaaS (Function as a Service): AWS Lambda, GCP Cloud Functions, Azure Functions
- Cloud-Native Primitives: AWS Step Machines, AWS SQS, AWS SNS, AWS Cognito, AWS API Gateway
- Google Cloud Shell Development Environment
- Google App Engine
-
Big Data Platforms
- Batch Processing: EMR/Hadoop, AWS Batch
- ETL (Extract Transform Load): AWS Glue, AWS Athena
- Stream Processing: EMR/Spark, AWS Kinesis, Kafka
-
Managed Machine Learning Systems and Platforms
- AWS Sagemaker
- GCP AI Platform
- Azure ML Studio
-
Edge Computing
- IoT: AWS Greengrass, Raspberry Pi
- Edge Machine Learning: Tensorflow lite, Intel Movidius, Apple X12
Discussion Forum
The purpose of the async discussion forum is to facilitate a free exchange of ideas. Remain respectful of other ideas. Active, relevant and timely discussion is encouraged. Please refrain from simple replies such as “I agree”. Use the Critical Thinking framework as described in the O’Reilly book Practical MLOps Preface.
The requirements each week is to both create a post according to the assignment but to also comment in a meaningful way on posts by two other students.
MLOps Template GitHub Pull Requests
Students can choose to either do a discussion question each week or create a pull request to work on a ticket on the MLOps Template project. The queue of work will be organized by the TAs and advanced can also create a ticket, then work on it.
Weekly Demo
- Each week you will do 1-5 minute demo (hard capped at 5 minutes). This trains your metacognitive abilities.
- You have the option of doing a demo on work from GitHub contributions related to class or class projects.
Cloud Resources (Labs, Credits, and Accounts)
This course will make sure of several free resources that allow students to use real cloud environments on AWS, Google and Azure. Please set up accounts as follows:
- AWS: Create an account on AWS Educate using your school email account: https://aws.amazon.com/education/awseducate/. This will be where “free” sandboxed AWS Environments will launch. (Note, you are also encouraged to sign up for a “free tier” AWS account: https://aws.amazon.com/free/
- GCP: Create an account on Qwiklabs using your school email account: https://www.qwiklabs.com/
- Azure: Create an account on Azure for students using your school email account: https://azure.microsoft.com/en-us/free/students/.
Required and Optional Readings and Resources
Required Readings and Media
- Berkeley View of Cloud Computing
- Google Cloud Adoption Framework (Read Whitepaper)
- AWS Cloud Adopation Framework (optional)
- The Economics of the Cloud-Microsoft
- Introduction to AWS Economics
- Gartner AI Hype Cycle
- Python for DevOps-Book
- Python for Programmers-Book
- Data Engineering with Python and AWS Lambda-Video
- Duke+Coursera: Cloud Computing for Data Coursera Course
- Gift, N (2021) Practical MLOps, Sebastopol, CA: O’Reilly
- Gift, N (2021) Cloud Computing for Data Analysis
- Gift, N (2020) Pragmatic AI: An Introduction to Cloud-Based Machine Learning
- Gift, N (2022)Developing on AWS with CSharp
Optional Readings & Media
Additional Coursera Labs: Bash and Linux
- Coursera-DE-C2-Lab1-Linux
- Coursera-DE-C2-Lab2-Using-Bash
- Coursera-DE-C2-Lab3-Building-Bash-Scripts
- Coursera-DE-C2-Lab4-Composing-File-Data-Solutions
Supplementary Readings & Media
- AWS Training
- AWS Educate
- AWS Academy
- Google Qwiklabs
- Microsoft Learn
- Python in One Hour
- Know Thyself: The Science of Self-Awareness
- DataCamp - CLI Automation Python
- AWS Training & Certification
- AWS Educate
- AWS Academy
- Google Qwiklabs - Hands-On Cloud Training
- Coursera
- Google Cloud Platform Fundamentals: Core Infrastructure
- Microsoft Learn
- edX
- Applied Computer Vision with Python Lectures: https://learning.oreilly.com/videos/applied-computer-vision/60652VIDEOPAIMLL/
- Learn Python in One Hour: https://learning.oreilly.com/videos/learn-python-in/60645VIDEOPAIML/
- Cloud Computing with Python: https://learning.oreilly.com/videos/cloud-computing-with/60650VIDEOPAIML/
- Python for Data Science with Colab and Pandas in One Hour: https://learning.oreilly.com/videos/python-for-data/62062021VIDEOPAIML/
- GCP Cloud Functions:
https://learning.oreilly.com/videos/learn-gcp-cloud/50101VIDEOPAIML/ - Azure AutoML
https://learning.oreilly.com/videos/learn-azure-ml/50104VIDEOPAIML/
AWS
- AWS Bootcamp
- Logic to Live
- AWS Lambda Python Cloud9 and Boto3 One Hour
- Learn AWS Cloudshell
- Using AWS Sagemaker
- Learn to build Data Pipelines
- Hello World IAC with AWS CDK
- Github Actions vs AWS Code Build for CI
- AWS Sagemaker Autopilot from Zero
- AWS Cloud Practitioner
- AWS ML
- AWS SA
GCP
- Building AI Applications with GCP: https://learning.oreilly.com/videos/building-ai-applications/9780135973462/
- Build GCP Cloud Functions: https://learning.oreilly.com/videos/learn-gcp-cloud/50101VIDEOPAIML/
- Google Cloud Functions for the Impatient
Python
-
Data Science, Pandas, and Colab: https://learning.oreilly.com/videos/python-for-data/62062021VIDEOPAIML/
- Python and DevOps: https://learning.oreilly.com/videos/python-devops-in/61272021VIDEOPAIML/
- Python Command-line Tools: https://learning.oreilly.com/videos/learn-python-command-line/50102VIDEOPAIML/
- Build a useful Python decorator
- Python Functions in One Hour
- Python command-line in one hour
- Fast, documented Machine Learning APIs with FastAPI
Development Environment
Linux and Systems Engineering
- Docker containers:
https://learning.oreilly.com/videos/learn-docker-containers/50103VIDEOPAIML/ - Learn the Vim Text Editor: https://learning.oreilly.com/videos/learn-vim-in/50100VIDEOPAIML/
Containers and Kubernetes
- Learn Docker containers in One Hour Video
- Setup a Remote Kubernetes Environment
- Packaging Machine Learning Models with Docker