Cloud ETL
Cloud ETL
The cloud takes complex problems that could be currently solved by a team of 50 people and allows it to be a button click. In the “real world” you have to automate the data pipeline via ETL (Extract, Transfer, Load) process. The diagram below shows how AWS S3 is the central repo for the data.
Next, AWS Glue indexes the cloud storage bucket and creates a database that can be used by AWS Athena. What is unique about this?
- Almost no code (only a little SQL to query)
- Serverless
- Automatable
Here is a screencast of AWS Glue and AWS Athena working together to catalogue data and search it at scale:
Real-World Problems with ETL Building a Social Network From Scratch
Cold-Start Problem
How do you bootstrap a social network and get users?
Building Social Network Machine Learning Pipeline From Scratch
How can you predict impact on platform from social media signals?
Results of ML Prediction Pipeline:
How do you construct a news Feed?
- How many users should someone follow?
- What should be in the feed?
- What algorithm do you use to generate the feed?
- Could you get a feed to be an O(1) lookup? Hint…pre-generate feed.
- What if one user posts 1000 items a day, but you follow 20 users and the feed paginates at 25 results?