Cloud ETL

The cloud takes complex problems that could be currently solved by a team of 50 people and allows it to be a button click. In the “real world” you have to automate the data pipeline via ETL (Extract, Transfer, Load) process. The diagram below shows how AWS S3 is the central repo for the data.

aws-glue-athena

Next, AWS Glue indexes the cloud storage bucket and creates a database that can be used by AWS Athena. What is unique about this?

  • Almost no code (only a little SQL to query)
  • Serverless
  • Automatable

Here is a screencast of AWS Glue and AWS Athena working together to catalogue data and search it at scale:

AWS Glue

Real-World Problems with ETL Building a Social Network From Scratch

Cold-Start Problem

How do you bootstrap a social network and get users?

cold start

Building Social Network Machine Learning Pipeline From Scratch

How can you predict impact on platform from social media signals?

ml pipeline

Results of ML Prediction Pipeline:

Brett Favre

feedback

conor

signals

How do you construct a news Feed?

  • How many users should someone follow?
  • What should be in the feed?
  • What algorithm do you use to generate the feed?
  • Could you get a feed to be an O(1) lookup? Hint…pre-generate feed.
  • What if one user posts 1000 items a day, but you follow 20 users and the feed paginates at 25 results?

References