{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "name": "Lesson2-AWSML-Data-Engineering.ipynb", "version": "0.3.2", "provenance": [], "collapsed_sections": [ "c_Id55m6Jsbu", "bq4VmHjPpMOR", "8oYv9z6P6Ukj", "DJxHWoBmcY_4", "c69G5pk8Dh7Z", "1skrZ0TuDpVq", "LYQ8rXU-Doab", "LQIpSWC8HkO0", "dBGcloLTH41z", "czV9D-QSIJGp", "WESUX507e4jg", "bDlM65z5nM0i", "ewIs7l_AptvW", "PlVk0ly3m8SH", "oWeNDZCtp6NP", "Hc2I3g2xnH3Y", "AZ6JLteQ6wXO", "_x5EC-sO61UF", "wMDw-IEhAgT4", "yo0I3xoETEpQ", "SNpJZKlHJ-_A", "pNkmKQ6m627N", "fYH_66i97RoN", "q7B3KpSfnCYo", "gUMlly9JFYQ9", "Skhxh0LhFjp9", "AUbie4x-7JXM", "6t1wD_-57Q3t", "u8jPN7MZLIsn", "qZkC2QC-MRiS", "VdheVbrOeWxk", "y8fseYaIeVPX", "lR42AyxvlNTn", "xUbJmruLldUe", "h8y_QAAElsY1", "oA0ECumOmBLz", "6fhL7WGJ7Scv", "5U8QWB0mEH7m", "iG_FBgbY6vqH", "M22l9fGmApsu", "qe8SKDDv7Y6D", "h6HlzS687ZOF", "FGeYIIyfilcU", "qJrkb72JioOs", "hIOu8DEbsjPi", "hLqAX6oR8gUD", "lhDdV7ydTyas", "_E5UUskr8jWc", "WKxr2MTcigL7" ], "include_colab_link": true }, "kernelspec": { "name": "python3", "display_name": "Python 3" } }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "view-in-github", "colab_type": "text" }, "source": [ "\"Open" ] }, { "metadata": { "id": "okSLzwCiiiS-", "colab_type": "text" }, "cell_type": "markdown", "source": [ "# Lesson 2 Data Engineering for ML on AWS\n", "\n", "[Watch Lesson 2: Data Engineering for ML on AWS Video](https://learning.oreilly.com/videos/aws-certified-machine/9780135556597/9780135556597-ACML_01_02_00)" ] }, { "metadata": { "id": "c_Id55m6Jsbu", "colab_type": "text" }, "cell_type": "markdown", "source": [ "## Pragmatic AI Labs\n", "\n" ] }, { "metadata": { "id": "e5p96AqpSDZa", "colab_type": "text" }, "cell_type": "markdown", "source": [ "![alt text](https://paiml.com/images/logo_with_slogan_white_background.png)\n", "\n", "This notebook was produced by [Pragmatic AI Labs](https://paiml.com/). You can continue learning about these topics by:\n", "\n", "* Buying a copy of [Pragmatic AI: An Introduction to Cloud-Based Machine Learning](http://www.informit.com/store/pragmatic-ai-an-introduction-to-cloud-based-machine-9780134863863) from Informit.\n", "* Buying a copy of [Pragmatic AI: An Introduction to Cloud-Based Machine Learning](https://www.amazon.com/Pragmatic-AI-Introduction-Cloud-Based-Learning/dp/0134863860) from Amazon\n", "* Reading an online copy of [Pragmatic AI:Pragmatic AI: An Introduction to Cloud-Based Machine Learning](https://www.safaribooksonline.com/library/view/pragmatic-ai-an/9780134863924/)\n", "* Watching video [Essential Machine Learning and AI with Python and Jupyter Notebook-Video-SafariOnline](https://www.safaribooksonline.com/videos/essential-machine-learning/9780135261118) on Safari Books Online.\n", "* Watching video [AWS Certified Machine Learning-Speciality](https://learning.oreilly.com/videos/aws-certified-machine/9780135556597)\n", "* Purchasing video [Essential Machine Learning and AI with Python and Jupyter Notebook- Purchase Video](http://www.informit.com/store/essential-machine-learning-and-ai-with-python-and-jupyter-9780135261095)\n", "* Viewing more content at [noahgift.com](https://noahgift.com/)\n" ] }, { "metadata": { "id": "bq4VmHjPpMOR", "colab_type": "text" }, "cell_type": "markdown", "source": [ "## Load AWS API Keys" ] }, { "metadata": { "id": "aWrzIk7WpRoh", "colab_type": "text" }, "cell_type": "markdown", "source": [ "Put keys in local or remote GDrive: \n", "\n", "`cp ~/.aws/credentials /Users/myname/Google\\ Drive/awsml/`" ] }, { "metadata": { "id": "hPWO_zyRopXN", "colab_type": "text" }, "cell_type": "markdown", "source": [ "### Mount GDrive\n" ] }, { "metadata": { "id": "XI73HZNLobp4", "colab_type": "code", "colab": {} }, "cell_type": "code", "source": [ "from google.colab import drive\n", "drive.mount('/content/gdrive', force_remount=True)" ], "execution_count": 0, "outputs": [] }, { "metadata": { "id": "UNyzZwgmoxwm", "colab_type": "code", "colab": {} }, "cell_type": "code", "source": [ "import os;os.listdir(\"/content/gdrive/My Drive/awsml\")" ], "execution_count": 0, "outputs": [] }, { "metadata": { "id": "fYu0ekUlqPk6", "colab_type": "text" }, "cell_type": "markdown", "source": [ "### Install Boto" ] }, { "metadata": { "id": "dJDDrUkWrYRY", "colab_type": "code", "colab": {} }, "cell_type": "code", "source": [ "!pip -q install boto3\n" ], "execution_count": 0, "outputs": [] }, { "metadata": { "id": "FpJhrpSQsK5E", "colab_type": "text" }, "cell_type": "markdown", "source": [ "### Create API Config" ] }, { "metadata": { "id": "QxRwGOZtsN0-", "colab_type": "code", "colab": {} }, "cell_type": "code", "source": [ "!mkdir -p ~/.aws &&\\\n", " cp /content/gdrive/My\\ Drive/awsml/credentials ~/.aws/credentials " ], "execution_count": 0, "outputs": [] }, { "metadata": { "id": "Kj977UW3rph_", "colab_type": "text" }, "cell_type": "markdown", "source": [ "### Test Comprehend API Call" ] }, { "metadata": { "id": "P-A8Cia-raT0", "colab_type": "code", "colab": {} }, "cell_type": "code", "source": [ "import boto3\n", "comprehend = boto3.client(service_name='comprehend', region_name=\"us-east-1\")\n", "text = \"There is smoke in San Francisco\"\n", "comprehend.detect_sentiment(Text=text, LanguageCode='en')" ], "execution_count": 0, "outputs": [] }, { "metadata": { "id": "8oYv9z6P6Ukj", "colab_type": "text" }, "cell_type": "markdown", "source": [ "## 2.1 Data Ingestion Concepts" ] }, { "metadata": { "id": "v7OC_Fh5QV9A", "colab_type": "text" }, "cell_type": "markdown", "source": [ "### Data Lakes" ] }, { "metadata": { "id": "Nt6VTvFVQb-D", "colab_type": "text" }, "cell_type": "markdown", "source": [ "**Central Repository** for all data at any scale" ] }, { "metadata": { "id": "8iyQzWR3amgh", "colab_type": "text" }, "cell_type": "markdown", "source": [ "![data_lake](https://user-images.githubusercontent.com/58792/49777724-8aef8300-fcb6-11e8-981e-96d14498a801.png)" ] }, { "metadata": { "id": "DJxHWoBmcY_4", "colab_type": "text" }, "cell_type": "markdown", "source": [ "#### AWS Lake Formation" ] }, { "metadata": { "id": "X5iz82w_dCJ_", "colab_type": "text" }, "cell_type": "markdown", "source": [ "* New Service Announced at Reinvent 2018\n", "* Build a secure lake in days...**not months**\n", "* Enforce security policies\n", "* Gain and manage insights" ] }, { "metadata": { "id": "4GwDZDvXchGJ", "colab_type": "text" }, "cell_type": "markdown", "source": [ "![aws_lake](https://user-images.githubusercontent.com/58792/49777834-f9ccdc00-fcb6-11e8-84a0-7295a0c69a15.png)" ] }, { "metadata": { "id": "c69G5pk8Dh7Z", "colab_type": "text" }, "cell_type": "markdown", "source": [ "### Kinesis (STREAMING)" ] }, { "metadata": { "id": "GaH-jrqEIRtp", "colab_type": "text" }, "cell_type": "markdown", "source": [ "**Solves Three Key Problems**\n", "\n", "\n", "\n", "* Time-series Analytics\n", "* Real-time Dashboards\n", "* Real-time Metrics\n", "\n" ] }, { "metadata": { "id": "1skrZ0TuDpVq", "colab_type": "text" }, "cell_type": "markdown", "source": [ "#### Kinesis Analytics Workflow\n", "![Kinesis Analytics](https://user-images.githubusercontent.com/58792/49440264-02ce2280-f778-11e8-9d7e-149819e74807.png)" ] }, { "metadata": { "id": "LYQ8rXU-Doab", "colab_type": "text" }, "cell_type": "markdown", "source": [ "#### Kinesis Real-Time Log Analytics Example\n", "\n", "![Real-Time Log Analytics](https://user-images.githubusercontent.com/58792/49440433-7cfea700-f778-11e8-8cd5-55999cb7713c.png)" ] }, { "metadata": { "id": "LQIpSWC8HkO0", "colab_type": "text" }, "cell_type": "markdown", "source": [ "#### Kinesis Ad Tech Pipeline\n", "\n", "![Ad Tech Pipeline](https://user-images.githubusercontent.com/58792/49441021-285c2b80-f77a-11e8-82e2-da9006dc4c6d.png)" ] }, { "metadata": { "id": "dBGcloLTH41z", "colab_type": "text" }, "cell_type": "markdown", "source": [ "#### Kinesis IoT\n", "\n", "![Kinesis IoT](https://user-images.githubusercontent.com/58792/49441101-5e011480-f77a-11e8-9727-4f7706361a08.png)" ] }, { "metadata": { "id": "czV9D-QSIJGp", "colab_type": "text" }, "cell_type": "markdown", "source": [ "#### [Demo] Kinesis" ] }, { "metadata": { "id": "Vq64AVvuILKx", "colab_type": "text" }, "cell_type": "markdown", "source": [ "" ] }, { "metadata": { "id": "WESUX507e4jg", "colab_type": "text" }, "cell_type": "markdown", "source": [ "### AWS Batch (BATCH)" ] }, { "metadata": { "id": "lk-T-dSsfHC0", "colab_type": "text" }, "cell_type": "markdown", "source": [ "Example could be Financial Service Trade Analysis" ] }, { "metadata": { "id": "AQfrQvV7h2IA", "colab_type": "text" }, "cell_type": "markdown", "source": [ "![financial_services_trade](https://user-images.githubusercontent.com/58792/49778503-64334b80-fcba-11e8-85e7-dcdbfe473cd9.png)" ] }, { "metadata": { "id": "bDlM65z5nM0i", "colab_type": "text" }, "cell_type": "markdown", "source": [ "#### Using AWS Batch for ML Jobs\n", "\n", "* *[Watch Video Lesson 11.6: Use AWS Batch for ML Jobs](https://www.safaribooksonline.com/videos/essential-machine-learning/9780135261118/9780135261118-EMLA_01_11_06)*\n" ] }, { "metadata": { "id": "gn6fah4M4Sa1", "colab_type": "text" }, "cell_type": "markdown", "source": [ "https://aws.amazon.com/batch/\n", "\n", "![alt text](https://d1.awsstatic.com/Test%20Images/Kate%20Test%20Images/Dilithium-Diagrams_Visual-Effects-Rendering.ad9c0479c3772c67953e96ef8ae76a5095373d81.png)\n", "\n", "\n", "Example submissions tool\n", "\n", "```python\n", "@cli.group()\n", "def run():\n", " \"\"\"Run AWS Batch\"\"\"\n", "\n", "@run.command(\"submit\")\n", "@click.option(\"--queue\", default=\"first-run-job-queue\", help=\"Batch Queue\")\n", "@click.option(\"--jobname\", default=\"1\", help=\"Name of Job\")\n", "@click.option(\"--jobdef\", default=\"test\", help=\"Job Definition\")\n", "@click.option(\"--cmd\", default=[\"uname\"], help=\"Container Override Commands\")\n", "def submit(queue, jobname, jobdef, cmd):\n", " \"\"\"Submit a job\"\"\"\n", "\n", " result = submit_job(\n", " job_name=jobname,\n", " job_queue=queue,\n", " job_definition=jobdef,\n", " command=cmd\n", " )\n", " click.echo(\"CLI: Run Job Called\")\n", " return result\n", "```" ] }, { "metadata": { "id": "ewIs7l_AptvW", "colab_type": "text" }, "cell_type": "markdown", "source": [ "### Lambda (EVENTS)" ] }, { "metadata": { "id": "dX_F3p8ipyJu", "colab_type": "text" }, "cell_type": "markdown", "source": [ "\n", "* Serverless\n", "* Used in most if not all ML Platforms\n", " - DeepLense\n", " - Sagemaker\n", " - S3 Events\n", "\n" ] }, { "metadata": { "id": "PlVk0ly3m8SH", "colab_type": "text" }, "cell_type": "markdown", "source": [ "#### Starting development with AWS Python Lambda development with Chalice\n", "\n", "* *[Watch Video Lesson 11.3: Use AWS Lambda development with Chalice](https://www.safaribooksonline.com/videos/essential-machine-learning/9780135261118/9780135261118-EMLA_01_11_03)*\n", "\n" ] }, { "metadata": { "id": "RS4-D7b72XwE", "colab_type": "text" }, "cell_type": "markdown", "source": [ "***Demo on Sagemaker Terminal***\n", "\n", "https://github.com/aws/chalice\n", "\n", "*Hello World Example:*\n", "\n", "```python\n", "$ pip install chalice\n", "$ chalice new-project helloworld && cd helloworld\n", "$ cat app.py\n", "\n", "from chalice import Chalice\n", "\n", "app = Chalice(app_name=\"helloworld\")\n", "\n", "@app.route(\"/\")\n", "def index():\n", " return {\"hello\": \"world\"}\n", "\n", "$ chalice deploy\n", "...\n", "https://endpoint/dev\n", "\n", "$ curl https://endpoint/api\n", "{\"hello\": \"world\"}\n", "```\n", "\n", "References:\n", "\n", "[Serverless Web Scraping Project](https://github.com/noahgift/web_scraping_python)" ] }, { "metadata": { "id": "oWeNDZCtp6NP", "colab_type": "text" }, "cell_type": "markdown", "source": [ "#### [Demo] Deploying Hello World Lambda Function" ] }, { "metadata": { "id": "N8HXfU4GqUcq", "colab_type": "text" }, "cell_type": "markdown", "source": [ "" ] }, { "metadata": { "id": "Hc2I3g2xnH3Y", "colab_type": "text" }, "cell_type": "markdown", "source": [ "### Using Step functions with AWS\n", "\n", "* *[Watch Video Lesson 11.5: Use AWS Step Functions](https://www.safaribooksonline.com/videos/essential-machine-learning/9780135261118/9780135261118-EMLA_01_11_05)*" ] }, { "metadata": { "id": "316vRqUj3ELe", "colab_type": "text" }, "cell_type": "markdown", "source": [ "https://aws.amazon.com/step-functions/\n", "\n", "![Step Functions](https://d1.awsstatic.com/product-marketing/Step%20Functions/AmazonCloudWatchUpdated4.a57e968b08739e170aa504feed8db3761de21e60.png)\n", "\n", "Example Project:\n", "\n", "https://github.com/noahgift/web_scraping_python" ] }, { "metadata": { "id": "rExHzr2wqYgy", "colab_type": "text" }, "cell_type": "markdown", "source": [ "[Demo] Step Function" ] }, { "metadata": { "id": "AZ6JLteQ6wXO", "colab_type": "text" }, "cell_type": "markdown", "source": [ "## 2.2 Data Cleaning and Preparation" ] }, { "metadata": { "id": "_x5EC-sO61UF", "colab_type": "text" }, "cell_type": "markdown", "source": [ "### Ensuring High Quality Data" ] }, { "metadata": { "id": "SNLwvwMbJub7", "colab_type": "text" }, "cell_type": "markdown", "source": [ "\n", "\n", "* Validity\n", "* Accuracy\n", "* Completeness\n", "* Consistency\n", "* Uniformity\n", "\n" ] }, { "metadata": { "id": "wMDw-IEhAgT4", "colab_type": "text" }, "cell_type": "markdown", "source": [ "### Dealing with missing values\n", "\n", "Often easy way is to drop missing values\n" ] }, { "metadata": { "id": "OiXVT6tfAw-9", "colab_type": "code", "outputId": "f039f74a-f0c4-4266-b1e8-643592dbfffc", "colab": { "base_uri": "https://localhost:8080/", "height": 1071 } }, "cell_type": "code", "source": [ "import pandas as pd\n", "df = pd.read_csv(\"https://raw.githubusercontent.com/noahgift/real_estate_ml/master/data/Zip_Zhvi_SingleFamilyResidence.csv\")\n", "df.isnull().sum()" ], "execution_count": 0, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "RegionID 1\n", "RegionName 1\n", "City 1\n", "State 1\n", "Metro 1140\n", "CountyName 1\n", "SizeRank 1\n", "1996-04 4440\n", "1996-05 4309\n", "1996-06 4285\n", "1996-07 4278\n", "1996-08 4265\n", "1996-09 4265\n", "1996-10 4265\n", "1996-11 4258\n", "1996-12 4258\n", "1997-01 4212\n", "1997-02 3588\n", "1997-03 3546\n", "1997-04 3546\n", "1997-05 3545\n", "1997-06 3543\n", "1997-07 3543\n", "1997-08 3357\n", "1997-09 3355\n", "1997-10 3353\n", "1997-11 3347\n", "1997-12 3341\n", "1998-01 3317\n", "1998-02 3073\n", " ... \n", "2015-04 13\n", "2015-05 1\n", "2015-06 1\n", "2015-07 1\n", "2015-08 1\n", "2015-09 2\n", "2015-10 3\n", "2015-11 1\n", "2015-12 1\n", "2016-01 1\n", "2016-02 19\n", "2016-03 19\n", "2016-04 19\n", "2016-05 19\n", "2016-06 1\n", "2016-07 1\n", "2016-08 1\n", "2016-09 1\n", "2016-10 1\n", "2016-11 1\n", "2016-12 51\n", "2017-01 1\n", "2017-02 1\n", "2017-03 1\n", "2017-04 1\n", "2017-05 1\n", "2017-06 1\n", "2017-07 1\n", "2017-08 1\n", "2017-09 1\n", "Length: 265, dtype: int64" ] }, "metadata": { "tags": [] }, "execution_count": 7 } ] }, { "metadata": { "id": "UWY9zSOCB9LJ", "colab_type": "code", "outputId": "79c39da1-7ab5-4b62-9012-649c00f0db5b", "colab": { "base_uri": "https://localhost:8080/", "height": 1071 } }, "cell_type": "code", "source": [ "df2 = df.dropna()\n", "df2.isnull().sum()" ], "execution_count": 0, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "RegionID 0\n", "RegionName 0\n", "City 0\n", "State 0\n", "Metro 0\n", "CountyName 0\n", "SizeRank 0\n", "1996-04 0\n", "1996-05 0\n", "1996-06 0\n", "1996-07 0\n", "1996-08 0\n", "1996-09 0\n", "1996-10 0\n", "1996-11 0\n", "1996-12 0\n", "1997-01 0\n", "1997-02 0\n", "1997-03 0\n", "1997-04 0\n", "1997-05 0\n", "1997-06 0\n", "1997-07 0\n", "1997-08 0\n", "1997-09 0\n", "1997-10 0\n", "1997-11 0\n", "1997-12 0\n", "1998-01 0\n", "1998-02 0\n", " ..\n", "2015-04 0\n", "2015-05 0\n", "2015-06 0\n", "2015-07 0\n", "2015-08 0\n", "2015-09 0\n", "2015-10 0\n", "2015-11 0\n", "2015-12 0\n", "2016-01 0\n", "2016-02 0\n", "2016-03 0\n", "2016-04 0\n", "2016-05 0\n", "2016-06 0\n", "2016-07 0\n", "2016-08 0\n", "2016-09 0\n", "2016-10 0\n", "2016-11 0\n", "2016-12 0\n", "2017-01 0\n", "2017-02 0\n", "2017-03 0\n", "2017-04 0\n", "2017-05 0\n", "2017-06 0\n", "2017-07 0\n", "2017-08 0\n", "2017-09 0\n", "Length: 265, dtype: int64" ] }, "metadata": { "tags": [] }, "execution_count": 8 } ] }, { "metadata": { "id": "DFVk5GQ-CaZN", "colab_type": "text" }, "cell_type": "markdown", "source": [ "" ] }, { "metadata": { "id": "yo0I3xoETEpQ", "colab_type": "text" }, "cell_type": "markdown", "source": [ "### Cleaning Wikipedia Handle Example" ] }, { "metadata": { "id": "r3iyGrpYTJbe", "colab_type": "text" }, "cell_type": "markdown", "source": [ "\n", "\n", "```python\n", "\"\"\"\n", "Example Route To Construct:\n", "https://wikimedia.org/api/rest_v1/ +\n", "metrics/pageviews/per-article/ +\n", "en.wikipedia/all-access/user/ +\n", "LeBron_James/daily/2015070100/2017070500 +\n", "\"\"\"\n", "import requests\n", "import pandas as pd\n", "import time\n", "import wikipedia\n", "\n", "BASE_URL =\\\n", " \"https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/user\"\n", "\n", "def construct_url(handle, period, start, end):\n", " \"\"\"Constructs a URL based on arguments\n", " Should construct the following URL:\n", " /LeBron_James/daily/2015070100/2017070500 \n", " \"\"\"\n", "\n", " \n", " urls = [BASE_URL, handle, period, start, end]\n", " constructed = str.join('/', urls)\n", " return constructed\n", "\n", "def query_wikipedia_pageviews(url):\n", "\n", " res = requests.get(url)\n", " return res.json()\n", "\n", "def wikipedia_pageviews(handle, period, start, end):\n", " \"\"\"Returns JSON\"\"\"\n", "\n", " constructed_url = construct_url(handle, period, start,end)\n", " pageviews = query_wikipedia_pageviews(url=constructed_url)\n", " return pageviews\n", "\n", "def wikipedia_2016(handle,sleep=0):\n", " \"\"\"Retrieve pageviews for 2016\"\"\" \n", " \n", " print(\"SLEEP: {sleep}\".format(sleep=sleep))\n", " time.sleep(sleep)\n", " pageviews = wikipedia_pageviews(handle=handle, \n", " period=\"daily\", start=\"2016010100\", end=\"2016123100\")\n", " if not 'items' in pageviews:\n", " print(\"NO PAGEVIEWS: {handle}\".format(handle=handle))\n", " return None\n", " return pageviews\n", "\n", "def create_wikipedia_df(handles):\n", " \"\"\"Creates a Dataframe of Pageviews\"\"\"\n", "\n", " pageviews = []\n", " timestamps = [] \n", " names = []\n", " wikipedia_handles = []\n", " for name, handle in handles.items():\n", " pageviews_record = wikipedia_2016(handle)\n", " if pageviews_record is None:\n", " continue\n", " for record in pageviews_record['items']:\n", " pageviews.append(record['views'])\n", " timestamps.append(record['timestamp'])\n", " names.append(name)\n", " wikipedia_handles.append(handle)\n", " data = {\n", " \"names\": names,\n", " \"wikipedia_handles\": wikipedia_handles,\n", " \"pageviews\": pageviews,\n", " \"timestamps\": timestamps \n", " }\n", " df = pd.DataFrame(data)\n", " return df \n", "\n", "\n", "def create_wikipedia_handle(raw_handle):\n", " \"\"\"Takes a raw handle and converts it to a wikipedia handle\"\"\"\n", "\n", " wikipedia_handle = raw_handle.replace(\" \", \"_\")\n", " return wikipedia_handle\n", "\n", "def create_wikipedia_nba_handle(name):\n", " \"\"\"Appends basketball to link\"\"\"\n", "\n", " url = \" \".join([name, \"(basketball)\"])\n", " return url\n", "\n", "def wikipedia_current_nba_roster():\n", " \"\"\"Gets all links on wikipedia current roster page\"\"\"\n", "\n", " links = {}\n", " nba = wikipedia.page(\"List_of_current_NBA_team_rosters\")\n", " for link in nba.links:\n", " links[link] = create_wikipedia_handle(link)\n", " return links\n", "\n", "def guess_wikipedia_nba_handle(data=\"data/nba_2017_br.csv\"):\n", " \"\"\"Attempt to get the correct wikipedia handle\"\"\"\n", "\n", " links = wikipedia_current_nba_roster() \n", " nba = pd.read_csv(data)\n", " count = 0\n", " verified = {}\n", " guesses = {}\n", " for player in nba[\"Player\"].values:\n", " if player in links:\n", " print(\"Player: {player}, Link: {link} \".format(player=player,\n", " link=links[player]))\n", " print(count)\n", " count += 1\n", " verified[player] = links[player] #add wikipedia link\n", " else:\n", " print(\"NO MATCH: {player}\".format(player=player))\n", " guesses[player] = create_wikipedia_handle(player)\n", " return verified, guesses\n", "\n", "def validate_wikipedia_guesses(guesses):\n", " \"\"\"Validate guessed wikipedia accounts\"\"\"\n", "\n", " verified = {}\n", " wrong = {}\n", " for name, link in guesses.items():\n", " try:\n", " page = wikipedia.page(link)\n", " except (wikipedia.DisambiguationError, wikipedia.PageError) as error:\n", " #try basketball suffix\n", " nba_handle = create_wikipedia_nba_handle(name)\n", " try:\n", " page = wikipedia.page(nba_handle)\n", " print(\"Initial wikipedia URL Failed: {error}\".format(error=error))\n", " except (wikipedia.DisambiguationError, wikipedia.PageError) as error:\n", " print(\"Second Match Failure: {error}\".format(error=error))\n", " wrong[name] = link\n", " continue\n", " if \"NBA\" in page.summary:\n", " verified[name] = link\n", " else:\n", " print(\"NO GUESS MATCH: {name}\".format(name=name))\n", " wrong[name] = link\n", " return verified, wrong\n", "\n", "def clean_wikipedia_handles(data=\"data/nba_2017_br.csv\"):\n", " \"\"\"Clean Handles\"\"\"\n", "\n", " verified, guesses = guess_wikipedia_nba_handle(data=data)\n", " verified_cleaned, wrong = validate_wikipedia_guesses(guesses)\n", " print(\"WRONG Matches: {wrong}\".format(wrong=wrong))\n", " handles = {**verified, **verified_cleaned}\n", " return handles\n", "\n", "def nba_wikipedia_dataframe(data=\"data/nba_2017_br.csv\"):\n", " handles = clean_wikipedia_handles(data=data)\n", " df = create_wikipedia_df(handles) \n", " return df\n", "\n", "def create_wikipedia_csv(data=\"data/nba_2017_br.csv\"):\n", " df = nba_wikipedia_dataframe(data=data)\n", " df.to_csv(\"data/wikipedia_nba.csv\")\n", "\n", "\n", "if __name__ == \"__main__\":\n", " create_wikipedia_csv() \n", "```\n", "\n" ] }, { "metadata": { "id": "SNpJZKlHJ-_A", "colab_type": "text" }, "cell_type": "markdown", "source": [ "### Related AWS Services" ] }, { "metadata": { "id": "Gb0fhTgMKqvs", "colab_type": "text" }, "cell_type": "markdown", "source": [ "These services could all help prepare and clean data\n", "\n", "\n", "* AWS Glue\n", "* AWS Machine Learning\n", "* AWS Kinesis\n", "* AWS Lambda\n", "* AWS Sagemaker\n", "\n" ] }, { "metadata": { "id": "pNkmKQ6m627N", "colab_type": "text" }, "cell_type": "markdown", "source": [ "## 2.3 Data Storage Concepts" ] }, { "metadata": { "id": "fYH_66i97RoN", "colab_type": "text" }, "cell_type": "markdown", "source": [ "### Database Overview\n", "\n" ] }, { "metadata": { "id": "DHXy69aAJbDn", "colab_type": "text" }, "cell_type": "markdown", "source": [ "![Database Styles](https://user-images.githubusercontent.com/58792/48925585-2214a800-ee7a-11e8-8546-767177679328.png)\n", "\n", "* [One size database doesn't fit anyone](https://www.allthingsdistributed.com/2018/06/purpose-built-databases-in-aws.html)" ] }, { "metadata": { "id": "q7B3KpSfnCYo", "colab_type": "text" }, "cell_type": "markdown", "source": [ "### Using AWS DynamoDB\n", "\n", "* *[Watch Video Lesson 11.4: Use AWS DynamoDB](https://www.safaribooksonline.com/videos/essential-machine-learning/9780135261118/9780135261118-EMLA_01_11_04)*" ] }, { "metadata": { "id": "2cheDvcB2x02", "colab_type": "text" }, "cell_type": "markdown", "source": [ "https://aws.amazon.com/dynamodb/\n", "\n", "![alt text](https://d1.awsstatic.com/video-thumbs/dynamodb/AWS-online-games-wide.ada4247744e9be9a6d857b2e13b7eb78b18bf3a5.png)\n", "\n", "Query Example:\n", "\n", "```python\n", "def query_police_department_record_by_guid(guid):\n", " \"\"\"Gets one record in the PD table by guid\n", " \n", " In [5]: rec = query_police_department_record_by_guid(\n", " \"7e607b82-9e18-49dc-a9d7-e9628a9147ad\"\n", " )\n", " \n", " In [7]: rec\n", " Out[7]: \n", " {'PoliceDepartmentName': 'Hollister',\n", " 'UpdateTime': 'Fri Mar 2 12:43:43 2018',\n", " 'guid': '7e607b82-9e18-49dc-a9d7-e9628a9147ad'}\n", " \"\"\"\n", " \n", " db = dynamodb_resource()\n", " extra_msg = {\"region_name\": REGION, \"aws_service\": \"dynamodb\", \n", " \"police_department_table\":POLICE_DEPARTMENTS_TABLE,\n", " \"guid\":guid}\n", " log.info(f\"Get PD record by GUID\", extra=extra_msg)\n", " pd_table = db.Table(POLICE_DEPARTMENTS_TABLE)\n", " response = pd_table.get_item(\n", " Key={\n", " 'guid': guid\n", " }\n", " )\n", " return response['Item']\n", "```\n" ] }, { "metadata": { "id": "gUMlly9JFYQ9", "colab_type": "text" }, "cell_type": "markdown", "source": [ "#### [Demo] DynamoDB" ] }, { "metadata": { "id": "G1SgzvGTFcCY", "colab_type": "text" }, "cell_type": "markdown", "source": [ "" ] }, { "metadata": { "id": "Skhxh0LhFjp9", "colab_type": "text" }, "cell_type": "markdown", "source": [ "### Redshift" ] }, { "metadata": { "id": "X357SPI1FpHx", "colab_type": "text" }, "cell_type": "markdown", "source": [ "* Data Warehouse Solution for AWS\n", "* Column Data Store (Great at counting large data)" ] }, { "metadata": { "id": "AUbie4x-7JXM", "colab_type": "text" }, "cell_type": "markdown", "source": [ "## 2.4 Learn ETL Solutions (Extract-Transform-Load)" ] }, { "metadata": { "id": "6t1wD_-57Q3t", "colab_type": "text" }, "cell_type": "markdown", "source": [ "### AWS Glue" ] }, { "metadata": { "id": "u8jPN7MZLIsn", "colab_type": "text" }, "cell_type": "markdown", "source": [ "#### AWS Glue is fully managed ETL Service\n", "\n", "![AWS Glue Screen](https://user-images.githubusercontent.com/58792/49441953-dff23d00-f77c-11e8-9065-dab53c47c345.png)" ] }, { "metadata": { "id": "qZkC2QC-MRiS", "colab_type": "text" }, "cell_type": "markdown", "source": [ "#### AWS Glue Workflow" ] }, { "metadata": { "id": "GuJTGk5GMbZn", "colab_type": "text" }, "cell_type": "markdown", "source": [ "\n", "\n", "* Build Data Catalog\n", "* Generate and Edit Transformations\n", "* Schedule and Run Jobs\n", "\n" ] }, { "metadata": { "id": "VdheVbrOeWxk", "colab_type": "text" }, "cell_type": "markdown", "source": [ "#### [DEMO] AWS Glue" ] }, { "metadata": { "id": "XGLxtZlrebR0", "colab_type": "text" }, "cell_type": "markdown", "source": [ "" ] }, { "metadata": { "id": "y8fseYaIeVPX", "colab_type": "text" }, "cell_type": "markdown", "source": [ "### EMR" ] }, { "metadata": { "id": "6uzJV6gpk-7P", "colab_type": "text" }, "cell_type": "markdown", "source": [ "* Can be used for large scale distributed data jobs" ] }, { "metadata": { "id": "lR42AyxvlNTn", "colab_type": "text" }, "cell_type": "markdown", "source": [ "### Athena" ] }, { "metadata": { "id": "n2-VhYhUlYWH", "colab_type": "text" }, "cell_type": "markdown", "source": [ "* Can replace many ETL\n", "* Serverless\n", "* Built on Presto w/ SQL Support\n", "* Meant to query Data Lake" ] }, { "metadata": { "id": "xUbJmruLldUe", "colab_type": "text" }, "cell_type": "markdown", "source": [ "#### [DEMO] Athena" ] }, { "metadata": { "id": "-cI7gQxxlgFg", "colab_type": "text" }, "cell_type": "markdown", "source": [ "" ] }, { "metadata": { "id": "h8y_QAAElsY1", "colab_type": "text" }, "cell_type": "markdown", "source": [ "### Data Pipeline" ] }, { "metadata": { "id": "gf_jQMNLl5yv", "colab_type": "text" }, "cell_type": "markdown", "source": [ "* create complex data processing workloads that are fault tolerant, repeatable, and highly available" ] }, { "metadata": { "id": "oA0ECumOmBLz", "colab_type": "text" }, "cell_type": "markdown", "source": [ "#### [Demo] Data Pipeline" ] }, { "metadata": { "id": "gmsXNAT-mElm", "colab_type": "text" }, "cell_type": "markdown", "source": [ "" ] }, { "metadata": { "id": "6fhL7WGJ7Scv", "colab_type": "text" }, "cell_type": "markdown", "source": [ "## 2.5 Batch vs Streaming Data" ] }, { "metadata": { "id": "5U8QWB0mEH7m", "colab_type": "text" }, "cell_type": "markdown", "source": [ "### Impact on ML Pipeline" ] }, { "metadata": { "id": "qxGzIx74EK0W", "colab_type": "text" }, "cell_type": "markdown", "source": [ "* More control of model training in batch (can decide when to retrain)\n", "* Continuously retraining model could provide better prediction results or worse results\n", " - Did input stream suddenly get more users or less users?\n", " - Is there an A/B testing scenario?" ] }, { "metadata": { "id": "iG_FBgbY6vqH", "colab_type": "text" }, "cell_type": "markdown", "source": [ "### Batch" ] }, { "metadata": { "id": "wdJonIVdAfaw", "colab_type": "text" }, "cell_type": "markdown", "source": [ "* Data is batched at intervals\n", "* Simplest approach to create predictions\n", "* Many Services on AWS Capable of Batch Processing\n", " - AWS Glue\n", " - AWS Data Pipeline\n", " - AWS Batch\n", " - EMR\n", "\n", "\n", "\n" ] }, { "metadata": { "id": "M22l9fGmApsu", "colab_type": "text" }, "cell_type": "markdown", "source": [ "### Streaming\n" ] }, { "metadata": { "id": "UsVFF-9NAr-H", "colab_type": "text" }, "cell_type": "markdown", "source": [ "* Continously polled or pushed\n", "* More complex method of prediction\n", "* Many Services on AWS Capable of Streaming\n", " - Kinesis\n", " - IoT" ] }, { "metadata": { "id": "qe8SKDDv7Y6D", "colab_type": "text" }, "cell_type": "markdown", "source": [ "## 2.6 Data Security" ] }, { "metadata": { "id": "h6HlzS687ZOF", "colab_type": "text" }, "cell_type": "markdown", "source": [ "### AWS KMS (Key Management Service)" ] }, { "metadata": { "id": "N9vq6t5qPjzK", "colab_type": "text" }, "cell_type": "markdown", "source": [ "\n", "\n", "* Integrated with AWS Encryption SDK\n", "* CloudTrail gives independent view of who accessed encrypted data\n", "\n" ] }, { "metadata": { "id": "FGeYIIyfilcU", "colab_type": "text" }, "cell_type": "markdown", "source": [ "### AWS Cloud Trail" ] }, { "metadata": { "id": "s5MYRh_JAL2H", "colab_type": "text" }, "cell_type": "markdown", "source": [ "![cloud_trail](https://user-images.githubusercontent.com/58792/49812752-f834ff80-fd1a-11e8-9ad6-bafa8e1b0779.png)" ] }, { "metadata": { "id": "1PXzeAzY_FO6", "colab_type": "text" }, "cell_type": "markdown", "source": [ "\n", "\n", "* enables governance, compliance, operational auditing\n", "* visibility into user and resource activity\n", "* security analysis and troubleshooting\n", "* security analysis and troubleshooting\n", "\n" ] }, { "metadata": { "id": "qJrkb72JioOs", "colab_type": "text" }, "cell_type": "markdown", "source": [ "#### [Demo] Cloud Trail" ] }, { "metadata": { "id": "xEnGV9o9isvZ", "colab_type": "text" }, "cell_type": "markdown", "source": [ "" ] }, { "metadata": { "id": "hIOu8DEbsjPi", "colab_type": "text" }, "cell_type": "markdown", "source": [ "### Other Aspects" ] }, { "metadata": { "id": "VOpNTb9LsmnI", "colab_type": "text" }, "cell_type": "markdown", "source": [ "* IAM Roles\n", "* Security Groups\n", "* VPC" ] }, { "metadata": { "id": "hLqAX6oR8gUD", "colab_type": "text" }, "cell_type": "markdown", "source": [ "## 2.7 Data Backup and Recovery" ] }, { "metadata": { "id": "lhDdV7ydTyas", "colab_type": "text" }, "cell_type": "markdown", "source": [ "### Most AWS Services Have Snapshot and Backup Capabilities" ] }, { "metadata": { "id": "jYC5kQUDT5Ti", "colab_type": "text" }, "cell_type": "markdown", "source": [ "* RDS\n", "* S3\n", "* DynamoDB" ] }, { "metadata": { "id": "_E5UUskr8jWc", "colab_type": "text" }, "cell_type": "markdown", "source": [ "### S3 Backup and Recovery\n" ] }, { "metadata": { "id": "l-_pBmYsQ33O", "colab_type": "text" }, "cell_type": "markdown", "source": [ "* S3 Snapshots\n", "* Amazon Glacier archive" ] }, { "metadata": { "id": "WKxr2MTcigL7", "colab_type": "text" }, "cell_type": "markdown", "source": [ "### [Demo] S3 Snapshot Demo\n" ] }, { "metadata": { "id": "xsd2w7XjijOL", "colab_type": "text" }, "cell_type": "markdown", "source": [ "" ] }, { "metadata": { "id": "MJyRBDb9RdVZ", "colab_type": "code", "colab": {} }, "cell_type": "code", "source": [ "" ], "execution_count": 0, "outputs": [] } ] }