{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "Lesson2-AWSML-Data-Engineering.ipynb",
      "version": "0.3.2",
      "provenance": [],
      "collapsed_sections": [
        "c_Id55m6Jsbu",
        "bq4VmHjPpMOR",
        "8oYv9z6P6Ukj",
        "DJxHWoBmcY_4",
        "c69G5pk8Dh7Z",
        "1skrZ0TuDpVq",
        "LYQ8rXU-Doab",
        "LQIpSWC8HkO0",
        "dBGcloLTH41z",
        "czV9D-QSIJGp",
        "WESUX507e4jg",
        "bDlM65z5nM0i",
        "ewIs7l_AptvW",
        "PlVk0ly3m8SH",
        "oWeNDZCtp6NP",
        "Hc2I3g2xnH3Y",
        "AZ6JLteQ6wXO",
        "_x5EC-sO61UF",
        "wMDw-IEhAgT4",
        "yo0I3xoETEpQ",
        "SNpJZKlHJ-_A",
        "pNkmKQ6m627N",
        "fYH_66i97RoN",
        "q7B3KpSfnCYo",
        "gUMlly9JFYQ9",
        "Skhxh0LhFjp9",
        "AUbie4x-7JXM",
        "6t1wD_-57Q3t",
        "u8jPN7MZLIsn",
        "qZkC2QC-MRiS",
        "VdheVbrOeWxk",
        "y8fseYaIeVPX",
        "lR42AyxvlNTn",
        "xUbJmruLldUe",
        "h8y_QAAElsY1",
        "oA0ECumOmBLz",
        "6fhL7WGJ7Scv",
        "5U8QWB0mEH7m",
        "iG_FBgbY6vqH",
        "M22l9fGmApsu",
        "qe8SKDDv7Y6D",
        "h6HlzS687ZOF",
        "FGeYIIyfilcU",
        "qJrkb72JioOs",
        "hIOu8DEbsjPi",
        "hLqAX6oR8gUD",
        "lhDdV7ydTyas",
        "_E5UUskr8jWc",
        "WKxr2MTcigL7"
      ],
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/github/noahgift/aws-ml-guide/blob/master/Lesson2_AWSML_Data_Engineering.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "metadata": {
        "id": "okSLzwCiiiS-",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "# Lesson 2 Data Engineering for ML on AWS\n",
        "\n",
        "[Watch Lesson 2:  Data Engineering for ML on AWS Video](https://learning.oreilly.com/videos/aws-certified-machine/9780135556597/9780135556597-ACML_01_02_00)"
      ]
    },
    {
      "metadata": {
        "id": "c_Id55m6Jsbu",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "## Pragmatic AI Labs\n",
        "\n"
      ]
    },
    {
      "metadata": {
        "id": "e5p96AqpSDZa",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "![alt text](https://paiml.com/images/logo_with_slogan_white_background.png)\n",
        "\n",
        "This notebook was produced by [Pragmatic AI Labs](https://paiml.com/).  You can continue learning about these topics by:\n",
        "\n",
        "*   Buying a copy of [Pragmatic AI: An Introduction to Cloud-Based Machine Learning](http://www.informit.com/store/pragmatic-ai-an-introduction-to-cloud-based-machine-9780134863863) from Informit.\n",
        "*   Buying a copy of  [Pragmatic AI: An Introduction to Cloud-Based Machine Learning](https://www.amazon.com/Pragmatic-AI-Introduction-Cloud-Based-Learning/dp/0134863860) from Amazon\n",
        "*   Reading an online copy of [Pragmatic AI:Pragmatic AI: An Introduction to Cloud-Based Machine Learning](https://www.safaribooksonline.com/library/view/pragmatic-ai-an/9780134863924/)\n",
        "*  Watching video [Essential Machine Learning and AI with Python and Jupyter Notebook-Video-SafariOnline](https://www.safaribooksonline.com/videos/essential-machine-learning/9780135261118) on Safari Books Online.\n",
        "* Watching video [AWS Certified Machine Learning-Speciality](https://learning.oreilly.com/videos/aws-certified-machine/9780135556597)\n",
        "* Purchasing video [Essential Machine Learning and AI with Python and Jupyter Notebook- Purchase Video](http://www.informit.com/store/essential-machine-learning-and-ai-with-python-and-jupyter-9780135261095)\n",
        "*   Viewing more content at [noahgift.com](https://noahgift.com/)\n"
      ]
    },
    {
      "metadata": {
        "id": "bq4VmHjPpMOR",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "## Load AWS API Keys"
      ]
    },
    {
      "metadata": {
        "id": "aWrzIk7WpRoh",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "Put keys in local or remote GDrive:  \n",
        "\n",
        "`cp ~/.aws/credentials /Users/myname/Google\\ Drive/awsml/`"
      ]
    },
    {
      "metadata": {
        "id": "hPWO_zyRopXN",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "### Mount GDrive\n"
      ]
    },
    {
      "metadata": {
        "id": "XI73HZNLobp4",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "from google.colab import drive\n",
        "drive.mount('/content/gdrive', force_remount=True)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "UNyzZwgmoxwm",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "import os;os.listdir(\"/content/gdrive/My Drive/awsml\")"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "fYu0ekUlqPk6",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "### Install Boto"
      ]
    },
    {
      "metadata": {
        "id": "dJDDrUkWrYRY",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "!pip -q install boto3\n"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "FpJhrpSQsK5E",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "### Create API Config"
      ]
    },
    {
      "metadata": {
        "id": "QxRwGOZtsN0-",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "!mkdir -p ~/.aws &&\\\n",
        "  cp /content/gdrive/My\\ Drive/awsml/credentials ~/.aws/credentials "
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "Kj977UW3rph_",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "### Test Comprehend API Call"
      ]
    },
    {
      "metadata": {
        "id": "P-A8Cia-raT0",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "import boto3\n",
        "comprehend = boto3.client(service_name='comprehend', region_name=\"us-east-1\")\n",
        "text = \"There is smoke in San Francisco\"\n",
        "comprehend.detect_sentiment(Text=text, LanguageCode='en')"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "8oYv9z6P6Ukj",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "## 2.1 Data Ingestion Concepts"
      ]
    },
    {
      "metadata": {
        "id": "v7OC_Fh5QV9A",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "### Data Lakes"
      ]
    },
    {
      "metadata": {
        "id": "Nt6VTvFVQb-D",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "**Central Repository** for all data at any scale"
      ]
    },
    {
      "metadata": {
        "id": "8iyQzWR3amgh",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "![data_lake](https://user-images.githubusercontent.com/58792/49777724-8aef8300-fcb6-11e8-981e-96d14498a801.png)"
      ]
    },
    {
      "metadata": {
        "id": "DJxHWoBmcY_4",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "#### AWS Lake Formation"
      ]
    },
    {
      "metadata": {
        "id": "X5iz82w_dCJ_",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "* New Service Announced at Reinvent 2018\n",
        "* Build a secure lake in days...**not months**\n",
        "* Enforce security policies\n",
        "* Gain and manage insights"
      ]
    },
    {
      "metadata": {
        "id": "4GwDZDvXchGJ",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "![aws_lake](https://user-images.githubusercontent.com/58792/49777834-f9ccdc00-fcb6-11e8-84a0-7295a0c69a15.png)"
      ]
    },
    {
      "metadata": {
        "id": "c69G5pk8Dh7Z",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "### Kinesis (STREAMING)"
      ]
    },
    {
      "metadata": {
        "id": "GaH-jrqEIRtp",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "**Solves Three Key Problems**\n",
        "\n",
        "\n",
        "\n",
        "*   Time-series Analytics\n",
        "*   Real-time Dashboards\n",
        "*   Real-time Metrics\n",
        "\n"
      ]
    },
    {
      "metadata": {
        "id": "1skrZ0TuDpVq",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "#### Kinesis Analytics Workflow\n",
        "![Kinesis Analytics](https://user-images.githubusercontent.com/58792/49440264-02ce2280-f778-11e8-9d7e-149819e74807.png)"
      ]
    },
    {
      "metadata": {
        "id": "LYQ8rXU-Doab",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "#### Kinesis Real-Time Log Analytics Example\n",
        "\n",
        "![Real-Time Log Analytics](https://user-images.githubusercontent.com/58792/49440433-7cfea700-f778-11e8-8cd5-55999cb7713c.png)"
      ]
    },
    {
      "metadata": {
        "id": "LQIpSWC8HkO0",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "#### Kinesis Ad Tech Pipeline\n",
        "\n",
        "![Ad Tech Pipeline](https://user-images.githubusercontent.com/58792/49441021-285c2b80-f77a-11e8-82e2-da9006dc4c6d.png)"
      ]
    },
    {
      "metadata": {
        "id": "dBGcloLTH41z",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "#### Kinesis IoT\n",
        "\n",
        "![Kinesis IoT](https://user-images.githubusercontent.com/58792/49441101-5e011480-f77a-11e8-9727-4f7706361a08.png)"
      ]
    },
    {
      "metadata": {
        "id": "czV9D-QSIJGp",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "#### [Demo] Kinesis"
      ]
    },
    {
      "metadata": {
        "id": "Vq64AVvuILKx",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        ""
      ]
    },
    {
      "metadata": {
        "id": "WESUX507e4jg",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "### AWS Batch (BATCH)"
      ]
    },
    {
      "metadata": {
        "id": "lk-T-dSsfHC0",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "Example could be Financial Service Trade Analysis"
      ]
    },
    {
      "metadata": {
        "id": "AQfrQvV7h2IA",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "![financial_services_trade](https://user-images.githubusercontent.com/58792/49778503-64334b80-fcba-11e8-85e7-dcdbfe473cd9.png)"
      ]
    },
    {
      "metadata": {
        "id": "bDlM65z5nM0i",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "#### Using AWS Batch for ML Jobs\n",
        "\n",
        "* *[Watch Video Lesson 11.6:  Use AWS Batch for ML Jobs](https://www.safaribooksonline.com/videos/essential-machine-learning/9780135261118/9780135261118-EMLA_01_11_06)*\n"
      ]
    },
    {
      "metadata": {
        "id": "gn6fah4M4Sa1",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "https://aws.amazon.com/batch/\n",
        "\n",
        "![alt text](https://d1.awsstatic.com/Test%20Images/Kate%20Test%20Images/Dilithium-Diagrams_Visual-Effects-Rendering.ad9c0479c3772c67953e96ef8ae76a5095373d81.png)\n",
        "\n",
        "\n",
        "Example submissions tool\n",
        "\n",
        "```python\n",
        "@cli.group()\n",
        "def run():\n",
        "    \"\"\"Run AWS Batch\"\"\"\n",
        "\n",
        "@run.command(\"submit\")\n",
        "@click.option(\"--queue\", default=\"first-run-job-queue\", help=\"Batch Queue\")\n",
        "@click.option(\"--jobname\", default=\"1\", help=\"Name of Job\")\n",
        "@click.option(\"--jobdef\", default=\"test\", help=\"Job Definition\")\n",
        "@click.option(\"--cmd\", default=[\"uname\"], help=\"Container Override Commands\")\n",
        "def submit(queue, jobname, jobdef, cmd):\n",
        "    \"\"\"Submit a job\"\"\"\n",
        "\n",
        "    result = submit_job(\n",
        "        job_name=jobname,\n",
        "        job_queue=queue,\n",
        "        job_definition=jobdef,\n",
        "        command=cmd\n",
        "    )\n",
        "    click.echo(\"CLI:  Run Job Called\")\n",
        "    return result\n",
        "```"
      ]
    },
    {
      "metadata": {
        "id": "ewIs7l_AptvW",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "### Lambda (EVENTS)"
      ]
    },
    {
      "metadata": {
        "id": "dX_F3p8ipyJu",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "\n",
        "* Serverless\n",
        "*   Used in most if not all ML Platforms\n",
        " - DeepLense\n",
        " - Sagemaker\n",
        " - S3 Events\n",
        "\n"
      ]
    },
    {
      "metadata": {
        "id": "PlVk0ly3m8SH",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "#### Starting development with AWS Python Lambda development with Chalice\n",
        "\n",
        "* *[Watch Video Lesson 11.3:  Use AWS Lambda development with Chalice](https://www.safaribooksonline.com/videos/essential-machine-learning/9780135261118/9780135261118-EMLA_01_11_03)*\n",
        "\n"
      ]
    },
    {
      "metadata": {
        "id": "RS4-D7b72XwE",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "***Demo on Sagemaker Terminal***\n",
        "\n",
        "https://github.com/aws/chalice\n",
        "\n",
        "*Hello World Example:*\n",
        "\n",
        "```python\n",
        "$ pip install chalice\n",
        "$ chalice new-project helloworld && cd helloworld\n",
        "$ cat app.py\n",
        "\n",
        "from chalice import Chalice\n",
        "\n",
        "app = Chalice(app_name=\"helloworld\")\n",
        "\n",
        "@app.route(\"/\")\n",
        "def index():\n",
        "    return {\"hello\": \"world\"}\n",
        "\n",
        "$ chalice deploy\n",
        "...\n",
        "https://endpoint/dev\n",
        "\n",
        "$ curl https://endpoint/api\n",
        "{\"hello\": \"world\"}\n",
        "```\n",
        "\n",
        "References:\n",
        "\n",
        "[Serverless Web Scraping Project](https://github.com/noahgift/web_scraping_python)"
      ]
    },
    {
      "metadata": {
        "id": "oWeNDZCtp6NP",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "#### [Demo] Deploying Hello World Lambda Function"
      ]
    },
    {
      "metadata": {
        "id": "N8HXfU4GqUcq",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        ""
      ]
    },
    {
      "metadata": {
        "id": "Hc2I3g2xnH3Y",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "### Using Step functions with AWS\n",
        "\n",
        "* *[Watch Video Lesson 11.5:  Use AWS Step Functions](https://www.safaribooksonline.com/videos/essential-machine-learning/9780135261118/9780135261118-EMLA_01_11_05)*"
      ]
    },
    {
      "metadata": {
        "id": "316vRqUj3ELe",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "https://aws.amazon.com/step-functions/\n",
        "\n",
        "![Step Functions](https://d1.awsstatic.com/product-marketing/Step%20Functions/AmazonCloudWatchUpdated4.a57e968b08739e170aa504feed8db3761de21e60.png)\n",
        "\n",
        "Example Project:\n",
        "\n",
        "https://github.com/noahgift/web_scraping_python"
      ]
    },
    {
      "metadata": {
        "id": "rExHzr2wqYgy",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "[Demo] Step Function"
      ]
    },
    {
      "metadata": {
        "id": "AZ6JLteQ6wXO",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "## 2.2 Data Cleaning and Preparation"
      ]
    },
    {
      "metadata": {
        "id": "_x5EC-sO61UF",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "### Ensuring High Quality Data"
      ]
    },
    {
      "metadata": {
        "id": "SNLwvwMbJub7",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "\n",
        "\n",
        "*   Validity\n",
        "*   Accuracy\n",
        "*   Completeness\n",
        "*   Consistency\n",
        "*   Uniformity\n",
        "\n"
      ]
    },
    {
      "metadata": {
        "id": "wMDw-IEhAgT4",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "### Dealing with missing values\n",
        "\n",
        "Often easy way is to drop missing values\n"
      ]
    },
    {
      "metadata": {
        "id": "OiXVT6tfAw-9",
        "colab_type": "code",
        "outputId": "f039f74a-f0c4-4266-b1e8-643592dbfffc",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 1071
        }
      },
      "cell_type": "code",
      "source": [
        "import pandas as pd\n",
        "df = pd.read_csv(\"https://raw.githubusercontent.com/noahgift/real_estate_ml/master/data/Zip_Zhvi_SingleFamilyResidence.csv\")\n",
        "df.isnull().sum()"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "RegionID         1\n",
              "RegionName       1\n",
              "City             1\n",
              "State            1\n",
              "Metro         1140\n",
              "CountyName       1\n",
              "SizeRank         1\n",
              "1996-04       4440\n",
              "1996-05       4309\n",
              "1996-06       4285\n",
              "1996-07       4278\n",
              "1996-08       4265\n",
              "1996-09       4265\n",
              "1996-10       4265\n",
              "1996-11       4258\n",
              "1996-12       4258\n",
              "1997-01       4212\n",
              "1997-02       3588\n",
              "1997-03       3546\n",
              "1997-04       3546\n",
              "1997-05       3545\n",
              "1997-06       3543\n",
              "1997-07       3543\n",
              "1997-08       3357\n",
              "1997-09       3355\n",
              "1997-10       3353\n",
              "1997-11       3347\n",
              "1997-12       3341\n",
              "1998-01       3317\n",
              "1998-02       3073\n",
              "              ... \n",
              "2015-04         13\n",
              "2015-05          1\n",
              "2015-06          1\n",
              "2015-07          1\n",
              "2015-08          1\n",
              "2015-09          2\n",
              "2015-10          3\n",
              "2015-11          1\n",
              "2015-12          1\n",
              "2016-01          1\n",
              "2016-02         19\n",
              "2016-03         19\n",
              "2016-04         19\n",
              "2016-05         19\n",
              "2016-06          1\n",
              "2016-07          1\n",
              "2016-08          1\n",
              "2016-09          1\n",
              "2016-10          1\n",
              "2016-11          1\n",
              "2016-12         51\n",
              "2017-01          1\n",
              "2017-02          1\n",
              "2017-03          1\n",
              "2017-04          1\n",
              "2017-05          1\n",
              "2017-06          1\n",
              "2017-07          1\n",
              "2017-08          1\n",
              "2017-09          1\n",
              "Length: 265, dtype: int64"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 7
        }
      ]
    },
    {
      "metadata": {
        "id": "UWY9zSOCB9LJ",
        "colab_type": "code",
        "outputId": "79c39da1-7ab5-4b62-9012-649c00f0db5b",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 1071
        }
      },
      "cell_type": "code",
      "source": [
        "df2 = df.dropna()\n",
        "df2.isnull().sum()"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "RegionID      0\n",
              "RegionName    0\n",
              "City          0\n",
              "State         0\n",
              "Metro         0\n",
              "CountyName    0\n",
              "SizeRank      0\n",
              "1996-04       0\n",
              "1996-05       0\n",
              "1996-06       0\n",
              "1996-07       0\n",
              "1996-08       0\n",
              "1996-09       0\n",
              "1996-10       0\n",
              "1996-11       0\n",
              "1996-12       0\n",
              "1997-01       0\n",
              "1997-02       0\n",
              "1997-03       0\n",
              "1997-04       0\n",
              "1997-05       0\n",
              "1997-06       0\n",
              "1997-07       0\n",
              "1997-08       0\n",
              "1997-09       0\n",
              "1997-10       0\n",
              "1997-11       0\n",
              "1997-12       0\n",
              "1998-01       0\n",
              "1998-02       0\n",
              "             ..\n",
              "2015-04       0\n",
              "2015-05       0\n",
              "2015-06       0\n",
              "2015-07       0\n",
              "2015-08       0\n",
              "2015-09       0\n",
              "2015-10       0\n",
              "2015-11       0\n",
              "2015-12       0\n",
              "2016-01       0\n",
              "2016-02       0\n",
              "2016-03       0\n",
              "2016-04       0\n",
              "2016-05       0\n",
              "2016-06       0\n",
              "2016-07       0\n",
              "2016-08       0\n",
              "2016-09       0\n",
              "2016-10       0\n",
              "2016-11       0\n",
              "2016-12       0\n",
              "2017-01       0\n",
              "2017-02       0\n",
              "2017-03       0\n",
              "2017-04       0\n",
              "2017-05       0\n",
              "2017-06       0\n",
              "2017-07       0\n",
              "2017-08       0\n",
              "2017-09       0\n",
              "Length: 265, dtype: int64"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 8
        }
      ]
    },
    {
      "metadata": {
        "id": "DFVk5GQ-CaZN",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        ""
      ]
    },
    {
      "metadata": {
        "id": "yo0I3xoETEpQ",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "### Cleaning Wikipedia Handle Example"
      ]
    },
    {
      "metadata": {
        "id": "r3iyGrpYTJbe",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "\n",
        "\n",
        "```python\n",
        "\"\"\"\n",
        "Example Route To Construct:\n",
        "https://wikimedia.org/api/rest_v1/ +\n",
        "metrics/pageviews/per-article/ +\n",
        "en.wikipedia/all-access/user/ +\n",
        "LeBron_James/daily/2015070100/2017070500 +\n",
        "\"\"\"\n",
        "import requests\n",
        "import pandas as pd\n",
        "import time\n",
        "import wikipedia\n",
        "\n",
        "BASE_URL =\\\n",
        " \"https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/user\"\n",
        "\n",
        "def construct_url(handle, period, start, end):\n",
        "    \"\"\"Constructs a URL based on arguments\n",
        "    Should construct the following URL:\n",
        "    /LeBron_James/daily/2015070100/2017070500 \n",
        "    \"\"\"\n",
        "\n",
        "    \n",
        "    urls  = [BASE_URL, handle, period, start, end]\n",
        "    constructed = str.join('/', urls)\n",
        "    return constructed\n",
        "\n",
        "def query_wikipedia_pageviews(url):\n",
        "\n",
        "    res = requests.get(url)\n",
        "    return res.json()\n",
        "\n",
        "def wikipedia_pageviews(handle, period, start, end):\n",
        "    \"\"\"Returns JSON\"\"\"\n",
        "\n",
        "    constructed_url = construct_url(handle, period, start,end)\n",
        "    pageviews = query_wikipedia_pageviews(url=constructed_url)\n",
        "    return pageviews\n",
        "\n",
        "def wikipedia_2016(handle,sleep=0):\n",
        "    \"\"\"Retrieve pageviews for 2016\"\"\" \n",
        "    \n",
        "    print(\"SLEEP: {sleep}\".format(sleep=sleep))\n",
        "    time.sleep(sleep)\n",
        "    pageviews = wikipedia_pageviews(handle=handle, \n",
        "            period=\"daily\", start=\"2016010100\", end=\"2016123100\")\n",
        "    if not 'items' in pageviews:\n",
        "        print(\"NO PAGEVIEWS: {handle}\".format(handle=handle))\n",
        "        return None\n",
        "    return pageviews\n",
        "\n",
        "def create_wikipedia_df(handles):\n",
        "    \"\"\"Creates a Dataframe of Pageviews\"\"\"\n",
        "\n",
        "    pageviews = []\n",
        "    timestamps = []    \n",
        "    names = []\n",
        "    wikipedia_handles = []\n",
        "    for name, handle in handles.items():\n",
        "        pageviews_record = wikipedia_2016(handle)\n",
        "        if pageviews_record is None:\n",
        "            continue\n",
        "        for record in pageviews_record['items']:\n",
        "            pageviews.append(record['views'])\n",
        "            timestamps.append(record['timestamp'])\n",
        "            names.append(name)\n",
        "            wikipedia_handles.append(handle)\n",
        "    data = {\n",
        "        \"names\": names,\n",
        "        \"wikipedia_handles\": wikipedia_handles,\n",
        "        \"pageviews\": pageviews,\n",
        "        \"timestamps\": timestamps \n",
        "    }\n",
        "    df = pd.DataFrame(data)\n",
        "    return df    \n",
        "\n",
        "\n",
        "def create_wikipedia_handle(raw_handle):\n",
        "    \"\"\"Takes a raw handle and converts it to a wikipedia handle\"\"\"\n",
        "\n",
        "    wikipedia_handle = raw_handle.replace(\" \", \"_\")\n",
        "    return wikipedia_handle\n",
        "\n",
        "def create_wikipedia_nba_handle(name):\n",
        "    \"\"\"Appends basketball to link\"\"\"\n",
        "\n",
        "    url = \" \".join([name, \"(basketball)\"])\n",
        "    return url\n",
        "\n",
        "def wikipedia_current_nba_roster():\n",
        "    \"\"\"Gets all links on wikipedia current roster page\"\"\"\n",
        "\n",
        "    links = {}\n",
        "    nba = wikipedia.page(\"List_of_current_NBA_team_rosters\")\n",
        "    for link in nba.links:\n",
        "        links[link] = create_wikipedia_handle(link)\n",
        "    return links\n",
        "\n",
        "def guess_wikipedia_nba_handle(data=\"data/nba_2017_br.csv\"):\n",
        "    \"\"\"Attempt to get the correct wikipedia handle\"\"\"\n",
        "\n",
        "    links = wikipedia_current_nba_roster() \n",
        "    nba = pd.read_csv(data)\n",
        "    count = 0\n",
        "    verified = {}\n",
        "    guesses = {}\n",
        "    for player in nba[\"Player\"].values:\n",
        "        if player in links:\n",
        "            print(\"Player: {player}, Link: {link} \".format(player=player,\n",
        "                 link=links[player]))\n",
        "            print(count)\n",
        "            count += 1\n",
        "            verified[player] = links[player] #add wikipedia link\n",
        "        else:\n",
        "            print(\"NO MATCH: {player}\".format(player=player))\n",
        "            guesses[player] = create_wikipedia_handle(player)\n",
        "    return verified, guesses\n",
        "\n",
        "def validate_wikipedia_guesses(guesses):\n",
        "    \"\"\"Validate guessed wikipedia accounts\"\"\"\n",
        "\n",
        "    verified = {}\n",
        "    wrong = {}\n",
        "    for name, link in guesses.items():\n",
        "        try:\n",
        "            page = wikipedia.page(link)\n",
        "        except (wikipedia.DisambiguationError, wikipedia.PageError) as error:\n",
        "            #try basketball suffix\n",
        "            nba_handle = create_wikipedia_nba_handle(name)\n",
        "            try:\n",
        "                page = wikipedia.page(nba_handle)\n",
        "                print(\"Initial wikipedia URL Failed: {error}\".format(error=error))\n",
        "            except (wikipedia.DisambiguationError, wikipedia.PageError) as error:\n",
        "                print(\"Second Match Failure: {error}\".format(error=error))\n",
        "                wrong[name] = link\n",
        "                continue\n",
        "        if \"NBA\" in page.summary:\n",
        "            verified[name] = link\n",
        "        else:\n",
        "            print(\"NO GUESS MATCH: {name}\".format(name=name))\n",
        "            wrong[name] = link\n",
        "    return verified, wrong\n",
        "\n",
        "def clean_wikipedia_handles(data=\"data/nba_2017_br.csv\"):\n",
        "    \"\"\"Clean Handles\"\"\"\n",
        "\n",
        "    verified, guesses = guess_wikipedia_nba_handle(data=data)\n",
        "    verified_cleaned, wrong = validate_wikipedia_guesses(guesses)\n",
        "    print(\"WRONG Matches: {wrong}\".format(wrong=wrong))\n",
        "    handles = {**verified, **verified_cleaned}\n",
        "    return handles\n",
        "\n",
        "def nba_wikipedia_dataframe(data=\"data/nba_2017_br.csv\"):\n",
        "    handles = clean_wikipedia_handles(data=data)\n",
        "    df = create_wikipedia_df(handles)    \n",
        "    return df\n",
        "\n",
        "def create_wikipedia_csv(data=\"data/nba_2017_br.csv\"):\n",
        "    df = nba_wikipedia_dataframe(data=data)\n",
        "    df.to_csv(\"data/wikipedia_nba.csv\")\n",
        "\n",
        "\n",
        "if __name__ == \"__main__\":\n",
        "    create_wikipedia_csv() \n",
        "```\n",
        "\n"
      ]
    },
    {
      "metadata": {
        "id": "SNpJZKlHJ-_A",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "### Related AWS Services"
      ]
    },
    {
      "metadata": {
        "id": "Gb0fhTgMKqvs",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "These services could all help prepare and clean data\n",
        "\n",
        "\n",
        "*   AWS Glue\n",
        "*   AWS Machine Learning\n",
        "*   AWS Kinesis\n",
        "*   AWS Lambda\n",
        "*   AWS Sagemaker\n",
        "\n"
      ]
    },
    {
      "metadata": {
        "id": "pNkmKQ6m627N",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "## 2.3 Data Storage Concepts"
      ]
    },
    {
      "metadata": {
        "id": "fYH_66i97RoN",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "### Database Overview\n",
        "\n"
      ]
    },
    {
      "metadata": {
        "id": "DHXy69aAJbDn",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "![Database Styles](https://user-images.githubusercontent.com/58792/48925585-2214a800-ee7a-11e8-8546-767177679328.png)\n",
        "\n",
        "* [One size database doesn't fit anyone](https://www.allthingsdistributed.com/2018/06/purpose-built-databases-in-aws.html)"
      ]
    },
    {
      "metadata": {
        "id": "q7B3KpSfnCYo",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "### Using AWS DynamoDB\n",
        "\n",
        "* *[Watch Video Lesson 11.4:  Use AWS DynamoDB](https://www.safaribooksonline.com/videos/essential-machine-learning/9780135261118/9780135261118-EMLA_01_11_04)*"
      ]
    },
    {
      "metadata": {
        "id": "2cheDvcB2x02",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "https://aws.amazon.com/dynamodb/\n",
        "\n",
        "![alt text](https://d1.awsstatic.com/video-thumbs/dynamodb/AWS-online-games-wide.ada4247744e9be9a6d857b2e13b7eb78b18bf3a5.png)\n",
        "\n",
        "Query Example:\n",
        "\n",
        "```python\n",
        "def query_police_department_record_by_guid(guid):\n",
        "    \"\"\"Gets one record in the PD table by guid\n",
        "    \n",
        "    In [5]: rec = query_police_department_record_by_guid(\n",
        "        \"7e607b82-9e18-49dc-a9d7-e9628a9147ad\"\n",
        "        )\n",
        "    \n",
        "    In [7]: rec\n",
        "    Out[7]: \n",
        "    {'PoliceDepartmentName': 'Hollister',\n",
        "     'UpdateTime': 'Fri Mar  2 12:43:43 2018',\n",
        "     'guid': '7e607b82-9e18-49dc-a9d7-e9628a9147ad'}\n",
        "    \"\"\"\n",
        "    \n",
        "    db = dynamodb_resource()\n",
        "    extra_msg = {\"region_name\": REGION, \"aws_service\": \"dynamodb\", \n",
        "        \"police_department_table\":POLICE_DEPARTMENTS_TABLE,\n",
        "        \"guid\":guid}\n",
        "    log.info(f\"Get PD record by GUID\", extra=extra_msg)\n",
        "    pd_table = db.Table(POLICE_DEPARTMENTS_TABLE)\n",
        "    response = pd_table.get_item(\n",
        "        Key={\n",
        "            'guid': guid\n",
        "            }\n",
        "    )\n",
        "    return response['Item']\n",
        "```\n"
      ]
    },
    {
      "metadata": {
        "id": "gUMlly9JFYQ9",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "#### [Demo] DynamoDB"
      ]
    },
    {
      "metadata": {
        "id": "G1SgzvGTFcCY",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        ""
      ]
    },
    {
      "metadata": {
        "id": "Skhxh0LhFjp9",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "### Redshift"
      ]
    },
    {
      "metadata": {
        "id": "X357SPI1FpHx",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "* Data Warehouse Solution for AWS\n",
        "* Column Data Store (Great at counting large data)"
      ]
    },
    {
      "metadata": {
        "id": "AUbie4x-7JXM",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "## 2.4 Learn ETL Solutions (Extract-Transform-Load)"
      ]
    },
    {
      "metadata": {
        "id": "6t1wD_-57Q3t",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "### AWS Glue"
      ]
    },
    {
      "metadata": {
        "id": "u8jPN7MZLIsn",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "#### AWS Glue is fully managed ETL Service\n",
        "\n",
        "![AWS Glue Screen](https://user-images.githubusercontent.com/58792/49441953-dff23d00-f77c-11e8-9065-dab53c47c345.png)"
      ]
    },
    {
      "metadata": {
        "id": "qZkC2QC-MRiS",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "#### AWS Glue Workflow"
      ]
    },
    {
      "metadata": {
        "id": "GuJTGk5GMbZn",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "\n",
        "\n",
        "*   Build Data Catalog\n",
        "*   Generate and Edit Transformations\n",
        "*   Schedule and Run Jobs\n",
        "\n"
      ]
    },
    {
      "metadata": {
        "id": "VdheVbrOeWxk",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "#### [DEMO] AWS Glue"
      ]
    },
    {
      "metadata": {
        "id": "XGLxtZlrebR0",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        ""
      ]
    },
    {
      "metadata": {
        "id": "y8fseYaIeVPX",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "### EMR"
      ]
    },
    {
      "metadata": {
        "id": "6uzJV6gpk-7P",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "* Can be used for large scale distributed data jobs"
      ]
    },
    {
      "metadata": {
        "id": "lR42AyxvlNTn",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "### Athena"
      ]
    },
    {
      "metadata": {
        "id": "n2-VhYhUlYWH",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "* Can replace many ETL\n",
        "* Serverless\n",
        "* Built on Presto w/ SQL Support\n",
        "* Meant to query Data Lake"
      ]
    },
    {
      "metadata": {
        "id": "xUbJmruLldUe",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "#### [DEMO] Athena"
      ]
    },
    {
      "metadata": {
        "id": "-cI7gQxxlgFg",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        ""
      ]
    },
    {
      "metadata": {
        "id": "h8y_QAAElsY1",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "### Data Pipeline"
      ]
    },
    {
      "metadata": {
        "id": "gf_jQMNLl5yv",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "*  create complex data processing workloads that are fault tolerant, repeatable, and highly available"
      ]
    },
    {
      "metadata": {
        "id": "oA0ECumOmBLz",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "#### [Demo] Data Pipeline"
      ]
    },
    {
      "metadata": {
        "id": "gmsXNAT-mElm",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        ""
      ]
    },
    {
      "metadata": {
        "id": "6fhL7WGJ7Scv",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "## 2.5 Batch vs Streaming Data"
      ]
    },
    {
      "metadata": {
        "id": "5U8QWB0mEH7m",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "### Impact on ML Pipeline"
      ]
    },
    {
      "metadata": {
        "id": "qxGzIx74EK0W",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "* More control of model training in batch (can decide when to retrain)\n",
        "* Continuously retraining model could provide better prediction results or worse results\n",
        " - Did input stream suddenly get more users or less users?\n",
        " - Is there an A/B testing scenario?"
      ]
    },
    {
      "metadata": {
        "id": "iG_FBgbY6vqH",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "### Batch"
      ]
    },
    {
      "metadata": {
        "id": "wdJonIVdAfaw",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "*   Data is batched at intervals\n",
        "*   Simplest approach to create predictions\n",
        "*   Many Services on AWS Capable of Batch Processing\n",
        " - AWS Glue\n",
        " - AWS Data Pipeline\n",
        " - AWS Batch\n",
        " - EMR\n",
        "\n",
        "\n",
        "\n"
      ]
    },
    {
      "metadata": {
        "id": "M22l9fGmApsu",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "### Streaming\n"
      ]
    },
    {
      "metadata": {
        "id": "UsVFF-9NAr-H",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "* Continously polled or pushed\n",
        "* More complex method of prediction\n",
        "* Many Services on AWS Capable of Streaming\n",
        " - Kinesis\n",
        " - IoT"
      ]
    },
    {
      "metadata": {
        "id": "qe8SKDDv7Y6D",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "## 2.6 Data Security"
      ]
    },
    {
      "metadata": {
        "id": "h6HlzS687ZOF",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "### AWS KMS (Key Management Service)"
      ]
    },
    {
      "metadata": {
        "id": "N9vq6t5qPjzK",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "\n",
        "\n",
        "*   Integrated with AWS Encryption SDK\n",
        "*   CloudTrail gives independent view of who accessed encrypted data\n",
        "\n"
      ]
    },
    {
      "metadata": {
        "id": "FGeYIIyfilcU",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "### AWS Cloud Trail"
      ]
    },
    {
      "metadata": {
        "id": "s5MYRh_JAL2H",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "![cloud_trail](https://user-images.githubusercontent.com/58792/49812752-f834ff80-fd1a-11e8-9ad6-bafa8e1b0779.png)"
      ]
    },
    {
      "metadata": {
        "id": "1PXzeAzY_FO6",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "\n",
        "\n",
        "*  enables governance, compliance, operational auditing\n",
        "*  visibility into user and resource activity\n",
        "*  security analysis and troubleshooting\n",
        "*  security analysis and troubleshooting\n",
        "\n"
      ]
    },
    {
      "metadata": {
        "id": "qJrkb72JioOs",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "#### [Demo] Cloud Trail"
      ]
    },
    {
      "metadata": {
        "id": "xEnGV9o9isvZ",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        ""
      ]
    },
    {
      "metadata": {
        "id": "hIOu8DEbsjPi",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "### Other Aspects"
      ]
    },
    {
      "metadata": {
        "id": "VOpNTb9LsmnI",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "* IAM Roles\n",
        "* Security Groups\n",
        "* VPC"
      ]
    },
    {
      "metadata": {
        "id": "hLqAX6oR8gUD",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "## 2.7 Data Backup and Recovery"
      ]
    },
    {
      "metadata": {
        "id": "lhDdV7ydTyas",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "### Most AWS Services Have Snapshot and Backup Capabilities"
      ]
    },
    {
      "metadata": {
        "id": "jYC5kQUDT5Ti",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "* RDS\n",
        "* S3\n",
        "* DynamoDB"
      ]
    },
    {
      "metadata": {
        "id": "_E5UUskr8jWc",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "### S3 Backup and Recovery\n"
      ]
    },
    {
      "metadata": {
        "id": "l-_pBmYsQ33O",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "* S3 Snapshots\n",
        "* Amazon Glacier archive"
      ]
    },
    {
      "metadata": {
        "id": "WKxr2MTcigL7",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "### [Demo] S3 Snapshot Demo\n"
      ]
    },
    {
      "metadata": {
        "id": "xsd2w7XjijOL",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        ""
      ]
    },
    {
      "metadata": {
        "id": "MJyRBDb9RdVZ",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        ""
      ],
      "execution_count": 0,
      "outputs": []
    }
  ]
}