Cloud Databases

A big takaway in the cloud is you don’t have to start with a relational database. The CTO of Amazon, Werner Vogel’s brings up some of the options available in the blog post A one size fits all database doesn’t fit anyone.

all things distributed source: allthingsdistributed.com

Key Value Databases

A good example of a serverless key/value database is Dynamodb.

alt text

How could you query it in Python?

def query_police_department_record_by_guid(guid):
    """Gets one record in the PD table by guid
    
    In [5]: rec = query_police_department_record_by_guid(
        "7e607b82-9e18-49dc-a9d7-e9628a9147ad"
        )
    
    In [7]: rec
    Out[7]: 
    {'PoliceDepartmentName': 'Hollister',
     'UpdateTime': 'Fri Mar  2 12:43:43 2018',
     'guid': '7e607b82-9e18-49dc-a9d7-e9628a9147ad'}
    """
    
    db = dynamodb_resource()
    extra_msg = {"region_name": REGION, "aws_service": "dynamodb", 
        "police_department_table":POLICE_DEPARTMENTS_TABLE,
        "guid":guid}
    log.info(f"Get PD record by GUID", extra=extra_msg)
    pd_table = db.Table(POLICE_DEPARTMENTS_TABLE)
    response = pd_table.get_item(
        Key={
            'guid': guid
            }
    )
    return response['Item']

Graph Databases

Why Not Relational Databases?

  • Relationship data not good for relational databases.
  • Example:
    • Think about SQL query of social network used to select all third-degree connections of individual.
      • Imagine number of joins needed.
  • Think about SQL query used to get full social network of individual.
    • Imagine number of recursive joins required.
  • Relational databases good at representing one-to-many relationships, in which one table connected to multiple tables.

AWS Neptune

Neptune

Neo4j

Key Concepts

reference Neo4j website sandbox tutorial

Graph Database

Graph Database can store:

  • Nodes - graph data records
  • Relationships - connect nodes
  • Properties - named data values
Simplest Graph

Simplest Graph

  • One node
  • Has some properties
  1. Start by drawing a circle for the node
  2. Add the name Emil
  3. Note that he is from Sweden
  • Nodes are the name for data records in a graph
  • Data is stored as Properties
  • Properties are simple name/value pairs

alt text

Labels

Nodes can be grouped together by applying a Label to each member. In our social graph, we’ll label each node that represents a Person.

  1. Apply the label “Person” to the node we created for Emil
  2. Color “Person” nodes red
    • A node can have zero or more labels
    • Labels do not have any properties

Nodes

More Nodes

Like any database, storing data in Neo4j can be as simple as adding more records. We’ll add a few more nodes:

  1. Emil has a klout score of 99
  2. Johan, from Sweden, who is learning to surf
  3. Ian, from England, who is an author
  4. Rik, from Belgium, has a cat named Orval
  5. Allison, from California, who surfs
  • Similar nodes can have different properties
  • Properties can be strings, numbers, or booleans
  • Neo4j can store billions of nodes

more_nodes

Relationships

The real power of Neo4j is in connected data. To associate any two nodes, add a Relationship which describes how the records are related.

In our social graph, we simply say who KNOWS whom:

  1. Emil KNOWS Johan and Ian
  2. Johan KNOWS Ian and Rik
  3. Rik and Ian KNOWS Allison
  • Relationships always have direction
  • Relationships always have a type
  • Relationships form patterns of data

relationships

Relationship Properties

In a property graph, relationships are data records that can also** contain properties**. Looking more closely at Emil’s relationships, note that:

  • Emil has known Johan since 2001
  • Emil rates Ian 5 (out of 5)
  • Everyone else can have similar relationship properties

relationships

Key Graph Algorithms (With neo4j)

  • Centrality - What are the most important nodes in the network? PageRank, Betweenness Centrality, Closeness Centrality

  • Community detection - How can the graph be partitioned? Union Find, Louvain, Label Propagation, Connected Components

  • Pathfinding - What are the shortest paths or best routes available given cost? Minimum Weight Spanning Tree, All Pairs- and Single Source- Shortest Path, Dijkstra

CALL dbms.procedures()
YIELD name, signature, description
WITH * WHERE name STARTS WITH "algo"
RETURN *

Russian Troll Walkthrough [Demo]

To run through example run this cipher code in their sandbox

:play https://guides.neo4j.com/sandbox/twitter-trolls/index.html
Finding top Trolls with Neo4J

The list of prominent people who tweeted out links from the account, @Ten_GOP, which Twitter shut down in August, includes political figures such as Michael Flynn and Roger Stone, celebrities such as Nicki Minaj and James Woods, and media personalities such as Anne Coulter and Chris Hayes

A screenshot of the Neo4J interface for the phrase “thanks obama”. Screen Shot 2020-02-29 at 4 06 34 PM

Pagerank score for Trolls

Here is a walkthrough of code in a colab notebook you can reference called social network theory.

def enable_plotly_in_cell():
  import IPython
  from plotly.offline import init_notebook_mode
  display(IPython.core.display.HTML('''
        <script src="/static/components/requirejs/require.js"></script>
  '''))
  init_notebook_mode(connected=False)

The trolls are exported from Neo4j and they are imported in Pandas.

import pandas as pd
import numpy as np

df = pd.read_csv("https://raw.githubusercontent.com/noahgift/essential_machine_learning/master/pagerank_top_trolls.csv")
df.head()

Screen Shot 2020-02-29 at 4 02 33 PM

Next up, the data is graphed with Plotly.

import plotly.offline as py
import plotly.graph_objs as go

from plotly.offline import init_notebook_mode
enable_plotly_in_cell()
init_notebook_mode(connected=False)


fig = go.Figure(data=[go.Scatter(
    x=df.pagerank,
    text=df.troll,
    mode='markers',
    marker=dict(
        color=np.log(df.pagerank),
        size=df.pagerank*5),
)])
py.iplot(fig, filename='3d-scatter-colorscale')

Screen Shot 2020-02-29 at 4 09 56 PM

Top Troll Hashtags

import pandas as pd
import numpy as np

df2 = pd.read_csv("https://raw.githubusercontent.com/noahgift/essential_machine_learning/master/troll-hashtag.csv")
df2.columns = ["hashtag", "num"]
df2.head()

Screen Shot 2020-02-29 at 4 11 42 PM

Now plot these troll hashtags.

import plotly.offline as py
import plotly.graph_objs as go

from plotly.offline import init_notebook_mode
enable_plotly_in_cell()
init_notebook_mode(connected=False)


fig = go.Figure(data=[go.Scatter(
    x=df.pagerank,
    text=df2.hashtag,
    mode='markers',
    marker=dict(
        color=np.log(df2.num),
        size=df2.num),
)])
py.iplot(fig)

You can see these trolls love to use the hashtag #maga.

Screen Shot 2020-02-29 at 4 14 47 PM

References