Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

NO-SQL PYTHON

Aileen NielsenSoftware Engineer, One Drop, NYC

aileen@onedrop.today

OUTLINE

1. WHY? ( O T H ER T H AN T H E T R ENDY NAME)

2. HOW?

3. WHY? ( AGAIN)

1. WHY?

LET’S START WITH STANDARD SQL-LIKE, TIDY DATA

Name Day Score

Allen 1 25

Joe 3 17

Joe 2 14

Mary 2 14

Mary 1 11

Allen 3 9

Mary 3 9

Joe 1 1

What makes this data tidy? Name Day Score

Allen 1 25

Joe 3 17

Joe 2 14

Mary 2 14

Mary 1 11

Allen 3 9

Mary 3 9

Joe 1 15

What makes this data tidy?

• Observations are in rows

Name Day Score

Allen 1 25

Joe 3 17

Joe 2 14

Mary 2 14

Mary 1 11

Allen 3 9

Mary 3 9

Joe 1 16

• Variables are in columns

Name Day Score

Allen 1 25

Joe 3 17

Joe 2 14

Mary 2 14

Mary 1 11

Allen 3 9

Mary 3 9

Joe 1 17

• Contained in a single data set

Name Day Score

Allen 1 25

Joe 3 17

Joe 2 14

Mary 2 14

Mary 1 11

Allen 3 9

Mary 3 9

Joe 1 18

• Contained in a single data set

Name Day Score

Allen 1 25

Joe 3 17

Joe 2 14

Mary 2 14

Mary 1 11

Allen 3 9

Mary 3 9

Joe 1 1

But can you tell me anything useful about this data set?

Name Day Score

Allen 1 25

Joe 3 17

Joe 2 14

Mary 2 14

Mary 1 11

Allen 3 9

Mary 3 9

Joe 1 1

Sure. These are easy to see:

• Highest score

• Lowest score

• Total observations

Name Day Score

Allen 1 25

Joe 3 17

Joe 2 14

Mary 2 14

Mary 1 11

Allen 3 9

Mary 3 9

Joe 1 1

Not-so-easy

• How many people?

• Who’s doing the best?

• Who’s doing the worst?

• How are individuals doing?

HOW ABOUT NOW?

What Changed?

• The data’s still tidy, but we’ve changed the organizing principle

Name Score Day

Allen 25 1

Mary 11 1

Joe 1 1

Mary 14 2

Joe 14 2

Joe 17 3

Allen 9 3

Mary 9 3

OK HOW ABOUT NOW? (LAST TIME I PROMISE)

Name Ordered Scores

Joe [1, 14, 17]

Mary [11, 14, 9]

Allen [25, NA, 9]

Name Ordered Scores

Joe [1, 14, 17]

Mary [11, 14, 9]

Allen [25, NA, 9]

This data’s NOT TIDY but...

I can eyeball it easily

Name Ordered Scores

Joe [1, 14, 17]

Mary [11, 14, 9]

Allen [25, NA, 9]

I can eyeball it easily

And new questions become interesting and easier to answer

Name Ordered Scores

Joe [1, 14, 17]

Mary [11, 14, 9]

Allen [25, NA, 9]

• How many students are there?• Who improved?• Who missed a test?• Who was kind of meh?

Name Ordered Scores

Joe [1, 14, 17]

Mary [11, 14, 9]

Allen [25, NA, 9]

DON’T GET MAD

I’m not saying to kill tidy Name Ordered Scores

Joe [1, 14, 17]

Mary [11, 14, 9]

Allen [25, NA, 9]

DON’T GET MAD

I’m not saying to kill tidy

But I worry we don’t use certain methods more often because it’s not as easy as it could be.

Name Ordered Scores

Joe [1, 14, 17]

Mary [11, 14, 9]

Allen [25, NA, 9]

BEFORE I GOT INTO THE ’NOSQL’ MINDSET I S IGHED WHEN ASKED QUESTIONS LIKE…

• App analytics What usage patterns do we see in our long-term app users? How do those patterns evolve over time at the individual level?

BEFORE I GOT INTO THE ’NOSQL’ MINDSET I S IGHED WHEN ASKED QUESTIONS LIKE

• Health research Can we predict early on in an experiment what’s likely to happen? Do our experiments need to be as long as they are?

BEFORE I GOT INTO THE ’NOSQL’ MINDSET I S IGHED THINKING ABOUT…

• Health research Can we predict early on in an experiment what’s likely to happen? Do our experiments need to be as long as they are?

• Consumer research Do people like things because they like them or because of the ordering they saw them in?

I ALSO TENDED NOT TO ASK NO-SQL QUESTIONS TOO OFTEN

• Status quo bias: humans tend to take whatever default is presented. That happens in data analysis too.

• Endowment effect: humans tend to want what they already have and think it’s more valuable than what’s offered for a trade.

• Especially deep finding: humans are lazy

Option 1:>>> no_sql_df = df.groupby('Name').apply(lambda df: list(df.sort_values(by='Day')['Score']))>>> no_sql_dfNameAllen [25, 9]Joe [1, 14, 17]Mary [11, 14, 9]

IT’S TRUE. YOU CAN ANSWER THESE QUESTIONS WITH THE TIDY DATA FRAMES I

JUST SHOWED YOU.

You can always ’reconstruct’ these trajectories of what happened by making a data frame per user

>>> dfName Day Score

0 Allen 1 251 Joe 3 172 Joe 2 143 Mary 2 144 Mary 1 115 Allen 3 96 Mary 3 97 Joe 1 1

JUST SHOWED YOU.

Option 2:>>> new_list = []>>> for tuple in df.groupby(['Name']):... new_list.append({tuple[0]: zip(tuple[1]['Day'], tuple[1]['Score'])})... >>> new_list[{'Allen': [(1, 25), (3, 9)]}, {'Joe': [(3, 17), (2, 14), (1, 1)]}, {'Mary': [(2, 14), (1, 11), (3, 9)]}]

JUST SHOWED YOU.

Option 3:>>> def process(new_df):... return [new_df[new_df['Day']==i]['Score'].values[0] if i in list(new_df['Day']) else None for i in range(1,4)]... >>> df.groupby(['Name']).apply(process)NameAllen [25, None, 9]Joe [1, 14, 17]Mary [11, 14, 9]

LET’S BE HONEST…NO ONE WANTS THAT TO BE A F IRST STEP TO EVERY QUERY. ..AND

INCREASINGLY WE’LL BE REQUIRED TO MAKE THESE SORTS OF QUERIES

• Google ads (well…maybe less so in Europe)

• Wearable sensors

• The unit of an observation should be the actor not the particular action observed at a particular time.

• Maybe we should rethink what we mean by ‘observations’

• High scalability

• Distributed computing

• Schema flexibility

• Semi-structured data

• No complex relationships

• Schema change all the time

• Patterns change all the time

• Same units of interest repeating new things

We don’t look for No-SQL because we have No-SQL databases... We have No-SQL Databases because we have no-SQL data.

WHAT IS NO-SQL PYTHON?

Data that doesn’t seem like it fits in a data frame

• Arbitrarily nested data

• Ragged data

• Comparative time series

WHERE DO WE FIND NO-SQL DATA?

Here’s where I’ve found it…

• Physics lab

• Running data

• Health data

• Reddit

2. HOW?

GETTING THE DATA INTO PYTHON WHEN IT ’S STRAIGHTFORWARD

• Scenario: you’re grabbing bunch of NoSQL data from an API or from a NoSQL db.

• We’ll stick with JSON since it’s a common format.

• Best case scenario. You’ll take everything however you can get it. In this case stick with pandas. The normalize_jsonworks great.

NORMALIZE_JSON WORKS PRETTY WELL

{"samples": [{ "name": "Jane Doe",

"age" : 42,"profession": "architect","series": [

{"day": 0,"measurement_value": 0.97

"day": 1,"measurement_value": 1.55

{ name": "Bob Smith","hobbies": ["tennis", "cooking"],"age": 37,"series":{

>>with open(json_file) as data_file:>> data = json.load(data_file)>> normalized_data = json_normalize(data['samples'])

Easy to process

>> print(normalized_data['series'][0])[1]>> {u'measurement_value': 1.55, u'day': 1}

Basically, it just works

Easy to process

Easy to add columns>> normalized_data['length'] = normalized_data['series'].apply(len)

>> print(normalized_data['series'][0])[1]>> {u'measurement_value': 1.55, u'day': 1}

Basically, it just works

USING SOME PROGRAMMER STUFF ALSO HELPS

class dfList(list):def __init__(self, originalValue):

if originalValue.__class__ is list().__class__:self = originalValue

else:self = list(originalValue)

def __getitem__(self, item):result = list.__getitem__(self, item)try:

return result[ITEM_TO_GET]except:

return result

def __iter__(self):for i in range(len(self)):

yield self.__getitem__(i)

def __call__(self):return sum(self)/list.__len__(self)

• Subclass an iterable to shorten your apply() calls

return result

• In particular, you need to subclass at least __getitem__ and __iter__

return result

• You should probably subclass __init__ as well for the case of inconsistent format

return result

• You should probably subclass __init__ as well for the case of inconsistent format

• Then __call__ can be a catch-all adjustable function...best to load it up with a call to a class function, which you can adjust at-will anytime.

CUSTOM CLASSES PAIR NICELY WITH CLASS METHODS

class Test:def __init__(self, name)

self.name1 = name

def print_class_instance(instance):print(instance.name1)

def print_self(self):self.__class__.print_class_instance(self)

>>> test1 = Test('test1')>>> test1.print_self()test1>>> def new_printing(instance):... print("Now I'm printing a constant string")... >>> test1.print_self()test1>>> Test.print_class_instance = new_printing>>> test1.print_self()Now I'm printing a constant string

• Design flexible classes that often reference class methods rather than instance methods

• Then as you are processing data, you can quickly swap out methods to call different field names in the event of highly nested JSON

• Data processing is faster and no mental gymnastics or annoying parse efforts required

GETTING NOSQL DATA: COMMONLY-ENCOUNTERED PROBLEMS

• CSVs with arrays

• Highly-nested JSON

• Unknown or Unreliably formatted API results

SOMETIMES YOU GET WEIRD CSV FILES…

• Sometimes your problem is as simple as getting a csv file with nested data

• This is pretty straightforward to deal with…use regex and common Python string operations to clean up the data

• Apply() is your best friend

• Common problems: spaces between “,” and column name or column value (df = pd.read_csv("in.csv",sep="," , skipinitialspace=1)) use a parameter to avoid this problem

name,favorites,agejoe,"[madonna,elvis,u2]",28mary,"[lady gaga, adele]",36allen,"[beatles, u2, adele, rolling stones]"

This isn’t even that weird

>> df = pd.read_csv(file_name, sep =",")Downright straightforward

Hmmm….>> print(df['favorites'][0][1])>> m

Regex to the rescue…Python’s exceptionally easy string parsing a huge asset for No-SQL parsing>> df['favorites'] = df['favorites'].apply(lambda s: s[1:-1].split())>> print(df['favorites'][0][1])>> adele

WHAT ABOUT THIS ONE?

name,favorites,agejoe,[madonna,elvis,u2],28mary,[lady gaga, adele],36allen,[beatles, u2, adele, rolling stones]

>> df = pd.read_csv(file_name, sep =",")Downright straightforward?

Actually this fails miserably>> print(df['favorites'])>> joe [madonna elvis u2]mary [lady gaga adele] 36Name: name, dtype: object

We need more regex…this time before applying read_csv()....

Missing quotes arouns arrays:

Basically, put in a the quotation marks to help out read_csv()

pattern = "(\[.*\])"with open(file_name) as f:

for line in f:new_line = linematch = re.finditer(pattern, line)try:

m = match.next()while m:

replacement = '"'+m.group(1)+'"'new_line = new_line.replace(m.group(1), replacement)m = match.next()

except:pass

with open(write_file, 'a') as write_f:write_f.write(new_line)

new_df = pd.read_csv(write_file)

pattern = "(\[.*\])"with open(file_name) as f:

for line in f:new_line = linematch = re.finditer(pattern, line)try:

m = match.next()while m:

replacement = '"'+m.group(1)+'"'new_line = new_line.replace(m.group(1), replacement)m = match.next()

except:pass

with open(write_file, 'a') as write_f:write_f.write(new_line)

new_df = pd.read_csv(write_file)

With multiple arrays per row, you’re gonna need to accommodate the greedy nature of regexpattern = "(\[.*?\])"

THAT WAS A LOT OF TEXT…ALMOST DONE

SOMETIMES YOU GET JSON AND YOU KNOW THE STRUCTURE, YOU JUST DON’T LIKE IT

• Use json_normalize()and then shed columns you don’t want. You’ve seen that today already (slides 32-38).

• Use some magic: sh with jq module to simplify your life…you can pick out the fields you want with jq either on the command line or with sh

• jq has a straightforward, easy to learn syntax: . = value, [] = array operation, etc… 63

cat = sh.catjq = sh.jqrule = """[{name: .samples[].name, days: .samples[].series[].day}]""”out = jq(rule, cat(_in=json_data)).stdoutjson.loads(uni_out)

AND SOMETIMES YOU HAVE NO IDEA WHAT’S IN AN ENORMOUS JSON FILE

• Inconsistent or undocumented API

• Legacy Mongo database

• Someone handed you some gnarly JSON because they couldn’t parse it

YOU’RE A PROGRAMMER…USE ITERATORS

• The ijson module is an iterator JSON parser…you can deal with structure one bit at a time

• This also gives you a great opportunity to make data parsing decisions as you go

• This isn’t fast, but it’s also not fast to shoot from the hip when you’re talking about gnarly JSON

with open(file_name, 'rb') as file:results = ijson.items(file, "samples.item")

for newRecord in results:record = newRecordfor k in record.keys():

if isinstance(record[k], dict().__class__):recursive_check(record[k])

if isinstance(record[k], list().__class__):recursive_check(record[k])

process(record)

total_dict = defaultdict(lambda: False)

def recursive_check(d):

if isinstance(d, dict().__class__):if not total_dict[tuple(sorted(d.keys()))]:

class_name = raw_input("Input the new classname with a space and then the file name defining the class ")

mod = import_module(class_name)cls = getattr(mod, class_name)total_dict[tuple(sorted(d.keys()))] = cls

for k in d.keys():new_class = recursive_check(k)if new_class:

d[k] = new_class(**d[k])

return total_dict[tuple(sorted(d.keys()))]

elif isinstance(d, list().__class__):for i in range(len(d)):

new_class = recursive_check(d[i])if new_class:

d[i] = new_class(**d[i])else:

return False

• Basically, you can build custom classes or generate appropriate named tuples as you go.

• This lets you know what you have and lets you build data structures to accommodate what you have.

• Storing these objects in a class rather than simple dictionary again gives you the option to customize .__call__() to your needs

total_dict = defaultdict(lambda: False)

def recursive_check(d):

if isinstance(d, dict().__class__):if not total_dict[tuple(sorted(d.keys()))]:

class_name = raw_input("Input the new classname with a space and then the file name defining the class ")mod = import_module(class_name)cls = getattr(mod, class_name)total_dict[tuple(sorted(d.keys()))] = cls

for k in d.keys():new_class = recursive_check(k)if new_class:

d[k] = new_class(**d[k])

return total_dict[tuple(sorted(d.keys()))]

elif isinstance(d, list().__class__):for i in range(len(d)):

new_class = recursive_check(d[i])if new_class:

d[i] = new_class(**d[i])else:

return False

• Basically, you can build custom classes or generate appropriate Named tuples as you go. This lets you know what you have and lets you build data structures to accommodate what you have.

• Again remember that class methods can easily be adjusted dynamically, so it’s good to code classes with instances that reference class methods.

3. WHY? (AGAIN)

CLUSTERING TIME SERIES

• Reports of clustering and classifying time series are surprisingly rare

• Methods are computationally demanding O(N2)… but we’re getting there

• Relatedly ‘classification’ can also be used for series-related predictions

• Can use many commonly applied clustering algorithms once you get the distance metric

http://www.cse.ust.hk/vldb2002/VLDB2002-proceedings/slides/S12P01slides.pdf

WHEN DO PEOPLE GO RUNNING?

72Actually, I made these plots with R…

NANO-SCALE PHYSICS

73Meisner et al, J . Am. Chem. Soc. 2012, 134, 20440−20445

• You can build an electrical circuit which has a single molecule as its narrowest part

• It turns out it’s quite easy to distinguish different molecules depending on their trajectory as you pull on them

• Particularly their summed behavior looks quite different

• Suggests that we could cluster and identify individual measurements with reasonable certainty

• Several months of pulling the top 25 threads off Reddit’sfront page shows significantly different trends for different subreddits.

• Some kinds of posts don’t last long (r/TwoX and r/videos)

• r/personalfinance shows a remarkable ability to have a second peak/second life on the front page

• r/videos do great but burn out quickly

QUICK: HOW IT WORKS I .

QUICK: HOW IT WORKS II .

• O(N2) in theory

• Various lower bounding techniques significantly reduce processing time

• Dynamic programming problem

http://wearables.cc.gatech.edu/paper_of_week/DTW_myths.pdf

WHY THE FANCY METHOD?

Euclidean distance matches ts3 to ts1, despite our intuition that ts1 and ts2 are more alike.

http://www.cse.ust.hk/vldb2002/VLDB2002-proceedings/slides/S12P01slides.pdf

http://nbviewer.jupyter.org/github/alexminnaar/time-series-classif ication-and clustering/blob/master/Time%20Series%20Classif ication%20and%20Clustering.ipynb

BIKE-SHARING STANDS

80http://ofdataandscience.blogspot.nl/2013/03/capital-bikeshare-time-series-clustering.html?m=1

http://ofdataandscience.blogspot.nl/2013/03/capital-bikeshare-time-series-clustering.html?m=1

FUTURE RESEARCH POSSIBILITIES

http://wearables.cc.gatech.edu/paper_of_week/DTW_myths.pdf

Time series classification and related metrics can be one more thing to know…or even several more things to know

Name Ordered Scores

ScoreTrajectory Type

Number of Tests

PredictedScore For Next Test

Joe [1, 14, 17] good 3 19

Mary [11, 14, 9] meh 3 11

Allen [25, NA, 9] underachiever 2 35

Info from classificationInfo from prediction

Info from easy apply() calls

THE SHORT VERSION

• Pandas is already well-adapted to the No-SQL world

THE SHORT VERSION

• Make your data format work for you

THE SHORT VERSION

• Comparative time series go hand-in-hand with the increasing availability of No-SQL data. Everything is a time series if you look hard enough.

THE SHORT VERSION

• Comparative time series go hand-in-hand with the increasing availability of No-SQL data. Everything is a time series if you look hard enough.

• Non-time series collections are also informative. This was just one example of what you can do.

THANK YOU

Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

Data & Analytics

Mga pambansang sagisag ng bansa aileen

SQL vs NoSQL: The NoSQL way

Autumn Skies 2012 Aileen O’Donoghue SLU, APO Aileen...

Aileen Sumagpang

Aileen heal postgis osmm cou

Osv2013by Aileen Vitug

NIELSEN GLOBAL RESPONSIBILITY REPORT...NIELSEN GLOBAL...

Велосипедостраительство в NoSQL,...

Infografia manzana aileen pitty

Sample presentation aileen

Chapter 18 & 19 aileen viado

Jerusalem - Aileen 5L

Who am I?assets.astrails.com/.../wtf-is-mysql.pdf ·...

NoSQL: Graph Databases. Databases Why NoSQL Databases?

Aileen Iverson

El-Kadi Aileen, Travesías