Seeing With Your Eyes Closed Ellen Friedman No SQL Matters Barcelona 22 November 2014
Jul 14, 2015
© 2014 Ellen Friedman 1
Seeing With Your Eyes Closed
Ellen Friedman No SQL Matters Barcelona 22 November 2014
© 2014 Ellen Friedman 2
Contact Information Ellen Friedman
Solutions Consultant and Commentator Apache Mahout committer, Apache Drill contributor
Email [email protected]
[email protected] Twitter @Ellen_Friedman @ApacheDrill
Hashtag today: #NoSQL14
© 2014 Ellen Friedman 3
Thinking With Your Eyes Closed
When some people think…
… they close their eyes in order to “see”.
© 2014 Ellen Friedman
© 2014 Ellen Friedman 4
Getting Past the Details • Look at your data with an open mind
• Listen to what data tells you • Find the key concepts in what you do
• Give yourself an opportunity for discovery
© 2014 Ellen Friedman 5
NoSQL • Founded on discovery
• Solution-driven
• Don’t be bound by the tool
• Flexibility is important
• How do you keep your ability for invention?
© 2014 Ellen Friedman 6
! ! ! ! !Basic idea: ! !“Eyes open” !! !“Eyes closed” ! ! !Details ! ! ! !Discovery!
© 2014 Ellen Friedman 7
Imagination, technology and careful reasoning
Think where this may take you.
© 2014 Ellen Friedman 8
Things don’t always turn out the way you predict… With exploration into new frontiers, you may meet your goal in surprising ways.
A Perfect Red, by Amy Butler Greenfield
Spanish explorers came to the Americas in search for riches.
They were looking for gold and silver.
They found cochineal.
Red dye worth a fortune.
© 2014 Ellen Friedman 9
Big Data and Open Source in the 19th Century Here’s a story with the power of vision (eyes closed thinking) plus keen observation and attention to detail (eyes open thinking) It’s got: • Adventure on the high seas • Time series data (a hot topic in the NoSQL world today) • Clever community building for open source participation • World speed record • (but no pirates)
© 2014 Ellen Friedman 11
Oddly, that’s where the real adventure starts.!
Matthew Fountain Maury was a sailor in the 1830s. Injured at sea, the US Navy gave him a “desk job”.
© 2014 Ellen Friedman 12
Time Series Data – An Old Idea Captain’s log book entry for the Steam Ship Bear, 1884 trip to Arctic From image digitized by www.oldweather.org and provided via www.naval-history.net . Image modified by Ellen Friedman and Ted Dunning.
Ship captains kept log books with various comments plus measurements recorded at specific times.
© 2014 Ellen Friedman 13
Time Series Data – An Old Idea
The basis of a time series is the repeated measurement of parameters over time, together with the times at which the measurements were made.
© 2014 Ellen Friedman 14
Time Series Data – An Old Idea
At his desk job in the U.S. Navy Office of Charts, Maury discovered boxes with hundreds of ship’s logs, largely forgotten.
© 2014 Ellen Friedman 15
Big data project: Bring the data together • Using the log data, Maury and his team built maps to indicate wind,
temperature, currents – They extracted, transformed and aggregated this huge volume of data – By hand!
• Mariners would be able to predict conditions on various routes at different times of the year
• His theory was that this would help navigation
• Maury published his Winds and Currents charts to be widely available
© 2014 Ellen Friedman 16
Big data project: Maury’s Wind and Currents charts
At first, no body was interested in them…
© 2014 Ellen Friedman 17
Maury’s Wind and Currents charts
Using Maury’s carefully compiled data, Captain Jackson got back one month early on a trip from Baltimore in the US to Rio de Janeiro in Brazil.
© 2014 Ellen Friedman 18
Maury’s Wind and Currents charts
Now everybody wanted one of his charts. Here’s where the open source parts comes in…
© 2014 Ellen Friedman 19
Maury’s Open Source Project: The Abstract Log Maury wanted better data from the ship’s captains. To get one of Maury’s Winds and Currents charts: • Captains first had to fill in a special template for one of their trips
• They returned the template, called Abstract Log, to Maury and got a chart
• Maury’s team collected new data that was better than before: regular and systematic time series data
© 2014 Ellen Friedman 20
Data-Drive Decisions Set a World Record • In 1853, clipper ship Flying Cloud set record for fastest sailing
from New York City to San Francisco
• Maury’s charts played a key role in the navigator’s expert, data-driven decisions about the route
• Surprisingly, the navigator was a woman, Eleanor Creesy
© 2014 Ellen Friedman 21
Key Lessons from Maury’s Work • Give to get
– Give the Abstract Log to captains, get data collected in careful way
• Big data consortium wins – Merging data gives pictures nobody else can see
• Building open source community is valuable – The collective effort builds the basis for exploration and discovery
• Lessons like today: Just 150 years before everybody else
© 2014 Ellen Friedman 23
Exploration takes you to surprising places The really scary part is knowing the amount of computing power in the Apollo 11 guidance system… Buzz Aldrin steps onto Moon
photo by Neil Armstrong, Apollo 11 20 July 1969 NASA photo http://1.usa.gov/1uXi53U
© 2014 Ellen Friedman 24
Computing power in familiar objects
For comparison: SIM chip in smart card similar to the SIM chip in a cell phone Has about 0.5 kilobytes RAM
16.0 kilobytes ROM
Only a little less than Apollo…
© 2014 Ellen Friedman 25
Computing power in familiar objects
SIM chip in smart card similar to the SIM chip in a cell phone Has about 0.5 kilobytes RAM
16.0 kilobytes ROM Phone processor is very powerful: 1.3 GHz, dual core,1 GB of RAM Much more powerful than Apollo
© 2014 Ellen Friedman 26
Computing power in familiar objects
Arduino is a little microprocessor with enough power to interact with sensors in the IoT
The question is, what can you use these powerful, compact technologies to do?
© 2014 Ellen Friedman 27
Things may not turn out the way you predict
Surprising use for a microprocessor: Family cat equipped with “smart collar” investigates neighborhood and reveals weak security for local wi-fi Humorous glimpse at the potential for IoT
https://www.mapr.com/blog/the-internet-of-cat-toys
© 2014 Ellen Friedman 28
Who Needs Time Series Data?
Utility providers use smart meters to monitor very short term changes in energy usage
© 2014 Ellen Friedman 29
Who Needs Time Series Data?
Manufacturers who monitor equipment on the assembly line
Manufacturers who produce “smart parts” that report back after the parts are in operation
© 2014 Ellen Friedman 30
Unmanned Ocean Robot: Wave Glider • Made by Liquid Robotics
http://liquidr.com/technology/waveglider/how-it-works.html
• Powered by wave motion • Onboard sensors solar powered • Travelled from San Francisco to
Hawaii, Japan & Australia • Survived shark attack and typhoon • Cool
© 2014 Ellen Friedman 31
Environmental Monitoring • Big trend and growing
• Companies to collect, store and analyze data
• Example: Planet OS – Multi-sensor, machine data – Time series + spatial data – https://planetos.com
© 2014 Ellen Friedman 32
Smart Shirt • Sensors embedded in fabric
– Measures heart rate & movement – Includes time stamp and geo data
• Smart fabric uses smart phone as hub
• Fabric also used for other industries
• Made by Smart Sensing, part of Cityzen Sciences Consortium
• Also cool.
Feb 2014 article in gizmag http://www.gizmag.com/cityzen-smart-shirt-sensing-fabric-health-monitoring/30428/
© 2014 Ellen Friedman 33
Cityzen Data • Spin-off from consortium Cityzen Sciences
• Provides data platform for storage & analysis of sensor data inc smart shirt
• http://www.cityzendata.com
• Presentation by Cityzen Data CTO Mathias Herberts “From Thread to API” (Feb 2014 )https://www.youtube.com/watch?v=RV_Wgc-0yOs
• Presentation in Silicon Valley in June 2014 http://www.slideshare.net/Mathias-Herberts/20140611-io-tsiliconvalley
© 2014 Ellen Friedman 34
When is a NoSQL time series database useful?
Build a NoSQL time series database when • Most of your scans are based on a time range • Data is at large scale
© 2014 Ellen Friedman 36
Lesson: It’s scary to go the Moon with the computing power of a credit card!
© 2014 Ellen Friedman 39
Like monkeys trying to describe a Capybara…
Seen on Twitter: https://twitter.com/rudytheelder/status/500471789042954240
© 2014 Ellen Friedman 40
Getting Past the Details It’s no longer acceptable for technical and non-technical teams to be unable to communicate
• Data science team needs to clearly exchange ideas about project goals, resources and planning with domain experts
• Find a new language to describe your work appropriately
• Find the key concepts in what you do
• Describe them in a way that makes sense to your audience
© 2014 Ellen Friedman 42
e-books currently available courtesy of MapR
Time Series Databases by Ted Dunning and Ellen Friedman © Oct 2014 (published by O’Reilly)
http://bit.ly/1GMk9yY
How to store & access time series data using NoSQL database (HBase or MapR-DB)
© 2014 Ellen Friedman 43
Innovations in Recommendation by Ted Dunning and Ellen Friedman © Feb 2014 (published by O’Reilly)
© 2014 Ellen Friedman 44
A New Look at Anomaly Detection by Ted Dunning and Ellen Friedman © June 2014 (published by O’Reilly)
© 2014 Ellen Friedman 45
! ! ! ! !Basic idea: ! !“Eyes open” !! !“Eyes closed” ! ! !Present ! ! ! ! !Future!
© 2014 Ellen Friedman 47
How would you like to be able to… • Query multiple data types including JSON or Parquet with SQL? • Use directory name as a table name when you query so you don’t have to
know in advance the files you’re going for? • Use standard SQL query on Hadoop or NoSQL, with low-
latency?
• Go schema-less !? (shocking!)
• Reduce the distance to your data?
• This is where Apache Drill comes in…
• That’s where Drill comes in…
© 2014 Ellen Friedman 48
Apache Drill • Low latency SQL query engine for Apache Hadoop and NoSQL
• Extremely flexible: – 1st and only distributed SQL query engine that does not require schema – Uses wide range of data types including nested, JSON, Parquet
• Convenient: – Uses familiar ANSI SQL commands – Lets you continue to use standard BI tools
• Open source community: – Approaching graduation
© 2014 Ellen Friedman 49
Real SQL instead of “SQL-like” • May be surprising to boast in a NoSQL conference, but flexibility
is important – find solutions, not bound by one tool • Sample TPC-H SQL benchmark query that Drill can run “as is”:
© 2014 Ellen Friedman 50
Schema-less distributed SQL engine • Save weeks or months
– would have been spent on defining schema, ETL and maintaining schema
• Drill automatically understands the structure of data • Simply point Drill at data and run queries
– Works on file, directory, Hbase or MapR-DB, table etc.
© 2014 Ellen Friedman 51
Query complex, semi-structured data “as is” • No need to flatten or transform data prior to query execution • Intuitive extensions to SQL to work with nested data • Here is simple query on a JSON file:
© 2014 Ellen Friedman 52
Apache Drill • Open source, open opportunities • What would you use Drill to do? • Best use case will be featured in upcoming book on Drill
© 2014 Ellen Friedman 55
What if you needed to uniquely identify every person in India?!
All 1.2 billion of them?!
© 2014 Ellen Friedman 56 PEOPLE
1.2 B
Largest Biometric Database in the World
PEOPLE PEOPLE The Aadhaar Project: • Unique 12 – digit number for each person in India • Proof of identity and address, authenticated anytime, anywhere • Runs on NoSQL database MapR-DB
© 2014 Ellen Friedman 57
A Day in the Life of the Aadhaar Project Data platform must handle: • 1 million new enrollments /day
– After 4 years, ~ 600 million of the 1.2 billion already enrolled – 4+ PB of raw data
• Each new enrollment needs de-duplication – 100s of millions of transaction over billions of records doing 100s of trillions of
biometric matches/day • Online sub-second authentications
– as many as 100 million per day From Pramod Varma, Chief Architect of UIDAI at Strata / Hadoop World NYC Oct 2014
http://strataconf.com/stratany2014/public/schedule/detail/36305 Official website of Unique Identification Authority of India (UIDAI)
http://uidai.gov.in
© 2014 Ellen Friedman 58
What does Aadhaar mean for India? • Better delivery of welfare services • More open society
– Identification without regard to cast, creed, religion or geography
• Reduction in embezzlement – save billions in government funds • NoSQL is changing society for the better
© 2014 Ellen Friedman 59
! ! ! ! !Basic idea: ! !“Eyes open” !! !“Eyes closed” ! !Implementation !! !! !Vision!
© 2014 Ellen Friedman 60
Exploration takes you to surprising places
Buzz Aldrin steps onto Moon photo by Neil Armstrong, Apollo 11 20 July 1969 NASA photo http://1.usa.gov/1uXi53U
© 2014 Ellen Friedman 61
India’s Space Program: Mission to Mars • India’s ISRO gets Mars orbit on 1st try • US NASA & India’s ISRO look forward
to collaboration (while @MarsOrbiter chats with @MarsCuriosity)
• Also cool
© 2014 Ellen Friedman 62
India’s Women Engineers at ISRO • ISRO and NASA have many women
engineers
• Very cool
© 2014 Ellen Friedman 63
European Space Agency: Rosetta Mission to Comet • Mission took 10 years, 8 mo, 19 days
• Philae lander touched down on comet on 12 November 2014
• Outrageously cool!
© 2014 Ellen Friedman 66
Contact Information Ellen Friedman
Solutions Consultant and Commentator Apache Mahout committer, Apache Drill contributor
Email [email protected]
[email protected] Twitter @Ellen_Friedman @ApacheDrill
Hashtag today: #NoSQL14