Top Banner
Welcome to the Birmingham Big Data Science Group (BIDS) Faizan Javed 5/25/2011 Intermark Group Sponsor: Intermark Group
22

1st Birmingham Big Data Science Group meetup

Jan 19, 2015

Download

Technology

Faizan Javed

 
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1st Birmingham Big Data Science Group meetup

Welcome to the Birmingham Big Data Science Group

(BIDS)

Faizan Javed5/25/2011

Intermark Group

Sponsor: Intermark Group

Page 2: 1st Birmingham Big Data Science Group meetup

BIDS Stats• Founded April 10, 2011• 9 members (and counting..)• Founder: Faizan Javed, Co-Founder: Qasim Ijaz• Online presence:

Meetup.com for co-ordinating meetups:http://www.meetup.com/bham-bids

Also on (for related articles and announcements):LinkedIn: http://www.linkedin.com/groups/Birmingham-Big-Data-Science-Group-3865219

Facebook:http://www.facebook.com/home.php?sk=group_202221519811444

Page 3: 1st Birmingham Big Data Science Group meetup

Agenda

• What is Big Data?

• Quick overview of related technologies:

Large-scale distributed systems and platforms NoSQL data stores

Intelligent algorithms/web-mining/information retrieval techniques

Highly-scalable systems

Page 4: 1st Birmingham Big Data Science Group meetup

What is Big Data?• More people connected to the internet

• Social media explosion (Web 2.0): Facebook, Twitter, etc.

• Huge volumes of data being collected: sensors, mobile devices, machine-to-machine communications, social media and retail sites web logs for browsing patterns

• “Big” in Big Data is relative: today's "big" is certainly tomorrow's "medium" and next week's "small.“

• “Big Data" is when the size of the data itself becomes part of the problem. Going from Gigabytes to Petabytes! http://radar.oreilly.com/2010/06/what-is-data-science.html

Page 5: 1st Birmingham Big Data Science Group meetup
Page 6: 1st Birmingham Big Data Science Group meetup

Big Data, Big Numbers McKinsey report, May 2011: http://www.mckinsey.com/mgi/publications/big_data/index.asp

Page 7: 1st Birmingham Big Data Science Group meetup

Why care about big data?

• Deep analysis of data can be a competitive advantage.

• More data easier to find consistent patterns• More data usually beats better algorithms

• Ex 1: Predict customer preferences and target ads on an ecommerce website.

• Ex 2: Improve search quality.

• Ex 3: Bank risk modeling (aggregate customer activity from different lines of businesses)

http://blog.mikepearce.net/2010/08/18/10-hadoop-able-problems-a-summary/http://www.ft.com/intl/cms/s/0/64095dba-7cd5-11e0-994d-00144feabdc0.html#axzz1NHn8icSC

Key point: “Many different sources” & “unstructured data”

Page 8: 1st Birmingham Big Data Science Group meetup

Big Players on the Big Data Scene

The Government http://us1.campaign-archive1.com/?u=4cb4c08d876d7481bbc4bc70f&id=6889126aef

Page 9: 1st Birmingham Big Data Science Group meetup

The need for new techniques

• Traditional “relational” techniques breakdown at scale.

Solutions:• NoSQL databases: Cassandra, Hbase, Riak, etc

• Large-scale “commodity” scale-out distributed computing techniques: MapReduce/Hadoop, Percolator, etc

• Analytics platforms: IBM BigInsight, EMC GreenPlum

Page 10: 1st Birmingham Big Data Science Group meetup

The NoSQL revolution http://www.infoq.com/news/2011/04/newsql

Page 11: 1st Birmingham Big Data Science Group meetup

Prominent NoSQL database users

• Cassandra: Facebook, Twitter, Rackspace, Reddit, Digg.com

• Riak: Mozilla, Ask.com, Comcast

• Voldemort: LinkedIn

• MongoDB: Foursquare, Etsy, bit.ly, Intuit

• Hbase: Stumbleupon, Twitter, Infolinks, Adobe, Meetup.com,

Page 12: 1st Birmingham Big Data Science Group meetup

Hadoop-based SMAQ stack http://radar.oreilly.com/2010/09/the-smaq-stack-for-big-data.html

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable>

{ public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException

{ int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } }

Page 13: 1st Birmingham Big Data Science Group meetup

Hadoop-based SMAQ stack

• Hadoop comes with HDFS – Hadoop Distributed File Sytem.

• Can be used alongside various NoSQL systems (Hbase most common)

Page 14: 1st Birmingham Big Data Science Group meetup

Hadoop-based SMAQ stack

• Pig (yahoo)• input = LOAD 'input/sentences.txt' USING

TextLoader(); words = FOREACH input GENERATE FLATTEN(TOKENIZE($0)); grouped = GROUP words BY $0;

counts = FOREACH grouped GENERATE group, COUNT(words); ordered = ORDER counts BY $0; STORE ordered INTO 'output/wordCount' USING PigStorage();

• Hive (facebook) INSERT OVERWRITE TABLE

xyz_com_page_views SELECT page_views.* FROM page_views WHERE page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31' AND page_views.referrer_url like '%xyz.com';

Page 15: 1st Birmingham Big Data Science Group meetup

Next-generation systems: going beyond MapReduce/Hadoop

http://www.nytimes.com/external/gigaom/2010/10/23/23gigaom-beyond-hadoop-next-generation-big-data-architectu-81730.html

• Mostly Google and Yahoo innovations.• Percolator – “real-time” MapReduce. Powers Google Instant.• Dremel – superfast “Hive” to interact with large-datasets.

Inhouse-Google.• Pregel – highly efficient graph computing for analyzing social

graphs. In-house Google. Open-source projects available.• Megastore- scalable NoSQL like system with ACID semantics

but lower consistency across partitions. In-house Google.• Next-gen Hadoop at Yahoo: enhanced scalability (going

beyond 4000 clusters), support for multiple programming paradigms, enhanced cluster utilization.

Page 16: 1st Birmingham Big Data Science Group meetup

Intelligent Web & machine learning

• Recommendation systems, data/web mining, natural language processing

• Recommendation systems:• A type of collaborative filtering/information

retrieval technique.• Uses user profiles, ratings, browsing habits to

recommend items not yet considered.• First made famous in the commercial arena by

Amazon.com

Page 17: 1st Birmingham Big Data Science Group meetup

Amazon.com & Netflix recommendation systems

Page 18: 1st Birmingham Big Data Science Group meetup

Foursquare (3/2011) and Google Places (5/2011) http://engineering.foursquare.com/2011/03/22/building-a-recommendation-engine-foursquare-style/

http://places.blogspot.com/2011/05/discover-more-places-youll-like-based.html

Page 19: 1st Birmingham Big Data Science Group meetup

Hot area!Netflix and Overstock.com competitions

Page 20: 1st Birmingham Big Data Science Group meetup

Search Engines (Google, Bing, Wolfram, Lucene/Nutch, etc)

Page 21: 1st Birmingham Big Data Science Group meetup

Search innovations @ LinkedIn http://thenoisychannel.com/2010/01/31/linkedin-search-a-look-beneath-the-hood/

http://blog.linkedin.com/2009/12/14/linkedin-faceted-search/

• Uses open-source Lucene project for social graph search and real-time indexing and searching.

• Dynamic filters automatically generated based on your query results!

Page 22: 1st Birmingham Big Data Science Group meetup

Conclusion• Big Data is a very challenging and promising

area• Can be used to get a competitive advantage• Usually bring about advances in computer

science• Vast area of topics: NoSQL systems, large-scale

distributed computing systems, highly scalable web system designs

• Machine learning techniques: search engines, recommender systems