From Data Mining to Discovery Analy4cs Naren Ramakrishnan CS@VT CS Open House, March 25, 2011
What is data mining?
• Extrac4ng non-‐trivial and ac4onable paNerns from data (lots of data)
• Integrates ideas from – Algorithms – Databases – Sta4s4cal Inference – Visualiza4on
Data mining research at CS@VT
• Algorithmic innova4ons mo4vated by real applica4ons
• One technique, mul4ple uses – Storytelling
• Life sciences, intelligence analysis – Event sequence discovery
• Manufacturing, neuroscience, sustainability
– Graph mining • Social networks, biochemical networks
Data mining research at CS@VT
• Algorithmic innova4ons mo4vated by real applica4ons
• One technique, mul4ple uses – Storytelling
• Life sciences, intelligence analysis – Event sequence discovery
• Manufacturing, neuroscience, sustainability
– Graph mining • Social networks, biochemical networks
L. Garczarek, N. Ramakrishnan, D. Kumar, R.F. Helm, and M. Potts, Global cross-over points in the genome responses of Synechocystis sp. PCC 6803, to dehydration, UV-irradiation, and other stresses, Technical Report, Department of Biochemistry, Virginia Tech, 2010.
M.B. Roth and T. Nystul, Buying time in suspended animation, Scientific American, Vol. 292, No. 6, pages 48-55, June 2005.
?
Connecting the dots
L. Garczarek, N. Ramakrishnan, D. Kumar, R.F. Helm, and M. Potts, Global cross-over points in the genome responses of Synechocystis sp. PCC 6803, to dehydration, UV-irradiation, and other stresses, Technical Report, Department of Biochemistry, Virginia Tech, 2010.
M.B. Roth and T. Nystul, Buying time in suspended animation, Scientific American, Vol. 292, No. 6, pages 48-55, June 2005.
L. Schmitt and R. Tampe, Structure and mechanism of ABC transporters, Current Opinion in Structural Biology, Vol. 14, No. 4, pages 426-431, Aug 2004.
J.W. Scott, S.A. Hawley, K.A. Green, M. Anis, G. Stewart, G.A. Scullion, D.G. Norman, and D.G. Hardie, CBS domains form energy-sensing modules whose binding of adenosine ligands is disrupted by disease mutations, Journal of Clinical Investigation, Vol. 113, No. 2, pages 182-184, Jan 2004.
C. Tang, X. Li and J. Du, Hydrogen sulfide as a new endogenous gaseous transmitter in the cardiovascular system, Current Vascular Pharmacology, Vol. 4, No. 1, pages 17-22, Jan 2006.
CBS domains
ABC transporters Ligands bound to CBS domains
Hydrogen sulfide
Connecting the dots
Stories mined from Wikileaks
Spain
USAVenezuela
USA is concerned about
ships with Venezuela
USA creates pressure on Netherlands to boycott Venezuela
Spain claims that their relationship with Venezuela is strictly financial and political
Netherlands want to maintain good relationship with Venezuela
US Embassy in Venezuela is trying to create positive impression about USA among Venezuelans
Spanish Foreign Minister visits Cuba. Spain has
relation with Cuba.
European Union
LibyaAl-‐Qaeda
Al-Qaeda is a concern to LibyaSpain suspects Al-
Qaeda for Madrid bombing
Spain tries to convince EU to involve Cuba in the political community
Ghana
Afghanistan
Libya provides help to boost agriculture in Ghana
Libyan company invests in Liberia
Libya lends tractors to Mozambique
Spain wants to increase presence in Afghanistan
USA and Libya tie about counter-terrorism cooperation,, prospective military-to-military ties, and petroleum resources.
Story 1: 05MADRID1604 05MADRID703 08CARACAS420 07THEHAGUE2012 Story 2: 05MADRID1604 05MADRID1879 09MADRID1121 09LONDON2592
UN
Human rights violation in Cuba
concerns UN
Story 3: 09TRIPOLI221 08TRIPOLI680 09TRIPOLI73 04MADRID974 06MADRID2657
Three automa4cally discovered stories summarized by an analyst
Data mining research at CS@VT
• Algorithmic innova4ons mo4vated by real applica4ons
• One technique, mul4ple uses – Storytelling
• Life sciences, intelligence analysis – Event sequence discovery
• Manufacturing, neuroscience, sustainability
– Graph mining • Social networks, biochemical networks
Event sequence discovery One long sequence of events
( ) ( ) ( )nn tEtEtE ,,...,,,, 2211
A 1 1 1 B 1 1 C 1 1 1 D 1 1 1 1
Even
ts
Time
Event of Type A occurred at t = 6.1
Event of Type D occurred at t = 5.2
From Events to Episodes to Structures
time
B"C"
G"Episode "Mining"
ACDEFH, BDCFEH, ACDEFH, BCDEFH, ...."
A"
B"C"
E"D"
F"
H"(or)"
(and)"(and)"
(not)"
Structure"Discovery"
A"
Why is this problem difficult?
15
CQLQSOQKDRQXCDRZSNQRVDXPDTBHYOCJUSWLPEDFTEQEDYASTKRYIVDTGZJUYEUPXFEQYCTCEFSSFAEOJSOBKREKSWIEQEKLSISRNSMEDWNCRESXNDQEFNSXEBYSBYRRYQTWDAOOWPKJEIINAUECBIMSFEFSSRJIBOEIPSWEEXYQTDXIRMSISNMEAREARSDSJDKFJHCIOWBSEUPAUSSBXEHYTMTLBPWERYKQDIHEBJOWSMUEROXPYFKPTINEOSSASJPEKNEBIGNMESQLCQLQIGWFIELJMELSRLNSGCTZPXJXETZSOOPAEKIERSKOIQDIPKTEDFIASCNHTDONDCYUXSNYHEXTOIXEXKGEJSYIMOLECKZXIKGIFMUESSSICEHSYZOUISERBEEADASCAW
CQLQSOQKDRQXCDRZSNQRVDXPDTBHYOCJUSWLPEDFTEQEDYASTKRYIVDTGZJUYEUPXFEQYCTCEFSSFAEOJSOBKREKSWIEQEKLSISRNSMEDWNCRESXNDQEFNSXEBYSBYRRYQTWDAOOWPKJEIINAUECBIMSFEFSSRJIBOEIPSWEEXYQTDXIRMSISNMEAREARSDSJDKFJHCIOWBSEUPAUSSBXEHYTMTLBPWERYKQDIHEBJOWSMUEROXPYFKPTINEOSSASJPEKNEBIGNMESQLCQLQIGWFIELJMELSRLNSGCTZPXJXETZSOOPAEKIERSKOIQDIPKTEDFIASCNHTDONDCYUXSNYHEXTOIXEXKGEJSYIMOLECKZXIKGIFMUESSSICEHSYZOUISERBEEADASCAW
Episode Frequency C→S 682 O→P→E→N 439 H→O→U→S→E 260
Data mining research at CS@VT
• Algorithmic innova4ons mo4vated by real applica4ons
• One technique, mul4ple uses – Storytelling
• Life sciences, intelligence analysis – Event sequence discovery
• Manufacturing, neuroscience, sustainability
– Graph mining • Social networks, biochemical networks
Biochemistry by search
“How do cells remember? What is the biochemical basis of memory?”
“Family tree” of > 3000 switches discovered by mining > 100 CPU years of simulation results
Experiences redux
• Data mining research can be organized into “horizontals” and “ver4cals”
• Data mining is beneficial when – First-‐principles answers are not available – Informa4on integra4on is key
Discovery Analy4cs Center
• A new ICTAS center focused on the use of analy4cs for scien4fic discovery
• Brings together – Core faculty from CS, STAT, MATH, ECE – Applica4ons faculty from various other departments
• Some ini4al areas of emphasis – Intelligence analysis, sustainability, neuroscience
For more info
• Contact – Naren Ramakrishnan – 2050 Torgersen Hall – [email protected] – hNp://www.cs.vt.edu/~naren