Kansas State UniversityDepartment of Computing and Information Sciences
Kansas State University KDD Lab (www.kddresearch.org)
Collaborative FilteringCollaborative FilteringIntelligent Information Retrieval and the GridIntelligent Information Retrieval and the Grid
Friday 11 October 2002
William H. Hsu
Laboratory for Knowledge Discovery in Databases
Department of Computing and Information Sciences
Kansas State University
http://www.kddresearch.org
This presentation is:
http://www.kddresearch.org/KSU/CIS/KU-20021010.ppt
Kansas State UniversityDepartment of Computing and Information Sciences
Kansas State University KDD Lab (www.kddresearch.org)
AcknowledgementsAcknowledgements
• Kansas State University Lab for Knowledge Discovery in Databases– Graduate research assistants: Haipeng Guo ([email protected]), Roby
Joehanes ([email protected])– Other grad students: Prashanth Boddhireddy, Siddharth Chandak, Ben
B. Perry, Rengakrishnan Subramanian– Undergraduate programmers: James W. Plummer, Julie A. Thornton
• Joint Work with– KSU Bioinformatics and Medical Informatics (BMI) group: Sanjoy Das
(EECE), Judith L. Roe (Biology), Stephen M. Welch (Agronomy)– KSU Microarray group: Scot Hulbert (Plant Pathology), J. Clare Nelson
(Plant Pathology), Jan Leach (Plant Pathology)– Kansas Geological Survey, Kansas Biological Survey, KU EECS
• Other Research Partners– NCSA Automated Learning Group (Michael Welge, Tom Redman)– University of Manchester (Carole Goble, Robert Stevens)– The Institute for Genomic Research (John Quackenbush, Alex Saeed)– International Rice Research Institute (Richard Bruskiewich)
Kansas State UniversityDepartment of Computing and Information Sciences
Kansas State University KDD Lab (www.kddresearch.org)
OverviewOverview
• Filtering– Collaborative filtering (CF) and relatives
– Application to intelligent information retrieval (IR)
• Computational Grids– High-Performance Computing (HPC) services
• Scientific data, metadata (ontologies, specifications), documentation• Software tools (source codes, application servers)• Experimental results
– Grid initiatives: TeraGrid (USA), eScience (UK, EBI)
• Challenge: Personalization of Services• Application: Bioinformatics• Methodology: Learning Relational Probabilistic Models
– User modeling and collaborative filtering (CF)
– DESCRIBER system: integrative CF for computational genomics
• Current Research and Open Problems
Kansas State UniversityDepartment of Computing and Information Sciences
Kansas State University KDD Lab (www.kddresearch.org)
Cross-Selling(based upon Market
Basket Analysis)
CollaborativeRecommendation
© 2002 Amazon.com, Inc.
Collaborative Filtering in Action:Collaborative Filtering in Action:Amazon.com [1]Amazon.com [1]
Kansas State UniversityDepartment of Computing and Information Sciences
Kansas State University KDD Lab (www.kddresearch.org)
Collaborative Filtering in Action:Collaborative Filtering in Action:Amazon.com [2]Amazon.com [2]
© 2002 Amazon.com, Inc.
Classification andRegression based
upon HistoricalCustomer Data
Explanation fromRecommender
(Decision Support)System
Kansas State UniversityDepartment of Computing and Information Sciences
Kansas State University KDD Lab (www.kddresearch.org)
Filtering and Recommendation ApproachesFiltering and Recommendation Approaches
• Collaborative
– Collect: recorded decisions (actions) of user(s)
– Infer: preferences of user(s)
– Model: associational relationships among entities (e.g., purchases)
– Use to: recommend similar decisions to users in similar context
• Structural
– Collect: recorded decisions (actions) of user(s)
– Infer: preferences of user(s)
– Model: causal relationships among entities (e.g., use cases)
– Use to: make recommendation and explain
• Content-Based: Driven by Key Word / Phrase
• Collective: Driven by Consensus, Stochastic Mixture Model
(e.g., “Swarm Intelligence”, Ant Colony Optimization)
Kansas State UniversityDepartment of Computing and Information Sciences
Kansas State University KDD Lab (www.kddresearch.org)
ThemeScapes © 1999 SPIRIX software http://www.cartia.com
6500 news storiesfrom the WWWin 1997
A Filtering Problem:A Filtering Problem: Text Mining for Information Retrieval (IR) Text Mining for Information Retrieval (IR)
Kansas State UniversityDepartment of Computing and Information Sciences
Kansas State University KDD Lab (www.kddresearch.org)
Another Filtering Application:Another Filtering Application:Commercial Fraud MonitoringCommercial Fraud Monitoring
Kansas State UniversityDepartment of Computing and Information Sciences
Kansas State University KDD Lab (www.kddresearch.org)
Stages of Data Mining andStages of Data Mining andKKnowledge nowledge DDiscovery in iscovery in DDatabasesatabases
Adapted from Fayyad, Piatetsky-Shapiro, and Smyth (1996)
Kansas State UniversityDepartment of Computing and Information Sciences
Kansas State University KDD Lab (www.kddresearch.org)
NCSA NCSA D2KD2K: Visual Programming System for: Visual Programming System forRapid Application Development in KDDRapid Application Development in KDD
Data to Knowledge (D2K) © 2002 NCSA http://archive.ncsa.uiuc.edu/STI/ALG/d2k/
Kansas State UniversityDepartment of Computing and Information Sciences
Kansas State University KDD Lab (www.kddresearch.org)
NCSA NCSA D2K D2K WorkflowWorkflow: Decision Support: Decision Supportin Insurance Pricingin Insurance Pricing
Hsu, Welge, Redman, Clutter (2002) Data Mining and Knowledge Discovery, 6(4):361-391
Kansas State UniversityDepartment of Computing and Information Sciences
Kansas State University KDD Lab (www.kddresearch.org)
Computational Grids [1]:Computational Grids [1]:High-Performance Distributed ComputingHigh-Performance Distributed Computing
• What is The Grid? – Infrastructure: Distributed Processing, Networks, Software
– Paradigm for Very Large-Scale Scientific Computing
• End Users of The Grid – Adapted from Goble (2002)– Providers
• Tool builders
• Systems/network administrators, service providers, etc.
– Researchers
• Scientific discipline – e.g., Biology
• Computational Science and Engineering (CSE) – e.g., Bioinformatics
• Patent Intelligence!
– “End users”
• Developers: e.g., pharmaceutical
• Medical doctors, patients
Kansas State UniversityDepartment of Computing and Information Sciences
Kansas State University KDD Lab (www.kddresearch.org)
Computational Grids [2]:Computational Grids [2]:Personalization of ServicesPersonalization of Services
• What Services?– High-Performance Computing (HPC) facilities
• Compute clusters (Beowulf, NT, etc.)
• Massively distributed networks
– Software
– Scientific data servers
• Metadata– Ontologies: Definitional Data Models (cf. Semantic Web)
– Service Type Directory
• Dynamic Design of Workflows – myGrid, Goble et al. (2002) http://www.ebi.ac.uk/mygrid
• Challenge: Personalization– Intelligent Filtering Approach: User Modeling
– “Users Who Used (Your) Specified Resources Also Used…”
Kansas State UniversityDepartment of Computing and Information Sciences
Kansas State University KDD Lab (www.kddresearch.org)
Domain-Specific Repositories
Experimental DataSource Codes and Specifications
Data ModelsOntologies
Models
Data Entity and Source Code Repository Index for Bioinformatics Experimental Research
Personalized Interface
Domain-SpecificCollaborative Filtering
New QueriesLearning and Inference
Components
HistoricalUse Case & Query Data
Decision SupportModels
Users ofScientificDocumentRepository
Interface(s) to Distributed Repository
Example Queries:• What experiments have found cell cycle-regulated
metabolic pathways in Saccharomyces?
• What codes and microarray data were used, and why?
DESCRIBERDESCRIBER: An Experimental: An ExperimentalIntelligent FilterIntelligent Filter
Kansas State UniversityDepartment of Computing and Information Sciences
Kansas State University KDD Lab (www.kddresearch.org)
Module 2
Learning & Validationof Bayesian Network
Models forUse Cases
Module 4Learning & Validationof Bayesian Network
Models forMAGE Data & Codes
Relational Models of MAGE Data
Module 1Intelligent Collaborative
Filtering Front-End
Data
Historical Use Case& Query Data
Personalized Interface Module 5MAGE
Data Model
User
Estimationof
ConstraintParameters
Graphical Modelsof Use Cases
Module 3
Constrained Models of Use Cases
New Queries
DESCRIBERDESCRIBER [1]: [1]:OverviewOverview
Kansas State UniversityDepartment of Computing and Information Sciences
Kansas State University KDD Lab (www.kddresearch.org)
Intelligent Collaborative FilteringFront-End
Personalized Interface
Relational Models of(Domain-Specific) Data
Constrained Modelsof Use Cases
RelationalProbabilistic
ModelConstraintSelector
IntegratedReasoning
Component:
XML Validator andConstraint Checker
Constraintson Repository
Content
Responseto User
New Queryfrom User
Module 1
DESCRIBER DESCRIBER [2]:[2]:Collaborative Filtering ModuleCollaborative Filtering Module
Kansas State UniversityDepartment of Computing and Information Sciences
Kansas State University KDD Lab (www.kddresearch.org)
Computational Genomics andComputational Genomics andMicroarray Data MiningMicroarray Data Mining
Treatment 1(Control)
Treatment 2(Pathogen)
Messenger RNA(mRNA) Extract 1
Messenger RNA(mRNA) Extract 2
cDNA
cDNA
DNA Hybridization Microarray(under LASER)
Kansas State UniversityDepartment of Computing and Information Sciences
Kansas State University KDD Lab (www.kddresearch.org)
Publication(e.g., PubMed)
Source(e.g.,
Taxonomy)
Gene(e.g., GenBank)
Experiment
Sample Hybridization Array
Normalization/Discretization
Data
Components of A Microarray Experiment:Components of A Microarray Experiment:HybridizationHybridization
Kansas State UniversityDepartment of Computing and Information Sciences
Kansas State University KDD Lab (www.kddresearch.org)
ComputationalWorkflows
(e.g., myGrid)
ExperimentalServices &Metadata
(Mage-ML XML)
GeneExpression
Model
Pathway &NetworkLearning
Specification
DataPreprocessingSpecification
ParameterLearning
Specification
ModelAnalysis
Specification
DiscretizationUse Case
Data MiningUse Case
Feature Selection
Specification
Validation(e.g., Bootstrap)
Use Case
Components of A Microarray Experiment:Components of A Microarray Experiment:Computational Gene Expression ModelingComputational Gene Expression Modeling
Kansas State UniversityDepartment of Computing and Information Sciences
Kansas State University KDD Lab (www.kddresearch.org)
Graphical Models of Probability for Graphical Models of Probability for CCollaborative ollaborative FFiltering (CF)iltering (CF)
• Goal: Estimate
• Filtering: r = t
– Intuition: infer current state from observations
– Applications: signal identification
– Variation: Viterbi algorithm
• Prediction: r < t
– Intuition: infer future state
– Applications: prognostics
• Smoothing: r > t
– Intuition: infer past hidden state
– Applications: signal enhancement
• CF Tasks
– Plan recognition by smoothing
– Prediction cf. WebCANVAS – Cadez et al. (2000)
)y|P(X r1it
Murphy (2002)
Kansas State UniversityDepartment of Computing and Information Sciences
Kansas State University KDD Lab (www.kddresearch.org)
Tools for Building Graphical ModelsTools for Building Graphical Models
• Commercial Tools: Ergo, Netica, TETRAD, Hugin• Bayes Net Toolbox (BNT) – Murphy (1997-present)
– Distribution page http://http.cs.berkeley.edu/~murphyk/Bayes/bnt.html
– Development group http://groups.yahoo.com/group/BayesNetToolbox
• Bayesian Network tools in Java (BNJ) – Hsu et al. (1999-present)– Distribution page
http://bndev.sourceforge.net
– Development group http://groups.yahoo.com/group/bndev
– Current (re)implementation projects for KSU KDD Lab
• Continuous state: Minka (2002) – Hsu, Guo, Perry, Boddhireddy
• Formats: XML BNIF (MSBN), Netica – Guo, Hsu
• Space-efficient DBN inference – Joehanes
• Bounded cutset conditioning – Chandak
Kansas State UniversityDepartment of Computing and Information Sciences
Kansas State University KDD Lab (www.kddresearch.org)
LearningEnvironment
Specification Fitness(Inferential Loss)
[B] ParameterEstimation
[A] StructureLearning
G = (V, E)Graph Component of BN
D: Data (User, Microarray)
B = (V, E, )BN with Probabilities
Dval (Model Validation by Inference)
G1
G2
G3
G4 G5
G1
G2
G3
G4 G5
Experimenters’ WorkbenchExperimenters’ Workbench
Kansas State UniversityDepartment of Computing and Information Sciences
Kansas State University KDD Lab (www.kddresearch.org)
References [1]:References [1]:Intelligent Filtering, IR, and KDDIntelligent Filtering, IR, and KDD
• Intelligent Filtering– Taxonomy of Filtering Approaches: Rocha (2001)
http://www.c3.lanl.gov/~rocha/GB0/adapweb_GB0.html
– Microsoft Research: Cadez et al. (1999), Heckerman and Meek (2002), Kadie (2002)
– Technical report: survey, Hsu (2002) http://www.kddresearch.org/Publications/Techreports/BMI-2001.pdf
– NCSA Automated Learning Group http://www.ncsa.uiuc.edu/STI/ALG
• Machine Learning, Data Mining, and Knowledge Discovery– K-State KDD Lab: literature survey and resource catalog (2002)
http://www.kddresearch.org/Resources
– Bayesian Network tools in Java (BNJ): Hsu, Guo, Joehanes, Perry, Thornton (2002) http://bndev.sourceforge.net
– Machine Learning in Java (BNJ): Hsu, Louis, Plummer (2002) http://mldev.sourceforge.net
Kansas State UniversityDepartment of Computing and Information Sciences
Kansas State University KDD Lab (www.kddresearch.org)
References [2]:References [2]:The Grid and BioinformaticsThe Grid and Bioinformatics
• The Grid– United Kingdom eScience Initiative: Taylor et al. (2002)
http://www.research-councils.ac.uk/escience– Access Grid: Foster and Kesselman (1999), Foster (2002)
http://www-fp.mcs.anl.gov/fl/accessgrid– NSF NPACI lecture: Reed (10 Apr 2002) http://
www.interact.nsf.gov/cise/conferences.nsf/cise_lectures
• Bioinformatics– European Bioinformatics Institute Tutorial: Brazma et al. (2001) http://
www.ebi.ac.uk/microarray/biology_intro.htm– Hebrew University: Friedman, Pe’er, et al. (1999, 2000, 2002)
http://www.cs.huji.ac.il/labs/compbio/– K-State BMI Group: literature survey and resource catalog (2002)
http://www.kddresearch.org/Groups/Bioinformatics
Kohavi (1998): “Crossing the Chasm”http://robotics.stanford.edu/~ronnyk/chasm.pdf