Bots & spiders
Post on 27-Nov-2014
3181 Views
Preview:
DESCRIPTION
Transcript
Bots & spiders
Bio-informatica II19/04/2012
Maté Ongenaert
Center for Medical GeneticsGhent University Hospital, Belgium
Part 1: Bots & spidersBackground
Part 2: Real-life case studiesThe use of bots and spiders in bio-informatics
About the presenter Bio-engineer cell and gene biotechnology (2005)
• Master thesis: identificatie van kanker-specifiek gemethyleerde genen
PhD applied biological sciences: cell and gene biotechnology (2009)• PhD thesis: cellular reprogramming
Industrial experience• Research scientist (methylation biomarkers)
Currently: postdoc at CMGG• Prognostic methylation biomarkers in neuroblastoma
Part 1Bots & spiders: background
Overview
Bots and spiders Introduction Bots Spiders The Google case
Bots/spiders and bio-informatics Automated querying APIs NCBI E-Utils (PubMed/GenBank) Ensembl
Bots and spiders The web history
• In 1989, while working at CERN, Tim Berners-Lee invented a network-based implementation of the hypertext concept
• Since then, information can be retrieved by ‘following links’ instead of having to know the exact location at first
• Information is not at a single location, it is dynamic and spread across machines
Bots and spiders
Bots Webbots
• Web robots, WWW robots, bots): software applications that run automated tasks over the Internet
Bots perform tasks that:• Are simple• Structurally repetitive• At a much higher rate than would be possible
for a human• Automated script fetches, analyses and files
information from web servers at many times the speed of a human
Other uses:• Chatbots / IM / Skype / Wiki bots• Malicious bots and bot networks (Zombies)
Bots and spiders
Bots A spam bot, called the ‘Zunker Bot’
• Is installed on unpatched Windows machines• Controls the clients trough a neat application• Can install additional software and execute commands
Bots and spiders
Spiders Webspiders
• Webspiders / Crawlers are programs or automated scripts which browses the World Wide Web in a methodical, automated manner. It is one type of bot
The spider starts with a list of URLs to visit, called the seeds• As the crawler visits these URLs, it identifies
all the hyperlinks in the page• It adds them to the list of URLs to visit, called
the crawl frontier• URLs from the frontier are recursively visited
according to a set of policies• This process is called web crawling: in most
cases a mean of collecting up-to-date data
Bots and spiders
SpidersBots and spiders
Spiders Use of webcrawlers:
• Mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches
• Automating maintenance tasks on a website, such as checking links or validating HTML code
• Can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses
Most commonly used crawler is probably the GoogleBot crawler• Crawls• Indexes (content + key content tags and attributes, such as Title tags and ALT
attributes)• Serves results: PageRank Technology
Bots and spiders
PageRankBots and spiders
PageRankBots and spiders
Google Hardware
• Standard server hardware (2009): 16 GB RAM / 2 TB storage per server• 2009 estimate: 450 000 servers – 2 million $/month electricity cost
Software• Webserver (Not apache-based)• Storage (Google File System / BigTable): distributed storage – mostly in memory• Borg job scheduling and monitoring• Indexing services: caffeine / percolator• MapReduce: cluster system: splits complex problems and sends ‘jobs’ to worker
nodes (Map), answers are gathered and combined to solve the original question (Reduce)
Bots and spiders
Overview
Bots and spiders Introduction Bots Spiders The Google case
Bots/spiders and bio-informatics Automated querying APIs NCBI E-Utils (PubMed/GenBank) Ensembl
Bots/spiders and bio-informatics Automated querying
• Collecting information nowadays means the power to automatically query datasources (databases, websites, Google, Ensembl or NCBI databases)
• Query in web-terms: GET / POST• Web-queries using Perl: LWP library
LWP: set of Perl modules which provides a simple and consistent application programming interface (API) to the World-Wide Web• Free LWP E-book: http://lwp.interglacial.com/
LWP for newbies• LWP::Simple (demo1)• Go to a URL, fetch data, ready to parse• Attention: HTML tags and regular expression
Bots and spiders
Bots/spiders and bio-informatics Some more advanced features
• LWP::UserAgent (demo2 – show server access logs)• Fill in forms and parse results• Depending on content: follow hyperlinks to other pages and parse these again,
…• Mechanize package: follow links; fill in forms,…
Bioinformatics examples• Use genome browser data (demo3) and sequences• Get gene aliases and symbols from GeneCards (demo4)
Bots and spiders
Bots/spiders and bio-informatics Why not make use of crawls, indexing and serving
technologies of others (e.g. Google)• Google allows automated queries: per account 1000 queries a day• Google uses Snippets: the short pieces of text you get in the main search results• This is the result of its indexing and parsing algoritms• Demo5: LWP and Google APIs combined and parsing the results
API: Application Programming Interface• Hides complexity by sharing ‘libraries’ with functions that can be applied within
another programming language• Bridges programming languages – crosses abstraction layers• Example: displaying on a screen; printing; querying Google or NCBI from within
a programming language
Bots and spiders
Bots/spiders and bio-informatics APIs Google example used Google API NCBI API
• The NCBI Web service is a web program that enables developers to access Entrez Utilities via the Simple Object Access Protocol (SOAP)
• Programmers may write software applications that access the E-Utilities using any SOAP development tool
• Main tools (demo6):– E-Search: Searches and retrieves primary IDs and term translations and
optionally retains results for future use in the user's environment– E-Fetch: Retrieves records in the requested format from a list of one or
more primary IDs
Ensembl API (demo7)• Uses ‘Slices’ and adaptors• You have to know the ‘application’ or database (Compare/Core/…)
Bots and spiders
Bots/spiders and bio-informatics APIs NCBI API A NCBI database, frequently used is PubMed
• PubMed can be queried using E-Utils• Uses syntax as regular PubMed website• Get the data back in data formats as on the website (XML, Plain Text)• Parse XML results and apply more advanced Text-mining techniques• Demo8• Parse results and present them in an interface
– Methylated genes in cancer:– http://matrix.ugent.be/mate/methylome/result1.html– miRNAs in cancer:– http://matrix.ugent.be/mate/textmining/preprocess/
Bots and spiders
Part 2Real-life case studies: the use of bots and
spiders in bio-informatics
TextMining Create and translate query
• User query -> query suited for PubMed
Query is executed, results are returned• Results format: XML, TXT, MedLine, ASN,…• Human readable <> parsable (XML parsers)
Parse results• Extract information: authors, title, abstract• Store results
Analyse results• Identify gene names, keywords, GO-terms,… -> score• Semantic analysis / NLP processing / …
Visualise results• Highlighting, hierarchie, filters, searches, graphics
Bots and spiders
TextMiningBots and spiders
TextMiningBots and spiders
TextMiningBots and spiders
TextMining Demonstration: GoldMine Web-application Translate query – find aliases for genes or miRNAs and
incorporate them in the search Query NCBI PubMed using E-fetch Get the results and process them
Count Highlight Rank Visualization
Bots and spiders
Data analysis NCBI GEO – Gene Expression Omnibus Raw expression data on FTP-server Annotation: can be queried using NCBI E-Utils Annotation: in Excel-files at FTP-server For specific experimental conditions, get all raw data
and annotations and perform an automated analysis
Create a scheme how you would proceed: biological question: superficial vs. Infiltrating bladder cancer
Bots and spiders
Case study: superficial vs. Infiltrating bladder cancer Find experiments on GEO Annotation of samples: up to the submittors ‘Uniform’ sample sheet available (Matrix-file) Current update of GEO: view ‘factors’ in graphical
overview
Bots and spiders
Case study: superficial vs. Infiltrating bladder cancer
Bots and spiders
Case study: superficial vs. Infiltrating bladder cancer Use this to couple sample annotation features (stage,
age, risk, sex) to unique sampleID (GSMxxxxxxx) Get raw data for each sample in dataset Either txt files (uniform) or raw data files (such as Affy
CEL files) Dependends on the used platform: GPLxxxx
Bots and spiders
Case study: superficial vs. Infiltrating bladder cancer Platform / data files / samples / sample annotation
relationship Set up standardised analysis strategy Make use of sample annotations Combine studies or keep them seperate? Normalisation RankProd analysis
Bots and spiders
Case study: superficial vs. Infiltrating bladder cancer data.justrma<just.rma("GSM90305.CEL”,”… SAMPLES expression<-exprs(data.justrma) NORMALISATION results[,2:103]<-expression library(hgu95av2.db) PLATFORM cl<c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, ANNOTATION RP.out.stage <- RP(results[,3:104], cl, num.perm = 100,
logged = TRUE, na.rm = FALSE, plot = TRUE, rand = 123) ANALYSIS STRATEGY
Bots and spiders
Case study: superficial vs. Infiltrating bladder cancer Combine results accross studies Biological question <> data analysis Scoring scheme, priorization Superficial vs. Infiltrating Metastasis vs. Primary cancer High stage vs. Low stage Normal vs. Cancer
Bots and spiders
OncoMineBots and spiders
Integrated analysisBots and spiders
Rank Meth Pca Lit Meth other Expression Pca Progression Rank1 2 3 4 5 6 7 8
EXPRESSION RE-EXP CpG Pc
1 1 x 0,95 1 0,993 0,997 0,84 1
2 0,998 0,995 1 0,958 0,091 0,994
3 1 x x x 1 0,993 1 0,996 0,312
4 1 x x x 0,995 0,767 0,96 1 0,931 0,998 0,635
5 1 x 0,997 0,968 1 1 0,364 0,746 0,199
6 x 0,711 0,948 0,994 0,559 0,991 0,993
7 0,998 0,993 0,83 0,936 0,996
8 0,997 0,99 0,998 0,759 0,726 0,575
9 1 x x 0,886 0,995 0,997 1 0,7
10 1 0,998 0,409 0,99 0,88 0,998 0,779
11 1 x x 0,995 0,999 0,995 0,687
12 1 x x 0,997 0,999 0,999 0,257
13 1 x x x 0,799 0,996 0,969 0,994 0,848 0,981 0,887
14 1 x x 0,916 0,568 0,99 0,993 0,994 0,988 0,558
15 0,986 0,995 0,956 0,983 0,998
16 1 x 0,157 1 0,925 0,989 0,984 0,993
Acknowledgments
CMGG Anneleen Decock Frank Speleman Jo Vandesompele
BioBix Leander Van Neste Tim De Meyer Gerben Mensschaert Geert Trooskens Wim Van Criekinge
top related