Bots & spiders

Bio-informatica II19/04/2012

Maté Ongenaert

Center for Medical GeneticsGhent University Hospital, Belgium

Part 1: Bots & spidersBackground

Part 2: Real-life case studiesThe use of bots and spiders in bio-informatics

About the presenter Bio-engineer cell and gene biotechnology (2005)

• Master thesis: identificatie van kanker-specifiek gemethyleerde genen

PhD applied biological sciences: cell and gene biotechnology (2009)• PhD thesis: cellular reprogramming

Industrial experience• Research scientist (methylation biomarkers)

Currently: postdoc at CMGG• Prognostic methylation biomarkers in neuroblastoma

Part 1Bots & spiders: background

Overview

Bots and spiders Introduction Bots Spiders The Google case

Bots/spiders and bio-informatics Automated querying APIs NCBI E-Utils (PubMed/GenBank) Ensembl

Bots and spiders The web history

• In 1989, while working at CERN, Tim Berners-Lee invented a network-based implementation of the hypertext concept

• Since then, information can be retrieved by ‘following links’ instead of having to know the exact location at first

• Information is not at a single location, it is dynamic and spread across machines

Bots and spiders

Bots Webbots

• Web robots, WWW robots, bots): software applications that run automated tasks over the Internet

Bots perform tasks that:• Are simple• Structurally repetitive• At a much higher rate than would be possible

for a human• Automated script fetches, analyses and files

information from web servers at many times the speed of a human

Other uses:• Chatbots / IM / Skype / Wiki bots• Malicious bots and bot networks (Zombies)

Bots and spiders

Bots A spam bot, called the ‘Zunker Bot’

• Is installed on unpatched Windows machines• Controls the clients trough a neat application• Can install additional software and execute commands

Bots and spiders

Spiders Webspiders

• Webspiders / Crawlers are programs or automated scripts which browses the World Wide Web in a methodical, automated manner. It is one type of bot

The spider starts with a list of URLs to visit, called the seeds• As the crawler visits these URLs, it identifies

all the hyperlinks in the page• It adds them to the list of URLs to visit, called

the crawl frontier• URLs from the frontier are recursively visited

according to a set of policies• This process is called web crawling: in most

cases a mean of collecting up-to-date data

Bots and spiders

SpidersBots and spiders

Spiders Use of webcrawlers:

• Mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches

• Automating maintenance tasks on a website, such as checking links or validating HTML code

• Can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses

Most commonly used crawler is probably the GoogleBot crawler• Crawls• Indexes (content + key content tags and attributes, such as Title tags and ALT

attributes)• Serves results: PageRank Technology

Bots and spiders

PageRankBots and spiders

Google Hardware

• Standard server hardware (2009): 16 GB RAM / 2 TB storage per server• 2009 estimate: 450 000 servers – 2 million $/month electricity cost

Software• Webserver (Not apache-based)• Storage (Google File System / BigTable): distributed storage – mostly in memory• Borg job scheduling and monitoring• Indexing services: caffeine / percolator• MapReduce: cluster system: splits complex problems and sends ‘jobs’ to worker

nodes (Map), answers are gathered and combined to solve the original question (Reduce)

Bots and spiders

Overview

Bots and spiders Introduction Bots Spiders The Google case

Bots/spiders and bio-informatics Automated querying APIs NCBI E-Utils (PubMed/GenBank) Ensembl

Bots/spiders and bio-informatics Automated querying

• Collecting information nowadays means the power to automatically query datasources (databases, websites, Google, Ensembl or NCBI databases)

• Query in web-terms: GET / POST• Web-queries using Perl: LWP library

LWP: set of Perl modules which provides a simple and consistent application programming interface (API) to the World-Wide Web• Free LWP E-book: http://lwp.interglacial.com/

LWP for newbies• LWP::Simple (demo1)• Go to a URL, fetch data, ready to parse• Attention: HTML tags and regular expression

Bots and spiders

Bots/spiders and bio-informatics Some more advanced features

• LWP::UserAgent (demo2 – show server access logs)• Fill in forms and parse results• Depending on content: follow hyperlinks to other pages and parse these again,

…• Mechanize package: follow links; fill in forms,…

Bioinformatics examples• Use genome browser data (demo3) and sequences• Get gene aliases and symbols from GeneCards (demo4)

Bots and spiders

Bots/spiders and bio-informatics Why not make use of crawls, indexing and serving

technologies of others (e.g. Google)• Google allows automated queries: per account 1000 queries a day• Google uses Snippets: the short pieces of text you get in the main search results• This is the result of its indexing and parsing algoritms• Demo5: LWP and Google APIs combined and parsing the results

API: Application Programming Interface• Hides complexity by sharing ‘libraries’ with functions that can be applied within

another programming language• Bridges programming languages – crosses abstraction layers• Example: displaying on a screen; printing; querying Google or NCBI from within

a programming language

Bots and spiders

Bots/spiders and bio-informatics APIs Google example used Google API NCBI API

• The NCBI Web service is a web program that enables developers to access Entrez Utilities via the Simple Object Access Protocol (SOAP)

• Programmers may write software applications that access the E-Utilities using any SOAP development tool

• Main tools (demo6):– E-Search: Searches and retrieves primary IDs and term translations and

optionally retains results for future use in the user's environment– E-Fetch: Retrieves records in the requested format from a list of one or

more primary IDs

Ensembl API (demo7)• Uses ‘Slices’ and adaptors• You have to know the ‘application’ or database (Compare/Core/…)

Bots and spiders

Bots/spiders and bio-informatics APIs NCBI API A NCBI database, frequently used is PubMed

• PubMed can be queried using E-Utils• Uses syntax as regular PubMed website• Get the data back in data formats as on the website (XML, Plain Text)• Parse XML results and apply more advanced Text-mining techniques• Demo8• Parse results and present them in an interface

– Methylated genes in cancer:– http://matrix.ugent.be/mate/methylome/result1.html– miRNAs in cancer:– http://matrix.ugent.be/mate/textmining/preprocess/

Bots and spiders

Part 2Real-life case studies: the use of bots and

spiders in bio-informatics

TextMining Create and translate query

• User query -> query suited for PubMed

Query is executed, results are returned• Results format: XML, TXT, MedLine, ASN,…• Human readable <> parsable (XML parsers)

Parse results• Extract information: authors, title, abstract• Store results

Analyse results• Identify gene names, keywords, GO-terms,… -> score• Semantic analysis / NLP processing / …

Visualise results• Highlighting, hierarchie, filters, searches, graphics

Bots and spiders

TextMiningBots and spiders

TextMining Demonstration: GoldMine Web-application Translate query – find aliases for genes or miRNAs and

incorporate them in the search Query NCBI PubMed using E-fetch Get the results and process them

Count Highlight Rank Visualization

Bots and spiders

Data analysis NCBI GEO – Gene Expression Omnibus Raw expression data on FTP-server Annotation: can be queried using NCBI E-Utils Annotation: in Excel-files at FTP-server For specific experimental conditions, get all raw data

and annotations and perform an automated analysis

Create a scheme how you would proceed: biological question: superficial vs. Infiltrating bladder cancer

Bots and spiders

Case study: superficial vs. Infiltrating bladder cancer Find experiments on GEO Annotation of samples: up to the submittors ‘Uniform’ sample sheet available (Matrix-file) Current update of GEO: view ‘factors’ in graphical

overview

Bots and spiders

Case study: superficial vs. Infiltrating bladder cancer

Bots and spiders

Case study: superficial vs. Infiltrating bladder cancer Use this to couple sample annotation features (stage,

age, risk, sex) to unique sampleID (GSMxxxxxxx) Get raw data for each sample in dataset Either txt files (uniform) or raw data files (such as Affy

CEL files) Dependends on the used platform: GPLxxxx

Bots and spiders

Case study: superficial vs. Infiltrating bladder cancer Platform / data files / samples / sample annotation

relationship Set up standardised analysis strategy Make use of sample annotations Combine studies or keep them seperate? Normalisation RankProd analysis

Bots and spiders

Case study: superficial vs. Infiltrating bladder cancer data.justrma<just.rma("GSM90305.CEL”,”… SAMPLES expression<-exprs(data.justrma) NORMALISATION results[,2:103]<-expression library(hgu95av2.db) PLATFORM cl<c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, ANNOTATION RP.out.stage <- RP(results[,3:104], cl, num.perm = 100,

logged = TRUE, na.rm = FALSE, plot = TRUE, rand = 123) ANALYSIS STRATEGY

Bots and spiders

Case study: superficial vs. Infiltrating bladder cancer Combine results accross studies Biological question <> data analysis Scoring scheme, priorization Superficial vs. Infiltrating Metastasis vs. Primary cancer High stage vs. Low stage Normal vs. Cancer

Bots and spiders

OncoMineBots and spiders

Integrated analysisBots and spiders

Rank Meth Pca Lit Meth other Expression Pca Progression Rank1 2 3 4 5 6 7 8

EXPRESSION RE-EXP CpG Pc

1 1 x 0,95 1 0,993 0,997 0,84 1

2 0,998 0,995 1 0,958 0,091 0,994

3 1 x x x 1 0,993 1 0,996 0,312

4 1 x x x 0,995 0,767 0,96 1 0,931 0,998 0,635

5 1 x 0,997 0,968 1 1 0,364 0,746 0,199

6 x 0,711 0,948 0,994 0,559 0,991 0,993

7 0,998 0,993 0,83 0,936 0,996

8 0,997 0,99 0,998 0,759 0,726 0,575

9 1 x x 0,886 0,995 0,997 1 0,7

10 1 0,998 0,409 0,99 0,88 0,998 0,779

11 1 x x 0,995 0,999 0,995 0,687

12 1 x x 0,997 0,999 0,999 0,257

13 1 x x x 0,799 0,996 0,969 0,994 0,848 0,981 0,887

14 1 x x 0,916 0,568 0,99 0,993 0,994 0,988 0,558

15 0,986 0,995 0,956 0,983 0,998

16 1 x 0,157 1 0,925 0,989 0,984 0,993

Acknowledgments

CMGG Anneleen Decock Frank Speleman Jo Vandesompele

BioBix Leander Van Neste Tim De Meyer Gerben Mensschaert Geert Trooskens Wim Van Criekinge

Bots & spiders

overview bots

spiders botsspiders

theinternet bots

bots spidersbackground

spiders pagerank

google google

web pages

google apis

Education

Spiders PT2

Postioma’s spiders

In Search of Spiders Spiders Alive! Opens July 28 - American

Spiders And Bots And Crawlers Oh My!

Spiders - agrilife.org · distinguish spiders from the...

Spelling Spiders

What are search engines? Tools used for locating web pages.....

Polynesian Spiders!

The spiders

BOTS, SPIDERS Y SCRAPERS: EL Y BUENO MALO · el estado de.....

Q4 2014 Security Report | Bots, Spiders & Scrapers Excerpts....

Spiders Spiders have two body parts, the head and the...

Terms of use · automated means (such as harvesting bots,.....

Spiders of Taleigao Plateau, Goa, India - Fortune...

Microscopy and Spiders · Microscopy and Spiders Anthony...

bots bots bots - Amazon...