Chemistry-to-Protein Relastionship Quality

[1]

The Chemistry-to-Protein Relationship Quality Challenge: Confounding Linked Data?

(Poster, Chris Southan, BioIT Boston, 2012)

Introduction As evidenced from this meeting data integration to facilitate the generation of new knowledge is undergoing a quantum jump driven by the generation of larger data sets, expanded computational capacity and semantic web federated queries across linked open sources.

However, the cloud in this bright future is that molecular mechanistic relationships inferred from data of equivocal quality can become a house of cards. On a good day, these may remain local artefacts in the uber-network. On a bad day, the very linking on which utility depends can propagate errors instantly, remorselessly, globally and permanently.

This poster compares inferred mechanistic mappings between chemical structures and proteins, both in curated drug databases and large chemogenomic data portals. A surprising degree of discordance and different error types were found. It could also be shown that various curatorial and automated parsing errors were being transitively passed on between databases.

The results are given below as a series of problems that are potentially confounding for linking between chemistry <> protein databases.

[2]

Problem I: Constitutive Mapping Challenges

We know mapping between chemicals and proteins is neither pure nor simple. This is not even a complete list of what ”compound X <> protein Y ” relationships can encompass in databases.

• Binds-to and modulates activity• Binds-to with known specificity (e.g. active or allosteric site in PDB)• Binds-to with molecular mechanism-of-action (mmoa) inhibitor, activator, agonist, antagonist• Binds-to with quantiative mmo (Ki, IC50, Kd etc)• Binds-to and is metabolicaly transformed by (e.g. P450)• Binds-to and is transported by (e.g. multidrug resistance-associated protein)• Binds-to but no activity modulation (e.g. albumin)• X transformation affects binding to Y (e.g. prodrug > drug > salt > metabolite)• X is non-canonical (e.g. enatiomers with different affinity for Y) • One X to-many proteins (panel screen)• Data source ambigous in description of X (e.g. errors or tautomers)• Data source ambigous in description of Y (e.g. protein ID not resolved)• X does not bind Y, thus mmmo is indirect (e.g. up or down regulation of Y)• Many cpds to-one Y (a throughput assay)• X has relevant linked data in addtion to binding Y (e.g. plasma clearance)• Y is part of a functional complex (e.g. gamma secretase)• X-Y mechanistic coupling at different system levels (e.g. in vitro, in celluo, in vivo and in clinico)• Y is species-specific• Y is non-canonical (e.g. splice variant, phosphorylated, activation clipped etc)

[3]

Problem II: The Numbers Don’t Add Up

• The statistical differences in orders of magnitude are only partialy intepretable• No concencus defintions or heirachies of ”target” or ”interaction” as concepts• Ipso facto curation and/or parsing rules are very different• Evidence filtration functionality different • Extraction substrates mostly simillar (e.g. Journals, PubMed and other dbs)• Explicit but also cryptic circularity (e.g. large dbs subsuming smaller dbs)

A collation of entity and relatishionship counts between databases and curated sets, ranked by compounds-per-protein

[4]

Problem III: Differential Chemistry Capture

• We can compare the two premier academic drug mapping resources, DrugBank and Therapeutic Target Database, in principle having convergent capture concepts.

• Both use expert curation teams to extract from the same primary data corpora.• The intra-PubChem comparison of chemical content (at the CID level) is shown below

DB = 6720 TTD= 14631 Union = 19803 Intersect = 1548

• Results show very different capture (e.g. union is over 10x larger than the intersect )• Some of this is explicable (e.g. DB’s historical emphasis on PDB ligands and TTD picking

up BioAssayed compounds from ChEMBL) but reasons for other differences are less clear.

[5]

Problem IV: Differential Target Capture

• The Venn compares DrugBank with TTD and a re-curated DrugBank sub-set (Ra-An ”Trends in the exploitation of novel drug targets” 2011, PMID: 21804595)

• While there are caveats related to set defintions, species filters and protein ID cross-mapings, the differencial capture of the three manualy curated sets is clear

• The intersect at only 170 human UniProt IDs is ~ ½ the expected primary targets

• Some of this is explicable (i.e. R-An picking up new targets) but the cause of other differences are unclear

• Over 900 targets (this comparison excluded enzymes and transporters) are unique to DrugBank so their curatorial rules are clearly different

[6]

Problem V: Large chemistry <> protein Dbs

• Leading expert teams and significant resources• Overlaps in concepts and utility • Differences in approaches and technical implimentation

[7]

Problem V (ctd): Too Large to Verify but too Divergent to Trust?

• Comparing atorvastin <> proteins in four large-scale Dbs

• The 4-database intersect is only 8 from 143

• 6 of these are probably indirect (no binding ) and mechanistically unclear

• Significant database-unique capture (e.g. CTD)

• There are caveats with these exact numbers because they depend on protein database x-mappings

[8]

Problem VI: Whose curation is ”correct”

• Protein <> atorvastin results, automated vs curated (ChEMBL and DugBank)• Sum is proteins from the four dbs in previous slide • Consensus is only HMGCR and CP450 3A4• Unique capture of transporters and metabolic enzymes by DrugBank • Targets unique to DrugBank: hum Dipeptidyl peptidase 4, Aryl hydrocarbon

receptor• Targets unique to ChEMBL: Cruzipain, pig Dipeptidyl peptidase 4

[9]

Problem VII. The PDB Hetero Entry Trap: False Drug/ligands and False Targets

E.g. Stitch makes high-scoring links from DPPIV to galatose and fucose

[10]

Problem VII ctd. STICH X-refs the Same Errors in DrugBank that Passed them to PubChem

DrugBank links to the wrong sugar isomer as CID 671379 and PubChem inherited the 40 targets in the ”Biomolecular Interactions and Pathways” field. DB entry now deprecated

[11]

Problem VII ctd. Mixed mappings of the ”Wrong” and ”Right” (drug-relevant) Ligands

Most of the mappings above are ”right”, on the left is ”wrong” (sugar is in the crystal but not a ligand or a drug in this context)

[12]

Problem VIII: False-negatives

• This clinically signficant infered interaction is missed by (all ?) Dbs

• A guess is that neither text mining nor curation rules (as implimented in the 7 dbs checked here) connected the individual drug names to the general case triple ”statins-inhibited-PAR-1”

• We can grapple with false-positives via filtration rules and heuristic tuning but false-negatives are a more difficult and potentialy more serious problem

[13]

Ameliorating the Problems• Avoid ”brainless parsing” and go for precision over recall• Make circularity explicit (e.g. dbs within dbs and curatorial recycling)• Refresh and update cross-links between dbs• Define biochemical and pharmacological relationships• Rigorous and deep QC (e.g. actually eyeball records)• Referential integrity checks (e.g. spot orphaned entities)• Display relationship distributions, inspect the extreme tails and attempt

to understand them• Document curatorial practice (e.g. equivocality handling rules)• Facilitate annotation judgments and quality-based filtration (i.e.

curatorial empowerment )• Consider canonical merging of chemical structures with multiplexed

bioactivity mappings• Crowdsourcing (e.g. Drug Bank comments > fixes and deprecations)• Encourage author mark-up at source (i.e. MIABE PMID: 21878981)• “But wait, hold on – did anyone peer review the database? “

(Williams and Eakins 2012 ACS presentation)

[14]

Conclusions • Linked Open Data is the new mining rock and roll; but...................• Even just chemistry <> protein is subject to the caveats in this poster

(and more besides)• At the very least circumspection is needed if inferences from database

linking are to be acted upon, validated and exploited• In the end, nothing saves us from database quality so this has to be

addressed by all of us

Dr Christopher Southan

ChrisDS Consulting: http://www.cdsouthan.info/Consult/CDS_cons.htmEmail: [email protected]: @cdsouthanBlog: http://cdsouthan.blogspot.com/LinkedIN: http://www.linkedin.com/in/cdsouthan Publications: http://www.citeulike.org/user/cdsouthan/publications/order/yearCitations:http://scholar.google.com/citations?user=y1DsHJ8AAAAJ&hl=enPresentations: http://www.slideshare.net/cdsouthan

Chemistry-to-Protein Relastionship Quality

Technology

description of y

speciesspecific y

clinico y

regulation of y

different capture

description of x

protein database xmappings7

chemistry protein databases