Semantic Search Engine for Bioinformatics Company · >Azciti Semantic Search Engine for BioinformaticsCompany Azati designed and developed a semantic search engine powered...

Post on 24-Jun-2020

11 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

> Azciti

Semantic SearchEngine forBioinformatics CompanyAzati designed and developed a semantic search engine poweredby machine learning.It extracts the actual meaning from the searchquery and looks for the most relevant results across huge scientificdatasets.

CUSTOMER:A US Company focused on the development of in vitro diagnostic (IVD) andbiopharmaceutical products. It provides products and services that support research anddevelopment activities and accelerates the time to market of products.

The customer offers clinical trial management services, biological materials, centrallaboratory testing and other solutions that enable product development and research ininfectious diseases, oncology, rheumatology, endocrinology, cardiology, and geneticdisorders.

A lot of companies suffer from the lack of accurate and fast search engine that can handlesubstantial scientific datasets. Scientific datasets are known for the structural complexityand a vast number of interconnected terms and abbreviations that make data processingquite tricky.

The customer was looking for a partner who can overcome this challenge.

OBJECTIVE:The customer wanted us to build an intelligent search engine that can helphim deal with theinternal inventory search. The inventory included a considerable number of blood samples.Each blood sample was described using several tags, grouped into subcategories, whichwere grouped into larger categories, etc.

Customer's employees were forced to select many tags by hand to get the information theywanted. It took several minutes to perform a single search. And what was moredisappointing, if an employee makes a single mistake or provides an inaccurate query, he orshe will get an empty result page.

The entire data lookup process was a huge disappointment and headache for the personneland the customer. There were several challenges to overcome to improve the customer’sworkflow.

CHALLENGE #!:

Every blood sample was described using a textual description andspecific tags, manually mapped by external data entry vendoraccording to the description.

There was a typical situation, where a blood sample hadmisstatements in the description and the tags. It means that anyapproach to improving the search by tag or by description would faildue to inconsistent data.

CHALLENGE #2:

The first thing we thought about was cleansing the data. We facedtwo interconnected issues. First one was the lack of knowledgeabout all possible factors that can differentiate one blood samplefrom another.

Another was the lack of knowledge about all alternative diseasenames: for example, Hepatitis B, HBV DNA, Hepatitis B Virus, HBVPCR, Hepatitis B Virus Genotype by Sequencing basically mean thesame thing.

CHALLENGE #3:

Another challenge was the amount of data. There was a significantnumber of entries to process,and what was more important - therewas no sample data for algorithm training to match the tagsautomatically.

PROCESS:From the very beginning, the customer provided a list of keywords that describe bloodsamples. Very soon our team discovered that this list was incomplete and required additionalresearch - it was not enough to complete the project. Similar issues can't deny our team fromcompleting the project in time.

This way we decided tosplit the final solution into two pluggable modules. One for intelligentmatching, it determined the level of confidence while tagging a blood sample. Another toextract all possible tags from search queries. De facto the second module transferredunstructured user input into structured data.

The first challenge our engineers overcame was the lack of sample data.We trained a custommodel based on hundred thousand life science documents related to blood samples fromthe open data sources. Data Scientist used Word2vec to analyze the connections betweenthe most common words from the thesaurus to find synonyms and determine how thesewords are related to each other.

AS A RESULT, THE MODEL COULD AUTOMATICALLY ANALYZE THE DESCRIPTION ANDTAG OF BLOOD SAMPLES WITH A HIGH CONFIDENCE LEVEL - CLOSE TO 98%.

The module responsible for entity detection in search queries was partially ready. We hadalready built a similar module while developing a platform for custom chatbot development.All that was left to do was retrain the model according to the list of entities: sample types,geography,diseases, genders,etc.

To achieve a high level of confidence,we analyzed the massive number of user search queriescollected from the open data sources. In the end,we compiled a collection of patterns usedto form search queries.

SOLUTION:The final solution consists of three separate interconnected modules hosted in the cloud.Such an approach helps us to maintain the system remotely to avoid on-site personneltraining. Cloud architecture makes the application more flexible, cutting down developmentand maintenance costs.

THE SYSTEM CONSISTS OF THREE MODULES:

Query Analysismodule

Search Enginemodule

User InterfaceAPI

We are proud to say, two of the modules are powered by machine learning.Query Analysismodule uses natural language processing algorithms to extract entities from search queries,while Search Engine module uses the extracted entities to match these entities withsynonyms to perform an accurate and fast search.

Modules are built as independent RESTful microservices, which helps us to scale the finalsolution to any size in the cloud.

We significantly optimized traditional search algorithms. Instead of searching among thewhole dataset we processed about150.000 samples with about 100 tags and performed thesearch among these tags. We cached all processed samples with Redis, which helped us toimplement in-memory data lookups and avoid the bottlenecks of reading/writing data tohard-drive.

Performed optimizations helped us to provide outstanding search quality and blazingspeed.

TECHNOLOGIES:

TensorFlowfgi python Word2Vec

Flaskcrfsuite

redis

SCREENSHOTS:

DcInood Wood sompfos drown fromfomoto patient* inloctod with hopatitis D fromtoothafrica

" 1 * hopcMk ca)femolo southafrica

BIOFLUIO ; COUNTRY ; DISEASE J GENDER J SAMPLE COUNT J :

SouthAfrica Fomafo

Plasma 0.6SouthAfrica fomafo

SouthAfrica 21Plosmo Fomafo

Whoto Blood Thailand Tomaio

Dominican RopuWicWhoto Blood Fomafo

whoJo Blood DominiconRopuWic romofo

Whoto Blood Fomafo

Whoto Blood 13romofo

unitod StatesWhoto Blood Fomafo

Whoto Blood Unitod States fomafo

Whoto Blood 32TiCk-Bomo Fomafo

Whoto Blood ftormal Healthy Fomafo

Whoto Blood Unitod States lyme Disoaso Fomafo

Dshowmo sampOoid* mato fromsouthofrico

south africa

COUNTRY ; DISEASE J GENDER J SAMPLE COUNT J :

SouthAfrico

0.67SouthAfrico

SouthAfrica 50

18

Thailand

Guinea-Bissau

Ivory Coast

Cote Dlvoiro

Domocratio RopuWic of the Congo

0.47

SouthAfrica

MOKiCO

DC numhor of sperm samples whoromolohas syphilis

has syphilis

BIOFLUID t DISEASE t GENDER J SAMPLE COUNT J :

Syphilis

Mofo 0.6Glucose

•06Normof Healthy

low Risk

interfering Substance

Hospital Patient

Hormone

High R »sk Mofo 438 0.6

1165 0.6

0.6

GBfttoirktn

( Dbtooosamplesot molopallom ottec organiransptontationwithhaIrom spoin

Boon malo potiomoftor orgon tronsplonlalion;ha

s>”in

BtOTLUIO J COUNTRY J DISEASE ! CENDE0 J SAMPLECOUNT J SCOPE :

Plosmo Spain

Wholo Blood unitodstales TBIc-Bomo

Wholo Blood DominicanRoputtk;

9 0.4wholo Rood Mondutos Chogos

Wholo Blood unitodStotes lyme Diseoso

0.4wholo Rood Normal Hoollhy

Wholo Rood Honduras ChiEungunyo

Wholo Rood

Wholo Rood OominiconRopuWic 0.4

Whoa Rood

18 0.4Wholo Rood

Whole Rood Orgon Troni

Plosmo igypt tymo Disoaso

RESULTS:We successfully implemented a commercial semantic search engine that can handlemassive scientific datasets. We used modern Natural Language Processing technologies toextract entities from search queries and categorize scientific texts by tags. The algorithmswe built helped the customer to eliminate ineffective search results and significantlyimprove employee's satisfaction with the data lookup process.

SOME NUMBERS:

150K 27 3SAMPLES MILLISECONDS MINUTES

WERE ANALYZED TOBUILD A SEMANTICSEARCH ENGINE

TAKES TO ANALYZE ASEARCH QUERY ANDRETURN A RESULT

IT TAKES TO RETRAINNEURAL NETWORKS FORTHE NEW DATASET

A considerable amount ofsamples helped us to trainthe machine learning modeleffectively.

Advanced caching andalgorithms improvementhelped to make a blazingfast search enginestruture.

Our engineers built ascalable system that caneasily be retrained to anyamount of similar samples.

NOW:We have successfully launched a semantic search engine in the middle of March. Now we aremaintaining the application, processing new datasets, increasing search quality and scaling thesystem in the cloud.

CONTACT US:BelarusUSA

9 184 South Livingston AvenueSection 9, Suite 119Livingston, NJ 07039

Q 31 K. Marks Street, Sections 5-6Grodno, 230025

V. + 375 29 6845855V. +1973 5971000

3E sales@azati.com® info@azati.com

Si www.azati.com

top related