Top Banner
> Azciti Semantic Search Engine for Bioinformatics Company Azati designed and developed a semantic search engine powered by machine learning . It extracts the actual meaning from the search query and looks for the most relevant results across huge scientific datasets . CUSTOMER : A US Company focused on the development of in vitro diagnostic (IVD) and biopharmaceutical products . It provides products and services that support research and development activities and accelerates the time to market of products . The customer offers clinical trial management services , biological materials , central laboratory testing and other solutions that enable product development and research in infectious diseases , oncology , rheumatology , endocrinology , cardiology , and genetic disorders . A lot of companies suffer from the lack of accurate and fast search engine that can handle substantial scientific datasets . Scientific datasets are known for the structural complexity and a vast number of interconnected terms and abbreviations that make data processing quite tricky . The customer was looking for a partner who can overcome this challenge . OBJECTIVE : The customer wanted us to build an intelligent search engine that can help him deal with the internal inventory search. The inventory included a considerable number of blood samples . Each blood sample was described using several tags, grouped into subcategories , which were grouped into larger categories, etc. Customer ' s employees were forced to select many tags by hand to get the information they wanted . It took several minutes to perform a single search. And what was more disappointing , if an employee makes a single mistake or provides an inaccurate query , he or she will get an empty result page . The entire data lookup process was a huge disappointment and headache for the personnel and the customer . There were several challenges to overcome to improve the customer s workflow . CHALLENGE # ! : Every blood sample was described using a textual description and specific tags , manually mapped by external data entry vendor according to the description. There was a typical situation , where a blood sample had misstatements in the description and the tags. It means that any approach to improving the search by tag or by description would fail due to inconsistent data. CHALLENGE #2 : The first thing we thought about was cleansing the data. We faced two interconnected issues . First one was the lack of knowledge about all possible factors that can differentiate one blood sample from another . Another was the lack of knowledge about all alternative disease names : for example , Hepatitis B, HBV DNA, Hepatitis B Virus , HBV PCR, Hepatitis B Virus Genotype by Sequencing basically mean the same thing. CHALLENGE # 3 : Another challenge was the amount of data. There was a significant number of entries to process , and what was more important - there was no sample data for algorithm training to match the tags automatically . PROCESS : From the very beginning, the customer provided a list of keywords that describe blood samples . Very soon our team discovered that this list was incomplete and required additional research - it was not enough to complete the project. Similar issues can' t deny our team from completing the project in time . This way we decided to split the final solution into two pluggable modules . One for intelligent matching , it determined the level of confidence while tagging a blood sample . Another to extract all possible tags from search queries . De facto the second module transferred unstructured user input into structured data. The first challenge our engineers overcame was the lack of sample data . We trained a custom model based on hundred thousand life science documents related to blood samples from the open data sources . Data Scientist used Word2vec to analyze the connections between the most common words from the thesaurus to find synonyms and determine how these words are related to each other . AS A RESULT , THE MODEL COULD AUTOMATICALLY ANALYZE THE DESCRIPTION AND TAG OF BLOOD SAMPLES WITH A HIGH CONFIDENCE LEVEL - CLOSE TO 98% . The module responsible for entity detection in search queries was partially ready . We had already built a similar module while developing a platform for custom chatbot development . All that was left to do was retrain the model according to the list of entities : sample types , geography , diseases, genders , etc . To achieve a high level of confidence , we analyzed the massive number of user search queries collected from the open data sources . In the end , we compiled a collection of patterns used to form search queries . SOLUTION : The final solution consists of three separate interconnected modules hosted in the cloud . Such an approach helps us to maintain the system remotely to avoid on - site personnel training. Cloud architecture makes the application more flexible, cutting down development and maintenance costs . THE SYSTEM CONSISTS OF THREE MODULES : Query Analysis module Search Engine module User Interface API We are proud to say , two of the modules are powered by machine learning . Query Analysis module uses natural language processing algorithms to extract entities from search queries , while Search Engine module uses the extracted entities to match these entities with synonyms to perform an accurate and fast search. Modules are built as independent RESTful microservices , which helps us to scale the final solution to any size in the cloud . We significantly optimized traditional search algorithms . Instead of searching among the whole dataset we processed about 150.000 samples with about 100 tags and performed the search among these tags . We cached all processed samples with Redis, which helped us to implement in - memory data lookups and avoid the bottlenecks of reading / writing data to hard - drive . Performed optimizations helped us to provide outstanding search quality and blazing speed. TECHNOLOGIES : TensorFlow fgi python Word2Vec Flask crfsuite redis SCREENSHOTS : D c I nood Wood sompfos drown from fomoto patient * inloctod with hopatitis D from tooth africa " 1 * hopcMk ca ) femolo south africa BIOFLUIO ; COUNTRY ; DISEASE J GENDER J SAMPLE COUNT J : South Africa Fomafo Plasma 0.6 South Africa fomafo South Africa 21 Plosmo Fomafo Whoto Blood Thailand Tomaio Dominican RopuWic Whoto Blood Fomafo whoJo Blood Dominicon RopuWic romofo Whoto Blood Fomafo Whoto Blood 13 romofo unitod States Whoto Blood Fomafo Whoto Blood Unitod States fomafo Whoto Blood 32 TiCk- Bomo Fomafo Whoto Blood ftormal Healthy Fomafo Whoto Blood Unitod States lyme Disoaso Fomafo D show mo sampO oid * mato from south ofrico south africa COUNTRY ; DISEASE J GENDER J SAMPLE COUNT J : South Africo 0.67 South Africo South Africa 50 18 Thailand Guinea-Bissau Ivory Coast Cote Dlvoiro Domocratio RopuWic of the Congo 0.47 South Africa MOKiCO D C numhor of sperm samples whoro molo has syphilis has syphilis BIOFLUID t DISEASE t GENDER J SAMPLE COUNT J : Syphilis Mofo 0.6 Glucose 06 Normof Healthy low Risk interfering Substance Hospital Patient Hormone High R » sk Mofo 438 0.6 1165 0.6 0 « 0.6 GBfttoirktn ( D btooo samples ot molo pallom ottec organ iransptontation with ha Irom spoin Boon malo potiomoftor orgon tronsplonlalion ; ha s > in BtOTLUIO J COUNTRY J DISEASE ! CENDE0 J SAMPLE COUNT J SCOPE : Plosmo Spain Wholo Blood unitod stales TBIc - Bomo Wholo Blood DominicanRoputtk ; 9 0.4 wholo Rood Mondutos Chogos Wholo Blood unitod Stotes lyme Diseoso 0.4 wholo Rood Normal Hoollhy Wholo Rood Honduras ChiEungunyo Wholo Rood Wholo Rood Oominicon RopuWic 0.4 Whoa Rood 18 0.4 Wholo Rood Whole Rood Orgon Troni Plosmo igypt tymo Disoaso RESULTS : We successfully implemented a commercial semantic search engine that can handle massive scientific datasets . We used modern Natural Language Processing technologies to extract entities from search queries and categorize scientific texts by tags . The algorithms we built helped the customer to eliminate ineffective search results and significantly improve employee ' s satisfaction with the data lookup process . SOME NUMBERS : 150 K 27 3 SAMPLES MILLISECONDS MINUTES WERE ANALYZED TO BUILD A SEMANTIC SEARCH ENGINE TAKES TO ANALYZE A SEARCH QUERY AND RETURN A RESULT IT TAKES TO RETRAIN NEURAL NETWORKS FOR THE NEW DATASET A considerable amount of samples helped us to train the machine learning model effectively . Advanced caching and algorithms improvement helped to make a blazing fast search enginestruture . Our engineers built a scalable system that can easily be retrained to any amount of similar samples . NOW : We have successfully launched a semantic search engine in the middle of March . Now we are maintaining the application , processing new datasets , increasing search quality and scaling the system in the cloud . CONTACT US : Belarus USA 9 184 South Livingston Avenue Section 9 , Suite 119 Livingston , NJ 07039 Q 31 K. Marks Street , Sections 5 - 6 Grodno , 230025 V . + 375 29 6845855 V . + 1973 5971000 3 E sales @azati . com ® info @azati . com Si www . azati . com
1

Semantic Search Engine for Bioinformatics Company · >Azciti Semantic Search Engine for BioinformaticsCompany Azati designed and developed a semantic search engine powered...

Jun 24, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Semantic Search Engine for Bioinformatics Company · >Azciti Semantic Search Engine for BioinformaticsCompany Azati designed and developed a semantic search engine powered bymachinelearning.Itextractstheactual

> Azciti

Semantic SearchEngine forBioinformatics CompanyAzati designed and developed a semantic search engine poweredby machine learning.It extracts the actual meaning from the searchquery and looks for the most relevant results across huge scientificdatasets.

CUSTOMER:A US Company focused on the development of in vitro diagnostic (IVD) andbiopharmaceutical products. It provides products and services that support research anddevelopment activities and accelerates the time to market of products.

The customer offers clinical trial management services, biological materials, centrallaboratory testing and other solutions that enable product development and research ininfectious diseases, oncology, rheumatology, endocrinology, cardiology, and geneticdisorders.

A lot of companies suffer from the lack of accurate and fast search engine that can handlesubstantial scientific datasets. Scientific datasets are known for the structural complexityand a vast number of interconnected terms and abbreviations that make data processingquite tricky.

The customer was looking for a partner who can overcome this challenge.

OBJECTIVE:The customer wanted us to build an intelligent search engine that can helphim deal with theinternal inventory search. The inventory included a considerable number of blood samples.Each blood sample was described using several tags, grouped into subcategories, whichwere grouped into larger categories, etc.

Customer's employees were forced to select many tags by hand to get the information theywanted. It took several minutes to perform a single search. And what was moredisappointing, if an employee makes a single mistake or provides an inaccurate query, he orshe will get an empty result page.

The entire data lookup process was a huge disappointment and headache for the personneland the customer. There were several challenges to overcome to improve the customer’sworkflow.

CHALLENGE #!:

Every blood sample was described using a textual description andspecific tags, manually mapped by external data entry vendoraccording to the description.

There was a typical situation, where a blood sample hadmisstatements in the description and the tags. It means that anyapproach to improving the search by tag or by description would faildue to inconsistent data.

CHALLENGE #2:

The first thing we thought about was cleansing the data. We facedtwo interconnected issues. First one was the lack of knowledgeabout all possible factors that can differentiate one blood samplefrom another.

Another was the lack of knowledge about all alternative diseasenames: for example, Hepatitis B, HBV DNA, Hepatitis B Virus, HBVPCR, Hepatitis B Virus Genotype by Sequencing basically mean thesame thing.

CHALLENGE #3:

Another challenge was the amount of data. There was a significantnumber of entries to process,and what was more important - therewas no sample data for algorithm training to match the tagsautomatically.

PROCESS:From the very beginning, the customer provided a list of keywords that describe bloodsamples. Very soon our team discovered that this list was incomplete and required additionalresearch - it was not enough to complete the project. Similar issues can't deny our team fromcompleting the project in time.

This way we decided tosplit the final solution into two pluggable modules. One for intelligentmatching, it determined the level of confidence while tagging a blood sample. Another toextract all possible tags from search queries. De facto the second module transferredunstructured user input into structured data.

The first challenge our engineers overcame was the lack of sample data.We trained a custommodel based on hundred thousand life science documents related to blood samples fromthe open data sources. Data Scientist used Word2vec to analyze the connections betweenthe most common words from the thesaurus to find synonyms and determine how thesewords are related to each other.

AS A RESULT, THE MODEL COULD AUTOMATICALLY ANALYZE THE DESCRIPTION ANDTAG OF BLOOD SAMPLES WITH A HIGH CONFIDENCE LEVEL - CLOSE TO 98%.

The module responsible for entity detection in search queries was partially ready. We hadalready built a similar module while developing a platform for custom chatbot development.All that was left to do was retrain the model according to the list of entities: sample types,geography,diseases, genders,etc.

To achieve a high level of confidence,we analyzed the massive number of user search queriescollected from the open data sources. In the end,we compiled a collection of patterns usedto form search queries.

SOLUTION:The final solution consists of three separate interconnected modules hosted in the cloud.Such an approach helps us to maintain the system remotely to avoid on-site personneltraining. Cloud architecture makes the application more flexible, cutting down developmentand maintenance costs.

THE SYSTEM CONSISTS OF THREE MODULES:

Query Analysismodule

Search Enginemodule

User InterfaceAPI

We are proud to say, two of the modules are powered by machine learning.Query Analysismodule uses natural language processing algorithms to extract entities from search queries,while Search Engine module uses the extracted entities to match these entities withsynonyms to perform an accurate and fast search.

Modules are built as independent RESTful microservices, which helps us to scale the finalsolution to any size in the cloud.

We significantly optimized traditional search algorithms. Instead of searching among thewhole dataset we processed about150.000 samples with about 100 tags and performed thesearch among these tags. We cached all processed samples with Redis, which helped us toimplement in-memory data lookups and avoid the bottlenecks of reading/writing data tohard-drive.

Performed optimizations helped us to provide outstanding search quality and blazingspeed.

TECHNOLOGIES:

TensorFlowfgi python Word2Vec

Flaskcrfsuite

redis

SCREENSHOTS:

DcInood Wood sompfos drown fromfomoto patient* inloctod with hopatitis D fromtoothafrica

" 1 * hopcMk ca)femolo southafrica

BIOFLUIO ; COUNTRY ; DISEASE J GENDER J SAMPLE COUNT J :

SouthAfrica Fomafo

Plasma 0.6SouthAfrica fomafo

SouthAfrica 21Plosmo Fomafo

Whoto Blood Thailand Tomaio

Dominican RopuWicWhoto Blood Fomafo

whoJo Blood DominiconRopuWic romofo

Whoto Blood Fomafo

Whoto Blood 13romofo

unitod StatesWhoto Blood Fomafo

Whoto Blood Unitod States fomafo

Whoto Blood 32TiCk-Bomo Fomafo

Whoto Blood ftormal Healthy Fomafo

Whoto Blood Unitod States lyme Disoaso Fomafo

Dshowmo sampOoid* mato fromsouthofrico

south africa

COUNTRY ; DISEASE J GENDER J SAMPLE COUNT J :

SouthAfrico

0.67SouthAfrico

SouthAfrica 50

18

Thailand

Guinea-Bissau

Ivory Coast

Cote Dlvoiro

Domocratio RopuWic of the Congo

0.47

SouthAfrica

MOKiCO

DC numhor of sperm samples whoromolohas syphilis

has syphilis

BIOFLUID t DISEASE t GENDER J SAMPLE COUNT J :

Syphilis

Mofo 0.6Glucose

•06Normof Healthy

low Risk

interfering Substance

Hospital Patient

Hormone

High R »sk Mofo 438 0.6

1165 0.6

0.6

GBfttoirktn

( Dbtooosamplesot molopallom ottec organiransptontationwithhaIrom spoin

Boon malo potiomoftor orgon tronsplonlalion;ha

s>”in

BtOTLUIO J COUNTRY J DISEASE ! CENDE0 J SAMPLECOUNT J SCOPE :

Plosmo Spain

Wholo Blood unitodstales TBIc-Bomo

Wholo Blood DominicanRoputtk;

9 0.4wholo Rood Mondutos Chogos

Wholo Blood unitodStotes lyme Diseoso

0.4wholo Rood Normal Hoollhy

Wholo Rood Honduras ChiEungunyo

Wholo Rood

Wholo Rood OominiconRopuWic 0.4

Whoa Rood

18 0.4Wholo Rood

Whole Rood Orgon Troni

Plosmo igypt tymo Disoaso

RESULTS:We successfully implemented a commercial semantic search engine that can handlemassive scientific datasets. We used modern Natural Language Processing technologies toextract entities from search queries and categorize scientific texts by tags. The algorithmswe built helped the customer to eliminate ineffective search results and significantlyimprove employee's satisfaction with the data lookup process.

SOME NUMBERS:

150K 27 3SAMPLES MILLISECONDS MINUTES

WERE ANALYZED TOBUILD A SEMANTICSEARCH ENGINE

TAKES TO ANALYZE ASEARCH QUERY ANDRETURN A RESULT

IT TAKES TO RETRAINNEURAL NETWORKS FORTHE NEW DATASET

A considerable amount ofsamples helped us to trainthe machine learning modeleffectively.

Advanced caching andalgorithms improvementhelped to make a blazingfast search enginestruture.

Our engineers built ascalable system that caneasily be retrained to anyamount of similar samples.

NOW:We have successfully launched a semantic search engine in the middle of March. Now we aremaintaining the application, processing new datasets, increasing search quality and scaling thesystem in the cloud.

CONTACT US:BelarusUSA

9 184 South Livingston AvenueSection 9, Suite 119Livingston, NJ 07039

Q 31 K. Marks Street, Sections 5-6Grodno, 230025

V. + 375 29 6845855V. +1973 5971000

3E [email protected]® [email protected]

Si www.azati.com