> Azciti Semantic Search Engine for Bioinformatics Company Azati designed and developed a semantic search engine powered by machine learning . It extracts the actual meaning from the search query and looks for the most relevant results across huge scientific datasets . CUSTOMER : A US Company focused on the development of in vitro diagnostic (IVD) and biopharmaceutical products . It provides products and services that support research and development activities and accelerates the time to market of products . The customer offers clinical trial management services , biological materials , central laboratory testing and other solutions that enable product development and research in infectious diseases , oncology , rheumatology , endocrinology , cardiology , and genetic disorders . A lot of companies suffer from the lack of accurate and fast search engine that can handle substantial scientific datasets . Scientific datasets are known for the structural complexity and a vast number of interconnected terms and abbreviations that make data processing quite tricky . The customer was looking for a partner who can overcome this challenge . OBJECTIVE : The customer wanted us to build an intelligent search engine that can help him deal with the internal inventory search. The inventory included a considerable number of blood samples . Each blood sample was described using several tags, grouped into subcategories , which were grouped into larger categories, etc. Customer ' s employees were forced to select many tags by hand to get the information they wanted . It took several minutes to perform a single search. And what was more disappointing , if an employee makes a single mistake or provides an inaccurate query , he or she will get an empty result page . The entire data lookup process was a huge disappointment and headache for the personnel and the customer . There were several challenges to overcome to improve the customer ’ s workflow . CHALLENGE # ! : Every blood sample was described using a textual description and specific tags , manually mapped by external data entry vendor according to the description. There was a typical situation , where a blood sample had misstatements in the description and the tags. It means that any approach to improving the search by tag or by description would fail due to inconsistent data. CHALLENGE #2 : The first thing we thought about was cleansing the data. We faced two interconnected issues . First one was the lack of knowledge about all possible factors that can differentiate one blood sample from another . Another was the lack of knowledge about all alternative disease names : for example , Hepatitis B, HBV DNA, Hepatitis B Virus , HBV PCR, Hepatitis B Virus Genotype by Sequencing basically mean the same thing. CHALLENGE # 3 : Another challenge was the amount of data. There was a significant number of entries to process , and what was more important - there was no sample data for algorithm training to match the tags automatically . PROCESS : From the very beginning, the customer provided a list of keywords that describe blood samples . Very soon our team discovered that this list was incomplete and required additional research - it was not enough to complete the project. Similar issues can' t deny our team from completing the project in time . This way we decided to split the final solution into two pluggable modules . One for intelligent matching , it determined the level of confidence while tagging a blood sample . Another to extract all possible tags from search queries . De facto the second module transferred unstructured user input into structured data. The first challenge our engineers overcame was the lack of sample data . We trained a custom model based on hundred thousand life science documents related to blood samples from the open data sources . Data Scientist used Word2vec to analyze the connections between the most common words from the thesaurus to find synonyms and determine how these words are related to each other . AS A RESULT , THE MODEL COULD AUTOMATICALLY ANALYZE THE DESCRIPTION AND TAG OF BLOOD SAMPLES WITH A HIGH CONFIDENCE LEVEL - CLOSE TO 98% . The module responsible for entity detection in search queries was partially ready . We had already built a similar module while developing a platform for custom chatbot development . All that was left to do was retrain the model according to the list of entities : sample types , geography , diseases, genders , etc . To achieve a high level of confidence , we analyzed the massive number of user search queries collected from the open data sources . In the end , we compiled a collection of patterns used to form search queries . SOLUTION : The final solution consists of three separate interconnected modules hosted in the cloud . Such an approach helps us to maintain the system remotely to avoid on - site personnel training. Cloud architecture makes the application more flexible, cutting down development and maintenance costs . THE SYSTEM CONSISTS OF THREE MODULES : Query Analysis module Search Engine module User Interface API We are proud to say , two of the modules are powered by machine learning . Query Analysis module uses natural language processing algorithms to extract entities from search queries , while Search Engine module uses the extracted entities to match these entities with synonyms to perform an accurate and fast search. Modules are built as independent RESTful microservices , which helps us to scale the final solution to any size in the cloud . We significantly optimized traditional search algorithms . Instead of searching among the whole dataset we processed about 150.000 samples with about 100 tags and performed the search among these tags . We cached all processed samples with Redis, which helped us to implement in - memory data lookups and avoid the bottlenecks of reading / writing data to hard - drive . Performed optimizations helped us to provide outstanding search quality and blazing speed. TECHNOLOGIES : TensorFlow fgi python Word2Vec Flask crfsuite redis SCREENSHOTS : D c I nood Wood sompfos drown from fomoto patient * inloctod with hopatitis D from tooth africa " 1 * hopcMk ca ) femolo south africa BIOFLUIO ; COUNTRY ; DISEASE J GENDER J SAMPLE COUNT J : South Africa Fomafo Plasma 0.6 South Africa fomafo South Africa 21 Plosmo Fomafo Whoto Blood Thailand Tomaio Dominican RopuWic Whoto Blood Fomafo whoJo Blood Dominicon RopuWic romofo Whoto Blood Fomafo Whoto Blood 13 romofo unitod States Whoto Blood Fomafo Whoto Blood Unitod States fomafo Whoto Blood 32 TiCk- Bomo Fomafo Whoto Blood ftormal Healthy Fomafo Whoto Blood Unitod States lyme Disoaso Fomafo D show mo sampO oid * mato from south ofrico south africa COUNTRY ; DISEASE J GENDER J SAMPLE COUNT J : South Africo 0.67 South Africo South Africa 50 18 Thailand Guinea-Bissau Ivory Coast Cote Dlvoiro Domocratio RopuWic of the Congo 0.47 South Africa MOKiCO D C numhor of sperm samples whoro molo has syphilis has syphilis BIOFLUID t DISEASE t GENDER J SAMPLE COUNT J : Syphilis Mofo 0.6 Glucose • 06 Normof Healthy low Risk interfering Substance Hospital Patient Hormone High R » sk Mofo 438 0.6 1165 0.6 0 « 0.6 GBfttoirktn ( D btooo samples ot molo pallom ottec organ iransptontation with ha Irom spoin Boon malo potiomoftor orgon tronsplonlalion ; ha s > ” in BtOTLUIO J COUNTRY J DISEASE ! CENDE0 J SAMPLE COUNT J SCOPE : Plosmo Spain Wholo Blood unitod stales TBIc - Bomo Wholo Blood DominicanRoputtk ; 9 0.4 wholo Rood Mondutos Chogos Wholo Blood unitod Stotes lyme Diseoso 0.4 wholo Rood Normal Hoollhy Wholo Rood Honduras ChiEungunyo Wholo Rood Wholo Rood Oominicon RopuWic 0.4 Whoa Rood 18 0.4 Wholo Rood Whole Rood Orgon Troni Plosmo igypt tymo Disoaso RESULTS : We successfully implemented a commercial semantic search engine that can handle massive scientific datasets . We used modern Natural Language Processing technologies to extract entities from search queries and categorize scientific texts by tags . The algorithms we built helped the customer to eliminate ineffective search results and significantly improve employee ' s satisfaction with the data lookup process . SOME NUMBERS : 150 K 27 3 SAMPLES MILLISECONDS MINUTES WERE ANALYZED TO BUILD A SEMANTIC SEARCH ENGINE TAKES TO ANALYZE A SEARCH QUERY AND RETURN A RESULT IT TAKES TO RETRAIN NEURAL NETWORKS FOR THE NEW DATASET A considerable amount of samples helped us to train the machine learning model effectively . Advanced caching and algorithms improvement helped to make a blazing fast search enginestruture . Our engineers built a scalable system that can easily be retrained to any amount of similar samples . NOW : We have successfully launched a semantic search engine in the middle of March . Now we are maintaining the application , processing new datasets , increasing search quality and scaling the system in the cloud . CONTACT US : Belarus USA 9 184 South Livingston Avenue Section 9 , Suite 119 Livingston , NJ 07039 Q 31 K. Marks Street , Sections 5 - 6 Grodno , 230025 V . + 375 29 6845855 V . + 1973 5971000 3 E sales @azati . com ® info @azati . com Si www . azati . com