Top Banner
Population-ethnic group specic genome variation allele frequency data: A querying and visualization journey Emmanouil Viennas a, 1 , Vassiliki Gkantouna a, 1 , Marina Ioannou a , Marianthi Georgitsi b , Maria Rigou a , Konstantinos Poulas b , George P. Patrinos b , Giannis Tzimas c, a Department of Computer Engineering and Informatics, Faculty of Engineering, University of Patras, Patras, Greece b Department of Pharmacy, School of Health Sciences, University of Patras, Patras, Greece c Department of Applied Informatics in Management & Economy, Faculty of Management and Economics, Technological Educational Institute of Messolonghi, Messolonghi, Greece abstract article info Article history: Received 19 March 2012 Accepted 21 May 2012 Available online 30 May 2012 Keywords: National/ethnic mutation databases (NEMDBs) Genetic variation Genetic disease Querying tools Visualization tools Data mining National/ethnic mutation databases aim to document the genetic heterogeneity in various populations and ethnic groups worldwide. We have previously reported the development and upgrade of FINDbase (www. ndbase.org), a database recording causative mutations and pharmacogenomic marker allele frequencies in various populations around the globe. Although this database has recently been upgraded, we continuously try to enhance its functionality by providing more advanced visualization tools that would further assist effective data querying and comparisons. We are currently experimenting in various visualization techniques on the existing FINDbase causative mutation data collection aiming to provide a dynamic research tool for the world- wide scientic community. We have developed an interactive web-based application for population-based mutation data retrieval. It supports sophisticated data exploration allowing users to apply advanced ltering criteria upon a set of multiple views of the underlying data collection and enables browsing the relationships between individual datasets in a novel and meaningful way. © 2012 Elsevier Inc. All rights reserved. 1. Introduction Genetic or mutation databases are online mutation data depositories that summarize genomic knowledge and genotypephenotype correla- tions pertaining, most of the time, to monogenic inherited disorders. To date, there are three common types of genetic databases, namely general or core databases, summarizing genotypephenotype informa- tion on all genes, locus-specic databases (LSDBs) for a single or few genes that relate to a single gene disorder and national/ethnic mutation databases (NEMDBs) that describe the genetic heterogeneity and allele frequencies at a summary level of causative genomic variations identied specically for a population or ethnic group [1]. Since the early 1990s that marked the rst attempts to develop genetic databases, the eld has witnessed a remarkable growth with such databases now available for hundreds to thousands of human genes (sometimes with more than one database per gene). However, the genetic database eld suffers from a number of serious deciencies, such as data content heterogeneity, lack of data models, ontology options etc. [2]. Also, there is an urgent need to create links between genomic databases with databases reporting phenotypes and clinical conditions so that genomic data become meaningful [3]. Most of these needs are currently being addressed by several consortia, such as the GEN2PHEN [4], or the Human Variome Project [5], to name a few. Among the identied de- ciencies is the lack of standardized data visualization tools that would allow for more efcient assessment of the data stored within these repositories and thus drawing more useful conclusions. A thorough domain analysis has indicated that very few LSDBs (13.5%) show muta- tion maps and few can provide graphical projections, including dynamic graphing tools, depicting the location of variations throughout the gene (or protein) sequence [2]. The VariVis tool has been previously described for LSDBs data visualization [6], which has been implemented in the SERPINA1 LSDB [7], while the MUTbase-based LSDBs have employed a built-in visualization tool [8]. Currently, there is no such data visualization and querying tool employed for NEMDBs [9], which would allow comparison of causative genome variation allele frequency data among different populations. To this end, we are experimenting on various visualization techniques aiming to bridge the gap that is currently open by other related data- bases. Although the initial launch of the Frequency of Inherited Disor- ders database (FINDbase) included some basic data visualization functionality [10], the recently upgraded version does allow data querying to be coupled to data visualization [11]. In this work, we pre- sent the development and implementation of an interactive web-based data visualization and querying tool for FINDbase, which allows users to combine large groups of similar elements and identify hidden Genomics 100 (2012) 93101 Corresponding author at: Department of Applied Informatics in Management & Economy, Faculty of Management and Economics, Technological Educational Institute of Messolonghi, Messolonghi, TEI of Messolonghi, Nea Ktiria, GR30200 Messolonghi, Greece. Fax: + 30 26310 23204. E-mail address: [email protected] (G. Tzimas). 1 These authors contributed equally to this work. 0888-7543/$ see front matter © 2012 Elsevier Inc. All rights reserved. doi:10.1016/j.ygeno.2012.05.009 Contents lists available at SciVerse ScienceDirect Genomics journal homepage: www.elsevier.com/locate/ygeno
9

Population-ethnic group specific genome variation allele frequency data: A querying and visualization journey

May 15, 2023

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Population-ethnic group specific genome variation allele frequency data: A querying and visualization journey

Genomics 100 (2012) 93–101

Contents lists available at SciVerse ScienceDirect

Genomics

j ourna l homepage: www.e lsev ie r .com/ locate /ygeno

Population-ethnic group specific genome variation allele frequency data: A queryingand visualization journey

Emmanouil Viennas a,1, Vassiliki Gkantouna a,1, Marina Ioannou a, Marianthi Georgitsi b, Maria Rigou a,Konstantinos Poulas b, George P. Patrinos b, Giannis Tzimas c,⁎a Department of Computer Engineering and Informatics, Faculty of Engineering, University of Patras, Patras, Greeceb Department of Pharmacy, School of Health Sciences, University of Patras, Patras, Greecec Department of Applied Informatics in Management & Economy, Faculty of Management and Economics, Technological Educational Institute of Messolonghi, Messolonghi, Greece

⁎ Corresponding author at: Department of AppliedEconomy, Faculty of Management and Economics, Techof Messolonghi, Messolonghi, TEI of Messolonghi, NeaGreece. Fax: +30 26310 23204.

E-mail address: [email protected] (G. Tzimas).1 These authors contributed equally to this work.

0888-7543/$ – see front matter © 2012 Elsevier Inc. Alldoi:10.1016/j.ygeno.2012.05.009

a b s t r a c t

a r t i c l e i n f o

Article history:Received 19 March 2012Accepted 21 May 2012Available online 30 May 2012

Keywords:National/ethnic mutation databases (NEMDBs)Genetic variationGenetic diseaseQuerying toolsVisualization toolsData mining

National/ethnic mutation databases aim to document the genetic heterogeneity in various populations andethnic groups worldwide. We have previously reported the development and upgrade of FINDbase (www.findbase.org), a database recording causative mutations and pharmacogenomic marker allele frequencies invarious populations around the globe. Although this database has recently been upgraded, we continuouslytry to enhance its functionality by providingmore advanced visualization tools thatwould further assist effectivedata querying and comparisons. We are currently experimenting in various visualization techniques on theexisting FINDbase causative mutation data collection aiming to provide a dynamic research tool for the world-wide scientific community. We have developed an interactive web-based application for population-basedmutation data retrieval. It supports sophisticated data exploration allowing users to apply advanced filteringcriteria upon a set of multiple views of the underlying data collection and enables browsing the relationshipsbetween individual datasets in a novel and meaningful way.

© 2012 Elsevier Inc. All rights reserved.

1. Introduction

Genetic ormutation databases are onlinemutation data depositoriesthat summarize genomic knowledge and genotype–phenotype correla-tions pertaining, most of the time, to monogenic inherited disorders.To date, there are three common types of genetic databases, namelygeneral or core databases, summarizing genotype–phenotype informa-tion on all genes, locus-specific databases (LSDBs) for a single or fewgenes that relate to a single gene disorder and national/ethnic mutationdatabases (NEMDBs) that describe the genetic heterogeneity and allelefrequencies – at a summary level – of causative genomic variationsidentified specifically for a population or ethnic group [1]. Since theearly 1990s thatmarked thefirst attempts to develop genetic databases,the field has witnessed a remarkable growth with such databases nowavailable for hundreds to thousands of human genes (sometimes withmore than one database per gene). However, the genetic databasefield suffers from a number of serious deficiencies, such as data contentheterogeneity, lack of datamodels, ontology options etc. [2]. Also, thereis an urgent need to create links between genomic databases with

Informatics in Management &nological Educational InstituteKtiria, GR30200 Messolonghi,

rights reserved.

databases reporting phenotypes and clinical conditions so that genomicdata become meaningful [3]. Most of these needs are currently beingaddressed by several consortia, such as the GEN2PHEN [4], or theHuman Variome Project [5], to name a few. Among the identified defi-ciencies is the lack of standardized data visualization tools that wouldallow for more efficient assessment of the data stored within theserepositories and thus drawing more useful conclusions. A thoroughdomain analysis has indicated that very few LSDBs (13.5%) showmuta-tionmaps and few can provide graphical projections, includingdynamicgraphing tools, depicting the location of variations throughout thegene (or protein) sequence [2]. The VariVis tool has been previouslydescribed for LSDBs data visualization [6], which has been implementedin the SERPINA1 LSDB [7], while the MUTbase-based LSDBs haveemployed a built-in visualization tool [8].

Currently, there is no such data visualization and querying toolemployed for NEMDBs [9], which would allow comparison of causativegenome variation allele frequency data among different populations. Tothis end, we are experimenting on various visualization techniquesaiming to bridge the gap that is currently open by other related data-bases. Although the initial launch of the Frequency of Inherited Disor-ders database (FINDbase) included some basic data visualizationfunctionality [10], the recently upgraded version does allow dataquerying to be coupled to data visualization [11]. In this work, we pre-sent the development and implementation of an interactive web-baseddata visualization and querying tool for FINDbase, which allows usersto combine large groups of similar elements and identify hidden

Page 2: Population-ethnic group specific genome variation allele frequency data: A querying and visualization journey

94 E. Viennas et al. / Genomics 100 (2012) 93–101

relationships between individual pieces of information. This toolmeets one of the significant challenges of large and complex datasets,that is the effective presentation of and interaction with the data.More specifically, we have built an elegant, web-based multimediaweb front-end, based on a software tool launched byMicrosoft, namelythe PivotViewer [12], in order to support a high level visualization ofthe data collection and the mining process. It supports dynamic datavisualization, sorting, organization and categorization. A different visualencoding is provided for each gene and population aiming to facilitatethe population-based mutation data collection and retrieval. Addi-tionally to the PivotViewer, there is also an alternative visualizationquerying interface based on the Flare visualization toolkit [13], provid-ing two extra visualization types of the underlying data collection,namely the Gene and Mutation Map and the Mutation DependencyGraph and allowing users to query mutation distributions and correla-tions among populations. The FINDbase is the first effort to provide astandardized visualization software for genetic databases, in the hopeof also providing a modern educational and diagnostic visualizationparadigm to the worldwide medical community.

2. Results

It is a common practice among health professionals and researchersto collect the information related to their scientific research in largespreadsheets which cannot provide the opportunity to explore, analyzeand understand data in a user-friendly way. As a result, there is animperative need to represent information in a visual form, enablingthem to drill down into the data and gain meaningful insights byrevealing underlying patterns and potentially previously unseen corre-lations among large datasets. Taking this into consideration, we havedeveloped a data visualization environment for the FINDbase causativemutation data collection, designed to provide users a much moreintuitive way to digest the large amounts of information without losingtheir orientation — an idea that anyone who has analyzed extensivespreadsheets may welcome.

The large volume and complexity of the FINDbase causative muta-tions data collection as well as the numerous relationships between

Fig. 1. FINDbase card item: (A) A card represents a single variant in the underlying collection ocgenetic variant, such as the disease name and themutation description, a chromosomal depictiinformation panel appears when users zoom-in on a card providing in-depth information abouclicking on the card, by using the zoom slider in the upper right corner of the PivotViewer framespecific card is selected, otherwise it disappears.

the data make it hard for researchers to maintain a global view of thewhole dataset. Driven by the need to help scientists obtain a broaderpicture of the underlying datasets and concurrently identify and extractthe valuable knowledge that lies in them, we have given special atten-tion to the development of the data and querying visualization. Ourprimary objective was to give researchers an active role in the queryingprocess allowing them to interact directly with the huge amount of theavailable data in an intuitive and meaningful way.

On this basis, we have built a visualization tool providing two basicquerying interfaces that rely on the following visualization approaches:(a) themain interface, based on the PivotViewer and (b) the alternativeinterface based on the Flare Visualization Toolkit. Both enable users tosee data under many different perspectives by providing multipleviews and thus gain a better assessment and understanding of them.Moreover, they offer users the possibility to zoom-in from the extensiveFINDbase datasets to particular gene-specific, genome variation-specific and/or population-specific data. This way the user can detectthe links between distant data items and handle them in a way thatintuitively reflects their semantic proximity. Any user can perform aquery based on a variety of filtering criteria which allows him to seespecific genetic variants and to further zoom into a particular popula-tion. It is noteworthy here that while a user experiments with differentquerying scenarios, this may sometimes result in his making incidentaldiscoveries of potentially high biological importance.

During our experiments, we have found very interesting queryingscenarios producing visualizations which turn out to be very insightfulpresenting information previously undiscovered. In the following para-graphs, some such indicative examples based on the provided queryinginterfaces are presented.

2.1. PivotViewer visualization tool

The form of a card (Fig. 1A) is used for the description of each geno-mic variant, in order to provide a more human-centric visualizationapproach. This card consists of a chromosomal depiction of the genelocation (derived from GeneCards [14]) combined with a sidebar infor-mation panel (Fig. 1B). The panel provides in-depth information

curring in a specific population. The card consists of the gene name, information about theon of the gene location and a flag corresponding to the particular population. (B) A sidebart the card currently selected. Users can zoom in or out of a card in the following ways: byand by using the zoomwheel on theirmouse. The information panel is visible onlywhen a

Page 3: Population-ethnic group specific genome variation allele frequency data: A querying and visualization journey

95E. Viennas et al. / Genomics 100 (2012) 93–101

concerning the particular genetic variation and population, appearingwhen users zoom-in on the card.

The entire FINDbase causativemutation data collection, as producedby PivotViewer, is shown in the following figure (Fig. 2A). A data filter-ing panel (displayed on the left side; Fig. 2B) is available providing avariety of filtering criteria to be applied on the underlying data collec-tion. FINDbase PivotViewer application enables users to smoothly andquickly explore the underlying datasets and include or exclude specificitems by applying filters, whereas they can simultaneously change theway the resulting set of cards (depicting each genetic variant) isdisplayed by choosing between the grid and the graph view (by clickingon the corresponding button; Fig. 3; highlighted areas).

This way, users can sort, organize and categorize data dynamicallyaccording to common characteristics that can be selected from thedata query menu and then zoom-in for a closer look, by either filteringthe collection to get a subset of information or clicking on a particularcard. Allowing users to focus on a specific area or zoom-out to havean overall view of the data enables them to uncover short or long-distance relationships. Moreover, hyperlinks for each gene name totheOMIM [15] andHGMD [16] databases offer direct access to additionalinformation.

By exploiting the above functionalities, users have the opportunityto experiment with different gene/mutation/population-specificscenarios whichmay guide them to discover new trends previously un-seen. Users can also perform compound queries. Such an examplewould be the query of rare pathogenic mutations with frequenciesfrom 5% to 30% in the Israeli population, sorted by the gene name. Thequery returns 19 alleles for six genes, namely CFTR, FANCA, GBA, HBB,HEXA, and MEFV (Fig. 4), which can be further separated according toeither the Arab or the Askenazi Jewish ethnic group (Fig. 4; highlightedarea).

2.2. Flare visualization tool

FINDbase data collection can also be presented in two more alter-native representations namely the Gene and Mutation Map and theMutation Dependency Graph.

Fig. 2. FINDbase data collection: (A) The entire FINDbase causative mutation data collection,variation of the collection. (B) Afiltering panel is available to theusers containing several categoquerying scenario. Users can apply a filter by selecting the check box of the corresponding cate

2.2.1. Gene and Mutation MapThe Gene and Mutation Map is based on a treemap. A treemap is an

easyway of analyzing large amounts of data in a small space. Introducedby Ben Shneiderman in 1991 [17], treemaps are a space-filling approachof showing hierarchies in which the rectangular screen space is dividedinto regions, and then each region is divided again for each level in thehierarchy.

The FINDbase Gene and Mutation Map presents a treemap of muta-tion frequencies estimated for each population. Each rectangle repre-sents a population's mutation and a specific color corresponds to eachpopulation. The area covering each node encodes the frequency ofrare alleles. Each time the user clicks on a node, the occurrence ofthe selected mutation is shown over all populations. Moreover, whena user hovers the mouse over a rectangle, the Gene and Mutation Mapdisplays additional information pertaining to the selected mutationsuch as the population, the gene name, themutation, and themutation'sfrequency.

The Gene and Mutation Map can provide a visual depiction of thedistribution of mutations per population (Fig. 5A), while it can alsoprovide information for the allele frequencies per population and theoccurrence of a specific mutation over all populations.

A typical query example is shown in Fig. 5B, where the user queriesfor the occurrence of the HBB: c.315+1 G>A, a mutation leading toβ-thalassemia. The graph indicates that this causative mutation ismore prevalent in the European and the Middle-East populations.By contrast, the HBB: c.124_127delTTCT mutation, is more prevalentin the populations of South-East Asia, where it reaches particularlyhigh frequencies, as shown in Fig. 5C.

2.2.2. Mutation Dependency GraphThe Mutation Dependency Graph visualizes the dependencies

taking place among populations based on a selected mutation. Popula-tion names are placed along a circle. A link between populations indi-cates that they have in common the same mutation. When a userhovers themouse over a specific population, the corresponding incidentlinks are highlighted. Red links show all the populations that are associ-ated with the selected mutation. By clicking on a specific population,

as produced by PivotViewer, is a set of cards each one representing a particular geneticries that can be used to include or exclude specific cards and to help users apply a particulargory. Compound filtering is possible through the application of multiple filter categories.

Page 4: Population-ethnic group specific genome variation allele frequency data: A querying and visualization journey

Fig. 3. The grid and the graph view: The PivotViewer enables users to modify the sort order of the displayed set of cards changing from grid view to graph view by clicking on thecorresponding button (highlighted areas). Users can see different sort order effects based on whether the PivotViewer is in the grid or the graph view. (A) The grid view displays allcard items in a single group, arrayed from left to right, top to bottom in ascending order, based on the order category chosen. (B) The graph view organizes the data in a columnarformat that allows for qualitative comparison. All card items are sorted into separate groups and displayed in several discrete sets resembling a bar chart. By picking one of thesegroups, users can filter out the other groups and redistribute items within the selected group. If there are more values than can be displayed conveniently, then aggregated groupsare formed and the label indicates the first and last value in that group. By clicking one of these aggregated groups, the other groups are filtered out and the items in the selectedgroup are redistributed with new labels or aggregate labels.

96 E. Viennas et al. / Genomics 100 (2012) 93–101

the user can see all the relevant dependencies of that populationconcerning the selected mutation.

An example of use of this visualization functionality is the presenceof a specific mutation in different populations as demographic evidencefor their inter-dependencies. For instance, querying for the occurrenceof the c.124_127delTTCT mutation leading to β-thalassemia showsthat it is present only in South-East Asian and the British populations,most likely as a result from large population subgroups from Pakistan,India, Taiwan, etc. in the United Kingdom. Similarly, as far as the com-monest CFTR gene mutation p.508delF is concerned, it is evident thatthis mutation is predominant in the European populations (Fig. 6A).Also, the same is true for the c.1677delTAmutation that is almost exclu-sively present in European populations (Fig. 6B,C).

It becomes apparent from the aforementioned that such dynamicvisualization tools could be used for studying human demographichistory, draw conclusions on the origin of specific mutations, stratifyingmolecular diagnostic services to minority groups (for instance patientsfrom Pakistan and India in the UK).

3. Discussion

In this work, we present the integration of a novel visualizationtool within the FINDbase 2.0 version in an attempt to provide astate-of-the-art data and querying visualization environment forpopulation-based mutation data collection and retrieval. Our primarygoal is to offer a robust and stable tool oriented to support and broad-en multidisciplinary knowledge among the scientists worldwide. Lackof detailed disease and phenotypic descriptions for each genetic var-iant and links to relevant patient organizations, are a number of defi-ciencies that if addressed, would allow NEMDBs to better serve theclinical genetics community. To this end, the first efforts begun byseveral groups, including ours, have yielded some promising results[18].

In the near future, we are planning to experiment onmore queryingand alternative visualization types and techniques able to manipulatemultidimensional data, in order to create a global visualization frame-work allowing users to conduct multifaceted comparative studies of

Page 5: Population-ethnic group specific genome variation allele frequency data: A querying and visualization journey

Fig. 4. PivotViewer query example: Query output screen for retrieving the rare pathogenic mutations with frequencies from 5% to 30% in the Israeli population, sorted by the genename. The query returns 19 alleles for six genes, namely CFTR, FANCA, GBA, HBB, HEXA, andMEFV, which can be further separated according to either the Arab or the Askenazi Jewishethnic group (highlighted area).

97E. Viennas et al. / Genomics 100 (2012) 93–101

the querying results. We intend to explore alternatives to SilverlightPivotViewer technology since it poses limitations on the number ofthe displayed source items of a collectionwith an upper limit of approx-imately 6000 data items. The FINDbase data collection includes about4000 mutations at the moment, but, as our collection grows, we aretrying to find solutions to overcome this limit. An alternative wouldbe to use Linked Collections which can handle much larger and fre-quently updating data sources and the size limit is unbounded. Havinga big collection of 60,000 items for example, one can create 10 simplecollections with 6000 items each and then create a Linked Collectionthat will be a superset of the 10 small ones. We are also planning touse HTML 5 instead, for the visualization, not only of the PivotViewerbut of thewhole web application, since HTML 5 tends to become a stan-dard for the development of Web multimedia applications in the nearfuture.

In addition, since the volume of data produced in biomedical re-search is growing exponentially, we intend to incorporate data min-ing techniques to further enhance and automate the discovery ofvaluable information hidden in large biological datasets. As the explo-ration and the analysis of the vast amounts of data become increas-ingly difficult, information visualization in conjunction with datamining techniques can help to deal with the flood of information.More specifically, for data mining to be effective, it is important to in-clude the human in the data exploration process in order to utilize theflexibility, creativity and perception abilities of the human mind. Atthis point, the visual data exploration allows the user to be directly in-volved in the data mining process and maintain a global view of largedatasets. Visualization in data mining is a novel and promising ap-proach for data explanatory analysis, known as visual data miningemerged from the technological coupling of automated data miningalgorithms and visualization techniques. The utilization of both auto-matic analysis methods and human perception promises more effec-tive inspection, understanding and interacting with huge datacollections.

The incorporation of artificial intelligence methods is also an ideatowards the development of an automated pattern extraction mecha-nism. Moreover, the expansion of this tool on the pharmacogenomicsmarker FINDbase data collection is on the top of our priority list aswell as the data exposure through web services based on theoDataprotocol [19].

4. Materials and methods

FINDbase is a web application, recording allele frequencies of causa-tive genetic variations and pharmacogenomic markers, at a summarylevel [20], in various populations worldwide. The database contentsare public and there are no registration requirements for data querying.FINDbase can be accessed at: http://www.findbase.org.

4.1. System architecture and database structure

The underlying structure of FINDbase is a relational database devel-oped with Microsoft SQL Server, a flexible software product offeringadvanced capabilities in database development, manageability, anddata warehousing. Database records on which the application is basedinclude the population and ethnic group and/or the geographical re-gion, the disorder name and the related gene name and its variation pa-rameters, as well as the rare allele frequencies, accompanied by links tothe respective OMIM and the HGMD databases. More specifically, theoverall database schema is depicted in the following figure (Fig. 7).

The overall systemarchitecture is based on a three-tier client–servermodel. The user interface, the functional process logic (“businessrules”), the computer data storage and the data access are developedand maintained as independent modules. It comprises three maincomponents: the client application, the application server and the data-base server. The three-tier architecture is intended to allow any of thethree tiers to be upgraded, or replaced independently, in response tochanges in requirements or technology. The client application containsonly the presentation logic. As a result, less resources are required onthe part of the client workstation and no client modification is neededif database location changes. Changes to business logic are automaticallyenforced by the server and possible future changes are restricted to theapplication server software that will have to be installed. The three-tierarchitecture is a robust model, flexible enough to aggregate multipleinformation sources and integrate modular development [21].

4.2. Technologies used

4.2.1. PivotViewerFINDbase queries can be performed utilizing the PivotViewer con-

trol [22], a Silverlight web browser plug-in. Microsoft Silverlight is an

Page 6: Population-ethnic group specific genome variation allele frequency data: A querying and visualization journey

Fig. 5. The Gene and Mutation Map: (A) The Gene and Mutation Map provides a visual depiction of the distribution of mutations per population, while it can also provide informationfor the allele frequencies per population and the occurrence of a specific mutation over all populations. (B) Display of the occurrence of the HBB: c.315+1 G>A, a mutation leading toβ-thalassemia. The query indicates that it is more prevalent in the European and theMiddle-East populations. (C) Another sample query for the occurrence of theHBB:c.124_127delTTCTshows prevalence in Asian populations.

98 E. Viennas et al. / Genomics 100 (2012) 93–101

application framework for writing and running rich Internet applica-tions, with features and purposes similar to those of Adobe Flash.PivotViewer was used to implement the main querying interface,since it leverages Deep Zoom [23] which is the fastest, smoothest,zooming technology on the Web. As a result, it displays full, high-resolution content without long loading times, while the animationsand natural transitions provide context and prevent users fromfeeling overwhelmed by large quantities of information. The PivotViewer

enables users to interact with thousands of objects at once, and sort andbrowse data in a way that helps them see trends and quickly find whatthey are looking for.

4.2.2. Flare visualization toolkitFINDbase queries output can also be visualized using an alternative

interface based on the Flare Visualization Toolkit. Flare is an open-source library written in ActionScript 3 [24], an object-oriented

Page 7: Population-ethnic group specific genome variation allele frequency data: A querying and visualization journey

Fig. 6. The Mutation Dependency Graph: (A) Query formulation for the occurrence of the CFTR gene mutation p.508delF. The generated graph shows that it is predominant in theEuropean populations. (B–C) Similarly, the query output for the occurrence of the c.1677delTA CFTR gene mutation.

99E. Viennas et al. / Genomics 100 (2012) 93–101

Page 8: Population-ethnic group specific genome variation allele frequency data: A querying and visualization journey

Fig. 7. Depiction of the FINDbase database schema.

100 E. Viennas et al. / Genomics 100 (2012) 93–101

programming language, for creating data visualizations that run in theAdobe Flash Player. Including a wide variety of features, ranging frombasic charts to complex interactive graphs, it supports data manage-ment, visual encoding, animation, and interaction techniques. Flare isa Flash version of its predecessor Prefuse [25], a visualization toolkitfor Java. Instead of creating Java applications with complex visualiza-tions, Flare offers the ability of developing thin-client, web-based andrich interactive experience environments. It has already been used inmany well-known web-based visualization applications such as theMany-Eyes website [26], built by the IBM Visual Communication Lab,for user-contributed data visualization as well as the BBC SuperPowerwebsite [27] for mapping the top 100 sites on the Internet.

We have utilized the Flare to implement two alternative representa-tions of the underlying data collection, namely the Gene and Mutation

Fig. 8. Data entry pr

Map (see Section 2.2.1) and the Mutation Dependency Graph (seeSection 2.2.2). They are implemented as Flash applications using theAdobe Flex platform and the Flex 3.6 software development kit, inorder to be compatible with a wide variety of operating systems andbrowsers. The underlying data structures and functionalities havebeen implemented in ActionScript 3 and the Flex component libraryhas been used to exploit its advanced capabilities. Our decision to useFlash as our platform was due to the fact that it is nearly ubiquitouson a variety of internet-enabled desktop as well as mobile operatingsystems.

The FINDbase data collection is composed of approximately 3000entries. As a result, data preprocessing was a crucial step in our pro-ject in order to reduce the loading time which plays an essentialrole in a web-based application running on a personal computer.

ocess diagram.

Page 9: Population-ethnic group specific genome variation allele frequency data: A querying and visualization journey

101E. Viennas et al. / Genomics 100 (2012) 93–101

For this purpose, a program written in C Sharp (C#) programminglanguage receives as input the FINDbase dataset and outputs a datafile that contains all the necessary relationships. This results in signif-icantly lower response times and faster loading as well.

4.3. Administration environment

The administration environment of FINDbase is another aspect ofgreat importance. Researchers all over the world can concurrentlycontribute their data and thus FINDbase can become a worldwide up-to-date and comprehensive variation data repository. Moreover, themultiple information flows coming from different data submitters willimprove the overall database quality and specifically data accuracy.FINDbase supports multiple user profiles, providing scaled access tothe information with users divided into four main groups according totheir access rights, namely administrators, national coordinators, cura-tors and advisors, as well as simple users.

Administrators have full access rights to all database functionalitiesand data. They are responsible for the activation of user accounts, andare assigned data entry and modification rights. One level below inthe hierarchy, the national coordinators have data entry and modifica-tion rights, and canperformdata re-allocation to another advisor/curatorbelonging to the country that they are in charge of. National coordinatorsare also responsible for managing the overall construction and mainte-nance of a NEMDB, contributing to FINDbase. Advisors and curatorshave data entry and modification rights only for those data entered bythemselves and under no circumstances can they alter data enteredby another advisor/curator. If an advisor/curator wishes to end his in-volvement in FINDbase, their data will be allocated to another curatorwho will then be responsible for their curation.

Data entry and modification in FINDbase is possible only for regis-tered users. The last few months, we have redesigned the data entrypage in an effort to make it more user-friendly. Data entry is con-ducted via uploading an Excel spreadsheet and the subsequent auto-matic extraction of the data from the spreadsheet into the system.More specifically, the data entry process is based on the roles hierar-chy and is depicted in the following diagram (Fig. 8). When an advi-sor/curator uploads an Excel spreadsheet, the submitted data aretemporarily stored into the system (they are not displayed by thevisualization tools until after the administrator's or the nationalcoordinator's approval). During the uploading stage, an automaticprocess that runs in the background performs validation rules on thesubmitted data to verify that they conform to the appropriate datatypes. When error values are detected, the system notifies and pro-mpts the user to correct them in order to complete the uploading pro-cess. Upon successful submission, the administrator and the nationalcoordinator corresponding to the country of the advisor's/curator's or-igin are automatically notified to review the data submitted againstcertain content criteria (minimum number of chromosomes studied,justification of data submission, statistical analysis of the findingsand means of genotyping). In particular, they have full access to allthe uploaded data according to their privileges and can modify themwhen it is necessary. They are responsible for the identification andcorrection of any data discrepancies in order tomaintain data accuracyand uniformity. Furthermore, a control for the detection of similar orduplicate entries is available. Each time this happens, the appropriateuser is notified and can select the correct entry to be published in thewhole visualization environment. Once the data are approved, theybecome part of the main FINDbase data collection.

Acknowledgments

We are indebted to all FINDbase users worldwide for their valu-able comments and suggestions, which helped us to keep the infor-mation as updated and complete as possible and also contributedto the continuous improvement of the database profile and contents.The research leading to these results has received funding from theEuropean Community's Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 200754 — the GEN2PHENproject to GPP and by the Golden Helix Institute of BiomedicalResearch.

References

[1] G.P. Patrinos, A.J. Brookes, DNA, disease and databases: disastrously deficient,Trends Genet. 21 (2005) 333–338.

[2] C. Mitropoulou, A.J. Webb, K. Mitropoulos, A.J. Brookes, G.P. Patrinos, Locus-specificdatabase domain and data content analysis: evolution and content maturationtoward clinical use, Hum. Mutat. 31 (10) (2010) 1109–1116.

[3] The human phenotype ontology (HPO), Available: http://www.human-phenotype-ontology.org/, 14/3/2012.

[4] GEN2PHEN— Available: http://www.gen2phen.org/, 14/3/2012.[5] Human variome project- Available: http://www.humanvariomeproject.org/,

14/3/2012.[6] T.D. Smith, R.G. Cotton, VariVis: a visualisation toolkit for variation databases,

BMC Bioinformatics 9 (2008) 206.[7] S. Zaimidou, S. van Baal, T.D. Smith, K. Mitropoulos, M. Ljujic, D. Radojkovic, R.G.

Cotton, G.P. Patrinos, A1ATVar: a relational database of human SERPINA1 genevariants leading to alpha1-antitrypsin deficiency, Hum. Mutat. 30 (2009)308–313.

[8] P. Riikonen, M. Vihinen, MUTbase: maintenance and analysis of distributed muta-tion databases, Bioinformatics 15 (1999) 852–859.

[9] G.P. Patrinos, National and ethnic mutation databases: recording populations'genography, Hum. Mutat. 27 (2006) 879–887.

[10] S. van Baal, P. Kaimakis, M. Phommarinh, D. Koumbi, H. Cuppens, F. Riccardino, M.Macek Jr., C.R. Scriver, G.P. Patrinos, FINDbase: a relational database recording fre-quencies of genetic defects leading to inherited disorders worldwide, NucleicAcids Res. 35 (2007) D690–D695.

[11] M. Georgitsi, E. Viennas, D.I. Antoniou, V. Gkantouna, S. van Baal, E.F. Petricoin III,K. Poulas, G. Tzimas, G.P. Patrinos, FINDbase: a worldwide database for geneticvariation allele frequencies updated, Nucleic Acids Res. 39 (2011) D926–D932.

[12] Microsoft live labs — Pivot. Available: http://www.microsoft.com/silverlight/pivotviewer/, 14/3/2012.

[13] Flare — data visualization for the Web, Available: http://flare.prefuse.org/,14/3/2012.

[14] GeneCards— Available: http://www.genecards.org/, 14/3/2012.[15] Online Mendelian inheritance in man (OMIM), Available: http://www.ncbi.nlm.

nih.gov/omim, 14/3/2012.[16] The human gene mutation database (HGMD), Available: http://www.hgmd.cf.ac.

uk/ac/index.php, 14/3/2012.[17] Treemap— Available: http://www.cs.umd.edu/hcil/treemap/, 14/3/2012.[18] J. van Baal, J. Zlotogora, G. Lagoumintzis, V. Gkantouna, G. Tzimas, K. Poulas, A.

Tsakalidis, G. Romeo, G.P. Patrinos, ETHNOS: a versatile electronic tool for the de-velopment and curation of national genetic databases, Hum. Genomics 4 (2010)361–368.

[19] Open data protocol— Available: http://www.odata.org/, 14/3/2012.[20] A. Gialluisi, T. Pippucci, Y. Anikster, U. Ozbek, M. Medlej-Hashim, A. Mégarbané, G.

Romeo, Estimating the allele frequency of autosomal recessive disorders throughmutational records and consanguinity: the homozygosity index (HI), Ann. Hum.Genet. 76 (2012) 159–167.

[21] Wayne W. Eckerson, Three tier client/server architecture: achieving scalability,performance, and efficiency in client server applications, Open Information Sys-tems, 10, 1995, 1 (January 1995): 3(20).

[22] Microsoft live labs — Pivot–Silverlight Control. Available: http://www.silverlight.net/learn/data-networking/pivot-viewer/pivotviewer-control, 14/3/2012.

[23] Deep zoom— Available: http://www.microsoft.com/silverlight/deep-zoom/,14/3/2012.

[24] ActionScript 3— Available: http://www.adobe.com/devnet/actionscript.html,14/3/2012.

[25] The Prefuse visualization toolkit— Available: http://prefuse.org/, 14/3/2012.[26] Many-eyes— Available: http://www-958.ibm.com/software/data/cognos/manyeyes/,

14/3/2012.[27] BBC SuperPower: visualizing the Internet— Available: http://news.bbc.co.uk/2/hi/

technology/8562801.stm, 14/3/2012.