Database in Healthcare

Chapter 5: Databases in Healthcare

5.1 Databases and Types of Data structure

Databases are collections of data with a specific well defined structure and PURPOSE. In hospitals databases are the spinal cord of hospital information systems Databases in healthcare are the collection of health data. Programs to develop & manipulate these data are called Database Management Systems (DBMS)

ARE THESE DATABASES? An excel file with names and medication of patients within a hospital

A nurses agenda with to dos A schedule of the shifts for next week

A list of the medicines available The medical record of a patient

One would say that databases are structured collections of data so the list in this textbox is not definite, but rather the way these data are organized.

Types of Data Structure Flat Data Hierarchical Data Relational Data Object-oriented data NoSQL databases

All data submitted into electronic health records are most of the times based on relational, object-oriented or, in few recent cases, on noSQL databases.

5.1.1 Flat Files

A flat file can be a plain text file, usually containing one record per line or it can be a binary file. The majority of the existing software includes easy access to flat data files. For simple data flat databases work. They waste computer storage by requiring it to keep information on items non logically available Flat databases are not complicated query friendly

5.1.2 Hierarchical Models

Data models in which the relationship between higher and lower items are inherited. An example of an hierarchical structure: folders-subfolders on our computers

Does this structure actually facilitate the real life health process?

Pros: Actions on parents save time since they affect all children Cons: In the real healthcare world most relationships are not hierarchical.

5.1.3 The Relational Database Model

Major Elements The database is a collection of tables, which represent entities and relationships The table name is the relationship title Columns represent the characteristics of the entity Rows represent data

5.2 Relational Databases

The principles of relational databases can be summarized to the following points:

Data in a relational database are values stored in the database. Data alone are useless. Relational databases are composed by a set of tables Each table includes records, which are the table rows and fields, which called table columns Fields can be of various data types. They can be alphanumeric, numeric, date-time, Boolean etc Using keys we access a table record. The key which uniquely identifies a record is the primary key Index the physical mechanism which improves the database efficiency. This is part of the physical

structure of the database and is not at all related with keys which are part of the logical structure. In a relational database we call view, a virtual table composed by a sub-set of the actual tables. In relational databases, there exist one-to-one, one-to-many or many-to-many relationships. The term data integrity describes the accuracy, validity and unity of the data.

An example of a relational database

Advantages of the Relational Schema Databases can be examined by many different perspectives No need to enter missing information for variables that are not logically possible Easy to modify because adding new entities involves adding new tables and not altering old ones (Granted that the database is adequately normalized)

Normalization in Relational Databases Normalization is the process where insufficiently normalized schemes are split into smaller

schema with more desirable characteristics. With normalization, we succeed to minimize anomalies during data entry, update and deletion. Normalized forms provide the methodological framework to analyze the database schema based

on the database keys and the functional dependencies.

Edgar Codd, 1923-2003: The father of normalization (have a look at his 1971 paper)

Every characteristic belongs to the entity it characterizes. Every characteristic only exists once in a database. Keys fully define the records Each value of the same characteristic is stored into the database only once Normalization-Easy to Remember Rules

5.3 Database categories found in healthcare

5.3.1 Distributed Databases in healthcare Data are kept in different settings and different computers. Since data produced are huge, the replication and distribution of databases improves database performance at the healthcare settings. Distributed databases need to address the location of the data AND audit log, that is a chronological record of the destination and source that provide documentary evidence of the sequence of activities that have affected at any time a specific procedure. Possible Cons Data loss is limited to nodes affected and this is critical for healthcare Since they are decentralized, they are more flexible and allow different units to update and

maintain their own data 5.3.2 Large Healthcare Utilization Databases They are used to study the use and outcome of treatments Their huge size allow the study of rare events Since they are representing the clinical routine care, they can address real world effectiveness and

utilization patterns 5.3.3 BLOBS-Binary Large Object Files Very frequent in healthcare settings Images (ct, mri) Audio (heartbeat seq.) Video (ultrasounds) The dilemma: should we move these data to data warehouses or keep them in their source? 5.3.4 Data-less databases They are distributed databases which have been set-up without any data, until such a need arises. They may be useful in healthcare Less expensive than centralized registries (it requires no equipment and little personnel) The use of the system does not require vague and time-independent patient consents The system does not require duplication of data in different databases 5.3.5 Object Oriented Data Models They are more efficient Use of real-life objects (entities) They use SQL Much higher programming flexibility since there is the possibility to integrate the database with

object oriented programming languages (i.e. java, C# etc) Not yet fully standardized but this is ongoing

Example of object oriented model

Chapter 6: A comparison between SQL and NOSQL databases

6.1 The Structured Query Language (SQL) 6.1.1 History of the SQL Standards ISO/IEC 9075 The most famous relational DBMS! New versions are out every couple of years. Some of the most important updates include the following features. 1987 Initial ISO/IEC Standard 1989 Referential Integrity 1999 SQL99 Call-level-Interface. Standardized communication of queries with DBMS 2003 XML Functionality 2008 New expansions and update

6.1.2 SQL Characteristics Data stored in columns and tables Relationships represented by data Data Manipulation Language and Data Definition Language Transactions Abstraction from physical layer SQL is independent of data-applications Applications specify what, not how Physical layer can change without modifying applications 6.1.2.1 Data Definition Language (DDL) Schema defined at the start

Create Table (Col1 type1, Col2 type 2 ) Constraints to define and enforce relationships (Primary-Foreign Key) Alter, Drop Security and Access 6.1.2.2 Data Manipulation Language (DML) Data manipulated with Select, Insert, Update, & Delete statements

Select T1.Column1, T2.Column2 From Table1, Table2 Where T1.Column1 = T2.Column1

6.1.3 Transactions ACID Properties of SQL Transactions Atomic All of the work in a transaction completes (commit) or none of it completes Consistent A transaction transforms the database from one consistent state to another consistent state. Consistency is defined in terms of constraints. Isolated Results of changes during a transaction are not visible until the transaction is over Durable The results of a committed transaction survive failures

6.2 NoSQL Databases (Not Only SQL)

6.2.1 NoSQL Definition (www.nosql-database.org) Next Generation Databases addressing some of the points: non-relational, distributed, open-source and horizontal scalable. The original intention has been modern web-scale databases. The movement began early 2009 and is growing rapidly. Often more characteristics apply as: schema-free, easy replication, simple API, eventually consistent / BASE (not ACID), a huge data amount, and more.

Usually referring to NoSQL we consider modern database systems using document stores, key value stores, XML databases, graph databases, column stores, object stores, etc.

These databases assume that data storage does not require fixed table schemas NoSQL are those database management systems that do not adhere to the widely used SQL

relational database management system NoSQL has become well known with the advent of web scale data and systems by Google, Facebook,

Amazon, Twitter etc to manage data There are over one hundred different NoSQL databases 6.2.2 NoSQL: multiple types based on the architecture

1. Key Value databases. These are based on a hash table of keys 2. Document based systems (i.e. mongoDB). These store documents made up of tagged elements 3. Column family systems. Each storage block contains data from only one column 4. Graph Databases

6.2.1.1 Column Store types of noSQL databases Each storage block contains data from only one column. When multiple rows are inserted in traditional raw insert methods, Column Store databases have better performance

Hadoop/Hbase http://hadoop.apache.org/ Yahoo, Facebook Ingres VectorWise: column Store integrated with an SQL database http://www.ingres.com/products/vectorwise Examples of column store noSQL databases

6.2.1.2 Document Store or Document Oriented types of noSQL databases These assume that documents encapsulate and encode data in standard formats or encodings

CouchDB http://couchdb.apache.org/ MongoDB http://www.mongodb.org/

Examples of document store noSQL databases

6.2.1.3 Key-Value Store types of noSQL databases Hash tables of KeysValues stored with Keys Fast access to small data values

MemCacheDB http://memcachedb.org/ Project-Voldemort http://www.project-voldemort.com/

Examples of key- store noSQL databases

Document Store BaseX, Clusterpoint, Apache Couchbase, eXist, Jackrabbit, Lotus Notes and IBM Lotus Domino LotusScript, MarkLogic

Server, MongoDB, OpenLink Virtuoso, OrientDB, RavenDB, SimpleDB, Terrastore Graph AllegroGraph, DEX, FlockDB, InfiniteGraph, Neo4j, OpenLink Virtuoso, OrientDB, Pregel, Sones GraphDB, OWLIM Key Value BigTable, CDB, Keyspace, LevelDB, membase, MemcacheDB, MongoDB, OpenLink Virtuoso, Tarantool, Tokyo Cabinet,

TreapDB, Tuple space Eventuallyconsistent - Apache Cassandra, Dynamo, Hibari, OpenLink Virtuoso, Project Voldemort, Riak Hierarchical - GT.M, InterSystems Cach Tabular BigTable, Apache Hadoop, Apache Hbase, Hypertable, Mnesia, OpenLink Virtuoso Object Database - db4o, Eloquera, GemStone/S, InterSystems Cach, JADE, NeoDatis ODB, ObjectDB, Objectivity/DB,

ObjectStore, OpenLink Virtuoso, Versant Object Database, Wakanda, ZODB Multivalue databases - Extensible Storage Engine (ESE/NT), jBASE, OpenQM, OpenInsight , Rocket U2, D3 Pick database,

InterSystems Cach, InfinityDB Tuple store- Apache River, OpenLink Virtuoso, Tarantool

Many noSQL DBMS systems are currently available

6.2.2 NoSQL Distinguishing Characteristics

Large data volumes (i.e. Googles big data) Scalable replication and distribution

Potentially thousands of machines Potentially distributed around the world

Queries need to return answers quickly Mostly query based, few updates Asynchronous Inserts & Updates Schema-less Open source development

6.2.3 C-A-P Theorem: you can only choose 2 out of three in distributed databases

Breers CAP Theorem: a distributed system can support only two of the characteristics: Consistency, Availability, Partition tolerance Consistency: all nodes see the same data at the same time Availability: node failures do not prevent survivors from continuing to operate Partition Tolerance: Operations will complete, even if individual components are unavailable 6.2.4 Storing and Modifying Data in noSQL databases Syntax varies (ie java, html) Asynchronous - Inserts and updates do not wait for confirmation Versioned Optimistic Concurrency: multiple transactions can complete without affecting each other 6.2.5 Querying data in noSQL Syntax Varies

No set-based query language Procedural program languages such as Java, C, etc

Application specifies retrieval path No query optimizer The dogma of prioritizing speed than accuracy is here May not be a single right answer

6.3 The debate: SQL vs noSQL

The two most significant differences between SQL and noSQL are Scaling SQL does not allow massively parallel processing, which lead to larger computers (scale up) vs. distribution to numerous commodity servers, virtual machines or cloud instances (scale out). Modeling SQL databases are highly normalized and require pre-defined data models prior to inserting data into the system. In contrast NOSQL databases do not require pre-defined data models. 6.3.1 SQL Positive and Negative Points

Positive

Advanced data aggregation options, statistics and reports at data level (integrated) High performance OLTP databases Good transaction features Complex SQL queries possible for diverse cases A large array of tools and compatible software Data-Application Independency

Weaknesses SQL complexity and cost for large solutions Not so fast to develop

Learning curve is not low 500gb is the maximum to exist on the server Scalability issues Performance issues Maintenance issues 6.3.2 NoSQL Positive and Negative Points

Positive Rapid development and easy to program They support insert, delete, select functions Faster performance (compared to SQL) and high read-write NoSQL solutions may handle huge BLOBS NoSQL solutions may have sufficient querying possibilities Good for constantly changing data Efficient in horizontal scalability

Weaknesses Lack of relation between one key to another No security or authentication of users Data storage cannot be efficiently used for analytics, aggregations and reporting

SQL Databases NoSQL Databases

Predefined Schema required Predefined Schema is not required or does not exist

Standard definition & interface language Definition and interface language according to the product

Great consistency Well defined semantics

Not so consistent all the times

Consistent and accurate Getting an answer quickly is more important than getting a correct answer

Summarization of differences between SQL and noSQL

6.4 Implementations of noSQL databases in healthcare

Not so many yet Most noSQL products are still beta and largely open source, lacking in support. Some say that medical apps are inevitably going to be extremely conservative, because people

could die if the IT system fouls up. But this is the future Electronic Health Records CAN do no-SQL. Some of the healthcare characteristics of future Health Information System are expected to support such applications. This is due to that in healthcare there is: Semantic interoperability (3M HDD, SNOMED, LOINC, HL7, ICD 9 & 10, RxNorm, CPT, etc.),

metadata and master data management (EMPI, providers, organizations, locations, devices, etc.) Cloud based architecture Standardized or flexible data models Health Information Systems are distributed and GO CLOUD! Examples of noSQL Electronic Health Records: VistA, CHCS, AHLTA, Epic, Cerner etc.

Database in Healthcare

Documents

flat data files

relational schema databases

collection of health

hospitals databases

various data types

structured collections

available flat databases

term data integrity