Storage Solutions for Bioinformatics Li Yan Director of FlexLab, Bioinformatics core technology laboratory [email protected] http:// www.genomics.cn/FlexLab/index.html Science and Technology Division, BGI-Shenzhen
Mar 29, 2015
Storage Solutions for Bioinformatics
Li YanDirector of FlexLab, Bioinformatics core technology laboratory
[email protected]://www.genomics.cn/FlexLab/index.html
Science and Technology Division, BGI-Shenzhen
OUTLINE
• Background
• Hardware Infrastructure of Data Storage
• Data Management
• Data Storage Architecture In BGI
• Distributed Computing on Storage Server
Background: Fast Growing Big Data
Sequencing, se
quencing and se
quencing
Background
Fast growing big data
E. coli Genome: 4.9M Caenorhaditis elegans Genome: 100M Human Genome: 3G Wheat Genome: 16G Salamander: 45G
• From small genomes to large complex genomes
Human Genome: 3 billion DNA subunits (A,T,C,G) 80~100X Sequencing: 600GB Raw data for individual study 1000 Genome Project: 600TB Raw data for population study
• From one sample to populations
• From the first generation sequencing to the second generation sequencing
Long-Term Data Storage Needs• Properly secure the data
Plan for data redundancy, which generally means we mirror data with
two or more copies
• Available(24x7x365) for all kinds of uses Readily accessible and in the right format
• Fast Data Transfer for collaborations Fast Network server(Aspera) instead of mailing a hard drive
• Scalable, easy to scale up Choosing reliable file systems
Hardware infrastructure of data storage
Type of Storage infrastructure
• Disk library• A high-capacity storage system that holds a quantity of CD-ROM, DVD or magneto-
optic (MO) disks in a storage rack and feeds them to one or more drives for reading and writing.
• Magnetic tape• A high-capacity data storage system for storing, retrieving, reading and writing
multiple magnetic tape cartridges.• Redundant array of independent disks (RAID)
• RAID is a storage technology that combines multiple disk drive components into a logical unit
• Direct-attached storage (DAS)• a digital storage system directly attached to a server or workstation, without a
storage network in between• Network-attached storage (NAS)
• Network-attached storage (NAS) is file-level computer data storage connected to a computer network providing data access to heterogeneous clients.
• Storage area network (SAN)• A storage area network (SAN) is a dedicated network that provides access to
consolidated, block level data storage.
Type of Storage Pros Cons General use
Disk library •Fast•High storage capacity•High data availability
•Not as easily accessible as DAS•Intended for write once, read rarely info
•Disk-to-disk backup•Archiving•Near line storage
Magnetic tape •Low cost per megabytes•Portable•Unlimited capacity (with multiple tapes)
•Inconvenient for fast recovery of individual or group files
•Archiving•Limited-budget businesses•Offsite storage
Redundant array of independent disks (RAID)
•Fast•High storage capacity•High data availability•Reliable•Security•Fault tolerance
•Possible false sense of security•Some recovery difficulty on some systems•High cost for optimum systems
•Swap files•Internet service providers•Redundant storage
Type of Storage Pros Cons General use
Direct-attached storage (DAS)
•Simple•Low starting cost•Easy to use
•Needs separate storage for each server•Not easy to transfer data in network•Server takes application processing load
•Data and application sharing•Data backup•Archiving
Network-attached storage (NAS)
•Fast file access for multiple clients•Ease of data sharing•High storage capacity•Redundancy•Ease of drive mirroring•Consolidated resources
•Less convenient than SAN for moving large blocks of data
•Backup•Archiving•Redundant storage
Storage area network (SAN)
•Excellent for moving large blocks of data•Exceptional reliability•Easily availible•Fault tolerance•Scalability
•Expensive•Lack of standardization•Management complexity
•Large databases•Bandwidth-intensive applications•Mission-critical applications
Software Level of Data storage
Data flow of NGS
Sequencer Raw Data
AlignmentAssembly
Association
Complex workflow• Annotation of features• Variations/Mutations• Protein Structural• Gene Expressions• Function Networks
Meaningful Biology DataData Store
Data Management
Classify the data into different levels First Level of Storage: Dynamic, fast, Temporary Secondary Level of storage: Slower than first level, but enduring and
safety Third Level of storage: High capacity medium for backups and
archives Choosing file systems
Current popular distributed file systems include: Lustre, HDFS, MogileFS, FreeNAS, FastDFS, OpenAFS, MooseFS, pNFS, and GoogleFS.
Classify the data into different levels
• First Level of Storage: Dynamic, fast, Temporary• intermediate results of data analysis• Reference data • …
• Secondary Level of storage: Slower than first level, but enduring and safety• Sequencing raw data• Meaningful data
• Third Level of storage: High capacity medium for backups and archives• Backups and archives of raw data and meaningful data
Storage ServerDistributed file systems
Distributed File systems• Lustre
lustre is a large, safe and reliable, highly available cluster file system, which is
developed and maintained by the SUN. Lustre can support more than 10,000 nodes,
the number to the number of PB storage system.
• Hadoop(HDFS)
Hadoop and not just a hadoop distributed file system for storage, but designed for
general-purpose computing device in the form of large-scale distributed applications
running on the cluster framework.
• OneFS
OneFS enables to scale data access capacity to more than 1.6 petabytes and up to 10
Gb/sec of throughput for a single cluster capacity of up to 10 GBS (Gigabytes per
second) of throughput.
Distributed File systems
• MogileFS (www.danga.com)
• FreeNAS ( www.openqrm.org )
• FastDFS (code.google.com / p / fastdfs)
• OpenAFS ( www.openafs.org )
• MooseFS (derf.homelinux.org)
• pNFS ( www.pnfs.com )
• GoogleFS
Data compression&& Data security
Data compression Common used:
Lemple-Ziv, BWT Exclusive used for DNA sequences:
Biocompress, GeneCompress, CTW-LZ, GeNML, fqzcomp, sam_comp
Data security Raid system failure/ Redundancy File system Network
Data Storage Architecture In BGI
Data Storage Architecture In BGI
Write
Write
Write
Two Copies
Read
ReadWrite
Compute Nodes
Sequencers
Tape Library
Archiving
Data Storage Architecture In BGI
Write
Write
Write
Two Copies
Read
ReadWrite
Compute Nodes
Sequencers
Tape Library
Archiving
First Level Storage
Data Storage Architecture In BGI
Write
Write
Write
Two Copies
Read
ReadWrite
Compute Nodes
Sequencers
Tape Library
Archiving
Second Level Storage
Data Storage Architecture In BGI
Write
Write
Write
Two Copies
Read
ReadWrite
Compute Nodes
Sequencers
Tape Library
Archiving
Third Level Storage
Data Storage Architecture In BGI
Write
Write
Write
Two Copies
Read
ReadWrite
Compute Nodes
Sequencers
Tape Library
Archiving
Distributed Computing on Storage Server
NGS read file
Sequence Assembly
Storage
Large memory server>500GB
Users26
Traditional Genome Assembly
Costly, Unscalable
Distributed Genome Assembly
Assembly ……
Several storage server (IBM3630*16 for human genome)
Cost effectively, Scalable
HecateConstructing de bruijn Graph
Solving Tiny Repeats Merging Bubbles
Scaffolding Merging Contigs
29
Gaea 2.1
Reads
Reference genome
Preprocessing
Locating
Aligning
SNP calling
Distributed Indexing for load balancing
Dynamic Programming for
robust gap alignment
Standard mapping quality for SNP calling
Flexible splitting tolerates more mistmatches
Q&A