SRAdb - a R/Bioconductor Package Jack Zhu
SRAdb - a R/Bioconductor Package
Jack Zhu
NCBI SRA• SRA: Sequence Read Archive:
– Archive of high-throughput sequencing data
• What is stored?– Raw sequence data– Now stores alignment information in sra format– EBI: still hosts fastq files
• The international partnership (INSDC):– SRA: NCBI Sequence Read Archive– ERA: EBI Sequence Read Archive– DRA: DDBJ Sequence Read Archive– All data is shared and synchronized between SRA, ERA and DRA.
http://www.ncbi.nlm.nih.gov/Traces/sra
NCBI SRA Web Site
SRAdb Biocondutor Package
• SRAdb SQLite database:– SRA metadata: faithfully parsed from NCBI SRA
XML files– Main tables: Submission, study, sample,
experiment, run– MySQL database SQLite base– Portable and local– Programmatically access to data – R and SQL– Updated weekly
SRAdb Download Stats
SRAdb Entities/Data Types
SRAdb Functions
Function Category DescriptionsraConvert Query Cross-reference between GEO data types
getFASTQinfo QueryGet SRA fastq file information and associated meta data from EBI ENA
getSRAinfo QueryGet SRA data file information from NCBI SRA
listSRAfile QueryList sra, sra-lite or fastq data file names associated with input SRA accessions
SRAdb Functions – cont.
Function Category Description
getSRA DownloadFulltext search SRA meta data using SQLite fts3 module
getSRAdbFile DownloadDownload and unzip last version of SRAmetadb.sqlite.gz from the server
getSRAfile Download Download SRA data file through ftp or fasp
ascpR DownloadFasp file downloading using the ascp command line program
ascpSRA DownloadFasp SRA data file downloading using the ascp command line program
entityGraph GraphCreate a new graphNEL object from an input entity matrix or data.frame
sraGraph GraphCreate a new graphNEL object of SRA accessions from SRA full text search
SRAdb Functions – cont.
Function Category Description
startIGV IGVStart IGV from R with different amount maximum memory support
IGVclear IGV Clear IGV tracks loaded.IGVcollapse IGV Collapse tracks in the IGVIGVgenome IGV Set the IGV genome.IGVgoto IGV Go to a specified region in IGV.
IGVload IGV Load data into IGV via remote port call.IGVsession IGV Create an IGV session file
IGVsnapshot IGV Make a file snapshot of the current IGV screen.IGVsocket IGV Create a Socket Connection to IGV.
IGVsort IGV Sort an alignment track by the specified option.
Getting Started> library(SRAdb)Loading required package: RSQLiteLoading required package: DBILoading required package: graphLoading required package: RCurlLoading required package: bitopsSetting options('download.file.method.GEOquery'='auto')
> sqlfile <- getSRAdbFile()trying URL 'http://gbnci.abcc.ncifcrf.gov/backup/SRAmetadb.sqlite.gz'Content type 'application/x-gzip' length 916403786 bytes (874.0 MB)==================================================downloaded 874.0 MB
Unzipping...
SQL Query> rs <- dbGetQuery( sra_con, paste( "SELECT study_type AS StudyType,+ count( * ) AS Number FROM `study` GROUP BY study_type order+ by Number DESC ", sep="") )
> rs StudyType Number1 Whole Genome Sequencing 265632 Other 139083 Transcriptome Analysis 71794 Metagenomics 45005 <NA> 31176 Epigenetics 8457 Population Genomics 6928 Exome Sequencing 1419 Cancer Genomics 7710 Pooled Clone Sequencing 3211 Synthetic Genomics 912 RNASeq 3
Accession Conversion> Conversion = sraConvert( c('SRP001007','SRP000931'), sra_con )
> conversion[1:3,] study submission sample experiment run1 SRP000931 SRA009053 SRS003453 SRX006122 SRR0182562 SRP000931 SRA009053 SRS003454 SRX006123 SRR0182573 SRP000931 SRA009053 SRS003464 SRX006135 SRR018269
Full Text Search> rs <- getSRA( search_terms = "breast cancer", out_types = c('run','study'), sra_con )> dim(rs)[1] 11081 23
> rs <- getSRA (search_terms ='"breast cancer"', out_types=c('run','study'), sra_con)> dim(rs)[1] 9803 23
> rs <- getSRA (search_terms ="breast OR cancer", out_types = c('run','study'), sra_con )> dim(rs)[1] 74250 23
Fasp Protocol Downloading> ascpCMD <- “ascp -QT -l 300m -i '/Users/zhujack/Applications/AsperaConnect.app/Contents/Resources/asperaweb_id_dsa.putty'”
> getSRAfile( c("SRX000122"), sra_con, fileType = 'sra', srcType = 'fasp', ascpCMD = ascpCMD )Files are saved to: '/Users/zhujack/Documents/R_WD'
Completed: 130939K bytes transferred in 7 seconds (148,846K bits/sec), in 1 file.Completed: 422K bytes transferred in 0 seconds (4,067K bits/sec), in 1 file.Completed: 843K bytes transferred in 1 seconds (5,337K bits/sec), in 1 file.Completed: 159492K bytes transferred in 14 seconds (90,572K bits/sec), in 1 file.----
Interaction with IGV> startIGV("mm")
> sock <- IGVsocket()
> IGVgenome(sock, 'hg19')
> IGVload(sock, exampleBams)
> IGVgoto(sock, 'chr1:1-1000')
> IGVsnapshot(sock)
Acknowledgements
• Paul Meltzer• Sean Davis• All members in Meltzerlab
• NCI SRA Team– O'Sullivan, Christopher – Ben Busby