This presentation is available under the Creative Commons Attribution-ShareAlike 3.0 Unported License. Please refer to http://www.bits.vib.be/ if you use this presentation or parts hereof. Introduction to Linux for Bioinformatics Managing data Joachim Jacob 5 and 12 May 2014
46
Embed
Part 4 of 'Introduction to Linux for bioinformatics': Managing data
This is part 4 of the training session 'Introduction to Linux for bioinformatics'. We shows basics of data management, and tips for handling big data effectively. Interested in following this training session? Please contact me at http://www.jakonix.be/contact.html
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
This presentation is available under the Creative Commons Attribution-ShareAlike 3.0 Unported License. Please refer to http://www.bits.vib.be/ if you use this presentation or parts hereof.
Widely used compression tools:● GNU zip (gzip)● Block Sorting compression (bzip2)
Typically, compression tools work on one file.How to compress complete directories and their contents?
7 of 46
Tar without compression
Tar (Tape Archive) is a tool for bundling a set of files or directories into a single archive. The resulting file is called a tar ball.
Syntax to create a tarball:$ tar -cf archive.tar file1 file2
Syntax to extract:$ tar -xvf /path/to/archive.tar
8 of 46
Compression: a typical case
Archiving and compression mostly occur together.The most used formats are tar.gz or tar.bz. These files are the result of two processes.
1. Archiving (tar)
2. Compressing (gzip or bzip2)
9 of 46
Compression: on your desktop
10 of 46
Compression: on your desktop
11 of 46
Compression: on the command line
Tar is the tool for creating .tar archives, but it can compress in one go, with the z or j option.
Creating a compressed tar archive:$ tar cvfz mytararchive.tar.gz docs/$ tar cvfj mytararchive.tar.bz docs/
Decompressing a compressed tar archive$ tar xvfz mytararchive.tar.gz$ tar xvfj mytararchive.tar.bz
create Compression technique
extract filesfilesverbose
12 of 46
De-/compression
To compress one or more files:$ gzip [options] file$ bzip2 [options] file
To decompress one or more files:$ gunzip [options] file(s)$ bunzip2 [options] file(s)
Every file will be compressed, and tar.gz or tar.bz appended to it.
13 of 46
Tips
1. Do you have to uncompress a big text file to read it? No! Some tools allow to read compressed files (instead of first unpacking then reading). Time saver!
$ zcat file(s)$ bzcat file(s)
2. Compression is always a balance between time and compression ratio. Gzip is faster, bzip2 compresses harder.
A symbolic link (or symlink) is a file which points to the location of another file. You can do anything with the symlink that you can do on the original file. But when you move the original file from its location, the symlink is 'dead'.
~
Downloads/
Projects/
Rice/
Butterfly/
Sequences/
Annotation/
alignment.sam
16 of 46
Symlinks
To create a symlink, move to the folder in where the symlink must be created, and execute ln.
~
Downloads/
Projects/
Rice/
Butterfly/
Sequences/
Annotation/
alignment.sam
~/Projects $ cd Butterfly~/Butterfly $ ln -s ../Rice/Sequences/alignment.sam Link_to_alignment.sam
17 of 46
Symlinks
~
Downloads/
Projects/
Rice/
Butterfly/Link_to_alignment.sam
Sequences/
Annotation/
alignment.samalignment.samalignment.sam
~/Projects $ cd Butterfly~/Butterfly $ ln -s ../Rice/Sequences/alignment.sam Link_to_alignment.sam ~/Butterfly $ ls -lh Link_to_alignment.samlrwxrwxrwx 1 joachim joachim 44 Oct 22 14:47 Link_to_alignment.sam -> ../Sequences/alignment.sam
The symlink is created. You can check with ls. To delete a symlink, use unlink.
A disk can be divided in parts, called partitions.
An internal disk which runs an operating system is usually divided in partitions, one for each functions.
An external disk is usually not divided in partitions. (but it can be partioned).
22 of 46
Check out the disk utility tool
23 of 46
The system disk
Name of the disk
24 of 46
The system disk
Name currently highlighted partition
25 of 46
The system disk
Place in the directory structure where the partition can be accessed
26 of 46
An example of an USB disk
-
Place in the directory structure where the partition can be accessed
27 of 46
An example of an USB disk
The USB disk is 'mounted' automatically on the directory tree under /media.
28 of 46
An example of an USB disk
- This is the type of file system on the partition (i.e. the way data is stored on this partition)
The partition is said to be formattedin FAT32 (in this case).
29 of 46
File system formats
By default, many USB flash disks are formatted in FAT32.
Other types are NTFS, ext4, ZFS.
FAT32 – max 4GB filesNTFS – maximum portability (also for use under windows)Ext4 – default file system in Linux,Btrfs – the next default file system in Linux in the near future.