This presentation is available under the Creative Commons Attribution-ShareAlike 3.0 Unported License. Please refer to http://www.bits.vib.be/ if you use this presentation or parts hereof. Introduction to Linux for Bioinformatics Working the command line Joachim Jacob 5 and 12 May 2014
81
Embed
Part 5 of "Introduction to Linux for Bioinformatics": Working the command line's text tools
This is part 5 of the training "introduction to linux for bioinformatics". Here we introduce more advanced use on the command line (piping, redirecting) and provide you a selection of GNU text mining and analysis tools that assist you tremendously in handling your bioinformatics data. Interested in following this training session? Contact me at http://www.jakonix.be/contact.html
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
This presentation is available under the Creative Commons Attribution-ShareAlike 3.0 Unported License. Please refer to http://www.bits.vib.be/ if you use this presentation or parts hereof.
PATH contains a set of directories, separated by ':' $ echo $PATH/home/joachim/bin:/usr/local/sbin:/usr/local/bin:/usr/ sbin:/usr/bin:/sbin:/bin:/usr/games
11 of 81
Installing is just placing the executable
1. You copy the executable to one of the folders in PATH
2. You create a sym(bolic) link to an executable in the one of the folders in PATH(see previous week)
/etc is the directory that contains configuration text files. It is only owned by root: system-wide settings.
A 'normal' user (session-wide settings) can create the file ~/.pam_environment to set the vars with content
14 of 81
Recap: editing files
Create a text file with the name .pam_environment and open in an editor:
$ nano .pam_environment → quit by pressing ctrl-x
$ gedit .pam_environment → graphical
$ vim .pam_environment → quit by pressing one after the other :q!
15 of 81
Create .pam_environment
In ~/.pam_environment, type:
TREES DEFAULT=green
Save the file. Log out and log back in.
16 of 81
Bash variables are limited in scope
You can assign any variable you like in bash, like:
The name of the variable can be any normal string. This variable exists only in this terminal. The command echo can display the value assigned to that variable. The value of a variable is referred to by ${varname} or $varname.
17 of 81
It can be used in scripts!
All commands you type, you can put one after the other in a text file, and let bash execute it.
Let's try!
Make a file in your ~ called 'space_left':
Enter two following bash commands in this file:df -h .du -sh */
18 of 81
Running our first bash script
19 of 81
The shebang
Simple text files become Bash scripts when adding a shebang line as first line, saying which program should read and execute this text file.
#!/bin/bash#!/usr/bin/perl#!/usr/bin/python
(see our other trainings for perl and python)
20 of 81
Things to remember
● Linux determines files types based on its content (not extension).● Change permissions of scripts to read and execute to allow running in the command line:
This is the ultimate power of Unix-like OSes. The philosophy is that every tool should do one small specific task. By combining tools we can create a bigger piece of software fulfilling our needs.
How combining different tools?1. writing scripts 2. pipes
31 of 81
Chaining the output to input
What the programs take in, and what they print out...
32 of 81
Chaining the output to input
We can take the output of one program, store it,
and use it as input for another program
~ Assembly line
33 of 81
Deliverance through channels
When a program is executed, 3 channels are opened:● stdin: an input channel – what is read by the program● stdout: channel used for functional output ● stderr: channel used for error reporting
In UNIX, open files have an identification number called a file descriptor: ● stdin called by 0● stdout called by 1● stderr called by 2
(*) by conventionhttp://www.linux-tutorial.info/modules.php?name=MContent&pageid=21
Can write output to files, instead of the terminal
cat
1 2
0
STDERR or channel 2
STDOUT or channel 1
40 of 81
Output redirection
The stdout output of a program can be saved to a file (or device):$ cat 1> file or short:$ cat > file
Examples:$ ls -lR / > /tmp/ls-lR
$ less /tmp/ls-lR
41 of 81
Chaining the output to input
You have noticed that running:$ ls -lR / > /tmp/ls-lRoutputs some warnings/errors on the screen: this is all output of STDERR (note: channel 1 is redirected to a file, leaving only channel 2 to the terminal)
42 of 81
Chaining the output to input
Redirect the errors to a file called 'error.txt'.
$ ls -lR / ?
43 of 81
Chaining the output to input
Redirect the error channel to a file error.txt.
$ ls -lR / 2 > error.txt
$ less error.txt
44 of 81
Beware of overwriting output
IMPORTANT, if you write to a file, the contents are being replaced by the output.
To append to file, you use:$ cat 1>> file or short
A general tool for counting lines, words and characters: wc [options] file(s)
c: show number of charactersw: show number of wordsl: show number of lines
How many mRNA entries are on chr1 of A. thaliana?$ wc -l chr1_TAIR9_mRNA.bedor$ grep chr1 TAIR9_mRNA.bed | wc -l
61 of 81
Translate
To replace characters:
$ tr 's1' 's2'! tr always reads from stdin – you cannot specify any files as command line arguments. Characters in s1 are replaced by characters in s2.
● Using fixed delimiter$ cut [-d delim] -f <fields> [file]
● chopping on fixed width$ cut -c <fields> [file]
For <fields>:N the Nth elementN-M element the Nth till the Mth elementN- from the Nth element on-M till the Mth elementThe first element is 1.
68 of 81
Cutting columns from text files
Fixed width example:Suppose there is a file fixed.txt with content12345ABCDE67890FGHIJ
To extract a range of characters:$ cut -c 6-10 fixed.txt ABCDE
69 of 81
Sorting output
To sort alphabetically or numerically lines of text:$ sort [options] file(s)
When more files are specified, they are read one by one, but all lines together are sorted.
70 of 81
Sorting options
● n sort numerically● f fold – case-insensitive● r reverse sort order● ts use s as field separator (instead of space)● kn sort on the n-th field (1 being the first field)
Example: sort mRNA by chromosome number and next by number of exons.
$ sort -n -k1 -k10 TAIR9_mRNA.bed > \ out.bed
71 of 81
Detecting unique records with uniq
● eliminate duplicate lines in a set of files● display unique lines● display and count duplicate lines
Very important: uniq always needs from sorted input.
Useful option: -c count the number of fields.
72 of 81
Eliminate duplicates
● Example:$ whoroot tty1 Oct 16 23:20james tty2 Oct 16 23:20james pts/0 Oct 16 23:21james pts/1 Oct 16 23:22james pts/2 Oct 16 23:22
$ who | awk '{print $1}' | sort | uniqjamesroot
73 of 81
Display unique or duplicate lines
● To display lines that occur only once:$ uniq -u file(s)
● To display lines that occur more than once: $ uniq -d file(s)
Example:$ who|awk '{print $1}'|sort|uniq -djames
● To display the counts of the lines$ uniq -c file(s)Example$ who | awk '{print $1}' | sort | uniq -c 4 james 1 root
74 of 81
Edit per line with sed
Sed (the stream editor) can make changes in text per line. It works on files or on STDIN.
See http://www.grymoire.com/Unix/Sed.html
This is also a very big tool, but we will only look to the substitute function (the most used one).
$ sed -e 's/r1/s1/' file(s)s: the substitute command/: separatorr1: regex to be replaceds1: text that will replace the regex match