Understanding and using GNU/Linux by Giuseppe Profiti January 2014 Tutorial for the Programming for Bioinformatics course, International Master of Bioinformatics University of Bologna, Italy http://www.biocomp.unibo.it/lsbioinfo/
Understanding and usingGNU/Linux
by Giuseppe ProfitiJanuary 2014
Tutorial for the Programming for Bioinformatics course,International Master of Bioinformatics
University of Bologna, Italyhttp://www.biocomp.unibo.it/lsbioinfo/
January 2014 Giuseppe Profiti 2/64
Goals and means
● Goals– Understanding what an Operating System is
– Know how to proficiently use GNU/Linux
● Means– Simple examples (maybe biology-inspired)
– Exercises and hands-on
● Not covered– Formal details
– “How do I use <our favourite software>?”
January 2014 Giuseppe Profiti 3/64
What is an Operating System?
● It's a piece of software
● It manages hardware and software resources
● It's useful for general purpose and heterogeneous hardware systems Im
age
fro
m W
ikim
ed
ia C
om
mo
ns,
Pu
blic
Do
ma
in
January 2014 Giuseppe Profiti 4/64
Hardware, OS and software
Hardware
Operatingsystem
Image from Flickr, released under Creative Commons BY by Petr Dosek
January 2014 Giuseppe Profiti 5/64
Same OS, different software
Image from Wikimedia Commons, Public Domain (NASA)
Ima
ge f
rom
Flic
kr,
Cre
ativ
e C
om
mo
ns
BY
by
Texa
s A
&M
Un
ive
rsity
January 2014 Giuseppe Profiti 6/64
Another example
Image from Flickr, released under Creative Commons BY by Andrea Arden
Different hardware, different OS and software
January 2014 Giuseppe Profiti 7/64
GNU/Linux● Originates from Unix● Linux is the kernel
– Manages the hardware, memory and so on
● GNU is a set of software and tools– They run on top of Linux
– Provide functionality
● Multi user, multi threaded● Ubuntu, Lubuntu, Xubuntu, Debian, Red Hat..● MacOS is based on Unix too
January 2014 Giuseppe Profiti 8/64
What's the difference?
Image from Wikimedia Commons, GNU GPL license
January 2014 Giuseppe Profiti 9/64
A Linux distribution includes
● The Kernel (Linux)● An install system for the distribution● Drivers
– How the system can manage specific hardware
● A package manager– To install and update software
– Usually different from one distribution to the other
January 2014 Giuseppe Profiti 10/64
Login
● Once started the system asks for your– Username
– Password
● Each user has a different main folder on disk● Users have different access rights● The superuser (called “root”) can do everything● On Ubuntu, the main user you created when
installing can run programs as root, if needed
January 2014 Giuseppe Profiti 11/64
Shell
● It is the main interface with the system● Can be used to
– Navigate the file system
– Execute tools
– Install software
– Connect to other machines
– Edit files
– … everything the system can do
● Also called Console, or Terminal
January 2014 Giuseppe Profiti 12/64
How a shell looks like
Image from Wikimedia Commons, licensed as Public Domain by User:AVRS
January 2014 Giuseppe Profiti 13/64
“It's a trap!”
Every time you use the mouse in a shell,you are doing something wrong.
Ima
ge b
y M
an
ue
l R.,
Wik
ime
dia
Co
mm
on
s, C
C-B
Y
January 2014 Giuseppe Profiti 14/64
Exercise 1: Open a shell
● If you don't use the Graphical User Interface– You already are in a shell
● If you use the Graphical User Interface– In Ubuntu: Click the logo, type “terminal”, select it
– Other systems: find the terminal icon somewhere
● The terminal may have a black, white or colour background– No matter the colour, it works in the same way
January 2014 Giuseppe Profiti 15/64
The prompt
● It is a string saying that the shell is ready● It may state the current directory● It ends with $,%,> or #● After that, you can type a command● After a command, you type the Enter key
January 2014 Giuseppe Profiti 16/64
Exercise 2: create a directory
● To create a directory (or folder) type:
mkdir tutorial-p2b
● and press the Enter key ↵● What do you see?
● To check the existence of the new directory:
ls
● and press the Enter key ↵
January 2014 Giuseppe Profiti 17/64
Upper-case and lower-case
● The shell is CASE SENSITIVE– Upper-case and lower-case are different
● LS is different from ls
● Tutorial-p2b is not tutorial-p2b
● Then, to run a program, you have to type its name correctly
● You can use the TAB key ↹ to complete a filename after typing its initials– IF the system can distinguish what file you want
January 2014 Giuseppe Profiti 18/64
Exercise 3: look inside a directory
● Type:
ls tutorial-p2b ↵● Type:
ls Tutorial-p2b ↵● Type:
ls tut
● Then the TAB key ↹ , then the Enter key ↵
January 2014 Giuseppe Profiti 19/64
File system
● It stores both data files and programs● Directories are lists of files● Hierarchical structure● The root of the tree is the directory /
/
home etc bin
me you
January 2014 Giuseppe Profiti 20/64
Filesystem
● Files and directories are stored in a filesystem● The filesystem is like a tree:
– It has one root directory “/”
– Each subdirectory is a branch in the tree
– Each file is a leaf
January 2014 Giuseppe Profiti 21/64
Path
● A path specifies a location in the filesystem● It indicates the branches to follow● Each branch (directory) is separated by /● The path can be absolute or relative● Absolute: always starts from the root
– i.e. “/home/Alice/Desktop/vacation/sunset.jpg”
● Relative: starts from your current directory– i.e. “Desktop/vacation/sunset.jpg” if you are in
/home/Alice/
January 2014 Giuseppe Profiti 22/64
Special directories
● The current directory is “.”– So “sunset.jpg” and “./sunset.jpg” are the same file
● The previous directory is “..”– i.e. If you are in “/home/Alice/Desktop/work/”, you
write “../vacation/sunset.jpg”
– If you are in “/home/Alice/experiment/data/”, you type “../../Desktop/vacation/sunset.jpg”
January 2014 Giuseppe Profiti 23/64
/
B A
WORKHOME
A
3.TXT
1.TXT
3.TXT2.TXT1.TXT
Exercise 4: path
While in /home/ check the following relative paths:● A/1.TXT● ../WORK/1.TXT● ../WORK/A/../1.TXT● ../WORK/A/../../HOME/B/../A/1.TXT
Specify the absolute paths for the following files:● leftmost and rightmost 3.TXT● leftmost and rightmost 1.TXT
January 2014 Giuseppe Profiti 24/64
File permissions
● Files can be read, written and executed● The owner of a file can restrict these operations
– For herself
– For other members of the group
– For everyone else
Examples:● Experiment data that should not be overwritten● Data shared only with group members for read
purposes
January 2014 Giuseppe Profiti 25/64
File permissions
● Permissions can be changed using chmod
● The shortcuts are:– User (u), Group (g), Others (o), All (a)
– adding (+), removing (-)
– Read (r), Write (w) and eXecute (x)
● To remove the write permission to the group:
chmod g-w
January 2014 Giuseppe Profiti 26/64
File types
● Extensions mean nothing– .doc, .jpg and so on are just conventions
● Text and binary files– Text can be printed and read by humans
● Plain text, CSV, XML are all text-based
– Binary can be read by programs
● Data and programs– A program can be executed by the system
(executable permission does not make a program)
January 2014 Giuseppe Profiti 27/64
Programs and processes
● An executable program sits in the disk● A running program becomes a process
– You can have multiple processes spawned from the same program: i.e. many blastall running
● Each process has a unique identifier (pid)● To inspect the running processes: ps or top
● To quit a running process, use CTRL+c or
kill <pid>
January 2014 Giuseppe Profiti 28/64
Exercise 5: processes
● Open two shells● In one shell run the following command
sleep 20m
● In the other shell, run ps to find the pid of sleep
● Kill the process using
kill <pid>
● Note: on remote servers you can't CTRL-C unless you keep the connection open
January 2014 Giuseppe Profiti 29/64
Parameters vs arguments
● The argument(s) is the subject of the operation– ls /home/Alice/Desktop
– kill 260046
● Parameters (or options) modify the behaviour– ls -l /home/Beatrix/Desktop
– top -h
● Parameters usually start with minus sign– Single one for single letter (-h, -p, -t)
– double for longer parameters (--help, --out)
January 2014 Giuseppe Profiti 30/64
Inspecting a file
● head prints the first 10 lines
● tail prints the last 10 lines
– You can change the number of lines of both head and tail by specifying it as parameter
● cat shows the whole file
– Beware to long files
● more shows the whole file, paginated
January 2014 Giuseppe Profiti 31/64
Exercise files
● Download the following files
http://profiti.web.cs.unibo.it/res/p2b/ex.tar.gz● Uncompress it● It should contain the following files:
– test1.txt
– test2.txt
– data1.txt
– data2.txt
January 2014 Giuseppe Profiti 32/64
Exercise 6: looking into a file
● Print the content of test1.txt
cat test1.txt
● Print the first 10 lines of test1.txt
head test1.txt
● Print the first line of test1.txt
head -n 1 test1.txt
● Print the last line of test1.txt
tail -n 1 test1.txt
January 2014 Giuseppe Profiti 33/64
Finding text: grep
● It prints the lines containing a match
grep “pattern” filename
● Pattern can be a string or a regular expression● Useful parameters
– -w matches whole words (i.e. spaces around)
– -x matches whole lines
– -i ignore case (uppercase = lowercase)
– -v reverse match (i.e. lines NOT containing pattern)
January 2014 Giuseppe Profiti 34/64
Exercise 7: grep
● Find all the lines containing “m” in test1.txt
grep “m” test1.txt
● Find all the lines NOT containing “m” in test1.txt
grep -v “m” test1.txt
● Find all the lines containing “omen” in test1.txt
grep “omen” test1.txt
● Please notice that “momentum” matches– Try using the -w option
January 2014 Giuseppe Profiti 35/64
Finding text: grep /2
● You can provide a file of patterns
grep -f patterns.txt data.txt
● The program looks for every line as a separate pattern
● It may take a while if the two files are big
January 2014 Giuseppe Profiti 36/64
Comparing
● Look for the differences in two similar files
diff file1 file2
● Compares the two files line by line● Output
– Line numbers for the different lines
– “<” for lines only in file1
– “>” for lines only in file2
● It is not quite easy to use
January 2014 Giuseppe Profiti 37/64
Sorting
● Diffing is easier when data are sorted
sort filename
● Useful parameters:– -n numerical sort (otherwise 100 < 2)
– -r reverse sort
– -k x sort on column number x
– -t x uses x as column separator
January 2014 Giuseppe Profiti 38/64
Getting columns
● Printing a specific column with cut (ex.: 3rd)
cut -f 3 filename
● You can specify column separator with -d
● Useful arguments for -f:
– N prints the Nth column, counted starting from 1
– N- prints from the Nth to the end of the line
– N-M prints from Nth to Mth (included)
– -M prints from 1 up to Mth (included)
January 2014 Giuseppe Profiti 39/64
Redirection
● You can save the result of commands to a file● The output is redirected using >
ls > files.list
● Append with >>
● Errors are not “output”, use 2>
● Both output and error redirected with &>
● Input redirection with <
cat < file.txt
January 2014 Giuseppe Profiti 40/64
Pipe: motivation
● Example: I want the file names for all the files with rwx permissions
● Solution with redirection:ls -l > files.list
grep “rwx”files.list > wanted-files.list
cut -f 10- -d” ” wanted-files.list > result.list
January 2014 Giuseppe Profiti 41/64
Pipe
● Too many intermediate files
– Possibly big: disk space issues
– Hard to remember: do I need myfiles.list or my.list?● Rule of thumb: keep intermediate result only if you
need it later for other analysis
● For everything else, use pipe |
ls -l | grep “rwx” | cut -f 10- -d” ” > result.list
● Pipe sends the result of a command to the input of the following one
January 2014 Giuseppe Profiti 42/64
Pipe
● All the previous examples work also without a file as input, but with a pipe
● The first 10 lines of a list of files
ls | head
● The first column of the last line of a sorted file
sort file.txt | tail -1 | cut -f 1
January 2014 Giuseppe Profiti 43/64
Pipe vs sequence
● Pipe sends the result to the next command● If you want to execute commands in sequence,
separate them using ;
ls; head test.txt
● What if the second depends from the first?
python my.py > a.txt && sort a.txt
January 2014 Giuseppe Profiti 44/64
Editing a file
● Too many to list them all, just the more common● On the shell
– cat > filename writes everything you type to file● CTRL+d ends the input
– nano, pico: easy to use
– vim, emacs: more advanced
● On the GUI– gedit
– gvim
January 2014 Giuseppe Profiti 45/64
Shell scripting
● What if the command is very long and you have to use it again?
● What if you have to repeat the same operations for many inputs?
● Shell scripting is programming for the shell● Same primitives of programming languages
– If choices, for loops
– Parameters, variables
January 2014 Giuseppe Profiti 46/64
Shell scripting /2
● Save commands to a text file● Add execution permissions to the file● Call the file from the shell● Example:
for i in $(ls *.fasta); do echo $i, $(grep “^>” $i | wc -l); done | sort -n -k 2 > $1
January 2014 Giuseppe Profiti 47/64
Shell scripting /3
for i in $(ls *.fasta); do echo $i, $(grep “^>” $i | wc -l); done | sort -n -k 2 > $1
● $( ) returns the output of the commands inside
● Useful for cat and everything that returns a content
January 2014 Giuseppe Profiti 48/64
Shell scripting /4
for i in $(ls *.fasta); do echo $i, $(grep “^>” $i | wc -l); done | sort -n -k 2 > $1
● for execute the commands between do and done one time for each iteration
● i is the iteration variable, it gets one of the values (in the example, a file name), you access its value using $i
January 2014 Giuseppe Profiti 49/64
Shell scripting /5
for i in $(ls *.fasta); do echo $i, $(grep “^>” $i | wc -l); done | sort -n -k 2 > $1
● The final result of all the for loops is passed to sort
● This script returns a list of fasta file with an associated number of entries, sorted by that number
January 2014 Giuseppe Profiti 50/64
Shell scripting /6
for i in $(ls *.fasta); do echo $i, $(grep “^>” $i | wc -l); done | sort -n -k 2 > $1
● The final result is redirected to a file, specified at command line
● Examples:
bash myscript.sh result1.txt
bash myscript.sh result2.txt
January 2014 Giuseppe Profiti 51/64
Awk
● Awk executes a series of commands for each line of the input
● It can execute different commands for different lines, using matching regular expressions
● It may be faster than other tools● It is easy to use and powerfull
January 2014 Giuseppe Profiti 52/64
Awk /2
awk '/<regex>/ {<commands>}' a.txt
● You can specify multiple regular expressions● Commands can contain if and assignments● Two special keywords instead of regex
– BEGIN matches the beginning of the input, before the first line
– END matches the end of the input, after the last line
January 2014 Giuseppe Profiti 53/64
Awk /3
awk 'BEGIN {a=0} {a=a+1} END{print a}'
● It counts the number of lines● Before the first line, sets the variable a to zero● For each line, increases the counter
– There is no regex, so each line matches
● At the end, prints the value of the counter● Works better than wc -l
January 2014 Giuseppe Profiti 54/64
Awk /4
awk '{print $2,$3}'
● Prints the second and the third column● Columns are separated by space● You can specify a different separator with -F
awk -F “,” '{print $2,$3}'
● NF is the number of columns (or “fields”)● $NF is the value of the last column
January 2014 Giuseppe Profiti 55/64
Awk /5
awk '/^ATOM/ {if ($5==”A”) print $7,$8,$9}'
● Prints the positions for each atom in the A chain● It matches only lines starting with “ATOM”● You can select lines not matching a patternawk '!/(TAG)|(TAA)|(TGA)/ {print $3,$4}'
● The ! means “not matching”● Round brackets group patterns● | is for alternatives
January 2014 Giuseppe Profiti 56/64
Awk exercise 1
1.Print lines containing m in test1.txt
2.Print lines not containing m in test1.txt
3.Print lines with A in second column in test1.txt
4.Print the third column of test2.txt
(a) Use comma as separator
(b) Use E as separator
January 2014 Giuseppe Profiti 57/64
Awk /6
awk 'BEGIN {name=””} /^>/ {name=$0; d[name]=””} !/^>/ {d[name]=d[name]+length($0)} END {for (i in d) print substr(i,2,length(i)),d[i]}'
● Uses an array d, it's like python dictionaries● $0 is the whole line
● substr is the substring, positions starts from 1● Prints a list of fasta entries and their length
January 2014 Giuseppe Profiti 58/64
Awk exercise 2
● Print the sum of the elements of the third column of test1.txt
● Print the average of the elements of the fourth column of test1.txt
● Take a look at data1.txt and data2.txt– Did you just open them with an editor?
– Did you just use “cat”? Or “more”?
January 2014 Giuseppe Profiti 59/64
Awk exercise 3
● How many lines in data1.txt and data2.txt?$wc -l data* 2999997 data1.txt 2999999 data2.txt
● Is it true?– data1.txt contains 2999998 lines
– data2.txt contains 3000000 lines
● They contain the same numbers, but 2● Which ones?
January 2014 Giuseppe Profiti 60/64
Awk /7
awk 'BEGIN {while ((getline<"patterns.txt")>0)diz[$1]=0} {if ($1 in diz) print $0}'
● Works like grep -f patterns.txt
● Getline reads the file one line at the time● Each line becomes a key in the array● The input is then checked against existing keys● For big files, it is faster than grep
– O(N*M) vs O(N+M)
January 2014 Giuseppe Profiti 61/64
Awk exercise 3, solution
diff <(sort data1.txt) <(sort data2.txt)
● Diff is picky, the result is not that good– Took 14 seconds on a test computer
grep -v -f data1 data2.txt
● Good luck, it may take a while– It may freeze your computer
● Awk takes 4 seconds on a test computer
January 2014 Giuseppe Profiti 62/64
Awk vs Python
● Reading fasta, awk style
awk 'BEGIN {name=””} /^>/ {name=$0; d[name]=0} !/^>/ {d[name]=d[name]+length($0)} END {for (i in d) print substr(i,2,length(i)),d[i]}'
● Note: awk scripts can be saved to a file
● Use the -f option to call the saved file
January 2014 Giuseppe Profiti 63/64
Awk vs Python
● Reading fasta, Python styleimport sysf = open(sys.argv[1])d = {}name = “”for r in f: r = r.rstrip() if r[0]=='>': name = r[1:] d[name]=0 else: d[name]+=len(r)f.close()for k in d: print k,d[k]
January 2014 Giuseppe Profiti 64/64