Linux for Biology DEDAN GITHAE, BIOINFORMATICIAN BECA-ILRI HUB
LinuxforBiologyDEDANGITHAE,BIOINFORMATICIAN
BECA-ILRIHUB
Importanceofcomputerstobiology
û Availability ofvast research datashared online.
û Automated analysis leading togeneration ofmassivedata
û Interactionwith other research communities andshared databases
û Speedandefficiency inprocessing,storage anddatamining
BIGData:Volume,Variety,Velocity&Veracity
Volume:
◦Morecontentalreadygeneratedand
◦ isavailableoveropenaccess
◦Morecontentbeinggeneratedperrun
◦ asaresultoftechnologyadvancement
◦ Costscheaperovertime
Velocity:◦ Technologymakingdatagenerationfasterandhigherefficiency
Variety◦ Sequences,annotation,structures,imageprocessing
Veracity◦ Someambiguities,Inconsistencies,incomplete,modelapproximations
Othercomputationaltasks:AnalysisandinterpretationBiologyactivities:◦ Prediction– functionalandstructural◦ Patternrecognition:Domains,homology◦ Sequencealignments◦ Statisticalanalysis◦ Structuralmodelling◦ Geneticdiversityandinteractionsbetweenorganisms,betweenpopulations
Linux
Whatislinuxafamily
◦offreeandopen-sourcesoftware
◦operatingsystem
◦distributionsbuiltaroundtheLinuxkernel.
Whatislinuxafamily
Ubuntu?Fedora?Mint?Debian? openSUSE?
◦offreeanyoneisfreelylicensedtouse,copy,study,andchangethesoftwareinanyway
◦andopen-sourcesoftwarethesourcecodeisopenlysharedsothatpeopleareencouragedtovoluntarilyimprovethedesignofthesoftware
◦operatingsystemsystemsoftwarethatmanagescomputerhardwareandsoftwareresourcesandprovidescommonservicesforcomputerprograms.◦distributionsbuiltaroundtheLinuxkernel.partoftheoperatingsystemthatmediatesaccesstosystemresourceseginput/outputrequestsfromsoftware,translatingthemintodata-processinginstructionsforthecentralprocessingunit
Kernel
SomeapplicationstobiologicaltasksRepetitivetasks– processingseveralsequencesAutomatinganalysisprocesses– scripts/pipingtoprogramsTextprocessingRegex;grep;sed;◦ extractingfieldsusingcut/awk◦ We’llseemoreofthisonthetutorial
TheILRIHighPerformanceComputing(HPC)Cluster
TheILRIHighPerformanceComputing(HPC)Cluster
userslogintoHPC(themaster)
Tologin:
then“jump”to therestofthecluster(computingservers).
Todothis,type
interactive
Softwares:Toknowwhetherasoftware,andversionyouneedtouseisinstalled,type
module avail
Touseasoftware,eg BLAST,type
module load blast
Toseewhatsoftwares arereadyforuse(loaded),type
module list
SLURM:SimpleLinuxUtilityforResourceManagement
Interactivejobshaveatimelimitof8hours.ifyouarerunningalongerjob,writeabatchscripttoscheduleit.
Howdowewritescripts?
WritingaSlurm script◦ Availableoptions,type
sbatch –u [ man sbatch fordetailedexplanationofusage]
Exampleofabatchscript#!/usr/bin/env bash
#SBATCH -p batch
#SBATCH -J blastn
#SBATCH -n 4
# load the blast module
module load blast/2.6.0+
# run the blast with 4 CPU threads (cores)
blastn -query ~/data/sequences/drosoph_14_sequences.seq -db nt
ToRunthescript,type
sbatch [ scriptname.sbatch ]
Bestpractice;overviewRunthejobonthecomputingnode
interactive
Makeadirectoryinthescratchspace;and“go”there
mkdir –p /var/scratch/userX ; cd $_
Createthescript
Runthescript
sbatch [scriptname.sbatch]
Enjoy!