L O O P P

LOOPP

T. Galor

INDEX

The database of the programThe database of LOOPPInserting a new driver to LOOPPInserting a new model to LOOPPThe MAIN moduleThe OPTION moduleThe loopp_interf moduleThe ALIGN moduleThe THREAD moduleTHE SEQ moduleThe PDB moduleThe MPS moduleDeveloping a new potential (SVM,BPMPD,PCX)

Index continueTE13 module

The global variablesThe parameter fileInstalling LOOPPRunning LOOPP

Interpretation of loopp results

Reference

The databaseIn this chapter I will talk about some of the main data structures defined in Loopp. The definition is given in the file db.h and the allocation and de-allocation of these structures is done in the file db.c.

P ro te in s tru cu tre

p ro te in fu nc tionfu n ction

p ro te in n am ep ro t_ na m e

P D B acco rn imp d b_ co de

re so lu tionta ken fro m p db

n u m b er o f a m in o a c idn _ res

a m ino a cid vec to rre s

se co nd a ry s tru ctu re ve c tors tru c_ 2 nd

T a ke n fro m d ssp

su rface e xp osu re vec to rsu rfa ce

x vec tor y ve c tor z ve co tr

co ord ina te [3 ]g e o m e tric s id e ch a in C a Cb

s tru c ture s ites tru c_ in fo

T h e n u m be r o f co ord ian te s is lowcu rrpu ted

n u m b er o f g oo d co ord in a ten _ go od _coo rd

u sed fo r tra in ing po te n tia l

s ta rt g oo d co ord in a tes ta rt_ go o d_ co o rdu se d fo r tra in ing

la s t g oo d co ord in a tee n d_ go od _coo rdu se d fo r tra in ing

P R O T E IN

id e n tity o f a th om 2 e n viro m e nt C o nta c t m apC M

co un ts th e n um be r o f a tho m 2 en v iro m e ntco u n ts_2 n d_ she ll_co nta c ts

n u m b e r o f co n ta c t g rea ter

n u m b e r o f co n ta c t le ss

S truc tu re s ite in fos tru c_ in fo

1 a g a 9 7 0 T Y R 5 5 5 7 1 M E T 1 3 9 6 2 L E U 1 1 5 6 9 6 3 G L Y 4 S E R 9 5 A L A 6 M E T 7 S E R 8 A R G 2 3 2 7 9 P R O 3 2 * 1 0 M E T 2 3 2 6 5 4 1 1 I L E 5 4 5 5 5 6 * 1 2 H I S 1 9 5 5 1 3 P H E 2 3 * 1 4 G L Y 1 9 2 3

P D B n a m e a n d n u m b e r o f r e s i d u e s

P o s i t i o n 0 h a s c o n ta c t w i th p o s i t i o n s 5 5 a n d

5 7

P o s i t i o n 9 h a s c o n ta c t w i th 3 2 a n d w i t h a

p r e v i o u s r e s i d u e ( 4 )

- 1

S 7

S 5

S 2

S 1

- 1

4

3

2

5

E v e r y s i t e h a s i n f o r m a t io n o n i t s

s t r u c u t r a l s i t e s

F o r s i t e 1 3 , n u m b e r o f c o n t a c t s g r e a t e r is 2 3 a n d n u m b e r o f

c o n t a c t le s s is 2 3

The protein

Coordinates

Geometric side chain

C alpha C beta

x y z x y zNULL

The options geometric_chain, C_alpha, C_beta define which of the coordinate set is loaded into the memory. The allocation of the vector is done in the file db.c. The yellow vector is allocated by zalloc_coord(). The red vectors are allocated by the routine init_coord().

The coordinates are read into the memory by read_xyz_loopp_format() defined in the file loop_interf.c.

The default in loopp is to read Geometric side chain.

Contact Map CM

Site

1

Site 4

Site n

3 4 79

Each vector is of size MAX_CONTACT

NULL

5 6 7 105

The contact map vector (the red vector in the picture) is generated during the allocation of the protein in alloc_info() if the option compute_CM is set on. The size of the red vector is as the number of residues in the protein.

The set of yellow vectors are are allocated during

Get_CM_for_a_prot() defined in the file cm.c.

The last routine read the CM if the file exists or generate the file and load the cm to the memory.

3

4

0

1

First shell neighbor g

First shell neighbor l

Site 1

Site 4

The first shell neighbor g/l contains for each site the number of contact greater/less then the site index respectively.

Id_2nd_shell_contact

Site 1

Site n

2 3 -1

NULL

Each cell contain the value of structural site in contact with site 1.

For example for THOM2 there are 16 different types structural sites numbered from

0 to 15.

Count_2nd_shell_contact

Site 1

Site n

2 1 -1

NULL

Each cell contain the the multiplicity of the corresponding structural site in the vector

ID_2ND_shell_contact.In the example there are 2 contact of type 2 and 1 contact of type 3 in contact with site

1

The red vectors are allocated during alloc_info if the option read_CM is set on . The yellow vectors are generated with get_thom2_env_per_site() defined in the file env.c. One can also imagine a different structural site than that of thom2.

The model:Is a set of information that describes rules to calculate the protein structural environment site, the cost of an alignment, the constraint.

E n e rg y m o d e l T ra in in fo n u m be r o f m o d e ls m o de l type n u m be r o f va r. to tra in C o n ta ct ra d iusr_ m in r_ m ax

m o d e l

m o de l type S ca le th e cos t w ithu n it co nve rs ion

co s t m a trix in d e x:b a se _sco re

C a lcu la te g a p co s t p e r s iteg e t_g a p_ p e r_ s ite

C a lcu la te th e p os tio n o f the fe a tu re in e nvre s2 p os

C o s t m a trix

a lp ha b e t s ize fla g fo r a po s t a lig n m e n t en e rgyp o s t_ e ne

C o m p u te co s t o f g ap fo r a fea tu reg e t_g a p _co st

C a lcu la te stru cu tra l s iteg e t_ co n ca ct_ type _ fo r_ m u ltip e _ en v_ p er_ s ite

C o m pu te co s t o f (fe a tu re ,fea tu re )g e t_e n erg y_co st

p o te n tia l f ile n a m ef_ po tn

E n e rg y m o d e l

There might be more then one energy model per model. In this

case we have a mix model

H P

Alphabet_HP[2]={HYD,POL}

Model HP_M

M_env_HP[2]={15,15}

Base_score[2]={0,15}

db.c: alphabet={ALA,ARG,ASN,ASP,CYS,GLN,GLU,GLY,HIS,ILE,LEU,LYS,MET,PHE,PRO,SER,THR,TRP,TYR,VAL,GAP,GINS,GDEL,HYD,POL,GLX,ASX,CHG,CHN,CST,HST,USR1,ACE,MSE,UNK}

Energy model

m a trix d im 1,d im 2,d im 3 sym m e tric w ith _ g ap _ sco re

C o s t m a trix

E n e rg y M o d e l

0.5

0.65

0.1

0.2

Matrix is a 2 by 2 vector which contain the potential of the current Energy model.

dim1=2; dim2=2; dim3=0; symmetric=NO with_gap_score=NO

A value in the matrix is accessed using the macro INDEX_POTEN defined in the file db.h

index=INDEX_POTEN[res,env_x,base_score]= base_score[res] + env_x.

The cost matrix The program stores the values of the potential in the energy model during the call of the routine. set_****_attributes(…)

The potential values are read by the routine read_scoring_matrixes(…)

E n e rg y m o d e l T ra in in fo n u m be r o f m o d e ls m o de l type n u m be r o f va r. to tra in C o n ta ct ra d iusr_ m in r_ m ax

m o d e l

include model for traininginclude_model

a vector of flagindicate which alphabet is trained

to_train

calculate constrains coefficient per siteget_constraints_coef_of_site

calculate a constraintdefine_ineq

convert feature to a environment name

dig2envscore matrixcost

TRAIN_INFO

There might be more then one model trained

stimulatingly

The model continue

The alignment

protein columnprot_col

protein row

prot_row

use loopp indexuse_loop_indx

pdb2loopp index 1pdb2loopp index2

compute Zscore

alignment typeglobal/local

alignment idthread/seq/struc/

alignment input

Zscore Average energyscore

energyene

post ene

post score post zscore

#ins #del #matchidentity

hydrophobic polarity

charge num of gap segment

num of mismatch rms

alignment assessment

ALIGN TRACE align lengthalign_len

begin 1begin2

asses input

ALIGNMENT

assessment

M D M I

* * * a - - h g r

* * * - w w h g a

Local alignment start on different location for the two protein

M= match, D=delete I=insert

mvs

The database

ve c to r o f P R O T E INp ro t

n u m b er o f p ro te insn _ p ro t

M O D E L

p d b lis t o f n a m elis t_ n a m e

D A T A B A S E

N U M B E R O F P R O Tn _ p ro t

cu rre n t p ro t in d exi_ p ro t

f_ xyzF _ x yz

f_ seqF _ s eq

f_ n o t_ usedF _ no t_ us ed

f_ co o rd _m is s ingF _ c o o rd_ m is s ing

f_ lo w _ co m p le x ityF _ lo w _ co m p le x ity

f_ p db _ m e m bra neF _ p db _ m e m bra ne

m o d e l

d a ta b ase L IS T

Database is used when all protein are stored in the memory.

The data base list is used when only one protein at a time is stored in the memory. F_xxxx, stands for the pointer to the file and f_xxxx stand for the file name.

The database is allocated in the file db.c with zalloc_db() and the data base is read into the memory with the routine Build_protein_db_from_file() defined in loop_interf.c. The proteins are read from a file containing a list of pdb name including chains.

The data base List is initialized with

Init_read_db(), and each new protein is read into the memory with read_nxt_prot() . After all proteins are processes we clean the list with the routine finish_reading_db().

The decoy

P R O T 1 P R O T 2 a lig nm e nt m eth odm eth od

e n e _ R HS e n e _L HS co e ffice n t d e coy na m ed e coy

D E C O Y

A decoy is a set of two proteins and their alignment method.

The alignment can be an Identity alignment of SN into XN, a threading alignment of SN into XD or the Sequence alignment.

The alignment energy

We calculate the total LHS energy, RHS energy and the coefficient vector, given an initial guess for the score. The coefficient vector C counts the number of assigning an amino acids ai to structural site xj .

alphabet strctural sitealign_len

,1 1

frozen=RHS not frozen=LHS1

( ) , ,i i i j i j f f nf nfi i j

E S X a x n a x C P C P C P

The constraint to train

d e coy 1 = p seu d o p ro t1d e coy 2 =p seu d o p ro t2

co e ff ice n t 1co e ff ic ie n t 2

L H S _ eqR H S _ eq

L og

e ned is t

n o rm 1n o rm 2

co n ta c t/p ro t le ng th 1co n ta c t /p ro t le ng th 2

D E F _ E Q

A pseudo protein is defined by a decoy, where a decoy is a set of two proteins and their alignment.

The equation is defined as the information of decoy1 subtracted from decoy 2.

Loopp outputs three files for training: the RHS file, the LHS file and the Log file.

In the Log file we save the norm of the two coefficient vectors the distant and energy,

In the LHS file we save the left hand side of a constraint

In RHS file we save the right hand side value.

The Database of LOOPP.

Loopp has a set of about 3888 proteins that span the known folds of the PDB. The folds are 6 Ǻ apart, found by LOOPP v1 structural alignment and are updated using CE program from time to time. The data base is stored at H:\\CBSU\LOOPP\DB\DB_jm on the theory center cluster. The list of the proteins of jm_list is given in H:\\CBSU\LOOPP\LIST\jm_list.

In the data base we have so far four types of data. Each file starts with header containing the name and the chain of the protein accompanied with the number of residues. The file ****.seq contain a list of the amino acids. The file ****.xyz contains the coordinates. There are 9 columns in the coordinates file. The first three columns correspond to the (x,y,z) of the geometric side chain. The next triplet correspond to the C alpha coordinates and the last triplet correspond to the C beta coordinates. Missing coordinates are designated by 999.9. The next file is ****.2nd which contains secondary structure which is produced by DSSP program. This file contain 5 columns. The first column has the name of the amino acid, the second column contains the secondary structure: A for alpha helix, B for beta sheets and X for the others. The last three columns are the dihedral angles. The number 3600 is used for unknown angle. The last file contains the surface exposure ****.surf.

Updating the databaseThe database is updated using the Perl script DB.pl found in H:\users\galor\loopp\perl. In order to run the script the user has to set some of the parameters in the perl script.

Inserting a new driver to LOOPP

S e t o p tion

S e t m od e l

O pe n o u tp u t

R e ad d a ta ba se to m e m o ry

R e a d lis t to m e m e o ry

D o xxxx

C le an

d rv_xxxx(o p tio n* , inp u t*, o u tp u t)

In this section we will explain how to insert a new driver in LOOPP. A driver is a function that a user can choose from the startup menu. As an example for a driver is: threading list of sequences to the database. As one can see, from the above figure (), a driver consists of several components. We start with the first component set option.

Default options are set at the beginning of the LOOPP program in main.c. Some of these options are set according to the choice of the user of the program, and the programmer sets the rest. Some of the options are driver dependent and are set by the programmer in the driver. Lets return to our example of threading:

op->threadAlignment = YES; op->compute_CM_TE13 = NO;

These options are translated to the alignment type is threading and we don’t want to compute the contact map (CM) of TE13.

The next step is to set the model, which define the energy function for LOOPP. In the first example the model is set according to the user wish and in the second example:

model =get_model(op->header.model_type, op->header.potential_type,

op->header.alphabet_name, train,op);

OR

model =get_model(model_type,

potential_type, alphabet_name,

train,op);

user choice

programmer choice

The programmer can set the model type, the potential type, and alphabet of the model. The programmer can decide if the model is to be trained, in this case, space is allocated for the training information when the variable train is set to YES.

Next, we read the database of structures to the memory of the program. To this end, we allocate the space with alloc_db(). We attach the model to the database and set options for the program to read all files connected to structures with set_struc_option (). Next we prepare to load only a portion of the database in case Loopp is run with several processors. A subset list of structures is created with take_portion_of_db_based_on_number_of_processes (op, io_in); finally, we build the database of structures, with the routine

build_protein_db_from_pdb_list (db_structure, io_in, io_out, op);

db_structure = alloc_db(); db_structure->model = model; set_option_struc(op,model); take_portion_of_db_based_on_number_of_processes(op,io_in); if (op->header.prot_DB_type == LOOP) build_protein_db_from_pdb_list(db_structure,io_in,io_out,op); else { fprintf(F_stdout,"WARNING: loopp accept only loopp format. Please modify header file\n"); return; }

Next, we load the list of sequences into the memory in the same manner.

db_sequence = alloc_db(); db_sequence->model = model; set_option_seq(op,model); strcpy(io_in->f_current_list,op->list_pdbs_file); if (op->header.prot_SEQ_type == LOOP) build_protein_db_from_pdb_list(db_sequence,io_in,io_out,op); else if (op->header.prot_SEQ_type == FASTA) read_seq_list_fasta2loop_format(db_sequence,io_in,io_out,op);

The data base can be divided on

several processors

We again allocate memory for the database and assign it to the variable db_sequence. Set the appropriate model to db_sequence. Then inform the program only to load the relevant information for sequence with set_option_seq (op, model). Next, we copy the list of sequence defined by the user in

op->list_pdb_file, into the variable, Io_in->f_current_list. Finally, sequences are read in LOOPP format or FASTA format according to user setting.

Inserting a new model to LOOPP

We start with the smallest component of a MODEL, the ENERGY_MODEL_TEMPLET. The energy template contains definitions of the protein and operations. In addition it contains the cost function and its parameters for calculating an alignment of two proteins belonging to the same model.

The name of the model is stored in the variable model_type.

The definition of protein is given by its list of residues the ALPHABET and its structural site by *env. As an example for a valid alphabet, alphabet_20_ins_del, which has the twenty usual amino acid types and two gaps namely insertion and deletion. The size of alphabet is stored in n_alphabet.

ALPHABET alphabet_20_ins_del[22] = {ALA,ARG,ASN,ASP,CYS,GLN,GLU,GLY,HIS,ILE,LEU,LYS,MET,PHE,PRO,SER,THR, TRP,TYR,VAL,GINS,GDEL};

The *env counts the number of different environment per site. In the example below, each amino acid (ac) has SEQZ sites and each gap has THOM2Z1 sites. In this particular model gaps are treated differently then ac. Gaps are assigned to THOM2 structural site.

static int m_env_seq_with_thom2_ins_del[22] ={SEQZ,SEQZ,SEQZ,SEQZ,SEQZ, SEQZ,SEQZ,SEQZ,SEQZ,SEQZ,SEQZ,SEQZ,SEQZ,SEQZ,SEQZ,SEQZ,SEQZ,SEQZ,SEQZ,SEQZ,THOM2Z1,THOM2Z1};

Next we list the operation that can be used on a protein. Theses are routines that must be programmed for every new model in order to function smoothly in LOOPP:

•Get_gap_per_site (): If in a particular model, gap depends on structural site, then one can compute the total gap cost for each site a priori. This means, that at the time, the protein features is loaded into the memory, also gap cost are automatically computed.

•Copy_struc_feature (): Every new protein feature beside the protein coordinates or resides, has to have a copy routine for that feature. As an example are secondary structure, surface exposure, or any new feature that will be added in the future.

• Res2pos (): Convert residue number to its position in the ALPHABET vector.

•Get_contact_types_for_multiple_env_per_site (): Computes number of contacts in case of multiple environments per site. So far it was used only for THOM2 structural site.

•Std_residue (): Checks if an amino acid name is standard for that model.

Recall, that ENERGY_MODEL_TEMPLET contains also the cost function and its parameters. The parameters are stored in the variable cost and are loaded from the file, which its name is stored in f_potn at the time set_option_align () is called. The cost function is divided into two parts, that for the amino acid and that for the gap. These routines must be also written for any new model introduced to LOOPP.

•Get_energy_cost (): Calculate the energy for assigning an ac to a structural site.

•Get_gap_cost (): Calculates the energy for assigning a gap to a structural site.

typedef struct ENERGY_MODEL_TEMPLET{ MODEL_TYPE model_type; float unit_conversion; float potn_scale; int n_alphabet; int n_env; int *env; int *base_score; int post_ene; ALPHABET *pos2res; char f_potn[MAXS]; SCORE_MATRIX cost; void (*get_gap_per_site) (PROTEIN *prot,struct ENERGY_MODEL_TEMPLET *m, int ); void (*copy_struc_feature) (PROTEIN *to_prot, PROTEIN *from_prot, int *to, int *from,int n); int (*res2pos) (ALPHABET res); void (*get_contact_types_for_multiple_env_per_site) (PROTEIN *prot); int (*std_residue) (ALPHABET res); float (*get_energy_cost) (OPTION *op,struct ENERGY_MODEL_TEMPLET *m,PROTEIN *prot1, PROTEIN *prot2,int pos1, int pos2);

void (*get_gap_cost) (OPTION *op,struct ENERGY_MODEL_TEMPLET *m, PROTEIN *prot1, int pos1,PROTEIN

*prot2,int pos2, float* ins, float* del ); } ENERGY_MODEL;

As an example of a new model we have here a sequence model with gap depending on structural site of THOM2.

void set_model_seq_with_thom2_ins_del(ENERGY_MODEL *m,OPTION *op){ m->model_type = SEQ_M; m->n_alphabet = 22; m->n_env = THOM2Z; //used for computation of gaps m->env = m_env_seq_with_thom2_ins_del; m->pos2res = alphabet_20_ins_del; m->cost.symmetric = YES; m->cost.with_gap_score = YES; m->post_ene = NO; // Yes for pairwise potential m->potn_scale = op->Dpotn_scale; m->get_gap_per_site = &get_thom2_ins_del_per_site; m->copy_struc_feature = NULL; m->res2pos = &res2pos_20_indel; m->std_residue = &std_residue_20_ins_del; m->get_contact_types_for_multiple_env_per_site = &get_thom2_env_per_site; m->get_energy_cost = &seq_alignment_scoring_energy; m->get_gap_cost = &seq_alignment_with_thom2_ins_del_penalty; if (op->NDalignment) strcpy(m->f_potn,op->NDpotn); if (op->NHalignment) strcpy(m->f_potn,op->NHpotn); }

} /*==========

In the file model.c some of the models are experimental and should be used with caution. Below is the list of available models for loopp: TE13: Set_model_te13_regular20 (); PDB: Set_model_clean_pdb ()SEQ: Set_model_seq_alignment ();THOM2: Set_model_thom2_regular_20_gap ();Secondary structure: Set_model_2nd_struc ();Surface exposure: Set_model_surf_regular_20(); A model can be a mix of several models. As an example we will use the mix model of OT

..... elseif ( model_type== SEQ_THOM2 && alphabet==REGULAR_20_GAP){ n_models=2; // number of models mixed is two. model = alloc_model(op,n_models,0); // alocation of space m=model->ene_model[0]; set_model_seq_alignmnet(m,op); // firste model is sequence m->unit_conversion=op->lambda; // Set miximng parameter to scale the cost matrix. m=model->ene_model[1]; set_model_thom2_regular_20_gap(m,op); // second model is THOM2 m->unit_conversion=1.0 - op->lambda; // Set mixing parameters to scale the cost matrix } .....

This section is plugged in the file model.c in set_model ().

The main module

The main routine has the following functions:

Decipher the command line for loopp.

Loopp.exe Interactive mode

Loopp.exe x.x loopp.par Batch model

Loopp.exe x.x loopp.par #proc proc_Id proc_Id Batch mode, multiple processors

Setting the options by the user with the function set_option().

Prints interactively the command option available with a short explanation.

Calls for the driver depending on the command option.

Print end message of LOOPP

The option module

Set_option() : read loopp.par and set the value for the structure OPTION. Set the pointer F_stdout (global variable) for redirecting the output to screen or to an output file .

Set_option_seq() : set option before reading a sequence information.

Set_option_struc(): set option before reading structual information of a protein.

Set_option():

Parse the parameter file loopp.par. Every line in loopp.par starts with pond (#) for comment or with at (@) for option definition.

#comment comment line

@USR_PARAMETER value option definition

The same option definition can appear several times in the file loopp.par with different value, yet the last definition only counts.

Adding a new option to LOOPP:

Add the structure option in the file db.h the appropriate new option field

Add to set_option() the following lines to parse the new option:

As an example we add the new option field called parameter which accept real value number:

if (strcmp("USR_PARAMETER",operator) == EQ){

sscanf(line,"%s%s%f",crd_opening,operator,&fval);

fprintf(F_stdout,"%s\t\t\t%f\n",operator,fval);

op->parameter = fval;

}

The module loop_interfThe major task of this module is to add protein information to the memory of the program.

build_protein_db_from_file() : Build protein database form old loopp format

read_a_pdb_in_loop_format(): Store protein information given in new loopp format.

get_prot_name(), read_header_loop_forma(), read_log_loop_format(), read_seq_loop_format(), read_xyz_loop_format(), read_surf_loop_format(), read_2ndstruc_loop_format() : Read the different files of loopp.

build_protein_db_from_pdb_list()

rm_path(), get_prot_len(), read_a_pdb_in_loop_format(), get_db_TE13_CM_from_pdb_list(), get_gap_per_site_for_db(), get_list_env_per_site_for_db()

rm _p a th

g e t_p ro t_ na m e

g e t_p ro t_ lenre a d _ he a d er

re a d _ lo g _ loo p _ fo rm at

re a d _se q _ lo o p _ fo rm at

re a d _xyz_ loo p f_ fo rm at

re a d _ surf_ lo o p_ fo rm at

re a d _2 n d s tru c_ loo p _ fo rm at

re a d _a _ p db _ in_ lo op p _ fo rm at g e t_ d b_ T E 1 3_ C M _fro m _ pd b _ list

g e t_ ga p _ pe r_ site _ fo r_ db

G e t_ C M _ fro m _p d b _ list

g e t_ lis t_ e n v_ pe r_ s ite

B u ild _ a_ p ro te in _ d b _ fro m _ pd b _ list

check_if_missing_coord() :

Compute the percentage of missing coordinates. If the percentage is greater then the threshold set by op-

>check_percent_missing. Then the protein is diagnosed as corrupted protein and is not loaded to the memory.

Compute the size as well as the edges index for the reliable chunk in a protein. Usually both edges of the protein contain a lot of missing coordinate. This edged are trimmed and not used for training new potential.

Convert old loopp format to new loopp format, printing routine:

prn_db_in_loop_format(), prn_db_in_old_loop_format(), drv_transform_nloopp_to_oloopp_format(), drv_get_list_from_old_loop_format().

define_a_sublist():

In case loopp is run on several processors. This routine calculate the portion of the database list to load in to the memory for a specific processor.

Fasta format

read_seq_list_fasta2loop_format(), read_seq_prot_fasta2loop_format(),

Load one protein at a time to the memory in case of insufficient of memory: loopp database

init_read_db(), read_nxt_prot(), finish_reading_db(), LM_read_a_pdb_in_loop_format(), LM_read_xyz_loop_format().

The align moduleHow to use the align module

a lig n .inp u t-> p ro t_ co l = p ro t2 = d b _ se q u e n ce -> p ro t[se q _ j]

a lig n .in pu t-> p ro t_ ro w = p ro t1 = db _ s tru c tu re -> p ro t[se q _ i];

align (& a lig n ,m o d e l,o p );

add_alignm ent_to_list(& a lig n ,a lig n m e n t_ list,& le n_ lis t,op );

G enerate an alignment list

set_***_attributes(&align,model,op,io_in,mvs,indx1,indx2); len_list=0

clean_align_list(alignment_list, len_list); len_list=0;

Can be seq/thread/..

Set alignment attributes

sum m arize_best_align_accord_to_ene_and_com pute_zscore (p ro t2 ,a lig n m e n t_ list, le n _ lis t,m o de l,o p ,io _ o u t)

Print *.stat and *.info file according to the option set by the user.First rank according to energy then com pute Zscore and rank according to zscore

The dynamic matrixThe dynamic matrix is allocated dynamically. It size depends on the query and the structure sizes. LOOPP has local and global algorithm implemented in align.c.

x1 x3 x2 x1 x2

ala

val

pro

cys

hys

arg

val

Prot_row

Prot_col

Align : Prot_col ------- Prot_row

The dynamic matrix is compute with the following routines : scoring_energy and gap_energy.

Below one can see that cost is the sum of all existing energy_models, that are not post_energy model. (TE13 is considered as post_energy_model)

float scoring_energy(OPTION *op, MODEL *model, PROTEIN *prot_col, PROTEIN *prot_row, int pos1, int pos2 ){

int k;

float ret_score = 0.0;

ENERGY_MODEL *m;

for (k=0; k<model->n_ene_models;k++){

m = &model->ene_models[k];

if (m->get_energy_cost != NULL && !m->post_ene){

ret_score += m->unit_conversion * m->get_energy_cost(op,m,prot_col,prot_row,pos1,pos2);

}

}

return(ret_score);

}

There are two routines for debugging the dynamic matrix. The first one prints the dynamic table to the screen. The size of the window is given by last four parameters. It must be inserted in local_align or global_align before the routines are exited.

The second routine can be called after align(..) was called to see the energy of the alignment path.

DEBUG(1, dbg_align_window(seq1_length,seq2_length,S,T,align_info->trace ,0,prot1,prot2,0,20,0,20));

DEBUG(1, dbg_align(seq1_length, seq2_length, S,T,align_info->trace,align_info->align_len));

T=

M M I M M DS=

index, align=M/D/I, Native, Structure, Ene, cost, count structural site, structural site

Align protein1.seq.1 ---> seq.2

0 M TYR GLU: ene = -1.112 score=-1.112 ( 4 4)

1 M PHE GLU: ene = -1.112 score=0.000

2 M GLN ASP: ene = -1.376 score=-0.264 ( 1 0) ( 1 1)

3 M GLY GLU: ene = -1.264 score=0.111 ( 3 4)

4 M HIS GLU: ene = -1.163 score=0.102 ( 1 0) ( 1 1)

5 M MET GLU: ene = -1.456 score=-0.294 ( 2 1)

6 M ASN PHE: ene = -0.866 score=0.591 ( 1 6) ( 4 7) ( 1 8)

Ene.dbg output file example

align: 8fab_B---->8atc_A total_ene=405.881042

align_length=310 prot2=224 prot1=310

index of window printing prot2=[214 224] prot1=[300 310]

TRACE ALIGN

D D D D D D D D D D D D D D D D D D D D D D D D D D m m m m m D D D D D D m m D D m m m m m m m m m D m m m m m D m m m m m m D m m m m D m m m m D D m m m D m m m m m D m D m m D m m m m m m D m m m m m m m m m m D m m m m m m m m D m m m D m m m m m m m m m m m m m m m m m m m m m m m m m m m m D m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m D m m D m D m D m m D m m D D m D m D m m D m m D m D m m m D m m m m m m m m m m m D D m m D D D m D D m m D m D m m m m m D m m D m m D D m m m m m m m m m m D m m m m D D m m D m D m m m m D D m m D m m m D m m m m m m m m m m m m D m D m

DYNAMIC MATRIX FOR GLOBAL ALIGNMENT

300 301 302 303 304 305 306 307 308 309 310

LEU ALA LEU VAL LEU ASN ARG ASP LEU VAL LEU

LYS 411.4 421.6 435.3 444.7 445.9 459.5 470.0 471.3 479.1 480.3 490.0

VAL 398.5 411.1 426.0 433.2 444.2 445.7 459.4 460.7 470.3 471.6 479.7

ASP 392.1 400.6 418.4 426.9 435.6 444.2 445.8 447.1 461.5 462.7 472.3

LYS 386.6 395.5 409.1 420.0 430.5 436.0 443.8 445.1 447.8 449.0 463.5

LYS 383.2 389.9 403.9 410.7 423.5 430.8 435.6 436.8 445.8 447.1 449.8

VAL 368.6 382.8 394.3 401.8 410.2 423.4 430.7 431.9 435.8 437.1 446.5

GLU 356.5 370.9 389.9 395.5 404.3 410.6 423.5 424.8 432.6 433.8 437.9

PRO 348.7 358.3 377.7 391.0 397.6 404.7 411.0 412.3 425.8 427.0 434.4

LYS 351.4 352.0 366.7 379.3 394.6 397.9 404.2 405.5 413.0 414.2 427.8

SER 340.3 352.7 358.3 367.6 381.0 394.8 398.1 399.4 406.5 407.8 415.6

CYS 324.1 338.7 355.9 355.6 366.1 379.4 393.9 395.1 396.9 398.2 405.9

An example for the dynamic matrix

if (input->compute_zscore == YES){

srand(RANDOM_SEED);

shuffled_prot = alloc_prot(op,MAX_SEQ);

for ( k=0; k<n_rnd_alignments; k++ ){

shuffle_sequence(prot2,shuffled_prot);

rnd_input->prot_row = prot1;

rnd_input->prot_col = shuffled_prot ;

if (align_data->input.alignment_type == GLOBAL){

I f (op->strucAlignment) rnd_ene = struc_global_align(&rnd_align,do_trace_back,op,model);

else rnd_ene = global_align(&rnd_align,do_trace_back,op,model);

}

else if (align_data->input.alignment_type == LOCAL)

rnd_ene= local_align(&rnd_align,do_trace_back,op,model);

sumT += rnd_ene;

sumT2 += rnd_ene*rnd_ene;

}

Computing the Z score

Add protein to align.input structure

Shuffle the sequence residues of prot2

Compute random energy of aligning the random sequence into prot1

Compute average energy of aligning the random sequence into prot1

The Zscore measure the homology of prot2 to prot1 with respect to random noise.

The Z score is computed in the routine align(….) in the file align.c

avT = sumT/n_rnd_alignments;

avT2 = sumT2/n_rnd_alignments;

norm = fabs(avT2 - avT*avT);

align_data->assess.score = avT;

if (norm == 0) align_data->assess.zscore = -999.9;

else align_data->assess.zscore = -(align_data->assess.ene - avT)/sqrt(norm);

if (align_data->assess.zscore < -999.9) align_data->assess.zscore = -999.9;

Printing the statistic of aligning the query to LOOPP database.

p rin t he a d er fo r * .s ta t

se t_ ***_ a ttrib u tes ( )

ra n k a cco rd in g to e n erg y w ithra n k_ a cco rd in g _ to (E N E R G Y,a lig n m en t_ lis t, le n_ lis t,ra n k_ e n e ,o p );

co m p ute zscore w itha lig n (a lig n m e n t_ lis t[ra nk_ e n e[a d d r_ i]],m o de l,o p) ;

R a n k a ccord in g to Zsco re w ithra n k_ a cco rd in g _ to (Z S C O R E ,a lig n m e n t_ lis t, le n _ lis t,ra n k_ zs ,o p) ;

C a lcu la te th e co u n te r o f p rin tingco u n t

P rin t sta tis t ic to th e f ile **** .s ta t

P rin t a lig n m en t to th e f ile **** . in fo w itha lig n ( & a lig n_ p ,m od e l,o p);

p rin t_ a lig n m e nt(& a lign _ p ,op ,io ,h ea d e r)

sum m arize_best_a lign_accord_ to_ene_and_com pu te_zsco re ()

#Mon Jul 07 10:56:28 2003

#LOOPP V2: ALIGNMENT INFORMATION

#======================================================

#This file contains statistics of sequence to sequence alignment

#with constant gap penalty 8.000000

#and the potential is multiplied with the factor scale 1.000000

#Alignment type : GLOBAL

#The following models were used:

#Potential : NHseq_gte_thom2.pot

#Model : SEQ_M with mixing parameter: 1.000000

#The model produced the alignment : YES

#

#Data Base : H:\users\galor\LISTS\test

#The difference in length between the query sequence and the data base sequence is less then 30.00 percent

#

#The number of random sequence to compute zscore was set to 100

#Only prints zscore above threshold 0.00

# ========================================================

# 1 matches to 1dbt_A zscore ene identity te_ene te_zscore length align_len

7tim_A 0.11 -89.00 5.40 999.00 999.90 247 278

The Threading moduleWhat is threading

A W - G H K - I

s1 - s3 - s1 s1 s5 s1

H

G

I

K

S1: AWGHKI

Sequence information is used for the probe protein. Structural information for the target.

X2: s1s0s2s3s0s3

drv_threading_a_list_of_seq_against_the_db() : Thread a list of sequences against the database

drv_threading_a_seq_against_the_db() : Thread one sequence against the database.

LM_drv_threading_a_list_against_the_db() : Thread a list of sequences against the database (Low memory)

drv_threading_a_seq_against_a_struc() : Thread a sequence against one structure

drv_threading_a_db_against_itself() : Thread the database against itself used for recognizing native

set_threading_attributes() : Set attributes for alignment of sequence to structure.

thom2_gapless_threading_gap_penalty() : Compute gaps for Thom2 model REJM model

thom2_threading_scoring_energy() : Compute scoring energy for Thom2.

LIST OF FUNCTIONS:

The seq module

drv_seq_alignment_of_a_list_of_seq_against_the_db(): Align a list of sequence against the data base

LM_drv_seq_alignment_a_list_against_the_db(): Align a sequence against the database (Low memory)

drv_seq_alignment_of_db_against_the_db(): Align the database against itself (for recognizing the native)

drv_seq_alignment_of_seq_against_seq(): Align one sequence against one sequence.

set_seq_attributes(): Set attributes for sequence alignment.

seq_alignment_gap_penalty(): Compute structural gap dependent penalty.

seq_alignment_constant_gap_penalty(): Compute constant gap penalty

constant_seq_alignment_gap_penalty_for_pdb_seq_to_atom(): Compute gap penalty for aligning SEQRES to ATOM section for a PDB file.

seq_alignment_scoring_energy(): Compute scoring energy using Blusom 50

The PDB module:The main task of this model is to create the database for Loopp. The database is created into stages. First step the PDB files are cleaned. IN the second step LOOPP files are created.

The first step:

A pdb file pdb****.ent is converted in to 2 files: pdb****.ent.log pdb****.ent.new. A clean PDB from the original PDB in pdb****.ent.new. A log file in pdb****.ent.log which contains information on the clean pdb. The later file contains lines of the form: "tag resName resSeqNum atomCounter gapIndicator CA-distance“, which describe how the file *.new was derived form the original pdb file *.ent

<tag> is a character of +, -, =, or *,

+ stands for adding NTER and CTER card in *.new as chain designators

- deleted residue in *.new

= copied residue in *.new

* copied residue but some of the atoms are missing in *.new.

<atomCounter>: Display the number of atoms found for the current residue;

<gapIndicator> : Display the index in a chain. A chain starts with index 1, and terminate with index 0, if no CA found at the current residue.

<CA-distance>: Display C-alpha distance between previous and current residue.

The created new files *.new and *.log are defined by the option = USR_PDB_PATH in the parameter file loopp.par.

The routine which is responsible for cleaning the pdb is drv_clean_pdb_from_a_list_of_pdb_names(). It calls the interface routine openInterfaceToCleanPDB() in PDBparser.c file.

Step 2: Generating loopp database

The routine structure for parsing a pdb file containing all sections as defined by RCSB database:

ch e ck _ if_ to _ sk ip _ th e_ cu rre n t_ ch a in

re a d _h e a de r_ loo p _ fo rm at

p a rse_ a _ cha in _ fro m _p d b log _ to ge t_ resa tom

g e t_ S E Q RE S _ se c tio n_ fro m _ p db

g lo b a l_ a lign

rm _ red u nd a n t_m a tch _ to_ le ft

rm _ red u n da n t_m a tch _ to_ th e _rig h t

m a tch _ trip le t_ o f_ S E Q R E S _ w ith _ A T O M _ se ctio n _ in _p d b _ file

im p rove _ a lig n m en t_ b ase d _ on _ g ap _ p os

p d b _ e rro r_ ask_ fo r_ h e l

im p ro ve_ a lig n m e n t_ m a n u a lly

a sse ss_ a lig nm e n t_a n d _a sk_ for_ h e lp_ in _ ca se _ pd b _ corru p ted

p rin t_ seq _ lo g _fo rm at

ch e ck_ if_ m e m e b ra ne

p a rse_ a _ cha in _ an d _ prin t_ se q _ log

p ick_ C a lp h a _ a tom

p ick_ Cb e ta_ A tom

p ick_ s id e _ ch a in _ a tom

g e n era te _d u m m y_ co o rd in a te_ if_ no t_ in_ a tom

T o ke n ize _ a tom _ line _ o f_ pd b _ file

p a rse _a _ cha in _ a nd _ p rin ts_ its_ co ord in a te

p a rse_ c le an _ p db _ a nd _ p rin t _ loo p _ fo rm at p a rse_ c le a n_ a to m _ se ctio n _a n d _ prin t_ lo o p _ fo rm at

d rv_ b u ild_ lo o p _ file s_ from _ a _ lis t_ o f_ p d b_ n a m es

The main database in the file pdb.c : PDB_INFO:

co de _ na m e

ch a in _ co d e _ file

f ile _ n a m e

re s_a tom

re s_ seq

sa ve _ firs t_ re s_ in d exsa ve _ la s t_ re s_ in d ex

n _ a to m s_ list

n _ ca rds

tra ce _p a ss2

tra ce _ o rig in a l

tra ce _ fin a l

g a p M arkB o nd

ju m p M arkB o nd

ju m M a rkP bd

a lig n _ in fo

cu rre n t_ch a in _ id

d isc re pa n cy In Ju m p

e rrIn P db

m a tchN o tId en tica l

m e ssIn S E Q R E S

D P B _ IN FO

code_name : PDB acronym

Chain_code_file : Chain identity (extract from the file name)

Res_atom : A protein whose sequence is taken from SEQRES section

Res_atom : A protein whose sequence and coordinates is taken from ATOM section.

N_card : Number of cards in atom section

N_atom_list[] : Display the number of atoms for the current residue

Trace_pass2 : Pass 2 alignment trace

Trace_original : Pass 1 alignment trace

Trace_final : Final alignment trace

gapMarkbond : Save gapIndicator from *.log file

JumpMarkBond : Gap marker according to C-alpha bond length

jumpMarkPdb : Gap marker according to bond residue index

Align_info : Alignment of SEQRES section onto ATOM section

Current_chain_id :The current chain in case there are several chains in the pdb

DiscrepancyInJump : JumpMarkBond and JumpMarkPdb disagree flag

MatchNotIdetical :In the alignment of SEQRES onto ATOM there is a match but the residue are not identical. (Error in the pdb file)

Algorithm outline:

The pdb file given from RCSB database is full of discrepancies. One way to fish out these problems is to align SEQRES section to the atom section residues. The program uses a sequence alignment with constant gap penalty and constant match score. After the first pass there is a need to check the alignment trace ( alignment path) if the alignment make sense. That is, gaps are concentrated in distinct area, gaps according to C-alpha distance correspond with jump in the PDB index, for match segment the program check whether the residue at the SEQRES section coincide with that of the ATOM section.

The program tries to correct some of the errors in the pass 2, by shifting gaps, or using a different alignment not based on dynamic programming. The user can choose which alignment path make more sense based on reading the comments in the PDB file, or to manually make his own version if the two alignment fails.

In case the program fails there is need only to correct small portion of the alignment.In this case the program prints section of the alignment at a time and wait for approval or correction. Unfortunately in rare occasion it might happened that the algorithm fails completely.

MPS moduleConvert loopp format output to MPS format:

The design of new potential leads eventually to solving a set of linear equations:

Loopp generate LHS and RHS files which contain the coefficients of the inequalities. One of the options to solve these

Set is using the software of BPMPD which requires MPS input format.

re a d _co e ff

se t_ a xu lia ry_ var

w rite _ co l

w rite _ ro w

w rite _ rh s

w rite _b o u nd

w rite _ Q m a trix

w rite _ lin e a r_p a rt_ o f_ o b j

se t_ sp a ce

g e t_va r_ na m e

b a s ic ro u tin es

Mps format

MPS

Here is a simple example of mps file:NAME example2.mps

ROWS

N obj

L c1

L c2

COLUMNS

x1 obj -1 c1 -1

x1 c2 1

x2 obj -2 c1 1

x2 c2 -3

x3 obj -3 c1 1

x3 c2 1

RHS

rhs c1 20 c2 30

BOUNDS

UP BOUND x1 40

ENDATA

1 1 0n nE c x c x R

Read loopp LHS and RHS file into memeory

Write bilinear objective :

Write linear objective

Define space of field

Define variable name

2 21 1 2 2min 2Ax ABx x Cx

http://www.sztaki.hu/~meszaros/bpmpd/

http://www-fp.mcs.anl.gov/otc/Guide/OptWeb/continuous/constrained/linearprog/mps.html

http://www.mosek.com/products/2/tools/doc/html/tools/node17.html

lo o p 2 m p s_ o b j lo o p 2m p s_n o b j lo o p 2 m p s_ m ixe d _ o b j_ n o b j

d rv_ lo op 2 m ps

The driver to convert loopp to MPS format is drv_loopp2mps(). This routine calls to the different routines

Depending on *.par file parameter. The most common routine is solving the linear set of inequalities with out objective: loopp2mps_nobj().

Design new potentialGapless threading

Create a equations of the type a_j[0]x[0]+...+ a_j[n]x[n] = r_j where a_j[0],...,a_j[n] (j=1...m) are stored in

In the file op->train_lhs_file_nobj and and the rhs r_j (j=1...m) are stored in op->train_rhs_file_nobj.

The equations are generated by gapless threading method assigning a seq into a structure with out gaps. The pair (seq_i,struc_j) construct a pseudo protein denoted as decoy. The equation is defined as the energy difference of assigning a native seq into a decoy structure (A non native structure) and assigning a native seq into a native structure.

E(N->D)-E(N->N)=A*X > 0.

The energy definition depends on the model chosen by the user in USR_MODEL_TYPE. The length of the seq should be shorter then that of structure. The sequence is sled into the structure (N_struc - n_seq +1) times or less depending on USR_GAPLESS_THREADING_WINDOW;

There are two main routine for gapless threading:

drv_compute_fix_threading_constrains(): Generate the LHS, RHS and LOG file from a LOOPP database

drv_compute_fix_threading_constraints_where_the_db_is_based_on_abintio_decoys() Generate LHS, RHS, LOG file for abintio database. The difference is that the native is gapless thread to its family of decoys. As an example of decoys is the Skolnik set, TB set, and Baker set. The decoy length equal to its native.

L O O P P

Documents

site index

site n23

site n34795671053

site n21

value of structural

different structural

corresponding structural

file db