Top Banner
Data quality and model parameterisation Martyn Winn CCP4, Daresbury Laboratory, U.K. Prague, April 2009
21

Data quality and model parameterisation

Feb 06, 2016

Download

Documents

isleen

Data quality and model parameterisation. Martyn Winn CCP4, Daresbury Laboratory, U.K. Prague, April 2009. Model Parameters. E.g. asymmetric unit contains n copies of a protein of N atoms Coordinates 3 x N x n xyz co-ordinates - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • Data quality and model parameterisation

    Martyn WinnCCP4, Daresbury Laboratory, U.K.

    Prague, April 2009

  • Model ParametersE.g. asymmetric unit contains n copies of a protein of N atomsCoordinates3 x N x n xyz co-ordinatesor ... 6 x M x n if each protein modelled as M rigid bodiesor ... ~ 0.5 x N x n torsion anglesDisplacement parameters1 x N x n B factorsor ... 6 x N x n anisotropic U factorsor ... 20 x M x n if each protein has M TLS groups

  • Model Parameters (2)OccupanciesUsually fixed at 1.0 for protein... except for alternative conformations (usually sum to 1.0)Water/ligand occupanciesScaling parameters etc.koverall, Boverall, kBabinet, BBabinet, ksolvent, Bsolventtwin fractionUltra-high resolutionMultipolar expansion coefficientsInteratomic scatterers

  • Reflection DataNumber of independent reflections, dependent on:spacegroupresolutioncompletenessFor each reflection, one has at least F/sigF.Might also have reliable experimental phases or F(+)/F(-)

  • Data / parameter ratioRefinement means minimise -log(likelihood):Nonlinear function of model parameters.Global minimum and many local minima.Need good data/parameter ratio.

    Strong dependence on resolution.No strong dependence on protein size.

    Generally not enough data ....Reduce number of parameters - constraintsAdd data - restraints

  • RestraintsExpected geometry of the protein treated as additional databond lengthsbond anglestorsions / dihedral (but not ,)chirality (e.g. chiral volume)planaritynon-bonded (VdW, H-bonds, etc.)B factors (between bonded atoms)U factor restraints (similarity, sphericity, rigid bond)NCS (position or conformation)

  • Data / parameter ratioNot really true ... assumes all data independentbond lengths and angles and planar restraints in ring systembond length restraint vs. high resolution diffraction dataEstimate as: no. reflections + no. restraints no. parameters

    Restraints may be more necessary in poorly determined parts of the structure.Restraints have associated weights:Overall w.r.t. reflection data Individual weights e.g. WB

  • calmodulin at 1.8 (1clm) 1132 protein atoms, 4 Ca atoms, 71 waters4828 x, y, z, B factors

    No. of unique reflections 10610 (deposited 1993 no test set!) data/parameter = 2.2Bond restraints: 1144Angle restraints: 1536Torsion restraints: 429Chiral restraints: 170Planar restraints: 874Non-bonded restraints: 1391B factor restraints: 2680(no NCS)total restraints = 8224 data/parameter = 3.9

  • calmodulin at 1.0 (1exr) 1467 protein atoms (inc. alt. conf.), 5 Ca atoms, 178 waters4950 x, y, z+ 9900 anisotropic U factors+ 316 occupancy parameters total parameter count = 15166

    No. of unique reflections 77150No. in test set 7782 (10%)Data for refinement 69368

    No. of restraints (PDB header) 22732

    data/parameter = 4.6 data/parameter = 6.1

  • GCPII at 1.75 (3d7g)5724 protein atoms (inc. alt. conf.), 211 ligand atoms, 617 waters26046 x, y, z, B factors + 162 anisotropic U factors (S, Zn, Ca, Cl only)+ 225 occupancy parameters total parameter count = 26433

    No. of unique reflections 105077No. in test set 1550 (1.5%)Data for refinement 103527

    No. of restraints (PDB header) 44652

    data/parameter = 3.9 data/parameter = 5.6

  • Thioredoxin reductase at 3.0 (1h6v)22514 protein atoms, 552 ligand atoms, 9 waters92300 x, y, z, residual B factors 6 TLS groups120 TLS parameters

    No. of unique reflections 69328No. in test set 3441 (5%)Data for refinement 65887

    No. of restraints 209378(inc. 44484 NCS restraints) data/parameter = 0.7 data/parameter = 3.0

  • Getting a good R-factorThe old way:Refine parameters so that Fcalc (from model) agrees with Fobs for all reflectionsCalculate: R = |Fobs| - s | Fcalc | / |Fobs| (Note: precise value may depend on scaling used)Add parameters until R is sufficiently low

    Whats wrong with that ? ?

  • Avoiding overfitting: RfreeWhat's wrong?:Can add any old parameters to improve R-factor, when low data/parameter ratioMay not be physically correct "overfitting"

    Solution:Calculate R-factor on a set of reflections not used in refinement = "Rfree"If changes to model improve Rfree as well as R, then they are good.Note: Rfree is global number - useful for refinement strategies, not useful for assessing changes to a few atoms

  • Choosing your free reflectionsUsually a randomly chosen subset.Typically 5-10% (CCP4 default is 5%)If you have enough reflections, impose maximum number (2000 in phenix.refine)Free set also used in maximum likelihood to estimate A parameters

  • Rfree and NCSNCS operators map different regions of reciprocal asymmetric unit onto each other. Reflections in these regions are correlated. gaps = free setworking reflectionsfree reflections

  • Rfree and NCSSolution: choose free set from thin shells in reciprocal spacePros:NCS operators link regions of same resolution which should be both in a shell or outside itCons:Large number of shells thin shells most free reflections close to edge and correlated to non-free reflectionsSmall number of shells significant gaps in resolution range, poor determination of ASFTOOLS: RFREE 0.05 SHELL 0.0013rd argument = width of shells in -1Also DATAMAN.

  • Width 0.013 shellsWidth 0.001320 shells(default)1xmp (1.8 )Width 0.0053 shellsWidth 0.000520 shells(default)XXX (3.8 )

  • Can increase size of free set to mitigate edge effectsOr use NCS-related free set islands

    Reflections also correlated to immediate neighbours in reciprocal space - can exclude these from working and free setsFabiola, Korostelev & Chapman, Acta Cryst D62, 227, (2006)Rapidly run out of working reflections!

    Be aware that correlations can artificially reduce your RfreeRfree and NCS

  • Rfree and twinningTwinning operator might relate e.g. reflection (1,2,3) to (2,1,-3)

    These two reflections should both be in the working set or the free set.

    Select free set in thin shells (as NCS) Select free reflections in higher lattice symmetry

  • Transferring free R setsUse the same free set for:additional datasets for same proteindatasets from isomorphous proteins (derivatives, complexes, etc.)(how isomorphous is not clear, but play safe ...)Otherwise initial R & Rfree will be similar and low for second structure - it has been refined against most of your free reflectionsFurther refinement may lead to divergence of R & Rfree, masking the bias. Harder to detect over-fitting. Although may eventually reset Rfree.

    How:Use "CAD" / "Merge MTZ files (CAD)" in CCP4.

  • Useful resourceshttp://ccp4wiki.org/ - CCP4 Wikihttp://strucbio.biologie.uni-konstanz.de/ccp4wiki/ - CCP4 community wikiProceedings of Study Weekend 2004 (Acta Cryst D, Dec 2004)

    *Data is only 71% complete.*SHELXL-97additional restraints probably for aniso Us.

    *Refmac5Paper describes mixed isotropic/anisotropic model, but not clear why or if it helped!*Also NCS-related islands*Also NCS-related islands*(2,1,-3) -> (-2,-1,3) -> (2,1,3)Someone mentioned phenix.xtriage but not sure if it does freeR sets - just to find twin law I thinkphenix.refine seems to have something about the symmetry of free R set*