1 Modelling procedures for directed network of data blocks Agnar Höskuldsson, Centre for Advanced Data Analysis, Copenhagen Data structures: Directed network of data blocks Input data blocks Output data blocks Intermediate data blocks Methods Optimization procedures for each passage through the network Balanced optimization of fit and prediction (H-principle) Scores, loadings, loading weights, regression coefficients for each data block Methods of regression analysis applicable at each data block Evaluation procedures at each data block Graphic procedures at each data block
31
Embed
1 Modelling procedures for directed network of data blocks Agnar Höskuldsson, Centre for Advanced Data Analysis, Copenhagen Data structures : Directed.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
Modelling procedures for directed network of data blocks
Agnar Höskuldsson, Centre for Advanced Data Analysis, Copenhagen
Data structures:
Directed network of data blocksInput data blocksOutput data blocksIntermediate data blocks
Methods
Optimization procedures for each passage through the networkBalanced optimization of fit and prediction (H-principle) Scores, loadings, loading weights, regression coefficients for each data blockMethods of regression analysis applicable at each data blockEvaluation procedures at each data blockGraphic procedures at each data block
2
Chemometric methods1. Regression estimation,
X, Y. Traditional presentation: Yest=XB, and standard deviations for B.Latent structure:X=TP’ + X0. X0 not used.Y=TQ’+Y0. Y0 not explained.
2. Fit and precision. Both fit and precision are controlled.
3. Selection of score vectorsAs large as possibledescribe Y as well as possiblemodelling stops, when no more found (cross-validation)
4. Graphic analysis of latent structureScore and loading plotsPlot of weight (and loading weight) vectors
Chemometric methods
3
5. Covariance as measure of relationship X’Y for scaled data measures strength X1’Y=0, implies that X1 is remmoved from analysis
6. Causal analysis T=XR From score plots we can infer about the original measurement values Control charts for score values can be related to contribution charts
7. Analysis of X Most time of analysis is devoted to understand the structure of X. Plots are marked by symbols to better identify points in scor or loading plots.
8. Model validation. Cross-validation is used to validate the results Bootstrapping (re-sampling from data) used to establish confidence intervals
Chemometric methods
4
9. Different methods Different types of data/situations may require different type of method One is looking for interpretations of the latent structure found
10. Theory generation Results from analysis are used to establish views/theories on the data Results motivate further analysis (groupings, non-linearity etc)
5
Partitioning data, 1
X1 X2 XL Y1 Y2
Z1
Z2
Z3
Measurement data Responsedata
Reference data
6
Partitioning data, 2
-There is often a natural sub-division of data.
- It is often required to study the role of a sub-block
- Data block with few variables may ’disappear’ among one with many variables, e.g. Optical instruments often give many variables.
Instrumental data Response data
X YX1 X2 X3 Y1 Y2
engineering
chemicalprocess
quality
chemical results
7
Path diagram 1
X1
X2
X3
X4
X5
X6 X7
Examples:
Production processOrganisational dataDiagram for sub-processesCausal diagram
8
Path diagram 2, schematic application of modelling
X1
X2
X3
X4
X5
X6 X7
x10
x20
x30
x10 is a new sample from X1,x20 is a new one from X2,x30 is a new one from X3,
how do they generate new samples for X4, X5, X6 and X7?
Resulting estimating equations
X4,est=X1B14+X2B24+X3B34
X5,est=X1B15+X2B25+X3B35
X6,est=X4B46+X5B56
X7,est=X6B67
9
Path diagram 3
X1
X2
X3
X4
X5
X6 X7
Time t1
Time t2
Data blocks can be aligned to time.Modelling can start at time t2.
10
Notation and schematic illustrations
X Y
Instrumental data Response dataw
tq
u
w: weight vector (to be found)t: score vector, t = Xw =w1x1 + ... + wKxK
Different views:a) As a part of a pathb) If the results are viewed
marginallyc) If only XiXk
...
25
Stages in batch processes
Y
Time
Batches
Stages
XkX2X1
1 2 K Final quality
Paths: X1 X2 ... XK Y Given a sample x10, the path modelgives estimated samples for later blocks
[X1 X2 X3] X4 Y Given values of (x10 x20 x30), estimatesfor values of x4 and y are given.
[X1 X2 X3] [X4 X5] Y Given values of (x10 x20 x30), estimatesfor values of (x4 x5) and y are given.
26
Schematic illlustration of the modelling task for sequential processes
Stages
X1
Initial conditions
Known process parameters
X2 X3
Next stage
X4
Later stages
Now
Y
27
Plots of score vectors
X1
t1
X2
t2
XL
tL
X1 X1 – X2
t1
t2 X1 – XL
t1
tL
The plots will show how the changes are relative to the first data block.
28
Graphic software to specify paths
X4
X5
XL
...
X1
X2
X3
Blocks are dragged into the screen. Relationships specified.
29
Pre-processing of data
• Centring. If desired centring of data is carried out
• Scaling. In the computations all variables are scaled to unit length (or unit standard deviation if centred). It is checked if scaling disturbs the variable, e.g. if it is constant except for two values, or if the variable is at the noise level. When analysis has been completed, values are scaled back so that units are in original values.
• Redundant variable. It is investigated if a variable does not contribute to the explanation of any of the variables that the presnt block lead to. If it is redundant, it iseliminated from analysis.
• Redundant data block. It is investigated if a data block can provide with a significant description of the block that it is connected to later in the network. If it can not contribute to the description of the blocks, it is removed from the network.
30
Post-processing of results
Score vectors computed in the passages through the network are evaluated in the analysis at one passage. Apart from the input blocks the score vectors found between passages are not independent. The score vectors found in a relationship XiXj are evaluated to see if all are significant or some should be removed for this relationship.
Cross-validation like in standard regression methods
Confidence intervals for parmeters by resampling technique