PowerPoint Presentation
Statistical Modeling: Building a Better Mouse Trap, and
othersDec 10, 2012 at the University of Hong Kong
Stephen Sauchi Lee Associate Professor of StatisticsAffiliated
Professor of Bioinformatics and Computational BiologyDepartment of
StatisticsUniversity of Idaho Moscow, Idaho, USAStatistical
ModelingOn 3 projectsBuilding a Better Mouse Trap?The Incremental
Utility Behind the Methodology of Risk AssessmentPredicting
Parkinson Disease StatusDemographic Impacts on Social Vulnerability
in NorwayBuilding a Better Mouse Trap?The Incremental Utility
Behind the Methodology of Risk AssessmentAcademy of Criminal
Justice Sciences NYC 2012-03-16
Zachary Hamilton, PhDMelanie-Angela Neuilly, PhDRobert Barnoski,
PhDWashington State UniversityPullman, Washington
Stephen S. Lee, PhDUniversity of IdahoMoscow, Idaho
Emerging technology of risk assessmentFour generations:1)
clinical judgment2) static predictors3) dynamic factors4)
automated
Regression methods utilized for instrument creationLSI-R:
Logistic RegressionCOMPAS: Survival Regression
Recent advancements in prediction rarely utilized for criminal
risk assessmentDecision treesNeural networksLatent class
analysisNon-linear prediction modelsLinear approaches assume equal
additive quality for all factors (Steadman, et al., 2000)Typically
neglect interaction effectsMachine-learning mirror diagnostic
processes more closely (Steadman, et al., 2000)Machine-learning
approaches most commonly used:Classification Trees (CT) and other
recursive partitioning models (CHAID, CART, ICT, Random Forests,
etc.)Neural Networks (NN)
CHAID = Chi Squared and Interactive Decision5Classification
TreesHierarchical question-decision tree model (Breiman, 1984)The
final answer is the result of a series of conditioning answers (If
this -> then that, etc.)Used in diagnostic reasoningNo
statistical significanceRandom ForestsInductive statistical
learningAggregation of hundreds of Classification Trees
Hypothetical recidivism treeNeural NetworksDeveloped in
Artificial Intelligence researchData mining technique for pattern
recognitionAim at modeling the lower level brain functionsLayered
nodes of fact-sets instead of rules, used to train the networkBased
on the training data, the network learns to deduce the right answer
to any new piece of informationUsed in psychiatric
diagnosticSchematic Neural NetworkRecalculation of weights based on
predicted and actual outputsPrevious CT and NN research on
recidivism predictionStudies using CT-like analyses, as well as NN
tend to make use of smaller samples ( 1,500) (except Berk et al.,
2009; Palocsay et al., 2000; and Silver et al., 2000)Overall,
results are mixed, but those finding significant improvement via CT
use lack proper validations (Liu, 2010)Studies using NN show very
split results (Liu, 2010)
Gaps in the literatureOverall, very few studies have
investigated the utility of CT and NN for predicting
recidivismPrevious studies have been limitedIn power To violence
predictionThe current study remedy such limitationsClose to million
casesGeneral recidivism as well as possibilities for investigating
offense-specific recidivism
Washington State Static Risk AssessmentPreviously utilized
LSI-RFound laborious by community corrections officersEvaluated to
be strengthened by increase of static items (Barnowski, 2003)
Created current instrument in 2006Factors strongly related to
recidivism: demographics, juvenile record, commitments to DOC,
felonies, misdemeanors, and violationsRemoved dynamic items
(interview not required)Instrument scored from logistic regression
- logit weights
Comparable Predictive Validity for WA Sample (WSIPP, 2007)LSI-R
AUC = .66WA Static Risk AUC = .74Analysis Plan24 variables included
from current risk prediction instruement3 year follow-up (release
from incarceration)Any felony recidivism
2 step creationConstruction sample All offenders released from
prison or jail placed on community supervision from 1986 to 2000 (N
= 287,417)
Validation sampleAll offenders released from prison or jail
placed on community supervision from 2001 to 2002 (N = 71,957)
Compare methods of prediction modelsArea under the receiver
operating characteristic (AUC) Values of .500s indicate no
predictive accuracyWhere .600s are weak, .700s moderate, and above
.800 strong predictive accuracyDescriptive Statistics
(N=359,374)Predictor%/Mean(SD) White (not included in model)79.71.
Male18.72. Age At Risk31.7(10.2)3. Adult Felonies 2.1(1.9)4.
Juvenile Felony Score325. Juvenile Person Score 6 6. Number of DOC
Commitments 2.0(1.7)7. Homicide/Manslaughter 18. Felony Sex 79.
Felony Violent Property 910. Felony Non-Dometic Violence
Assault1611. Felony Dometic Violence Assault 212. Felony Weapon
413. Felony Property 8514. Felony Drug 6215. Felony Escape 816.
Misdemenor Non-Dometic Violence Assault 2317. Misdemenor Dometic
Violence Assault 2118. Misdemenor Sex 319. Misdemenor Dometic
Violence Other 120. Misdemenor Weapon 421. Misdemenor Property
5222. Misdemenor Drug 1723. Misdemenor Escape 124. Misdemenor
Alcohol 17NewFelony (Outcome) 44Logistic regression methodExtended
validation sample of original instrument constructionStrongest
model predictors (weights) were: 1) Misd. Property, 2) Juvenile
Felony, 3) Misd. Dometic Violence Assault, 4) Misd. Drug, 5)Misd.
Sex, 6)MaleFindings comparable to original instrument
constructionModelsConstruction Sample ROCsValidation Sample
ROCsOriginal Sample.756.742Extended Sample .750.749Radom Forest
Model
Strongest Model Predictors : 1)Felony Adjudications, 2) Misd.
Property, 3)Sentence Length4)Juvenile Felony5)Age6) Felony
PropertyModelsConstruction Sample ROC (SE)Validation Sample ROC
(SE)Logistic Regression.750 (.001).749* (.002)Neural Network.755*
(.001).750* (.002)Random Forest.750 (.001).734 (.002)Model
Comparisons
ROC curve represents the Sensitivity (y axis) plotted against
the specificity (x axis). Closer the ROC is to the upper left
corner the higher the overall accuracy 17
Model ComparisonsSignificant differences foundNeural network
significantly greater predictive validity than random forestNeural
network significantly greater predictive validity than logistic
regression but only construction sampleIncremental Utility of
Methodological AdvancementsNeural networks performed best, followed
by logistic regression and random forest
ROC differences of methods found to be significant but not
universally
Preliminary nature of findings are stressedLimitations Lack of
specificity of outcome measure and sample heterogeneityAny felony
within 3 yearsSpecialization and taxonomic structures not
considered
Unit of analysis is incarceration cycleViolation of independence
assumption for repeat incarcerations
Exclusion of dynamic predictorsFuture Findings and Policy
ImplicationsAdd dynamic predictors to modelsAvailable since
2008Prior/preliminary findings indicate only modest improvement
Examine impact of latent variable methods4th potential model
Disentangle heterogeneity Subgroup analyses based on offense
specialtiesi.e. drug, violent, sex offenderPredicting Parkinsons
disease status with vocal dysphonia measurementsRoxana
HickeyBioinformatics & Computational BiologyStatistics 519
Multivariate Statistics Term ProjectProfessor Stephen LeeApril 27,
2011OutlineBackgroundParkinsons diseaseVocal dysphoniaStudy
datasetStatistical analysesConclusionsParkinsons
diseaseNeurological disorder that leads to shaking and difficulty
with walking, movement and coordination1Affects >1 million
people in North America2rapidly increased prevalence after age
603No cure, but medication available to alleviate symptoms,
especially in early stages4early detection key to effective
treatment
strategieshttp://www.healthtree.com/articles/parkinsons-disease/causes/
1=PubMed Health24Parkinsons disease & vocal impairment~90%
of individuals with Parkinsons disease have some form of vocal
impairment5, 6characteristics7dysphonia (impaired production of
vocal sounds)dysarthria (problems with normal articulation in
speech)may be one of earliest indicators of onset of illness8 Tests
for vocal impairment9,10sustained phonations11, 12 (focus of this
study)produce single vowel and hold pitch constantrunning
speech12speak standard sentences that contains representative
sample of linguistic unitsMeasures of assessing vocal
dysphoniaTraditional methods11, 12pitch (F0, fundamental frequency
of vocal oscillation)absolute sound pressure level (loudness)jitter
(variation in F0 from vocal cycle to vocal cycle)shimmer (variation
in amplitude)noise-to-harmonics ratioNovel methods13, 14nonlinear
dynamical systems theory and nonlinear time series
analysisrecurrence period density entropydetrended fluctuation
analysis
Measures of assessing vocal dysphoniaMeasurements differ in
robustness14uncontrolled variation in acoustic environmentphysical
condition and characteristics of subjectTherefore, chosen
measurement methods should be as robust as possible to this
variationGoal of the study: identify an optimal feature set that is
both robust to uncontrolled variation and able to classify patients
with Parkinsons disease based on vocal dysphonic symptomsAdditional
advantage: possibility of monitoring patients remotely
http://archive.ics.uci.edu/ml/machine-learning-databases/parkinsons/parkinsons.data
Subjects & methodsSubjects31 individuals8 healthy23 with
Parkinsons disease (PD)average of six sustained vowel phonations
recorded from each subjectTotal n=195Calculation of features via
software programstraditional measuresnon-standard measures,
including new measure proposed by authors: pitch period
entropyVariablesAttributeDescriptionAttributeDescriptionMDVP:Jitter
(%)MDVP jitter as percentageNHRNoise-to-Harmonics
RatioMDVP:Jitter(Abs)MDVP absolute jitter in
microsecondsHNRHarmonics-to-Noise RatioMDVP:RAPMDVP Relative
Amplitude PerturbationRPDERecurrence Period Density
EntropyMDVP:PPQMDVP five-point Period Perturbation
QuotientD2Correlation dimensionJitter:DDPAverage absolute
difference of differences between cycles, divided by the average
periodDFADetrended Fluctuation Analysis
MDVP:ShimmerMDVP local shimmerspread1Nonlinear measure of
fundamental frequency variationMDVP:Shimmer(dB)MCVP local shimmer
in decibelsspread2Nonlinear measure of fundamental frequency
variationShimmer:APQ33-pt Amplitude Perturbation QuotientPPEPitch
period entropyShimmer:APQ55-pt Amplitude Perturbation
QuotientMDVP:Fo(Hz)Average vocal fundamental frequencyMDVP:APQMDVP
11-point Amplitude Perturbation QuotientMDVP:Fhi(Hz)Maximum vocal
fundamental frequencyShimmer:DDAAvg abs. diff. between consecutive
differences between the amplitudes of consecutive
periodsMDVP:Flo(Hz)Minimum vocal fundamental frequencyMDVP = (Kay
Pentax) Multi-Dimensional Voice ProgramMeasures of variation in
amplitudeMeasures of variation in fundamental frequencyMeasures of
ratio of noise to tonal components in voiceNonlinear dynamical
complexity measuresSingle fractal scaling exponentNonlinear
measures of fundamental frequency variationGrouping variable:status
=0 (healthy)=1 (PD)Statistical analysesEDAPCAMANOVAHotellings
T2QDAClassification tree (with random forest)EDA
0=healthy1=PDEDA
0=healthy1=PDEDA
0=healthy1=PDEDA
0=healthy1=PDEDA
0=healthy1=PDparallel coordinate plot36PCA
MANOVAtemplate
H0: healthy = ParkinsonsHotellings T2 test
H0: healthy = Parkinsons(p=22)
T-square test statistic = 187.48
df = 48 + 147 2 = 193
critical 20.05, 22, 193 47 (extrapolated)
Conclusion: reject H0 (=0.05)coefficients of a indicate relative
importance of variables in multivariate T-square
test39010351319138ClassifiedActualpark.qda.cv 500,000 NOK $85,000
USD (2010)Percent elderly (Old age dependency) (2010)Percent
employed in primary industries ie. mining, fishing, farming
(2010)Percent Labor Force participation (2010)Percent unemployed
(2010)Percent paid for Social Assistance (2009)-Percent over age 25
with only completed primary education (2010)-Percent over age 25
with secondary education attainment (2010)-Percent over age 25 with
attainment beyond secondary w/o completion of tertiary
(2010)-Percent over age 25 with attainment of tertiary education
(2010)Percent Voter turnout (2008)Percent Municipal Net Loan to
Gross Revenue (2010)Percent Municipal Net loan debt per capita
(2010)Percent Municipal Long term debt to Revenue of (2010)
63Non-BarentsBarentsBarents vs.
Non-BarentsMunicipalitiesN=430(N=342)(N=88)
6465
Df Hotelling-Lawley approx F num Df den Df Pr(>F) barentsF 1
1.5733 43.424 15 414 < 2.2e-16 ***Residuals 428
Component Eigenvalues % of total Variables and (component
loadings) Variance
_____________________________________________________________________________________________________________________
1. Age, Income, School, 2.324 33.75%Percent Elderly (0.730)
Migration and LaborIncome < 150,000 (0.762) ForcePercent Primary
Sector (0.651)Percent Tertiary School1 (-0.738)Percent Tertiary
School2 (-0.641)Net Migration (-0.695)Labor Force Part.
(-0.612)Income > 500,000 (-0.814)
2. Social Welfare 1.692 17.90%Percent Unemploy. (0.599)Upper
Secondary Ed. (-0.576)Social Assistance (0.543)Labor Force Part.
(-0.527)
3 . Debt 1.305 10.65%Net Loan to Gross Rev. (0.617)Long Term
debt (0.637)Net Loan Debt/capita (0.709)
4. Education 1.115 7.77%Tertiary Ed (-0.543)1
6667
68
69Plots of municipalities on First 3 Principal Components
BarentsNon-Barents70
Standard Deviations on First Principal ComponentQDA
Analysis71Quadratic Discriminant Analysis on the same 16 variables.
Results illustrate a discernible difference between North and
South.
***correct classification rate of 94.65%***cross-validated 1 =
Non-Barents2 = Barents DiscussionDistinction between North and
South urbanizationMigration and social vulnerability Life Biography
(20 somethings)
Caveats Missing variables (ethnic minority)Indigenous group
Sami
Further researchCommunity level analysis
72
73Photo by Hildegun Johnsen
Questions, Feedback?
Thank you
Determining the Geographic Origin of Potatoes with Trace Metal
Analysis Using Statistical and Neural Network ClassifiersThe
objective of this research was to develop a method to confirm the
geographical authenticity of Idaho-labeled potatoes as Idaho-grown
potatoes. Elemental analysis (K, Mg, Ca, Sr, Ba, V, Cr, Mn, Fe, Co,
Ni, Cu, Zn, Mo, S, Cd, Pb, and P) PCA, CDA, discriminant function
analysis, k-nearest neighbors, and neural network
76