Worcester Polytechnic Institute Towards Assessing Students’ Fine Grained Knowledge: Using an Intelligent Tutor for Assessing Mingyu Feng August 18 th , 2009 Ph.D. Dissertation Committee: Prof. Neil T. Heffernan (WPI) Prof. Carolina Ruiz (WPI) Prof. Joseph E. Beck (WPI) Prof. Kenneth R. Koedinger (CMU)
80
Embed
Worcester Polytechnic Institute Towards Assessing Students’ Fine Grained Knowledge: Using an Intelligent Tutor for Assessing Mingyu Feng August 18 th,
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Worcester Polytechnic Institute
Towards Assessing Students’ Fine Grained Knowledge: Using an Intelligent Tutor for Assessing
Towards Assessing Students’ Fine Grained Knowledge: Using an Intelligent Tutor for Assessing
Mingyu FengAugust 18th, 2009
Ph.D. Dissertation Committee: Prof. Neil T. Heffernan (WPI) Prof. Carolina Ruiz (WPI) Prof. Joseph E. Beck (WPI) Prof. Kenneth R. Koedinger (CMU)
2
Motivation – the needMotivation – the need
Concerns about poor student performance on new state tests
High-stakes standards-based tests are required by the No Child Left Behind (NCLB) Act
Student performance are not satisfactory Massachusetts (2003, 20% failed 10th grade math on the first try) Worcester
Secondary teachers are asked to be data-driven MCAS test reports Formative assessment and practice tests
Provided by Northwest Evaluation Association; Measured Progress; Pearson Assessments, etc.
333
Motivation – the problemsMotivation – the problems
I: Formative assessment takes time from instruction NCLB or NCLU (No Child Left Untested)? Every hour spent assessing students is an hour lost
from instruction Limited classroom time compels teachers to make
a choice
44
Motivation – the problemsMotivation – the problems
II: Performance reports are not satisfactory Teachers want more frequent and more detailed reports
Confrey, J., Valenzuela, A., & Ortiz, A. (2002). Recommendation to the Texas State Board of Education on the Setting of TAKS Standards: A Call to Responsible Action. At http://www.syrce.org/State_Board.htm
5
Main ContributionsMain Contributions Improved assessment system by taking into account how
much assistance students need (WWW’06; ITS’06; EDM’08; UMUAI Journal’09 (nominated for James Chen award))
Established a way to track and predict performance longitudinally over multiple years (WWW’06; EDM’08)
Rigorously evaluated the effectiveness of the skill models of various granularities (AAAI’06 EDM Workshop; TICL’07; IEEE Journal’09)
Used data mining approach to evaluate effectiveness of individual contents (AIED’09)
Used data mining to refine existing skill models (EDM’09; in preparation)
Developed an online reporting system deployed and used by real teachers (AIED’05; Book chapter’07; TICL Journal’06; JILR Juornal’07)
6
RoadmapRoadmap
MotivationContributionsBackground - ASSISTmentUsing tutoring system as an assessor
A web-based tutoring system that assists students in learning mathematics and gives teachers assessment of their students’ progress
Teachers like ASSISTments
Students like ASSISTments
8
We break multi-step items (original question) into scaffolding questions
Attempt: student take an action to answer a question
Response: the correctness of student answer (1/0)
Hint Messages: given on demand that give hints about what step to do next
Buggy Message: a context sensitive feedback message
Skill: a piece of knowledge required to answer a question
An ASSISTmentAn ASSISTment
99
Facts about ASSISTments Facts about ASSISTments
5000+ students have used the system regularlyMore than 10 million data records collectedOther features
Learning experiments; authoring tools, account and class management toolkit …
The dissertation uses data of about 1000 students who used ASSISTments during 2004-2006
AIED’05: Razzaq, L., Feng, M., Nuzzo-Jones, G., Heffernan, N.T., Koedinger, K. R., Junker, B., Ritter, S., Knight, A., Aniszczyk, C., Choksey, S., Livak, T., Mercado, E., Turner, T.E., Upalekar. R, Walonoski, J.A., Macasek. M.A., Rasmussen, K.P. (2005). The Assistment Project: Blending Assessment and Assisting. In C.K. Looi, G. McCalla, B. Bredeweg, & J. Breuker (Eds.) Proceedings of the 12th International Conference on Artificial Intelligence in Education, pp. 555-562. Amsterdam: ISO Press.
Book Chapter: Razzaq, L., Feng, M., Heffernan, N., Koedinger, K., Nuzzo-Jones, G., Junker, B., Macasek, M., Rasmussen, K., Turner, T., & Walonoski, J. (2007). Blending Assessment and Instructional Assistance. In Nedjah, Mourelle, Borges and Almeida (Eds). Intelligent Educational Machines within the Intelligent Systems Engineering Book Series . pp.23-49. Springer Berlin / Heidelberg.
10
RoadmapRoadmap
MotivationContributionsBackground - ASSISTmentsUsing tutoring system as an assessor
JILR Journal: Feng, M. & Heffernan, N. (2007). Towards Live Informing and Automatic Analyzing of Student Learning: Reporting in the Assistment System. Journal of Interactive Learning Research. 18 (2), pp. 207-230. Chesapeake, VA: AACE.
TICL Journal: Feng, M., Heffernan, N.T. (2006). Informing Teachers Live about Student Learning: Reporting in the Assistment System. Technology, Instruction, Cognition, and Learning Journal. Vol. 3. Old City Publishing, Philadelphia, PA. 2006.
Where does this score come
from?
Automated AssessmentAutomated Assessment
Big idea: use data collected while a student uses ASSISTment to assess him
Lots of types of data available (last screen just used % correct on original
questions) Lots of other possible measures
Why should we be more complicated?
Worcester Polytechnic Institute
12
13
A Grade Book ReportA Grade Book Report
Static – does not distinguish “Tom” and “Jack”
Average – ignores development over time
Uninformative – not informative for classroom instruction
Dynamic assessment
Longitudinal modeling
Cognitive diagnostic assessment
1414
Dynamic Assessment – the ideaDynamic Assessment – the idea
Brown, A. L., Bryant, N.R., & Campione, J. C. (1983). Preschool children’s learning and transfer of matrices problems: Potential for improvement. Paper presented at the Society for Research in Child Development meetings, Detroit.
Dynamic testing began before computerized testing (Brown, Bryant, & Campione, 1983).
1515
Dynamic vs. Static AssessmentDynamic vs. Static Assessment
Developing dynamic testing metrics # attempts # minutes to come up with an answer; # minutes to
complete an ASSISTment # hint requests; # hint-before-attempt requests;
#bottom-out hints % correct on scaffolds # problems solved
“Static” measure correct/wrong on original questions
1616
Dynamic Assessment – dataDynamic Assessment – data
2004-2005 Data Sept, 2004 – May, 2005 391 students Online data
ASSISTments data enables us to assess more accurately
The relative success of the assistance model over the standard test model highlights the power of the dynamic measures
Feng, M., Heffernan, N.T, Koedinger, K.R. (2006a). Addressing the Testing Challenge with a Web-Based E-Assessment System that Tutors as it Assesses. In Proceedings of the 15th International World Wide Web Conference. pp. 307-316. New York, NY: ACM Press. 2006. Best Student Paper Nominee. Feng, M., Heffernan, N.T., & Koedinger, K.R. (2009). Addressing the assessment challenge in an online System that tutors as it assesses. User Modeling and User-Adapted Interaction: The Journal of Personalization Research (UMUAI journal). 19(3), 2009.
24
RoadmapRoadmap
MotivationContributionsBackground - ASSISTmentsUsing tutoring system as an assessor
Can we have our cake and eat it, too?Can we have our cake and eat it, too?
Most large standardized tests are unidimensional or low-dimensional.
Yet, teachers need fine grained diagnostic reports (Militello, Sireci, & Schweid, 2008; Wylie, & Ciofalo, 2008; Stiggins, 2005)
Can we have our cake and eat it, too?
Militello, M., Sireci, S., & Schweid, J. (2008). Intent, purpose, and fit: An examination of formative assessment systems in school districts. Paper presented at the American Educational Research Association, New York City, NY.
Wylie, E. C., & Ciofalo, J. (2008). Supporting teachers' use of individual diagnostic items. Teachers College Record. Retrieved from http://www.tcrecord.org/PrintContent.asp?ContentID=15363 on October 13, 2008.
Stiggins, R. (2005). From formative assessment to assessment FOR learning: A path to success in standards-based schools. Phi Delta Kappan, 87(4), 324-328.
McCalla & Greer (1994) pointed out that the ability to represent and reason about knowledge at various levels of detail is important for robust tutoring.
Griel, Wang & Zhou (2008) proposed one direction for future research is to increase understanding of how to select an appropriate grain size or level of analysis
Can we use MCAS test results to help select the right grain-sized model from a series of models of different granularities?
McCalla, G. I. and Greer, J. E. (1994). Granularity- based reasoning and belief revision in student models. In Greer, J. E. and McCalla, G. I., (eds), Student Modeling: The Key to Individualized Knowledge-Based Instruction, pages 39-62. Springer-Verlag, Berlin. Gierl, M.J., Wang, C., & Zhou, J. (2008). Using the attribute hierarchy method to make diagnostic inferences about examinees’ cognitive skills in Algebra on the SAT. Journal of Technology, Learning, and Assessment, 6(6).
2727
Building Skill ModelsBuilding Skill Models Math WPI - 1
WPI - 5
Patterns,Relations,and Algebra Geometry Measurement Number Sense
Cognitive Diagnostic Assessment – dataCognitive Diagnostic Assessment – data
2004-2005 Data Sept, 2004 – May, 2005 447 students Online data: 7.3 days; 87 items (sd. = 35)
Item level response of 8th grade MCAS test (May, 2005) 2005-2006 Data
Sept, 2005 – May, 2006 474 students Online data: 5 days; 51 items (sd. = 24)
Item level 8th grade MCAS scores (May, 2006) All online and MCAS items have been tagged with all
four skill models
30
Cognitive Diagnostic Assessment - modelingCognitive Diagnostic Assessment - modeling Fit mixed-effects logistic regression model
Predict MCAS score Extrapolate the fitted model in time to the month of the MCAS test Obtain probability of getting each MCAS question correct, based upon
skill tagging of the MCAS item Sum up probabilities to get total score
30
-- Xijkt is the 0/1 response of student i on question j tapping skill k in month t-- Montht is elapsed month in the study; 0 for September, 1 for October, and so on-- β0k and β1k : respective fixed effects for baseline and rate of change in probability of correctly answering a question tapping skill k. -- β00 and β10 : the group average incoming knowledge level and rate of change-- β0 and β1 : the baseline level of achievement and rate of change of the student
Longitudinal model (e.g. Singer & Willett, 2003)
Absolute Difference
WPI-1 WPI-5 WPI-39 WPI-78
1.69 2.15 2.82 4.53
2.34 2.85 3.33 4.87
…
0.54 0.77 1.15 2.74
0.59 1.30 1.88 3.70
1.33 0.58 0.02 1.86
31
How do I Evaluate Models?How do I Evaluate Models?
04-05Data
Real MCAS score
ASSISTment Predicted Score
Skill Models WPI-1 WPI-5 WPI-39 WPI-78
Mary 25.00 23.31 22.85 22.18 20.47
Tom 32.00 29.66 29.15 28.67 27.13
…
Sue 29.00 28.46 28.23 27.85 26.26
Dick 28.00 27.41 26.70 26.12 24.30
Harry 22.00 23.33 22.58 22.02 20.14
MAD 4.42 4.37 4.22 4.11
%Error 13.00% 12.85% 12.41% 12.09%
Paired two-sample t-test
32
P =0.21P <0.001P =0.006
Comparing Models of Different GranularitiesComparing Models of Different Granularities
4.67
13.70%
4.36
12.83%
P =0.10
1-parameter IRT model
04-05 Data WPI-1 WPI-5 WPI-39 WPI-78
MAD 4.42 4.37 4.22 4.11
%Error 13.00% 12.85% 12.41% 12.09%> >> >
>>
05-06 Data WPI-1 WPI-5 WPI-39 WPI-78
MAD 6.58 6.51 4.83 4.99
%Error 19.37% 19.14% 15.10% 14.70%
P <0.001P <0.001P <0.001 P =0.03
The Effect of Scaffolding - hypothesisThe Effect of Scaffolding - hypothesis
Only using original questions makes it hard to decide which skill to “blame”
Scaffolding questions aid in diagnosis by directly assessing a single skill
Hypotheses Using responses to scaffolding questions will
improve prediction accuracy Scaffolding questions are more useful for fine
grained models33
The Effect of Scaffolding - resultsThe Effect of Scaffolding - results
Fine-grained models do the best job estimating student skill level overall
Not necessarily the best for all consumers (e.g. principals)
Need ability to diagnosis (e.g. scaffolding questions) Scaffolding questions
Helps improve overall prediction accuracy More useful for fine-grained models
Feng, M., Heffernan, N.T, Mani, M. & Heffernan C. (2006). Using Mixed-Effects Modeling to Compare Different Grain-Sized Skill Models. In Beck, J., Aimeur, E., & Barnes, T. (Eds). Educational Data Mining: Papers from the AAAI Workshop. Menlo Park, CA: AAAI Press. pp. 57-66. Feng, M, Heffernan, N., Heffernan, C. & Mani, M. (2009). Using mixed-effects modeling to analyze different grain-sized skill models. IEEE Transactions on Learning Technologies Special Issue on Real-World Applications of Intelligent Tutoring Systems. (Featured article of the issue)Pardos, Z., Feng, M. & Heffernan, N. T. & Heffernan-Lindquist, C. (2007).Analyzing fine-grained skill models using bayesian and mixed effect methods. In Luckin & Koedinger (Eds.) Proceedings of the 13th Conference on Artificial Intelligence in Education. Amsterdam, Netherlands: IOS Press.pp.626-628.
37
Future Work - Skill Model RefinementFuture Work - Skill Model Refinement
We found that WPI-78 is good enough to better predict a state test than some less fine-grained models
However, WPI-78 may have some mis-taggings Expert-built models are subject to the risk of “expert blind
spot” Our best-guess in a 7-hour coding session
A best guess model should be iteratively tested and refined
38
Skill Model Refinement - approaches Skill Model Refinement - approaches
Human experts manually update hand-crafted models (1,000+ items ) * (100+ skills) Not practical to do it often
Data mining can help Skills or items with high residuals Skills consistently over-predicted or under-predicted “Un-learned” skills (i.e. negative slopes from mixed-
effects models)
Feng, M., Heffernan, N., Beck, J, & Koedinger, K. (2008). Can we predict which groups of questions students will learn from? In Beck & Baker (Eds.). Proceedings of the 1st International Conference on Education Data Mining. Montreal, 2008.
Some items in a random sequence cause significantly less learning than others
Hypothesis Problems that “don’t help”
students learn might be teaching a different skill(s)
Create factor tables Preliminary results show
some validity
Feng, M., Heffernan, N., & Beck, J. (2009). Using learning decomposition to analyze instructional effectiveness in the ASSISTment system. In Dimitrova, Mizoguchi, du Boulay, & Graesser (Eds), Proceedings of the 14th International Conference on Artificial Intelligence in Education (AIED-2009). Amsterdam, Netherlands: IOS Press. Brighton, UK.
Skill Factor
Circle-area High
Circle-area High
Circle-area High
Circle-area Low
41
RoadmapRoadmap
MotivationContributionsBackground - ASSISTmentsUsing tutoring system as an assessor
Conclusion of the DissertationConclusion of the Dissertation
The dissertation establishes novel assessment methods to better assess students in tutoring systems
Assess students better by analyzing their learning behaviors when using the tutor
Assess students longitudinally by tracking learning over time
Assess students diagnostically by modeling fine- grained skills
4343
Comments from the Education SecretaryComments from the Education Secretary
Secretary of Education, Arne Duncan weighed in (in Feb 2009) on the NCLB Act, and called for continuous assessment
Duncan says he is concerned about overtesting but he thinks states could solve the problem by developing better tests. He also wants to help them develop better data management systems that help teachers track individual student progress. "If you have great assessments and real-time data for teachers and parents that say these are [the student's] strengths and weaknesses, that's a real healthy thing," he says.
Ramírez, E., & Clark, K. (Feb., 2009). What Arne Duncan Thinks of No Child Left Behind: The new education secretary talks about the controversial law and financial aid forms. (Electronic version) Retrieved on March 8th, 2009 from http://www.usnews.com/articles/education/2009/02/05/what-arne-duncan-thinks-of-no-child-left-behind.html.
4444
General implicationGeneral implication
Continuous assessment systems are possible to build (we built one)
Save classroom instruction time by assessing students during tutoring
Track individual progress and help stakeholders get student performance information
Provide teachers with fine-grained, cognitively diagnostic feedbacks to be “data-driven”
45
A metaphor for this shiftA metaphor for this shift
Committee on the Foundations of Assessment Board on Testing and AssessmentCenter for Education National Research CouncilJames W. Pellegrino Naomi ChudowskyRobert Glaser
(page 284).
Businesses don’t close down periodically to take inventory of stock any more
Bar code; auto-checkout Non-stopped business Richer information
4646
AcknowledgementAcknowledgement
My advisor Neil Heffernan
Committee members Ken Koedinger Carolina Ruiz Joe Beck
The ASSISTment team My familyMany more…
Worcester Polytechnic Institute
Thanks!
Questions?
4848
Backup slidesBackup slides
49
Motivation – the problemsMotivation – the problems
III: The “moving” target problem Testing and instruction have been separate fields
of research with their own goals Psychometric theory assumes a fixed target for
measurement ITS wants student ability to “move”
50
More ContributionsMore Contributions
Working systems www.ASSISTment.org The reporting system that gives cognitive diagnostic
reports to teachers in a timely fashion Establish an easy approach to detect the effectiveness
of individual tutoring content
AIED’05: Razzaq, L., Feng, M., Nuzzo-Jones, G., Heffernan, N.T., Koedinger, K. R., Junker, B., Ritter, S., Knight, A., Aniszczyk, C., Choksey, S., Livak, T., Mercado, E., Turner, T.E., Upalekar. R, Walonoski, J.A., Macasek. M.A., Rasmussen, K.P. (2005). The Assistment Project: Blending Assessment and Assisting. In C.K. Looi, G. McCalla, B. Bredeweg, & J. Breuker (Eds.) Proceedings of the 12th International Conference on Artificial Intelligence in Education, pp. 555-562. Amsterdam: ISO Press.
Book Chapter: Razzaq, L., Feng, M., Heffernan, N., Koedinger, K., Nuzzo-Jones, G., Junker, B., Macasek, M., Rasmussen, K., Turner, T., & Walonoski, J. (2007). Blending Assessment and Instructional Assistance. In Nedjah, Mourelle, Borges and Almeida (Eds). Intelligent Educational Machines within the Intelligent Systems Engineering Book Series . pp.23-49. Springer Berlin / Heidelberg.
JILR Journal: Feng, M. & Heffernan, N. (2007). Towards Live Informing and Automatic Analyzing of Student Learning: Reporting in the Assistment System. Journal of Interactive Learning Research. 18 (2), pp. 207-230. Chesapeake, VA: AACE.
TICL Journal: Feng, M., Heffernan, N.T. (2006). Informing Teachers Live about Student Learning: Reporting in the Assistment System. Technology, Instruction, Cognition, and Learning Journal. Vol. 3. Old City Publishing, Philadelphia, PA. 2006.
AIED’09: Feng, M., Heffernan, N.T., Beck, J. (2009). Using learning decomposition to analyze instructional effectiveness in the ASSISTment system. In Dimitrova, Mizoguchi, du Boulay, and Grasser (Eds), Proceedings of the 14th International Conference on Artificial Intelligence in Education (AIED-2009). pp. 523-530. Amsterdam, Netherlands: IOS Press.
51
EvidenceEvidence
62% 50% 37% 37%
52
EvidenceEvidence
1. Congruence2. Perimeter3. Equation-Solving
5353
TerminologyTerminology
MCAS Item/question/problem Response Original question Scaffolding question Hint message Bottom-out hint Buggy message
Attempt Skill/knowledge
component Skill model/cognitive
model/Q-matrix Single mapping model Multi-mapping model
5454
5555
Worcester Polytechnic Institute
55
The reporting systemThe reporting system
I developed the first reporting system for ASSISTments in 2004 that
is online, live, and gives detailed feedback at a grain size for guiding instruction
5656
The grade bookThe grade book
“It’s spooky; he’s watching everything we do”. – a student
BIC: Bayesian Information Criterion(the lower, the better)
Feng, M., Heffernan, N.T, Koedinger, K.R. (2006a) Addressing the Testing Challenge with a Web-Based E-Assessment System that Tutors as it Assesses. In Proceedings of the 15th International World Wide Web Conference. pp. 307-316. New York, NY: ACM Press. 2006. Best Student Paper Nominee. Feng, M., Heffernan, N.T, Koedinger, K.R. (2006b). Predicting State Test Scores Better with Intelligent Tutoring Systems: Developing Metrics to Measure Assistance Required. In Ikeda, Ashley & Chan (Eds.). Proceedings of the 8th International Conference on Intelligent Tutoring Systems. Springer-Verlag: Berlin. pp. 31-40. 2006.
7070
Mixed effects modelsMixed effects models
Individuals in the population are assumed to have their own subject-specific mean response trajectories over time
The mean response is modeled as a combination of population characteristics (fixed effects) and subject-specific effects that are unique to a particular individual (random effects)
It is possible to predict how individual response trajectories change over time
Flexibility in accommodating imbalance in longitudinal data
Methodological features: 1) 3 or more waves of data 2) an outcome variable (dependent variable) whose values change systematically over time 3) A sensible metric for time that is the fundamental predictor in the longitudinal study
7171
Sample longitudinal dataSample longitudinal data
72
Comparison of ApproachesComparison of Approaches
Ayers & Junker (2006) Estimate student proficiency using
1-PL IRT model LLTM (linear logistic test model)
Main question difficulty decomposed into K skills
1-PL IRT fits dramatically better Only main questions used Additive, non-temporal WinBUGS
Worcester Polytechnic Institute
72
73
Comparison of ApproachesComparison of Approaches
Pardos et al. (2006) Conjunctive Bayes nets Non-temporal Scaffolding used Bayes Net Toolbox (Murphy, 2001)
DINA model
(Anozie, 2006)
Worcester Polytechnic Institute
73
74
Comparison of ApproachesComparison of Approaches
Feng, Heffernan, Mani & Heffernan (2006) Logistic mixed-effects model (Generalized Linear Mixed-
effects Model, GLMM) Temporal Xi j is the 0/1 response of student i on question j tapping
KC k in month t,
R lme4 library
Worcester Polytechnic Institute
74
Montht is elapsed month in the study; β0k and β1k are respective fixed effects for baseline and rate of change in probability of correctly answering a question tapping KC k.
75
Comparison of ApproachesComparison of Approaches
Comparing to LLTM in Ayers & Junker (2006) Student proficiency depends on time
Question difficulty depends on KC and time
Assign only the most difficult skill instead of full Q-matrix mapping of multiple skills as in LLTM
Scaffolding used to gain identifiability Ayers & Junker (2006) use regression to predict MCAS after
obtaining estimate of student ability (θ) (MAD= 10.93%) No such regression process in my work
logit(p=1) = θ – 0; estimated score = full score * p Higher MAD, but provide diagnostic information
Worcester Polytechnic Institute
75
76
Comparison of ApproachesComparison of Approaches
Comparing to Bayes nets and conjunctive models Bayes: probability reasoning; conjunctive GLMM: linear learning; max-difficulty reduction Computationally much easier and faster Results are still comparable
GLMM is better than Bayes nets when WPI-1, WPI-5 used GLMM is comparable with Bayes nets when WPI-39 or WPI-
Feng, M., Heffernan, N., & Beck, J. (2009). Using learning decomposition to analyze instructional effectiveness in the ASSISTment system. In Dimitrova, Mizoguchi, du Boulay, & Graesser (Eds), Proceedings of the 14th International Conference on Artificial Intelligence in Education (AIED-2009). Amsterdam, Netherlands: IOS Press. Brighton, UK.
Prior encounters
1
0
0
1
Correct?
1
1
1
0
t1
011Tom
010Tom
000Tom
000Tom
t4t3t2
Item Student
Detect relative instructional effectiveness among items in the same GLOP using learning decomposition.
79
Searching ResultsSearching Results
Among 38 GLOPs, LFA found significant better models for 12
Shall I be happy? “Sanity” check: random
assigned factor tables
#items in GLOP (#GLOPs)
Learning- suggested factors
Random factor table
2 (11) 5 5
3 (5)
4 (7) 3 1
5-11 (15) 4 (5, 6, 8, 9) 1 (5)
Further works need to be done Quantitatively measure whether and how data analysis
results can be helpful for subject-matter experts Explore the automatic factor assigning approach on
more data for other systems Contrast with human experts as controlled condition