Journal of Theoretical and Applied Computer Science Vol. 6, No. 3, 2012 QCA & CQCA: QUAD COUNTRIES ALGORITHM AND CHAOTIC QUAD COUNTRIES ALGORITHM M. A. Soltani-Sarvestani, Shahriar Lotfi .......................................... 3 EFFECTIVENESS OF MINI-MODELS METHOD WHEN DATA MODELLING WITHIN A 2D-SPACE IN AN INFORMATION DEFICIENCY SITUATION Marcin Pietrzykowski ....................................................... 21 SMARTMONITOR: RECENT PROGRESS IN THE DEVELOPMENT OF AN INNOVATIVE VISUAL SURVEILLANCE SYSTEM Dariusz Frejlichowski, Katarzyna Gościewska, Pawel Forczmański, Adam Nowosielski, Radoslaw Hofman.......................................................... 28 NONLINEARITY OF HUMAN MULTI-CRITERIA IN DECISION-MAKING Andrzej Piegat, Wojciech Salabun.............................................. 36 METHOD OF NON-FUNCTIONAL REQUIREMENTS BALANCING DURING SERVICE DEVELOPMENT Larisa Globa, Tatiana Kot, Andrei Reverchuk, Alexander Schill ....................... 50 DONOR LIMITED HOT DECK IMPUTATION: EFFECTS ON PARAMETER ESTIMATION Dieter William Joenssen, Udo Bankhofer ........................................ 58
70
Embed
Journal of Theoretical and Applied Computer Science
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Journal of Theoretical and Applied
Computer Science
Vol. 6, No. 3, 2012
QCA & CQCA: QUAD COUNTRIES ALGORITHM AND CHAOTIC QUAD COUNTRIES ALGORITHM
Journal of Theoretical and Applied Computer Science Scientific quarterly of the Polish Academy of Sciences, The Gdańsk Branch, Computer Science Commission
Scientific advisory board:
Chairman:
Prof. Henryk Krawczyk, Corresponding Member of Polish Academy of Sciences,
Gdansk University of Technology, Poland
Members:
Prof. Michał Białko, Member of Polish Academy of Sciences, Koszalin University of Technology, Poland
Prof. Aurélio Campilho, University of Porto, Portugal
Prof. Ran Canetti, School of Computer Science, Tel Aviv University, Israel
Prof. Gisella Facchinetti, Università del Salento, Italy
Prof. André Gagalowicz, The National Institute for Research in Computer Science and Control (INRIA), France
Prof. Constantin Gaindric, Corresponding Member of Academy of Sciences of Moldova, Institute of Mathematics and Computer
Science, Republic of Moldova
Prof. Georg Gottlob, University of Oxford, United Kingdom
Prof. Edwin R. Hancock, University of York, United Kingdom
Prof. Jan Helmke, Hochschule Wismar, University of Applied Sciences, Technology, Business and Design, Wismar, Germany
Prof. Janusz Kacprzyk, Member of Polish Academy of Sciences, Systems Research Institute, Polish Academy of Sciences, Poland
Prof. Mohamed Kamel, University of Waterloo, Canada
Prof. Marc van Kreveld, Utrecht University, The Netherlands
Prof. Richard J. Lipton, Georgia Institute of Technology, USA
Prof. Jan Madey, University of Warsaw, Poland
Prof. Kirk Pruhs, University of Pittsburgh, USA
Prof. Elisabeth Rakus-Andersson, Blekinge Institute of Technology, Karlskrona, Sweden
Prof. Leszek Rutkowski, Corresponding Member of Polish Academy of Sciences, Czestochowa University of Technology, Poland
Prof. Ali Selamat, Universiti Teknologi Malaysia (UTM), Malaysia
Prof. Stergios Stergiopoulos, University of Toronto, Canada
Prof. Colin Stirling, University of Edinburgh, United Kingdom
Prof. Maciej M. Sysło, University of Wrocław, Poland
Prof. Jan Węglarz, Member of Polish Academy of Sciences, Poznan University of Technology, Poland
Prof. Antoni Wiliński, West Pomeranian University of Technology, Szczecin, Poland
Prof. Michal Zábovský, University of Zilina, Slovakia
Prof. Quan Min Zhu, University of the West of England (UWE), Bristol, United Kingdom
Editorial board:
Editor-in-chief:
Dariusz Frejlichowski, West Pomeranian University of Technology, Szczecin, Poland
Managing editor:
Piotr Czapiewski, West Pomeranian University of Technology, Szczecin, Poland
Section editors:
Michaela Chocholata, University of Economics in Bratislava, Slovakia
Piotr Dziurzański, West Pomeranian University of Technology, Szczecin, Poland
Paweł Forczmański, West Pomeranian University of Technology, Szczecin, Poland
Przemysław Klęsk, West Pomeranian University of Technology, Szczecin, Poland
Radosław Mantiuk, West Pomeranian University of Technology, Szczecin, Poland
Jerzy Pejaś, West Pomeranian University of Technology, Szczecin, Poland
Izabela Rejer, West Pomeranian University of Technology, Szczecin, Poland
ISSN 2299-2634
The on-line edition of JTACS can be found at: http://www.jtacs.org. The printed edition is to be considered the primary one.
Publisher:
Polish Academy of Sciences, The Gdańsk Branch, Computer Science Commission
[5] Kordos M., Blachnik M., Strzempa D.: Do We Need Whatever More than k-NN?, in
Proceedings of 10-th International Conference on Artificial Inteligence and Soft
Computing, Zakopane, 2010.
[6] Pietrzykowski M.: Comparison of effectiveness of linear mini-models with some
methods of modelling, in Młodzi naukowcy dla Polskiej Nauki, Kraków, 2011.
[7] Pietrzykowski M.: The use of linear and nonlinear mini-models in process of data
modelling in a 2D-space, in Nowe trendy w naukach inżynieryjnych., 2011.
[8] Specht D. F.: A General Regression Neural Network, IEEE Transactions on Neural
Networks, pp. 568-576, 1991.
[9] Witten I. A., Frank E.: Data mining. San Francisco: Morgan Kaufmann Publishers,
2005.
[10] Pluciński M.: Nonlinear ellipsoidal mini-models – application for the function approx-
imation task, paper accepted for ACS Conference, 2012
[11] Pluciński M.: Application of the information-gap theory for evaluation of nearest
neighbours method robustness to data uncertainty, paper accepted for ACS Confer-
ence, 2012
Journal of Theoretical and Applied Computer Science Vol. 6, No. 3, 2012, pp. 28–35ISSN 2299-2634 http://www.jtacs.org
SmartMonitor: recent progress in the development of aninnovative visual surveillance system
Dariusz Frejlichowski1, Katarzyna Gosciewska1,2, Paweł Forczmanski1, Adam Nowosielski1,Radosław Hofman2
1 Faculty of Computer Science and Information Technology, West Pomeranian University of Technology,Szczecin, Poland2 Smart Monitor sp. z o.o., Szczecin, Poland
Abstract: This paper describes recent improvements in developing SmartMonitor — an innovative securitysystem based on existing traditional surveillance systems and video content analysis algorithms.The system is being developed to ensure the safety of people and assets within small areas. Itis intended to work without the need for user supervision and to be widely customizable to meetan individual’s requirements. In this paper, the fundamental characteristics of the system arepresented including a simplified representation of its modules. Methods and algorithms that havebeen investigated so far alongside those that could be employed in the future are described. Inorder to show the effectiveness of the methods and algorithms described, some experimental resultsare provided together with a concise explanation.
Keywords: SmartMonitor, visual surveillance system, video content analysis
1. IntroductionExisting monitoring systems usually require supervision by responsible person whose role
it is to observe multiple monitors and report any suspicious behaviour. The existing intelligentsurveillance systems that have been built to perform additional video content analysis tend tobe very specific, narrowly targeted and expensive. For example, the Bosch IVA 4.0 [1], anadvanced surveillance system with VCA functionality, is designed to help operators of CCTVmonitoring and is applied primarily for the monitoring of public buildings or larger areas,hence making it unaffordable for personal use. In turn, SmartMonitor is being designed forindividual customers and home use, and user interaction will only be necessary during systemcalibration. SmartMonitor’s aim is to satisfy the needs of a large number of people who wantto ensure the safety of both themselves and their possessions. It will allow for the monitoringof buildings (e.g. houses, apartments, small enterprises, etc.) and their surroundings (e.g.yards, gardens, etc.), where only a small number of objects need to be tracked. Moreover, itwill utilize only commonly available and inexpensive hardware such as a personal computerand digital cameras. Another intelligent monitoring system, described in [2], analyses humanlocation, motion trajectory and velocity in an attempt to classify the type of behaviour. Itrequires both the participation of a qualified employee and the preparation of a large databaseduring the learning process. These steps are unnecessary with the SmartMonitor system dueto a simple calibration mechanism and feature-based methods. Moreover, a precise calibra-
SmartMonitor: recent progress. . . 29
tion can improve a system’s effectiveness and allow the system’s sensitivity to be adjustedto situations that do not require any system reaction. The customization ability offered bySmartMonitor is very advantageous. In [3], the problem of automatic monitoring systemswith object classification was described. It was assumed that the background model used forforeground subtraction does not change with time. This is a crucial limitation caused by thebackground variability of real videos. Therefore, and due to planned system scenarios, themodel that best adapts to changes in the scene will be utilized.
SmartMonitor will be able to operate in four independent modes (scenarios) that will pro-vide home/surroundings protection against unauthorized intrusion, allow for supervision ofpeople who are ill, detect suspicious behaviours and sudden changes in object trajectory andshape, and detect smoke or fire. Each scenario is characterized by a group of performedactions and conditions, such as movement detection, object tracking, object classification,region limitation, object size limitation, object feature change, weather conditions and worktime (with artificial lighting required at night). A more detailed explanation of system scenar-ios and parameters is provided in [4].
The rest of the paper is organised as follows: Section 2 contains the description of the mainsystem modules; algorithms and methods that are utilised in each module are briefly describedin Section 3; Section 4 contains selected experimental results; and Section 5 concludes thepaper.
2. System ModulesSmartMonitor will be composed of six main modules: background modelling, object
tracking, artefacts removal, object classification, event detection and system response. Someof these are common to the intelligent surveillance systems that were reviewed in [5]. Asimplified representation of these system modules is displayed in Fig. 1.
Figure 1. Simplified representation of system modules
Background modelling detects movement through use of background subtraction methods.Foreground objects that are larger than a specified size and coherent are extracted as objectsof interest (OOI). The second module, object tracking, tracks object locations across consec-utive video frames. When multiple objects are tracked, each object is labelled accordingly.Every object moves along a specified path called a trajectory. Trajectories can be comparedand analysed in order to detect suspicious behaviours. The third module, artefacts removal,is an important step preceding classification and should be performed correctly. In this, all
30 Dariusz Frejlichowski, et al.
artefacts, such as shadows, reflections or false detection results, enlarge the foreground regionand usually move with the actual OOI. The fourth module, object classification, will allow forsimple classification using object parameters and object templates. The template base will becustomizable so that new objects can be added. A more detailed classification will also bepossible using more sophisticated methods. The key issue of the fifth, i.e. the event detectionmodule, is to detect changes in object features. The system will react to both sudden changes(mainly in shape) and a lack of movement. The final module defines how the system re-sponds to detected events. By eliminating the human factor it is important to determine whichsituations should set off alarms or cause information to be sent to the appropriate services.
3. Employed Methods and AlgorithmFor each module we investigated the existing approaches, and modified them to apply the
best solution for the system. Below we present a brief description and explanation of this.Background modelling includes models that utilize static background images [3], back-
ground images averaged in time [6] and background images built adaptively, e.g. usingGaussian Mixture Models (GMM) [7, 8]. Since the backgrounds of real videos tend to beextremely variable in time, we decided to use a model based on GMM. This builds per-pixelbackground image that is updated with every frame, and is also sensitive to sudden changesin lighting which can cause false detections, mainly by shadows. It was stated in [9] thatshadows only affects the image brightness and not the hue. By comparing foreground imagesconstructed using both the Y component of the YIQ colour scheme and the H component ofthe HSV colour scheme, it is possible to exclude false detections that are caused by shadows.Following this, morphological operations are applied to the resulting binary mask. Erosionallows for the elimination of small objects composed of one or few pixels (such as noise) andthe reduction of the region. Later the dilation process fills in the gaps.
For the object tracking stage we investigated three possible implementations, namely theKalman filter [10], Mean Shift and Camshift [11, 12] algorithms. The Mean Shift algorithmis simple and appearance-based. It requires one or more feature, such as colour or edge datato be selected for tracking purposes. This can cause several problems with object localizationwhen particular features change. The Camshift algorithm is simply a version of the MeanShift algorithm that continuously adapts to the variable size of tracked objects. Unfortunately,the described solution is not optimal since it increases the number of computations. More-over, both methods are effective only when certain assumptions are met, such as that trackedobjects will differ from the background (e.g. through variations in colour). The Kalmanfilter algorithm was therefore selected to overcome these drawbacks. This constitutes a set ofmathematical equations that define a predictor-corrector type estimator. The main task was toestimate future values in two steps: prediction based on known values, and correction basedon new measurements. It is assumed that objects can move uniformly and in any direction butwill not change direction suddenly and unpredictably.
After tracking the objects are classified (labelled) as either human or not human. A boostedcascade of Haar-like features [13] connected using the AdaBoost algorithm [14] can be uti-lized. However, at this stage, we replaced the AdaBoost classification with a simpler one.Objects can now be classified using their binary masks and the threshold values of two oftheir properties: area size and minimum bounding rectangle aspect ratio.
A specific and detailed classification can be performed using a Histogram of OrientedGradients (HOG) [15]. A HOG descriptor localises and extracts objects from static scenes
SmartMonitor: recent progress. . . 31
through use of specified patterns. Despite its high computational complexity, the HOG algo-rithm can be applied to a system under several conditions such as those with limited regionsor time intervals.
4. Experimental Conditions and ResultsIn this section we present some experimental results from employing the algorithms for
object localization, extraction and tracking that have given the best results so far. In order toensure the experiments were performed under realistic conditions, a set of test video sequencescorresponding to certain system scenarios was prepared. These include scenes recorded bothinside and outside the buildings, with different types of moving objects. A database alsohad to be created due to the lack of free, universal video databases that matched the plannedscenarios.
The results of employing both the GMM algorithm and the methods for removing falseobjects are presented in Fig. 2. The first row contains the sample frame and backgroundimages for the Y and H components. The second row shows the respective foreground imagesfor the Y and H components alongside the foreground object’s binary mask after false objectsremoval. It is noticeable that the foregrounds constructed using the different colour compo-nents strongly differ and that, by subtracting one image from another, we can eliminate falsedetections.
Figure 2. Results of employing the GMM algorithm and false objects removal methods
Specific objects can be localised and extracted using the HOG descriptor. This detectsobjects using a predefined patterns and extracted feature vectors. Below we present the resultsof the experiments utilizing HOG descriptor. The first experiment was performed using a fixedtemplate size and two sample frames, the second one utilized various template sizes and onesample frame.
The results of the first experiment are pictured in Fig. 3. The figure contains: a sampleframe with a chosen template (left column) and two frames (middle column) from the samevideo sequence which were scanned horizontally in an attempt to identify the matching re-gions. The depth maps (right column) show the results of the HOG algorithm — the darkerthe colour the more similar the region is. Black regions indicate a Euclidean distance betweentwo feature vectors of zero.
32 Dariusz Frejlichowski, et al.
Figure 3. Results of the experiment utilizing the HOG descriptor with a fixed template size
In the next experiment, devoted to an investigation of the HOG descriptor, various templatesizes were tested. The left column of Fig. 4 presents a frame with a chosen template marked bya white rectangle, the central column contains a frame that was scanned horizontally using twodifferent template sizes (dark rectangles in the top left corners define the size of the rescaledtemplate) and the right column provides the respective results of the HOG algorithm. Clearly,the closer the template size is to object size, the more accurate the depth map is.
Figure 4. Results of the experiment utilizing the HOG descriptor with a variable template size
As mentioned in the previous section, we investigated three tracking methods. The firstone, the Mean Shift algorithm, uses part of an image to create a fixed template model. In thiscase we converted images to the HSV colour scheme. Fig. 5 presents three sample framesfrom the tracking process (first row) and their corresponding binary masks (second row). Thewhite masked regions indicate those regions that are similar to the template, the dark rectangledetermines the template and the light points within the rectangle create the object’s trajectory.
Camshift was the second tracking method investigated. This uses the HSV colour schemeand a variable template model. The first row in Fig. 6 presents sample frames from the track-ing process: the starting frame with the chosen template, the central frame with an enlargedtemplate and the finishing frame where the moving object leaves the scene. The second rowin Fig. 6 shows corresponding binary masks for each frame. Both tracking methods, thanks
SmartMonitor: recent progress. . . 33
Figure 5. Results of the experiment utilizing the Mean Shift algorithm
to their local application, were effective despite of the presence of many similar regions to thetemplate.
Figure 6. Results of the experiment utilizing the Camshift algorithm
Fig. 7 shows a result of employing the third algorithm, the Kalman filter, to track a personwalking in a garden. Light asterisks are obtained for object positions that were estimated usinga moving object detection algorithm and dark circles are positions predicted by the Kalmanfilter.
5. Summary and ConclusionsIn this paper, recently achieved results from the SmartMonitor system during the develop-
ment process were described. We provided basic information about system characteristics andproperties, and system modules. Investigated methods and algorithms were briefly described.Selected experimental results on utilizing various solutions were presented.
SmartMonitor will be an innovative surveillance system based on video content analysisand targeted at individual customers. It will operate in four independent modes which are fullycustomizable (and will also be combinable to make custom modes). This allows for individualsafety rules to be set based on different system sensitivity degrees. Moreover, SmartMonitorwill utilize only commonly available hardware. It will almost eliminate human involvement,
34 Dariusz Frejlichowski, et al.
Figure 7. Results of the experiment utilizing the Kalman filter
being only required for the calibration process. Our system will analyse a small number ofmoving objects over limited region which could additionally improve its effectiveness.
Currently, there are no similar systems on the market. Modern surveillance systems areusually expensive, specific and need to be operated by a qualified employee. SmartMonitorwill eliminate these factors by offering less expensive software, making it more affordable forpersonal use and requiring less effort to use.
AcknowledgementsThe project Innovative security system based on image analysis — SmartMonitor pro-
totype construction (original title: Budowa prototypu innowacyjnego systemu bezpieczenstwaopartego o analize obrazu — SmartMonitor) is the project co-founded by the European Union(project number PL: UDA-POIG.01.04.00-32-008/10-01, Value: 9.996.604 PLN, EU con-tribution: 5.848.800 PLN, realization period: 07.2011-04.2013). European Funds — forthe development of innovative economy (Fundusze Europejskie — dla rozwoju innowacyjnejgospodarki).
Commercial Brochure enUS 1558886539.pdf[2] Robertson N., Reid I.: A general method for human activity recognition in video. Computer
Vision and Image Understanding 104, 232–248 (2006)[3] Gurwicz Y., Yehezkel R., Lachover B.: Multiclass object classification for real-time video
surveillance systems. Pattern Recognition Letters 32, 805–815 (2011)[4] Frejlichowski D., Forczmanski P., Nowosielski A., Gosciewska K., Hofman R.: SmartMonitor:
An Approach to Simple, Intelligent and Affordable Visual Surveillance System. In: Bolc, L. et al.(eds.) ICCVG 2012. LNCS, vol. 7594, pp. 726–734. Springer, Heidelberg (2012)
SmartMonitor: recent progress. . . 35
[5] Forczmanski P., Frejlichowski D., Nowosielski A., Hofman R.: Current trends in the develope-ment of intelligent visual monitoring systems (in Polish). Methods of Applied Computer Science4/2011(29), 19–32 (2011)
[6] Frejlichowski D.: Automatic Localisation of Moving Vehicles in Image Sequences Using Mor-phological Operations. 1st IEEE International Conference on Information Technology, 439-442(2008)
[7] Stauffer C., Grimson W. E. L.: Adaptive background mixture models for real-time tracking. IEEEComputer Society Conference on Computer Vision and Pattern Recognition, 2–252 (1999)
[8] Zivkovic Z.: Improved adaptive Gaussian mixture model for background subtraction. Proceed-ings of the 17th International Conference on Pattern Recognition 2, 28–31 (2004)
[9] Forczmanski P., Seweryn M.: Surveillance Video Stream Analysis Using Adaptive BackgroundModel and Object Recognition. In: Bolc, L. et al. (eds.) ICCVG 2010, Part I. LNCS, vol. 6374,pp. 114–121. Springer, Heidelberg (2010)
[10] Welch G., Bishop G.: An Introduction to the Kalman Filter. UNC-Chapel Hill, TR 95-041 (24July 2006)
[11] Cheng Y.: Mean Shift, Mode Seeking, and Clustering. IEEE Transactions on Pattern Analysisand Machine Intelligence 17(8), 790–799 (1995)
[12] Comaniciu D., Meer P.: Mean Shift: A Robust Approach Toward Feature Space Analysis. IEEETransactions on Pattern Analysis and Machine Intelligence 24(5), 603–619 (2002)
[13] Viola P., Jones M.: Rapid Object Detection Using a Boosted Cascade of Simple Features. IEEEComputer Society Conference on Computer Vision and Pattern Recognition 1, 511–518 (2001)
[14] Avidan S.: Ensemble Tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence29(2), 261–271 (2007)
[15] Dalal N., Triggs B.: Histograms of oriented gradients for human detection. IEEE ComputerSociety Conference on Computer Vision and Pattern Recognition 1, 886–893 (2005)
Journal of Theoretical and Applied Computer Science Vol. 6, No. 3, 2012, pp. 36-49
ISSN 2299-2634 http://www.jtacs.org
Nonlinearity of human multi-criteria in decision-making
Andrzej Piegat, Wojciech Sałabun
Faculty of Computer Science and Information Technology, West Pomeranian University of Technology,
Szczecin, Poland
{apiegat, wsalabun}@wi.zut.edu.pl
Abstract: In most cases, known methods of multi-criteria decision-making are used in order to make linear aggregation of human preferences. Authors of these methods seem not to take into account the fact that linear functional dependences rather rarely occur in real systems. Lin-ear functions rather imply a global character of multi-criteria. This paper shows several examples of human nonlinear multi-criteria that are purely local. In these examples, the nonlinear approach is used based on fuzzy logic. It allows for better understanding of how important is the non-linear aggregation of human multi-criteria. The paper contains also proposal of an indicator of nonlinearity degree of the criteria. The presented results are based on investigations and experiments realized by authors.
On a daily basis and in professional life we frequently have to make decisions. Then we
use some criteria that depend on our individual preferences, or in case of group-decisions on
preferences of the group. Further on, criteria represent preferences of a single person will be
called individual criteria and criteria represent a group will be called group-criteria. Group-
criteria can be achieved by aggregation of individual ones. Therefore, further on, the nonlin-
earity problem of criteria will be analyzed on examples of individual criteria, because prop-
erties of individual criteria are transferred on the group-ones. Individual human multi-
criteria are “programmed” in our brains and special methods for their elicitation and math-
ematical formulation of them are necessary. Multi-criteria (for short M-Cr) of different per-
sons are more or less different and therefore it would be not reasonable to assume one and
the same type of a mathematical formula for certain criterion representing thousands of dif-
ferent people, e.g. for the individual criterion of car attractiveness. However, in case of M-
Crs most frequently used criterion type is the linear M-Cr form (1).
� = �� + ���� + ���� + ⋯+ �� , (1)
where: wi – weight coefficients of particular component criteria, ∑ �� = 1� � , Ki - the com-
ponent criteria aggregated by the M-Cr (� = 1: ������). They are mostly used, also in this paper,
in the form which is normalized to interval [0,1]. The linear criterion-function in the space
2D is represented by a straight line, in the space 3D by a plane, Fig.1, and in the space nD
by a hyper-plane.
Nonlinearity of human, decisional multi-criteria 37
Figure 1. A linear criterion function in space 2D (Fig.1a) and in space 3D (Fig.1b)
Let us notice that in the linear-criterion function K particular component-criteria Ki in-
fluence the superior criterion in the mutually independent and uncorrelated way. Apart of
this, the influence strength of particular component criteria Ki is of the global, constant and
unchanging character in the full criterion-domain. Both above features are great disad-
vantages of the linear M-Cr, because human M-Cr are in most cases nonlinear and signifi-
cance of component criteria Ki is not constant, is not independent from other criteria and
varies in particular, local sub-domains of the global MCr. Unfortunately, linear multi-
criteria are used in many world-known methods of the multi-criteria decision-analysis. Fol-
lowing examples illustrating the above statement can be given: the method SAW (Simple
Additive Weighting) [4,15], the very known and widely used AHP-method of Saaty (the
Analytic Hierarchy Process) [11,15,18], the ANP-method (Analytic Network Process),
[12,13]. Other known MCr-methods as TOPSIS [15,16], ELECTRE [2], PROMETHEE
[1,2] are not strictly linear ones. However, they assume global weight-coefficients wi, con-
stant for the full MCr-domain and in certain steps of their algorithms they also use the line-
ar, weighted aggregation of alternatives. The next part will present the simplest examples of
nonlinear criterion-functions in 2D-space.
2. Nonlinear human criterion-functions in 2D-space
An example of a very simple human nonlinear criterion-function can be the dependence
between the coffee taste (CT), CT ∈ [0,1], and the sugar quantity S, S ∈ [0,5] expressed in
number of sugar spoons, Fig.2. Coffee taste represents inner human preference.
The criterion function of the coffee taste can be identified by interviewing a given per-
son or more exactly, experimentally, by giving the person coffees with different amount of
sugar and asking he/she to evaluate the coffee taste or to compare tastes of pairs of coffees
with different amount of sugar. The achieved taste evaluations can be processed with vari-
ous MCr-methods previously cited or with the method of characteristic objects proposed by
one of the paper authors. However, even without scientific investigations it is easy to under-
stand that the criterion-function shown in Fig.2 is qualitatively correct. This function repre-
sents preferences of the author-AP. He does not like coffee with too great amount of sugar
(more than 3 coffee-spoons) and evaluates its taste as CT≈0. The taste of coffee without
sugar (S=0) he also evaluates as a poor one. The best taste he feels when cup of coffee con-
38 Andrzej Piegat, Wojciech Sałabun
tains 2 spoons of sugar (Sopt=2). For other persons the optimal sugar amount will be differ-
ent. Thus, this criterion-function is not an “objective” (what does it mean?) function of all
people in the world but an individual criterion-function of the AP-author of the paper. It is
very important to differentiate between individual criteria and group-criteria, which repre-
sent small or greater group of people. Similar in character as the function in Fig.2 is also
other one-component human criterion function: e.g. dependence of the text-reading easiness
from the light intensity.
Figure 2. Criterion function representing dependence of the coffee taste CT from number of sugar
spoons S (felt by an individual person, the paper author-AP)
3. Nonlinear, human, multi-criterion function in 3D-space and a method
of its identification
Already in 60-ties and 70-ties of the 20th
century American scientists D. Kahneman and
A. Tversky, Nobel prize winners from 2002 have drawn the attention of the scientific com-
munity on the nonlinearity of human multi-criteria [5] by their investigation results on hu-
man decisions based on a MCr. In their experiment were aggregated some component
criteria: value of a possible profit, probability of the possible profit value, value of a possi-
ble loss, probability of the possible loss-value. Further on, there will be presented a similar
but a simplified problem of evaluation of the individual play acceptability-degree K in de-
pendence of a possible winnings-value K1[$] and of a possible loss-value K2 [$]. Both val-
ues are not great. The interviewed person has to make decisions in the problem described
below.
Among 25 plays shown in Table 1, with different winnings K1[$] and losses K2[$] (if you
don’t win you will have to pay a sum equal to the loss K2) at first find all plays (K1,K2) which certainly are not accepted by you (K=0), and next all plays which are certainly ac-cepted by you (K=1). For rest of the plays determine a rank with the method of pair-tournament (pair comparisons). Probability of winnings and losses are the same and equal to 0.5.
Nonlinearity of human, decisional multi-criteria 39
Table 1 gives values of possible winnings and losses (K1,K2) in particular plays. It also
informs for which plays the AP-author declares the full acceptation (full readiness to take up
the game) that means K=1, and informs for which plays he does not accept at all (zero read-
iness to take up the game) that means K=0. The acceptability degree plays a role of the mul-
ti-criterion in the shown decision-problem.
The acceptability degree of plays marked with question mark will be determined with
the tournament-rank method. The investigated person chooses from each play-pair the more
acceptable play (inserting the value 1 in the table for this play), which means the win. If the
person is not able to decide which of two plays is better, then she/he inserts the value 0.5 for
both plays of the pair, which means the draw.
Summarized scores from Table 2 are shown in Table 3 for particular plays (K1,K2).
Table 1. Winnings K1[$] and losses K2[$] in particular 25 plays and first decisions of the interviewed
person : determining the unacceptable plays (acceptation degree K=0) and the fully acceptable plays
(K=1) which certainly would be played by the person. Plays with question marks are plays of a par-
tial (fractional) acceptation that is to be determined.
The value of losses
��[$] The value of winning ��[$]
0.0 2.5 5.0 7.5 10.0
0.0 0 1 1 1 1
2.5 0 0 ? ? ?
5.0 0 0 0 ? ?
7.5 0 0 0 0 ?
10.0 0 0 0 0 0
Table 2. Tournament results of particular play-pairs. The value 1 means the win of a play, the value
0.5 means the draw. A single play is marked by (K1,K2).
Table 3. Scores of particular plays (K1,K2) and rank places assigned to particular plays with
fractional acceptation degree K (multi-criterion) of the investigated person
Play (K1,K2) (10.0,
2.5)
(10.0,
5.0)
(7.5, 2.5) (10.0,
7.5)
(5.0, 2.5) (7.5, 5.0)
��� !"#���, ��� 5 3.5 3.5 1 1 1
Rank(K1,K2) I II II III III III
Analysis of Table 3 shows that in the end we have 3 play types with differentiated val-
ues of the multi-criterion K. Apart from 6 plays with fractional acceptation given in Table 3
we also have 15 plays with the zero-acceptability K=0 and 4 plays with the full acceptability
40 Andrzej Piegat, Wojciech Sałabun
K=1, see Table 1. Applying the indifference principle of Laplace [2], we can assume that the
full difference of acceptation value relating to plays from Table 3, Kmax - Kmin= 1 - 0 = 1
should be partitioned in 4 equal differences ∆K = ¼. The plays (5, 2.5), (7.5, 5), (10,7.5)
achieve the M-Cr value K=1/4 (the third place in the rank). The plays (7.5, 2.5) and (10, 5)
achieve K=2/4 (the second place in the rank). The play (10,2.5) achieves K=3/4 (the first
place in the rank of fractional-acceptability of plays). Resulting values of the M-Cr K de-
termined for particular plays with the tournament-rank method are given in Table 4.
Table 4. Resulting values of the multi-criterion K= f(K1,K2), which represents the acceptability de-
gree of particular plays (K1,K2) for the investigated person.
The value of losses
��[$] The value of winning ��[$]
0.0 2.5 5.0 7.5 10.0
0.0 0 1 1 1 1
2.5 0 0 1 4⁄ 2 4⁄ 3 4⁄
5.0 0 0 0 1 4⁄ 2 4⁄
7.5 0 0 0 0 1 4⁄
10.0 0 0 0 0 0
On the basis of Table 4 a visualization of the investigated multi-criterion K of the play
acceptability-degree can be realized, Fig. 3 and 4.
Figure 3. Visualization of the 25 analyzed plays (K1,K2) as 25 characteristic objects regularly placed
in the decisional domain K1 K2 of the problem
Each of the 25 characteristic plays (decisional objects) can be interpreted as a crisp rule,
e.g.:
IF (K1 = 7.5) AND (K2 = 5) THEN (K = ¼) (2)
However, if K1 is not exactly equal to 7.5 and K2 is not exactly equal to 5.0 then rule (2)
can be transformed in a fuzzy rule (3) based on tautology Modus Ponens [8, 9].
IF (K1 close to 7.5) AND (K2 close 5.0) THEN (K close ¼) (3)
Nonlinearity of human, decisional multi-criteria 41
This way 25 fuzzy rules of type (4) were achieved on the basis of each characteristic ob-
ject (play) given in Table 3. The rules enable calculating values of the nonlinear multi-
criterion K for any values of the component criteria K1i and K2j, i,j =1:5.
IF (K1 close to K1i) AND (K2 close to K2j) THEN (K close to Kij) (4)
The complete rule base is given in Table 3. To enable calculation of the fuzzy M-Cr-
function K it is necessary to define membership functions µK1i ( close to K1i ), µK2j (close to
K2j) and µKij (close to Kij). These functions are shown in Fig.4.
Figure 4. Membership functions µK1i (close to K1i), µK2j (close to K2j) of the component criteria and
µKij (close to Kij) of the aggregating multi-criterion K
On the basis of the rule base (Table 3) and of membership functions from Fig.4 it is easy
to visualize the function-surface K = f(K1,K2) of individual multi-criterion of the play ac-
ceptation. As visualization tool one also can use toolbox of fuzzy logic from MATLAB or
own knowledge about fuzzy modeling [8, 9]. The functional surface is shown in Fig.5.
As Fig.5 shows, the functional surface of the human multi-criterion K=f(K1,K2) is
strongly nonlinear. This surface represents the M-Cr of one person. However, in case of
other persons surfaces of this multi-criterion are qualitatively very similar (an investigation
was realized on approximately 100 students of Faculty of Computer Science of West Pom-
eranian University of Technology in Szczecin and of Faculty of Management and Economy
of University of Szczecin). Quantitative differences of the multi-criterion K between partic-
ular investigated persons were mostly not considerable. All identified surfaces were strongly
nonlinear.
The second co-author WS of the paper used the method of characteristic objects in in-
vestigation of the attractiveness degree of color. In the experiment two attributes occur:
• the degree of brightness green (in short G),
• the degree of brightness blue (in short B).
42 Andrzej Piegat, Wojciech Sałabun
Figure 5. Functional surface of the individual multi-criterion K=f(K1,K2) of the play acceptability
with possible winnings K1[$] and losses K2[$], probability of winnings and losses are identical and
equal to 0.5. This particular surface represents the AP-author of the paper.
The degree of red was fixed at constant brightness level 50%. The brightness level of
each components was normalized to the range [0,1]. The first step was to define linguistic
values for the G and B components, presented in Fig. 6. and 7.
Figure 6. Definitions of linguistic values for the component G
Figure 7. Definitions of linguistic values for the component B
Nonlinearity of human, decisional multi-criteria 43
Membership functions presented in Fig 6. are described by formula (5):
() = �.+,-�.+ (.) = -,�
�.+ (./ = �,-�.+ (0 = -,�.+
�.+ , (5)
where: L – low, ML – medium left, MR – medium right, H – height, G – the level of bright-
ness green.
Membership functions presented in Fig. 7. are described by formula (6):
() = �.+,1�.+ (.) = 1,�
�.+ (./ = �,1�.+ (0 = 1,�.+
�.+ , (6)
where: L – low, ML – medium left, MR – medium right, H – height, B – the level of bright-
ness blue.
Linguistic values of attributes generate 9 characteristic objects. Their distribution in the
problem space is presented by Fig.8.
Figure 8. Characteristic objects Ri in the space of the problem
Attribute values of the characteristic Ri objects, their names and colors are given in Ta-
ble 5.
Table 5. Complex color and their rules
Rule [R, G, B] Color
R1 [0.5, 0.0, 0.0]
R2 [0.5, 0.0, 0.5]
R3 [0.5, 0.0, 1.0]
R4 [0.5, 0.5, 0.0]
R5 [0.5, 0.5, 0.5]
R6 [0.5, 0.5, 1.0]
R7 [0.5, 1.0, 0.0]
R8 [0.5, 1.0, 0.5]
R9 [0.5, 1.0, 1.0]
The interviewed person has to make decisions described below.
In the survey, please indicate, which color of the pair of colors is more attractive (please mark this color by X). If both colors have similar or identical level of attractiveness, please mark a draw. Attractiveness of color is telling you which color you prefer more from the pair of colors.
44 Andrzej Piegat, Wojciech Sałabun
Evaluation of characteristic objects is determined with the tournament-rank method. If
one color of a pair is preferred, then this color receives 1 point and second color receives 0
points. If the interviewed person marks a draw, both colors receive 0.5 point. Next, all the
points assigned to each object are added. On the basis of the sums the ranking of objects is
established. Applying the indifference principle of Laplace we can assume that the full dif-
ference value �234 5 �2� = 1 5 0 = 1 should be partitioned in 7 5 1 equal differences 89:;,89<=
2,� . ( m – number of places in the ranking). Experimental identification of surfaces
of the multi-criterion showed, that for all interviewed people, this surfaces were strongly
nonlinear. Fig. 9. shows the multi-criterion surface for a randomly chosen person.
For comparison, Fig. 10 shows the multi-criterion surface for co-author WS of the arti-
cle.
Figure 9. Functional surface of the individual multi-criterion of the resulting color-attractiveness
achieved by mixing 2 component colors with different proportion-rates.
Figure 10. Functional surface of the individual multi-criterion of attractiveness of the resulting color
achieved by mixing 2 component colors with different proportion-rates (WS)
Nonlinearity of human, decisional multi-criteria 45
The realized investigation also showed that functional surfaces of the multi-criterion of
all persons were strongly nonlinear. Fig. 9 presents the functional, M-Cr-surface of one of
the persons taking part in the investigation. For other interviewed people, these M-Cr-
surfaces were also highly nonlinear. (Identification of M-Cr-surfaces has been performed
for a group of 307 selected people).
4. Nonlinearity indicator of the functional surface of a multi-criterion
In case of the 2-component multi-criterion K = f(K1,K2) there exists visualization possi-
bility of the functional surface of the M-Cr and possibility of an approximate, visual evalua-
tion of its nonlinearity degree or, at least, of evaluation whether the surface is linear or
nonlinear one. However, in case of higher-dimensional multi-criteria K = f(K1,K2, … ,Kn)
visualization and visual evaluation of nonlinearity becomes more and more difficult, though
it can be realized e.g. with method of lower-dimension cuts [7]. Therefore it would be very
useful to construct a quantitative indicator of nonlinearity N-IndK of a model of the multi-
criterion K. First, let us analyze, for better understanding of the problem, the most simple
criterion-model K = f(K1),the criterion of the lowest dimension identified with the method of
characteristic objects (Ch-Ob-method). Let us assume that after realized investigations we
have at disposal m objects, each of them is described by the pair (K1,K) of coordinate values
and can be interpreted as a measurement sample that can be used for identification of a
functional dependence. Let us assume that the characteristic objects are distributed in the
coordinate-system space as shown in Fig.11a.
Figure 11. An example placement of characteristic objects (���, ��), � = 1:7������, in the space �� > �,
Fig.11a, and a nonlinear, fuzzy model approximating the characteristic objects, Fig.11b
Nonlinearity of the fuzzy model approximating the criterion-function K=f(K1) will be the
smaller, the smaller is the difference sum (Ki – KLi ) of corresponding points lying on the
fuzzy and on the linear approximation of the criterion function. Information about this sum
delivers the proposed indicator N-IndK of nonlinearity, formula (7).
[6] Kahneman D., Tversky A.: Choices, values and frames. Cambridge University Press,
Cambridge, New York, 2000.
[7] Lu Jie at al.: Multi-objective group decision-making. Imperial College Press, London,
Singapore, 2007.
[8] Piegat A.: Stationary to the lecture Methods of Artificial Intelligence. Faculty of Com-
puter Science, West Pomeranian University of Technology, Szczecin, Poland, not pub-
lished.
[9] Piegat A.: Fuzzy modeling and control. Springer-Verlag, Heidelberg, New York, 2001.
Nonlinearity of human, decisional multi-criteria 49
[10] Rao C.R.: Linear Models: Least Squares and Alternatives., Rao C.R.(eds), Springer
Series in Statistics, 1999.
[11] Rutkowski L.: Metody i techniki sztucznej inteligencji (Methods and techniques of arti-ficial intelligence)
[12] Saaty T.L.: How to make a decision: the analytic hierarchy process. European Journal
of Operational Research, vol.48, no1, pp.9-26, 1990.
[13] Saaty T.L.: Decision making with dependence and feedback: the analytic network pro-cess. RWS Publications, Pittsburg, Pennsylvania, 1996.
[14] Saaty T.L., Brady C.: The encyclicon, volume 2: a dictionary of complex decisions us-ing the analytic network process. RWS Publications, Pittsburgh, Pennsylvania, 2009.
[15] Stadnicki J.: Teoria I praktyka rozwiązywania zadań optymalizacji (Theory and practi-ce of solving optimization problems). Wydawnictwo Naukowo Techniczne, Warszawa,
2006.
[16] Zarghami M., Szidarovszky F.: Multicriteria analysis. Springer, Heidelberg, New
Abstract: Methods for dealing with missing data in the context of large surveys or data mining projects arelimited by the computational complexity that they may exhibit. Hot deck imputation methods arecomputationally simple, yet effective for creating complete data sets from which correct inferencesmay be drawn. All hot deck methods draw values for the imputation of missing values from thedata matrix that will later be analyzed. The object, from which these available values are taken forimputation within another, is called the donor. This duplication of values may lead to the problemthat using any donor “too often” will induce incorrect estimates. To mitigate this dilemma somehot deck methods limit the amount of times any one donor may be selected. This study answerswhich conditions influence whether or not any such limitation is sensible for six different hot deckmethods. In addition, five factors that influence the strength of any such advantage are identifiedand possibilities for further research are discussed.
Keywords: hot deck imputation, missing data, non-response, imputation, simulation
1. IntroductionDealing with missing observations when estimating parameters or extracting information
from empirical data remains a challenge for scientists and practitioners alike. Failures in eithermanual or automated data collection or editing, such as aggregating information from differentsources [18] or outlier removal [22], cause missing observations. Some missing data may beresolved through manual or automatic logical inference when values may be inferred directlyfrom existing data (e.g. a missing passport number when the respondent has no passport, miss-ing age when the date of birth is known). If missing data cannot be resolved in this way (e.g.cost restraints, lack of domain knowledge), it must be compensated in light of the missingnessmechanism.
Rubin [25] first treated missing data indicators as random variables. Based on the indica-tors distribution, he defined three basic mechanisms, MCAR, MAR, and NMAR, that governwhich missing data methods are appropriate. With MCAR (missing completely at random),missingness is independent of any data values, missing or observed. Thus under MCAR, ob-served values represent a subsample of the intended sample. Under MAR (missing at random),whether or not data is missing depends on some observed data’s values. A MAR mechanismwould be present if response rates for an item differ between two groups of respondents, e.g.survey respondents with a higher education level are less likely to answer a question on in-come than respondents exhibiting a lower education level. Finally under NMAR (not missing
Donor limited hot deck imputation. . . 59
at random), the presence of missing data is dependent on the variable’s values, which is itselfsubject to missingness. NMAR missingness is present when, for example, data is less likely tobe transmitted by a temperature sensor if the temperature rises above a certain threshold.
With missingness present, conventional methods cannot be simply applied to the data with-out proxy. Explicit provisions must be made before or within the analysis. The provisions,to deal with the missing data, must be chosen based on the identified missingness mechanism.Principally, two strategies to deal with missing data in the data mining or large survey contextare appropriate: elimination and imputation. Elimination procedures will eliminate objects orattributes with missingness from the analysis. These only lead to a data set, from which accurateinferences may be made, if the missingness mechanism is MCAR, and correctly identified assuch. But even if the mechanism is MCAR, eliminating records with missing values denotesan inferior strategy, especially when many records need to be eliminated due to unfavorablemissingness patterns or data collection schemes (e.g. asynchronous sampling). Imputationmethods replace missing values with estimates ([17], [1]), and can be suitable under the lessstringent assumptions of MAR. Some techniques can even lead to correct inferences under thenon-ignorable NMAR mechanism ([3], [19]). Replacing missing values with reasonable onesnot only assures that all information gathered can be used, but also broadens the spectrum ofavailable analyses. Imputation methods differ in how they define these reasonable values. Thesimplest imputation techniques, and so far the state of the art for data mining [18], replacemissing values with eligible location parameters. Beyond that, multivariate methods, such asregression or classification methods, may be used to identify imputation values. The interestedreader may find a more complete description of missingness mechanisms and methods for deal-ing with missing data in [3], [19], or [14].
A category of imputation techniques appropriate for imputation in the context of mininglarge amounts of data and large surveys, due to its computational simplicity (c.p. [22], [14],[20]), is hot deck imputation. Ford [11] defines a hot deck procedure as one where missing itemsare replaced by using values from one or more similar records within the same classificationgroup. Partitioning records into disjoint, homogeneous groups is done so selected, good recordsthat supply the imputation values (the donors) follow the same distribution as the bad records(the recipients). Due to this, and the replication property, all hot deck imputed data sets containonly plausible values, which cannot be guaranteed by most other methods. Traditionally, adonor is chosen at random, but other methods such as ordering by covariate, when sequentiallyimputing records, or nearest neighbor techniques, utilizing distance metrics, are possible, whichimprove estimates at the expense of computational simplicity (c.p. [11], [19]).
The replication of values leads to the central problem in question here. Any donor may,fundamentally, be chosen to accommodate multiple recipients. This poses the inherent risk that“too many” or even all recipients are imputed with the same value or values from a single donor.Due to this, some variants of hot deck procedures limit the amount of times any one donor maybe selected for donating its values. This inevitably leads to question under which conditions alimitation is sensible and whether or not some appropriate limit value exists. This study aims toanswer these questions. An overview of the basic mechanics of hot deck methods is presented inchapter 2. Chapter 3 discusses current empirical and theoretical research on this topic. Chapter4 highlights the simulation study design while results are reported and discussed in chapter 5.A conclusion possibilities for further research are presented in chapter 6.
60 Dieter William Joenssen, Udo Bankhofer
2. Overview of Hot Deck MethodsFord [11] describes hot deck methods as processes in which a reported value is duplicated to
represent a value missing from the sample. Sande [26] extends this to define hot deck imputa-tion procedures as methods for completing incomplete responses using values from one or morerecords in the same file. Thus, from a procedural standpoint, clearly hot deck methods matchdonors and recipients within the same data matrix, whereby observations are duplicated to re-solve either all the recipient’s missingness simultaneously or on an attribute sequential basis.Simultaneous resolution of all the recipient’s missing data may better preserve the associationsbetween the variables, while the sequential resolution ensures a larger donor pool. Since, the-oretically, any procedure may be iteratively applied to all attributes exhibiting missing values,hot deck methods are better classified by the method of how donors and recipients are matched.The two primary possibilities for donor matching are:— Randomly. A donor is selected at random to accommodate any recipient. This method is,
seen computationally, the simplest. It preserves the overall distribution of the data and leadsto correct mean and variance estimation [2] under the MCAR mechanism. When data is notmissing MCAR, this method can be modified in various ways. Most often imputation classesare formed by stratifying auxiliary variables or by applying common clustering proceduresto the data, in an effort to achieve MCAR missingness within the classes. The randommatching of donor and recipient is then performed within these classes.Another variant of the random hot deck applies weights to the selection probabilities [27].This guarantees that donors more similar to the recipient have a higher chance of beingselected.The last and most widely used (random) method is the so called sequential hot deck. Thesequential hot deck is a procedure developed by the U.S. Census Bureau [7]. Based onpartitioning the data into imputation classes, each record in the data set is considered inturn. If a record is missing a value, this value is replaced by one saved in a register. If therecord is complete, the register’s value is updated. Initial values for this register are takeneither from a previous survey, class or randomly from the variables’ domain. The sequentialhot deck yields equivalent results to the random hot deck if the data set’s ordering is random.An advantage may be attained when ordering is nonrandom, such as when the data set issorted by covariates. This, however, is seldom done purposefully as sorting not only requirescomputationally intensive sorting but also identification of strong covariates. Usually, in anysequential hot deck application, any order in the data set is due to data entry procedures andthus is unlikely to ensure substantially better results.
— Deterministically. This class of hot decks matches recipients to their respective donors.These, usually of the nearest neighbor type, procedures are state of the art for many sta-tistical institutes and bureaus around the world. For example, nearest neighbor hot decksare used by the US Bureau of the Census in the CPS1, SIPP2, and ACS Surveys3, the UKOffice for National Statistics used them for the 2001/2011 Censuses4, and Statistics Canadautilizes nearest neighbor hot decks in 45% of all active surveys exhibiting missing data suchas the SLID and LFS5.
The nearest neighbor is usually defined by minimizing simple distance functions such as theManhattan or Chebyshev distances. These hot decks guarantee that the same donor is alwayschosen, given a static data set, ensuring consistency when multiple independent analyses areperformed on the data after a public release. While distance matrix computation tends tobecome prohibitively expensive for large amounts of data, this limit is reached later for thenearest neighbor hot deck methods, as neither the simultaneous nor the sequential versionrequire a full distance matrix. Rather only distances between all donors and all recipientsneed to be calculated.All hot deck methods guarantee, by virtue of the duplication property, that the imputed data
set contains only naturally occurring values, without the need to round or transform categoricalvalues. Hot decks also conserve unique distribution features, such as discontinuities or spikes.Their low cost of implementation and execution is, however, offset by the fact that little isknown about their theoretical properties.
Literature further detailing the mechanics hot deck imputation methods includes [11], [19],[14], [15], and [6].
3. Review of LiteratureThe theoretical effects of a donor limit were first investigated by Kalton and Kish [16].
Based on combinatorics, they come to the conclusion that selecting a donor from the donorpool without replacement leads to a reduction in the imputation variance, the precision withwhich any parameter is estimated from the post-imputation data matrix. A possible effect on animputation introduced bias was not discussed. Two more arguments in favor of a donor limit aremade. First, the risk of exclusively using one donor for all imputations is removed [26]. Second,the probability of using one donor with an extreme value or values “too often” is reduced ([3],[28]). Based on these arguments and sources, recommendations are made in [15], [21], [28],and [10].
In contrast, Andridge and Little [2] reason that imposing a donor limit inherently reducesthe ability to choose the most similar, and therefore most appropriate, donor for imputation. Notlimiting the times a donor can be chosen may thus increase data quality. Generally speaking, adonor limit makes results dependent on the order of object imputation. Usually, the imputationorder will correspond to the sequence of the objects in the data set. This property is undesirable,especially in deterministic hot decks. Thus, from a theoretical point of view, it is not clearwhether or not a donor limit has a positive or negative impact on the post-imputation data’squality.
Literature on this subject provides only studies that compare hot deck imputation methodswith other imputation methods. These studies include either only drawing the donor from thedonor pool with replacement ([4], [24], [29]) or without replacement ([13]).
It becomes apparent, based on this review of literature, that the consequences of imposing adonor limit have not been sufficiently examined.
4. Study DesignConsidering possible theoretical advantage that a donor limit has, and possible effects that
have not been investigated to date, the following questions will be answered by this study:1. Are true parameters of any hot deck imputed data matrix estimated with higher precision
when a donor limit is used?
62 Dieter William Joenssen, Udo Bankhofer
2. Does a donor limit lead to less biased post-imputation parameter estimation?3. What factors influence if a hot deck with a donor limit creates better results?
A series of factors are identified, by considering papers where authors chose similar ap-proaches ([23], [24], [28]) and further deliberations, that might influence whether or not a donorlimit affects parameter estimates. The factors varied are the following:
— Imputation class count: Imputation classes are assumed to be given prior to imputationand data is generated as determined by the class structure. Factor levels are two and sevenimputation classes.
— Objects per imputation class: The amount of objects characterizing each imputation classis varied. Factor levels 50 and 250 objects per class are considered.
— Class structure: To differentiate between well- and ill-chosen imputation classes, dataare generated with a relatively strong and relatively weak class structure. A strong classstructure is achieved by having classes overlap by 5% and an inner-class correlation of.5. A weak class structure is achieved by an intra-class overlap of 30% and no inner-classcorrelation.
— Data matrices: Data matrices of nine multivariate normal variables are generated depen-dent on the given class structure. Three of these variables are then transformed to a discreteuniform distribution with either five or seven possible values, simulating an ordinal scale.The next three variables are converted to a nominal scale so that 60% of all objects areexpected to take the value one, with the remaining values being set to zero. General detailson this NORTA type transformation are described by Cario and Nelson [8].
— Portion of missing data: Factor levels include 5, 10, and 20% missing data points andevery object is assured to have at least one data point available (no subject non-response).
— Missingness mechanism: Missingness mechanisms considered are either MCAR, MAR,or NMAR. These are generated as follows: under MCAR a set amount of values are chosenwithout replacement to be missing. Under MAR missing data is generated as under MCARbut using two different rates based on the value of one binary variable, which is not subjectto missingness. The different rates of missingness are either 10% higher or lower than therates under MCAR. NMAR modifies the MAR mechanism to also allow missingness of thebinary variable. To forgo possible problems with the simultaneous imputation methods andthe donor limitation of once, it was guaranteed that at least 50% of all objects within oneclass were complete in all attributes.
— Hot deck methods: The six hot deck methods considered are named “SeqR,” “SeqDW,”“SeqDM,” “SimR,” “SimDW,” and “SimDM” according to the three properties that theyexhibit. The prefixes denote whether attributes are considered sequentially (Seq) or simul-taneously (Sim) for imputation. The postfixes indicate a random (R) or a distance baseddeterministic (D) hot deck and the type of adjustment made to compensate for missingnesswhen computing the distances. “W” indicates a reweighting type of compensation, whichassumes that the missing components supply an average deviation to the distance. “M”denotes that an imputation of relevant location estimates is performed before distance calcu-lation, which assumes that the missing component is close to the average for this attribute.To account for variability and importance, prior to aggregating the Manhattan distances,variables are weighted with the inverse of their range.
Next to the previously mentioned factors, two static and two dynamic donor limits are evalu-
Donor limited hot deck imputation. . . 63
ated. The two static donor limits allow either a donor to be chosen once or an unlimited numberof times. For the dynamic cases, the limit is set to either 25% or 50% of the recipient count.
To evaluate imputation quality, a set of location, variability, and contingency measures isconsidered (c.p. [21]). For the quantitative variables mean, variance, and correlation for theordinal variables median, quartile distance, and rank-correlation, and for the binary variables therelative frequency of the value one and the normalized coefficient of contingency are computed.
100 data matrices are simulated for every factor level combination of “imputation classcount”, “object count per imputation class”, “class structure”, and “ordinal variables scale”.For every complete data matrix, the set of true parameters is computed. Each of these 1600 datamatrices is then subjugated to each missingness mechanism, generating three different amountsof missing data. All of the matrices with missing data are then imputed by all six hot deckmethods using all four donor limits. Repeating this process ten times creates 3.456 millionimputed data matrices, for which each parameter set is calculated again.
Considering every parameter in the set, the relative deviation ∆p between the true parametervalue pT and the estimated parameter value pI , based on the imputed data matrix, is calculatedas follows:
∆p =pI − pT
pT(1)
To analyze the impact of different donor limits on the quality of imputation, the differences inthe absolute values of ∆p, that can be attributed to the change in donor limitation, are consid-ered. Due to the large data amounts that are generated in this simulation, statistical significancetests on these absolute relative deviations are not considered appropriate. As an alternative tothis Cohen’s d measure of effect ([9], [5]) is chosen as a qualitative criterion. The calculationof Cohen’s d for this case is as follows:
d =|∆p1| − |∆p2|√
s21+s222
(2)
∆p1 and ∆p2 are the means of all relative deviations calculated via (1) for two different donorlimits. s21 and s22 are the corresponding variances in the relative deviations. Using absolutevalues for ∆p1 and ∆p2 allows interpreting the sign of d. A positive sign means that the sec-ond case of donor limitation performed better than the first, while a negative sign means theconverse. As with any qualitative interpretation of results, thresholds are quite arbitrary anddependent on the investigators frame of reference. Recommendations ([9], [12]) are to considerdeviations larger than 10% of a standard deviation as meaningful and thus thresholds to considereffects nontrivial are set to |d| >= .1.
5. ResultsBased on the simulation’s results, the research questions formulated in section 4 are now an-
swered. Section 5.1 deals with whether or not minimum imputation variance is always achieved,independent of the data and chosen hot deck procedure, when the most stringent donor limit isapplied. Section 5.2 deals with whether or not a donor limit will introduce a bias. Influencingfactors will be analyzed for each hot deck method separately in section 5.3
64 Dieter William Joenssen, Udo Bankhofer
Table 1. Frequency distribution of minimum imputation variance
Donor limitEvaluated parameter once 25% 50% unlim.
5.1. Donor Limitation Impact on PrecisionThe theoretical reduction in imputation variance through donor selection without replace-
ment, as put forth by Kalton and Kish [16], is investigated empirically at this point. Table 1shows how often, in any simulated situation, a certain donor limit leads to the least amount ofimputation variance in the parameter estimate.
Clearly, a donor limit of one leads to minimal imputation variance in most cases and thuscan be expected to have highest precision in parameter estimation. Estimation precision for pa-rameters also tends to increase with the stringency of the donor limit. Variables with lower scaletype, binary and ordinal, favor donor selection without replacement even more strongly than thequantitative variables. Nonetheless, this recommendation does not hold for all situations. Somesituations demand using donors more often, while others require an over usage protection.
5.2. Donor Limitation Impact on Imputation BiasTo answer the, for the practitioner, more pressing question of whether or not implementing
a donor limit also leads to a reduction in imputation bias, the recorded data was evaluatedsimilarly to the previous fashion. Table 2 shows the percentage of situations in which a certaindonor limit has the least bias, as measured by the mean relative deviations.
The values indicate, just as with the imputation variance previously discussed, that in mostcases donor selection without replacement leads to best expected parameter estimation. Min-imal imputation bias is mostly achieved under limiting donor usage to just one time, but evenmore so than for the imputation variance, there are situations where other donor limits improvehot deck performance. Measures of variability are more strongly affected than those of location,
which means that in some cases donor limitation will lead to less accurate confidence intervals.Contingency measures are affected less than both location and variability measures, signifyingthat a choice of donor limit is even more important if the association between variables is ofinterest.
5.3. Analysis of Donor Limit Influencing FactorsCohen’s d is used to analyze which of the factors have an influence on whether or not a donor
limitation is beneficial. Tables in the following sections first highlight main effects followedby between factor effects on any donor limit advantages. Effect sizes are calculated betweenthe two extreme cases, donor selection without and with replacement. Effects exceeding thethreshold value of .1 are in bold, with negative values indicating an advantage for the moststringent donor limit.
5.3.1. Analysis of Main Effects
Table 3 (below) shows the cross classification between all factors and factor levels with allparameters analyzed.
The first conclusion that can be reached upon investigation of the results is that, indepen-dent of any chosen factors, there are no meaningful differences between using a donor limitand using no donor limit in mean and median estimation. This result is congruent with theresults of the previous section. In contrast to this, parameters measuring variability are moreheavily influenced through the variation of the chosen factors. Especially data matrices witha high proportion missing data, as well as those imputed with SimDM will profit significantlyfrom a donor limitation. Correlation measures are influenced mainly by the amount of objects
66 Dieter William Joenssen, Udo Bankhofer
per imputation class. All effects related to the binary variables are negative, indicating thatespecially these types of variables profit from donor selection without replacement. Also a highamount of imputation classes tends to speak for a limit on donor usage.
The class structure, any of the random hot deck procedures or SeqDW have no influenceon whether a donor limit is advantageous. Fairly conspicuous is the fact that SimDW leadsto partially positive effect sizes meaning that leaving donor usage unlimited is favorable. Thisleads to interesting higher order effects, detailed in the following section.
5.3.2. Analysis of Interactions
Based on the findings in the previous section, effects are investigated stratified by the hotdeck methods SimDW, SimDM and SeqDM. Results for the parameters mean and median,for the quantitative and ordinal variables respectively, are omitted because no circumstanceconsidered yielded meaningful differences. The values for the remaining parameters are shownin table 4.
As in the analysis of main effects, this table clearly shows that using SimDW with no donorlimit is advantageous in most cases. If solely the estimation of association between binaryvariables is of interest, limiting donor usage to once is always appropriate. Furthermore, theother two methods, SimDM and SeqDM, show only negative values. Thus, the advantage ofusing a hot deck with a donor limit is strongly dependent upon the imputation method used.
For all three portrayed methods, a high amount of imputation classes and a high percentageof missing data show meaningful effects, indicating an increased tendency for any advanta-geous strategy of choosing a donor limit. The amount of objects per imputation class show nohomogeneous effect on the parameters, rather it seems to strengthen the advantage the donorlimitation or non-limitation has, with the parameters variance and quartile distance reactinginversely to the other four.
The other factors seemingly do not influence the effects as their variation does not lead togreat differences in the effect sizes, making their absolute level only dependent on the variable’sscale or imputation method.
Besides the results shown in table 4, further cross classifications between factors may becalculated. These effect sizes further highlight the additive nature of the factors systematicallyvaried in this study. Some strikingly large effects arise when considering large amounts ofmissingness and imputation classes. For example, the factor level combination: 20% missingdata, high amounts of imputation classes, and a low amount of objects per imputation class leadto effects up to -1.7 in variance, up to -1.9 in quartile distance, -3.6 in correlation, -2.9 in rankcorrelation, and -2.5 in coefficient of contingency when imputing with the SimDM algorithm.Maximum effects when imputing with the SimDW method are reached with 20% missing data,seven imputation classes,
Effect sizes up to -3 are calculated for the relative frequency in the binary variable whenthe amount of imputation classes is large, has many objects in each class and many values aremissing. This signifies some large advantage for donor selection without replacement whenusing SimDM. On the other hand, when using SimDW the largest effects are calculated whenthe amount of classes is high, but the amount of objects is low while having a high rate ofmissingness. Even though this only leads to effects of up to .6 and .34 for variance and quartiledifference respectively, the effect is noticeable and relevant for donor selection with replace-ment. Conspicuous nonetheless is the fact that especially the combination of hot deck variant,
Donor limited hot deck imputation. . . 67
Table 4. Interactions between imputation method and other factors
amount of imputation classes, objects per imputation class, and portion of missing data lead tostrong effects indicating strong advantages for and against donor limitation.
6. ConclusionsThe simulation conducted shows distinct differences between different levels of donor lim-
its. Unlike Kalton and Kish [16] suggested, smallest imputation variance is not always achievedwhen donors are selected without replacement from the pool. Their suggestion is thus limited toa subset of many possible combinations of situations and hot deck types. When imputation biasis taken into account, it becomes apparent, that there are many more situations where overlylimiting donor usage is ill advised. For most parameters, the chances are less than 50/50 thatthe most extreme donor limit is advisable.
Further, there are some subsets of situations in which both imputation variance and biasare minimal when one of the two dynamic donor limits is chosen. This indicates, that neitherstrategy of donor selection with or without replacement is always superior, but that there isindeed a payoff between protection from donor over usage and the ability to choose the mostsimilar donor. Thus the truth lies between the arguments presented in section 3.
These findings show that the most influential factor in deciding whether or not to imputeusing donor selection without replacement is the hot deck method used. When using randomhot deck methods, the question of choosing a donor limit is frivolous. Implementing a donorlimit into an existing system would not be worth the effort. When considering nearest neigh-bor hot decks, not only the method of compensating the missing data prior to calculating thedissimilarity measure is influential, but also whether variables are processed simultaneously orsequentially. With distance calculation assisted by mean imputation donor selection withoutreplacement always denotes a superior strategy. If a reweighting scheme is chosen, parameterestimation (excluding the contingency coefficient for binary variables) is never worse whendonors may be chosen an unlimited amount of times. Sequential processing of variables leadsto trivial differences, but simultaneous processing of variables leads to noticeable advantageswhen allowing infinite donor usage. Beyond that the overall magnitude of advantage for anydonor usage tactic is determined by the factors objects per imputation class, the amount ofimputation classes, and the proportion of missing data. These results, in conjunction with theintended, post-imputation analyses, dictate which donor limit, with or without replacement, ismost suitable. For example, if a decision tree should be constructed with a CHAID algorithm,donor selection without replacement should be used for imputation, because the best estimationof the coefficient of contingency is best estimated when donor selection is performed withoutreplacement.
In conclusion, some interesting questions can be answered with this research, while othersremain open. Results from sections 5.1 and 5.2 indicate, that there may be a situation dependentoptimal donor limit which may be dynamically determined from the data on hand. Hence, a hotdeck method with a data driven donor limit selection method may have desirable properties.Finally the large amount of situations under which donor selection without replacement is thesuperior strategy raises questions. Since generally imputing without donor replacement makesthe results dependent on the sequence of the recipients, results of hot deck imputation couldfurther be improved if donor selection was not performed to minimize the distance at each step,but to minimize the sum of distances between all donors and recipients. Thus, further researchpertaining to hot deck imputation and donor selection schemes remains worthwhile.
Donor limited hot deck imputation. . . 69
References
[1] Allison P.D.: Missing Data, Sage University Papers Series on Quantitative Applications in theSocial Sciences. Thousand Oaks, 2001.
[2] Andridge R.R., Little R.J.A.: A Review of Hot Deck Imputation for Survey Non-Response. Interna-tional Statistical Review, 78, 1, pp. 40–64, 2010.
[3] Bankhofer U.: Unvollstandige Daten- und Distanzmatrizen in der Multivariaten Datenanalyse.Eul, Bergisch Gladbach, 1995.
[4] Barzi F., Woodward M.: Imputations of Missing Values in Practice: Results from Imputations ofSerum Cholesterol in 28 Cohort Studies. American Journal of Epidemiology, 160, pp. 34–45, 2004.
[5] Bortz J., Doring N.: Forschungsmethoden und Evaluation fur Human- und Sozialwissenschaftler.Springer, Berlin, 2009.
[6] Brick J.M., Kalton G.: Handling Missing Data in Survey Research. Statistical Methods in MedicalResearch, 5, pp. 215–238, 1996.
[7] Brooks C.A., Bailar B.A.: An Error Profile: Employment as Measured by the Current PopulationSurvey. Statistical Policy Working Paper 3. U.S. Government Printing Office, Washington, D.C.,1978.
[8] Cario M.C., Nelson B.L.: Modeling and Generating Random Vectors with Arbitrary MarginalDistributions and Correlation Matrix. Northwestern University, IEMS Technical Report, 50, pp.100–150, 1997.
[9] Cohen J.: A Power Primer. Quantitative Methods in Psychology, 112, pp. 155–159, 1992.[10] Durrant G.B.: Imputation Methods for Handling Item-Nonresponse in Practice: Methodologi-
cal Issues and Recent Debates. International Journal of Social Research Methodology, 12, pp.293–304, 2009.
[11] Ford B.: An Overview of Hot-Deck Procedures. In: W. Madow, H. Nisselson, I. Olkin (Eds.):Incomplete Data in Sample Surveys, 2, Theory and Bibliographies. Academic Press, pp. 185–207,1983.
[12] Frohlich M., Pieter A.: Cohen’s Effektstarken als Mass der Bewertung von praktischer Relevanz –Implikationen fur die Praxis. Schweizerische Zeitschrift fur Sportmedizin und Sporttraumatologie,57, 4, pp. 139–142, 2009.
[13] Kaiser J.: The Effectiveness of Hot-Deck Procedures in Small Samples. Proceedings of the Sectionon Survey Research Methods, American Statistical Association, pp. 523–528, 1983.
[14] Kalton G., Kasprzyk D.: Imputing for Missing Survey Responses. Proceedings of the Section onSurvey Research Methods, American Statistical Association, pp. 22–31, 1982.
[15] Kalton G., Kasprzyk D.: The Treatment of Missing Survey Data. Survey Methodology, 12, pp.1–16, 1986.
[16] Kalton G., Kish L.: Two Efficient Random Imputation Procedures. Proceedings of the SurveyResearch Methods Section 1981, pp. 146–151, 1981.
[17] Kim J.O., Curry J.: The Treatment of Missing Data in Multivariate Analysis. Sociological Methodsand Research, 6, pp. 215–240, 1977.
[18] Kim W., Choi B., Hong E., Kim S., Lee D.: A Taxonomy of Dirty Data. Data Mining and Knowl-edge Discovery, 7, 1, pp. 81–99, 2003.
[19] Little R.J., Rubin D.B.: Statistical Analysis with Missing Data. New York, Wiley, 1987.[20] Marker D.A., Judkins D.R., Winglee M.: Large-Scale Imputation for Complex Surveys. In: Groves
R.M., Dillman D.A., Eltinge J.L., Little R.J.A. (Eds.): Survey Nonresponse. John Wiley and Sons,pp. 329–341, 2001.
[21] Nordholt E.S.: Imputation: methods, simulation experiments and practical examples. InternationalStatistical Review, 66, pp. 157–180, 1998.
70 Dieter William Joenssen, Udo Bankhofer
[22] Pearson R.: Mining Imperfect Data. Philadelphia, Society for Industrial and Applied Mathematics,2005.
[23] Roth P.L.: Missing Data in Multiple Item Scales: A Monte Carlo Analysis of Missing Data Tech-niques. Organizational Research Methods, 2, pp. 211–232, 1999.
[24] Roth P.L., Switzer III F.S.: A Monte Carlo Analysis of Missing Data Techniques in a HRM Setting.Journal of Management, 21, pp. 1003–1023, 1995.
[25] Rubin D.B.: Inference and Missing Data (with discussion). Biometrika 63, pp. 581–592, 1976.[26] Sande I.: Hot-Deck Imputation Procedures. In: W. Madow, H. Nisselson, I. Olkin (Eds.): Incom-
plete Data in Sample Surveys, 3, Theory and Bibliographies. Academic Press, pp. 339–349, 1983.[27] Siddique J., Belin T.R.: Multiple Imputation Using an Iterative Hot-Deck with Distance-Based
Donor Selection. Statistics in medicine, 27, 1, pp. 83–102, 2008.[28] Strike K., Emam K.E., Madhavji N.: Software Cost Estimation with Incomplete Data. IEEE Trans-
actions on Software Engineering, 27, pp. 890–908, 2001.[29] Yenduri S., Iyengar S.S.: Performance Evaluation of Imputation Methods for Incomplete Datasets.
International Journal of Software Engineering and Knowledge Engineering, 17, pp. 127–152, 2007.