arXiv:1710.07954v3 [math.ST] 27 Aug 2018 1 Bayesian Cluster Enumeration Criterion for Unsupervised Learning Freweyni K. Teklehaymanot, Student Member, IEEE, Michael Muma, Member, IEEE, and Abdelhak M. Zoubir, Fellow, IEEE Abstract—We derive a new Bayesian Information Criterion (BIC) by formulating the problem of estimating the number of clusters in an observed data set as maximization of the posterior probability of the candidate models. Given that some mild assumptions are satisfied, we provide a general BIC expression for a broad class of data distributions. This serves as a starting point when deriving the BIC for specific distributions. Along this line, we provide a closed-form BIC expression for multivariate Gaussian distributed variables. We show that incorporating the data structure of the clustering problem into the derivation of the BIC results in an expression whose penalty term is different from that of the original BIC. We propose a two-step cluster enu- meration algorithm. First, a model-based unsupervised learning algorithm partitions the data according to a given set of candidate models. Subsequently, the number of clusters is determined as the one associated with the model for which the proposed BIC is maximal. The performance of the proposed two-step algorithm is tested using synthetic and real data sets. c 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. DOI: 10.1109/TSP.2018.2866385, IEEE Transactions on Signal Processing Index Terms—model selection, Bayesian information criterion, cluster enumeration, cluster analysis, unsupervised learning, multivariate Gaussian distribution I. I NTRODUCTION S TATISTICAL model selection is concerned with choosing a model that adequately explains the observations from a family of candidate models. Many methods have been pro- posed in the literature, see for example [1]–[25] and the review in [26]. Model selection problems arise in various applications, such as the estimation of the number of signal components [15], [18]–[20], [23]–[25], the selection of the number of non- zero regression parameters in regression analysis [4]–[6], [11], [12], [14], [21], [22], and the estimation of the number of data clusters in unsupervised learning problems [27]–[45]. In this paper, our focus lies on the derivation of a Bayesian model selection criterion for cluster analysis. The estimation of the number of clusters, also called cluster enumeration, has been intensively researched for decades [27]–[45] and a popular approach is to apply the Bayesian Information Criterion (BIC) [29], [31]–[33], [37]–[41], [44]. The BIC finds the large sample limit of the Bayes’ estimator which leads to the selection of a model that is a posteriori most probable. It is consistent if the true data generating model belongs to the family of candidate models under investigation. F. K. Teklehaymanot and A. M. Zoubir are with the Signal Process- ing Group and the Graduate School of Computational Engineering, Tech- nische Universit¨ at Darmstadt, Darmstadt, Germany (e-mail: [email protected]darmstadt.de; [email protected]). M. Muma is with the Signal Processing Group, Technische Universit¨ at Darmstadt, Darmstadt, Germany (e-mail: [email protected]). The BIC was originally derived by Schwarz in [8] assum- ing that (i) the observations are independent and identically distributed (iid), (ii) they arise from an exponential family of distributions, and (iii) the candidate models are linear in parameters. Ignoring these rather restrictive assumptions, the BIC has been used in a much larger scope of model selection problems. A justification of the widespread applicability of the BIC was provided in [16] by generalizing Schwarz’s derivation. In [16], the authors drop the first two assumptions made by Schwarz given that some regularity conditions are satisfied. The BIC is a generic criterion in the sense that it does not incorporate information regarding the specific model selection problem at hand. As a result, it penalizes two structurally different models the same way if they have the same number of unknown parameters. The works in [15], [46] have shown that model selection rules that penalize for model complexity have to be examined carefully before they are applied to specific model selection problems. Nevertheless, despite the widespread use of the BIC for cluster enumeration [29], [31]–[33], [37]–[41], [44], very little effort has been made to check the appropriateness of the original BIC formulation [16] for cluster analysis. One noticeable work towards this direction was made in [38] by providing a more accurate approximation to the marginal likelihood for small sample sizes. This derivation was made specifically for mixture models assuming that they are well separated. The resulting expression contains the original BIC term plus some additional terms that are based on the mixing probability and the Fisher Information Matrix (FIM) of each partition. The method proposed in [38] requires the calculation of the FIM for each cluster in each candidate model, which is computationally very expensive and impractical in real world applications with high dimensional data. This greatly limits the applicability of the cluster enumeration method proposed in [38]. Other than the above mentioned work, to the best of our knowledge, no one has thoroughly investigated the derivation of the BIC for cluster analysis using large sample approximations. We derive a new BIC by formulating the problem of estimating the number of partitions (clusters) in an observed data set as maximization of the posterior probability of the candidate models. Under some mild assumptions, we provide a general expression for the BIC, BIC G (·), which is applicable to a broad class of data distributions. This serves as a starting point when deriving the BIC for specific data distributions in cluster analysis. Along this line, we simplify BIC G (·) by imposing an assumption on the data distribution. A closed-
14
Embed
Bayesian Cluster Enumeration Criterion for Unsupervised ... · The proposed generic Bayesian cluster enumeration criterion, BIC G(·), is introduced in Section III. Section IV presents
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
arX
iv:1
710.
0795
4v3
[m
ath.
ST]
27
Aug
201
81
Bayesian Cluster Enumeration Criterion for
Unsupervised LearningFreweyni K. Teklehaymanot, Student Member, IEEE, Michael Muma, Member, IEEE,
and Abdelhak M. Zoubir, Fellow, IEEE
Abstract—We derive a new Bayesian Information Criterion(BIC) by formulating the problem of estimating the number ofclusters in an observed data set as maximization of the posteriorprobability of the candidate models. Given that some mildassumptions are satisfied, we provide a general BIC expressionfor a broad class of data distributions. This serves as a startingpoint when deriving the BIC for specific distributions. Along thisline, we provide a closed-form BIC expression for multivariateGaussian distributed variables. We show that incorporating thedata structure of the clustering problem into the derivation ofthe BIC results in an expression whose penalty term is differentfrom that of the original BIC. We propose a two-step cluster enu-meration algorithm. First, a model-based unsupervised learningalgorithm partitions the data according to a given set of candidatemodels. Subsequently, the number of clusters is determined asthe one associated with the model for which the proposed BIC ismaximal. The performance of the proposed two-step algorithmis tested using synthetic and real data sets.
Index Terms—model selection, Bayesian information criterion,cluster enumeration, cluster analysis, unsupervised learning,multivariate Gaussian distribution
I. INTRODUCTION
STATISTICAL model selection is concerned with choosing
a model that adequately explains the observations from
a family of candidate models. Many methods have been pro-
posed in the literature, see for example [1]–[25] and the review
in [26]. Model selection problems arise in various applications,
such as the estimation of the number of signal components
[15], [18]–[20], [23]–[25], the selection of the number of non-
zero regression parameters in regression analysis [4]–[6], [11],
[12], [14], [21], [22], and the estimation of the number of data
clusters in unsupervised learning problems [27]–[45]. In this
paper, our focus lies on the derivation of a Bayesian model
selection criterion for cluster analysis.
The estimation of the number of clusters, also called cluster
enumeration, has been intensively researched for decades
[27]–[45] and a popular approach is to apply the Bayesian
Information Criterion (BIC) [29], [31]–[33], [37]–[41], [44].
The BIC finds the large sample limit of the Bayes’ estimator
which leads to the selection of a model that is a posteriori
most probable. It is consistent if the true data generating model
belongs to the family of candidate models under investigation.
F. K. Teklehaymanot and A. M. Zoubir are with the Signal Process-ing Group and the Graduate School of Computational Engineering, Tech-nische Universitat Darmstadt, Darmstadt, Germany (e-mail: [email protected]; [email protected]).
M. Muma is with the Signal Processing Group, Technische UniversitatDarmstadt, Darmstadt, Germany (e-mail: [email protected]).
The BIC was originally derived by Schwarz in [8] assum-
ing that (i) the observations are independent and identically
distributed (iid), (ii) they arise from an exponential family
of distributions, and (iii) the candidate models are linear in
parameters. Ignoring these rather restrictive assumptions, the
BIC has been used in a much larger scope of model selection
problems. A justification of the widespread applicability of
the BIC was provided in [16] by generalizing Schwarz’s
derivation. In [16], the authors drop the first two assumptions
made by Schwarz given that some regularity conditions are
satisfied. The BIC is a generic criterion in the sense that
it does not incorporate information regarding the specific
model selection problem at hand. As a result, it penalizes two
structurally different models the same way if they have the
same number of unknown parameters.
The works in [15], [46] have shown that model selection
rules that penalize for model complexity have to be examined
carefully before they are applied to specific model selection
problems. Nevertheless, despite the widespread use of the BIC
for cluster enumeration [29], [31]–[33], [37]–[41], [44], very
little effort has been made to check the appropriateness of
the original BIC formulation [16] for cluster analysis. One
noticeable work towards this direction was made in [38] by
providing a more accurate approximation to the marginal
likelihood for small sample sizes. This derivation was made
specifically for mixture models assuming that they are well
separated. The resulting expression contains the original BIC
term plus some additional terms that are based on the mixing
probability and the Fisher Information Matrix (FIM) of each
partition. The method proposed in [38] requires the calculation
of the FIM for each cluster in each candidate model, which is
computationally very expensive and impractical in real world
applications with high dimensional data. This greatly limits
the applicability of the cluster enumeration method proposed
in [38]. Other than the above mentioned work, to the best
of our knowledge, no one has thoroughly investigated the
derivation of the BIC for cluster analysis using large sample
approximations.
We derive a new BIC by formulating the problem of estimating
the number of partitions (clusters) in an observed data set
as maximization of the posterior probability of the candidate
models. Under some mild assumptions, we provide a general
expression for the BIC, BICG(·), which is applicable to a
broad class of data distributions. This serves as a starting
point when deriving the BIC for specific data distributions
in cluster analysis. Along this line, we simplify BICG(·) by
imposing an assumption on the data distribution. A closed-
(A.7) p(Ml) and f(Θl|Ml) are independent of the data length
N .
Then, ignoring the terms in Eq. (57) that do not grow as N →∞ results in
BICN(Ml) , log p(Ml|X )
≈ logL(Θl|X )−q
2
l∑
m=1
logNm + ρ. (58)
Since X is composed of multivariate Gaussian distributed data,
BICN(Ml) can be further simplified as follows:
BICN(Ml)=logL(Θl|X ) + pl
=
l∑
m=1
(
Nm logNm
N−
rNm
2log 2π
−Nm
2log∣
∣
∣Σm
∣
∣
∣−
1
2Tr(
NmΣ−1m Σm
)
)
+ pl
=
l∑
m=1
Nm logNm −N logN −rN
2log 2π
−l∑
m=1
Nm
2log∣
∣
∣Σm
∣
∣
∣−
rN
2+ pl, (59)
where
pl , −q
2
l∑
m=1
logNm + ρ. (60)
Finally, ignoring the model independent terms in Eq. (59)
results in Eq. (18) which concludes the proof.
APPENDIX B
VECTOR AND MATRIX DIFFERENTIATION RULES
Here, we describe the vector and matrix differentiation rules
used in this paper (see [63] for details). Let µ ∈ Rr×1 be the
mean and Σ ∈ Rr×r be the covariance matrix of a multivariate
Gaussian random variable x. Assuming that the covariance
matrix Σ has no special structure, the following vector and
matrix differentiation rules hold.
d
dΣlog |Σ| = Tr
(
Σ−1 dΣ
dΣ
)
(61)
d
dΣTr (Σ) = Tr
(
dΣ
dΣ
)
(62)
d
dΣΣ
−1 = −Σ−1dΣ
dΣΣ
−1 (63)
d
dµµ⊤µ = 2µ⊤ (64)
Given three arbitrary symmetric matrices A, B, and Y with
matching dimensions
Tr (AY BY ) = vec(Y )⊤(A⊗B)vec(Y ) (65)
= u⊤D⊤(A⊗B)Du, (66)
where u contains the unique elements of the symmetric matrix
Y and D denotes the duplication matrix of Y . In Eq. (50)
we used the relation vec(
dYdu
)
= D dudu .
ACKNOWLEDGMENT
We thank the anonymous reviewers for their insightful
comments and suggestions. Further, we would like to thank Dr.
Benjamin Bejar Haro for providing us with the multi-object
multi-camera data set which was created as a benchmark with
in the project HANDiCAMS. HANDiCAMS acknowledges
the financial support of the Future and Emerging Technologies
(FET) programme within the Seventh Framework Programme
for Research of the European Commission, under FET-Open
grant number: 323944. The work of F. K. Teklehaymanot
is supported by the ‘Excellence Initiative’ of the German
Federal and State Governments and the Graduate School of
Computational Engineering at Technische Universitat Darm-
stadt and by the LOEWE initiative (Hessen, Germany) within
the NICER project. The work of M. Muma is supported by
the ‘Athene Young Investigator Programme’ of Technische
Universitat Darmstadt.
REFERENCES
[1] H. Jeffreys, The Theory of Probability (3 ed.). New York, USA: OxfordUniversity Press, 1961.
[2] H. Akaike, “Fitting autoregressive models for prediction,” Ann. Inst.
Statist. Math., vol. 21, pp. 243–247, 1969.
[3] ——, “Statistical predictor identification,” Ann. Inst. Statist. Math.,vol. 22, pp. 203–217, 1970.
[4] ——, “Information theory and an extension of the maximum likelihoodprinciple,” in 2nd Int. Symp. Inf. Theory, 1973, pp. 267–281.
[5] D. M. Allen, “The relationship between variable selection and dataaugmentation and a method for prediction,” Technometrics, vol. 16,no. 1, pp. 125–127, Feb. 1974.
[6] M. Stone, “Cross-validatory choice and assessment of statistical predic-tion,” J. R. Statist. Soc. B, vol. 36, no. 2, pp. 111–133, 1974.
[7] J. Rissanen, “Modeling by shortest data description,” Automatica,vol. 14, pp. 465–471, 1978.
[8] G. Schwarz, “Estimating the dimension of a model,” Ann. Stat., vol. 6,no. 2, pp. 461–464, 1978.
[9] E. J. Hannan and B. G. Quinn, “The determination of the order of anautoregression,” J. R. Statist. Soc. B, vol. 41, no. 2, pp. 190–195, 1979.
[10] R. Shibata, “Asymptotically efficient selection of the order of the modelfor estimating parameters of a linear process,” Ann. Stat., vol. 8, no. 1,pp. 147–164, 1980.
14
[11] C. R. Rao and Y. Wu, “A strongly consistent procedure for modelselection in a regression problem,” Biometrika, vol. 76, no. 2, pp. 369–74, 1989.
[12] L. Breiman, “The little bootstrap and other methods for dimensionalityselection in regression: X-fixed prediction error,” J. Am. Stat. Assoc,vol. 87, no. 419, pp. 738–754, Sept. 1992.
[13] R. E. Kass and A. E. Raftery, “Bayes factors,” J. Am. Stat. Assoc.,vol. 90, no. 430, pp. 773–795, June 1995.
[14] J. Shao, “Bootstrap model selection,” J. Am. Stat. Assoc., vol. 91, no.434, pp. 655–665, June 1996.
[15] P. M. Djuric, “Asymptotic MAP criteria for model selection,” IEEE
Trans. Signal Process., vol. 46, no. 10, pp. 2726–2735, Oct. 1998.
[16] J. E. Cavanaugh and A. A. Neath, “Generalizing the derivation of theSchwarz information criterion,” Commun. Statist.-Theory Meth., vol. 28,no. 1, pp. 49–66, 1999.
[17] A. M. Zoubir, “Bootstrap methods for model selection,” Int. J. Electron.
Commun., vol. 53, no. 6, pp. 386–392, 1999.
[18] A. M. Zoubir and D. R. Iskander, “Bootstrap modeling of a class ofnonstationary signals,” IEEE Trans. Signal Process., vol. 48, no. 2, pp.399–408, Feb. 2000.
[19] R. F. Brcich, A. M. Zoubir, and P. Pelin, “Detection of sources usingbootstrap techniques,” IEEE Trans. Signal Process., vol. 50, no. 2, pp.206–215, Feb. 2002.
[20] M. R. Morelande and A. M. Zoubir, “Model selection of randomamplitude polynomial phase signals,” IEEE Trans. Signal Process.,vol. 50, no. 3, pp. 578–589, Mar. 2002.
[21] D. J. Spiegelhalter, N. G. Best, B. P. Carlin, and A. van der Linde,“Bayesian measures of model complexity and fit,” J. R. Statist. Soc. B,vol. 64, no. 4, pp. 583–639, 2002.
[22] G. Claeskens and N. L. Hjort, “The focused information criterion,” J.
[23] Z. Lu and A. M. Zoubir, “Generalized Bayesian information criterion forsource enumeration in array processing,” IEEE Trans. Signal Process.,vol. 61, no. 6, pp. 1470–1480, Mar. 2013.
[24] ——, “Flexible detection criterion for source enumeration in arrayprocessing,” IEEE Trans. Signal Process., vol. 61, no. 6, pp. 1303–1314,Mar. 2013.
[25] ——, “Source enumeration in array processing using a two-step test,”IEEE Trans. Signal Process., vol. 63, no. 10, pp. 2718–2727, May 2015.
[26] C. R. Rao and Y. Wu, “On model selection,” IMS Lecture Notes -Monograph Series, pp. 1–57, 2001.
[27] A. Kalogeratos and A. Likas, “Dip-means: an incremental clusteringmethod for estimating the number of clusters,” in Proc. Adv. Neural Inf.
Process. Syst. 25, 2012, pp. 2402–2410.
[28] G. Hamerly and E. Charles, “Learning the K in K-Means,” in Proc. 16thInt. Conf. Neural Inf. Process. Syst. (NIPS), Whistler, Canada, 2003, pp.281–288.
[29] D. Pelleg and A. Moore, “X-means: extending K-means with efficientestimation of the number of clusters,” in Proc. 17th Int. Conf. Mach.
Learn. (ICML), 2000, pp. 727–734.
[30] M. Shahbaba and S. Beheshti, “Improving X-means clustering withMNDL,” in Proc. 11th Int. Conf. Inf. Sci., Signal Process. and Appl.
(ISSPA), Montreal, Canada, 2012, pp. 1298–1302.
[31] T. Ishioka, “An expansion of X-means for automatically determining theoptimal number of clusters,” in Proc. 4th IASTED Int. Conf. Comput.
Intell., Calgary, Canada, 2005, pp. 91–96.
[32] Q. Zhao, V. Hautamaki, and P. Franti, “Knee point detection in BIC fordetecting the number of clusters,” in Proc. 10th Int. Conf. Adv. Concepts
Intell. Vis. Syst. (ACIVS), Juan-les-Pins, France, 2008, pp. 664–673.
[33] Q. Zhao, M. Xu, and P. Franti, “Knee point detection on Bayesianinformation criterion,” in Proc. 20th IEEE Int. Conf. Tools with Artificial
Intell., Dayton, USA, 2008, pp. 431–438.
[34] Y. Feng and G. Hamerly, “PG-means: learning the number of clustersin data,” in Proc. Conf. Adv. Neural Inf. Process. Syst. 19 (NIPS), 2006,pp. 393–400.
[35] C. Constantinopoulos, M. K. Titsias, and A. Likas, “Bayesian featureand model selection for Gaussian mixture models,” IEEE Trans. Pattern
Anal. Mach. Intell., vol. 28, no. 6, June 2006.
[36] T. Huang, H. Peng, and K. Zhang, “Model selection for Gaussianmixture models,” Statistica Sinica, vol. 27, no. 1, pp. 147–169, 2017.
[37] C. Fraley and A. Raftery, “How many clusters? Which clusteringmethod? Answers via model-based cluster analysis,” Comput. J., vol. 41,no. 8, 1998.
[38] A. Mehrjou, R. Hosseini, and B. N. Araabi, “Improved Bayesianinformation criterion for mixture model selection,” Pattern Recognit.
Lett., vol. 69, pp. 22–27, Jan. 2016.
[39] A. Dasgupta and A. E. Raftery, “Detecting features in spatial pointprocesses with clutter via model-based clustering,” J. Am. Stat. Assoc.,vol. 93, no. 441, pp. 294–302, Mar. 1998.
[40] J. G. Campbell, C. Fraley, F. Murtagh, and A. E. Raftery, “Linearflaw detection in woven textiles using model-based clustering,” Pattern
Recognit. Lett., vol. 18, pp. 1539–1548, Aug. 1997.[41] S. Mukherjee, E. D. Feigelson, G. J. Babu, F. Murtagh, C. Fraley, and
A. Raftery, “Three types of Gamma-ray bursts,” Astrophysical J., vol.508, pp. 314–327, Nov. 1998.
[42] W. J. Krzanowski and Y. T. Lai, “A criterion for determining the numberof groups in a data set using sum-of-squares clustering,” Biometrics,vol. 44, no. 1, pp. 23–34, Mar. 1988.
[43] R. Tibshirani, G. Walther, and T. Hastie, “Estimating the number ofclusters in a dataset via the gap statistic,” J. R. Statist. Soc. B, vol. 63,pp. 411–423, 2001.
[44] F. K. Teklehaymanot, M. Muma, J. Liu, and A. M. Zoubir, “In-networkadaptive cluster enumeration for distributed classification/labeling,” inProc. 24th Eur. Signal Process. Conf. (EUSIPCO), Budapest, Hungary,2016, pp. 448–452.
[45] P. Binder, M. Muma, and A. M. Zoubir, “Gravitational clustering: asimple, robust and adaptive approach for distributed networks,” Signal
Process., vol. 149, pp. 36–48, Aug. 2018.[46] P. Stoica and Y. Selen, “Model-order selection: a review of information
criterion rules,” IEEE Signal Process. Mag., vol. 21, no. 4, pp. 36–47,July 2004.
[47] R. Xu and D. Wunsch, “Survey of clustering algorithms,” IEEE Trans.
Neural Netw., vol. 16, no. 3, pp. 645–678, May 2005.[48] T. Ando, Bayesian Model Selection and Statistical Modeling, ser. Statis-
tics: Textbooks and Monographs. Florida, USA: Taylor and FrancisGroup, LLC, 2010.
[49] C. M. Bishop, Pattern Recognition and Machine Learning. New York,USA: Springer Science+Business Media, LLC, 2006.
[50] J. Blomer and K. Bujna, “Adaptive seeding for Gaussian mixturemodels,” in Proc. 20th Pacific Asia Conf. Adv. Knowl. Discovery andData Mining (PAKDD), vol. 9652, Auckland, New Zealand, 2016, pp.296–308.
[51] D. Arthur and S. Vassilvitskii, “K-means++: the advantages of carefulseeding,” in Proc. 18th Annu. ACM-SIAM Symp. Discrete Algorithms,New Orleans, USA, 2007, pp. 1027–1035.
[52] P. Franti, “Efficiency of random swap clustering,” J. Big Data, vol. 5,no. 13, 2018.
[53] Q. Zhao, V. Hautamaki, I. Karkkainen, and P. Franti, “Random swapEM algorithm for Gaussian mixture models,” Pattern Recognit. Lett.,vol. 33, pp. 2120–2126, 2012.
[54] P. Franti and O. Virmajoki, “Iterative shrinking method for clusteringproblems,” Pattern Recognit., vol. 39, no. 5, pp. 761–765, 2006.[Online]. Available: http://dx.doi.org/10.1016/j.patcog.2005.09.012
[55] I. Karkkainen and P. Franti, “Dynamic local search algorithm for theclustering problem,” Department of Computer Science, University ofJoensuu, Joensuu, Finland, Tech. Rep. A-2002-6, 2002.
[56] P. Franti, R. Mariescu-Istodor, and C. Zhong, “Xnn graph,” Joint Int.
Workshop on Structural, Syntactic, and Statist. Pattern Recognit., vol.LNCS 10029, pp. 207–217, 2016.
[57] R. A. Fisher, “The use of multiple measurements in taxonomic prob-lems,” Ann. Eugenics, vol. 7, pp. 179–188, 1936.
[58] M. Lichman, “UCI machine learning repository,” 2013. [Online].Available: http://archive.ics.uci.edu/ml
[59] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, “Speeded-Up RobustFeatures (SURF),” Comput. Vis. Image Underst., vol. 110, pp. 346–359,June 2008.
[60] F. K. Teklehaymanot, A.-K. Seifert, M. Muma, M. G. Amin, and A. M.Zoubir, “Bayesian target enumeration and labeling using radar data ofhuman gait,” in 26th Eur. Signal Process. Conf. (EUSIPCO) (accepted),2018.
[61] A. M. Zoubir, V. Koivunen, Y. Chakhchoukh, and M. Muma, “Robustestimation in signal processing,” IEEE Signal Process. Mag., vol. 29,no. 4, pp. 61–80, July 2012.
[62] A. M. Zoubir, V. Koivunen, E. Ollila, and M. Muma, Robust Statistics
for Signal Processing. Cambridge University Press, 2018.[63] J. R. Magnus and H. Neudecker, Matrix Differential Calculus with
Applications in Statistics and Econometrics (3 ed.), ser. Wiley Seriesin Probability and Statistics. Chichester, England: John Wiley & SonsLtd, 2007.