7/24/2019 Feature Extraction From Big Data
1/13
International Journal of Advance Foundation and Research in Computer (IJAFRC)
Volume 2, Issue 10, cto!er " 201#$I%%& 2' * #', Impact Factor * 1$'1+
2 - . 201#, IJAFRC All Ri/hts Reserved $iafrc$or/
Feature 3traction From 4i/ 5ataAarti.B.Sahitya*,DR.Mrs.M.Vijayalakshmi.
Line 2: PG Scholar, V.E.S..!, "hem#$r, Pro%essor, V.E.S..!,"hem#$r
Line &: aarti.sahitya'(es.ac.in) , m.(ijayalakshmi'(es.ac.in
A 4 % 6 R A C 6
5imensionalit7 reduction, Feature 3traction 8 Feature %election are the terms important in the
concept of data minin/ hen someone has to reduce the si9e of hi/h dimensional data$ ur main
motto of this research is to find an applica!le methodolo/7 that ill reduce the ever increasin/l7
volume of data $ In this paper e descri!e a!out various feature e3traction 8 feature selection
methods 8 proposes an scheme hich ill a!le to select a su!set of ori/inal features !ased on
some evaluation criteria 8 a!le to reduce the volume of hi/h dimensional dataset$
Index Terms : *eat$re E+traction, *eat$re Selection, Dimensionality Re$ction, Bi- Data, Data
Vis$aliation, /i-h Dimensional Data.
I$ I&6R5:C6I&
A ataset is a collection o% homo-eneo$s o#jects. An o#ject is a instance in the ataset. Dimension is the
0ro0erty 1hich e%ine an o#ject. Dimensionality re$ction is the 0rocess 1here at each ste0 the
irrele(ant imensions are re$ce 1itho$t s$#stantial loss o% in%ormation 1itho$t a%%ectin- the %inal
o$t0$t. *eat$re e+traction is 0rocess o% e+tractin- ne1 set o% re$ce %eat$res %rom ori-inal %eat$res
#ase on some attri#$tes trans%ormation. *eat$re selection is a 0rocess that chooses an o0timal s$#set o%
%eat$res accorin- to a o#jecti(e %$nction. As real 1orl lar-e ataset consist o% irrele(ant ,re$nant noisy imensions, so #e%ore a00lyin- cl$sterin-3classi%ication3re-ression al-orithms on these ataset,
one m$st consier imensionality re$ction as a 0re40rocessin- ste0.
A. ;i/h 5imensionalit7 5ata Reduction Challen/es In 4i/ 5ata
Bi- ata is relentless. t is contin$o$sly -enerate on a massi(e scale. t is -enerate #y online
interactions #et1een 0eo0le ,systems #y sensor ena#le e(ices. t can #e relate, linke inte-rate
to 0ro(ie hi-hly etaile in%ormation s$ch a etail makes it 0ossi#le %or #anks, health care 0$#lic
sa%ety to 0ro(ie s0eci%ic ser(ices. t is creatin- ne1 #$siness trans%ormin- ne1 traitional markets to
create ne1 #$siness trens. So it is a challen-e to statistical comm$nity. Aitional in%ormation %or #i-ata can #e o#taine %rom a sin-le lar-e set as o00ose to se0arate smaller sets. t allo1s correlation to
#e %o$n, %or instance to s0ot #$siness trens. t in(ol(es increasin- (ol$me, that is amo$nt o% ata,
(elocity, that is s0ee at 1hich ata is in o$t (ariety that is ran-e o% ata ty0es so$rces. !his
re5$ires ne1 %orm o% 0rocessin- %or ecision makin-. t 0ro$ces massi(e sam0le sies that allo1s $s to
isco(er hien 0atterns associate 1ith small s$#sets o% #i- ataset. /i-h imensionality #i- ata
ha(e s0ecial %eat$res s$ch as noise, acc$m$lation s0$rio$s correlation. S0$rio$s correlation occ$rs $e
to the %act that many $ncorrelate ranom (aria#les may ha(e sam0le correlation coe%%icient in hi-h
imensions. S$ch correlation leas to 1ron- in%erences. /i-h imensional ata can #e -enerate %rom
sectors like Biotech, *inancial, Satellite ima-ery c$stomer %inancial ata. So s$ch ata can #e store in
%orm o% ata matri+4 1e# term oc$ment ata, sensor array ata, cons$mer %inancial ata etc. t iscom0$tationally in%easi#le to irectly make in%erences #ase on the ra1 ata. !o hanle #i- ata %rom
#oth the statistical the com0$tational (ie1s, the iea o% imension re$ction is an im0ortant ste0
#e%ore start 0rocessin- o% #i- ata. /i-h imensional ata can #e analye thro$-h classi%ication,
7/24/2019 Feature Extraction From Big Data
2/13
International Journal of Advance Foundation and Research in Computer (IJAFRC)
Volume 2, Issue 10, cto!er " 201#$I%%& 2' * #', Impact Factor * 1$'1+
' - . 201#, IJAFRC All Ri/hts Reserved $iafrc$or/
cl$sterin- re-ression #$t these methos alone are not s$%%icient %or 0rocessin- o% #i- ata. So
challen-es %or re$cin- hi-h imensional ata are as %ollo1s
6.7o %ormal mathematical moels are a(aila#le
2.% sometimes moels are a(aila#le #$t 0ro0er eri(ation is not a(aila#le so it creates con%$sion amon-
e(elo0ers ho1 to re$ce the %eat$res o% hi-h imensional ataset.
B. 7ee o% Dimensionality Re$ction
t is re5$ire as most machine learnin- ata minin- techni5$es may not #e e%%ecti(e %or hi-h
imensional ata, 5$ery acc$racy e%%iciency e-rae ra0ily as the imension increases. t is also
re5$ire %or
(is$aliation40rojection o% hi-h imensional ata on to 2D or &D.
Data com0ression4E%%icient stora-e retrie(al.
7oise remo(al4 Positi(e e%%ect on 5$ery acc$racy
II$
7/24/2019 Feature Extraction From Big Data
3/13
International Journal of Advance Foundation and Research in Computer (IJAFRC)
Volume 2, Issue 10, cto!er " 201#$I%%& 2' * #', Impact Factor * 1$'1+
- . 201#, IJAFRC All Ri/hts Reserved $iafrc$or/
0roce$res not s$ita#le %or hi-h imensional ata. *eat$re selection 3 %eat$re e+traction methos ha(e
#een $se to select a s$#set o% ori-inal %eat$res #ase on some e(al$ation criteria8>9.
Data is -ro1in- at a h$-e s0ee makin- it i%%ic$lt to hanle s$ch lar-e amo$nt o% ata?e+a#ytes@. !he
main i%%ic$lty in hanlin- s$ch lar-e amo$nt o% ata is #eca$se that the (ol$me is increasin-ly ra0ily in
com0arison to the com0$tin- reso$rces. Bi- ata can #e e%ine #y the %ollo1in- 0ro0erties like (ariety,
(ol$me, (elocity, (aria#ility, (al$e com0le+ity. Bi- ata is i%%erent %rom the ata #ein- store in
traitional 1areho$ses. !he ata 1hich is store in 1areho$ses that has to $ner-oes a 0rocess o%
E!L?e+traction, trans%ormation loain-@ #$t this is not the case 1ith #i- ata as this ata is not s$ita#le
to #e store in ata 1areho$ses89.
Bi- ata is also re%erre to as ata intensi(e technolo-ies, 1ith a lon- traition o% 1orkin- 1ith
constantly increasin- (ol$me o% ata in sectors like #$siness, social meia, ins$rance, health care etc. So
moern in$stry is tryin- to e(elo0 a(ance #i- ata technolo-ies tools89.
Data e+0losion is an ine(ita#le tren as the 1orl is connecte more than e(er. Data are -enerate %aster
than e(er to ate a#o$t 2.C 5$intillion #ytes o% ata are create aily. !his s0ee o% ata -eneration 1illcontin$e in the comin- years is e+0ecte to increase at an e+0onential le(el, accorin- to D"S recent
s$r(ey. !he a#o(e %act -i(es #irth to the 1iely circ$late conce0t calle #i- ata. B$t t$rnin- #i- ata in
to si-hts emans an in e0th e+traction o% their (al$es, hea(ily relies $0on hence #oosts e0loyments
o% massi(e #i- ata systems869.
;ne o% the stron-est ne1 0resences in contem0orary li%e is #i- ata, (ery lar-e ata sets that may #e #i-
in (ol$me, (elocity, (aria#ility, (ariety (eracity. /i-h (ol$me o% ata is -enerate in %o$r areas s$ch as
scienti%ic, -o(ernmental, cor0orate, 0ersonal ata. ;ne im0lication o% #i- ata is that h$mans are
ha(in- a 1holly i%%erent conce0t ne1 1ay o% relatin- to ata, 1here %ormerly e(erythin- 1as si-nal,
no1 F is noise, 1hich can 0ossi#ly lea to o(er1helm, es0ecially i% there is a %ail$re to ae5$ately%ilter the in%ormation8669.
!he emer-in- #i- ata 0arai-m, o1in- to its #roaer im0act, has 0ro%o$nly trans%orme o$r society
1ill contin$e to attract i(erse attentions %rom #oth technolo-ical e+0erts the 0$#lic in -eneral. *or
instance, an D" re0ort 0reicts that, %rom 2C to 22, the -lo#al ata (ol$me 1ill -ro1 #y a %actor o%
&, %rom 6& e+a#ytes to , e+a#ytes, re0resentin- a o$#le -ro1th e(ery t1o years8629.
III$ >I%6I&? =R@
!o re$ce hi-h4imensionality (ario$s methos o% %eat$re selection 3 e+traction ha(e #een 0ro0ose #$t
these methos ha(e #een $tilie %or e+tractin- a %eat$re %rom te+t$al ata %or re$cin- the hi-h
imensionality o% te+t$al ata or %or oc$ment cl$sterin- #$t not %or e+tractin- a %eat$re %rom hi-h
imensional ataset 1hich com0rises o% mi+t$re ata. So here 1e 0ro0oses an a00roach 1hich 1ill take
hi-h imensional ata com0rises o% mi+t$re ata like te+t$al ata , n$merical ata, noisy ata etc an
#ase on some e(al$ation criteria chooses a s$#set o% %eat$res 1hich 1ill re$ces the imensionality o%
hi-h imensionality ataset. *eat$re e+traction an *eat$re selection metho also $se as a
0re0rocessin- ste0, %or hi-h imensional ata 1hich is not 0ossi#le #y alreay tool a(aila#le in the
market like
7/24/2019 Feature Extraction From Big Data
4/13
International Journal of Advance Foundation and Research in Computer (IJAFRC)
Volume 2, Issue 10, cto!er " 201#$I%%& 2' * #', Impact Factor * 1$'1+
# - . 201#, IJAFRC All Ri/hts Reserved $iafrc$or/
=ma0s a ata (ector 3%rom an ori-inal s0ace o%p(aria#les
to a ne1 s0ace o%p (aria#les 1hich are $ncorrelate o(er the ataset. /o1e(er, not all the 0rinci0al
com0onents nee to #e re5$ire only the %irst L0rinci0al com0onents are re5$ire, 0ro$ce #y $sin-
only the %irst Lloain- (ectors an that -i(es the tr$ncate trans%ormation as 1here the matri+ 6Lno1
has n ro1s #$t only Lcol$mns. S$ch imensionality re$ction can #e a (ery $se%$l ste0 %or (is$aliin-
an 0rocessin- hi-h4imensional atasets. *or e+am0le, selectin- LH 2 col$mns an kee0in- only the
%irst t1o 0rinci0al com0onents %ins the t1o4imensional 0lane thro$-h the hi-h4imensional ataset in
1hich the ata is most s0rea o$t, so i% the ata contains cl$sters 1hich s0rea o$t, an there%ore most
(isi#le on a t1o4imensional ia-ramI 1hereas i% t1o irections thro$-h the ata are chosen at ranom,
7/24/2019 Feature Extraction From Big Data
5/13
International Journal of Advance Foundation and Research in Computer (IJAFRC)
Volume 2, Issue 10, cto!er " 201#$I%%& 2' * #', Impact Factor * 1$'1+
- . 201#, IJAFRC All Ri/hts Reserved $iafrc$or/
the cl$sters may #e m$ch less s0rea a0art %rom each other, an may in %act they s$#stantially o(erlay
each other, 1hich makes them inistin-$isha#le.
2$ ultifactor 5imensionalit7 Reduction (5R)
M$lti%actor imensionality re$ction ?MDR@ is a ata minin- a00roach %or etectin- an characteriin-
com#inations o% attri#$tes or ine0enent (aria#les that interact to in%l$ence a e0enent or class
(aria#le. MDR 1as esi-ne s0eci%ically to ienti%y interactions amon- iscrete (aria#les that in%l$ence
a #inary o$tcome an is consiere a non0arametric alternati(e to traitional statistical methos s$ch
as lo-istic re-ression. !he #asis o% the MDR metho is a constr$cti(e in$ction al-orithm that con(erts
t1o or more (aria#les or attri#$tes to a sin-le attri#$te. !his 0rocess o% constr$ctin- a ne1 attri#$te
chan-es the re0resentation s0ace o% the ata. !he en -oal is to create or isco(er a re0resentation that
%acilitates the etection o% nonlinear or non aiti(e interactions amon- the attri#$tes s$ch that
0reiction o% the class (aria#le is im0ro(e o(er that o% the ori-inal re0resentation o% the ata. "onsier
the %ollo1in- sim0le e+am0le $sin- the e+cl$si(e ;R ?J;R@ %$nction. !he ta#le #elo1 re0resents a sim0le
ataset 1here the relationshi0 #et1een the attri#$tes ?J6 an J2@ an the class (aria#le ?K@ is e%ine #y
the J;R %$nction s$ch that K H J6 J;R J2.
6a!le1. J;R *$nction
>1 >2 B
6 6
6 6
6 6
% the a#o(e e+am0le has #een sol(e $sin- any ata minin- al-orithm, that al-orithm nee to
a00ro+imate the J;R %$nction in orer to acc$rately 0reict K. So alternate metho is to $se MDR
constr$cti(e in$ction al-orithm 1hich chan-es the re0resentation o% the ata. MDR al-orithm chan-e
the re0resentation o% ata #y selectin- t1o attri#$tes like +6 +2 1hich is there in a#o(e e+am0le. Each
com#ination o% (al$es %or J6 an J2 are e+amine an the n$m#er o% times KH6 an3or KH is co$nte.
7/24/2019 Feature Extraction From Big Data
6/13
7/24/2019 Feature Extraction From Big Data
7/13
International Journal of Advance Foundation and Research in Computer (IJAFRC)
Volume 2, Issue 10, cto!er " 201#$I%%& 2' * #', Impact Factor * 1$'1+
- . 201#, IJAFRC All Ri/hts Reserved $iafrc$or/
t $ses ortho-onal
trans%ormation to con(ert
a set o% 0ossi#ly correlate
(aria#les in to a set o%
linearly $ncorrelate
(aria#les calle 0rinci0al
com0onents.
MDR is a ata minin-
a00roach 1hich etects
an characterie the
com#inations o%
attri#$tes or
ine0enent (aria#le
that interact to in%l$ence
a class (aria#le
"A %ins the
latent (aria#le #y
ma+imiin- the
stastical
ine0enence o% the
estimate
com0onents
t $ses m$lti0le
hien layers %or
imensionality
re$ction o0eration
1here each
imension is lo-istic
%$nction o% in0$t
'$erits
6a!le
$5emerits
6a!le #
P"A MDR "A 77
Vectors are less s0atially
localie
m0lementation o%
Minin- 0attern 1ith MDR
%rom real ata is
com0$tationally com0le+.
Vectors are neither
ortho-onal nor in
orer
7e$ral net1orks are
i%%ic$lt to moel #eca$se
a small chan-e in a sin-le
in0$t 1ill a%%ect the entire
net1ork
4$ Feature %election 6echniues
*eat$re selection is a 0rocess that selects a s$#set o% ori-inal %eat$res #y rejectin- irrele(ant an3 or
re$nant %eat$res accorin- to certain criteria. Rele(ancy o% %eat$res is ty0ically meas$re #y
iscriminatin- a#ility o% a %eat$re to enhance 0reicti(e acc$racy o% classi%ier an cl$ster -ooness %orcl$sterin- al-orithm. Generally, %eat$re re$nancy is e%ine #y correlationI t1o %eat$res are re$nant
to each other i% their (al$es are correlate8298&98C9.
*eat$re selection 0rocess com0rises o% %o$r ste0s can #e e+0laine thro$-h ia-ram
DCA 5R ICA &&
Basis (ectors are lesse+0ensi(e to com0$te
t chan-es there0resentation o% ata to
acc$rately 0reict the
class (aria#le
Vectors are s0atiallylocalie an
stastically
ine0enent
t $ses -raientescent metho to
locally minimie the
s5$are o$t0$t error.
7/24/2019 Feature Extraction From Big Data
8/13
International Journal of Advance Foundation and Research in Computer (IJAFRC)
Volume 2, Issue 10, cto!er " 201#$I%%& 2' * #', Impact Factor * 1$'1+
- . 201#, IJAFRC All Ri/hts Reserved $iafrc$or/
Fi/ '. *eat$re Selection 0rocess
6. Generation H select %eat$re s$#set caniate.
2. E(al$ation H com0$te rele(ancy (al$e o% the s$#set.
&. Sto00in- criterion H etermine 1hether s$#set is rele(ant.
.
Valiation H (eri%y s$#set (aliity.
*eat$re selection methos escri#e here as %ollo1
1$ Filter odel
t se0arates %eat$re selection %rom classi%ier learnin-. t relies on ty0es o% criteria s$ch as in%ormation,
istance, e0enence, consistency %or e(al$ation o% %eat$res %rom any ataset 1itho$t $sin- any ata
minin- al-orithm. Methos o% %ilter moel are as %ollo1
1$ Information ?ain (I?)
n%ormation meas$res ty0ically etermine the n%ormation -ain or re$ction in entro0y 1hen the ataset
is s0lit on a %eat$re.
2$Correlation Coefficient (CC)
!he correlation coe%%icient is a n$merical 1ay to 5$anti%y the relationshi0 #et1een t1o %eat$res.
'$ %7mmetric uncertaint7 (%:)
*eat$res are selecte #ase on hi-hest symmetric $ncertainty (al$es #et1een the %eat$re an tar-et
classes.
Fi/ $ *ilter moel 0rocess
2$=rapper model
7/24/2019 Feature Extraction From Big Data
9/13
International Journal of Advance Foundation and Research in Computer (IJAFRC)
Volume 2, Issue 10, cto!er " 201#$I%%& 2' * #', Impact Factor * 1$'1+
+0 - . 201#, IJAFRC All Ri/hts Reserved $iafrc$or/
7/24/2019 Feature Extraction From Big Data
10/13
International Journal of Advance Foundation and Research in Computer (IJAFRC)
Volume 2, Issue 10, cto!er " 201#$I%%& 2' * #', Impact Factor * 1$'1+
+1 - . 201#, IJAFRC All Ri/hts Reserved $iafrc$or/
1here the entro0y o% (aria#le + is %o$n #y
A$ Al/orithm can !e descri!ed as follo
Ste0 64 n0$t the ataset 1hich contains %eat$res an their (al$es.
Ste0 24 !hen 1e calc$late the Rele(ance (al$e %or each %eat$re #y $sin- the %orm$la
Ste0 &4 !hen 1e take the a(era-e o% all rele(ance (al$es.
7/24/2019 Feature Extraction From Big Data
11/13
International Journal of Advance Foundation and Research in Computer (IJAFRC)
Volume 2, Issue 10, cto!er " 201#$I%%& 2' * #', Impact Factor * 1$'1+
+2 - . 201#, IJAFRC All Ri/hts Reserved $iafrc$or/
Fi/ +
2$ A/ent !ranch short description
Fi/
'$ A/ent Client Id
Fi/
$ A/ent Full &ame
Fi/ 10
!he a#o(e ataset can #e -i(en as in0$t to S*S al-orithm escri#e a#o(e the o$t0$t 1o$l #e the
re$ce set o% rele(ant %eat$res 1hich satis%ies the criteria.
4$ utput
7/24/2019 Feature Extraction From Big Data
12/13
International Journal of Advance Foundation and Research in Computer (IJAFRC)
Volume 2, Issue 10, cto!er " 201#$I%%& 2' * #', Impact Factor * 1$'1+
+' - . 201#, IJAFRC All Ri/hts Reserved $iafrc$or/
Fi/ 11.Rele(ant *eat$res
Fi/ 12$rrele(ant *eat$res
VII$ C&C
7/24/2019 Feature Extraction From Big Data
13/13
International Journal of Advance Foundation and Research in Computer (IJAFRC)
Volume 2, Issue 10, cto!er " 201#$I%%& 2' * #', Impact Factor * 1$'1+
+ - . 201#, IJAFRC All Ri/hts Reserved $iafrc$or/
0resente a Symmetric $ncertainty %eat$re selection?S*S@ metho 1hich is s$ita#le %or re$cin- hi-h
imensional ataset an hence 1e -ot the satis%ie res$lts.
*$rther this can #e im0lemente as that a#o(e sho1n rele(ant %eat$res o$t0$t can #e -i(en to any
cl$sterin- 3 classi%ication 3 re-ression al-orithm %or %$rther 0rocessin-.