Top Banner

of 13

Feature Extraction From Big Data

Feb 24, 2018

Download

Documents

ijafrc
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 7/24/2019 Feature Extraction From Big Data

    1/13

    International Journal of Advance Foundation and Research in Computer (IJAFRC)

    Volume 2, Issue 10, cto!er " 201#$I%%& 2' * #', Impact Factor * 1$'1+

    2 - . 201#, IJAFRC All Ri/hts Reserved $iafrc$or/

    Feature 3traction From 4i/ 5ataAarti.B.Sahitya*,DR.Mrs.M.Vijayalakshmi.

    Line 2: PG Scholar, V.E.S..!, "hem#$r, Pro%essor, V.E.S..!,"hem#$r

    Line &: aarti.sahitya'(es.ac.in) , m.(ijayalakshmi'(es.ac.in

    A 4 % 6 R A C 6

    5imensionalit7 reduction, Feature 3traction 8 Feature %election are the terms important in the

    concept of data minin/ hen someone has to reduce the si9e of hi/h dimensional data$ ur main

    motto of this research is to find an applica!le methodolo/7 that ill reduce the ever increasin/l7

    volume of data $ In this paper e descri!e a!out various feature e3traction 8 feature selection

    methods 8 proposes an scheme hich ill a!le to select a su!set of ori/inal features !ased on

    some evaluation criteria 8 a!le to reduce the volume of hi/h dimensional dataset$

    Index Terms : *eat$re E+traction, *eat$re Selection, Dimensionality Re$ction, Bi- Data, Data

    Vis$aliation, /i-h Dimensional Data.

    I$ I&6R5:C6I&

    A ataset is a collection o% homo-eneo$s o#jects. An o#ject is a instance in the ataset. Dimension is the

    0ro0erty 1hich e%ine an o#ject. Dimensionality re$ction is the 0rocess 1here at each ste0 the

    irrele(ant imensions are re$ce 1itho$t s$#stantial loss o% in%ormation 1itho$t a%%ectin- the %inal

    o$t0$t. *eat$re e+traction is 0rocess o% e+tractin- ne1 set o% re$ce %eat$res %rom ori-inal %eat$res

    #ase on some attri#$tes trans%ormation. *eat$re selection is a 0rocess that chooses an o0timal s$#set o%

    %eat$res accorin- to a o#jecti(e %$nction. As real 1orl lar-e ataset consist o% irrele(ant ,re$nant noisy imensions, so #e%ore a00lyin- cl$sterin-3classi%ication3re-ression al-orithms on these ataset,

    one m$st consier imensionality re$ction as a 0re40rocessin- ste0.

    A. ;i/h 5imensionalit7 5ata Reduction Challen/es In 4i/ 5ata

    Bi- ata is relentless. t is contin$o$sly -enerate on a massi(e scale. t is -enerate #y online

    interactions #et1een 0eo0le ,systems #y sensor ena#le e(ices. t can #e relate, linke inte-rate

    to 0ro(ie hi-hly etaile in%ormation s$ch a etail makes it 0ossi#le %or #anks, health care 0$#lic

    sa%ety to 0ro(ie s0eci%ic ser(ices. t is creatin- ne1 #$siness trans%ormin- ne1 traitional markets to

    create ne1 #$siness trens. So it is a challen-e to statistical comm$nity. Aitional in%ormation %or #i-ata can #e o#taine %rom a sin-le lar-e set as o00ose to se0arate smaller sets. t allo1s correlation to

    #e %o$n, %or instance to s0ot #$siness trens. t in(ol(es increasin- (ol$me, that is amo$nt o% ata,

    (elocity, that is s0ee at 1hich ata is in o$t (ariety that is ran-e o% ata ty0es so$rces. !his

    re5$ires ne1 %orm o% 0rocessin- %or ecision makin-. t 0ro$ces massi(e sam0le sies that allo1s $s to

    isco(er hien 0atterns associate 1ith small s$#sets o% #i- ataset. /i-h imensionality #i- ata

    ha(e s0ecial %eat$res s$ch as noise, acc$m$lation s0$rio$s correlation. S0$rio$s correlation occ$rs $e

    to the %act that many $ncorrelate ranom (aria#les may ha(e sam0le correlation coe%%icient in hi-h

    imensions. S$ch correlation leas to 1ron- in%erences. /i-h imensional ata can #e -enerate %rom

    sectors like Biotech, *inancial, Satellite ima-ery c$stomer %inancial ata. So s$ch ata can #e store in

    %orm o% ata matri+4 1e# term oc$ment ata, sensor array ata, cons$mer %inancial ata etc. t iscom0$tationally in%easi#le to irectly make in%erences #ase on the ra1 ata. !o hanle #i- ata %rom

    #oth the statistical the com0$tational (ie1s, the iea o% imension re$ction is an im0ortant ste0

    #e%ore start 0rocessin- o% #i- ata. /i-h imensional ata can #e analye thro$-h classi%ication,

  • 7/24/2019 Feature Extraction From Big Data

    2/13

    International Journal of Advance Foundation and Research in Computer (IJAFRC)

    Volume 2, Issue 10, cto!er " 201#$I%%& 2' * #', Impact Factor * 1$'1+

    ' - . 201#, IJAFRC All Ri/hts Reserved $iafrc$or/

    cl$sterin- re-ression #$t these methos alone are not s$%%icient %or 0rocessin- o% #i- ata. So

    challen-es %or re$cin- hi-h imensional ata are as %ollo1s

    6.7o %ormal mathematical moels are a(aila#le

    2.% sometimes moels are a(aila#le #$t 0ro0er eri(ation is not a(aila#le so it creates con%$sion amon-

    e(elo0ers ho1 to re$ce the %eat$res o% hi-h imensional ataset.

    B. 7ee o% Dimensionality Re$ction

    t is re5$ire as most machine learnin- ata minin- techni5$es may not #e e%%ecti(e %or hi-h

    imensional ata, 5$ery acc$racy e%%iciency e-rae ra0ily as the imension increases. t is also

    re5$ire %or

    (is$aliation40rojection o% hi-h imensional ata on to 2D or &D.

    Data com0ression4E%%icient stora-e retrie(al.

    7oise remo(al4 Positi(e e%%ect on 5$ery acc$racy

    II$

  • 7/24/2019 Feature Extraction From Big Data

    3/13

    International Journal of Advance Foundation and Research in Computer (IJAFRC)

    Volume 2, Issue 10, cto!er " 201#$I%%& 2' * #', Impact Factor * 1$'1+

    - . 201#, IJAFRC All Ri/hts Reserved $iafrc$or/

    0roce$res not s$ita#le %or hi-h imensional ata. *eat$re selection 3 %eat$re e+traction methos ha(e

    #een $se to select a s$#set o% ori-inal %eat$res #ase on some e(al$ation criteria8>9.

    Data is -ro1in- at a h$-e s0ee makin- it i%%ic$lt to hanle s$ch lar-e amo$nt o% ata?e+a#ytes@. !he

    main i%%ic$lty in hanlin- s$ch lar-e amo$nt o% ata is #eca$se that the (ol$me is increasin-ly ra0ily in

    com0arison to the com0$tin- reso$rces. Bi- ata can #e e%ine #y the %ollo1in- 0ro0erties like (ariety,

    (ol$me, (elocity, (aria#ility, (al$e com0le+ity. Bi- ata is i%%erent %rom the ata #ein- store in

    traitional 1areho$ses. !he ata 1hich is store in 1areho$ses that has to $ner-oes a 0rocess o%

    E!L?e+traction, trans%ormation loain-@ #$t this is not the case 1ith #i- ata as this ata is not s$ita#le

    to #e store in ata 1areho$ses89.

    Bi- ata is also re%erre to as ata intensi(e technolo-ies, 1ith a lon- traition o% 1orkin- 1ith

    constantly increasin- (ol$me o% ata in sectors like #$siness, social meia, ins$rance, health care etc. So

    moern in$stry is tryin- to e(elo0 a(ance #i- ata technolo-ies tools89.

    Data e+0losion is an ine(ita#le tren as the 1orl is connecte more than e(er. Data are -enerate %aster

    than e(er to ate a#o$t 2.C 5$intillion #ytes o% ata are create aily. !his s0ee o% ata -eneration 1illcontin$e in the comin- years is e+0ecte to increase at an e+0onential le(el, accorin- to D"S recent

    s$r(ey. !he a#o(e %act -i(es #irth to the 1iely circ$late conce0t calle #i- ata. B$t t$rnin- #i- ata in

    to si-hts emans an in e0th e+traction o% their (al$es, hea(ily relies $0on hence #oosts e0loyments

    o% massi(e #i- ata systems869.

    ;ne o% the stron-est ne1 0resences in contem0orary li%e is #i- ata, (ery lar-e ata sets that may #e #i-

    in (ol$me, (elocity, (aria#ility, (ariety (eracity. /i-h (ol$me o% ata is -enerate in %o$r areas s$ch as

    scienti%ic, -o(ernmental, cor0orate, 0ersonal ata. ;ne im0lication o% #i- ata is that h$mans are

    ha(in- a 1holly i%%erent conce0t ne1 1ay o% relatin- to ata, 1here %ormerly e(erythin- 1as si-nal,

    no1 F is noise, 1hich can 0ossi#ly lea to o(er1helm, es0ecially i% there is a %ail$re to ae5$ately%ilter the in%ormation8669.

    !he emer-in- #i- ata 0arai-m, o1in- to its #roaer im0act, has 0ro%o$nly trans%orme o$r society

    1ill contin$e to attract i(erse attentions %rom #oth technolo-ical e+0erts the 0$#lic in -eneral. *or

    instance, an D" re0ort 0reicts that, %rom 2C to 22, the -lo#al ata (ol$me 1ill -ro1 #y a %actor o%

    &, %rom 6& e+a#ytes to , e+a#ytes, re0resentin- a o$#le -ro1th e(ery t1o years8629.

    III$ >I%6I&? =R@

    !o re$ce hi-h4imensionality (ario$s methos o% %eat$re selection 3 e+traction ha(e #een 0ro0ose #$t

    these methos ha(e #een $tilie %or e+tractin- a %eat$re %rom te+t$al ata %or re$cin- the hi-h

    imensionality o% te+t$al ata or %or oc$ment cl$sterin- #$t not %or e+tractin- a %eat$re %rom hi-h

    imensional ataset 1hich com0rises o% mi+t$re ata. So here 1e 0ro0oses an a00roach 1hich 1ill take

    hi-h imensional ata com0rises o% mi+t$re ata like te+t$al ata , n$merical ata, noisy ata etc an

    #ase on some e(al$ation criteria chooses a s$#set o% %eat$res 1hich 1ill re$ces the imensionality o%

    hi-h imensionality ataset. *eat$re e+traction an *eat$re selection metho also $se as a

    0re0rocessin- ste0, %or hi-h imensional ata 1hich is not 0ossi#le #y alreay tool a(aila#le in the

    market like

  • 7/24/2019 Feature Extraction From Big Data

    4/13

    International Journal of Advance Foundation and Research in Computer (IJAFRC)

    Volume 2, Issue 10, cto!er " 201#$I%%& 2' * #', Impact Factor * 1$'1+

    # - . 201#, IJAFRC All Ri/hts Reserved $iafrc$or/

    =ma0s a ata (ector 3%rom an ori-inal s0ace o%p(aria#les

    to a ne1 s0ace o%p (aria#les 1hich are $ncorrelate o(er the ataset. /o1e(er, not all the 0rinci0al

    com0onents nee to #e re5$ire only the %irst L0rinci0al com0onents are re5$ire, 0ro$ce #y $sin-

    only the %irst Lloain- (ectors an that -i(es the tr$ncate trans%ormation as 1here the matri+ 6Lno1

    has n ro1s #$t only Lcol$mns. S$ch imensionality re$ction can #e a (ery $se%$l ste0 %or (is$aliin-

    an 0rocessin- hi-h4imensional atasets. *or e+am0le, selectin- LH 2 col$mns an kee0in- only the

    %irst t1o 0rinci0al com0onents %ins the t1o4imensional 0lane thro$-h the hi-h4imensional ataset in

    1hich the ata is most s0rea o$t, so i% the ata contains cl$sters 1hich s0rea o$t, an there%ore most

    (isi#le on a t1o4imensional ia-ramI 1hereas i% t1o irections thro$-h the ata are chosen at ranom,

  • 7/24/2019 Feature Extraction From Big Data

    5/13

    International Journal of Advance Foundation and Research in Computer (IJAFRC)

    Volume 2, Issue 10, cto!er " 201#$I%%& 2' * #', Impact Factor * 1$'1+

    - . 201#, IJAFRC All Ri/hts Reserved $iafrc$or/

    the cl$sters may #e m$ch less s0rea a0art %rom each other, an may in %act they s$#stantially o(erlay

    each other, 1hich makes them inistin-$isha#le.

    2$ ultifactor 5imensionalit7 Reduction (5R)

    M$lti%actor imensionality re$ction ?MDR@ is a ata minin- a00roach %or etectin- an characteriin-

    com#inations o% attri#$tes or ine0enent (aria#les that interact to in%l$ence a e0enent or class

    (aria#le. MDR 1as esi-ne s0eci%ically to ienti%y interactions amon- iscrete (aria#les that in%l$ence

    a #inary o$tcome an is consiere a non0arametric alternati(e to traitional statistical methos s$ch

    as lo-istic re-ression. !he #asis o% the MDR metho is a constr$cti(e in$ction al-orithm that con(erts

    t1o or more (aria#les or attri#$tes to a sin-le attri#$te. !his 0rocess o% constr$ctin- a ne1 attri#$te

    chan-es the re0resentation s0ace o% the ata. !he en -oal is to create or isco(er a re0resentation that

    %acilitates the etection o% nonlinear or non aiti(e interactions amon- the attri#$tes s$ch that

    0reiction o% the class (aria#le is im0ro(e o(er that o% the ori-inal re0resentation o% the ata. "onsier

    the %ollo1in- sim0le e+am0le $sin- the e+cl$si(e ;R ?J;R@ %$nction. !he ta#le #elo1 re0resents a sim0le

    ataset 1here the relationshi0 #et1een the attri#$tes ?J6 an J2@ an the class (aria#le ?K@ is e%ine #y

    the J;R %$nction s$ch that K H J6 J;R J2.

    6a!le1. J;R *$nction

    >1 >2 B

    6 6

    6 6

    6 6

    % the a#o(e e+am0le has #een sol(e $sin- any ata minin- al-orithm, that al-orithm nee to

    a00ro+imate the J;R %$nction in orer to acc$rately 0reict K. So alternate metho is to $se MDR

    constr$cti(e in$ction al-orithm 1hich chan-es the re0resentation o% the ata. MDR al-orithm chan-e

    the re0resentation o% ata #y selectin- t1o attri#$tes like +6 +2 1hich is there in a#o(e e+am0le. Each

    com#ination o% (al$es %or J6 an J2 are e+amine an the n$m#er o% times KH6 an3or KH is co$nte.

  • 7/24/2019 Feature Extraction From Big Data

    6/13

  • 7/24/2019 Feature Extraction From Big Data

    7/13

    International Journal of Advance Foundation and Research in Computer (IJAFRC)

    Volume 2, Issue 10, cto!er " 201#$I%%& 2' * #', Impact Factor * 1$'1+

    - . 201#, IJAFRC All Ri/hts Reserved $iafrc$or/

    t $ses ortho-onal

    trans%ormation to con(ert

    a set o% 0ossi#ly correlate

    (aria#les in to a set o%

    linearly $ncorrelate

    (aria#les calle 0rinci0al

    com0onents.

    MDR is a ata minin-

    a00roach 1hich etects

    an characterie the

    com#inations o%

    attri#$tes or

    ine0enent (aria#le

    that interact to in%l$ence

    a class (aria#le

    "A %ins the

    latent (aria#le #y

    ma+imiin- the

    stastical

    ine0enence o% the

    estimate

    com0onents

    t $ses m$lti0le

    hien layers %or

    imensionality

    re$ction o0eration

    1here each

    imension is lo-istic

    %$nction o% in0$t

    '$erits

    6a!le

    $5emerits

    6a!le #

    P"A MDR "A 77

    Vectors are less s0atially

    localie

    m0lementation o%

    Minin- 0attern 1ith MDR

    %rom real ata is

    com0$tationally com0le+.

    Vectors are neither

    ortho-onal nor in

    orer

    7e$ral net1orks are

    i%%ic$lt to moel #eca$se

    a small chan-e in a sin-le

    in0$t 1ill a%%ect the entire

    net1ork

    4$ Feature %election 6echniues

    *eat$re selection is a 0rocess that selects a s$#set o% ori-inal %eat$res #y rejectin- irrele(ant an3 or

    re$nant %eat$res accorin- to certain criteria. Rele(ancy o% %eat$res is ty0ically meas$re #y

    iscriminatin- a#ility o% a %eat$re to enhance 0reicti(e acc$racy o% classi%ier an cl$ster -ooness %orcl$sterin- al-orithm. Generally, %eat$re re$nancy is e%ine #y correlationI t1o %eat$res are re$nant

    to each other i% their (al$es are correlate8298&98C9.

    *eat$re selection 0rocess com0rises o% %o$r ste0s can #e e+0laine thro$-h ia-ram

    DCA 5R ICA &&

    Basis (ectors are lesse+0ensi(e to com0$te

    t chan-es there0resentation o% ata to

    acc$rately 0reict the

    class (aria#le

    Vectors are s0atiallylocalie an

    stastically

    ine0enent

    t $ses -raientescent metho to

    locally minimie the

    s5$are o$t0$t error.

  • 7/24/2019 Feature Extraction From Big Data

    8/13

    International Journal of Advance Foundation and Research in Computer (IJAFRC)

    Volume 2, Issue 10, cto!er " 201#$I%%& 2' * #', Impact Factor * 1$'1+

    - . 201#, IJAFRC All Ri/hts Reserved $iafrc$or/

    Fi/ '. *eat$re Selection 0rocess

    6. Generation H select %eat$re s$#set caniate.

    2. E(al$ation H com0$te rele(ancy (al$e o% the s$#set.

    &. Sto00in- criterion H etermine 1hether s$#set is rele(ant.

    .

    Valiation H (eri%y s$#set (aliity.

    *eat$re selection methos escri#e here as %ollo1

    1$ Filter odel

    t se0arates %eat$re selection %rom classi%ier learnin-. t relies on ty0es o% criteria s$ch as in%ormation,

    istance, e0enence, consistency %or e(al$ation o% %eat$res %rom any ataset 1itho$t $sin- any ata

    minin- al-orithm. Methos o% %ilter moel are as %ollo1

    1$ Information ?ain (I?)

    n%ormation meas$res ty0ically etermine the n%ormation -ain or re$ction in entro0y 1hen the ataset

    is s0lit on a %eat$re.

    2$Correlation Coefficient (CC)

    !he correlation coe%%icient is a n$merical 1ay to 5$anti%y the relationshi0 #et1een t1o %eat$res.

    '$ %7mmetric uncertaint7 (%:)

    *eat$res are selecte #ase on hi-hest symmetric $ncertainty (al$es #et1een the %eat$re an tar-et

    classes.

    Fi/ $ *ilter moel 0rocess

    2$=rapper model

  • 7/24/2019 Feature Extraction From Big Data

    9/13

    International Journal of Advance Foundation and Research in Computer (IJAFRC)

    Volume 2, Issue 10, cto!er " 201#$I%%& 2' * #', Impact Factor * 1$'1+

    +0 - . 201#, IJAFRC All Ri/hts Reserved $iafrc$or/

  • 7/24/2019 Feature Extraction From Big Data

    10/13

    International Journal of Advance Foundation and Research in Computer (IJAFRC)

    Volume 2, Issue 10, cto!er " 201#$I%%& 2' * #', Impact Factor * 1$'1+

    +1 - . 201#, IJAFRC All Ri/hts Reserved $iafrc$or/

    1here the entro0y o% (aria#le + is %o$n #y

    A$ Al/orithm can !e descri!ed as follo

    Ste0 64 n0$t the ataset 1hich contains %eat$res an their (al$es.

    Ste0 24 !hen 1e calc$late the Rele(ance (al$e %or each %eat$re #y $sin- the %orm$la

    Ste0 &4 !hen 1e take the a(era-e o% all rele(ance (al$es.

  • 7/24/2019 Feature Extraction From Big Data

    11/13

    International Journal of Advance Foundation and Research in Computer (IJAFRC)

    Volume 2, Issue 10, cto!er " 201#$I%%& 2' * #', Impact Factor * 1$'1+

    +2 - . 201#, IJAFRC All Ri/hts Reserved $iafrc$or/

    Fi/ +

    2$ A/ent !ranch short description

    Fi/

    '$ A/ent Client Id

    Fi/

    $ A/ent Full &ame

    Fi/ 10

    !he a#o(e ataset can #e -i(en as in0$t to S*S al-orithm escri#e a#o(e the o$t0$t 1o$l #e the

    re$ce set o% rele(ant %eat$res 1hich satis%ies the criteria.

    4$ utput

  • 7/24/2019 Feature Extraction From Big Data

    12/13

    International Journal of Advance Foundation and Research in Computer (IJAFRC)

    Volume 2, Issue 10, cto!er " 201#$I%%& 2' * #', Impact Factor * 1$'1+

    +' - . 201#, IJAFRC All Ri/hts Reserved $iafrc$or/

    Fi/ 11.Rele(ant *eat$res

    Fi/ 12$rrele(ant *eat$res

    VII$ C&C

  • 7/24/2019 Feature Extraction From Big Data

    13/13

    International Journal of Advance Foundation and Research in Computer (IJAFRC)

    Volume 2, Issue 10, cto!er " 201#$I%%& 2' * #', Impact Factor * 1$'1+

    + - . 201#, IJAFRC All Ri/hts Reserved $iafrc$or/

    0resente a Symmetric $ncertainty %eat$re selection?S*S@ metho 1hich is s$ita#le %or re$cin- hi-h

    imensional ataset an hence 1e -ot the satis%ie res$lts.

    *$rther this can #e im0lemente as that a#o(e sho1n rele(ant %eat$res o$t0$t can #e -i(en to any

    cl$sterin- 3 classi%ication 3 re-ression al-orithm %or %$rther 0rocessin-.