D5.1 Visualization System Infrastructure and Requirement ...disiem-project.eu/wp-content/uploads/2017/09/D5.1.pdfD5.1 3 Executive Summary This report documents the result of the activities

ProjectDeliverable

D5.1VisualizationSystem

InfrastructureandRequirementAnalysis

ProjectNumber 700692ProjectTitle DiSIEM–Diversity-enhancementsforSIEMsProgramme H2020-DS-04-2015Deliverabletype ReportDisseminationlevel PUSubmissiondate August31,2017(M12)Responsiblepartner CityEditor CagatayTurkayRevision 1.0

The DiSIEM project has received funding from the European Union’s Horizon 2020researchandinnovationprogrammeundergrantagreementNo700692.

D5.1

2

EditorCagatayTurkay,CityPhongNguyen,CityContributorsCagatayTurkay,CityPhongNguyen,CityGennadyAndrienko,FraunhoferIAISNataliaAndrienko,FraunhoferIAISMichaelKamp,FraunhoferIAIS

D5.1

3

ExecutiveSummaryThisreportdocumentstheresultoftheactivitiescarriedoutwithinWP5todefinethescopeofthevisualisationdesignandvisualisation-relateddevelopmentthatisgoingtotakeplacefortherestoftheproject.Asitsprimaryfocus,thedeliverablepresents the results of the systematic investigation conducted to understanddomain-specificproblems,datacharacteristics,userneeds,analyticaltasks,andhigh-levelgoalswithin the coreuse cases inDiSIEM thatarebest cateredbyavisualanalyticsapproach.Areviewofexistingvisualisationarchitecturesfromatechnicalcapabilityandcompatibilityperspectiveisalsopresented.Moreover,anin-depthreviewofrelatedvisualisationliteraturethatcaninformthedesignanddevelopmentactivitiesisalsoconducted,andtheresultingreviewisreportedinconjunction with the presented use cases. And finally, a series of preliminaryprototypesandinitialdesignsketchesproducedaspartoftheiterative,prototype-led,anduser-centreddesignanddevelopmentactivitiesarepresented.Theworkcarriedoutsofarwithinthisworkpackagehasresultedinanumberofoutcomes.Thefirstoneisanin-depthinvestigationandthedocumentationoftherequirementsandthedomaincharacterizationinvolvingtwocoreusecases:userbehaviourmodellingandvisualdiversityanalysis.Thesecondoutcomeisasetofanalytical methods that help to computationally model and analyse the datainvolved in these use cases. A third outcome involves prototypes and designiterationsbuilttoprovideinitialsolutionstotheproblemsandtasksidentifiedintheuse cases.The report includes screenshotsandanalysisexamples from theinteractiveprototypesoftwareasfirstiterationsofthetoolsweplantodevelopfurtherwithinDiSIEM.Tosummarise,thecorefocusofthisreportisto:

• Identifywell-scopedusecaseswherevisualisationandvisualanalyticscanenhancetheoverallanalyticalprocess;

• Systematically discuss and present the characteristics of the problemdomainsidentifiedthroughwell-definedanalyticaltasks,goals,andusagescenarios;

• Document the results of the visualisation architecture and technologyreview;

• Present the results of the initial prototyping stages focused on theidentifiedusecases;

• Discusstherelatedworksandliteratureonrelevantareasinvisualisationanddataanalysis.

Bydocumentingthelistedresultsabove,thereportwillactasaguidanceforthefurthervisualisation-relateddevelopmentworktakingplace in theproject.Thereport will set the scope and serve as a framework to inform further designdecisionsanddevelopment.

D5.1

4

TableofContents1 Introduction................................................................................................................................81.1 ObjectivesoftheDocument........................................................................................81.2 OrganizationoftheDocument.................................................................................102 VisualisationArchitecture...................................................................................................112.1 VisualisationLibraries................................................................................................112.2 WebFrameworks..........................................................................................................122.3 IntegrationPlan.............................................................................................................133 UnderstandingUserRequirements.................................................................................153.1 Methodology....................................................................................................................153.2 UseCases...........................................................................................................................163.2.1 UserBehaviourModelling...............................................................................163.2.2 VisualDiversityAnalysis..................................................................................224 InitialDesigns...........................................................................................................................274.1 UserBehaviourModelling.........................................................................................274.1.1 StateoftheArt......................................................................................................274.1.2 VisualisationDesigns.........................................................................................324.1.3 AnalyticalApproaches.......................................................................................434.2 DiversityVisualAnalysis............................................................................................494.2.1 StateoftheArt......................................................................................................494.2.2 InitialDesignInvestigationsandSketches...............................................535 SummaryandConclusions..................................................................................................60References...........................................................................................................................................61

D5.1

5

ListofFiguresFigure1–NestedmodelforvisualisationdesignprocessbyMunzner[3]..............8Figure2–StagesofthefulldesignstudyprocessasmodelledbySedlmairetal.[4]..............................................................................................................................................................9Figure3–IntegrationplanforSIEMsystems......................................................................14Figure4–Our“prototype-led”“iterative”approachtotherequirementgatheringanddevelopmentprocess.............................................................................................................17Figure5–DifferentsourcesofsignalsanddatautilisedinthedetectionofanomalousbehaviourwithintheSKEPTICframeworkbyAmadeus.........................17Figure6–Distributionofsessionanomalyscoresforadurationof24hoursasvisualizedbytheexistingKibana-baseddashboard.........................................................18Figure7–Screenshotofatablewidgetusedtolistthesessionswithhighscores.Noticehowquicklythisrepresentationbecomeshardtoreadwithincreasednumbersofsessionstoinvestigate...........................................................................................18Figure8–ExistingKibana-baseddashboardprovidingthedetailsandbreak-downoffeaturesrelatedtoasingleuser/session...........................................................19Figure9–Ademonstrationbyanalysts.................................................................................20Figure10–TimeSetstechniquebyNguyenetal.[13]todisplaytemporaleventsgroupedaccordingtotheirsemanticsimilarityovertime.............................................27Figure11–Icicle-plotlikeviewinLifeflow[14]toprovideoverviewsofeventsequences.............................................................................................................................................28Figure12–EventFlowbyMonroeetal.[15]incorporatesinteractivevisualsforsearchingandaligningeventsequencestosupportexplorativeanalysismoreeffectively.............................................................................................................................................29Figure13–Wangetal.[24]investigatestheuseoflinkedrepresentationsofpatternsandsequences.................................................................................................................30Figure14–Sankeydiagramsarealreadywidelyadoptedtohighlightflows–fromtheGoogleAnalyticsdashboard(http://www.sankey-diagrams.com/google-analytics-use-sankey-diagrams/)...............................................31Figure15–Branchingeventsinsequencescanbeexplicitlyvisualizedtoindicatehigh-levelchangesinpatternsasexemplifiedherebytheCoreFlowtechnique[25]..........................................................................................................................................................31Figure16–Representationofasequenceofactions........................................................32Figure17–Asequencetreeshowingrelationshipofsequences................................33Figure18–Sequencesstartingwith'SearchUser'............................................................33Figure19–Aninterestingpatternabout'refresh'actions............................................34Figure20–Asimplebarchartofusers...................................................................................34Figure21–Abasictimelineofactions....................................................................................35Figure22–Timelinewithoverlappingactions...................................................................35Figure23–Timelinewithactionslocatedindifferentrowstoavoidoverlapping....................................................................................................................................................................35Figure24–Timelinewithrelativetemporalordertoavoidoverlappingofactions....................................................................................................................................................35Figure25–Timelinewithrepeatingactionsshowninanormalway......................35Figure26–Timelinewithrepeatingactionsshowninanaggregatedwaytohighlightthepatterns.....................................................................................................................35

D5.1

6

Figure27–Visualisationdesignthatdisplaystheminedpatternsalongwiththeirstatistics.....................................................................................................................................37Figure28–A'ring'representationofasequenceofactions.Eachringrepresentsanactiontype,orderingoutward..............................................................................................38Figure29–Thescatterplotofallthesessionsindicatean“interesting”sessionwithhighlevelsofactivity............................................................................................................38Figure30–Switchingtodifferentaxes,theplotnowrevealssessionsthatmightrequirefurtherinvestigation–shortbutveryactivesessionsandlongsessionswithlimitedlevelsofactivity......................................................................................................39Figure31–Contrastingthefrequencyofsequencestoscoresrevealspatternsthatareconsistentlyratedasanomalousdespitebeingfrequent..............................39Figure32–Timelinewithhighestlevelofdetail.Actionsareshownindividually....................................................................................................................................................................40Figure33–Timelinewiththesecondhighestlevelofdetail.Consecutiveactionshavingthesametypeareaggregated......................................................................................40Figure34–Timelinewiththethirdhighestlevelofdetail.Activitiesareshowninsteadofactions..............................................................................................................................40Figure35–Timelinewiththelowestlevelofdetail.Consecutiveactivitiesareaggregated...........................................................................................................................................40Figure36–Anoverviewofallthesessionsexecutedbyasingleuserusingoursimplificationschemetofitseveralsessionsinalimitedamountofspace............41Figure37–Overviewofsessionsthatcanbecolour-codedusinglightnessandgroupedaccordingtoanycriteriasetinteractivelybytheuseraccordingtotheanalyticalneeds.................................................................................................................................41Figure38–Comparisonto“expected”behaviourisperformedalgorithmicallyaccordingtoaninternalformulaandtheresultsarecolourcodedtohighlightinterestingsectionsofsessions..................................................................................................42Figure39–Clusteringresultsareviewedandexploredusingatwo-dimensionalprojectionofthesetofactionsbasedonthesemanticdistancesbetweenthem.Theimagespresenttheresultsofthreestepsofprogressivedensity-basedclusteringwithincreasingvaluesoftheneighbourhoodradiusR..............................45Figure40-Thefullsetofactionclustersobtainedin8stepsoftheprogressiveclusteringprocedure;37clustersintotal.Thegreydotsarethe“noise”,i.e.,theactionsthatwerenotincludedinanycluster.Theothercoloursrepresentclustermembership........................................................................................................................................46Figure41–CombinationofClusterAssignmentModel(CAM)andTaskTransitionModel(TTM)onanoveluser’stasksequence.Thetasksareobtainedbyreplacingactionswiththetaskassignedtothecluster,theactionbelongsto....................................................................................................................................................................48Figure42–Confusionmatrixfortheprediction19458tasktransitionsofnormalusersfromtheLSSdatasetsandanequalnumberofanomaloususersgeneratedbyassumingrandomtasktransitions.....................................................................................49Figure43–SmallmultiplesemployedinthisworkbyTurkayetal.[33]incommunicatingthevariationsinthefingerprintmatchingperformancewithinbiometricdeviceandmatchingalgorithmcombinations...............................................50Figure44–Advancedinteractionmechanismsenableanalystsindynamicallygeneratingvisualrepresentationsthatmakeuseofunderlyingcomputationaltools.Inthisexample,statisticalsummariesaredynamicallycomputedand

D5.1

7

renderedinresponsetotwodifferentinteractionpatternsfortwodifferentcities,whereeachsmallmultiplesdepictthevariationinasingledatafeature[38]..........................................................................................................................................................51Figure45–Ataxonomyofvisualvariablesusedinthedepictionofuncertaintyinvisualrepresentations[42]..........................................................................................................52Figure46–Alternativechartdesignshavebeenofferedandevaluatedforthedepictionofvariationanduncertaintyinaggregatedstatisticschallengingthewaysthatsuchinformationiscommunicatedconventionally[43]...........................52Figure47–Importanceofalargenumberoffeatureswithinclassifiermodelsaredepictedfollowinganensemblerunandcross-validationofmodelstoinformfeatureselectiontasksinthisworkbyKrauseetal.[48]...............................................53Figure48–SketchyprototypeintoaddressthecommontasksidentifiedwithinStage-1ofthediversityanalysisprocess...............................................................................54Figure49–LinearisedrepresentationshavebeenadoptedtorepresentsetmembershiprelationswhereaccompanyingstatisticsonsetsarevisualizedinintegrationinthisUpsettechniquebyLexetal.[49]andwe’llconsidersuchrepresentationswithintheanalysisforStage-1withtheaimofextendingfortemporalvariations.........................................................................................................................55Figure50–Adepictionofaconfusionmatrix.....................................................................56Figure51–AnexampleROCcurvethatdepictstheresultsofthediversityanalysisactivitiesheldwithinWP3..........................................................................................56Figure52–SketchyprototypeintoaddressthecommontasksidentifiedwithinStage-2ofthediversityanalysisprocess...............................................................................57Figure53–AlternativerepresentationssuchasTimeCurves[50]andConnectedScatterplots[51]willbeconsideredasalternativesofdisplayingvariationovertimewithinmoreconventionalROCcurves.........................................................................57Figure54-SketchyprototypeintoaddressthecommontasksidentifiedwithinStage-3ofthediversityanalysisprocess...............................................................................58

D5.1

8

1 IntroductionThe goal of the visualisation work-package (WP5) is to design and developvisualisation solutions for supporting better decision processes for securityanalysts.Visualisationsystemssupportanalystswithrepresentationsthatenablethem to carry out particular tasks more effectively [1]. This is often done byexploitingtheperceptual,cognitive,andcreativecapabilitiesofhumansandhelpstointerpretandaugmenttheinsightsgainedthroughcomputationalanalysis[2].Given the human-oriented nature of visualisation systems, delivering effectivevisualisation-based approaches relies on an in-depth understanding of users’needs,inparticularthoserequirementsthatcanbestbecateredbyavisualisationsolution[1].

1.1 ObjectivesoftheDocumentIn her seminal paper describing theNestedModel for visualisation design [3],Munznerstatesthat“acentraltenetofhuman-centreddesignisthattheproblemsofthetargetaudienceneedtobeclearlyunderstoodbythedesignerofatoolforthataudience”.Motivated by thiswidely accepted view on visualization design,weadopt such a human-cantered approach whilst designing and developing thevisualisationsolutionsinthisproject.TheNestedModel(assummarisedbyFigure1)approachesthedesignandvalidationofavisualisationsysteminfournestedstages.

This structured approach considers the development of a visualisation systemfrom the early requirement analysis phase to all the way to thedeployment/validationphase:domaincharacterisationàabstractionàdesignàimplementation.Theprocessstartswithunderstandingdomainproblems,thenabstractingthederivedknowledgeusingadomain-intendentlanguagethatcanbedesignedeffectivelyandimplementedintoaworkingsystem.Withinthecontextofthisreport,however,weprimarilyfocusontheearlierdomainunderstandingphases(Chapter3)butalsopresentseveralinitialdesigninvestigations(Chapter4).Tosummarisethisgoal,wecanstatethatourfirstobjectivewiththeactivitiesleadingtothisreportistocharacterisethedomainintermsofitsneeds,tasks,and aspirations with a thorough understanding of the users involved(Objective1).

Figure1–NestedmodelforvisualisationdesignprocessbyMunzner[3]

D5.1

9

Ouroverallapproachindeliveringthisobjectiveanddesigningthevisualisation-basedsolutionswithinDiSIEMcanbe,atahigh-level,putintothecontextoftheNestedModel[3].Inthisnesteddesignapproach,thatfollowsFigure1,thefirststage – characterisation – is where the designer/developer learns about thetargetdomain(inthecaseofDiSIEM,thisdomaincanbebroadlyconsideredascybersecurityorSIEMs).Theoutcomeofthisstageisusuallyalistofquestions/needs/tasksthatthedomainexpertsaspiretoaccomplish.Thesequestionsareoftenstatedatalowlevelusingthevocabularyofthetargetdomainandrelatetothecriticaltaskstheydoand/orneedtodomoreeffectively.Thesecondstage–abstraction – is where designers (in DiSIEM, these are the visualisationresearchersfromCITYandFraunhoferIAIS)gothroughanabstractionphasetomapthedomainspecificvocabularytoahigher,genericlevelthatresonateswellatacomputerscienceterminologyandrelatetoparticularoperationsthatneedtobe accomplished. These resulting abstractions inform the development ofparticular features and tools, and enable designers to justify their designdecisions. The third and fourth stages move into the actual design andimplementation phases where algorithmic and visual approaches aredeveloped. The results of the earlier phases play a critical role in shaping andinformingthedesignandimplementationphases.Oneimportantcharacteristictohighlight regarding the design and implementation phases is that the human-centrednatureoftheapproachisstillatthecoreoftheactivities.Inlaterpartsofthereport(Chapter4),wepresentaseriesofearlyattemptsonthetwousecases:user-behaviourmodelling anddiversityanalysis, anddocumenthowweadopt anumberofmethodstoaccomplishsuchuser-centredthinkingwithinDiSIEM.OtherworksinvisualisationresearchhavegonefurtherindevelopingtheNestedModelandcharacterisinganddescribingtheintermediatestages(assummarisedinFigure2)thatdesignersandtheircollaborators(i.e.,users)gothrough[4]–[6]or devised extended taxonomies of tasks [5], [7] to help build new solutions.Wherever possible and to the extent it helps us to frame the design anddevelopmentactivities,wewillbeinformingandpositioningoureffortswithintheconceptofsuchtheoreticaldesignframeworkstoensuresuccessfulandeffectiveendresultswithinthecontextofDiSIEM.

Figure2–StagesofthefulldesignstudyprocessasmodelledbySedlmairetal.[4].

D5.1

10

Asecondobjective inthis firstphaseofWP5, inadditiontounderstandingtheuser-side requirements, is to carry out an investigation on the availablevisualisationtechnologiesandsoftwareinfrastructures(Objective2).Suchan investigation enables us not only to effectively develop the visualisationprototypesbutalsoensuresthattheresultingtoolscanworkinharmonywiththeothercomponentsdevelopedwithinDiSIEMandtheSIEMsystemsweinvestigatein this project. To accomplish this,we reviewexisting technologies in buildingvisualisation systems, evaluate them in terms of their capability, agility andsuitabilityforintegrationwithothercomponentsandexistingSIEMsystems.A final objective (Objective 3) we have in the initial planning phase of thevisualisationdevelopment is to identify algorithmicandanalyticalmethodsthat enable us to effectively deliver the tasks we identify and to address thevarioustechnicalchallengeswefacethroughautomatedalgorithmicapproaches.Thisexerciseisofgreatimportancegiventhatwearedevelopingvisualanalytics[2] tools that integrating techniques from fields such as statistics, machinelearning,ordatamining[8].Whendesigningsolutionsanddevelopingprototypesonaparticulartaskorrequirement,weconsiderwhatpartsofthistask,ifany,canbe addressed algorithmically and what parts can best benefit from theinvolvementofahumanexpertthroughinteractionandvisualisation.WewillbegoingthroughthisprocessmostlyduringtheprototypedevelopmentaspartofTask5.2and5.3.However,withinthecontextoftheusecasesidentifiedinthisreport(Chapter3.2),wehighlightandlistthoseanalyticalapproaches.Framedbytheaforementionedobjectives,thisreportdocumentstheresultsoftheactivitiescarriedoutwithinWP5(primarilyaspartofTask5.1)todefinethescopeofthevisualisationdesignanddevelopmentthatisgoingtotakeplacefortherestof theproject. Inadditiontodocumentingthevisualisationarchitecturereviewanddomainunderstandingactivity results,wepresentpreliminarydesigns andinitialprototypesthatarestartedtobedevelopedinauser-centredapproachasdescribedabove.

1.2 OrganizationoftheDocumentTherestofthereportisorganisedasfollows:

• InChapter2,wereviewandevaluatetheavailablevisualisationarchitectures,web frameworks and discuss these systems within the context of ourintegrationplan(asalsopresentedwithinDeliverable6.1).

• Chapter 3 documents the domain understanding and characterisationactivities.Wefirstintroducethemethodologyappliedthroughadiscussionofwidely-useduser-centredrequirementgatheringtechniquesandthenpresenttwo core use cases that have been identified as the main streams ofvisualisationdesignactivitythatwilltakeplaceinthisworkpackage.

• Chapter4 presents the preliminary design attempts and the results of theinitial iterationsofuser-centredprototype-leddesignactivities.Thesecasesare in-linewith the visualisation heavy components as documentedwithinWP-2aspartofDeliverable2.2.

• ThereportcloseswithdiscussionsandconclusionsinChapter5.

D5.1

11

2 VisualisationArchitectureIn this project, we use web-based technologies (JavaScript, HTML5, CSS3) todevelop visualisation tools because of the popularity of both web-basedvisualisations and web-based dashboards in notable SIEM systems (McAfee,QRadar,OSSIM/USM,XL-SIEM, Splunk andElastic-based). In order tomake aninformed decision on the visualisation architecture, we review visualisationlibraries and web frameworks that streamline the development process andhandlecomplexity.

2.1 VisualisationLibrariesNumerous approaches are available to generate (web-based) visualisationsrangingfromlow-levelGraphicsAPIstointeractivechartingsoftware.Theformerallowsdevelopingvisualisations at a veryhigh level of complexitybutwith anenormous coding effort. The latter doesnot requirewriting any codebut onlyprovidesa limitednumberofpopularvisualisationoptions.Wereviewnotablevisualisationlibrariesalongthisspectrum.

GraphicsAPIs. Graphics Application Programming Interfaces provide aset of low-level commands for rendering graphics such as pixel and texturemanipulation.OpenGL1,oneofthemostcommonandhigh-performinglibraries,isoftenusedindeveloping3Dgamesandmovies.WebGL2,derivedfromOpenGLanddesigned to run inmodernbrowsers, canbe a good choice fordevelopinggraphically complex visualisation tools. Processing3 (Processing.js4 for thewebversion) specialises in visualisation and allows rapid prototyping. Two nativeweb-based graphics API include SVG5 and canvas6. SVG is an XML file formatdesigned to create vector graphics, therefore is good at scalable and highlydetailed visualisation. Canvas, introduced in HTML5, is more suitable if thevisualisationiscomplex.

Core libraries. This classprovideswrappersof low-level graphicsAPIsdesignedforfasterandeasier-to-usedevelopment.Paper.js7isapopularlibrarybasedoncanvas,whereasraphael.jsisacounterpartofSVG.However,themostnotablecorelibraryisD3.js8.BesidestheexcellentconceptofmappingbetweendataandDOMelements,D3providesarichsetoflayoutalgorithmsandflexibilityin developing highly interactive visualisations. Besides these general-purpose

1"OpenGL-TheIndustryStandardforHighPerformanceGraphics."https://www.opengl.org/.Accessed31Jul.2017.2"WebGL-WebAPIs|MDN."14Jun.2017,https://developer.mozilla.org/en-US/docs/Web/API/WebGL_API.Accessed31Jul.2017.3"Processing.org."https://processing.org/.Accessed31Jul.2017.4"Processing.js."http://processingjs.org/.Accessed31Jul.2017.5"W3CSVGWorkingGroup."https://www.w3.org/Graphics/SVG/.Accessed31Jul.2017.6"CanvasAPI-WebAPIs|MDN."15Jun.2017,https://developer.mozilla.org/en-US/docs/Web/API/Canvas_API.Accessed31Jul.2017.7"Paper.js."http://paperjs.org/.Accessed31Jul.2017.8"D3.js-Data-DrivenDocuments."https://d3js.org/.Accessed31Jul.2017.

D5.1

12

libraries,specific-purposeonesarealsoavailablesuchas formaps(polymaps9,leaflet10), network/graph (sigma11, cytoscape12) and high-dimensional data(crossfilter13).

Charting libraries. This class provides a set of out-of-the-box yetcustomisedcharts.Theycanbebasedontheaforementionedcorelibrariessuchasnvd314(basedonD3)anddc.js15(basedonD3andcrossfilter)orfromlow-levelGraphicsAPIsuchasGooglechartingtools16(baseddirectlyonSVG).Vega17canbeconsideredsomewherein-betweencoreandchartinglibraries.ItwrapsD3inaformofadeclarativelanguagethusrequireslesscoding(itactuallyrequiresaJSONinput)thanD3andmoreflexibilitythanout-of-the-boxcharts.

Chartingsoftware.Thisclassincludescommercialdatavisualisationandanalysis software packages such as Tableau18, Qlik19 and kibana20. They allowgeneratingvisualisationinteractivelythroughtheirgraphicalinterface.However,theyareverylimitedincustomisingthevisualisations.Acomparativeevaluation:Theselibrariesarepresentedinanincreasingorderintermsofease-of-usebutwithdecreasingexpressiveness.Inthisproject,wewilldevelopnovel,highlyinteractivevisualisationsthatrequirefine-graineddesignsandcustomisations.Thus,out-of-the-boxchartinglibrariesandsoftwarepackagesarenotappropriate.Wealsowanttofollowanagileapproachindevelopment,sowriting code from scratch using low-level GraphicsAPIs is also not ideal. As aresult, core libraries are the most suitable choice. We decide to adopt D3.jsprimarily due to the fact that this is now largely the standard library in datavisualizationbothfor industrialandacademicuse.Thishighlevelofpopularitybrings together theadvantage thatD3.jshashealthycommunitysupportandahighlikelihoodforfuturedevelopmentandenhancements,makingittheobviouschoiceforvisualizationdevelopmentwithinDiSIEM.

2.2 WebFrameworks

Web frameworks provide libraries for implementing common tasks in webapplicationdevelopment. They alsohelp organise andmaintain the codemore9"Polymaps."http://polymaps.org/.Accessed31Jul.2017.10"Leaflet."http://leafletjs.com/.Accessed31Jul.2017.11"Sigmajs."http://sigmajs.org/.Accessed31Jul.2017.12"Cytoscape."http://www.cytoscape.org/.Accessed31Jul.2017.13"Crossfilter."28Feb.2001,http://square.github.io/crossfilter/.Accessed31Jul.2017.14"NVD3."http://nvd3.org/.Accessed31Jul.2017.15"dc.js-DimensionalChartingJavascript...."https://dc-js.github.io/dc.js/.Accessed31Jul.2017.16"Charts|GoogleDevelopers."https://developers.google.com/chart/.Accessed31Jul.2017.17"Vega:AVisualizationGrammar."https://vega.github.io/vega/.Accessed31Jul.2017.18"TableauSoftware."https://www.tableau.com/.Accessed31Jul.2017.19"Qlik:BusinessIntelligence|DataVisualizationTools."https://www.qlik.com/.Accessed31Jul.2017.20"Kibana:Explore,Visualize,DiscoverData|Elastic."http://www.elastic.co/products/kibana.Accessed31Jul.2017.

D5.1

13

effectively.ManynotableJavaScriptwebframeworkssuchasBackbone.js21andAngularJS22 follow theModel-View-Controller (MVC) architectural pattern thatseparatesanapplicationintothreemainlogicalcomponents:themodel,theview,and the controller. The model is responsible for data processing, modelcomputation,andotherunderlyingtasksrelatedtotheapplicationlogic.Theviewtakescareofdisplayinginformationtousers.Thecontrollercoordinatesbetweenthemodelandtheview.The separation of concerns characteristic in MVC pattern provides manyadvantagessuchas:

- Simultaneous development: all three components can be developed inparallel.

- Easeofmodification:changesinonecomponent(e.g.,asortingalgorithmofanarrayofdataitemsinthemodel)donotaffectothercomponents(e.g.,visualrepresentationofdataintheview).

- Multipleviewsforonemodel:simplycreatedifferentviewclassesforthesamedatamodelandtheviewselectioncanbedoneinthecontroller.

ThedownsideofusingwebframeworksisthecompatibilitybetweenSIEMs.Thecode will be highly different with different frameworks, which makes theintegrationtoSIEMsystemsmoredifficult.Therefore,wedecidenot toemployanyweb frameworks toavoid integrationconflicts. Instead,we followtheMVCpatternconceptuallybyorganisingthecodeintounitmodulesanddeveloptheminsuchawaythattheycanbereusedinintegration.

2.3 IntegrationPlanDuringtheproject,wewilldevelopa fewvisualisationtools,eachaddressingaspecificusecasediscussed later inSection3.2Thedevelopmentwillstartwithplain JavaScript (without using any web frameworks), simple stylesheets, andstatic data input such as JSON files. This eliminates any framework-dependentbackend and frontend development, allowing rapid prototyping and minimaldevelopment cost. We refer to this asminimal prototype. In the integrationphase,dependingoneachSIEMsystem,extrawrappingmodules for theModelandtheViewwillbeimplementedassummarisedinFigure3.Morespecifically,wehavethe followingplans for integrationto the threeSIEMsystemsthatourpartnersintheprojectareusing.

Elastic-based. Kibana uses the plugins architecture and allows customplugins. The idea is to convert the visualisation tool to aKibanaplugin,whichshouldbetechnicallystraightforwardthankstothefactthatKibanaalsousesD3.jslibrary that we employ in DiSIEM. In order to operate within Elastic-basedsystems,thedatainputmoduleneedstobechangedtoElasticsearchaswell.

XL-SIEM. Dashboard in XL-SIEM is also implemented using web-basedtechnologies.Therefore,itwillbepossibletointegratethevisualisationtoolasa21"Backbone.js."http://backbonejs.org/.Accessed31Jul.2017.22"AngularJS—SuperheroicJavaScriptMVWFramework."https://angularjs.org/.Accessed31Jul.2017.

D5.1

14

dashboardview.Moreworkneedstobedonetoenablecommunicationwithotherviewsbelongingtothedashboard.ThedatainputmoduleneedstobechangedtoMySQLasitiscurrentlyusedinXL-SIEM.

ArcSight.DashboardinArcSightisimplementedusingJava,thusitwillnotbe possible to integrate the visualisation tool tightly as a dashboard view inArcSight.Instead,itreadsandvisualisesdatafromacentraldatabasethatArcSightexportsto.ThisconfigurationallowsArcSightandthevisualisationtoolworkonthesamedataset.

Figure3–IntegrationplanforSIEMsystems.

D5.1

15

3 UnderstandingUserRequirements

3.1 Methodology

As discussed in the introduction, we follow a user-centred approach whiledesigningthevisualisationsolutions.End-usersplayacentralroleinallstagesofthedesignprocess(requirementelicitation,interfacedesign,implementationandevaluation).Thisapproachhelpsthevisualisationtoolstoreachtherightaudiencefor addressing the problems that users actually have. This section gives anoverviewofthemethodsweuseforthefirststageofthedesignprocess:elicitingandunderstandinguserrequirements.

Interview. This is a technique to elicit requirements explicitly throughquestion-answer conversation with end users. Interviews can be done with asingleuseroragroupofusers.Interviewscanrangefromunstructuredoropen-ended conversations in which no questions are predetermined to highlystructuredconversations inwhichspecificquestionsoccur inaspecifiedorder.Interviews can also be semi-structured: a discussion theme and a set of initialquestionsarepredeterminedtokickoffandscopetheinterview;however,newquestionscanbeaddedtofollowtheanswersofpreviousquestions.

Observation.Thistechniqueelicitsrequirementsthroughobservationofend-usersperformingtheirdailytasks.Ithelpstheobserverstounderstandthework process, challenges faced, and the opportunities for improvement.Observationcanbepassiveoractive.Inthepassiveoption,theobserverdoesnotinteractwiththeenduserduringtheobservationatalltoavoidinterferencewiththeuser’sprocess.Theobservercantakenotesandaskquestionslater.However,intheactiveoption,theobservercaninterveneandaskquestionstogetadeeperunderstanding ofwhat is happening. Observation techniquemay help uncoverimplicitrequirementsthatinterviewsmayoverlookbecausewhattheuserstalkabouttheirworkprocessmaynotbeexactlywhattheyactuallydo.

Brainstorming.Agroupofendusersthinkcreativelyabouttheproblemstheywanttosolve.Whatarethebottlenecks?Whyis itdifficult?Howcanitbeimproved?Thegoalistofirstgenerateasmanyideasaspossible.Then,ideascanbescopedandrankedtoformasetofinitialrequirements.

Prototyping. It is usually impractical to elicit a complete set ofrequirementsintheearlystageofthedesignprocess.Theideaofthistechniqueistoquickly implement a low-fidelityprototypebasedon initial requirementsorneeds and use it to discuss with end users for gathering feedback and newrequirements.Thistechniqueisusefulwhentheendusersarenotsoclearaboutwhat exactly they want to include in their systems, which is not uncommon.However,theyaretypicallygoodatassessingwhethertheimplementedfeaturesarehelpfulornot.Inaddition,afterbeingexposedtowhatvisualisationcanofferwithin theirdomainandwith their owndata, the endusers aremore likely toprovideinformativeandvaluablefeedbackandsuggestions.Withintheproject,weemploytheaforementionedtechniquesinanagilewayandperformtheminvariousstageswithinthedevelopmentprocess.Ourdesignanddevelopmentstrategyisuser-centred,prototype-ledandhighlyagilewithseveral

D5.1

16

iterationsofprototypedevelopmentwhereweconvergetoeffectivesolutionsbyincorporatinguserfeedbackwithinthewholedevelopmentcycle.

3.2 UseCasesAs a result of the earlier discussionswith all thepartnersduring theproposaldevelopment stage and at the consortium meetings, we performed a criticalinvestigationoftherequirementsandthecharacteroftheproblemsintermsoftheir suitability for visualisation development. As also often discussed invisualisationliterature[1],notallproblemsaresuitableforavisualisation-basedapproach. Problems that are well-defined with clear objectives and definitemetrics to optimise against are best catered by automated computationalsolutions which do not necessarily need to involve a human expert, such asestimatingtrendsfromsignals,automatedalgorithmictradingtools,orcomputervisionalgorithmsthatareusedto,forinstance,detectfaces.Wherevisualisationbasedmethodsmakearealdifferencearethosecaseswheretheproblemis ill-defined, i.e., where the questions being investigated are not fully formulated,where there are no clear objectives to deliver, or there are only partialautomatable solutions to an investigated question. These kinds of problemsbenefitthemostfrominvolvingahumanexpertintheproblem-solvingprocesshencearemostsuitableforvisualisationdevelopment.Whilst evaluating the problems (for visualisation suitability)we are aiming toaddresswithinDiSIEM,thisabovedescribedcriticalperspectiveonthenatureofthe problem has been our primary criteria. As a consortium, we identify twochallengingproblemsthatvisualisationcanbestaddress.Foreachproblem,weworkwiththecorrespondingpartnerstounderstandandelicituserrequirementsasdiscussedindetailinthefollowingsections.

3.2.1 UserBehaviourModellingUserandEntityBehaviourAnalytics(UEBA)23hasbecomeanimportantfeatureofSIEMsystems.Itcapturesuseractions, infersnormalbehavioursanddetectsabnormal deviation from the standard. Some of the benefits of UEBA includedetecting insider threats, compromisedaccounts, andprivilegedaccountabuseandmisuse.However, suchuserbehaviour analysis capability in existing SIEMsystemsislimited(seeDeliverable2.1In-depthanalysisofSIEMsextensibilityfora detailed analysis). In this project, we investigate the use of visualisation infacilitatingsuchuserbehaviouranalysis.

TheRequirementGatheringProcess

TheprocessstartedwithapresentationfromAmadeuspartnertointroducetheproblems they have, their importance and their existing working progress. Ademonstrationwasfollowedtohelpusunderstandthecurrentworkflowandtoidentifyitspainpoints.Attheendofthesection,Amadeusandthevisualizationresearchers agreed on an initial set of high-level requirements for the firstvisualisationprototype.Insubsequentmeetings,wedemonstratedtheprototype,

23https://en.wikipedia.org/wiki/User_behavior_analytics

D5.1

17

received feedback from Amadeus members, refined the feature set, andimplementedthemforthenextmeeting.Figure4summarisesthisprocess.

Figure4–Our“prototype-led”“iterative”approachtotherequirementgatheringanddevelopmentprocess.

DescriptionofAnalysisProblem

Amadeushasbeenfocusingonthedetectionofinsiderthreatsthroughanalysisofuser behaviour. They develop the SKEPTIC framework that uses unsupervisedstatisticallearningmethodstogenerateuserprofilesbasedonuseractionswithmonitoringapplications.Anumberofdifferenttypesofdataarecollectedwhileusersaccessanapplication:meta information(username,IPaddress,browser,etc.) and activity focused information detailing what happened in a recordedsession(seeFigure5).

Figure5–Different sourcesof signalsanddatautilised in thedetectionofanomalousbehaviourwithin theSKEPTICframeworkbyAmadeus

First,astatisticalmodelisbuiltforeachcategoryofinformation.Foreachusersession,thosemodelsareusedtogenerateindividualanomalyscores.Then,thosescoresarecombinedinaweightedmannertoproduceanaggregatedscore.Thisfinalscoreisusedtoassesstheanomalystatusofausersession.Whenthecomputedscoreishigh,aninvestigationintothatsessionisrequiredtovalidatethescoreandsearchforanexplanation.However,currentlysuchaninvestigationishighlymanualandtimeconsuming.

D5.1

18

CurrentManualInvestigation

This sessiondescribes thedemonstrationofananalyst fromAmadeus inusingtheirexistingtoolstoanalyseusersessions.TheprocessiscurrentlysupportedbyaKibana-baseddashboardandtheprocessisexecutedintwostages.

SessionSelection.Ananalyststartsbylookingatanoverviewofasetofsessions; for example, those happened in the last 24 hours. In the Kibanadashboard (Figure 6), the sessions are shown as a histogram binned by timeintervalandcolourcodedbyanomalyscore.Thisprovidesadistributionofscoredsessionsovertimeandscoreranges.

Figure6–Distributionofsessionanomalyscoresforadurationof24hoursasvisualizedbytheexistingKibana-baseddashboard.

Theanalystthenselects(orcandecide)tokeeponlysessionswithscoresgreater thanaspecific thresholdsuchas0.8and investigates theselection inatableformat(Figure7).Sessionsarethensortedinadescendingorderbyscoreand the onewith the highest score is selected for further investigation. In thecurrentpractice,onlyasingleattribute(modelscore)canbeconsideredinsessionselection.

Figure 7 – Screenshot of a table widget used to list the sessions with high scores. Notice how quickly thisrepresentationbecomeshardtoreadwithincreasednumbersofsessionstoinvestigate.

SessionInvestigation. Theanalystthenexaminestheactionsthattookplace in that session through another data table. Several pie charts displaysummarystatisticsoftheactiontypes,suchastop10mostcommonones(Figure8).Tounderstandwhatwasgoingoninthesession,theanalystmustgothroughall actions listed in the table,which couldbe time-consuming, error-proneanddifficulttodetectanypatterns.Morechallengingly,tomakeaninformeddecision,theanalystoftenneedstocompareonesessionwithothersessionsperformedbythesameuserinthepast.Boththelargenumberofsessionsrequiredtoanalyseandthefactthatthesessionsareoftenlongandcomplexwithseveralactivitiesmaketheinvestigationchallengingtoexecute.

D5.1

19

Figure8–ExistingKibana-baseddashboardprovidingthedetailsandbreak-downoffeaturesrelatedtoasingleuser/session.

High-levelAnalysisGoals

Aftertheobservationprocess,asemi-structuredinterviewwasconductedtogaindeeperunderstandingintotheuseranalysisprocess.Weaskedtheanalystsasetofpre-preparedstructuredquestions toelaborateonanumberofaspects.Thequestionsareasfollows.

1. Whichtoolsarecommonlyusedandwhatarethekeystepsintheanalysis?2. Which information is relevant and which relationship is important to

discover?3. Howtodecidewhichsessionstoinvestigate?4. Howtodecideifasessionisfraudulentornot?

Figure9illustratesademonstrationbyanalysts.We(analystsandvisualisationresearchers)identifiedthefollowinghigh-levelgoalsthatsuccessfulanalysesaimtoachieve.

SessionsOverview.Helpanalystsgainunderstandingintowhathasjusthappenedinthemonitoringsystemandidentifyhighlysuspiciousonesforfurtherinvestigation.Inthecurrentprocess,theanalystsexploresessionsthathappenedinthelast24hoursthroughatimehistogramstackedbydifferentseveritylevelsofscoresuchaslow,mediumandhigh.Then,theyfiltertokeeponlysessionswithhighscoresandlookattheminatableformat.Thetableallowssortingsessionsbyscorevalueandsessionsareselectedtoinvestigateinthedecreasingorderofscore(i.e.,highestscoresfirst).Thisprocesshastwolimitations.First, ittotallyreliesonthemodellingscore,whichisstillintheearlystageofdevelopmentandshownimperfection.Second,analystslackapriorknowledgeonthecontextofasessionbeforedivingintoin-depthinvestigation.

D5.1

20

In-depthAnalysis.Supporttheanalysisofpotentiallysuspicioussessionsselectedinthepreviousstep.This investigationiscurrentlycompletedthroughmanualexaminationofactionsdisplayedinadatatable,whichistime-consuming,error-prone and inefficient.Analysts indicated a significant interest in efficientways to gain insight on the activities happening in user sessions. Anothersignificantchallengehighlightedisthelackoffit-for-purposesolutionstoenableeffectivecomparisonsoverdifferentsessions.

Figure9–Ademonstrationbyanalysts.

DataCharacterisation

Through the observation and the follow-up interview, we understand that toachievethetwopreviousgoals,theanalystsmostlyanalysethesequenceofactionsinusersessions.Therefore,weinitiatedourdesignanddevelopmentprocesswithapreferredfocusonthedatarelatedtothesequenceofactions.Itisimportanttonotethattheotherdataattributes(clientbrowser,operatingsystem,country,etc.)arenotleftoutbutareofsecondaryprioritywithinourinitialinvestigations.Theactionsequencedataalreadyposesmanychallenges.First, therawactionscontain little semantics but they represent multiple levels of higher semanticconcepts such as user tasks and roles. The number of actions is high and thesequences contain noise. Also, multiple facets exist in the data such as time,semanticsandusers.Intheinitialphases--tohelpusgetacquaintedwiththeproblemdomain,carrypreliminaryanalysis,anddesignprototypes--wedecidedtostartworkingwithasmallsampledatasetprovidedbyAmadeus,whichcontainsdataspanning31days

D5.1

21

onapproximately15,000sessionsperformedby1,400userswith300differentactiontypes.Thisdatasetservesverywellthepurposeofrapidprototypingforfurther feedback and detailed requirements. We detail here the informationassociatedwithausersession.

meta-information:- time:whenthesessionbeganandended- user:whoperformedthesessionandhisoffice,organisation- IPaddress:wherethesessionisperformed

actions:anorderedlistofwhatactuallyhappened,including- time:whenanactionwasperformed- type: providing meaning to the action such as SearchUser and

DisplayOneUserscore:ananomalymeasurement(0→1)computedbyaSKEPTICmodelderivedattributes:

- length:thenumberofactions- duration:thecoveringtimerange- actionrate:theratioofdurationtolength,showingtheaveragetime

betweentwoactions

SpecificRequirementsfromIterativePrototypes

The prototype-led requirement gathering process continued with a series ofmeetingstodiscussprototypesandrequirements.Themeetingswerebothonlineand offline (project meetings and on-site visit to Amadeus office). A typicalmeetinglastedabout2to3hoursandinvolved3to5analysts.Itstartedbyusdemonstratingthenewfeaturesandchangesintheprototype.Theanalyststhengavefeedbackandsuggestedimprovementsoradditionalfeatures.Theyalsohadhands-on experience with the prototype, either immediately in the offlinemeetingsorlaterintheonlinemeetings.Atthecurrentstatusofthisproject,weidentified the followingmain analysis tasks or requirements (note thatminorinterfacerequirementsarenotlisted).Goal1:SessionsOverview

- Task1.1–Relationshipbetweenanomalyscoreandothersessionattributes.Becausethescoreisimperfect,itisnecessarytocomplementitwithotherinformationtoincreaseaccuracyinidentificationofpotentiallysuspicioussessions,suchasdurationandlength.

- Task 1.2 – Higher-level semantic summary of atomic action sequences.Currently,theanalystscanseeafrequencyofactiontypesthroughapiechart.However,anindividualactionmaynotcarrythemeaningthatmayberevealedinasequenceofseveralactions.

Goal2:In-DepthAnalysis

- Task2.1–Multi-scaleexploration.Helpanalystsquicklyunderstandwhathappenedinasinglesession,inallsessionsperformedbythesameuser,andinallusersessionsassignedinthesameoffice.

D5.1

22

- Task2.2–Currentvs.Pastcomparison.Helpanalystsquicklyidentifyboththesimilarityanddifferencebetweenagivensessionandothersessionspreviouslyperformedbythesameuser.

Theseabovegoalsandtasksareprovidingthescopeandfocusofthevisualizationdesignanddevelopmentactivitiesthatweareplanningtotakewithinthisfirstusecaseonbehaviouranalysis.

3.2.2 VisualDiversityAnalysisAsecondgeneral andhighly applicableuse casewehavewithinDiSIEM is theanalysis of diverse systems and the analysis of diverse information. Largeorganisations usually make use of multiple detection or monitoring tools toprotecttheirnetworkinfrastructures.Thegenerateddatafromthosetools(storedinSIEMsystems)providesanopportunityforexplorationandassessmentoftheperformanceofdifferentcombinationoftools.ExistingSIEMsystemslackofthiscapability.WorkfromWP-3(theDiversityAssessmentandPredictioncomponent,orDAPinshort)willprovidealgorithmstosupportthisdiversityassessmentandprediction. We will combine with that component to provide visual analysiscapability to SIEM users, allowing them to explore and analyse diverseconfigurations of monitoring tools more effectively. The ultimate goal of theactivities under this heading is to build solutions for SOC operators and cybersecurityanalystsinhelpingthemmakebetterinformeddecisionsduringboththedesign of a defence strategy and during the evaluation of ongoing attacks andanomalies.Given thewide applicability of this task and given the volume andvarietyofdatathatarebothcurrentlyavailablewithinexistingSIEMsandwillbemadeavailableduringtheproject,thegoaloftheactivitiesunderthisusecasehaswideapplicabilitybothfortheuser-sidepartners(i.e.,Amadeus,ATOS,EDP)oftheprojectandthewiderSIEMusercommunity.

TheRequirementGatheringProcess

TheprimarydriverforthisusecaseisthecontinuousexchangebetweenthetworesearchcentresatCity,UniversityofLondon–giCentreandCSRthatarebasedphysicallyincloseproximity.In order to gather the requirements and needs, we first organised a scopingmeeting, in which CSR researchers presented the available data sources, theexistingproblemswithinthecurrentanalysisandweinvestigatedpotentialareaswhere visualisation support for facilitating the process are needed. Abrainstorming session was carried out following the initial scopingmeeting, inwhichbothCSRandgiCentreresearchersdiscussedcreativelyandopenlyaboutthe problem, identify possible key points and requirements to guide thedevelopmentinvisualisation,andproducedinitialdesignsketches,andidentifiedpotentialpromisingusecasestoguidefurtherdevelopment.

DataCharacterisation

Within the visual analysis of diversity,we consider data that is produced as aresultofthediversemodelling,i.e.,indiversesetupswehavewithinDiSIEM,the

D5.1

23

rawdata--whichusuallycomesintheformofPCAP24files--arefedintovariousconfigurationsofdefencesystemswhichproducesalerts.Suchalertdatacanbeproduced either retrospectively, i.e., using logged data, or in a streaming/livefashionwhendeployedinasystem.Within this use case, for an initial investigation of the potential solutions, weconsiderageneraldatastructuretocharacterisethedataavailable foranalysisandvisualisationanddescribeanalertdataobjectwhichconstitutesthefields:AlertDataObject:

- Time:Time-stampthatindicateswhenthealertisraised- Type:Categoricalinformationthatindicatesthetypeofalert,e.g.,whether

itrelatestoaparticularblacklistedIPoraknownvulnerability,etc.- Sensor:Anindicationoftheconfigurationofmonitoringtoolsused,canbe

any combination of Antivirus, IDS, Firewall, or any other defencemechanismdeployed

- Meta-data: This is a placeholder for any further data that is availablerelatedtoanalert,examplescouldbe,unstructuredinformationabouttherulewhichledtoanalert,probability/certaintyassociatedwiththealert,etc.

- Label: In certain cases, in particular for the purposes of training andsimulation,thealertswillhavelabelswhethertheyaretrueattacksornot.

Notethatwithinthiscontext,wedealwiththedataatanalertlevelanddonotconsidertheunderlyingrawdata(i.e.,PCAPfiles)thatgoesintothemodels.OSINTRelatedDataandModellingResults:Inadditiontothedataemanatingfromthediversityanalysisprocess,wewillalsoconsiderrawdataandthreatpredictionsbasedontheOSINTdatagatheringandmodelling performed within WP4. In order to incorporate an OSINT basedperspective along with the other diversity analysis results, we’ll considerincorporating both the raw OSINT data (Tweets, forum discussions, opendatabasesaslistedinthereportD4.1)andalsothethreatmodellingresultswithinthevisualisationsdevelopedinthisusecase.Thedetailsandstructureof thesedataiscurrentlybeingdevelopedbyWP4andthevisualisationworkpackagewillcoordinate with WP4 on incorporating these data sources as they are beinggatheredandmodelled.

AnalysisGoalsandTasks

TheDAPmodulewilllabelthedata,supportexplorationofdiverseconfigurationsandpredictthenextanomalyevent.Theoverallobjectivewithinalltheseactivitiesis tounderstand thevariations in thedistributionsof alerts in response to thecombinations of monitoring tools with the ultimate goal of making betterdecisionsbothinevaluatingsignalsandinsetting-uptheinfrastructure.During thebrainstorming sessions,we investigated the existingworkflow, andidentified that the analysis happens in three stages as described below. In thefollowing,wedescribe eachStagebriefly and list the core analytical questions24https://en.wikipedia.org/wiki/Pcap

D5.1

24

involved in the stage, and present the list of features discussed during theworkshopsasthebasisoffurtherdevelopment.Stage1—OverviewofalertsasopposedtothecombinationofsensorsAtthisstage,noinformationonwhetherthealertsarerealattacksornotisyetavailable.Duetothisfact,thenatureoftheanalysisispurelyexploratoryatthisstageandhasthegoaltogainanoverviewofhowthealertsaredistributedovertimeandoverdifferentconfigurations,andtoidentifytrendsandoutliers.CoreAnalyticalQuestions:

- Howarethealertsdistributedoverinfrastructurecombinationsovertime?- Havetherebeensignificantchangesovertimeinthevolumeandcharacter

ofalertsandcanwederiveanypotentialcausesderivingthesechanges?- Are there common trends or outliers in the alert distributions and any

systematicstructure inthesystemconfigurationthatmighthelpexplainthem?

Featuresmentioned:- Display the overall distribution of alerts broken down by

time/configuration.- Filterandfocusonaparticulartimeperiod- Abilitytoconsideralertswithinthecontextofallthetraffic

Stage2—Analysing(labelled)alertsformodelinvestigationInthisstageoftheinvestigation,analystsconsiderdatathatalsohasthe labelsassociatedwiththealerts,i.e.,dataonwhetherthealertisassociatedwitharealattack or not. In this stage, ROC curves are usually deployed by analysts andlimitationsinexistingtoolswerediscussed.CoreAnalyticalQuestions:

- How the various system configurations relate to each other in terms oftheirperformance?

- Howtheperformancesofsystemconfigurationsvaryovertimeandhowthechangescaninformananalystonstability?

- Whatarethebestconfigurationswhenmultipleoptimisationcriteriaareconsidered?

Featuresmentioned:- Manuallyadjustwhattheperformancemetricsare.- Filterthealertstofocusonaparticularsubset(e.g.,onlyFalsePositives)- Filtertimeperiods- Observe changes over time and/or performance during a particular

instancesuchasanattack

Stage3—Prediction(Uncertaintyvisualisationformodels)Thisstageoftheinvestigationinvolvesacombinationoftheanalysismodellingoutputs with the aim of evaluating the forecasts for future potential

D5.1

25

vulnerabilities. The primary input to this stage will be the forecasts and theparameter spaces of the probabilistic models. In addition to these modellingoutputs, wewill also incorporate other signals generated from the analysis ofOSINTdata.CoreAnalyticalQuestions:

- How the various models are related to each other in terms of theirpredictions?

- What are the similarities and differences betweenmodels and to whatextentthiscanhelptheevaluationoftheforecasts?

- Towhatextentthechangesinthefrequencies,trendsinthemodels,andthepredictionsbeexplainedandsupportedbysignalscomingfromOSINTdata?

Featuresmentioned:- Visuallyinvestigateseveralmodelswiththeirforecastsinasynopticway- Visualisetheuncertaintyinthepredictions- Relate the predictedmodels to past raw data to provide context to the

predictions- Relate the predicted models and the raw data leading to them to the

gatheredOSINTdataandthethreatpredictions.

The above characterisation of the workflow and the core analytical questionsprovideusthedomainspecificrequirementsatalowlevel.Inthefollowing,weabstract out from these low-level questions and present high-level goals andassociatedhigh-leveltasks:Goal1:Overviewofalerts

- Task1.1–Alertsvs.non-alertsinthetraffic.Exploringhowoftenalertswereraised.

- Task 1.2 – Temporal analysis of alert distributions. Exploring therelationshipbetweendifferentattributes:time,alerttypeandsensor.Howdodifferentsensorsraisealertsovertime?

Goal2:Interactiveexplorationofsensorconfigurations- Task 2.1 – Overview of all possible configurations. Looking at all

configurationsatthesametimewherethevisualisationcanchangeovertime.

- Task2.2– Interactiveoptimisation. Supporting interactivemeans for theuserstoidentifythebestconfigurationbasedontheirneedsandstrategy.

Goal3:Analysisandevaluationofmodelensembles- Task3.1–Visualsummaryofmodels.Showingthepredictionvalueandhow

reliableitis.- Task 3.2 – Visual interaction with models. Interactively select particular

subsetsofdataand/orparametersthatgointothemodellingandobservevariations.

- Task 3.3. – Visual correlation with OSINT data and models. Visuallycomparing the signals being raised from the OSINT threat predictionmodelsandexplaining/evaluatingmodelswithdiversedata

D5.1

26

Notice that we organised these Goals and Tasks in a similar structure as theexistingworkflowtohelpplanouractivitieswithinthisusecase.WerevisitthesegoalsandtasksinSection4.2andpresentaseriesofinitialsketchesanddocumentpromisingprototypeideas.

D5.1

27

4 InitialDesignsThis chapter reviews work related to the two use cases discussed above anddescribesearlydesignattemptsinaddressingtheelicitedanalysisgoalsandtasks.Note that thesearenot the finaldesignofourvisualisation tools. Instead, theywere included in our prototypes demonstrated to end users to investigatepotentialdesignsandelicitrequirements.Undereachsection,wealsoreviewtherelevantstateoftheartthatisneededtohelpinformourdesigns.

4.1 UserBehaviourModelling

4.1.1 StateoftheArt

TemporalEventSequenceVisualisation

Timeisanessentialaspectoflifebecauseeverythingcontainsinherenttemporalattributessuchasthetimewhenapersonwasbornandthetimewhenaneventhappens.Longbeforecomputerswereinvented,informationgraphicshavebeenused to represent temporal relationshipofdata.Oneof theoldestdocumentedtimelines was created back in 1765 entitled Chart of Biography by JosephPriestley25. It shows the lifespans of two thousand famous names along ahorizontaltimeaxis,spanningfrom1200BCto1800AD.Heusesahorizontallinesegment to depict a lifespan, and adds dots to either ends to indicate theuncertaintyofthereportedvalues.Sincethen,manyvisualizationtechniqueshavebeendevelopedtoeffectivelyrevealthetemporalrelationshipofdata.Thebookby[9]providesacomprehensivereviewofthistopic.Inthissection,wefocusonvisualisationtechniquesofeventsequencedata.Timeline[10],[11]isthemostcommonmethodstovisualisetemporaleventdata.Events are displayed along a horizontal axis at when they happen. To avoidoverlap,eventscanbelocatedatdifferentverticallocation[12].Eventtypescanbeencodedusingcolourediconsorspatialgrouping[13]asshowninFigure10.However,thesemethodscanonlysupportalimitednumberofeventsandeventtypes.

Figure10–TimeSets techniquebyNguyenetal. [13] todisplay temporaleventsgroupedaccording to theirsemanticsimilarityovertime.

25"AChartofBiography-Wikipedia(https://en.wikipedia.org/wiki/A_Chart_of_Biography).Accessed31Jul.2017.

D5.1

28

Therehavebeenseveralapproachestoachieveahigherscalability.LifeFlow[14]providesanoverviewofeventsequencesthroughanaggregationrathershowingallindividualones.Eventsofsequencesareaggregatediftheyhavethesametypesand are at the same step in their sequences (Figure 11). LifeFlow is good atsummarisingsimilarandshortsequences.However,whenthesequencesarelonganddiverse,thevisualisationisoftenhighlymessy.

Figure11–Icicle-plotlikeviewinLifeflow[14]toprovideoverviewsofeventsequences.

EventFlow[15]providesasetof interactiveuser-drivendatasimplifications toaddressthelimitationofLifeFlow.Eventscanbefilteredoutormergedbasedoninterval or event types (Figure 12). More advanced ‘search and replace’mechanismisprovidedthroughafriendlyinterface.Asaresult,EventFlowgreatlysimplifieslargeeventdatasetstorevealhigh-levelpatterns.Asanalternativetoprovidinganoverview,DecisionFlow[16] takesadifferentapproachtosimplification–filteringforamorefocusedanalysis.Thetechniquestarts by askingusers to forma query to search for relevant event sequences.DecisionFlow allows users to specify multiple ‘milestones’ in a sequence andpresentsavisualsummaryofallsequenceshavingsuchmilestones.Anexampleofsuchqueryis“showmeallsequencesthathavethreemilestonesA,BandCinthatorder,andthereshouldbeagapof6monthsbetweenAandB”.Thisapproachissuitable forusers thatroughlyknowwhat theywant to investigaterather thanexploringinterestingpatternsinthewholedataset.

D5.1

29

Figure12–EventFlowbyMonroeetal.[15]incorporatesinteractivevisualsforsearchingandaligningeventsequencestosupportexplorativeanalysismoreeffectively.

SequentialPatternMiningofEventsSequential patterns are ordered list of events that co-occur frequently in asequencedataset.Events that satisfyapatternarenotnecessarily consecutive.SuchpatternscanbeminedusingclassicalgorithmssuchasAprioriAll[17]andGSP [18], aswell asmore improved recent approaches [19], [20]. A pattern isconsideredasfrequentifitappearsinmorethanaparticularnumberofsequences.Suchthresholdvalueiscalledsupportandcanbemeasuredasabsolutevalueorpercentage.Oneproblemoffrequentpatternminingisthehighnumberofpatternsreturnedby the algorithms. Besides increasing the support, different constraints can beincluded to the algorithms [21]–[23] to help express more specific patterns,namely:

- Maximalsequence.Asequenceismaximalifthereisnoothersequencethatcontainsit.

- Type.Findpatternshavingspecifictypes.- Length.Findpatternshavingatleast20events.- Time gap. Find patterns such that the gap between consecutive events

shouldbelessthan1month.- Regularexpression.Findpatterns‘startingfromahomepagethensearch

foraparticularkeyword’.

EventSequencePatternVisualisationApatternisbasicallyanorderedlistofeventtypes,sostandardtemporaleventsequence visualisation techniques can be used to visualise frequent eventpatterns.Morespecially,apatternalsoincludesinformationaboutallsequencessupportingorcontainingit.Thosesupportingsequencesareusuallysummarisedanddisplayedtogetherwiththepatterns.Patternscanbedisplayedindividually,eachconsistingofitseventtypes[24].Eventtypesaredisplayedverticallyandthedistance indicates the average time gap between two consecutive events,consideringallinstancesofsuchpattern(Figure13).Whenapatternisselected,all sequences having the patternwill be displayed in the other view, allowingfurtherexploration.

D5.1

30

Figure13–Wangetal.[24]investigatestheuseoflinkedrepresentationsofpatternsandsequences.

PatternscanalsobedisplayedinanaggregatedmannerasintheSankeydiagramused by Google Analytics26. In Figure 14, a Sankey diagram is used to showcommonvisitpathsinaparticularwebsite.Foreachstepinapattern,identicalevents are aggregated and the size of the flow corresponds to the number ofindividualevents.Transitionsbetweeneventsarealsoaggregatedandreflectedthroughflowsize.CoreFlow [25] extracts branching patterns based on ranked key events. Thealgorithmhasshowntobemorescalablethanstandardsequentialpatternminingalgorithms. The patterns produced are also of smaller size, which makes theexplorationmoremanageable.Thepatternsarevisualisedusinganicicleplot[26],anode-linkdiagram,oracombinationofbothasshowninFigure15.Thesizeofasegment between two consecutive events indicates the frequency of suchsegmentsinthedataset.

26"GoogleAnalyticsuseSankeyDiagrams|SankeyDiagrams."2Nov.2011,http://www.sankey-diagrams.com/google-analytics-use-sankey-diagrams/.Accessed31Jul.2017.

D5.1

31

Figure 14 – Sankey diagrams are already widely adopted to highlight flows – from the Google Analyticsdashboard(http://www.sankey-diagrams.com/google-analytics-use-sankey-diagrams/)

Figure15–Branchingeventsinsequencescanbeexplicitlyvisualizedtoindicatehigh-levelchangesinpatternsasexemplifiedherebytheCoreFlowtechnique[25].

D5.1

32

4.1.2 VisualisationDesignsInthissection,wepresentthedesignsinchronologicalorder,sothereaderscanfollowhowthedesignsevolvedovereachiterationinresponsetotheemerginganalysis requirements and how the visualisation designers gain a deeperunderstandingoftheproblemdomainwhilstanalystsarealsogettingexposuretothecapabilitiesofwhatvisualisation-basedsystemscanachieve.

Iteration1–High-levelAnalysisGoals

Thisiterationinvolvesthegenerationoftheinitialprototypesthatwerebuilttoaddress the two high-level analysis goals identified within the demonstrationsessionsandtheinterviews,asdiscussedinSection3.2.1.Goal1:SessionsOverviewWeprovidetwoperspectivestolookatsessiondata.

Sequenceofactions.Asequenceisanorderedlistofactionshappenedinasession.Thishelpsrevealcommonanduncommonlinearcombinationsofactionsthatusersperformedwiththeapplicationsystem.Thisalsohelpstoexaminethefollowingortrailing relationship; for instance,howlikelysequenceABCwillbefollowedbyD(becomingABCD)orE(becomingABCE)?Asequenceisrepresentedbyarectanglewithmultipleequal-sizedcolumns,eachrepresentinganactioninthesequence(Figure16).Theactionsarereadfromlefttoright.Theheightoftherectangleindicatesitsnumberofoccurrencesintheentiredataset;i.e.,thetotalnumberoftimesthatsequenceisfoundinallsessions.

Figure16–Representationofasequenceofactions.

Sequencesarethenaggregatedinatreemetaphorrepresentationsothattheonessharing the same ancestor are placed next together (Figure 17). For instance,ABCDisplacedrightaboveorbelowABCEsothatABCcanbeseenastheirparent.Asaresult,thesequencetreeconsistsofmultipleequal-sizedcolumns(5inFigure17),eachrepresentingacolor-codedaction.The first columnshowssequenceswithoneaction,whichactuallyarejusttheactionsthemselves.Thecombinationofthefirstandthesecondcolumnsshowssequenceswithtwoactions,andsoon.Whenhoveringacell, thecorrespondingsequenceendingwiththatcellwillbehighlightedandatooltipisdisplayedtogivedetailedinformation.

D5.1

33

Figure17–Asequencetreeshowingrelationshipofsequences.

WecanobservesomeinterestingpatternsinFigure18.SearchUser isthemostcommon action, but what do users commonly do after a search? The actionsfollowingareDisplayOneUserorSearchUseragain–i.e.,afterdisplayingauser,theyusuallysearchagainorupdatedetails.

Figure18–Sequencesstartingwith'SearchUser'.

D5.1

34

AnotherinterestingpatternisthatsomeactionsarejustfollowedbythemselvesasshowninFigure19.TheseareRefreshReportandRefreshTable.

Figure19–Aninterestingpatternabout'refresh'actions.

UserView.Theultimategoalofthisworkistoanalyseuserbehaviourbasedontheiractions.Therefore,itisnecessarytoexaminedataattheuserlevelbesidesdivingintoeachindividualusersession.Weprovideasimplebarchartwitheachbarshowingtheaveragescoreofallsessionsperformedbytheuser(Figure20).Ithelpsrevealsuspicioususersthatneedmorein-depthinvestigation.

Figure20–Asimplebarchartofusers.

Goal2:In-depthAnalysisThisistosupportexaminingaselectedsession.Asdiscussedpreviously,wefocuson the recorded sequence of actions because such information tells us whatactually happened in the session. For each action, two important pieces ofinformationaredisplayed:thetypeoftheactionandthetimewhenithappened.

D5.1

35

Actionisshownasadiamondalongthetimeaxisattheinstantitoccurs,withitscolourindicatingthetypeofaction(Figure21).Ontheleft-handside,abarshowsthemodelscorewithitslengthmappedtothescorevalue.Thescorebarcolouralso indicates its value with 3 different hues: red for high (>0.75), yellow formedium(0.5–0.75)andgreenforlow(<=0.5).

Figure21–Abasictimelineofactions.

Actionshapescanoverlapeachotheriftheyappearcloseenough(Figure22).Inthatcase,weprovideanoptiontoavoidoverlappingbydisplayingactionshapesinseparaterows(Figure23).Thismethodsacrificesspaceforreadability.

Figure22–Timelinewithoverlappingactions.

Figure23–Timelinewithactionslocatedindifferentrowstoavoidoverlapping.

Anothermethodtohandleoverlappingistoplaceactionsnexttoeachotherbasedontheirtemporalorder(Figure24).Thiswillpreservetherelativetemporalorderbetweenactionsbutnottheabsolutetimestampedinformation.Thismethodalsoisspace-efficient.

Figure24–Timelinewithrelativetemporalordertoavoidoverlappingofactions.

Whileexaminingthedatawithourtimeline,wenoticethatactionsdonotappearrandomly(Figure25).TheyoftenappeartogetherasasequencetoperformsomeactivitysuchasSearchUser→DisplayOneUser.Therefore,wecombinethemandshowthesequenceasastretcheddiamondwiththewidthindicatingthedurationbetweenthefirstandthelastactioninthesequence(Figure26).

Figure25–Timelinewithrepeatingactionsshowninanormalway.

Figure26–Timelinewithrepeatingactionsshowninanaggregatedwaytohighlightthepatterns.

D5.1

36

NotesonVisualisationInteractionThis section briefly overviews a powerful technique in visualisation that isimplementedinalltheprototypes:coordinationoflinkedviews.Theinteractioninoneview is reflected inallotherviewsenablingusers toexaminedatawithdifferent perspectives in different views. For instance, an analyst can select asubsetof sequences in the ‘sequenceof actions’ view, all users thatperformedthosesequenceswillbehighlightedandallsessionshavingthosesequenceswillalsobedisplayedinthetimelineview.Inanotherexample,ananalystwantstofocusonuserperspective, shecanselectaparticularuser in the ‘userview’ toinvestigate,thedatainthe‘sequenceofactions’viewwillbefilteredtokeeponlythoseperformedbytheselecteduser.Thetypesofinteractionarenotlimitedtoonlyselection; itcanbeappliedtohoveringaswell.Mousehoveringinteractionallowsafasterexplorationofdatabeforemakingamorecommittedselection.

Iteration2–AnalysisTask1.2and2.1

This iteration was to address the two new tasks identified in the previousiteration.Task1.2–Higher-levelsemanticsummaryofatomicactionsequencesActivityMining.Thepreviousiterationshowedthebenefitofdisplayingcommonsequences of actions. In this iteration, we made an attempt to systematicallyextractallcommonsequencesusingdataminingtechniques.Asessioncontainsanorderedlistoftimestampedandlabelledactions,withlabelsdeterminedbythedevelopers of the application. Even though each action is associated with ameaningfullabelindicatingitspurpose(suchasSearchUserandDisplayOneUser),itisstillchallengingtounderstandthenatureofasessionduetothelargenumberof actions (many sessions containingmore than 100 actions).Moreover, earlyinvestigations reveal that actions do not appear randomly. They often appeartogetherasshort“patterns”whereahigher-levelactivity iscarriedout,suchasSearchUser→DisplayOneUsertoretrievethedetailsofauser.Inordertobothsimplifytheactionspaceandrepresentthesessiondatawithahigher semantic level, we mine the activities from raw user actions. Morespecifically,givenanorderedlistofactionsinausersession,wesplitthelistintocontiguousanddisjointsequences.Eachsequence(anorderedsub-listofactions)represents a meaningful activity that the user performed (we refer to theseartefactsasactivityfromnowoninthetext).Itisreasonabletoassumethattheseactivitiesareasubsetof frequentactionsequencesbecausesmallactivitiesaresupposedtoberepeatedmanytimesinmanydifferenttasks,whicharecarriedoutinmanydifferentsessions(assimilarlyevidencedinotheranalysissettings[27]). Extracting frequent action sequences can be implemented using classicsequential patterns mining algorithms such as AprioriAll [17] and GSP [18].However,thenumberofsequencesproducedbythesealgorithmscanbehigherandthemajorityofthemmaynotrepresentmeaningfulactivities.Toexcludenon-activitysequences,weapplyseveralconstraintssuchasthemaximumtimegapbetweentwoadjacentactionsinasequence.

D5.1

37

ActivityVisualisation.Wevisualisethesequencesproducedbytheminingprocessto output the frequent activities performed as shown in Figure 14. Thevisualisationconsistsofmultiplerows,whereeachrowrepresentsanactivityandissplitintotwoparts:therightpartvisualisingtheactionsinanactivityandtheleft section listing statistics on these actions. Each activity is represented as acontiguoussequenceofcolour-codedsquares,whereeachsquarerepresentsanaction. To characterise the ``frequency'' of an activity, three statistics arevisualised in nested bars (three grey bars in figure): the number of times theactivity appears (biggest bar), the number of sessions having that activity(mediumbar),andthenumberofusersperformingit(smallestbar).Forinstance,

comparing activity (second top) with activity (secondbottom), the former is repeatedmany times,more than the latter, buttakingplaceinafewsessionsbyafewusers,whereasthelatterisspreadmoreevenlyacrossseveralsessionsandusers.

Figure27–Visualisationdesignthatdisplaystheminedpatternsalongwiththeirstatistics.

Visual Exploration of Activities. This is to explore sequences with multipledimensions,namely:

- thenumberofoccurrences- thenumberofsessionshavingthesequence- thenumberofusersperformingthesequence- themedian score of all sessions having the sequence (the action

frequencyscorefromthemodel)- themedian length (number of actions) of all sessions having the

sequence- themedianduration(time)ofallsessionshavingthesequence

A sequence is representedby a circle (Figure28)with its sizemapping to thenumberofitsoccurrencesintheentiredataset.Itslocationisdeterminedbythetwoconfigurableaxesof thescatterplot.Thecircle is split intomultiplecolor-codedrings,eachonerepresentinganaction.Thesequenceisreadoutward,fromthecentrering.

D5.1

38

Figure 28 – A 'ring' representation of a sequence of actions. Each ring represents an action type, orderingoutward.

Usingascatterplot(asrepresentedbyFigure29)toexploreactivitiesorcommonsequences,wecandiscoverinterestingpatterns.Changingx-axisto‘Occurrences’and y-axis to ‘Sessions’, we expect these two dimensions to be positivelycorrelatedandcirclesplacedalongadiagonalline.Mostofsequencesfollowthisbutthegreenones.Hoveringthemtodiscoverthatonlyoneuserinonesessionrefreshed66times!

Figure29–Thescatterplotofallthesessionsindicatean“interesting”sessionwithhighlevelsofactivity.

Changingx-axis to ‘Medianduration’andy-axis to ‘Median length’as shown inFigure30,westillexpectapositivecorrelation,nonetheless,wecanspotsomeexceptions.

D5.1

39

Figure30–Switchingtodifferentaxes,theplotnowrevealssessionsthatmightrequirefurtherinvestigation–shortbutveryactivesessionsandlongsessionswithlimitedlevelsofactivity.

Changing x-axis to ‘Median score’ and y-axis to ‘Sessions’, we can see a quitenormaldistributionofscore(Figure31).Frequentsequences(bigcircles)seemtohave average score and some less frequent sequences have higher score. Anexceptionisapinkcircle(DisplayOrgaDetails)thatisoccurred20timesintotalof12sessions(allbydifferentuser)andreceivedamedianof0.87.

Figure31–Contrasting the frequencyof sequences to scores reveals patterns thatare consistently ratedasanomalousdespitebeingfrequent.

D5.1

40

Task2.1–Multi-scaleexplorationSingleSession.Usingtheresultfromtheactivityminingprocessdiscussedabove,we provide four levels of action aggregation to allow analysts to explore thesession from different perspectives. At the first (highest) level of detail, eachactionisshownseparately,providingasenseofthesessionlengthandfacilitatingdetailedexaminationofindividualactions(Figure32).

Figure32–Timelinewithhighestlevelofdetail.Actionsareshownindividually.

Atthesecondaggregationlevel(Figure33),consecutiveactionshavingthesametypeareaggregatedintoonewithasubtlehorizontalline(suchastheoneinthemiddleofthefigure)indicatingthesizeoftheaggregation.

Figure 33 – Timeline with the second highest level of detail. Consecutive actions having the same type areaggregated.

Atthethirdaggregationlevel,actionsarereplacedbyminedactivitieswheneverpossible. This greatly simplifies the visual summary and helps quickly tounderstandtheentiresession.Anactivityisrepresentedasacontiguoussequenceofcolour-codedactions, leavingnowhitepaddingbetweenthem.Theheightofindividualactionsisreducedtohalftovisuallydistinguishwithactivities(Figure34).

Figure34–Timelinewiththethirdhighestlevelofdetail.Activitiesareshowninsteadofactions.

Atthehighestaggregationlevel(lowestdetail),consecutivelyrepeatedactivitiesarecombinedintoone,similarlyasinthesecondaggregationlevel(Figure35).

Figure35–Timelinewiththelowestlevelofdetail.Consecutiveactivitiesareaggregated.

These four representations (actions → aggregated actions → activities →aggregatedactivities)allowanalyststoinvestigateasessionatdifferentlevelsofdetailsandpurposes:ahigh-levelsummaryoradetailedexamination.MultipleSessions.Thisnewrequirementemergedfromobservingthebenefitofacompact representation of a single session. An analyst may want to comparemultiplesessionssuchasthosehavingreallyhighscores,thosecomingfromthesameuserorthoseusingthesameIPaddress.Wesupportthisbyprovidingsmallmultiplesofvisualsummariesofsessions,eachshownasaseparaterow(Figure36). Specifically,whenan analyst selects auser to investigate, all of his orhersessionsaredisplayedinthetimelineusingsmallmultiples.Sessionsareorderedbythetimethattheybegin,allowingtheanalysttounderstandtheactivitiesovertime. The analyst can highlight the session that needs to be investigated andvisuallycompareitwithothersessions.Whenanactivityoractionishovered,allotheroccurrencesarehighlighted.Sessionsfromanofficearegroupedbyuser,facilitatingcomparisonbetweenusersinthesameoffice.

D5.1

41

Figure36–Anoverviewofallthesessionsexecutedbyasingleuserusingoursimplificationschemetofitseveralsessionsinalimitedamountofspace.

Iteration3–AnalysisTask1.1and2.2

Thisiterationaddressesthetwonewtasksidentifiedinthepreviousiteration.Task 1.1 – Exploration of relationship between anomaly score and othersessionattributesWe introduce a design that targets about 1000 sessions (in this dataset, themaximum number of sessions in one day is 806). This shows an overview ofsessionsthathappenedrecently,e.g.,inthepast24hours.Eachsessionisshownasasmallcolouredrectangleandorderedsequentiallyfromthetoprowtothebottomrow,andfromlefttorightineachrow.

Figure37–Overviewofsessionsthatcanbecolour-codedusinglightnessandgroupedaccordingtoanycriteriasetinteractivelybytheuseraccordingtotheanalyticalneeds

Scoreisthecrucialattributeforanalyststoassessasession;therefore,itisshownusing colour lightness (darker means higher value and more suspicious). Animportantrequirementhereistorelatescorewithotherattributesofasessionsuchasitslength,durationandactionrate.Oneoftheseattributesismappedtotherectangleheightandchangeablethroughadropdownmenu.Thereisaclassofattributesthatcansplitsessionsintodifferentgroupsbasedonthe attribute value such as user or IP address. We visualise such categoricalattributes using spatial grouping. Each group of sessions is placed spatiallyseparatedandsurroundedbyaborder.Besides anomaly score, this visualisation allows exploration of sessions withdifferentperspectivesandhelpsanswerdifferentanalysisquestions.Whoaretheuserswith consistently high session scores?Which IP address has the highestaveragescore?Howdoessessionlengthordurationrelatewithanomalyscore?

D5.1

42

Task2.2–Currentvs.PastcomparisonAmadeusanalystsmentioned that this is oneof themost commonactivities inanalysingsuspicioussessions. It isessential tocompareonesessionwithothersessionsperformedbythesameuserinthepast.Therewedecidetosupportsuchcomparisonautomatically.Eachactioninagivensessioniscomparedagainstasetofactionsperformedinthepastandassignedan‘expectedness’score.Thescoreisfrom-1(highlyunexpected)to1(highlyexpected)andcomputedasfollows.

- If theactionhappens inagivensessionx timesandhappenedbeforeytimes:𝑠𝑐𝑜𝑟𝑒 = ()*(,,.)

(01(,,.).

- If the action happens in a given session x times and never happenedbefore:𝑠𝑐𝑜𝑟𝑒 = 2

,− 1.

The score is then visually encoded to the action representation to allow rapidcomparison.Inthefigurebelow,darkredindicateshighlyunexpectedwhiledarkblueindicateshighlyexpected.

Figure38–Comparisonto“expected”behaviourisperformedalgorithmicallyaccordingtoaninternalformulaandtheresultsarecolourcodedtohighlightinterestingsectionsofsessions.

Additionalinterfacefeaturesandplannedcapabilities:Inadditiontothepresentedtechniquesthataremotivatedbythetasksabove,we’llalsodevelopandincorporatemoreconventionalinterfacefeaturesandcapabilitiescommoninvisualanalyticssystems.Theseincludesearchandfilteringinterfaces(whichcanbothbevisualortext-based),interfacestoamendthedatabeingdisplayed,interfacestoaddorremovenewviews,featurestocollapse/maximiseviews,andfeaturestoexportparticulardataorvisualstouseinreportsortocommunicatewithothersystems.Thesefeatureswillbescopedandevaluatedduringthefinalevaluationphaseoftheprojectwherewewilldeploythevisualisationsinlivesystems.

D5.1

43

4.1.3 AnalyticalApproachesIn this part of our research, we combine interactive visualizations withcomputational methods, particularly, data mining techniques. Focusing onsequencesofactionsorevents,weaddresstwoproblems:

1. Detectanomaloussequences,whichmaybesuspicious.2. Supportexaminationofdetectedanomaloussessionsbyhumananalysts,

sothattheycaneasilyunderstandwhatisunusual.Theapproachwetakeaimsatidentifyingtransitionsbetweenactionsthatcanbetreatedassemanticdisruptionsinactionsequences.Anormalsessionisexpectedtoconsistofsemanticallyrelatedactionsthatareusuallyperformedtogether.Anoccurrenceofanactionthatisnotrelatedtothepreviouslyperformedactionsmaysignifyapossiblesystemmisuse.Wecallsuchoccurrences“semanticdisruptions”.Todetectthemautomatically,itisnecessarytofindawaytomeasurethesemanticrelatednessoftheactions,or,inotherwords,the“semanticdistance”betweenanytwoactions.Alarge“semanticdistance”ofsomeactiontotheprecedingactionsinasessionmaybeconsideredasananomalyrequiringattentionofahumananalyst.Inthefollowing,theapproachispresentedbytheexampleofthesessionlogsoftheusersofthesystemLSS(LogonandSecurityServer).Thesetofpossibleusers’actionsconsistsof296differentactions.

Measuringthesemanticdistancesbetweenactions

Given the large number of distinct actions, it is unfeasible to obtain expertestimatesofthedegreesofsemanticrelatednessofallpossibleactionpairs.Thesemanticdistancesbetweentheactionsneedtobeestimatedinanautomatedorsemi-automated way.We can adapt the approaches from text analysis, wheresemanticrelatednessofwordsisassessedbasedonthefrequenciesoftheirco-occurrence in the same texts and/or the distances between them in texts. Asessioncanbetreatedasatext,andtheactionswithinthesessionaswords.Basedonthisreasoning,wetaketheapproachtoobtainingthesemanticdistancesbetweentheactionsbasedonthedistancesbetweentheactionoccurrencesinthesessions and the frequencies of their co-occurrences. A distance between theoccurrencesoftwoactionsinonesessionisthenumberofotheractionsthatwereperformedbetweenthemplusone.Hence,whenoneactionimmediatelyfollowstheother,thedistanceis1;whenthereisoneactionbetweenthem,thedistanceis2,andsoon.Thisisthesameapproachasisusedformeasuringthedistancesbetweenwordsinatext.Itisreasonabletoignoreactionco-occurrenceswithmanyotheractionsbetweenthem. High separation between occurrences of actions A and B decreases theprobability of semantic relatedness of these occurrences. In text analysis,distancesbetweenwordsaretypicallymeasuredwithinaparagraphratherthanwithinawholedocument,whichmaybequitelong.Unliketextdocuments,actionlogs lack structure that could be used for defining the scope of actionneighbourhood. Instead, we use a threshold on the number of other actionsseparatingtheoccurrencesoftwoactions.Wemadeexperimentsusingthresholds10and15anddidnotdetectnoticeabledifferencesbetweentheresults.

D5.1

44

Ourautomatic toolscans theactionsequenceofeachsession, takesallpairsofdistinct actions (A, B) separated by 0, 1, …, N other actions (where N is theseparationthreshold),measuresthedistancesbetweenAandB,andcountsthetotalnumberofco-occurrencesofAandBinallsessions.TherelativeorderoftheactionsAandBisirrelevant,sinceweneedadistancemeasurethatissymmetric,i.e.,D(A,B)=D(B,A).ThesemanticdistancebetweenAandBiscomputedasthemedianofthedistancesfromallco-occurrences(includingboth(A,B)and(B,A))inversely weighted by the number of the co-occurrences. The idea is thatfrequently co-occurring actions are semanticallymore related than actions co-occurringrarely.Hence,themorefrequentlytwoactionsco-occur,thesmallerthesemantic distance between them should be. Accordingly, we obtain a numericmeasureofthesemanticdistancebetweenactionsAandBusingtheformula:

Theuseofthelogarithmfunctionismotivatedbythepropertiesofthestatisticaldistributionoftheactionco-occurrencecounts,whichisanextremecaseoflongtaildistribution.Themaximalvalueis37,377,thesecondlargestvalueis15,976,whereas99%oftheco-occurrencecountsarebelow1000,andthethirdquartileandmedianareonly12and3, respectively.Theapplicationof the logarithmictransformationgivesamoreevendistributionwithintherangefrom0to10.53wherethemedianis1.1andthethirdquartileis2.48.Thesemanticdistanceshavebeencomputedfor8,873actionpairs,whilethereare43,660possiblepairsthatcanbecomposedfrom296actions.Theabsenceofdistanceforapairofactionsmeansthattheseactionsneverco-occurredinthesamesessionwithatmostNseparatingactionsbetweenthem(here,N=10).This,inturn,meansthattheactionsare,mostprobably,notsemanticallyrelated.Thecomputedsemanticdistancesrangefrom0.87to11;themedianis1.95andthethirdquartileis4.

ValidatingtheapproachtomeasuringthesemanticrelatednessTovalidateourapproachtoquantifyingthesemanticrelatednessoftheactions,weappliedthefollowingidea.Bymeansofdensity-basedclustering,weputtheactionsingroupsaccordingtotheirproximityintermsofthecomputedsemanticdistances.Wepresent the action clusters soobtained to adomainexpert for ajudgementoftheirsemanticcoherence.Ifaclusterisperceivedascoherentandwell interpretable, theexpert isaskedtogiveacommonnameto thisgroupofaction.Theabilityofthedomainexperttointerprettheclustersandnamethemwouldsignifythatthedistancemeasurecanserveasagoodnumericexpressionofthedegreeofsemanticrelatednessbetweentheactions.

ClusteringactionsbasedonsemanticproximityForgroupingtheactionsbasedonthesemanticdistances,weusedthefollowingprocedure.Wecreatedamatrixofpairwisedistancesbetweenactions.For theactionpairsforwhichnodistanceshadbeencomputed,averylargeconstantvaluewasputinthematrix.Weusedthisdistancematrixasaninputtoadensity-basedclusteringalgorithmOPTICS[28].Thepurposeoftheclusteringwastofindgroupsof close actions in terms of our distance measure. According to the idea of

D5.1

45

progressiveclustering[29],weappliedtheclusteringalgorithmiterativelywithdifferent parameter settings, specifically, the neighbourhood radius R and theminimalnumberofneighboursNminforacoreobjectofacluster.Theuseofthisprocedure was motivated by the large variation of the inter-action distancescausedbythelargedifferencesbetweenthefrequenciesoftheactions.Inacaseoflargevariationofdistancesbetweenobjects,asetofobjectsmaycontainclustersthatgreatlydifferintheirdensityandcompactness.Asinglerunoftheclusteringalgorithmwith applying the sameparameters throughout thewholeobject seteitherextractsonlythedensestclustersandmisses importantclustersthatareless compact or unites too many objects in a single cluster. The first type ofoutcomehappenswhentheRissmalland/ortherequiredNminislarge.Thesecondtypeofoutcomehappenswhentheparametersaremorerelaxed.For this reason,weapplied theprogressive clusteringprocedure,bywhichwefirstextractedthedensestclustersusingasmallvalueofRandtheniterativelyapplied clustering to the remaining actionswith increasing theneighbourhoodradiusineachstep.WeusedthesamevalueNmin=3throughouttheprocessasthetotalnumberofactionsisrelativelysmall(fromtheclusteringperspective),andtheclusterswerenotexpectedtocontainmanyactions.Tobeabletosee,interpret,andevaluatetheclusteringresults,weusedadisplaywiththeactionsrepresentedbydotsarrangedinatwo-dimensionallayoutusingaprojectionalgorithm,specifically,Sammon’smapping[30],whichwasappliedtothesamematrixofthedistancesbetweentheactionsastheclusteringalgorithm.A result of clustering was represented by colouring the dots according to thecluster membership. Figure 39 shows the action projection display with theresults of three first steps of the progressive clustering procedure. The dotscolouredingreyarethe“noise”,i.e.,theactionsthatwerenotputinanyclusterbythe algorithm. The other colours correspond to the clusters detected by thealgorithm.Aftereachstep,weinteractivelyfilteredouttheactionsbelongingtotheclustersandthenappliedtheclusteringalgorithmtothe“noise”withaslightlylargervalueoftheneighbourhoodradiusRthaninthepreviousstep.

Figure39–Clusteringresultsareviewedandexploredusingatwo-dimensionalprojectionofthesetofactionsbased on the semantic distances between them.The images present the results of three steps of progressivedensity-basedclusteringwithincreasingvaluesoftheneighbourhoodradiusR.

For the assessment of the clustering results, we looked whether the clustermembersaresufficientlycloseintheprojectiondisplay.Wecouldnotexpecttheclusterstobewellseparatedinthetwo-dimensionalprojectionspacebecausethe

D5.1

46

projectionstress(measureoftheerrorinrepresentingtheoriginaldistances)isquitehigh(0.425).Still,wecouldexpectthatmembersofaclustershouldnotbedisorderly scattered over the projection space but should be located relativelyclose to each other.We also assessed the similarity between the names of theactionsincludedinacluster.Thenamescanbedisplayedondemandasdotlabelsin the projection display (the labels are not shown in Fig. 1 for the sake oflegibility). To see a cluster more clearly, we used interactive filtering totemporarilyhidethenoiseandtheotherclusters.Weusedtheprojectiondisplayalso for deciding whether the progressive clustering procedure should becontinued.Thus, thepresenceof compact concentrationsofnon-clustereddotswas an indicator that some clusters hadnot beendetected yet and, hence, theprocedureshouldbecontinued.Figure 40 presents the final result of the progressive clustering procedure, inwhichweperformed8successfulclusteringsteps.Afewstepswerenotsuccessful,i.e.,noclusterscouldbedetectedforthechosenvalueofR.Intotal,233actions(78.7%)were put in 37 clusters, and 63 actions (21.3%) remained out of theclusters.

Figure40-Thefullsetofactionclustersobtainedin8stepsoftheprogressiveclusteringprocedure;37clustersintotal.Thegreydotsarethe“noise”,i.e.,theactionsthatwerenotincludedinanycluster.Theothercoloursrepresentclustermembership.

D5.1

47

DomainexpertjudgementTheclustersofactionswerepresentedtoadomainexpertwithagoodknowledgeofthefunctionalityoftheLSSsystem,inwhichtheactionsareperformed.Eachclusterwaspresentedasasimplelistofactions.Theexpertwasaskedtojudgewhether the listed actions fit together and the group can be considered as ameaningfulcategory.Ifso,theexpertwasaskedtonamethiscategory.Theexpertreviewedtheclustersonebyone.Shejudgedallclustersasvalidandcoherent and gave meaningful names to all of them. The expert was veryinterestedwheretheactiongroupscamefrom,becausenocategorizationoftheactionsexistedbefore.Shewasquitesurprisedthatthegroupswereobtainedbymeansofautomatedanalysisofthesessions.Therewerethreecasesoftwoclustershavingreceivedthesamenames(sincetheexpertviewedeachclusterindependentlyoftheothers,shemightnotrememberwhatnameshadalreadybeenused).Wepresentedthesethreepairsofclusterstotheexpertandaskedwhetherthetwoclustersineachpairshouldbejoinedorconsideredseparately.Theexpertreviewedtheclustersagainandrepliedthatthepairsshouldbejoinedsincetherespectiveactionsfitwelltogether.Inthisway,wehavevalidatedourmeasureofthesemanticrelatednessbetweentheactions.Thismeasurecannowbeusedfordetectionofsemanticdisruptionsin action sequences, i.e., cases when an action performed by the user issemanticallyunrelatedtothepreviouslyperformedactions.Thiskindofsemanticanomalymaybe related toanunforeseenuseof theLSS systemand thusmayrequirespecificexaminationbyahumananalyst.

ModellingUserBehaviour

In order tomodel user behaviour,we investigated aMarkovmodel describingactionsequences.However,becauseof thestructureof the tasksperformedbyexecutingsubsequencesofthe296distinctactions,manyactiontransitionsarenotpresentintheLSSsequencedatasetcontaining14360sessions.ThisleadstosparseMarkovmatriceswitharound96%ofthematrixbeing0,andthustolessmeaningfulmodels.To circumvent this problem, we used the action clusters described above,assuming(inaccordancetotheexpertfeedback)thateachclustercorrespondstoaspecifictask.Transformingtheactionsequencestotasksequencesresultsinareductionofsparsityto75%.Withthis,asimpleMarkovmodelisbuilt:

However,thismodeldoesnottakeintoaccountinformationabouttheuser.Thus,weproposetousetheuserclusterasafeatureresultingintheTaskTransitionModel(TTM):

Fornovelusers,theclusterisnotknowninadvance.However,theuserclusteringisbasedontheuser’sactionsequences.Ourideaistousetheactionsofanovel

D5.1

48

usertopredictthemostprobableclusterhebelongsto.ThisClusterAssignmentModel(CAM)isagainrealisedasaMarkovmodel.

predictingtheclusterbasedontheuser’spasttasks/actions.Bylookingback4tasks,theCAMachievesatop-2accuracyof91,9%.Asafirstapproach,wecombinethetwomodelsbyfirstapplyingtheCAMtothelast4tasksobserved:

Wethenusethethusobtainedclusterpredictiontogetherwiththecurrenttaskobserved to obtain a probability of the current task transition using the TTM(Figure41).

Figure41–CombinationofClusterAssignmentModel(CAM)andTaskTransitionModel(TTM)onanoveluser’stask sequence. The tasks are obtained by replacing actionswith the task assigned to the cluster, the actionbelongsto.

Inordertoalsotakeintoaccounttheuncertaintyinclusterprediction,weproposea second variant. This variant does not use the most probable cluster but allclusterswithnon-zeroprobabilityandcomputesthelikelihoodofatasktransitionfor all of them, weighing these probabilities with the respective clusterprobabilities:

Inordertotestthemodelsintermsofanomalyandattackprediction,wedefineananomaloususerasauserananomaloususerasauserwitharandomlygeneratedtasksequences.Theassumptionisthattheprobabilityassumptionisthattheprobabilityoftasktransitionsissubstantiallyhigherfornormalusersthanfornormalusersthanforanomaloususers.Bydeterminingathresholdontheprobability,anomaloususerscanbeprobability,anomaloususerscanbeclassified.Inafirsttestusingthehardclusterpredictionandathreshold

D5.1

49

predictionandathresholdof0,wewereabletoclassifyanomaloususersfromabalanceddatasetwithanbalanceddatasetwithanaccuracyof92%,detailedin

Figure42.

actual/predicted P N

P 18227 1231

N 1998 17460

Figure42–Confusionmatrixfortheprediction19458tasktransitionsofnormalusersfromtheLSSdatasetsandanequalnumberofanomaloususersgeneratedbyassumingrandomtasktransitions.

UsingtheTaskTransitionModel,onecanassignaprobabilityvalueforbeinganormalinteractiontoeachtasktransition.Thus,theTTMcanbeusedtoannotateactions,oractionsequenceswithananomalyvaluewhichcanthenbeintegratedinto thevisualisation.Forexample, anomalous sessions canbe flaggedand thetransitionswithinananomaloussessionresponsiblefortheclassificationcanbehighlighted.TheTTMcanbeintegratedasaplug-incomponenttothevisualisationarchitecture. This also allows to easily exchanging it with more advancedapproachesthatwillbedevelopedwithintheproject.

4.2 DiversityVisualAnalysis

4.2.1 StateoftheArtIn this section,we present the results of the literature review of visualisationtechniquesthatweconsiderasrelevantforaneffectivevisualisationofdiversityinformation.Westartourreviewwithrelevantgeneralvisualisationtechniques,reviewanumberofalternativestodisplayuncertaintyandvariationindata,andfinallyreviewworkonthevisualexplorationofmodelvariabilityandensemblesimulationresults.

Generalvisualisationtechniquesfordisplayingdiversity

Although diversity analysis (as understood within DiSIEM as described in thereviewbyLittlewoodetal[52])hasnotbeensubjecttomanyinvestigations invisualisation, the analysis of variety and the depiction of diversity, at a moregenericlevel,isaproblemconsideredfrequentlybyvisualisationresearchers.Onecommontechniquetodisplayvarietyistousesmallmultiples–atechniquethat Jacque Bertin [31] describes as “series of graphics, showing the samecombinationofvariables, indexedbychanges inanothervariable”andtheyhavebeenpopularisedbyEdwardTuftewhoreferstothemasrepresentations“... topresentdatainadensefashionthatsupportscomparisonandenquiry”[32].Oneofthemostrelevantexamplesoftheuseofsmallmultiplestodisplaydiversityiswithinthevisualanalysisofdatagatheredfromdiversebiometricdevicesetups[33]asillustratedinFigure43.Inthiswork,smallmultipleshavebeenutilisedto

D5.1

50

presentthevariationwithindifferentbiometricdevicesetupsandthewaytheyare evaluated through different evaluation metrics. The technique alsoincorporates interactivity to support decision makers in generating visualconfigurations thatare tailoredaccording to theiranalyticalneeds.Eachvisualdisplaycorrespondstoaparticularbiometricdeviceandnumericconfiguration(i.e.,thesourceofdiversity)andindicateshowwellthatconfigurationsupportsthe classification of the observations. The gridded, linear layout of the smallmultiples and the fact that all instances share the same visual representationsfacilitates effective comparison and a synoptic reading of overall variation inresponsetotheencodeddiversity.

Figure43–SmallmultiplesemployedinthisworkbyTurkayetal.[33]incommunicatingthevariationsinthefingerprintmatchingperformancewithinbiometricdeviceandmatchingalgorithmcombinations.

Small multiples have been used further as a visual exploration scheme in the“SmallMultiples,LargeSingles”approach[34].Intheirwork,theauthorsdescribeinteractiontechniquesandvisualisationregimestostructuretheanalysisprocessthroughtheuseofsmallmultiplesasthemainvisualisationtechnique.Afurtherformalisation on the use small multiples and how they can be incorporatedthrough careful faceting information is byKehrer et al. [35].We’ll incorporatethese ideas and interaction frameworks when designing visualisations for theinteractiveexplorationofdiversityinourwork.Anothergeneraltechniquetoanalysediversityistheincorporationofinteractionand coordinatedmultiple view setups. Multiple linked views have beenwidelyadopted in visualisation research [36]. Such systems often involve multipledisplaysofthedifferentfacetsofthedata,forexample,scatterplotsofdatawithdifferent dimensions/features as the axes. Almost always, such systems alsoincorporatealinking&brushingmechanism[37]whichreferstothecapabilitythataninteractiveselectionmadeinoneviewisimmediatelyhighlightedintheother

D5.1

51

viewswithinthedisplay--enablinganalysistovisuallyrelatemultipleconceptsandinvestigatetheircovariationinwaysthatalgorithmicapproachesmightnotimmediately reveal. The combination of interaction and multiple views haverecentlyseenfurtheradvances,suchasthecreationofvisualsummariesofdatabasedoninteractiveuserinput[38]asillustratedinFigure44.

Figure44–Advancedinteractionmechanismsenableanalystsindynamicallygeneratingvisualrepresentationsthat make use of underlying computational tools. In this example, statistical summaries are dynamicallycomputedandrendered inresponse to twodifferent interactionpatterns for twodifferentcities,whereeachsmallmultipledepictthevariationinasingledatafeature[38].

We will make use of such interactive techniques combined with multipleconcurrentviewsofinformationwithinourvisualisationsolutions.

Uncertaintyandvariabilityvisualisation

Withinthevisualdisplayofdiverseinformation,theeffectivecommunicationofuncertaintyisofparamountimportance.Uncertaintyinthedataandtheresultsofaparticularanalysiscanemanatefromvarioussourcesandcanmanifestitselfindifferent forms. MacEachrenn et al. [39] describes uncertainty as a term that“(uncertainty)coversabroaderrangeofdoubtorinconsistencythanerroralone”anddiscussesitasacriticalconceptwithsignificantimpactondecision-makinginseveralfields.ThepapersbyThomsonetal.[40]onuncertaintytypographyandthe review of techniques for visualisation of uncertainty by Brodlie et al. [41]provideagoodhigh-levelguidanceinapproachingtheuncertaintyvisualisationproblem.When thekindsofdataareconsidered, the recentempiricalworkby [42] thatinvestigates the scope and effectiveness of different visual variables to encodeuncertaininformationprovidesvaluableguidanceindesigningviewstorepresentthevariationanduncertaintyinthediversedataweanalysewithinDiSIEM.

D5.1

52

Figure45–Ataxonomyofvisualvariablesusedinthedepictionofuncertaintyinvisualrepresentations[42].

Recentadvancesinvisualisationresearchinformusontheavailabletechniquesondisplayinguncertainty.Eventhewidely-adoptedconventionalrepresentationsof uncertainty in data and/or statistics are being questioned by researchers.Correll and Gleicher [43] present empirical evidence that alternativerepresentationsofconventional“errorbars”(asdepictedinFigure46)aremoreeffectiveincommunicatinguncertaintyandvarianceinthedata.

Figure 46 – Alternative chart designs have been offered and evaluated for the depiction of variation anduncertaintyinaggregatedstatisticschallengingthewaysthatsuchinformationiscommunicatedconventionally[43].

Theseinvestigationsintothealternativerepresentationsofuncertaininformationand the reviews of visualisation techniques will inform our designs whenimplementingrepresentationsofdiversitywithinthisworkpackage.Giventhatwewillhaveseveralconditionsandalternativestobecomparedwhicharerankedby probabilistic models (that are inherently “uncertain” by nature), suchrepresentationswillplayacriticalrolefordecisionmakingforSIEMusers.

Modelvisualisation/parameterspaceanalysisOne strand of work with significant relevance is the investigation of modeloutcomes and the analysis of parameter spaces. As detailed above under the

D5.1

53

DiversityVisualisationrequirementanalysis,wewillbegeneratingestimationsofeffectivityusingseveralmodelsthattakeseveralinputsasparameters.Withinthecontextofnumerical/statisticalmodelling,thereareseveralexampleswherevisualisationandvisualanalyticshavebeenappliedsuccessfully.Sedlmairetal.[44]generaliseasubsetoftheseworksandpresentaframeworkforvisualparameter analysis where they describe an elaborate data flow strategy andsuggeststrategiestonavigateinthespaceofparameterswiththeguidancefromvisualisation. Afzal et al. [45] use a combination of interactive spatio-temporalvisualisations and a decision history representation to support epidemiologymodelbuilding.Intheirwork,visualisationisacriticalelementtocompareandevaluate different models and responses given to them. In Vismon [46], theauthorsdesignedavisualisationsystemtoaidfisherymanagerstobettermodeland better understand the uncertainties in the data and the computations.Torsney-Weir et al. [47] suggest a systematic parameter investigation processthroughvisualisationinordertoimproveimagesegmentationmodels.TherearealsomethodsthatsupportfeatureselectiontasksthroughvisualisationwhichisaparticularkindoftaskthatwewillaimtoaddresswithinDiSIEMwhenexploringthelargeparameterspacesoftheprobabilisticmodels.Krauseetal.'smethod[48]visuallyrepresentsseveralcross-validationrunsandgivesanindicationofhowimportantparticularfeaturesareforclassificationpurposesasdepictedinFigure47.

Figure47–Importanceofalargenumberoffeatureswithinclassifiermodelsaredepictedfollowinganensemblerunandcross-validationofmodelstoinformfeatureselectiontasksinthisworkbyKrauseetal.[48]

Noticeherethatsmall,glyph-likevisualrepresentations,eachmappingtoasinglefeature,createavisualoverviewofmodelperformanceanditsinteractionwiththefeatures.Theauthorsobservedthatinvolvingtheuserinthemodelbuildingprocessleadstoeasiertointerpretmodels.The body of work investigated within this section will inform the design anddevelopmentworkthatinvolvetheincorporationofmodellingwithinthevisualanalysiscycle.

4.2.2 InitialDesignInvestigationsandSketchesDuringandfollowingtherequirementanalysisanddesignactivitiesasdescribedinChapter3,wedevelopanumberofdesignsketchesandinitial investigations

D5.1

54

that will guide further development within Diversity Visualisation. Note that,however, the design prototypes are in a much earlier state compared to theprototypespresentedintheprevioussection.Hence,inthefollowingwepresentearlydiscussionsandpaper/sketchprototypeideaswehavedevelopedsofar.As discussed earlier, the focuswithin these activities are in understanding thevariations in the distributions of alerts in response to the combinations ofmonitoring tools with the ultimate goal of making better decisions both inevaluatingsignalsandinsetting-upinfrastructure.

InitialDesignSketches

Asa result of thediscussionsat thebrainstormingandworkshop sessions,weidentifiedaworkflowthattakesplacewithinoverthreestages.Inthefollowing,welistthesestagesanddocumentsomeoftheearlysketches.Whereversuitable,wealsohighlightpotentialprototypesunderthissectiontohighlightsomeofthepromisingareasoffurtherinvestigationwewillconsiderwithinthecontinuationoftheproject.Stage1—Analysisofalertdistributionsandsensorconfigurations

Atthisstage,noinformationonwhetherthealertsarerealattacksornotisyetavailable.Duetothisfact,thenatureoftheanalysisispurelyexploratoryatthisstageandhasthegoaltogainanoverviewofhowthealertsaredistributedovertimeandoverdifferent configurations, and to identify trendsandoutliers.Theprimaryobjectiveofthisstageisin-linewiththe“Goal1:Overviewofalerts”aspresentedaboveinSection3.2.2.

Figure48–SketchyprototypeintoaddressthecommontasksidentifiedwithinStage-1ofthediversityanalysisprocess.

D5.1

55

Initial design sketches involve visual representations that involve moreconventional representations such as stacked bar charts and histograms. Oneaspectwedecidedtoincorporateistoprovideadditionalfacetingoptionssuchasovertime,oversensortypes,overalerttypes,usingprimarilysmallmultiplesasdiscussedabove in the literaturereviewsection(Section4.2.1).AverysketchyearlyprototypecanbefoundbelowinFigure48.Ahighlycriticalfunctionalitythatthis design will entail is the ability to interactively filter according to variousconditions,suchastime,sensortype,etc.,andgetanimmediatefeedbackthroughavisualisationupdate.Potentialprototype:TemporalVennDiagrams—Onerecurringthemeinthisstage is the requirement to investigate thealertdistributionsover timeand inparticularthedistributionsofthemintothedifferentalertconfigurations--whichcanbeconsideredassets.Conventionalvisualisationsofsetsoftendealwithstaticsetmembershipsetups.However,inthiscontext,animportantfactorthatfeedsinto how the sensor combinations are evaluated is their behaviour over time.Withinthisprototype,wewillinvestigatehowthemembershipchangesovertimecanbestbevisualisedandweplantousesomeexistingvisualisationsofsetsasastartingbasissuchastheUpsettechnique[49](seeFigure49)thatiseffectiveindisplayingvisualoverviewsofsets(howeverstaticones)moreeffectively.

Figure 49 – Linearised representations have been adopted to represent set membership relations whereaccompanyingstatisticsonsetsarevisualizedinintegrationinthisUpsettechniquebyLexetal.[49]andwe’llconsidersuchrepresentationswithintheanalysisforStage-1withtheaimofextendingfortemporalvariations.

Stage2—Analysing(labelled)alertsformodelinvestigationIn this stageof the investigation,wewill be consideringdata that alsohas thelabelsassociatedwiththealerts,i.e.,dataonwhetherthealertisassociatedwitharealattackornot.Theprimaryobjectiveofthisstageisin-linewiththe“Goal2:Interactiveexplorationofsensorconfigurations”aspresentedaboveinSection3.2.2.Thisdataenablesustogenerateconfusionmatrices (additionalstatisticsontheperformanceoftheconfigurations,seeFigure50)whichwewillassociatewiththevisualisationsofthedistributionsandmoreconventionalvisualrepresentations

D5.1

56

of model performance such as the receiver operating characteristic curves27(ROC)(asexemplifiedbyFigure51).WithinthevisualrepresentationsunderthisStage,wewill look fordesigns that enable an enhanced comparisonof severaldifferentcombinationsofsensors.

Figure51–AnexampleROCcurvethatdepictstheresultsofthediversityanalysisactivitiesheldwithinWP3.

An initial sketch of a potential prototype is indicated in Figure 52. Here, weenvisionacombinationofconfusionmatricesthatareinteractivelygeneratedinresponse to the interactive filtering operations carried out by an analyst. Theinteractive ROC curve will display the selected configurations, however, incontrast tomoreconventionalROCcurves,will incorporate informationon thetemporalvariation.

27https://en.wikipedia.org/wiki/Receiver_operating_characteristic

Figure50–Adepictionofaconfusionmatrix.

D5.1

57

Figure52–SketchyprototypeintoaddressthecommontasksidentifiedwithinStage-2ofthediversityanalysisprocess.

A potential idea to investigate here is to look for adapting novel visualrepresentationsoftimesuchasTime-curves[50]orconnectedscatterplots[51]asshowninFigure53.

Figure53–AlternativerepresentationssuchasTimeCurves[50]andConnectedScatterplots[51]willbeconsideredasalternativesofdisplayingvariationovertimewithinmoreconventionalROCcurves.

Potentialprototype:DynamicinteractiveROCcurves—Onepotentialideaforaninfluentialprototypeemergedduringthediscussionsonahighlyinteractive,enhanceROCcurve,thatwilltransformtheexistinganalysisdonewiththehelpofROCcurves.Somekeyfeatureswillinclude:

D5.1

58

- Display temporal variation— Investigating whether the most optimalconfigurationisalwaysstable,oralwaysoptimaloveraperiodovertime

- EnablethedynamicgenerationofROCcurves—curvesgeneratedbasedonasubsetofcases,forinstance,data/samplesthatareassociatedwithaspecific attack,oronlyondata froma timeperiod (e.g., “lastweek”), ormaybe for those cases where there is most disagreement within theconfigurations

Stage3—Prediction(Uncertaintyvisualisationformodels)Thisstageoftheinvestigationinvolvesacombinationoftheanalysismodellingoutputs with the aim of evaluating the forecasts for future potentialvulnerabilities. The primary input to this stage will be the forecasts and theparameter spaces of the probabilistic models. The goal is to present severalmodels and their forecasts in a comparative manner by communicating thevariation within the parameter space and the uncertainty associated with theforecasts.Theprimaryobjectiveofthisstageisin-linewiththe“Goal3:Analysisandevaluationofmodelensembles”aspresentedaboveinSection3.2.2.Thefollowingroughsketch(Figure54)highlightssomeofthebasicfunctionalityweenvisionasapotentialfirststep.Inthisprototype(asdevelopedduringthebrainstorming sessions),we display a number of alternatives of themodellingoutputsandtheassociateddatathatledtothesemodelsinajointinterface.Thisapproachinvestigateswaysofvisualisingthevariabilityandtheuncertaintyinthemodels drawing design ideas from the literature review we conducted in thesubsectionbelowonUncertaintyvisualisation.

Figure54-SketchyprototypeintoaddressthecommontasksidentifiedwithinStage-3ofthediversityanalysisprocess

D5.1

59

Potentialprototype:Dualanalysisforintegratedinteractivemodelling—Theinvestigationoftheparameterspacealongwiththemodellingoutputspresentsaninterestingchallengeandopportunityforvisualisation.OneofthevisionswehavewithinDiSIEMistodevelopvisualanalyticsmethodstoenhanceandimprovethemodelbuildingprocess(oneofourtaskswithinTask5.2inWP5).Inordertoachieve this, we will investigate two alternatives within this prototype: 1)parameterspaceexplorationofensemblemodellingand2) integratedinteractivemodelling.Withinthefirstapproach,theworkflowwillinvolveapost-visual-investigationofavarietyofmodelparametersandtheresultsofthemodels.Inthisscheme,thegoalistorunthemodelswithasmanyparametercombinationsaspossibleandvisualise this parameter space along with the results space. Such dual-analysisapproachesareeffectiveinunderstandingthevariationsinthemodeloutcomesandtheirrelationshipswiththemodels.Inthesecond,andthemoreinteractiveanalysisscheme,ourapproachwillbetointegrate thecomputationalmodelsas interactiveprocesses that canrun inanimmediate mode in response to the analysts’ inputs. This is a much moreexplorative and flexible approach compared to the first scheme and enablesanalyststo“craft”parameterand/ordatadomaincombinationswhilegettingthemodelresultsimmediately.Wewillalsoinvestigatewayshowthesetwoschemescanworkinharmonyandwewillaimtoprovideamorecomprehensiveapproach.Potential prototype: Visually correlating model results with other diversedata, e.g., OSINT – In addition to the diversity analysis data, DiSIEM will begatheringandmodellingseveralotherdatasources.Oneof the innovationsweenvisionisthegatheringandmodellingofOSINTdata(asdescribedinD4.1)andwe will develop visualisation methods to integrate the data and the threatprediction results coming fromWP4. The goal here is to visually correlate thechanges in the threatpredictionsandchanges in theunderlyingrawdata (e.g.,changesinthetopicsofdiscussionsonTwitterortheforums)withthepredictionsandmodels of alerts as emanating from thediversity analysis.Our vision is toinvestigatewhetheranalystscanobservepointers inOSINTdatadrivensignalsthatcanhelpinunderstandingandevaluatingthemodelsbetter,thusimprovingthedecision-makingprocess both in the evaluation of forecasts and inmakinginfrastructure changes. To address these, we’ll design integrated views thatcommunicatechanges inOSINTdrivendata incombinationwithpast trends inalertsandthemodellingoutputs.

D5.1

60

5 SummaryandConclusionsInthisreport,wedocumentedtheresultoftheactivitiescarriedoutwithinWP5(primarilyaspartofTask5.1)todefinethescopeofthevisualisationdesignanddevelopmentthatisgoingtotakeplacefortherestoftheproject.WestartedourinvestigationsbyareviewofavailablevisualisationtechnologiesanddiscussthemintermsoftheirsuitabilitytoDiSIEM.Wefollowthistechnicalreviewwiththereportingofthein-depthuserrequirementanddomainunderstandingactivitieswe have conducted. We are following a user-centred and agile prototype-lediterativedesignanddevelopmentmethodologyandintroducedthevarioususer-centredmethodsweusedinthisproject.Asaresultofthediscussionswiththewholeconsortium,weidentifiedanumberofusecaseswherevisualisationcanbemost influential and important inandwediscuss theseuse cases in-depthandpresentourfindingsfromtheuser-centreddesignactivities.Theresultsoftheseactivitiesarewell-defineddomainspecifictasks,theirhigh-levelabstractions,andanalyticalgoalswhichweareandwillbeusingasguidelinesfordevelopmentandimplementation. In the finalpartof the report,wedocumentour initialdesignactivitiesandpresentthefirstprototypesanddesignsketches.ThereportwillserveasaguidelinedocumenttoderivefurtherdevelopmentthatishappeningwithinWP5.Thepresentedanalyticaltasksandgoalswillinformourdesigns,analyticaltoolsused,andfunctionalitiesimplementedinthesecondyearoftheprojectandwillprovideusascopeforevaluationandvalidationwhenwemoveintothefinal,deploymentphaseoftheproject.

D5.1

61

References[1] T.Munzner,Visualizationanalysisanddesign.books.google.com,2014.[2] G.EllisandF.Mansmann,“Masteringtheinformationagesolvingproblems

withvisualanalytics,”Eurographics,2010.[3] T.Munzner,“Anestedmodelforvisualizationdesignandvalidation.,”IEEE

TransVisComputGraph,vol.15,no.6,pp.921–928,Dec.2009.[4] M. Sedlmair, M. Meyer, and T. Munzner, “Design study methodology:

Reflections from the trenches and the stacks,” IEEE Transactions onVisualizationandComputerGraphics,2012.

[5] M.BrehmerandT.Munzner,“Amulti-leveltypologyofabstractvisualizationtasks,”IEEETransactionsonVisualizationandComputerGraphics,2013.

[6] M. Meyer, M. Sedlmair, and T. Munzner, “The four-level nested modelrevisited:blocksandguidelines,”Proceedingsofthe2012BELIVWorkshop:BeyondTimeandErrors-NovelEvaluationMethodsforVisualization,2012.

[7] J.S.Yi,Y.A.Kang,J.Stasko,andJ.Jacko,“Towardadeeperunderstandingoftheroleofinteractionininformationvisualization.,”IEEETransVisComputGraph,vol.13,no.6,pp.1224–1231,Dec.2007.

[8] A.Endert,W.Ribarsky,C.Turkay,B.L.W.Wong,I.Nabney,I.D.Blanco,andF.Rossi, “TheStateof theArt in IntegratingMachineLearning intoVisualAnalytics,”ComputerGraphicsForum,Mar.2017.

[9] W.Aigner,S.Miksch,H.Schumann,andC.Tominski,VisualizationofTime-OrientedData.London:SpringerLondon,2011.

[10] C. Plaisant, R. Mushlin, A. Snyder, J. Li, D. Heller, and B. Shneiderman,“LifeLines:usingvisualizationtoenhancenavigationandanalysisofpatientrecords.,”ProcAMIASymp,pp.76–80,1998.

[11] C. Plaisant, B.Milash, A. Rose, S.Widoff, and B. Shneiderman, “LifeLines:visualizing personal histories.,” Proceedings of the SIGCHI conference onHumanfactorsincomputingsystems,1996.

[12] P. H. Nguyen, K. Xu, R. Walker, and B. W. Wong, “SchemaLine: timelinevisualizationforsensemaking,”inInformationVisualisation(IV),201418thInternationalConferenceon,2014,pp.225–233.

[13] P. H. Nguyen, K. Xu, R. Walker, and B. W. Wong, “TimeSets: Timelinevisualizationwithsetrelations,”InformationVisualization,vol.15,no.3,pp.253–269,2016.

[14] K.Wongsuphasawat,J.A.GuerraGómez,C.Plaisant,T.D.Wang,M.Taieb-Maimon,andB.Shneiderman, “LifeFlow:visualizinganoverviewofeventsequences,” in Proceedings of the SIGCHI conference on human factors incomputingsystems,2011,pp.1747–1756.

[15] M.Monroe,R.Lan,H.Lee,C.Plaisant,andB.Shneiderman,“Temporaleventsequencesimplification.,”IEEETransVisComputGraph,vol.19,no.12,pp.

D5.1

62

2227–2236,Dec.2013.[16] D. Gotz and H. Stavropoulos, “DecisionFlow: Visual Analytics for High-

DimensionalTemporalEventSequenceData.,”IEEETransVisComputGraph,vol.20,no.12,pp.1783–1792,Dec.2014.

[17] R.AgrawalandR.Srikant,“Miningsequentialpatterns,”inProceedingsoftheEleventhInternationalConferenceonDataEngineering,1995,pp.3–14.

[18] R.SrikantandR.Agrawal,“Miningsequentialpatterns:Generalizationsandperformance improvements,” in International Conference on ExtendingDatabaseTechnology,1996,pp.1–17.

[19] J.Ayres,J.Flannick,J.Gehrke,andT.Yiu,“SequentialPatternminingusingabitmap representation,” in Proceedings of the eighth ACM SIGKDDinternationalconferenceonKnowledgediscoveryanddatamining-KDD’02,NewYork,NewYork,USA,2002,p.429.

[20] JianPei,JiaweiHan,B.Mortazavi-Asl,JianyongWang,H.Pinto,QimingChen,U.Dayal,andMei-ChunHsu,“Miningsequentialpatternsbypattern-growth:thePrefixSpanapproach,” IEEETransKnowlDataEng,vol.16,no.11,pp.1424–1440,Nov.2004.

[21] M.N.Garofalakis,R.Rastogi,andK.Shim,“SPIRIT:Sequentialpatternminingwithregularexpressionconstraints,”VLDB,1999.

[22] J.Pei,J.Han,andW.Wang,“Constraint-basedsequentialpatternmininginlargedatabases.”

[23] J.Pei,J.W.Han,andW.Wang,“Constraint-basedsequentialpatternminingin large databases.,” Proc. 2002 Int’l Conf. Information and KnowledgeManagement(CIKM’02),2002.

[24] Z. Liu, Y. Wang, M. Dontcheva, M. Hoffman, S. Walker, and A. Wilson,“Patterns and sequences: interactive exploration of clickstreams tounderstandcommonvisitorpaths.,”IEEETransVisComputGraph,vol.23,no.1,pp.321–330,Jan.2017.

[25] Z. Liu, B. Kerr, M. Dontcheva, J. Grover, M. Hoffman, and A. Wilson,“CoreFlow: Extracting and Visualizing Branching Patterns from EventSequences.,”2017.

[26] J.B.KruskalandJ.M.Landwehr,“Icicleplots:Betterdisplaysforhierarchicalclustering.,”TheAmericanStatistician,vol.37,no.2,pp.162–168,1983.

[27] H.Guo,S.R.Gomez,C.Ziemkiewicz,andD.H.Laidlaw,“Acasestudyusingvisualization interaction logs and insight metrics to understand howanalystsarriveatinsights.,”IEEETransVisComputGraph,vol.22,no.1,pp.51–60,Jan.2016.

[28] M.Ankerst,M.M.Breunig,H.P.Kriegel, and J. Sander, “OPTICS:orderingpointstoidentifytheclusteringstructure,”ACMSigmodrecord,1999.

[29] S. Rinzivillo, D. Pedreschi, M. Nanni, F. Giannotti, N. Andrienko, and G.Andrienko, “Visually driven analysis of movement data by progressiveclustering,”InfVis,vol.7,no.3–4,pp.225–239,2008.

D5.1

63

[30] J. W. Sammon, “A nonlinear mapping for data structure analysis,” IEEETransactionsoncomputers,1969.

[31] J. Bertin, Graphics and graphic information processing. books.google.com,1981.

[32] E. Tufte and P. Graves-Morris, “The visual display of quantitativeinformation.;1983,”2014.

[33] C.Turkay,S.Mason,I.Gashi,andB.Cukic,“SupportingDecision-MakingforBiometricSystemDeploymentthroughVisualAnalysis.,”Proceedingsofthe2014 IEEE International Symposium on Software Reliability EngineeringWorkshops,2014.

[34] S.vandenElzenandJ. J.vanWijk,“SmallMultiples,LargeSingles:ANewApproachforVisualDataExploration,”ComputerGraphicsForum,vol.32,no.3pt2,pp.191–200,Jun.2013.

[35] J.Kehrer,H.Piringer,W.Berger,andM.E.Gröller,“Amodelforstructure-based comparison of many categories in small-multiple displays.,” IEEETransVisComputGraph,vol.19,no.12,pp.2287–2296,Dec.2013.

[36] J.C.Roberts,“Stateoftheart:Coordinated&multipleviewsinexploratoryvisualization,” CoordinatedandMultipleViewsinExploratoryVisualization,2007.CMV'07.FifthInternationalConferenceon,2007.

[37] H. Piringer, R. Kosara, and H. Hauser, “Interactive focus+ contextvisualization with linked 2d/3d scatterplots,” Proceedings of the SecondInternationalConferenceonCoordinatedandMultipleViews inExploratoryVisualization,2004.

[38] C.Turkay,A.Slingsby,H.Hauser,J.Wood,andJ.Dykes,“Attributesignatures:dynamic visual summaries for analyzingmultivariate geographical data.,”IEEETransVisComputGraph,vol.20,no.12,pp.2033–2042,Dec.2014.

[39] A.M.MacEachren,A.Robinson,S.Hopper,andetal.,“Visualizinggeospatialinformation uncertainty: What we know and what we need to know,” CartographyandGeographicInformationScience,2005.

[40] J.Thomson,E.Hetzler,A.MacEachren,andetal.,“Atypologyforvisualizinguncertainty,”SPIEProceedings,2005.

[41] K. Brodlie, R. A. Osorio, and A. Lopes, “A review of uncertainty in datavisualization,”Expanding the frontiersofvisualanalyticsandvisualization,2012.

[42] A. M. MacEachren, R. E. Roth, J. O’Brien, and et al., “Visual semiotics &uncertainty visualization: An empirical study,” IEEE Transactions onVisualizationandComputerGraphics,2012.

[43] M. Correll and M. Gleicher, “Error bars considered harmful: exploringalternateencodingsformeananderror.,”IEEETransVisComputGraph,vol.20,no.12,pp.2142–2151,Dec.2014.

[44] M. Sedlmair, C. Heinzl, S. Bruckner, H. Piringer, and T. Möller, “Visualparameterspaceanalysis:Aconceptualframework.,”IEEETransVisComput

D5.1

64

Graph,vol.20,no.12,pp.2161–2170,Dec.2014.[45] S.Afzal,R.Maciejewski,andD.S.Ebert,“Visualanalyticsdecisionsupport

environmentforepidemicmodelingandresponseevaluation,” 2011IEEEConferenceon VisualAnalyticsScienceandTechnology(VAST),2011.

[46] M.Booshehrian,T.Möller,R.M.Peterman,andetal.,“Vismon:FacilitatingAnalysisofTrade-Offs,Uncertainty,andSensitivityInFisheriesManagementDecisionMaking,”ComputerGraphicsForum,2012.

[47] T.Torsney-Weir,A.Saad,T.Moller,andetal.,“Tuner:Principledparameterfinding for image segmentation algorithms using visual response surfaceexploration,” IEEETransactions onVisualizationandComputerGraphics ,2011.

[48] J.Krause,A.Perer,andE.Bertini,“INFUSE:interactivefeatureselectionforpredictive modeling of high dimensional data,” IEEE Transactions onVisualizationandComputerGraphics,2014.

[49] A. Lex, N. Gehlenborg, H. Strobelt, R. Vuillemot, and H. Pfister, “Upset:visualizationofintersectingsets.,”IEEETransVisComputGraph,vol.20,no.12,pp.1983–1992,Dec.2014.

[50] B.Bach,C.Shi,N.Heulot,T.Madhyastha,T.Grabowski,andP.Dragicevic,“Time curves: folding time to visualize patterns of temporal evolution indata.,”IEEETransVisComputGraph,vol.22,no.1,pp.559–568,Jan.2016.

[51] S. Haroz, R. Kosara, and S. L. Franconeri, “The connected scatterplot forpresentingpairedtimeseries.,”IEEETransVisComputGraph,vol.22,no.9,pp.2174–2186,2016.

[52] B.Littlewood,P.Popov,andL.Strigini,“Modelingsoftwaredesigndiversity:areview,”ACMComputingSurveys(CSUR),pp.33(2),177-208,2001.