Click here to load reader
Click here to load reader
Apr 14, 2017
Oracle Big DataTHE INTELLIGENCE LIFE-CYCLEand Schema-Last ApproachDr Neil Brittliff PhD
A little about myselfAwarded a PhD at the University of Canberra in March this year for my work in the Big Data spaceCurrently employed as Data Scientist within the Australian GovernmentHave been employed by 5 law enforcement agenciesDeveloped Cryptographic Software to support the Australian Medicare SystemFirst used Oracle products back in 1986Worked in the IT industry since 1982 Resides in Canberra (capital of Australia)Canberra is the only capital city in Australia that is not named after a personInterestsTennis (play) / Cricket (watch)Bushwalking and campingPiano Playing (very bad)Making stuff out of woodEnjoys the art of Programming (prefers the C language)Pushing the limits of the Raspberry Pi2
ThanksVladimir VidenovicRichard FooteVicky FaulknerDharmendra Sharma (my PhD supervisor)
IntroductionAbout myself worked for 5 law enforcement agencies
University of Canberra - 2015Talk Structure3MotivationPrinciples and ConstraintsIntelligence Life-CycleCollect & CollateAnalyse & ProduceReport & DisseminateMotivationResearchWhat is a SchemaThe Problem with ETLData Cleansing verses Data TriageA New Architecture Oracle Big DataThe Schema-Last ApproachIndexing Technologies and ExploitationUser ReactionObservations and Opportunities
The AICM (The Australian Intelligence Criminal Model)
These are the components will focus on:Collect & CollateAnalyse & Produce
University of Canberra - 2015National Criminal Intelligence4
The Law Enforcement community are also in the business of collecting and analysing criminal intelligence and data, and where possible, sharing that resulting informationTo do this, they need rich, contemporary, and comprehensive criminal intelligenceThe National Criminal Intelligence Fusion Capability, which brings together subject matter experts, analysts, technology and big data to identify previously unknown criminal entities, criminal methodologies, and patterns of crime.Fusion capability identifies the threats and vulnerabilities through the use of data. It brings together, monitors and analyses data and information from Customs, other law enforcement, Government agencies and industry to build an intelligence picture of serious and organised crime in Australia.
Intelligence is an integral part of the ACC remit and used to identify new criminal andmonitor existing known targets. The intelligence cycle is the process of developing unrefineddata from multiple data sources then analyst the fused data sources. The ACC andmany other law enforcement agencies see that Big Data enables the collection to store andprocess data at a unprecedented rate that is only going to increase. An integral process ofthe Intelligence cycle is the collection and processing of raw data. In addition, the the scale,complexity and changing nature of intelligence data can make it impossible to stay in frontwithout the aid of technology to collect, process and analyze big data.4
University of Canberra - 2015Australian Institute of Criminology5While many of the challenges posed by the volume of data are addressed in part by new developments in technology, the underlying issue has not been adequately resolved.
Over many years, there have been a variety of different ideas put forward in relation to addressing the increasing volume of data, such as data mining.Darren Quick and Kim-Kwang Raymond ChooAustralian Institute of Criminology September 2014
he Australian Institute of Criminology is Australia's national research and knowledge centre on crime and justice.5
University of Canberra - 2015Objectives6Support the Australian Intelligence Criminal ModelSimple Interface to exploit the dataData ingestion must be simple to do and minimise transformationSupport the large variety of data sourcesFast ingestion and retrieval timesEnable exact and fuzzy searchingSupport Identity ResolutionSupport metadataMain the datas integrityPreserve Data-Lineage/Provenance Reproduce the ingested data sourceexactly!
We dont want this!
The Skywhaleis ahot air balloondesigned by the sculptorPatricia Piccininias part of a commission to mark the centenary of the city ofCanberra. It was built byCameron BalloonsinBristol,United Kingdom, and first flew inAustraliain 2013. The balloon's design received a mixed response after it was publicly unveiled in May 2013.The cost of the balloon and the arrangements under which it was funded also attracted criticism. The executive director of culture for the ACT Chief Ministers directorate informed the media on 9 May that the balloon and its supporting website cost about $170,000. Documents released the next day showed that the total cost to the government of commissioning and operatingThe Skywhaleover its lifespan will be $300,000, and the philanthropicAranday Foundationwill provide a further $50,000. Moreover, the balloon will remain the property of the Melbourne-based companyGlobal Ballooningand only one flight was scheduled for Canberra
University of Canberra - 2015The Intelligence Life-Cycle7
Plan, prioritise & direct
Collect & collate
Report & disseminate
Analyse & produce
Evaluate & review
The intelligence life-cycle central focus is data and data exploitation. The intelligence life cycle begins with the identification of possible data source, the collection and collation of the data. The analysis and application of models upon this data. The production and dissemination of situation reports and finally an evaluation and review of the entire intelligence life-cycle. Hoover the life-cycle as it will be shown must deal with: messy and noisy data.structured, semi-structured, and unstructured data.tabular and highly linked data.
Cross Industry Standard for Data Mining (CRISP-DM)The Cross-Industry Standard Process for Data Mining (CRISPDM). CRISPDM a given data mining project has a life cycle consisting of six phases, That is, the next phase in the sequence often depends on the outcomes associated with the preceding phase. Theremay be further data preparation phase for further refinement before moving forward to the model evaluation phase. The six phases are as follows:Business understanding phase .
The first phase in the CRISPDM standard process may also be termed the research understanding phase . Enunciate the project objectives and requirements clearly in terms of the business or research unit as a whole.Translate these goals and restrictions into the formulation of a data mining problemdefinition.Prepare a preliminary strategy for achieving these objectives.Data understanding phaseCollect the data.Use exploratory data analysis to familiarize yourself with the data, and discover initial insights.7
University of Canberra - 2015Intelligence Data Source Classification8
Collect & collateAnalyse & produce
Low Signal Usually List data that has no criminal significanceHigh Signal The opposite list that may be significant to an investigation
Variety was the real problem8
University of Canberra - 2015Some Definitions:9That a major problem for the data scientist is to flatten the bumps as a result of the heterogeneity of data. Jimmy Lin and Dmitriy Ryaboy. Scaling big data mining infrastructure: The twitter experience.
Collect & CollateSchema is from the Greek word meaning form' or figure' and is a formal representation of data model which has integrity constraints controlling permissible data values. Data munging or sometimes referred to as data wrangling means taking data thats storedin one format and changing it into another format.
Schema is used to describe relational tabular, hierarchical or graph structures. Usually, schema is used to identify how the data is to be stored or transported. For sources without schema, such as files, there are few restrictions on what data can be entered and stored, giving rise to a high probability of errors and inconsistencies. Database systems, on the other hand, enforce restrictions of a specific data model (for example: the relational approach requires simple attribute values, referential integrity, et cetera) as well as application-specific integrity constraints.
Data munging or sometimes referred to as data wrangling means taking data thats storedin one format and changing it into another format. Analysts regularly wrangle data intoa form suitable for computational tools through a tedious process that delays more substantiveanalysis. The are tools both interactive and command line that can assist datatransformation, analysts must still conceptualize the desired output state, formulate a transformationstrategy, and specify complex transforms.
The `Schema-First' may mean a loss of data quality at any one of these stages and reduce the applicability. These include (Chapman, 2005):
Data capture and recording at the time of gathering.
Data manipulation prior to digitization (label preparation), identification of the collection and its recording.
Digitization of the data.
Documentation of the data (capturing and recording the meta-data).
Data storage and archiving.
Data presentation and dissemination (paper and electronic publications, web-enabled databases, et cetera).
Data use (analysis and manipulation). All these have an input into the final quality or `fitness for use' of the data and all apply to all aspects of the data the taxonomic or nomenclature portion of the data the `what', the spatial portion the `where' and other data such as the `who' and the temporal `when'. 9