C H A L L E N G E S I N DATA M O D E L I N G
WEB ANALYTICS
AGENDA
• Introduction to Web Analytics• Data Sources, Data Capture• Vocabulary
• Data Modeling Basics• Relational vs. Dimensional• Normalization, De-normalization, Aggregation
• Web Analytics + Data Modeling• Four-tiered Data Model for Web data• Challenges
• Q & A
INTRODUCTION
• Anne Marie Macek• Senior Manager, Data Strategy• Consumer Insight and Revenue Strategy• Marriott International
• 30+ years Data Modeling and Reporting• 14+ years Data Warehousing and Business
Intelligence• 4+ years Web Analytics Data and Reporting• MBA, Management Information Systems• BS, Mathematics and Computer Science
EXPERIENCE
• Data Modeling:• Flat Files, IMS/DB, DB2, Oracle, Netezza• MS Access, Borland Paradox• Cognos Powerplay, MS Analysis Services, Cognos 10.2
Dynamic Cubes
• Reporting:• COBOL, Focus, SAS, Actuate• Cognos BI Suite
• Business Functions:• eCommerce, Revenue Management, Sales & Marketing• Human Resources, Finance
DEFINITION
• Web analytics is the measurement, collection, analysis and reporting of internet data for purposes of understanding and optimizing web usage.
Source: Wikipedia
OBJECTIVES
• Website Performance• Conversion Rate ($ sales / # visits)• Trends over time• In Response to Campaigns
• Website Optimization• Customer Behavior • Technological Trends
• Integration• Customer Lifetime Value / Segmentation
• Personalization• Proactive display of pertinent information
DATA SOURCES
• Click-stream Data• Search Engine Optimization (SEO)• Campaign Classification• Email Campaigns• Advertising Impressions• 3rd Party Marketing Data• IP Geolocation• Competitive Analysis• Customer Information• Multi-channel Analysis• Outcome Data
CLICKSTREAM COLLECTION
• Web Log Files• Rudimentary data collected on company’s web server• Page name, IP address, browser, date/time
• Does not screen out search engine robots
• JavaScript Tagging (Google Analytics, Omniture, WebTrends)• As page loads, data is sent to 3rd party for collection• Assigns a cookie to the user• Can implement custom tags on specific pages• Does not count pages served from cache
• Packet Sniffers (Cloudmeter Pion, Tealeaf CX Connect)• Software or hardware layer installed on web servers• Parsing raw data, and ensuring PII can be complex
CLICKSTREAM ANALYSIS
• Number of Visitors • Total vs. Unique• New vs. Repeat
• Source of Visit (Session)• External Link (Campaign Analysis / Attribution)• Direct
• Searches Performed On Site• Keywords• Sort Order of Results
• Page Analysis• Specific Actions Performed• Order (Booking)• Signup for Membership, Credit Card, Event
• Abandonment (Bounce Rate)
BRINGING CLICKSTREAM IN-HOUSE
• Control/Consolidate Business Rules• Integration with Corporate Systems of Record • Single Version of the Truth
• Integration with Other Web Data Sources• Enable more “intelligent” metrics• Not all visits are a conversion opportunity
• Shift from “visit analysis” to “customer analysis”• Enable advanced statistical and predictive
modeling• Multi-touch Attribution• Pay Per Click (PPC) Keyword Bid Optimization
CLICKSTREAM CHALLENGES
• “Clickstream data … is delightfully complex, ever changing, and full of mysterious occurrences.” Avinash Kaushik, Web Analytics: An Hour a Day
• Volume• Cons- It’s big• Pros- It’s incremental
• Fairly Unstructured• Exceptions to every rule• Mobile App vs. Mobile Web vs. Desktop• Rapidly Changing• Most queries require trending YTD + 2 years’ history• Few “natural” metrics; most require count (distinct)• How do I model this data??
DATA WAREHOUSE APPROACHES
Bill Inmon
• DW is Central Repository of all Enterprise Data• “Top Down”• Relational Model (3NF)• Feeds Functional Data
Marts• Huge Undertaking
Ralph Kimball
• DW is the “Virtual” Integration of Various Functional Data Marts• “Bottom Up”• Dimensional Model• Quicker to Develop• Silo-ed and Redundant
RELATIONAL MODEL
Source: sqlservercentral.com
DIMENSIONAL MODELS
Star Schema Snowflake Schema
Source: Wikipedia
NORMALIZATION
• Removes redundancy and dependency from data structures.
• 1NF: Remove Repeating Groups• 2NF: Remove Partial Key Dependencies• 3NF: Remove Dependencies Among Attributes
• Tutorial: http://phlonx.com/resources/nf3/
• Data Warehouses require some De-Normalization to improve query performance
ECOMMERCE DATA WAREHOUSE
Native Source Model
Fact Model BI ModelAggregate
Model
NATIVE SOURCE MODEL
Plus
• In-database copy of the source data• Stores data elements
we are not yet ready to model further• Maintains details for
research purposes• Prevents repeating
historical conversion
Minus
• Huge• Unstructured• Not normalized (at all)• Not useful for analysis
or reporting
NATIVE SOURCE MODEL
FACT MODEL
Plus
• “Snow-relational”• Nearly Normalized
(optimized for load)• Multiple Fact &
Extension Tables (manage I/O)
• Granular (click row)• Contains keys to
integrate with enterprise data
Minus
• Complex load including propagation and look-back• Use requires non-
filtered joins of massive tables• Difficult to use for
analysis, cannot be used for reporting
FACT MODEL
BI MODEL
Plus
• “Star-flake” Model• De-normalized
(optimized for query)• Pre-joined• Granular (click row)• Integrated with
enterprise data at load time
• Useful for detailed analysis
Minus
• Complex load process• It’s still big!• Corrections to Fact
Model data issues require re-build or complex conversion processes• Difficult to use for
reporting
BI MODEL
AGGREGATE MODEL
Plus
• Star Schema (simple)• De-normalized
(optimized for query)• Aggregated• Fast query
performance• Great for pre-
determined reports
Minus
• Corrections to Fact Model data issues and embedded dimensions require re-build• Count distincts only
available for pre-determined dimensions• Limited use for
analysis
AGGREGATE MODEL
QUESTIONS?
• Thank You!