01/05/2023 CEGID3
Michel Caradec• [email protected]
Project Manager, Software/Data Engineer at Cegid Background
• Business Intelligence, ETL, OLAP, Data Manipulation• C#, R, Python
About Me
01/05/2023 CEGID6
Cegid web sites armed with tracking solutions• Extend web analytics data
Data Engineer: collect and prepare data Data Scientists: consume data in Azure ML Studio
• Visitors usage knowledge (browsing)• Provide better experience (recommendations)
Business Case
01/05/2023 CEGID7
Business Case
1 single visitBounce rate: 63%
Session length: 2 minMainly on Homepage
Average of 4.5 sessionsHigh screen resolutionWindows OSUse of internal SearchDuring the day
Average of 2.5 sessions1.5 pages / sessionBounce rate: 80%iOS (iPhone, iPad)In the evening and WENo conversion
Average of 7 sessions5 pages / session
Session length: 6 minMainly Solutions pages
More conversions
Tactilevisitor
3.5%
Returningvisitor
7%
Addictvisitor3.5%
One shotvisitor86%
Visitors clustering
Metrics built from a subset of the dataset.Do not represent real traffic.
01/05/2023 CEGID9
ADL Store: repository, schema-on-read, Web HDFS ADL Analytics: distributed processing using U-SQL
Azure Data Lake = Big Data as a Service
© Microsoft Azure
01/05/2023 CEGID10
REFERENCE ASSEMBLY [Cegid.DigitalAnalytics.Commons];
@data = EXTRACT user string, timestamp DateTime, heart int FROM "quantified-{user}.tsv" USING Extractors.Tsv();
@agg = SELECT *, timestamp.Date AS date, timestamp.Hour AS hour FROM @data;
@agg = SELECT user, date, hour, AVG(heart) AS avg, MIN(heart) AS min, MAX(heart) AS max FROM @agg GROUP BY user, date, hour;
OUTPUT @agg TO "quantified.csv" USING Outputters.Csv();
U-SQL = SQL + C#
Extract Transform OutputADLS, Azure Storage Blobs, Azure SQL
File schema-on-read SQL-like data manipulation C# data types C# integration Can store as relational
And many more… User-defined aggregators (C#) User-defined operators (C#) Custom Extractors, Outputters (C#) File sets for multiple input files access
patterns Credentials
Inspired by Michael Rys presentationat SQL Server PASS Deutschland 2016
Extr
act
Out
put
Tran
sfor
m
CREATE TABLE quantified(user string, date DateTime, hour int, avg long?, min int?, max int?, INDEX idx CLUSTERED(user, date, hour) DISTRIBUTED BY HASH(user));
INSERT INTO quantified SELECT * FROM @agg;
Out
put’
01/05/2023 CEGID13
U-SQL Pipeline
JSON / TSV conversion
• cegid-<site>-raw.tsv
Sessions aggregation
• cegid-<site>-
sessions.tsv
01/05/2023 CEGID14
U-SQL Script 1 - JSON / TSV Conversion
*.json files extraction
Records to JSON objects
conversion
Cegid IPS filtering
JSON objects fields extraction
Custom dimensions extraction
Custom dimensions pivot
01/05/2023 CEGID15
Data extracted as raw text@json_records_raw =EXTRACT json_raw string FROM "{*}/{date:yyyy}{date:MM}{date:dd}.json"USING Extractors.Text(delimiter : Char.MinValue, quoting : false);
JSON / TSV Conversion - Step 1/7
*.json files extraction
Records to JSON objects conversion
Cegid IPS filtering
JSON objects fields extraction
Custom dimensions extraction
Custom dimensions pivot
Output to TSV files
Azure Storage Blobs 20160901/20160901.json 20160902/20160902.json 20160903/20160903.json 20160904/20160904.json …
@json_records_raw
01/05/2023 CEGID16
DECLARE @fields = new SqlArray<string>() { "timestamp", "eventId", "customerId", ... };@json_records =SELECT JsonFunctions.JsonTuple(
json_raw, @fields.ToArray()) AS json_objectFROM @json_records_raw;
JSON / TSV Conversion - Step 2/7
*.json files extraction
Records to JSON objects conversion
Cegid IPS filtering
JSON objects fields extraction
Custom dimensions extraction
Custom dimensions pivot
Output to TSV files
01/05/2023 CEGID17
@json_records =SELECT json_object["ipAddress"] AS ipAddress, json_objectFROM @json_records;
@json_records =SELECT *FROM @json_recordsLEFT ANTISEMIJOIN (SELECT ip FROM CegidIps) AS ipsON ipAddress == ips.ip;
JSON / TSV Conversion - Step 3/7
*.json files extraction
Records to JSON objects conversion
Cegid IPS filtering
JSON objects fields extraction
Custom dimensions extraction
Custom dimensions pivot
Output to TSV files
01/05/2023 CEGID18
@events =SELECT json_object["timestamp"] AS timestamp,json_object["eventId"] AS eventId,json_object["customerId"] AS customerId,// ...FROM @json_records;
JSON / TSV Conversion - Step 4/7
*.json files extraction
Records to JSON objects conversion
Cegid IPS filtering
JSON objects fields extraction
Custom dimensions extraction
Custom dimensions pivot
Output to TSV files
01/05/2023 CEGID19
@events =SELECT *, CustomDimensions.ParseFromJson(cd) AS cd_arrayFROM @events;
JSON / TSV Conversion - Step 5/7
*.json files extraction
Records to JSON objects conversion
Cegid IPS filtering
JSON objects fields extraction
Custom dimensions extraction
Custom dimensions pivot
Output to TSV files
index value
3 bigFan
4 powerBuyer
2 1443654335461
1 ComkzbXvn8g
value (ordered)ComkzbXvn8g
1443654335461
bigFan
powerBuyer
01/05/2023 CEGID20
@events =SELECT timestamp,
eventId,customerId,verb,// ...cd_array[0] AS cd_01,cd_array[1] AS cd_02,// ...cd_array[19] AS cd_20
FROM @events;
JSON / TSV Conversion - Step 6/7
*.json files extraction
Records to JSON objects conversion
Cegid IPS filtering
JSON objects fields extraction
Custom dimensions extraction
Custom dimensions pivot
Output to TSV files
valueComkzbXvn8g
1443654335461
bigFan
powerBuyer
cd_01 cd_02 cd_03 cd_04ComkzbXvn8g 1443654335461 bigFan powerBuyer
01/05/2023 CEGID21
OUTPUT( SELECT * FROM @events WHERE fullUrl.StartsWith("http://www.cegid.com/uk/"))TO "cegid-uk-raw.tsv"ORDER BY clientId, timestamp ASCUSING Outputters.Tsv(quoting : false);
JSON / TSV Conversion - Step 7/7
*.json files extraction
Records to JSON objects conversion
Cegid IPS filtering
JSON objects fields extraction
Custom dimensions extraction
Custom dimensions pivot
Output to TSV files
01/05/2023 CEGID24
N events M sessions (M >= 1)• New session if no activity for 30 minutes
Aggregation using U-SQL REDUCE• Microsoft.Analytics.Interfaces.IReducer
User Agent parsing: UAParser (C#) Geocoding (lon/lat, timezone, time lag): Cegid.GeoTools (C#)
U-SQL Script 2 - Sessions Aggregation
cegid-<site>-raw.tsv files extraction
Events to sessions
aggregation
Output to TSV
files
01/05/2023 CEGID25
N sessions 1 visitor Aggregation using U-SQL REDUCE
• Microsoft.Analytics.Interfaces.IReducer
U-SQL Script 3 - Visitors Aggregation
cegid-<site>-sessions.tsv
files extraction
Sessions to visitors
aggregation
Output to TSV
files
01/05/2023 CEGID26
Custom dimensions aggregation (C#)
Visitors Aggregation
visitor_id cd_01 cd_02A homepage articlesA solutions detailA homepage articlesA form newsB homepage articlesB blog news
visitor_id cat_homepage cat_solutions cat_blog cat_form typ_articles typ_detail typ_newsA 2 1 1 2 1 1B 1 1 1 1
cd_01 = page categorycd_02 = content type...cd_20 = …
01/05/2023 CEGID27
Keep C# code centralized in dedicated assemblies• Code-behind good for prototyping
Prefer U-SQL to C# for better optimization• Distributed execution built from U-SQL script, not C#
Properly use parallelism Know your data
• Statistics (cardinality, distribution, skewness)• Growth
Understand U-SQL concepts (vertex, partitioning, etc.) and MapReduce design patterns (will help)
Local mode is good, but do not forget to test on ADL
Lessons Learned
01/05/2023 CEGID28
Pros Packaged solution, zero deployment Cloud agility (scalability, elasticity) Web HDFS storage = Hadoop
compatible (sqoop, etc.) Integrated development
(Visual Studio, local/debug mode) U-SQL = SQL + C# (business code
reuse)
Azure Data Lake Review
Cons Proprietary solution Not on-premises Learning curve