PhoenixJames [email protected]
We put the SQL back in NoSQL
Agenda
Completed
What is Phoenix?Why SQL?What is next?Q&A
What is Phoenix?
Completed
SQL layer on top of HBaseDelivered as an embedded JDBC driverTargets low latency queries over HBase dataColumns modeled as multi-part row key and key valuesVersioned schema repositoryQuery engine transforms SQL into puts, delete, scansUses native HBase APIs instead of Map/ReduceBrings the computation to the data:
Aggregate, insert, delete datathrough coprocessorsPush predicates through custom filters
100% JavaOpen source here: https://github.com/forcedotcom/phoenix
Why SQL?
Completed
Broaden HBase adoptionGive folks an API they already know
Reduce the amount of code users need to writeSELECT TRUNC(date,'DAY’), AVG(cpu_usage)FROM web_statWHERE domain LIKE 'Salesforce%’GROUP BY TRUNC(date,'DAY')
Performance optimizations transparent to the userAggregationStats gatheringSecondary indexing
Leverage existing toolingSQL client
But I can’t surface x,y,z in SQL…
Completed
But I can’t surface x,y,z in SQL…
Completed
But I can’t surface x,y,z in SQL…
Completed
Define multi-part row keys
But I can’t surface x,y,z in SQL…
Completed
Define multi-part row keysCREATE TABLE web_stat (
domain VARCHAR NOT NULL, feature VARCHAR NOT NULL, date DATE NOT NULL, usage BIGINT, active_visitor INTEGER,
CONSTRAINT pk PRIMARY KEY (domain, feature, date));
But I can’t surface x,y,z in SQL…
Completed
Define multi-part row keysImplement my whizz-bang custom function
But I can’t surface x,y,z in SQL…
Completed
Define multi-part row keysImplement my whizz-bang custom function
Derive class from ScalarFunctionAdd annotation to define name, args, and typesImplement evaluate methodRegister function
(blog on this coming soon: http://phoenix-hbase.blogspot.com/)
But I can’t surface x,y,z in SQL…
Completed
Define multi-part row keysImplement my whizz-bang built-in functionRun snapshot in time queries
But I can’t surface x,y,z in SQL…
Completed
Define multi-part row keysImplement my whizz-bang built-in functionRun snapshot in time queries
Set CURRENT_SCN property on connection to earlier timestamp
Queries will see only rows before timestampSchema in-place at that point in time will be used
But I can’t surface x,y,z in SQL…
Completed
Define multi-part row keysImplement my whizz-bang built-in functionRun snapshot in time queriesNest child entities inside of a row
But I can’t surface x,y,z in SQL…
Completed
Define multi-part row keysImplement my whizz-bang built-in functionRun snapshot in time queriesNest child entities inside of a row
Declare new new child entity as nested tablePrefix column qualifier of nested entities with:
table name + child primary key + child column nameRestrict join to be only through parent/child relationExecute query by scanning nested child rows
TBD: https:/github.com/forcedotcom/phoenix/issues/19
But I can’t surface x,y,z in SQL…
Completed
Define multi-part row keysImplement my whizz-bang built-in functionRun snapshot in time queriesNest child entities inside of a rowPrevent hot spotting on writes
But I can’t surface x,y,z in SQL…
Completed
Define multi-part row keysImplement my whizz-bang built-in functionRun snapshot in time queriesNest child entities inside of a rowPrevent hot spotting on writes
“Salt” row key on upsert by mod-ing with cluster sizeQuery for fully qualified key by inserting salt byteRange scan by concatenating results of scan over all
possible salt bytesOr alternately
Define column used for hash to derive row key prefix
TBD: https://github.com/forcedotcom/phoenix/issues/74
But I can’t surface x,y,z in SQL…Define multi-part row keysImplement my whizz-bang built-in functionRun snapshot in time queriesNest child entities inside of a rowPrevent hot spotting on writesIncrement atomic counter
But I can’t surface x,y,z in SQL…Define multi-part row keysImplement my whizz-bang built-in functionRun snapshot in time queriesNest child entities inside of a rowPrevent hot spotting on writesIncrement atomic counter
Surface the HBase put-and-increment functionality through the standard SQL sequence support
TBD: https://github.com/forcedotcom/phoenix/issues/18
But I can’t surface x,y,z in SQL…Define multi-part row keysImplement my whizz-bang built-in functionRun snapshot in time queriesNest child entities inside of a rowPrevent hot spotting on writesIncrement atomic counterSample table data
But I can’t surface x,y,z in SQL…Define multi-part row keysImplement my whizz-bang built-in functionRun snapshot in time queriesNest child entities inside of a rowPrevent hot spotting on writesIncrement atomic counterSample table data
Support the standard SQL TABLESAMPLE clauseImplement filter that uses a skip next hint Base next key on the table stats “guide posts”
TBD: https://github.com/forcedotcom/phoenix/issues/22
But I can’t surface x,y,z in SQL…Define multi-part row keysImplement my whizz-bang built-in functionRun snapshot in time queriesNest child entities inside of a rowPrevent hot spotting on writesIncrement atomic counterSample table dataDeclare columns at query time
But I can’t surface x,y,z in SQL…Define multi-part row keysImplement my whizz-bang built-in functionRun snapshot in time queriesNest child entities inside of a rowPrevent hot spotting on writesIncrement atomic counterSample table dataDeclare columns at query time
SELECT col1,col2,col3FROM my_table(col2 VARCHAR, col3 INTEGER)WHERE col3 > 10
TBD: https://github.com/forcedotcom/phoenix/issues/9
ConclusionPhoenix fits the 80/20 use case ruleLet us know what you’d like to see addedGet involved – we need your help!Think about how your new feature can be surfaced in SQL
Thank you!Questions/comments?
Query Processing
FEATURERow Key
Key Values
ORG_ID DATE
TXNS
IO_TIME
RESPONSE_TIME
Product Metrics HTable
Scan Start key: ORG_ID (:1) + DATE (:2) End key: ORG_ID (:1) + DATE (:3)
Filter Filter: IO_TIME > 100
Aggregation Intercepts scan on region server Builds map of distinct FEATURE values Returns one row per distinct group Client does final merge
SELECT feature, SUM(txns)FROM product_metricsWHERE org_id = :1AND date >= :2 AND date <= :3AND io_time > 100GROUP BY feature
Phoenix Query Optimizations
Completed
Start/stop key of scan based on AND-ed columnsThrough SUBSTR, ROUND, TRUNC, LIKE
Parallelized on client by chunking over start/stop key of scanAggregation on region-servers through coprocessor
Inline for GROUP BY over row key ordered columnsIn memory map per group otherwise
WHERE clause executed through custom filtersIncremental evaluation with early terminationEvaluated through byte pointers
IN and OR over same column (in progress)Becomes batched get or filter with next row hint
Top N queries (future)Through coprocessor keeping top N rows
TABLESAMPLE (future)Becomes filter with next row hint
Phoenix Performance
Phoenix Performance
Completed