Top Banner
HBase powered Merchant Lookup Service at Intuit Vrushali Channapattan, Intuit Lightning Talk @ HBaseCon2012 (May 22 nd , 2012)
13

HBaseCon 2012 | HBase powered Merchant Lookup Service at Intuit

Jul 13, 2015

Download

Technology

Cloudera, Inc.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: HBaseCon 2012 | HBase powered Merchant Lookup Service at Intuit

HBase powered Merchant Lookup Service at IntuitVrushali Channapattan, Intuit

Lightning Talk @ HBaseCon2012 (May 22nd, 2012)

Page 2: HBaseCon 2012 | HBase powered Merchant Lookup Service at Intuit

About Intuit

2

Intuit is a leader in this trend because we are entrusted with the collective data of our 50 million customers.

Page 3: HBaseCon 2012 | HBase powered Merchant Lookup Service at Intuit

Both of the above vendor records map to the D&B business:

ID: 002114902Name: The Windsor-Press IncStreet: 6 N 3rd StCity: HamburgState: PAZip: 19526-1502Phone: (610)-562-2267

Company ABC

name: The Windsor Press, Inc.address: PO Box 465 6 North Third Streetcity: Hamburgstate: PAzip: 19526phone: (610) 562-2267

name: The Windsor Pressaddress: P.O. Box 465 6 North 3rd St.city: Hamburgstate: PAzip: 19526-0465phone: (610) 562-2267

Company PQR

Problem: Duplicate Merchants

Dun & Bradstreet

Page 4: HBaseCon 2012 | HBase powered Merchant Lookup Service at Intuit

Applications of Merchant Lookup

Page 5: HBaseCon 2012 | HBase powered Merchant Lookup Service at Intuit

Applications of Merchant Lookup

Page 6: HBaseCon 2012 | HBase powered Merchant Lookup Service at Intuit

Name AddressPhone

Loader

Various Matchers

Final Match Score

Merchant

Splicer

Update

Full table Scan

Score

Combiner

Backend Architecture

IndividualMatcher Scores

Input

Data

Applications

Internal Research Projects

6

Page 7: HBaseCon 2012 | HBase powered Merchant Lookup Service at Intuit

Data Model -Tables in HBase

7

Merchants Master dataset of merchants

Sangria_idUnique id generation coordination across mapper processes

DuplicatesNoting duplicate merchants after deduplication

SnapshotMerchantsMerging into master dataset

NewMerchantsThe new merchant set that is to be added to the master data set of

merchants

Page 8: HBaseCon 2012 | HBase powered Merchant Lookup Service at Intuit

Schema

8

Merchants

Row key Info (column family) Mapping (column family)

25204939 name:Crepevinestreet:367 University Avenuecity:Palo Altostate:CAzip:94031county:Santa Clara Countycountry: United States of Americawebsite:www.crepevine.comphoneNumber:16503233900latitude:37.430211longitude:-122.098221source:internetmint_category:Food & Diningqbo_category:RestaurantsNAICS:722110SIC:5182

sourcename:10000048, 10000075

Page 9: HBaseCon 2012 | HBase powered Merchant Lookup Service at Intuit

Schema

9

Sangria_id

Duplicates

Row key Info (column family)

10000043 25204921:0.998

10000048 25204939:0.78

10000075 25204939:0.95

Row key Info (column family)

default seed:30000comment:initial seed by vc of 1000

qbo seed:20550000comment:initial seed by kf of 20000000

Page 10: HBaseCon 2012 | HBase powered Merchant Lookup Service at Intuit

Optimizations (job level)

10

• For Hadoop jobs interfacing with HBase, used TableMapReduceUtil

• Emitted a ‘put’ from Mapper or Reducer instead of a regular htable put

– Use context.write(rowKey,put)

• To make the full table scan faster (hbase read only hadoop jobs – deduping

matchers , Solr index generator)

scan.setCaching(500);

scan.setCacheBlocks(false);

• Used Customized TableInputFormat while scanning (custom number of

splits for map tasks)

job.setInputFormatClass(CustomizedTableInputFormat.class);

extends TableInputFormat class and overriding getSplits

method

Page 11: HBaseCon 2012 | HBase powered Merchant Lookup Service at Intuit

Optimizations (code level)

11

• Storing frequently used column family and column names as byte arrays in a

public interface

public static final byte[] COLUMN_NAME =

Bytes.toBytes("name");

public static final byte[] COLUMN_FAMILY_INFO = Bytes.toBytes("info");

• Utility class for getting values from hbase.client.Result

HBaseUtils.getColumnValue(result, COLUMN_FAMILY_INFO,

COLUMN_NAME));

public static String getColumnValue(Result result, byte[] type, byte[] columnName) {

return Bytes.toString(result.getValue(type, columnName));

}

• Writing a sample set of 31 million records into the HBase cluster changed from 4 hours 37 mins 47 secs to 32 mins, 18 seconds

Page 12: HBaseCon 2012 | HBase powered Merchant Lookup Service at Intuit

Vrushali Channapattan, Intuit Data Group (BIO)

[email protected]

12

Thank You!

Page 13: HBaseCon 2012 | HBase powered Merchant Lookup Service at Intuit

Schema

13

SnapshotMerchants

NewMerchants- same as Merchants

Row key Info (column family)

merge first:1336813613start:1337029113end:1337120100comments:merging qbo against dandbmerchants initiated on May 14th 2012outcome:started (or) merge run successful