Top Banner
Apache NiFi Record Processing Bryan Bende / @bbende Staff Software Engineer September 8 th 2017
42

Apache NiFi Record Processing

Jan 22, 2018

Download

Software

Bryan Bende
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Apache NiFi Record Processing

ApacheNiFiRecordProcessingBryanBende/@bbendeStaffSoftwareEngineerSeptember8th 2017

Page 2: Apache NiFi Record Processing

2 ©HortonworksInc.2011– 2016.AllRightsReserved

Background

à FlowFile– Unitofworkthatmovesthroughthedataflow– Madeupofattributes+content

à Attributesareamapofkey/valuepairs– Availablein-memory asstrings– Accessiblefromexpressionlanguage– Usefulforquickdecision-making/routing

à Contentisarbitrarybytes– FlowFileisapointertothecontentinthecontentrepository– Contentisonlyaccessediftheprocessorneedstooperateonit– Couldpassthroughmanyprocessorswithouteveryaccessingthecontent

Page 3: Apache NiFi Record Processing

3 ©HortonworksInc.2011– 2016.AllRightsReserved

TheProblem

à Specializedprocessorstooperateondifferentdatatypes– SplitJson,EvaluateJsonPath,ConvertJsonToAvro– SplitAvro,ExtractAvroMetadata,ConvertAvroToJson– SplitText,ExtractText,RouteText

à Sometimesmissingconversions– NoConvertCsvToJson,soConvertCsvToAvro thenConvertAvroToJson

à Sometimesmissingaspecificfunctionforadatatype– NoEvaluateAvroPath,soConvertAvroToJson thenEvaluateJsonPath

à Sometimesimplementedwithdifferentlibrariescausinginconsistencies– SomeAvroprocessorsimplementedwithKite,otherswithApacheAvrolibraries– Eachlibrarymayhavedifferentfeatures/error-handling

Page 4: Apache NiFi Record Processing

4 ©HortonworksInc.2011– 2016.AllRightsReserved

TheSolution

à Introducetheconceptofa”record”

à Centralizethelogicforreading/writingrecordsintocontrollerservices

à Providestandardprocessorsthatoperateonrecords

à Canstillhandlearbitrarydata,butprocessrecordswhenappropriate

Page 5: Apache NiFi Record Processing

5 ©HortonworksInc.2011– 2016.AllRightsReserved

RecordReaders&Writers

à Readers– AvroReader– CsvReader– GrokReader– JsonPathReader– JsonTreeReader– ScriptedReader

à Writers– AvroRecordSetWriter– CsvRecordSetWriter– JsonRecordSetWriter– FreeFormTextRecordSetWriter– ScriptedRecordSetWriter

Page 6: Apache NiFi Record Processing

6 ©HortonworksInc.2011– 2016.AllRightsReserved

Buthowisdataturnedintoarecord?

à Arecordhasfields,andfieldshaveinformationlikeanameandtype

à Schemasdefinethefieldsofarecordandgivemeaningtothedata

à ApacheAvroalreadyutilizesschemas,widelyused&supportedbymanytools

à WecanuseAvroschemastodefineaschemaforanytypeofdata

à Eachreader&writerneedsawaytoobtainaschema

Page 7: Apache NiFi Record Processing

7 ©HortonworksInc.2011– 2016.AllRightsReserved

SchemaAccessStrategy

à SchemaName– ProvidethenameofaschematolookupinaSchemaRegistry,canuseELtoobtainthename

à SchemaText– Providethetextofaschemainreader/writer,canuseELtoobtainthetext

à HWXContent-EncodedSchemaReference– ContentoftheFlowFilecontainsspecialheaderreferencingaschemainaSchemaRegistry

à HWXSchemaReferenceAttributes– FlowFilecontainsthreeattributesthatwillbeusedtolookupaschemafromtheconfigured

SchemaRegistry:‘schema.identifier’,‘schema.version’,and ‘schema.protocol.version’

à Readers&writersmayhaveadditionaloptionsspecifictothedatatype– Ex:CsvReader canmakeaschemaontheflyfromthecolumnnames– Ex:AvroReader canusetheschemaembeddedintheAvrodatafile

Page 8: Apache NiFi Record Processing

8 ©HortonworksInc.2011– 2016.AllRightsReserved

SchemaRegistries

à AvroSchemaRegistry– Accessschemabyname– OnlyaccessiblewithinNiFi

à HortonworksSchemaRegistry– Accessschemabynameand/orversion– Accessibleacrosssystemsintheenterprise– https://github.com/hortonworks/registry

à ConfluentSchemaRegistry– Accessschemabynameand/orversion– Accessibleacrosssystemsintheenterprise– https://github.com/confluentinc/schema-registry– NotinanofficialApacheNiFi releaseyet,availableinmasterbranch(1.4.0-snapshot)

Page 9: Apache NiFi Record Processing

9 ©HortonworksInc.2011– 2016.AllRightsReserved

FullPictureAbstractControllerService

SchemaRegistryService

RecordReaderFactory

AvroReader

CsvReader

GrokReader

JsonPathReader

JsonReader

Implements

RecordSetWriterFactory

AvroRecordSetWriter

CsvRecordSetWriter

JsonRecordSetWriter

FreeFormTextWriter

ImplementsExtendsExtends

Extends

SchemaRegistry

AvroSchemaRegistry

HWXSchemaRegistry

Uses

Implements

ConfluentSchemaRegistry

Page 10: Apache NiFi Record Processing

10 ©HortonworksInc.2011– 2016.AllRightsReserved

RecordPath

à Domainspecificlanguage(DSL)forspecifying/accessingfieldsofarecord

à SimilartoJSONPathorXPath

à Examples:– Child:/details/address/zip– Descendant://zip– Arrays:/addresses[1]– Maps:/details/address['zip']– Predicates:/*[./state != 'NY']

à Moreinfo…– https://nifi.apache.org/docs/nifi-docs/html/record-path-guide.html

Page 11: Apache NiFi Record Processing

11 ©HortonworksInc.2011– 2016.AllRightsReserved

RecordProcessors

à Manyprocessorsforoperatingonrecords– ConvertRecord– LookupRecord– PartitionRecord– QueryRecord– SplitRecord– UpdateRecord– ConsumeKafkaRecord_0_10– PublishKafkaRecord_0_10

à Goalistokeepmanyrecordsperflowfileandavoidsplittingifpossible

à Checklatestdocsusagedetailsandotherrecordprocessors– https://nifi.apache.org/docs.html

Page 12: Apache NiFi Record Processing

12 ©HortonworksInc.2011– 2016.AllRightsReserved

Example– CSVtoJSONw/LocalSchemaRegistry

Page 13: Apache NiFi Record Processing

13 ©HortonworksInc.2011– 2016.AllRightsReserved

Example- CSVtoJSON

à IncomingCSVthatlookslike:first_name, last_name

John, Smith

Mike, Jones

à WantJSONthatlookslike:[

{“first_name” : “John”, ”last_name” : “Smith”},

{“first_name” : “Mike”, “last_name” : “Jones”}

]

Page 14: Apache NiFi Record Processing

14 ©HortonworksInc.2011– 2016.AllRightsReserved

Step1– DefineanAvroSchema

{

"name": "person",

"namespace": "nifi",

"type": "record",

"fields": [

{ "name": "first_name", "type": "string" },

{ "name": "last_name", "type": "string" }

]

}

Page 15: Apache NiFi Record Processing

15 ©HortonworksInc.2011– 2016.AllRightsReserved

Step2- CreateaLocalSchemaRegistry&AddSchema

Page 16: Apache NiFi Record Processing

16 ©HortonworksInc.2011– 2016.AllRightsReserved

Step3- CreateaCsvReader

Page 17: Apache NiFi Record Processing

17 ©HortonworksInc.2011– 2016.AllRightsReserved

Step4– CreateaJsonRecordSetWriter

Page 18: Apache NiFi Record Processing

18 ©HortonworksInc.2011– 2016.AllRightsReserved

Step5– GenerateFlowFile Processor

à SetRunScheduletosomethinglike10seconds

à PutexampleCSVdatainCustomTextproperty

à Thereader&writerhadtheir’SchemaName’setto${schema.name}

à Addanpropertycalled‘schema.name’withthevalueof‘person’sincethisisthenameintheschemaregistry

Page 19: Apache NiFi Record Processing

19 ©HortonworksInc.2011– 2016.AllRightsReserved

Step6– ConvertRecordProcessor

à Selecttheappropriatereaderandwriter

Page 20: Apache NiFi Record Processing

20 ©HortonworksInc.2011– 2016.AllRightsReserved

Step7- LogAttribute

à SetLogPayloadtotrue

Page 21: Apache NiFi Record Processing

21 ©HortonworksInc.2011– 2016.AllRightsReserved

Step8– ConnectProcessors&RunFlow

Page 22: Apache NiFi Record Processing

22 ©HortonworksInc.2011– 2016.AllRightsReserved

Step9– Checknifi-app.log forJSON

--------------------------------------------------StandardFlowFile AttributesKey:'entryDate ' Value:'ThuAug3113:28:02EDT2017’Key:'lineageStartDate' Value:'ThuAug3113:28:02EDT2017’Key:'fileSize' Value:'137’FlowFile AttributeMapContentKey:'filename' Value:'326844487150210’Key:'mime.type' Value:'application/json’Key:'path'Value:'./’Key:'record.count' Value:’2’Key:'schema.name' Value:'person’Key:'uuid' Value:'e9198166-0cff-400b-a39d-9c8c9c565f85’--------------------------------------------------[{"first_name":"John","last_name":"Smith"},{"first_name":"Mike","last_name":"Jones"}]

Page 23: Apache NiFi Record Processing

23 ©HortonworksInc.2011– 2016.AllRightsReserved

Example– CSVtoJSONw/HortonworksSchemaRegistry

Page 24: Apache NiFi Record Processing

24 ©HortonworksInc.2011– 2016.AllRightsReserved

Step1– RuntheHortonworksSchemaRegistry

à Downloadthelatestrelease– https://github.com/hortonworks/registry/releases/download/v0.2.1/hortonworks-registry-0.2.1.tar.gz

à Extractthetarandruntheapplication– tar xzvf hortonworks-registry-0.2.1.tar.gz – cd hortonworks-registry-0.2.1 – ./bin/registry-server-start.sh conf/registry-dev.yaml

à NavigatetoregistryUIinyourbrowser– http://localhost:9090

Page 25: Apache NiFi Record Processing

25 ©HortonworksInc.2011– 2016.AllRightsReserved

Step2– AddSchema

Page 26: Apache NiFi Record Processing

26 ©HortonworksInc.2011– 2016.AllRightsReserved

Step3– CreateHortonworksSchemaRegistry Service

Page 27: Apache NiFi Record Processing

27 ©HortonworksInc.2011– 2016.AllRightsReserved

Step4– ReconfigureCsvReader

Page 28: Apache NiFi Record Processing

28 ©HortonworksInc.2011– 2016.AllRightsReserved

Step5– ReconfigureJsonRecordSetWriter

Page 29: Apache NiFi Record Processing

29 ©HortonworksInc.2011– 2016.AllRightsReserved

Step6– Runthesameflowwithsameresults

--------------------------------------------------StandardFlowFile AttributesKey:'entryDate ' Value:'ThuAug3113:28:02EDT2017’Key:'lineageStartDate' Value:'ThuAug3113:28:02EDT2017’Key:'fileSize' Value:'137’FlowFile AttributeMapContentKey:'filename' Value:'326844487150210’Key:'mime.type' Value:'application/json’Key:'path'Value:'./’Key:'record.count' Value:’2’Key:'schema.name' Value:'person’Key:'uuid' Value:'e9198166-0cff-400b-a39d-9c8c9c565f85’--------------------------------------------------[{"first_name":"John","last_name":"Smith"},{"first_name":"Mike","last_name":"Jones"}]

Page 30: Apache NiFi Record Processing

30 ©HortonworksInc.2011– 2016.AllRightsReserved

Example– UseSpecificSchemafromHWXSchemaRegistry

Page 31: Apache NiFi Record Processing

31 ©HortonworksInc.2011– 2016.AllRightsReserved

SpecifyingaSchemaVersion

à Previousexampleused“SchemaName”for“SchemaAccessStrategy”– NiFi retrievedlatestversionofschemaforname– Cachedschemabasedonconfigurationincontrollerservice

à Wecanalsouse“HWXSchemaReferenceAttributes”tobemorespecific– schema.identifier– schema.version– schema.protocol.version

Page 32: Apache NiFi Record Processing

32 ©HortonworksInc.2011– 2016.AllRightsReserved

AddNewVersionofSchema

Page 33: Apache NiFi Record Processing

33 ©HortonworksInc.2011– 2016.AllRightsReserved

ObtainingIdentifier,Version,Protocol

à WecangetthesevaluesfromtheschemaregistryRESTAPI– http://localhost:9090/api/v1/schemaregistry/schemas/person– http://localhost:9090/api/v1/schemaregistry/schemas/person/versions– ProtocolVersionisalways‘1’fornow

Page 34: Apache NiFi Record Processing

34 ©HortonworksInc.2011– 2016.AllRightsReserved

UpdateFlowtoSpecifyAttributes

à Removeschema.name andaddadditionalattributesinGenerateFlowFile

Page 35: Apache NiFi Record Processing

35 ©HortonworksInc.2011– 2016.AllRightsReserved

UpdateCsvReader withnewSchemaAccessStrategy

Page 36: Apache NiFi Record Processing

36 ©HortonworksInc.2011– 2016.AllRightsReserved

UpdateJsonRecordSetWriter withnewSchemaAccessStrategy

Page 37: Apache NiFi Record Processing

37 ©HortonworksInc.2011– 2016.AllRightsReserved

RuntheFlowAgain

à Usingv2oftheschemaweshouldonlyseefirst_name:

Key: 'schema.identifier' Value: '1’Key: 'schema.name'Value: 'person’Key: 'schema.protocol.version' Value: '1’Key: 'schema.version' Value: '2’Key: 'uuid' Value: '34407f4e-3bf1-46d5-a6d4-6da5ba197eb8’--------------------------------------------------[{"first_name":"John"},{"first_name":"Mike"}]

Page 38: Apache NiFi Record Processing

38 ©HortonworksInc.2011– 2016.AllRightsReserved

ApacheNiFi +ApacheKafka+HWXSchemaRegistry

Page 39: Apache NiFi Record Processing

39 ©HortonworksInc.2011– 2016.AllRightsReserved

Publishing

à PublishKafkaRecord_0_10– StreamsincomingflowfileasrecordsusingconfiguredRecordReader– SerializeseachrecordtobytesusingconfiguredRecordSetWriter

à Generallydon’twanttopublishschemaoneverymessage– “SchemaWriteStrategy”ofRecordSetWriter controlswhereschemaendsup– “HWXContent-EncodedSchemaReference”encodesschemainfoatbeginningofcontent– Singlerecordpublishedasencodedschemareference+bytesofarecord

Protocol(1byte)

Identifier(8bytes)

Version(3bytes)

RecordBytes

Page 40: Apache NiFi Record Processing

40 ©HortonworksInc.2011– 2016.AllRightsReserved

Consuming

à ConsumeKafkaRecord_0_10– ReadsmessagesfromKafkaintorecordsusingconfiguredRecordReader– WritesrecordstoaflowfileusingconfiguredRecordSetWriter

à Ifpublisherused“HWXContent-EncodedSchemaReference” astheSchemaWriterStrategy thenconsumerneedstouse““HWXContent-EncodedSchemaReference”astheSchemaAccessStrategy

Page 41: Apache NiFi Record Processing

41 ©HortonworksInc.2011– 2016.AllRightsReserved

Publish&Consume

KafkaPublishKafkaRecord_0_10

HWXSchemaRegistry

[schemaref][record]

1.PublishConsumeKafkaRecord_0_10

2.Consume

4.RetrieveSchemaforencodedprotocol,id,

andversion

3.Readencodedschemainfofrom

message

Page 42: Apache NiFi Record Processing

42 ©HortonworksInc.2011– 2016.AllRightsReserved

AdditionalResources

à https://blogs.apache.org/nifi/entry/record-oriented-data-with-nifi

à https://blogs.apache.org/nifi/entry/real-time-sql-on-event

à https://community.hortonworks.com/content/kbentry/119766/installing-a-local-hortonworks-registry-to-use-wit.html

à https://community.hortonworks.com/articles/131320/using-partitionrecord-grokreaderjsonwriter-to-pars.html

à https://community.hortonworks.com/articles/115311/convert-csv-to-json-avro-xml-using-convertrecord-p.html