Reactive Feature Generation with Spark and MLlib
Jeff Smith x.ai
x.ai is a personal assistant who schedules meetings for you
M A N N I N G
Jeff Smith
REACTIVE MACHINE LEARNING
Reactive Machine Learning
Machine Learning Systems
Traits of Reactive Systems
Responsive
Resilient
Elastic
Message-Driven
Reactive Strategies
Machine Learning Data
Machine Learning Data
Machine Learning Systems
INTRODUCING FEATURES
Microblogging DataSquawks Squawkers Super
Feature Generation
Raw Data FeaturesFeature Generation Pipeline
FEATURE TRANSFORMS
Feature Generation
Raw Data FeaturesFeature Generation Pipeline
Extract Transform
Features
trait FeatureType[V] { val name: String}
trait Feature[V] extends FeatureType[V] { val value: V}
Features
case class IntFeature(name: String, value: Int) extends Feature[Int]
case class BooleanFeature(name: String, value: Boolean) extends Feature[Boolean]
Named Transforms
def binarize(feature: IntFeature, threshold: Double) = { BooleanFeature("binarized-" + feature.name, feature.value > threshold)}
Non-Trivial Transformsdef categorize(thresholds: List[Double]) = { (rawFeature: DoubleFeature) => { IntFeature("categorized-" + rawFeature.name, thresholds.sorted .zipWithIndex .find { case (threshold, i) => rawFeature.value < threshold }.getOrElse((None, -1)) ._2) }}
Standardizing Namingtrait Named { def name(inputFeature: Feature[Any]) : String = { inputFeature.name + "-" + Thread.currentThread.getStackTrace()(3).getMethodName }}
object Binarizer extends Named { def binarize(feature: IntFeature, threshold: Double) = { BooleanFeature(name(feature), feature.value > threshold) }}
Lineages"cleaned-normalized-categorized-interactions"
interactions categorized normalized cleaned
PIPELINES
Feature Generation
Raw Data FeaturesFeature Generation Pipeline
Multi-Stage Generation
Raw Data ExtractedExtract Transform Features
Pipeline Compositiondef extract(rawSquawks: RDD[JsonDocument]): RDD[IntFeature] = { ???} def transform(inputFeatures: RDD[Feature[Int]]): RDD[BooleanFeature] = { ???} val trainableFeatures = transform(extract(rawSquawks))
Feature Generation
Raw Data FeaturesFeature Generation Pipeline
Pipelines
Don’t orchestrate when you can compose
Pipeline Failure
Raw Data FeaturesFeature Generation Pipeline
Raw Data FeaturesFeature Generation Pipeline
Supervising Feature Generation
Raw Data FeaturesFeature Generation Pipeline
Supervision
Supervising Feature Generation
Raw Data FeaturesFeature Generation Pipeline
Reactive DB Drivers Cluster Managers Feature Validation
Reactive Database DriversCouchbase MongoDB Cassandra
Reactive Database Drivers
val rawSquawks: RDD[JsonDocument] = sc.couchbaseView( ViewQuery.from("squawks", "by_squawk_id")) .map(_.id) .couchbaseGet[JsonDocument]()
Cluster ManagersSpark Standalone Mesos YARN
Supervising Feature Generation
Raw Data FeaturesFeature Generation Pipeline
Reactive DB Drivers Cluster Managers Feature Validation
FEATURE COLLECTIONS
Original Features
object SquawkLength extends FeatureType[Int]
object Super extends LabelType[Boolean]
val originalFeatures: Set[FeatureType] = Set(SquawkLength)val label = Super
Basic Features
object PastSquawks extends FeatureType[Int]
val basicFeatures = originalFeatures + PastSquawks
More Features
object MobileSquawker extends FeatureType[Boolean]
val moreFeatures = basicFeatures + MobileSquawker
Feature Collections
case class FeatureCollection(id: Int, createdAt: DateTime, features: Set[_ <: FeatureType[_]], label: LabelType[_])
Feature Collectionsval earlierCollection = FeatureCollection(101, earlier, basicFeatures, label)
val latestCollection = FeatureCollection(202, now, moreFeatures, label)
val featureCollections = sc.parallelize( Seq(earlierCollection, latestCollection))
Feature Generation
Raw Data FeaturesFeature Generation Pipeline
Fallback Collections
val FallbackCollection = FeatureCollection(404, beginningOfTime, originalFeatures, label)
Fallback Collectionsdef validCollection(collections: RDD[FeatureCollection], invalidFeatures: Set[FeatureType[_]]) = { val validCollections = collections.filter( fc => !fc.features .exists(invalidFeatures.contains)) .sortBy(collection => collection.id) if (validCollections.count() > 0) { validCollections.first() } else FallbackCollection}
VALIDATING FEATURES
Supervising Feature Generation
Raw Data FeaturesFeature Generation Pipeline
Reactive DB Drivers Cluster Managers Feature Validation
Predicting Super Squawkers
Training Instances
val instances = Seq( (123, Vectors.dense(0.2, 0.3, 16.2, 1.1), 0.0), (456, Vectors.dense(0.1, 1.3, 11.3, 1.2), 1.0), (789, Vectors.dense(1.2, 0.8, 14.5, 0.5), 0.0))
Selection Setup
val featuresName = "features"val labelName = "isSuper"
val instancesDF = sqlContext.createDataFrame(instances) .toDF("id", featuresName, labelName)
val K = 2
Feature Selection
val selector = new ChiSqSelector() .setNumTopFeatures(K) .setFeaturesCol(featuresName) .setLabelCol(labelName) .setOutputCol("selectedFeatures")
Feature Selection
val selector = new ChiSqSelector() .setNumTopFeatures(K) .setFeaturesCol(featuresName) .setLabelCol(labelName) .setOutputCol("selectedFeatures")
val selectedFeatures = selector.fit(instancesDF) .transform(instancesDF)
Back to RDDs
val labeledPoints = sc.parallelize(instances.map({ case (id, features, label) => LabeledPoint(label = label, features = features)}))
Validating Features
def validateSelection(labeledPoints: RDD[LabeledPoint], topK: Int, cutoff: Double) = { val pValues = Statistics.chiSqTest(labeledPoints) .map(result => result.pValue) .sorted pValues(topK) < cutoff}
Persisting Validationscase class ValidatedFeatureCollection(id: Int, createdAt: DateTime, features: Set[_ <: FeatureType[_]], label: LabelType[_], passedValidation: Boolean, cutoff: Double)
SUMMARY
Machine Learning Systems
Traits of Reactive Systems
Reactive Strategies
Machine Learning Data
Feature Generation
Raw Data FeaturesFeature Generation Pipeline
Reactive Feature Generation with Spark and MLlib
reactivemachinelearning.com @jeffksmithjr @xdotai