® 2011 Dachis Group. dachisgroup.com Dachis Group Las Vegas 2012 Intermediate Pig Know How Timothy Potter (Twitter: thelabdude) Pigout Hackday, Austin TX May 11, 2012
Jan 27, 2015
® 2011 Dachis Group.
dachisgroup.com
Dachis GroupLas Vegas 2012
Intermediate Pig Know How
Timothy Potter (Twitter: thelabdude)Pigout Hackday, Austin TXMay 11, 2012
® 2011 Dachis Group.
dachisgroup.com
UFO Sightings Data Set
1. Which US city has the most UFO sightings overall?2. What is the most common UFO shape within a 100 mile radius of
your answer for #1?
Pig Mahout Example: Training 20 Newsgroups Classifier
• Loading messages using a custom loader• Hashed Feature Vectors• Train Logistic Regression Model• Evaluate Model on held-out Data
Agenda
® 2011 Dachis Group.
dachisgroup.com
1. What US city has the most UFO sightings overall?
2. What is the most common UFO shape within a 100 mile radius of your answer for #1?
Using Two Data Sets:• UFO sightings data set
available from Infochimps• US city / states with geo-
codes available from US Census
UFO Sightings
® 2011 Dachis Group.
dachisgroup.com
19930809 19990816 Westminster, CO triangle 1 minute A white puffy cottonball appeared and then a triangle ...
20010111 20010113 Pueblo, CO fireball 30 sec Blue fireball lights up the skies of colorado and nebraska ...
20001026 20030920 Aurora, CO triangle 10 Minutes Triangular craft (two footbal fields in size)As reported to Art Bell ...
ufo_sightings = LOAD ’ufo/ufo_awesome.tsv' AS (
sighted_at: chararray, reported_at: chararray,
location: chararray, shape: chararray,
duration: chararray, description: chararray
);
ufo_sightings_split_loc = FOREACH (
FILTER ufo_sightings BY sighted_at IS NOT NULL AND location IS NOT NULL
) {
split_city = REGEX_EXTRACT(TRIM(location), '([A-Z][\\w\\s\\-\\.]*)(, )([A-Z]{2})', 1);
split_state = REGEX_EXTRACT(TRIM(location), '([A-Z][\\w\\s\\-\\.]*)(, )([A-Z]{2})', 3);
city_lc = (split_city IS NOT NULL ? LOWER(split_city) : null);
state_lc = (split_state IS NOT NULL ? LOWER(split_state) : null);
GENERATE city_lc AS city, state_lc AS state, ...
Load Sightings Data
Pig provides functions
for doing basic text munging tasks or
use a UDF ...
® 2011 Dachis Group.
dachisgroup.com
CO 0862000 02411501 Pueblo city 138930097 2034229 53.641 0.785 38.273147 -104.612378
CO 0883835 02412237 Westminster city 81715203 5954681 31.550 2.299 39.882190 -105.064426
CO 0804000 02409757 Aurora city 400759192 1806832 154.734 0.698 39.688002 -104.689740
us_cities = LOAD ’dev/data/usa_cities_and_towns.tsv' AS (
state: chararray, geoid: chararray,
ansicode: chararray, name: chararray,
....
latitude: double, longitude: double
);
us_cities_w_geo = FOREACH us_cities {
city_name = SUBSTRING(LOWER(name), 0, LAST_INDEX_OF(name,' '));
GENERATE TRIM(city_name) as city, TRIM(LOWER(state)) AS state, latitude, longitude;
};
Load US Cities Datawith geo-codes
Use projection toselect only the fields
you want to work with:city, state, latitude, longitude
® 2011 Dachis Group.
dachisgroup.com
What US city has the most UFO sightings overall?Things to consider ...
1. Need to select only sightings from US cities
2. Need to count sightings for each city
3. Need to do a TOP to get the city with the most sightings
Join sightings data with US city data
Group results from step 1 by state/city and count
Descending sort on count and choose the top.
® 2011 Dachis Group.
dachisgroup.com
ufo_sightings_with_geo = FOREACH (
JOIN ufo_sightings_by_city BY (state,city), us_cities_w_geo BY (state,city) USING ‘replicated’
) GENERATE
ufo_sightings_by_city::state AS state,
ufo_sightings_by_city::city AS city,
ufo_sightings_by_city::sighted_at AS sighted_at,
ufo_sightings_by_city::sighted_year AS sighted_year,
ufo_sightings_by_city::shape AS shape,
us_cities_w_geo::latitude AS latitude,
us_cities_w_geo::longitude AS longitude;
What US city has the most UFO sightings overall?
Inner JOIN by (state,city) to
attach geo-codesto sightings
Group by (state,city)to get number ofsightings for each
CityPoor man’s TOP
grp_by_state_city = FOREACH (GROUP ufo_sightings_with_geo BY (state,city,latitude,longitude)) GENERATE FLATTEN($0) AS (state,city,latitude,longitude), COUNT($1) AS the_count;
most_freq = ORDER grp_by_state_city BY the_count DESC;top_city_state = LIMIT most_freq 1;
DUMP top_city_state;
® 2011 Dachis Group.
dachisgroup.com
(seattle,wa,446,light,47.620499,-122.350876)
Seattle only averages 58 sunny days a year. Coincidence?
Maybe all the UFOs are coming to look at the Space Needle?
What US city has the most UFO sightings overall?
® 2011 Dachis Group.
dachisgroup.com
ufo_sightings_with_geo = FOREACH (
JOIN ufo_sightings_by_city BY (state,city), us_cities_w_geo BY (state,city) USING ‘replicated’
) GENERATE
ufo_sightings_by_city::state AS state,
ufo_sightings_by_city::city AS city,
ufo_sightings_by_city::sighted_at AS sighted_at,
ufo_sightings_by_city::sighted_year AS sighted_year,
ufo_sightings_by_city::shape AS shape,
us_cities_w_geo::latitude AS latitude,
us_cities_w_geo::longitude AS longitude;
grp_by_state_city = FOREACH (GROUP ufo_sightings_with_geo BY (state,city,latitude,longitude))
GENERATE FLATTEN($0) AS (state,city,latitude,longitude),
COUNT($1) AS the_count;
most_freq = ORDER grp_by_state_city BY the_count DESC;
top_city_state = LIMIT most_freq 1;
DUMP top_city_state;
Pig Explain: Pull back the covers ...
Job 1 - Mapper
Job 1 - Reducer
Job 2 – Full Map/Reduce
pig -x local -e ‘explain -script ufo.pig’
® 2011 Dachis Group.
dachisgroup.com
Things we need to solve this ...
1) Some way to calculate geographical distance from a geographical location (lat / lng)
2) Iterate over all cities that have sightings to get the distance from our centroid
3) Filter by distance and count shapes
What is the most common UFO shape within a 100 mile radius of your answer for #1?
® 2011 Dachis Group.
dachisgroup.com
REGISTER some_path/my-ufo-app-1.0-SNAPSHOT.jar;
DEFINE CalcGeoDistance com.dachisgroup.ufo.GeoDistance();
...
with_distance = FOREACH calc_dist {
GENERATE city, state,
CalcGeoDistance(from_lat, from_lng, to_lat, to_lng) AS dist_in_miles;
};
Let’s build a UDF that uses the Haversine Forumla to calculate
distance between two points
See: http://en.wikipedia.org/wiki/Haversine_formula
UDF: User Defined Function
® 2011 Dachis Group.
dachisgroup.com
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
public class GeoDistance extends EvalFunc<Double> {
public Double exec(Tuple input) throws IOException {
if (input == null || input.size() < 4 || input.isNull(0) ||
input.isNull(1) || input.isNull(2) || input.isNull(3)) {
return null;
}
Double dist = null;
try {
Double fromLat = (Double)input.get(0);
Double fromLng = (Double)input.get(1);
Double toLat = (Double)input.get(2);
Double toLng = (Double)input.get(3);
dist = haversineDistanceInMiles(fromLat, toLat, fromLng, toLng);
} catch (Exception exc) { // better to return null than to throw exception }
return dist;
}
protected double haversineDistanceInMiles(double lat1, double lat2, double lon1, double lon2) {
// details excluded for brevity – see http://www.movable-type.co.uk/scripts/latlong.html
return dist;
}
UDF: User Defined Function
® 2011 Dachis Group.
dachisgroup.com
What is the most common UFO shape ...
top_city = FOREACH top_city_state GENERATE city, state, latitude as from_lat, longitude as from_lng;
sighting_cities = FOREACH (GROUP ufo_sightings_with_geo BY (state,city,latitude,longitude))
GENERATE FLATTEN($0) AS (state,city,latitude,longitude);
Including lat / lng in group bykey to help reduce number of
records I’m crossing
Pig only supports equi-joinsso we need to use CROSS
to get the lat / lng of the twopoints to calculate distance
using our UDF
When joining, list largest relationfirst and smallest last and optimizeif possible such as using ‘replicated’
calc_dist = FOREACH (CROSS sighting_cities, top_city) GENERATE sighting_cities::city AS city, sighting_cities::state AS state, sighting_cities::latitude AS to_lat, sighting_cities::longitude AS to_lng, CalcGeoDistance(top_city::from_lat, top_city::from_lng, sighting_cities::latitude, sighting_cities::longitude) AS dist_in_miles;
near = FILTER calc_dist BY dist_in_miles < 100;
shapes = FOREACH (JOIN ufo_sightings_with_geo BY (state,city), near BY (state,city) USING ‘replicated’) generate ufo_sightings_with_geo::shape as shape;
count_shapes = FOREACH (GROUP shapes BY shape) GENERATE $0 AS shape, COUNT($1) AS the_count;sorted_counts = ORDER count_shapes BY the_count DESC;
® 2011 Dachis Group.
dachisgroup.com
In Pig:
fs -getmerge sorted_counts sorted_counts.txt
In R:
shapes <- read.table(”sorted_counts.txt",
header=F, sep="\t", col.names=c("shape","occurs"), stringsAsFactors=F)
barplot(c(shapes$occurs),
main="UFO Sightings (Shapes)",
ylab="Number of Sightings",
ylim=c(0,500),
cex.names=0.8,
las=2,
names.arg=c(shapes$shape))
Visualize Results
® 2011 Dachis Group.
dachisgroup.com
Use Pig’s IsEmpty function to isolate records that only occur in one of the relations ... such as sightings in cities not in the US census list:
city_sightings = COGROUP ufo_sightings_by_city BY (state,city) OUTER,
us_cities_w_geo BY (state,city);
outside_us_sightings =
FOREACH (FILTER city_sightings BY IsEmpty(us_cities_w_geo)) GENERATE
FLATTEN(ufo_sightings_by_city);
Set Logic in Pig
® 2011 Dachis Group.
dachisgroup.com
Example Integration: Pig-Vector
GitHub project by Ted Dunning, Mahout Committer
https://github.com/tdunning/pig-vector
Use Case:
Train Logistic Regression Model from Pig
Hello World of ML – 20 Newsgroups
Mahout and Pig
® 2011 Dachis Group.
dachisgroup.com
Load 20-newsgroups messages using custom Pig LoadFunc:
docs = LOAD '20news-bydate-train/*/*’ USING
org.apache.mahout.pig.MessageLoader()
AS (newsgroup, id:int, subject, body);
In Java:
public class MessageLoader extends LoadFunc {
public void setLocation(String location, Job job) throws IOException {
// setup where we're reading data from
}
public InputFormat getInputFormat() throws IOException {
return new TextInputFormat() {
// ...
};
}
public Tuple getNext() throws IOException {
// parse message and build Tuple that matches the schema
}
}
Mahout and PigStep 1: Load the Training Data
® 2011 Dachis Group.
dachisgroup.com
-- Import UDF, define vectorizing strategy and fixed size of feature vector
DEFINE encodeVector org.apache.mahout.pig.encoders.EncodeVector('100000', 'subject+body', 'group:word, article:numeric, subject:text, body:text');
vectors = FOREACH docs GENERATE newsgroup, encodeVector(*) as v;
Result is a hashed feature vector where features
are mapped to indexes in a fixed size sparse vector
(from Mahout)
Fixed sized vectors are needed to train
Mahout’s SGD-based logistic regression model
Mahout and PigStep 2: Vectorize using Pig-Vector UDF
® 2011 Dachis Group.
dachisgroup.com
DEFINE train org.apache.mahout.pig.LogisticRegression('iterations=5, inMemory=true, features=100000, categories=alt.atheism comp.sys.mac.hardware rec.motorcycles sci.electronics talk.politics.guns comp.graphics comp.windows.x rec.sport.baseball sci.med talk.politics.mideast comp.os.ms-windows.misc misc.forsale rec.sport.hockey sci.space talk.politics.misc comp.sys.ibm.pc.hardware rec.autos sci.crypt soc.religion.christian talk.religion.misc');
/* put the training data in a single bag. We could train multiple models this way */
grouped = group vectors all;
/* train the actual model. The key is bogus to satisfy the sequence vector format. */
model = foreach grouped generate 1 as key, train(vectors) as model;
store model into 'pv-tmp/news_model' using PigModelStorage();
Mahout and PigStep 3: Train the Model
® 2011 Dachis Group.
dachisgroup.com
DEFINE evaluate org.apache.mahout.pig.LogisticRegressionEval('sequence=pv-tmp/news_model/part-r-00000, key=1');
test = load '20news-bydate-test/*/*' using org.apache.mahout.pig.MessageLoader()
as (newsgroup, id:int, subject, body);
testvecs = foreach test generate newsgroup, encodeVector(*) as v;
evalvecs = foreach testvecs generate evaluate(v);
Mahout and PigStep 4: Evaluate the Model
® 2011 Dachis Group.
dachisgroup.com
For Slides and Pig script email me at: [email protected]
Twitter: thelabdude
Questions?