Big Data Analysis with Crate and Python

Post on 11-Aug-2014

164 Views

Category:

Data & Analytics

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Analysing any huge dataset with the help of the crate datastore using the bare crate python client or SQLAlchemy.

Transcript

Big Data Analysis with Crate and Python

Matthias Wahl - developer @ crate.io !

Email: matthias@crate.io

Crate

shared nothing massively scalable datastore

standing on the shoulders of giants

Crate

get it at: https://crate.io/download

# bash -c "$(curl -L try.crate.io)"

Crate

automatic sharding and replication

(semi-) structured models

single table only

SQL query language

Crate

all common SQL types(and more)

powerful aggregations (‘GROUP BY’)

linear scalability - data and query execution is distributed

basic arithmetics (next release 0.39)

Crate

Aggregation Execution

SELECT station_name, max(temp), avg(temp), min(temp), count(distinct date) FROM weather_de WHERE temp != -999 GROUP BY station_name ORDER BY station_name ASC;

Aggregation Execution

H

M

M

M

R

R

R

collect

Request

Aggregation Execution

H

M

M

M

R

R

R

collect

hash based distribution

Aggregation Execution

H

M

M

M

R

R

R

group results

Aggregation Execution

H

M

M

M

R

R

R

final reduceResponse

Aggregation Execution

Using the python client

>>> from crate.client.http import Client >>> client = Client([“127.0.0.1:4200”]) >>> response = client.sql(“select * from weather_de limit 1”) >>> print(response) { u'duration': 659, u'rowcount': 1, u'rows': [ [1303365600000, 82.0, None, None, None, 0, u'954', 54.1667, 7.45, u'UFS Deutsche Bucht', 60.0, 10.9, 100, 5.2] ], u'cols': [u'date', ...] }

Using SQLAlchemy

>>> import sqlalchemy as sa >>> from sqlalchemy.ext.declarative import declarative_base >>> from sqlalchemy.orm import sessionmaker >>> engine = sa.create_engine(“crate://localhost:4200”) >>> Base = declarative_base()

Using SQLAlchemy

>>> class Weather(Base): ... ... __tablename__ = 'weather_de' ... ... station_id = Column('station_id', String, primary_key=True) ... station_name = Column('station_name', String) ... station_lat = Column('station_lat', Float) ... station_long = Column('station_lon', Float) ... station_height = Column('station_height', Integer) ... date = Column('date', DateTime, primary_key=True) ... temp = Column('temp', Float) ... humility = Column(Float) ... sunshine_hours = Column(Float) ... wind_speed = Column(Float) ... wind_direction = Column(Integer) ... rainfall_fallen = Column(Integer) ... rainfall_height = Column(Float) ... rainfall_form = Column(Integer)

Using SQLAlchemy

>>> from sa import func >>> res = DBSession.query( ... Weather.station_name, ... func.avg(Weather.temp) ... ).group_by(Weather.station_name) ... .order_by(Weather.station_name) ... .limit(10).all()

SELECT station_name, avg(temp) from weather group by station_name order by station_name limit 10;

Using SQLAlchemy

#Average sunshine hours from sqlalchemy.sql import func DBSession.query(func.avg(Weather.sunshine_hours)).scalar() # Average sunshine hours in Konstanz DBSession.query(func.avg(Weather.sunshine_hours)).filter(Weather.station_name==‘Konstanz’).scalar()

Feature Requests

I’m no data scientist

Feature Requests

Please tell us what you would like to see in crate.

I’m no data scientist

CRATE

Thank you

web: https://crate.io/

github: https://github.com/crate

twitter: @cratedata

IRC: #crate

stackoverflow tag: cratedata

top related