TOPICS
databases and data architectures
databases in the real world
scaling, data quality, distributed
machine learning/data mining/statistics
information retrieval
Data Science is currently a popular interest of employers
our Industrial Affiliates Partners say there is high demand for students trained in Data Science databases, warehousing, data architectures data analytics – statistics, machine learning
Big Data – gigabytes/day or more Examples:
Walmart, cable companies (ads linked to content, viewer trends), airlines/Orbitz, HMOs, call centers, Twitter (500M tweets/day), traffic surveillance cameras, detecting fraud, identity theft...
supports “Business Intelligence” quantitative decision-making and control finance, inventory, pricing/marketing, advertising need data for identifying risks, opportunities, conducting
“what-if” analyses
DATA ARCHITECTURES
traditional databases (CSCE 310/608)
tables, fields
tuples = records or rows
<yellowstone,WY,6000000 acres,geysers>
key = field with unique values
can be used as a reference from one table into another
important for avoiding redundancy (normalization), which risks inconsistency
join – combining 2 tables using a key
metadata – data about the data
names of the fields, types (string, int, real, mpeg...)
also things like source, date, size, completeness/sampling
Name HomeTown Grad school PhD teaches title
John Flaherty Houston, TX Rice 2005 CSCE 411 Design and Analysis of Algorithms
Susan Jenkins Omaha, NE Univ of Michigan 2004 CSCE 121 Introduction to Computing in C++
Susan Jenkins Omaha, NE Univ of Michigan 2004 CSCE 206 Programming in C
Bill Jones Pittsburgh, PA Carnegie Mellon 1999 CSCE 314 Programming Languages
Bill Jones Pittsburgh, PA Carnegie Mellon 1999 CSCE 206 Programming in C
Name teaches
John Flaherty CSCE 411
Susan Jenkins CSCE 121
Susan Jenkins CSCE 206
Bill Jones CSCE 314
Bill Jones CSCE 206
course title
CSCE 411 Design and Analysis of Algorithms
CSCE 121 Introduction to Computing in C++
CSCE 314 Programming Languages
CSCE 206 Programming in C
Name HomeTown Grad school PhD
John Flaherty Houston, TX Rice 2005
Susan Jenkins Omaha, NE Univ of Michigan 2004
Bill Jones Pittsburgh, PA Carnegie Mellon 1999
Instructors:
TeachingAssignments:
Courses:
SQL: Structured Query Language>SELECT Name,HomeTown FROM Instructors WHERE PhD<2000;
Bill Jones Pittsburgh, PA
>SELECT Course,Title FROM Courses ORDER BY Course;
CSCE 121 Introduction to Computing in C++
CSCE 206 Programming in C
CSCE 314 Programming Languages
CSCE 411 Design and Analysis of Algorithms
can also compute sums, counts, means, etc.
example of JOIN: find all courses taught by someone from CMU:
>SELECT TeachingAssignments.Course
FROM Instructors JOIN TeachingAssignments
ON Instructors.Name=TeachingAssigmnents.Name
WHERE Instructor.PhD=“Carnegie Mellon”
CSCE 314
CSCE 206
because they were both taught by Bill Jones
SQL servers
centralized database, required for concurrent
access by multiple users
ODBC: Open DataBase Connectivity – protocol to
connect to servers and do queries, updates from
languages like Java, C, Python
Oracle, IBM DB2 - industrial strength SQL
databases
some efficiency issues with real databases
indexing
how to efficiently find all songs written by Paul Simon in a database with 10,000,000 entries?
data structures for representing sorted order on fields
disk management
databases are often too big to fit in RAM, leave most of it on disk and swap in blocks of records as needed – could be slow
concurrency
transaction semantics: either all updates happen en batch or none (commit or rollback)
like delete one record and simultaneously add another but guarantee not to leave in an inconsistent state
other users might be blocked till done
query optimization
the order in which you JOIN tables can drastically affect the size of the intermediate tables
Unstructured data
raw text
documents, digital libraries
grep, substring indexing, regular expressions
like find all instances of “[aA]g+ies” including “agggggies”
Information Retrieval (CSCE 470)
look for synonyms, similar words (like “car” and “auto”)
tfIdf (term frequency/inverse doc frequency) – weighting for
important words
LSI (latent semantic indexing) – e.g. ‘dogs’ is similar to ‘canines’
because they are used similarly (both near ‘bark’ and ‘bite’)
Natural Language parsing
extracting requirements from jobs postings
Unstructured data
images, video (BLOBs=binary large objects)
how to extract features? index them? search them?
color histograms
convolutions/transforms for pattern matching
looking for ICBM missiles in aerial photos of Cuba
streams
sports ticker, radio, stock quotes...
XML files
with tags indicating field names<course>
<name>CSCE 411</name>
<title>Design and Analysis of Algorithms</title>
</course>
Object databases
CHEM 102
Intro to Chemistry
TR, 3:00-4:00
prereq: CHEM 101
Texas A&M
College Station, TX
Div 1A
53,299 students
Dr. Frank Smith
302 Miller St.
PhD, Cornell
13 years experience
ClassOfferedAt
TaughtBy
Instructor/Employee
In a database with millions of objects,
how do you efficiently do queries (i.e. follow pointers)
and retrieve information?
Real-world issues with databases
it’s all about scaling up to many records (and many
users)
data warehousing:
full database is stored in secure, off-site location
slices, snapshots, or views are put on interactive query
servers for fast user access (“staging”)
might be processed or summarized data
databases are often distributed
different parts of the data held in different sites
some queries are local, others are “corporate-wide”
how to do distributed queries?
how to keep the databases synchronized?
CSCE 438 – Distributed Object Programming
OLAP: OnLine Analytical Processing
data warehouse:
every transaction
ever recordedOLAP server
nightly updates
and summaries
http://technet.microsoft.com/en-us/
library/ms174587.aspx– multi-dimensional tables of
aggregated sales in
different regions in recent
quarters, rather than “every
transaction”
– users can still look at
seasonal or geographic
trends in different product
categories
– project data onto 2D
spreadsheets, graphs
data integrity
missing values
how to interpret? not available? 0? use the mean?
duplicated values
including partial matches (Jon Smith=John Smith?)
inconsistency:
multiple addresses for person
out-of-date data
inconsistent usage:
does “destination” mean of first leg or whole flight?
outliers:
salaries that are negative, or in the trillions
most database allow “integrity constraints” to be defined that validate newly entered data
Interoperability
how can data from one database be compared or combined with another?
what if fields are not the same, or not present, or used differently?
think of medical or insurance records
translation/mapping of terms
standards
units like ft/s, or gallons, etc.
identifiers like SSN, UIN, ISBN
“federated” databases – queries that combine information across multiple servers
“Data cleansing”
filling in missing data (imputing values)
detecting and removing outliers
smoothing
removing noise by averaging values together
filtering, sampling
keeping only selected representative values
feature extraction
e.g. in a photo database, which people are wearing
glasses? which have more than one person? which
are outdoors?
DATA MINING/DATA ANALYTICS
finding patterns in the data
statistics
machine learning
(CSCE 633)
Numerical data
correlations
multivariate regression
fitting “models”
predictive equations that fit the data
from a real estate database of home sales, we get
housing price = 100*SqFt - 6*DistanceToSchools + 0.1*AverageOfNeighborhood
ANOVA for testing differences between groups
R is one of the most commonly used software packages for doing statistical analysis
can load a data table, calculate means and correlations, fit distributions, estimate parameters, test hypotheses, generate graphs and histograms
clustering
similar photos, documents, cases
discovery of “structure” in the data
example: accident database
some clusters might be identified with “accidents
involving a tractor trailer” or “accidents at night”
top-down vs. bottom-up clustering methods
granularity: how many clusters?
Decision trees (classifiers)
what factors, decisions, or treatments led to different
outcomes?
recursive partitioning algorithms
related methods
“discriminant” analysis
what factors lead to return of product?
extract “association rules”
boxers dogs tend to have congenital defects
covers 5% of patients with 80% confidence
Veterinary database - dogs treated for disease
breed gender age drug sibsp outcome
terrier F 10 methotrexate 4.0 died
spaniel M 5 cytarabine 2.3 survived
doberman F 7 doxorubicin 0.1 died
other types of data
time series and forecasting:
model the price of gas using autoregression
a function of recent prices, demand, geopolitics...
de-trend: factor out seasonal trends
GIS (geographic information systems)
longitude/latitude coordinates in the database
objects: city/state boundaries, river locations, roads
find regions in B/CS with an excess of coffee shops
from: Basic Statistics for Business and Economics, Lind et al (2009), Ch 16.
Toy Sales
credit: Frank Curriero
FOR MORE INFORMATION:
VISIT US AT: www.kellytechno.com
ADDRESS: Flat no : 212, 2nd floor,
AnnapurnaBlock,
Aditya Enclave,
Ameerpet, Hyderabad-16.