Data Science Training in Hyderabad,Data Science training institutes in Hyde...

DATA SCIENCE

http://www.kellytechno.com/


TOPICS

databases and data architectures

databases in the real world

scaling, data quality, distributed

machine learning/data mining/statistics

information retrieval



Data Science is currently a popular interest of employers

our Industrial Affiliates Partners say there is high demand for students trained in Data Science databases, warehousing, data architectures data analytics – statistics, machine learning

Big Data – gigabytes/day or more Examples:

Walmart, cable companies (ads linked to content, viewer trends), airlines/Orbitz, HMOs, call centers, Twitter (500M tweets/day), traffic surveillance cameras, detecting fraud, identity theft...

supports “Business Intelligence” quantitative decision-making and control finance, inventory, pricing/marketing, advertising need data for identifying risks, opportunities, conducting

“what-if” analyses



DATA ARCHITECTURES

traditional databases (CSCE 310/608)

tables, fields

tuples = records or rows

<yellowstone,WY,6000000 acres,geysers>

key = field with unique values

can be used as a reference from one table into another

important for avoiding redundancy (normalization), which risks inconsistency

join – combining 2 tables using a key

metadata – data about the data

names of the fields, types (string, int, real, mpeg...)

also things like source, date, size, completeness/sampling



Name HomeTown Grad school PhD teaches title

John Flaherty Houston, TX Rice 2005 CSCE 411 Design and Analysis of Algorithms

Susan Jenkins Omaha, NE Univ of Michigan 2004 CSCE 121 Introduction to Computing in C++

Susan Jenkins Omaha, NE Univ of Michigan 2004 CSCE 206 Programming in C

Bill Jones Pittsburgh, PA Carnegie Mellon 1999 CSCE 314 Programming Languages

Bill Jones Pittsburgh, PA Carnegie Mellon 1999 CSCE 206 Programming in C

Name teaches

John Flaherty CSCE 411

Susan Jenkins CSCE 121

Susan Jenkins CSCE 206

Bill Jones CSCE 314

Bill Jones CSCE 206

course title

CSCE 411 Design and Analysis of Algorithms

CSCE 121 Introduction to Computing in C++

CSCE 314 Programming Languages

CSCE 206 Programming in C

Name HomeTown Grad school PhD

John Flaherty Houston, TX Rice 2005

Susan Jenkins Omaha, NE Univ of Michigan 2004

Bill Jones Pittsburgh, PA Carnegie Mellon 1999

Instructors:

TeachingAssignments:

Courses:



SQL: Structured Query Language>SELECT Name,HomeTown FROM Instructors WHERE PhD<2000;

Bill Jones Pittsburgh, PA

>SELECT Course,Title FROM Courses ORDER BY Course;

CSCE 121 Introduction to Computing in C++

CSCE 206 Programming in C

CSCE 314 Programming Languages

CSCE 411 Design and Analysis of Algorithms

can also compute sums, counts, means, etc.

example of JOIN: find all courses taught by someone from CMU:

>SELECT TeachingAssignments.Course

FROM Instructors JOIN TeachingAssignments

ON Instructors.Name=TeachingAssigmnents.Name

WHERE Instructor.PhD=“Carnegie Mellon”

CSCE 314

CSCE 206

because they were both taught by Bill Jones



SQL servers

centralized database, required for concurrent

access by multiple users

ODBC: Open DataBase Connectivity – protocol to

connect to servers and do queries, updates from

languages like Java, C, Python

Oracle, IBM DB2 - industrial strength SQL

databases



some efficiency issues with real databases

indexing

how to efficiently find all songs written by Paul Simon in a database with 10,000,000 entries?

data structures for representing sorted order on fields

disk management

databases are often too big to fit in RAM, leave most of it on disk and swap in blocks of records as needed – could be slow

concurrency

transaction semantics: either all updates happen en batch or none (commit or rollback)

like delete one record and simultaneously add another but guarantee not to leave in an inconsistent state

other users might be blocked till done

query optimization

the order in which you JOIN tables can drastically affect the size of the intermediate tables



Unstructured data

raw text

documents, digital libraries

grep, substring indexing, regular expressions

like find all instances of “[aA]g+ies” including “agggggies”

Information Retrieval (CSCE 470)

look for synonyms, similar words (like “car” and “auto”)

tfIdf (term frequency/inverse doc frequency) – weighting for

important words

LSI (latent semantic indexing) – e.g. ‘dogs’ is similar to ‘canines’

because they are used similarly (both near ‘bark’ and ‘bite’)

Natural Language parsing

extracting requirements from jobs postings



Unstructured data

images, video (BLOBs=binary large objects)

how to extract features? index them? search them?

color histograms

convolutions/transforms for pattern matching

looking for ICBM missiles in aerial photos of Cuba

streams

sports ticker, radio, stock quotes...

XML files

with tags indicating field names<course>

<name>CSCE 411</name>

<title>Design and Analysis of Algorithms</title>

</course>



Object databases

CHEM 102

Intro to Chemistry

TR, 3:00-4:00

prereq: CHEM 101

Texas A&M

College Station, TX

Div 1A

53,299 students

Dr. Frank Smith

302 Miller St.

PhD, Cornell

13 years experience

ClassOfferedAt

TaughtBy

Instructor/Employee

In a database with millions of objects,

how do you efficiently do queries (i.e. follow pointers)

and retrieve information?



Real-world issues with databases

it’s all about scaling up to many records (and many

users)

data warehousing:

full database is stored in secure, off-site location

slices, snapshots, or views are put on interactive query

servers for fast user access (“staging”)

might be processed or summarized data

databases are often distributed

different parts of the data held in different sites

some queries are local, others are “corporate-wide”

how to do distributed queries?

how to keep the databases synchronized?

CSCE 438 – Distributed Object Programming



OLAP: OnLine Analytical Processing

data warehouse:

every transaction

ever recordedOLAP server

nightly updates

and summaries

http://technet.microsoft.com/en-us/

library/ms174587.aspx– multi-dimensional tables of

aggregated sales in

different regions in recent

quarters, rather than “every

transaction”

– users can still look at

seasonal or geographic

trends in different product

categories

– project data onto 2D

spreadsheets, graphs



data integrity

missing values

how to interpret? not available? 0? use the mean?

duplicated values

including partial matches (Jon Smith=John Smith?)

inconsistency:

multiple addresses for person

out-of-date data

inconsistent usage:

does “destination” mean of first leg or whole flight?

outliers:

salaries that are negative, or in the trillions

most database allow “integrity constraints” to be defined that validate newly entered data



Interoperability

how can data from one database be compared or combined with another?

what if fields are not the same, or not present, or used differently?

think of medical or insurance records

translation/mapping of terms

standards

units like ft/s, or gallons, etc.

identifiers like SSN, UIN, ISBN

“federated” databases – queries that combine information across multiple servers



“Data cleansing”

filling in missing data (imputing values)

detecting and removing outliers

smoothing

removing noise by averaging values together

filtering, sampling

keeping only selected representative values

feature extraction

e.g. in a photo database, which people are wearing

glasses? which have more than one person? which

are outdoors?



DATA MINING/DATA ANALYTICS

finding patterns in the data

statistics

machine learning

(CSCE 633)



Numerical data

correlations

multivariate regression

fitting “models”

predictive equations that fit the data

from a real estate database of home sales, we get

housing price = 100*SqFt - 6*DistanceToSchools + 0.1*AverageOfNeighborhood

ANOVA for testing differences between groups

R is one of the most commonly used software packages for doing statistical analysis

can load a data table, calculate means and correlations, fit distributions, estimate parameters, test hypotheses, generate graphs and histograms



clustering

similar photos, documents, cases

discovery of “structure” in the data

example: accident database

some clusters might be identified with “accidents

involving a tractor trailer” or “accidents at night”

top-down vs. bottom-up clustering methods

granularity: how many clusters?



Decision trees (classifiers)

what factors, decisions, or treatments led to different

outcomes?

recursive partitioning algorithms

related methods

“discriminant” analysis

what factors lead to return of product?

extract “association rules”

boxers dogs tend to have congenital defects

covers 5% of patients with 80% confidence

Veterinary database - dogs treated for disease

breed gender age drug sibsp outcome

terrier F 10 methotrexate 4.0 died

spaniel M 5 cytarabine 2.3 survived

doberman F 7 doxorubicin 0.1 died



other types of data

time series and forecasting:

model the price of gas using autoregression

a function of recent prices, demand, geopolitics...

de-trend: factor out seasonal trends

GIS (geographic information systems)

longitude/latitude coordinates in the database

objects: city/state boundaries, river locations, roads

find regions in B/CS with an excess of coffee shops

from: Basic Statistics for Business and Economics, Lind et al (2009), Ch 16.

Toy Sales

credit: Frank Curriero



FOR MORE INFORMATION:

VISIT US AT: www.kellytechno.com

ADDRESS: Flat no : 212, 2nd floor,

AnnapurnaBlock,

Aditya Enclave,

Ameerpet, Hyderabad-16.




Data Science Training in Hyderabad,Data Science training institutes in Hyde...

Documents

Data Science Training in Hyderabad,Data Science training institutes in Hyde...