Top Banner
PROC SQL Proc SQL is part of Base SAS and can combine some features of data steps and procs. SQL stands for Structured Query Language, and is considered the standard language for relational databases. Database management systems that use SQL include Access, Ingres, Oracle, Sybase, Microsoft SQL Server, etc Inferring rooted species trees December 5, 2018 1 / 45
45

Inferring Rooted Species Trees

Dec 27, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Inferring Rooted Species Trees

PROC SQL

Proc SQL is part of Base SAS and can combine some features of datasteps and procs.

SQL stands for Structured Query Language, and is considered the standardlanguage for relational databases. Database management systems that useSQL include Access, Ingres, Oracle, Sybase, Microsoft SQL Server, etc

Inferring rooted species trees December 5, 2018 1 / 45

Page 2: Inferring Rooted Species Trees

There is a whole book on PROC SQL (300 pages):

Inferring rooted species trees December 5, 2018 2 / 45

Page 3: Inferring Rooted Species Trees

PROC SQL

Apparently, PROC SQL can replicate much of the functionality of thedatastep. Proc SQL has a different flavor than the rest of SAS so feels likea different language. One approach is to do almost all analysis usingPROC SQL. Another is to go back and forth, such as reading in data usinga traditional data step, and then use PROC SQL to manipulate the data.The syntax in PROC SQL should be similar to SQL as implemented inother languages, and so is a good way to learn SQL.

Inferring rooted species trees December 5, 2018 3 / 45

Page 4: Inferring Rooted Species Trees

PROC SQL

One difference between PROC SQL and the data step is that items in aseries are usually separated by spaces in the data step but by commas inSQL. There are also terminological differences:

Inferring rooted species trees December 5, 2018 4 / 45

Page 5: Inferring Rooted Species Trees

PROC SQL

Just as in the data step, PROC SQL runs statements that are terminatedby semi-colons. However, the SQL block is terminated by QUIT; instead ofrun;

Technically, SAS is considered a procedural language, meaning you tellthe computer what to do. SQL is considered declarative, meaning you tellthe computer what to produce, and the software determines whatalgorithm will produce the result. Other declarative languages includeProlog (which implements logic and is used in AI) and Mathematica.

Inferring rooted species trees December 5, 2018 5 / 45

Page 6: Inferring Rooted Species Trees

PROC SQL

The basics of the syntax for PROC SQL is (where TABLE refers to a SASdata set)

PROC SQL <options>;

SELECT Columns

FROM TABLE

WHERE Columns

GROUP BY Columns

;

QUIT;

Inferring rooted species trees December 5, 2018 6 / 45

Page 7: Inferring Rooted Species Trees

DATA TEMP;

INPUT ID $ NAME $ SALARY DEPARTMENT $;

DATALINES;

1 Rick 623.3 IT

2 Dan 515.2 Operations

3 Michelle 611 IT

4 Ryan 729 HR

5 Gary 843.25 Finance

6 Nina 578 IT

7 Simon 632.8 Operations

8 Guru 722.5 Finance

;

RUN;

PROC SQL;

CREATE TABLE EMPLOYEES AS

SELECT * FROM TEMP;

QUIT;

PROC PRINT data = EMPLOYEES; RUN;

Inferring rooted species trees December 5, 2018 7 / 45

Page 8: Inferring Rooted Species Trees

In the previous code, bot the Temp and EMPLOYEES data sets show upin the WORK library in SAS. Note that the Department value“Operations” gets truncated to 8 characters.

Inferring rooted species trees December 5, 2018 8 / 45

Page 9: Inferring Rooted Species Trees

PROC SQL

The previous basically used PROC SQL to just duplicate the data setcreated in the data step. You can also get a subset of rows using WHEREstatements, and update a data set by inserting or deleting rows. You canalso create new variables using mathematical operations and the ”as”keyword.Another feature is using the UPDATE statement to change a value in thedata set

Inferring rooted species trees December 5, 2018 9 / 45

Page 10: Inferring Rooted Species Trees

Inferring rooted species trees December 5, 2018 10 / 45

Page 11: Inferring Rooted Species Trees

Inferring rooted species trees December 5, 2018 11 / 45

Page 12: Inferring Rooted Species Trees

SQL ideas

Now that we’ve seen a couple examples of PROC SQL, we’ll discuss someof the ideas behind it.Usually, we think of a dataset as being like a matrix with orderedobservations and columns. PROC SQL is designed for multisets of data. Amultiset (just like a set) has no order, but can allow repeated entries.However, for a good data base design, there is a desire to have norepeated entries.Here is an example of working with sets versus multisets:

Sets:{1, 2, 3} ∪ {3, 4, 5} = {1, 2, 3, 4, 5}

Multisets : {1, 2, 3} ∪ {3, 4, 5} = {1, 2, 3, 3, 4, 5}

Inferring rooted species trees December 5, 2018 12 / 45

Page 13: Inferring Rooted Species Trees

SQL ideas

A good table for databases should also avoid columns that are largelyempty (have lots of missing or NULL values). This can happen, forexample, in repeated measures recorded in wide format, where mostindividuals are not observed at later time points.

In SQL, the term normalization refers to organizing data to save space(memory) and to eliminate duplication or repetition of data. (For example,instead of having duplicate rows, you could count how often each distinctrow of values occurs.)

Inferring rooted species trees December 5, 2018 13 / 45

Page 14: Inferring Rooted Species Trees

SQL: first normal form (1NF)

In first normal form, the data is in rectangular format and there is acolumn (such as the subject ID) that uniquely identifies each row.However, there still might be some redundant information in the table.

Inferring rooted species trees December 5, 2018 14 / 45

Page 15: Inferring Rooted Species Trees

SQL: second normal form (2NF)

Note that in the previous example, the customer number and Smith andSan Diego is repeated several times. Also, if the customer numberuniquely identifies the customer, then “Smithe” must be a mispelling.

The data can be rearranged into two tables, say one with customernumber, last name, and city, and the other with the remaining variablesthat are linked to the first table. This reduces the total number of entries.

Inferring rooted species trees December 5, 2018 15 / 45

Page 16: Inferring Rooted Species Trees

Inferring rooted species trees December 5, 2018 16 / 45

Page 17: Inferring Rooted Species Trees

SQL: second normal form (2NF)

Note that the original table had 7 rows and 7 columns (49 entries). Thenew tables are 4× 3 and 7× 5. The new tables have a total of12 + 35 = 47 entries. This is not a huge savings, but in larger data setswith more repetition could be considerable.In a sense what has happened is that the original table is “unmerged”. Youcould merge the two tables in 2NF form to create the table in 1NF form.One advantage of 2NF form here is that if a customer has updatedinformation (such as moving to a new city), that information only has toupdated in one location. For the table in 1NF form, that information hasto be updated in multiple locations.

Inferring rooted species trees December 5, 2018 17 / 45

Page 18: Inferring Rooted Species Trees

SQL: third normal form (3NF)

The variable MANUCITY refers to the location of the company where theitem is manufactored (or company location). This is not related to thecustomer number, which is considered the key column for the data. For3NF form, each column should depend on the key. A phrase that is used isthat each column should ”depend on the key, the whole key, and nothingbut the key”. Consequently, the previous two tables violate 3NF form.

To satisfy 3NF form, you need a new key for the manufacturer andpurchase number. Then the MANUCITY column will depend on this newkey instead of the customer number.

Inferring rooted species trees December 5, 2018 18 / 45

Page 19: Inferring Rooted Species Trees

Inferring rooted species trees December 5, 2018 19 / 45

Page 20: Inferring Rooted Species Trees

SQL nromal forms

Usually 3NF form is considered good enough for database design, butthere are also 4NF and 5NF forms possible in attempt to eliminate allredundant information.Note that the 3NF form in this example actually uses more memory (hasmore total cellsss) than the 2NF form.

Inferring rooted species trees December 5, 2018 20 / 45

Page 21: Inferring Rooted Species Trees

Keywords in SQL

There are several keywords in SQL that are normally not allowed to becolumn names, although this is not enforced in SAS SQL. (SAS SQL doesnot strictly follow ANSI SQL. ANSI is the American National StandardsInsitute.)The keywords are

Inferring rooted species trees December 5, 2018 21 / 45

Page 22: Inferring Rooted Species Trees

SQL ideas

Tables often have a primary key used to identify rows. I think the multisetidea here is that observations are referred to by a key rather than a rownumber. A foreign key can be used to link one table (data set) to another.(Think of this as something that could be used for merging the twodatasets/tables).

Inferring rooted species trees December 5, 2018 22 / 45

Page 23: Inferring Rooted Species Trees

SQL example

The book uses a database example with 6 linked tables. They are:Customers, Inventory, Invoice, Manufacturers, Products, and Purchases.

Inferring rooted species trees December 5, 2018 23 / 45

Page 24: Inferring Rooted Species Trees

Inferring rooted species trees December 5, 2018 24 / 45

Page 25: Inferring Rooted Species Trees

Inferring rooted species trees December 5, 2018 25 / 45

Page 26: Inferring Rooted Species Trees

Inferring rooted species trees December 5, 2018 26 / 45

Page 27: Inferring Rooted Species Trees

PROC CONTENTS is useful for getting an overview of the different

tables.

Inferring rooted species trees December 5, 2018 27 / 45

Page 28: Inferring Rooted Species Trees

Inferring rooted species trees December 5, 2018 28 / 45

Page 29: Inferring Rooted Species Trees

Inferring rooted species trees December 5, 2018 29 / 45

Page 30: Inferring Rooted Species Trees

Inferring rooted species trees December 5, 2018 30 / 45

Page 31: Inferring Rooted Species Trees

Inferring rooted species trees December 5, 2018 31 / 45

Page 32: Inferring Rooted Species Trees

Type to enter text

Inferring rooted species trees December 5, 2018 32 / 45

Page 33: Inferring Rooted Species Trees

Inferring rooted species trees December 5, 2018 33 / 45

Page 34: Inferring Rooted Species Trees

Data step versus PROC SQL

Here’s an example to compare creating an empty data set with fourvaraibles with user defined lengths and labels.

DATA PURCHASES;

LENGTH CUSTNUM 4.

PRODNUM 3.

UNITS 3.

UNITCOST 4.;

LABEL CUSTNUM = ‘Customer Number’

PRODNUM = ‘Product Purchased’

UNITS = ‘# Units Purchased’

UNITCOST = ‘Unit Cost’;

FORMAT UNITCOST DOLLAR12.2;

RUN;

PROC CONTENTS DATA=PURCHASES;

RUN;Inferring rooted species trees December 5, 2018 34 / 45

Page 35: Inferring Rooted Species Trees

Data step versus PROC SQL

Equivalent PROC SQL code:

PROC SQL;

CREATE TABLE PURCHASES

(CUSTNUM NUM LENGTH=4

LABEL=‘Customer Number’

PRODNUM NUM LENGTH=3

LABEL=‘Product Purchased’,

UNITS NUM LENGTH=3

LABEL=‘# Units Purchased’

UNITCOST NUM LENGTH=4

LABEL=‘Unit Cost’);

QUIT;

Inferring rooted species trees December 5, 2018 35 / 45

Page 36: Inferring Rooted Species Trees

Data step versus PROC SQL

Many step functions and formats are also available in PROC SQL, includedata/time formats and string functions.

Inferring rooted species trees December 5, 2018 36 / 45

Page 37: Inferring Rooted Species Trees

Data step versus PROC SQL

To get a list of unique values in a particular column, you can use theUNIQUE keyword in PROC SQL. How would you do this using regulardata step programming?

Inferring rooted species trees December 5, 2018 37 / 45

Page 38: Inferring Rooted Species Trees

Operators in SQL

Operators available in SQL include the usual comparisons in SAS such as

I = EQ

I = NEQ

I < LT

I <= LE

I > GT

I >= GE

Inferring rooted species trees December 5, 2018 38 / 45

Page 39: Inferring Rooted Species Trees

Operators in SQL

You can also compare two strings conveniently by truncating the strings tothe length of the shorter string.

I EQT equal to

I GTT greater than

I LTT less than

I GET greater than or equal to

I LET less than or equal to

I NET not equal to

Inferring rooted species trees December 5, 2018 39 / 45

Page 40: Inferring Rooted Species Trees

Operators in SQL

Logical operators can be done using AND, OR, and NOT such asWHERE X GT 30 Y AND POSITION EQT SALES

Arithmetic operators are the same as in SAS, such as +, ∗ and ∗∗ forexponentiation.

Inferring rooted species trees December 5, 2018 40 / 45

Page 41: Inferring Rooted Species Trees

Concatenation in SQL

Strings can be concatenated using the usual concatenation operator inSAS, ||, or using SELECT CAT, which works very similarly to thepaste() function in R:

SELECT CAT(VAR1, "-", VAR2, "-", VAR3)

concatenates three columns with hyphens between them in order to makethe combination a single string. This can be useful when combining twokeys to make a new key.

Inferring rooted species trees December 5, 2018 41 / 45

Page 42: Inferring Rooted Species Trees

Row numbers SQL

Although part of the philosophy of SQL is that tables are sets of rowsrather than ordered matrices, you can cheat to generate a row numberusing the function MONOTONIC(). Apparently this is undocumented, and Ithink not an official part of SQL. Here is an example

PROC SQL;

SELECT MONOTONIC() AS Row_Number FORMAT=COMMA6.,

PRODNUM,

UNITS,

UNITCOST

FROM PURCHASES;

QUIT;

Inferring rooted species trees December 5, 2018 42 / 45

Page 43: Inferring Rooted Species Trees

Summary stats available in SQL

Inferring rooted species trees December 5, 2018 43 / 45

Page 44: Inferring Rooted Species Trees

Logic in SQL

Note that for logic programming in SQL, for loops are not supported.There are ways around this, but they are generally not recommended andnot in the spirit of SQL. Conditionally processing can be done normallyusing WHERE and CASE statements. Here is an example, basically usingWHEN...THEN instead of IF...THEN.

PROC SQL;

SELECT PRODNAME,

CASE PRODTYPE

WHEN ‘Laptop’ THEN ‘Hardware’

WHEN ‘Phone’ THEN ‘Hardware’

WHEN ‘Software’ THEN ‘Software’

WHEN ‘Workstation’ THEN ‘Hardware’

ELSE ‘Unknown’

END AS Product_Classification

FROM PRODUCTS;

QUIT;Inferring rooted species trees December 5, 2018 44 / 45

Page 45: Inferring Rooted Species Trees

Logic in SQL

PROC SQL is also frequently combined with using the ODS and themacro language.

Inferring rooted species trees December 5, 2018 45 / 45