Addressing Student Difficulties in Using Numeric Data Downloaded from the Web Marilyn K. Pelosi & David L. Russell Western New England College Springfield, Massachusetts
Jan 13, 2016
Addressing Student Difficulties in Using
Numeric Data Downloaded from the
Web
Marilyn K. Pelosi & David L. RussellWestern New England College
Springfield, Massachusetts
Overview Three Circumstances under which
a student’s ability to download data can be complicated
Requires a level of skill beyond that of “casual student users”
Assumption is that the data is to be subsequently analyzed in a spreadsheet tool.
Steps a student must take Find appropriate data using a
search engine Download the data in such a
manner that they can be converted for analysis
Analyze the data correctly
Challenges Some data are in a form that is not
conducive to direct and effective use in a data analysis tool
Even data made available in a standard format cannot be used without editing
Some data are presented in non-tabular layout that causes their transformation to a standard file format to be very challenging
Data must frequently be downloaded in a text format which students then import the text format into the analysis tool.
Objectives To describe situations in which
some downloaded data is in a form that is not conducive to effective use
To address situations in which data to be downloaded is not available in any standard file format
To address situations in which data must frequently be downloaded in text format and the consequences of this action
Objectives (cont’d) To address converting the text
format into a usable file format To address the data structures and
editing actions required to make the converted data file usable
Three Scenarios Scenario 1: Web sites which make
data available in file formats that can be used directly by data analysis tools, the large majority of which are sites making files available in Excel’s native file format (*.xls)
Scenario 2: Web sites that make data available in text format
Scenario 3: Web sites whose data is not in tabular structure
Scenario 1: Data Available in Standard File Formats Least challenging scenario Often However … Data are not in true tabular form or Data contain control breaks in the
form of sub-totals or total lines or Data requires editing chores
Example of Scenario 1 Education Professor has assigned
students a project in which they must analyze per-pupil expenditures in MA
Investigate the relationship of per-pupil expenditures and population in the 351 cities and towns in MA.
Challenges to the Student-
Tab sheet is titled Per-Pupil Expenditures FY01
Data is given for LEA’s (Local Education Agency) rather than by city/town
LEA’s are listed in rough alphabetic order but there are a number of exceptions
The LEA field holds code numbers with many gaps
Data contain several layers of summarization
Information Student Seeks Per-pupil expenditure is on another
worksheet Labeled “incostlist” and titled FY 01
Integrated Cost Per Pupil Integrated means the cost of multi-
town regional school districts are spread or integrated across the constituent towns.
No indication of which LEA’s are composed of an individual city or town and which are part of a regional district
Integrating Population data Import from another information
source Student finds this data in which
1999 estimated population data for the 351 cities and towns is listed
Further Challenges Data are not fully tabular Contain frequent summarization
lines and frequent blank lines for readability
Data are sorted by county and sub-sorted by city or town (unlike the Dept of Education data)
Contain a plethora of data files which are not relevant to the investigation of per-pupil expenditures
An experienced Excel user would Delete unneeded rows and
columns Make the reference to county a
categorical variable Resort the data by city or town Copy the data to the “intcostlist”
worksheet Is it reasonable to expect a casual
student user to be able to do this?
Here’s is what would need to be done:
1. Using the Edit – Move or Copy Sheet function, create a temporary copy of the ‘State,County & MCD Totals’ worksheets;
2. In this worksheet, manually eliminate all spacing lines, county headers, unneeded columns and summary data rows;
3. Sort the data by city or town, so that they are in the same order as the ‘intcostlist’ worksheet.
4. Copy these data onto a copy of the ‘intcostlist’ worksheet.
5. In the resulting merged worksheet, the city or town names from ‘intcostlist’ are in column B while the city or town names imported from MISER are in column G.
6. However, the values from the Department of Education and those from MISER often do not match. Thus, the student must create a logical test column (here, column J) to determine where there are any mismatches in the data
7. This logical test reveals an immediate problem: since the IF function’s logical test is case-specific, no values in the original data in column B (which is all in upper-case format, e.g., ‘ABINGTON’) match the values in column G (which is in title format, e.g., ‘Abington’). Therefore the student must modify the logical test in column J to force column G into upper case:
8. Unfortunately, the student soon discovers that even this will not work, because column B’s values contain literal blanks at the end of the entered values. For example the first value is actually 'ABINGTON' followed by 21 blanks.
9. This in turn requires a re-write of the logical test in column J to incorporate the length of the city or town name. For example, row 10 would be:
10. This results in 36 records that are “out of sequence”. These are dominated by sorting errors, that is, an error of incorrect placement of values under the collating sequence, indicating that one or both of the original tables was sorted incorrectly. An example of sorting errors can be seen in values 85-87 (spreadsheet lines 94-96):
Now the student has what he/she needs
Not a criticism of sites Often not created with analysis in mind Often originated as an internal document
and thus there is no motivation to avoid internal jargon
Web site’s owner’s often have a strong desire to present data in a highly readable format - hence the blank lines, summarization's and other factors
Thus, even a site that offers data in Excel format cannot be used directly by a student
Scenario 2: Data Available in Text Format
Scenario 2a: Text Data available in Tabular Format
Scenario 2b: Data Available in Text Format But with a Layout that Inhibits Downloading
Sites that provide data in text format
1. Capture the data by selecting and copy-pasting the data to an intermediate holding area such an Microsoft Office’s Notepad™;
2. Advisedly, saving the data in Notepad’s native format, *.txt;
3. Take advantage of the analysis tool’s text import functionality, such as Microsoft Excel’s Text Import Wizard;
4. Edit the imported data as necessary; and
5. Save the data in Excel’s native format, *.xls.
Example: Suppose a student taking a course on forecasting must analyze
time series data. Home page for the BLS is
http://stats.bls.gov
Student selects “get detailed statistics”
For price data - CPI- Average Price Data
Suppose the student selects the price of a pound of bread in the Northeast
www.bls.gov Search |
A-Z Index
BLS Home | Programs & Surveys | Get
Detailed Statistics | Glossary | What's New | Find I t! I n DOL
Data extracted on: June 24, 2003 (03:02 PM)
Average Price Data
Series Catalog: Series ID : APU0100702111 Area : Northeast urban I tem : Bread, white, pan, per lb. (453.6 gm)
Data:
Series ID Yea
r Perio
d Valu
e
APU0100702111
2001
M01 1.039
APU0100702111
2001
M02 1.053
Scenario 2b: Data Available in Text Format but with a Layout that Inhibits Downloading
All of issues of Scenario 2a apply here In addition, the numeric information
posted to the Web is solely for the user to read, with no obvious intention of allowing or facilitating the user to download the data
Select carefully the portion of the data to be used on the screen
Copy it and then paste it to the worksheet.
Example - A professor has assigned a student to prepare a graphical representation of population flows in MA by county.
Student finds data at the following site: http://eire.census.gov/popest/data/counties/tables/CO-EST2002-04-25.php
To Download these data: Student must first highlight only the data portion
of the table, avoiding header information (column headers, however, are acceptable)
Longer column headers are “wrapped” over several lines, so the first line should be edited to incorporate the full column name
This assumes that the student can: 1. go into the subsequent lines; 2. initiate an edit (F2); 3. cut the data in the line; 4. move to the first line; initiate
an edit in that cell; and 5. paste the data on the clipboard to
the end of the data in the first line.
Handling of a calculated field In this example: “Natural Increase
(births- deaths) and “Net migration” (International + Internal)
These columns get downloaded as entered values
If the data they are based on gets updated, the calculated columns will not get updated and it is unlikely that a student would notice this.
Better if the two columns were eliminated and re-created by the student so they are hot linked to the data
Also note that the first data line is a sum of the detail lines below but is not labeled as such. Likely to through off analysis if not noticed.
Scenario 3: Data not in text format or tabular layout Websites that have data available Presentation of data is intended
only for reading Data are not in tabular form
Example: Professor has assigned a paper on the “Deaths Caused by Alcohol”
Student has found the Department of Justice site: http://www.ojp.usdoj.gov/bjs/pubalp2.htm
Selects a report on “Alcohol and Crime”
Selects .txt format
Rates of death caused by alcohol per 100,000 persons, adjusted for age Whites Blacks Year All groups Males Females Males Females 1980 8.4 10.8 3.5 32.4 10.6 1985 7.0 9.2 2.8 27.7 8.0 1990 7.2 9.9 2.8 26.6 7.7 1992 6.8 9.9 2.6 22.3 6.3 1993 6.7 9.7 2.7 21.3 5.5 1994 6.8 9.9 2.7 20.4 5.6 Percent change 1980-94 -19.0% -8.3% -22.9% -37.0% -47.2% Source: National Center for Health Statistics, Monthly Vital Statistics Reports.
How will the student extract the needed data and move it to Excel for subsequent analysis?
Consider: Occurrence of Alcohol-related deaths Data of interest in left most
numeric column Copy data to a text editor such as
Notepad Edit out extraneous data, if any Saved as its own data file Opened in Excel Use the Text Import Wizard
Another example: Traffic fatalities and data related to alcohol involvement for the 50 states and DC
Presented in two columns
Challenge Data file with four fields that is
double horizontally Must be downloaded as a data file
of 8 fields First task - save left hand data as a
text file and import it to Excel Space delimiter works best except
for when there is a space in the state name such as New Hampshire
Next Student must repeat the process for
the 26 states to the right Cut and paste these data to the
bottom of the 25 states to the left of the table.
Integer field “Number of fatalities” is fine, but remaining two columns are %’s
Students must store %’s as proportions and then format them in percentage style - otherwise could make serious calculation errors
Implications for Faculty and Conclusions
Faculty, particularly outside IS, often assign students to “get data from the web”
Demonstrated that accomplishing this seemingly straightforward task can be quite challenging
Some faculty may not realize that this task represents more of a challenge to their students than they had in mind
Time spent importing and editing data prior to analysis reduces time spend on the assignment itself
Recommendation #1 Professors make a habit to test
how difficult it is to “get data from the web”
We suspect that many profs have limited experience in doing this
Performing the necessary downloads prior to giving the assignment will allow the prof to assess how realistic it is for students to download the data themselves
Recommendation #2
Limit the number of assignments that require downloads to a few
After that, we recommend the prof do the downloading and make the data available to students
Implications for colleges and universities Faculty must educate
administrators of the difficulties students face when downloading data from the web
Administrators must make available to students support in using the web to find and download data
Implications for IS community Data warehouse issues have thus
far been taught conceptually We predict that technical
instruction will soon be available When that occurs, IS faculty will
face much the same difficulties
Thank You!