Full Terms & Conditions of access and use can be found at http://www.tandfonline.com/action/journalInformation?journalCode=ucha20 Download by: [University of Illinois at Urbana-Champaign] Date: 11 September 2015, At: 10:01 CHANCE ISSN: 0933-2480 (Print) 1867-2280 (Online) Journal homepage: http://www.tandfonline.com/loi/ucha20 Estimating Rates at Which Books Are Mis-Shelved Hongmei Liu, Jay Parker & Wei Sun To cite this article: Hongmei Liu, Jay Parker & Wei Sun (2011) Estimating Rates at Which Books Are Mis-Shelved, CHANCE, 24:1, 36-43, DOI: 10.1080/09332480.2011.10739850 To link to this article: http://dx.doi.org/10.1080/09332480.2011.10739850 Published online: 02 Aug 2013. Submit your article to this journal Article views: 22 View related articles
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Full Terms & Conditions of access and use can be found athttp://www.tandfonline.com/action/journalInformation?journalCode=ucha20
Download by: [University of Illinois at Urbana-Champaign] Date: 11 September 2015, At: 10:01
To cite this article: Hongmei Liu, Jay Parker & Wei Sun (2011) Estimating Rates at Which BooksAre Mis-Shelved, CHANCE, 24:1, 36-43, DOI: 10.1080/09332480.2011.10739850
To link to this article: http://dx.doi.org/10.1080/09332480.2011.10739850
Estimating Rates at Which Books Are Mis-ShelvedHongmei Liu, Jay Parker, and Wei Sun
The basic goal of survey sampling is to draw inferences on selected parameters, or characteristics, of
a population. Usually, it is impractical or impossible to examine all the individu-als in an entire population. By using a controlled random sampling strategy, one can examine a sample selected from a population, which takes less time, costs less money, and is operationally simpler than doing a census. It also achieves acceptably accurate results.
Basic ideas of survey sampling were used in a project for an introductory survey sampling course at the University of Illinois-Chicago, taught and super-vised by Samad Hedayat, in the fall of 2009. The task was to estimate the mis-shelving rate of books at the university library. The following relates the story of that exercise.
Survey Design To draw a representative sample that can reflect on the population with reasonable accuracy, we need to select a suitable sampling design. Let U = {1, 2, …, N} index a finite population of N distinct and identifiable units. We refer to N as the size of the finite population. The ele-ments of U—namely 1, 2, …, N— also are known as the sampling units. Here, identifiability means there is a 1:1 cor-respondence between the units and the indexes 1,2, …, N. The sampling frame
Dow
nloa
ded
by [
Uni
vers
ity o
f Il
linoi
s at
Urb
ana-
Cha
mpa
ign]
at 1
0:01
11
Sept
embe
r 20
15
CHANCE 37
is a list of all the N population units from which a sample is drawn. There is a study variable Y that assumes values Y1, Y2 ...,YN on the N units in the population. For example, in the study of mis-shelved books, we are interested in the mis-shelv-ing rate. In this case, the total number of books is N, and Yi= 1 if the ith book is mis-shelved, Yi= 0 otherwise; then,
refers to the mis-shelving rate per book. Here, all the N books are distinct and identifiable.
A duplicate copy of a title counts as a separate book for this application. In reality, the mis-shelving rate per book is small. Therefore, it is often expressed in terms of a suitably defined larger col-lection (of books). Herein, we will work with the shelves as the reference popu-lation units, since these define natural collections of books.
For a reasonably moderate collection of identifiable reference units, such as books or shelves, simple random sam-pling without replacement (SRSWOR) is recommended over taking a conve-nience or opportunistic sample. Once the sample size n (in terms of the number of reference units to be sampled) has been determined, the SRSWOR method ensures equal chance of all possible com-binations of n units out of the totality of all reference units. SRSWOR can be implemented by putting index cards with identifying information on each reference unit into a box, or by putting electronic records into a computer, and randomly selecting n indexes sequen-tially without replacement with equal probability at each draw.
Most often, however, alternative, more convenient, and accurate sampling methods are available for a large collec-tion of units. One method is referred to as stratified simple random sampling (STRSRS). According to this method, the whole population of reference units is divided into a number of subpopula-tions, called strata, and independent simple random samples are conducted
in each stratum. We denote by L the number of strata and assume there are Mh population units in the hth stratum, h = 1, 2, …, L.
We obtained preliminary informa-tion on mis-shelving rates and frequency of use on different book collections by consulting librarians and reading litera-ture. Because different collections have different frequency of use and some col-lections naturally result in a low number of mis-shelved books, we divided the library into six collections by physi-cal location and frequency of use. Each such collection served as a stratum in our study.
The collections for our study com-prised of 1st floor south, 2nd floor south, 2nd floor north, 3rd floor north, 4th floor south, and 4th floor north. Those collections covered all the areas in the library, excluding the noncir-culated items and government docu-ments, which we excluded from our study. In the above notation, we have L = 6 strata. If each stratum were ame-nable to conducting a SRSWOR, we could have done so and combined the estimates from each stratum to estimate the overall misclassification rate. Strati-fied sampling often is more efficient than simple random sampling, because it explicitly removes the between-strata variation from the estimation.
At this stage, we realized the stratum sizes were large and still not easy to sam-ple. The books, however, were conve-niently located in stacks of bookshelves, running into hundreds and thousands of bookshelves. Moreover, the book-shelves were easy to list and contained a more-or-less constant number of books (30) per shelf. Although there was slight stratum-to-stratum variation in this num-ber, this was approximately true for each stratum. Also, it is not difficult for a single person to check the shelving order of 30 books.
Within a stratum, in survey sampling terminology, the population of books is organized into clusters of books. From within each stratum, we decided to select a few clusters and inspect the status of all the books in each selected cluster. Cluster sampling is useful in
N
YY i
i∑=
−
Dow
nloa
ded
by [
Uni
vers
ity o
f Il
linoi
s at
Urb
ana-
Cha
mpa
ign]
at 1
0:01
11
Sept
embe
r 20
15
38 VOL. 24, NO. 1, 2011
Table 1—Precision Test of Sample Size
Samples Total Number of Books Mis-shelving Rate per Shelf ( ) s.e. of
SRSWOR(9945, 100) 3,110 0.80 0.0891
SRSWOR(9945, 1000) 29,886 0.71 0.0250
Table 2(a)—Mis-Shelving Rate by Column and Row Positions of the Shelf in Book Stacks
Data Row\Column Bottom Middle Top Total
# of Shelves End 87 130 104 321
Middle 151 260 168 579
# of Books per Shelf End 28 29 30 29
Middle 31 30 30 30
# of Mis-Shelved Books per Shelf End 0.67 0.63 0.74 0.68
Middle 0.86 0.66 0.68 0.72
Mis-Shelving Rate per Book End 2.36% 2.18% 2.50% 2.33%
Middle 2.78% 2.20% 2.30% 2.38%
Total # of Shelves 238 390 272 900
Average # of Books per Shelf 30 30 30 30
Average # of Mis-Shelved Books per Shelf 0.79 0.65 0.70 0.70
Mis-Shelving Rate per Book 2.63% 2.19% 2.37% 2.36%
Table 2(b)—Chi-Square Tests Comparing Mis-Shelving Rate of the Shelves
Row Column # of Mis-Shelved Books Total # of Books p-Value
End
Bottom 58 2456 Reference
Middle 82 3768 0.6006
Top 77 3081 0.7263
Middle
Bottom 130 4679 0.3127
Middle 172 7825 0.6893
Top 114 4967 0.8059
hMis-shelving Rate per Shelf ( )hMis-shelving Rate per Shelf ( )φMis-shelving Rate per Shelf ( )φMis-shelving Rate per Shelf ( )^hφs.e. of φs.e. of
^
Dow
nloa
ded
by [
Uni
vers
ity o
f Il
linoi
s at
Urb
ana-
Cha
mpa
ign]
at 1
0:01
11
Sept
embe
r 20
15
CHANCE 39
^
hφ
situations in which the frame of the population under study is either not readily available or largely inadequate for the purpose of sampling.
Indeed, we could not be sure of the exact shelf location of each book based on its electronic record. Instead, the population units are found to be conveniently grouped into several natural clusters. In such cases, splitting the population into representative clusters can make sam-pling more practical. Then, we could simply select one or a few clusters at random and perform a census within each of the clusters.
A potential disadvantage of clus-ter sampling is that the units within a cluster can be quite similar. For example, a remote shelf with old vol-umes on it might be relatively undis-turbed relative to the general library. One could potentially learn more about the overall mis-shelving rate by taking 30 randomly selected books, rather than 30 from one shelf. Still, cluster sampling is often used due to its practical advantages.
In summary, the design used in the project was a stratified single-stage cluster sampling design. That is, first a STRSRS design was used to select clusters (the book shelves in our study) within each stratum (book collection). Then, a census of books in the selected clusters was performed to determine the accuracy of shelving. In effect, we used the shelves as clusters of books and used the shelves as reference units for sampling and inference.
Definition of Outcome Variable Recall that the parameter of interest is the mis-shelving rate. The under-standing was that a mis-shelved book would be one whose call number was larger than the call number of the adjacent books on its right or smaller than the call number of the adjacent books on its left. That is, if {1, 2, 3, 4, 5} is the correct order, then for the sequence {1, 2, 5, 3, 4}, only unit {5} is considered mis-shelved.
Our definition included identifiers such as volume numbers in a series of bound volumes of a periodical. We also accounted for books located on the floor close to bookshelves or on
top of other books. We treated a shelf of books as forming a natural cluster. Normally, a book stack consisted of seven rows and 13 columns of shelves on both sides. A shelf is a natural, easy to identify and locate cluster.
Also one does not have to account for borrowed, circulating, or lost books when selecting a sample of nonempty shelves, as one does with selecting a sample of individual books. In many contexts, clearly defining variables to be measured is an important compo-nent of study development.
Sample Size and Pilot Study To check whether it is worthwhile to allocate sample size (i.e., num-ber of shelves to be selected from each stratum of shelves) proportional to the stratum size and, further, for example, whether 100 shelves make up a suitable sample size, we con-ducted a precision test on the 4th floor south collection. This collec-tion had 9,945 shelves in total. We obtained a sample of 100 shelves using an SRSWOR(9945, 100) design. We then sampled another 900 shelves using an SRSWOR design from the remaining subpopulation of 9,845 shelves. Combining these two samples produces a sample of 1,000 shelves belonging to the class of SRSWOR(9945,1000) designs. We recorded both the number of books per shelf and the number of mis-shelved books per shelf. Table 1 summarizes the comparison between the samples.
As seen in Table 1, the precision or standard error with 100 shelves was about 10 times that of the sample with 1,000 shelves (as expected), but it still was quite low. The standard error (s.e.) of i s approx imate ly the square root of ,
which is a formula from survey sampling that accounts for sampling units from a finite population. Therefore, the deci-sion was made to opt for a sample of 100 shelves from each stratum as it meets our needs of reducing data col-lection time while keeping a satisfac-tory precision level.
Moreover, we decided to work out an estimate of the mis-shelving rate per shelf, rather than per book, since the latter was small. In every stratum, an estimate of the exact mis-shelving rate per shelf is the ratio of the total num-ber of mis-shelved books arising out of the sampled shelves over the number of sampled shelves. Under the assumption that, on an average, the book volume is 30 books per shelf, we also can work out an estimate of the exact mis-shelving rate per book, once the rate is computed per shelf. Moreover, using exact sizes of the sampled shelves would result in computation of a ratio of the form total number of mis-shelved books arising out of the sampled shelves over total num-ber of books arising out of the sample shelves. This would provide an estimate of the rate per book only in an approxi-mate sense. The estimate would behave like a ratio estimate. We pursue this latter computation below only to check the variation in the final results on rate per book.
Data Description Table 2(a) and Table 2(b) show com-parisons of shelves with different column and row positions on book stacks to check whether physical locations of the shelves make any difference, using the 900 shelves sampled from the 4th floor south. Column position was determined by proximity to the top or bottom. Those shelves within two levels from the top were considered column-top, and those within two levels from the bottom were considered column-bottom. Other-wise, the shelves were considered column-middle. Position was simi-larly defined for rows. Those rows within two of the left end or right end of the row were considered row-end and everything else was considered row-middle.
Statistical tests show there was no significant difference in terms of mis-shelving rates among shelves of different row and column positions, although column-middle seemed to have a lower mis-shelving than col-umn-top and column-bottom. There-fore, a shelf of books seemed like a reasonable cluster. The shelves were used as the primary sampling units.
h
hh
m30)30(
^^φφ − )1(
oh
h
Mm
−
^
hφ
10
h
hh
m30)30(
^^φφ −30
30 )1(oh
h
Mm
− o
Dow
nloa
ded
by [
Uni
vers
ity o
f Il
linoi
s at
Urb
ana-
Cha
mpa
ign]
at 1
0:01
11
Sept
embe
r 20
15
40 VOL. 24, NO. 1, 2011
Table 3—Frequency of Mis-Shelved Books by Collection
Figure 1. Frequency of mis-shelved books by collection (number of mis-shelved books on a shelf : between 0 and 7)
Figure 2. Boxplot of phi by collection Y-axis: Phi = Number of mis-shelved books on a shelf / Total number of books on the same shelf
Collection
PhiD
ownl
oade
d by
[U
nive
rsity
of
Illin
ois
at U
rban
a-C
ham
paig
n] a
t 10:
01 1
1 Se
ptem
ber
2015
CHANCE 41
For each collection, we applied SRSWOR design to select 100 shelves. For each selected shelf, the mis-shelving rate (the ratio of the number of mis-shelved books to the total number of books on the same shelf) was computed.
Analysis PhaseTable 3 and Figure 1 show that most of the shelves had fewer than two books mis-shelved across all the collections. Overall, 49.2% of the shelves had no
books mis-shelved, 27% had only one book mis-shelved, and 13% had two books mis-shelved.
As we can see from Figure 1, 2nd floor north and 2nd floor south had the lowest percentages of shelves with fewer than two books mis-shelved, along with a high percentage of shelves with more than two books mis-shelved. The 1st floor south and 4th floor south had the highest percentage of shelves with fewer than two books mis-shelved.
Figure 2 shows the mis-shelving rate on single shelves by collection. The
mean level of the 4th floor south col-lection was much lower than the other collections and also had the smallest spread. One can clearly see that the distributions were right skewed and the 3rd floor north collection had an extremely high outlier. Checking the data, the 3rd floor north collection had one shelf with a high mis-shelving rate because it had three out of its seven books mis-shelved.
Table 4 shows distribution of book volume of the shelves by the six strata. It indicates there is no strong linear
Table 4—Data Summary by Collection and Book Volume per Shelf
Data Collections (strata)
Book Volume per Shelf<−10 (10,20] (20,30] (30,40] >40 Overall
# of Shelves 1st Floor S. 7 27 38 22 6 100
2nd Floor S. 0 16 56 26 2 100
2nd Floor N. 0 9 43 46 2 100
3rd Floor N. 2 31 57 9 1 100
4th Floor S. 1 7 43 37 12 100
4th Floor N. 1 14 52 25 8 100
# of Books per Shelf 1st Floor S. 9 17 25 35 50 25
2nd Floor S. / 18 26 34 47 27
2nd Floor N. / 18 26 35 44 30
3rd Floor N. 8 17 26 34 41 23
4th Floor S. 8 17 27 35 46 31
4th Floor N. 8 17 26 34 44 28
# of Mis-shelved Books per Shelf
1st Floor S. 0.57 0.19 0.76 1.05 2.17 0.74
2nd Floor S. / 0.75 0.91 1.19 1.50 0.97
2nd Floor N. / 0.33 1.30 1.13 0.50 1.12
3rd Floor N. 1.50 0.68 0.86 1.11 3.00 0.86
4th Floor S. 0.00 0.29 0.58 0.89 1.67 0.80
4th Floor N. 0.00 0.14 0.85 1.32 2.63 1.00
Mis-shelving Rate per Book
1st Floor S. 6.06% 1.12% 3.05% 2.95% 4.38% 2.91%
2nd Floor S. / 4.20% 3.50% 3.56% 3.19% 3.58%
2nd Floor N. / 1.90% 5.00% 3.27% 1.15% 3.79%
3rd Floor N. 18.75% 4.06% 3.37% 3.27% 7.32% 3.68%
4th Floor S. 0.00% 1.71% 2.16% 2.58% 3.64% 2.57%
4th Floor N. 0.00% 0.82% 3.28% 3.83% 5.92% 3.56%
# of Shelves 11 104 289 165 31 600
Avg. # of Books per Shelf 9 17 26 34 46 27
Avg. # of Mis-shelved Books per Shelf 0.64 0.43 0.88 1.10 1.97 0.92
Dow
nloa
ded
by [
Uni
vers
ity o
f Il
linoi
s at
Urb
ana-
Cha
mpa
ign]
at 1
0:01
11
Sept
embe
r 20
15
42 VOL. 24, NO. 1, 2011
^
hφ
relationship between book volume and mis-shelving rate.
To further investigate the impact of book volume per shelf on mis-shelving rate, we used the Wald chi-square test, conducted with SAS SURVEYFREQ procedure, to analyze the data. The shelves were divided into five groups according to their book volume, as we did in Table 4, and we had six strata. According to the chi-square test results, the association between book volume of the shelves and mis-shelving rate is not significant (p-value = 0.1750, X2
4=
6.3653, Adjusted F = 1.5833). This analysis procedure took into consider-ation the complex survey design with stratification and clustering.
Accounting for the Complex Sample Design: Statistical Inference Now we look at statistical inference for mis-shelving rates. When a sample is greater than 5% of the population from which it is being selected and the sample is chosen without replacement, the finite population correction factor should be used. The central limit theorem and standard errors of the mean and of the proportion are based on the premise that the samples selected are chosen with replacement. However, in virtually all survey research, sampling is conducted without replacement from populations that are of a finite size N. In these cases, particularly when the sample size n is not small in comparison with the population size N (i.e., more than 5% of the popu-lation is sampled) so that n/N > 0.05, a finite population correction factor is used to define both the standard error of the mean and the standard error of the proportion. If n denotes the sample size and N denotes the population size under SRSWOR sampling, then f = n/N is known as the “sampling fraction.” Under SRSWR sampling, the sampled units behave as independently and iden-tically drawn units and the expression for the variance of the sample mean assumes a very simple form,
n
2σ
.
However, under SRSWOR sampling, there is an intrinsic dependence of the sampled units among themselves. In that
case, the expression for the variance of the sample mean assumes the form
nf
NN )1(
)1(
2 −−
σ,
where f is the sampling fraction. For large N and appreciable value of the sampling fraction (of the order of 5% or more), the formula simplifies to
)1(
2
fn
−σ.
The factor (1-f) is known as the finite population correction factor (abbrevi-ated as fpc). When f is small, we can drop the fpc and the formula reduces to
n
2σ
,
same as that under SRSWR sampling. In the above, 2σ refers to the popula-tion variance of the values of the study variable Y.
Table 5 summarizes statistical esti-mation of mis-shelved books and other parameters. Moh is the number of shelves, or clusters. nh is the average number of books per shelf. Mh is the number of shelves rescaled by a factor of (nh /30). Wh is the weight of the hth
stratum
∑
≈
hh
h
MM
.
^
hφ is the estimated mis-shelving rate per shelf.
^Th is the estimated total number of
books = nh Moh. The collection on 2nd floor north
had the highest mis-shelving rate among the six collections we investigated, and the collection on 1st floor south had the lowest mis-shelving rate. The mean mis-shelved books per shelf were estimated to be 0.88 with estimated s.e. to be 0.1588. We have assumed the average shelf length in terms of book volume per shelf is 30. Therefore, an estimate of number of mis-shelved books per 1,000 books would be 29, obtained from that per shelf by mul-tiplying the latter by 1,000/30. Further, the estimated standard error would also have the same inflation factor. The mis-shelving rate we obtained is 2.9%.
Dow
nloa
ded
by [
Uni
vers
ity o
f Il
linoi
s at
Urb
ana-
Cha
mpa
ign]
at 1
0:01
11
Sept
embe
r 20
15
CHANCE 43
homogeneity within each stratum and heterogeneity among the strata means with respect to the survey variable. If improving precision of the sampling design is the primary concern, then we can do a subject-based stratified sampling. One possible approach would be to define strata in a different method or incorpo-rate a second-stage cluster sampling. In a second-stage cluster sampling, one would subsample books from each cluster. If the shelves held 300 books instead of 30, it would be time consuming to check all 300 books. Rather, one could take a random sample of 30 books from each shelf and examine those books. This second stage of sampling is common in many applications.
Another option with more advance information would be to combine homo-geneous collections of similar mis-shelv-ing rates into one stratum and reduce the number of strata in the sampling design.
Another approach for investigating the mis-shelved books and predict counts of mis-shelved books is using a Binomial-Poisson mixture model with each shelf as the specified interval, if information about more covariates is available.
There are many factors that may have an impact on the mis-shelving rate, includ-ing frequency of use, subjects, timing and flux of users, and the shape of the books. Perhaps mis-shelving is also a result of inexperienced personnel, lack
of attention, fatigue by personnel, or the complexity of some book indexes.
Further Reading
Cochran, W. G. 1977. Sampling techniques. Wiley, NY.
Edwardy, J. M., and J.S. Pontius. 2001. Monitoring book reshelving in librar-ies using statistical sampling and con-trol charts. Library Resources & Technical Services, 45(2):90–94.
Groves, R. M., F. J. Fowler Jr., M. P. Couper, J. M. Lepkowski, E. Singer, and R. Tou-rangeau. 2009. Survey methodology. 2nd edition. Wiley.
Hedayat, A. S., and B. K. Sinha. 1991. Design and inference in finite population sam-pling. New York: Wiley Series in Prob-ability and Mathematical Statistics.
Heeringa, S. G., B. T. West, and P. A. Ber-glund. 2010. Applied survey data analysis. Chapman & Hall.
Jan, S. S., A. W. John, and S. Nackil. 2009. A cost-benefit analysis of a collections inventory project: A statistical analysis of inventory data from a medium-sized academic library. The Journal of Academic Librarianship 35(4):314–323.
Kish, L. 1965. Survey sampling. Wiley. Lohr, S. 2009. Sampling: Design and analysis.
Duxbury.
In the collection on 2nd floor north, the low frequency of use played a different role than expected. While many of the shelves had fewer than two mis-shelved books, there were 10 shelves with four, five, or even seven mis-shelved books. These high numbers were due to books being stacked after the bookend or on top of the books, instead of being placed somewhere the librarian would notice them and reshelve them properly.
DiscussionOur librarians did a good job in keep-ing the overall mis-shelving rate low compared to other public and university libraries, though there were certain areas that needed more effort (e.g., 2nd floor north). Due to time constraints, we did not do a census to compare strategies. There are other strategies we can consider and compare to STRSRS, such as system-atic sampling since books are arranged in arrays and STRSRS with proportional allocation or optimal allocation if we can define a cost structure for the sampling design. We also can look into time effect by inspecting the mis-shelving rate at different periods of the year.
We used stratified single-stage cluster sampling. The drawback of this approach is that our strata were placement based. The amount of gain due to stratifica-tion largely depends on the degree of
Table 5—Statistical Estimation of Mis-Shelved Books