A latent semantic analysis methodology for the identification and creation of personas

A Latent Semantic Analysis Methodology for the Identification and Creation of Personas

Tomasz Miaskiewicz University of Colorado / Boulder

Leeds School of Business [email protected]

Tamara Sumner University of Colorado / Boulder

Institute of Cognitive Science [email protected]

Kenneth A. Kozar University of Colorado / Boulder

Leeds School of Business [email protected]

ABSTRACT

A persona represents a group of target users that share common behavioral characteristics. By using a narrative, picture, and name, a persona provides HCI practitioners with a vivid and specific design target. This research develops a new methodology for the identification and creation of personas through the application of Latent Semantic Analysis (LSA). An application of the LSA methodology is provided in the context of the design of an Institutional Repository system. The LSA methodology helps overcome some of the drawbacks of current methods for the identification and creation of personas, and makes the process less subjective, more efficient, and less reliant on specialized skills.

Author Keywords Personas, latent semantic analysis, cluster analysis, user abstraction, institutional repository

ACM Classification Keywords H.5.2 Information Interfaces and Presentation: User Interfaces: Theory and Methods

INTRODUCTION Human-computer interaction (HCI) practitioners utilize user abstractions to summarize the needs of a group of target users with similar characteristics. Personas, user abstractions that present vivid and holistic representations of a group of users [5], were developed by Cooper [4]. Personas have been gaining popularity among practitioners [8, 24] – large companies such as Discover Financial Services, Staples, Microsoft, SAP, Sovereign Bank, and FedEx have adopted personas as part of their design processes [18, 19]. Personas are beneficial when incorporated into a design process because they facilitate the communication of information about the target users

among the project team members [22]. Personas also build greater empathy for the target users – they develop a greater understanding of and identification with the user audience [21, 22].

A persona represents a group of users who share common behavioral characteristics. Even though a persona represents a group of users, it is written in the form of a detailed narrative about a specific, fictitious individual. To make the persona seem like a real person, the narrative contains details about family members, friends, possessions, socio economic status, gender, and so on [8]. More importantly, the narrative also describes the goals, needs, and frustrations of the persona that are relevant to the product being designed.

With the exception of the name, face, and specific details that make the persona more believable, personas should be developed based on the findings of user research [4, 7]. The commonly used methods for the identification and creation of personas from user research require individuals to manually analyze an extensive amount of (usually qualitative) data. Individuals analyzing the data attempt to find common characteristics that are shared by multiple users – these groups of similar users constitute the resulting personas. However, these “manual” methods can be problematic because they can be perceived to be subjective [22, 24], require the commitment of substantial resources [19], and rely on specialized skills [19].

In an attempt to address the drawbacks of the manual methods, this research incorporates a technique called Latent Semantic Analysis (LSA) to develop a new methodology for identifying and creating personas directly from interview transcripts. The methodology then is illustrated and validated in the context of the design of an Institutional Repository (IR) system at a large midwestern university. The LSA methodology overcomes some of the drawbacks of current methods, and provides a first step in the evolution of personas into a more broadly applicable design method.

PERSONAS BACKGROUND For those readers unfamiliar with personas, we first provide an example persona that helps to illustrate their essential characteristics. Next, we review the current methods that

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise,or republish, to post on servers or to redistribute to lists, requires priorspecific permission and/or a fee. CHI 2008, April 5–10, 2008, Florence, Italy. Copyright 2008 ACM 978-1-60558-011-1/08/04…$5.00.

CHI 2008 Proceedings · Character Development April 5-10, 2008 · Florence, Italy

1501

are used to identify and create personas, and argue that these methods possess three drawbacks.

Key Characteristics of Personas The characteristics of a persona are illustrated with an example persona in Figure 1. The example was written by two of the authors, and represents a persona that could be used to redesign ACM’s Digital Library. Due to the extensive length of personas (typically one to two pages long), we only provide excerpts for this example.

MEET PAT OWENS Pat is a 23 year-old graduate student at North Carolina University. He lives off-campus in a modest one-bedroom apartment in a quiet area of Chapel Hill...During his free time, he enjoys playing online computer games (Age of Empires is his favorite game), and spending time outside running and bicycling to stay in shape...

Pat recently started to work as a research assistant for Dr. Justin Roberts, a well-known HCI researcher...Dr. Roberts has asked Pat to write a draft of a literature review for an upcoming paper on the topic of heuristic evaluation. Dr. Roberts purchased an ACM membership for Pat, and suggested reviewing a series of ACM’s publications such as the Communications of the ACM and Interactions. Pat has never conducted a literature review so he is excited by the opportunity, but is also afraid that he will not do an adequate job.

Specifically, Pat’s goals and needs are: (1) to not look stupid in front of Dr. Roberts, (2) to complete an exhaustive literature review, (3) to not waste time while looking for papers, and (4) to acquire a better understanding of HCI literature.

Figure 1. Example persona

The example persona illustrates the narrative that is needed to describe a persona. Pat Owens is intended to seem like a believable person through detailed information about his apartment and hobbies. Also, the specific goals that Pat wants to accomplish are explicitly defined, which guide the design of specific features. For example, a search engine on the ACM web site that effectively suggests related articles and terms is essential for Pat’s literature review because Pat is unfamiliar with HCI literature and terminology.

Current Methods for Identifying and Creating Personas Methods for persona development are composed of two phases. First, persona identification involves finding distinct groups of users that constitute personas. Users that are judged to be similar to each other (and distinct from other group(s) of users) are grouped together. Once the personas are identified, persona creation involves writing a detailed narrative about a persona.

While there is no universally advocated method for persona identification and creation, a few principles exist. First of all, data about user needs is gathered during the pre-design phase, so that subsequent design decisions are based on the personas. While other sources of data could be used (e.g., surveys, web logs), the use of rich qualitative data from interviews and/or observations is advocated [5, 7]. Once the user research is complete, HCI practitioners analyze the findings. Significant observations are identified for each individual that was interviewed and/or observed. Once the important observations are identified, similar observations are clustered and labeled [22]. After patterns are found across multiple individuals, these individuals become the basis for a persona [7]. Depending on the complexity of the product, typically 3 to 12 personas are identified [4]. Once the personas are identified, the original observations are used to create the personas.

Sinha [24] recognized the subjectivity of identifying personas manually from user research findings, and developed a new approach that uses Principal Components Analysis (PCA). In this study, participants filled out a 32-item questionnaire and rated the importance of certain dimensions of a restaurant experience (e.g., wine selection, parking). Through the use of PCA, three components (i.e., groups) of user needs were identified. However, this PCA approach is problematic because the personas are identified by asking the individual about what they want from a restaurant. Goodwin [7] explains, “Rather than asking users what they want, it is more effective to focus on what users do, what frustrates them, and what gives them satisfaction.” Therefore, personas ideally are identified from interviews and/or observations because they allow individuals to discuss their goals and frustrations more openly [7]. Sinha [24] acknowledges this limitation – after the PCA results were provided to information architects additional interviews were still needed to further inform the personas.

Drawbacks of Current Methods

Subjectivity and Perceived Lack of Rigor Manually identifying the right set of personas from the user research data is difficult [22]. The process of identifying important observations can be a daunting task – when 20 interviews are conducted and each resulting transcript is 20 pages long, then HCI practitioners will need to determine which observations were important and which were not in about 400 pages of text. Due to the difficult judgments that need to be made when reviewing the data, two practitioners


1502

can identify very different personas even when reviewing the same data [24].

As a result, personas can be perceived as “artsy” and subjective [22, 24]. If personas are not based on a methodology that closely associates the personas with the user research findings, they can be perceived as lacking rigor [22]. In effect, they can lose credibility among the project team members and senior management, and will not be actively used in the design process [8, 22].

Resource Intensive The effort needed to identify and create personas for a specific project can be significant. A Forrester Research report surveyed persona vendors and found that an average persona investment is about $47,000 [19]. This cost is composed of the time needed to conduct interviews, identify the personas from the data, and create the resulting personas. As the complexity of a project and the diversity of the target user audience increases, the investment also increases – one project for which eight personas were created cost about half a million dollars [19]. Clearly, organizations that decide to use personas as part of their design process need to commit substantial resources.

Require Specialized Skills Conducting interviews and extracting personas from data requires specialized skills. A team responsible for the persona effort needs to possess an ability to conduct qualitative data analyses and extract information from the resulting data [19]. Therefore, personas are not accessible to all – organizations that do not possess employees with the requisite skills and cannot afford a persona vendor are less likely to use personas.

METHODOLOGY In response to the three identified drawbacks of personas, we created a new methodology for persona identification and creation. The following sections first introduce LSA, a computational technique at the heart of the proposed methodology. Next, each of the five steps in the methodology is described.

Latent Semantic Analysis Overview Our methodology uses LSA to identify personas directly from textual data. LSA is a technique for representing the similarity of meaning of terms and documents through the analysis of a large amount of text [12, 13]. LSA only uses the contexts in which words appear and do not appear to determine similarity in meaning.

LSA starts by creating a word-by-document matrix – each unique word in the text is a row, and each document (usually in the form of a paragraph) is a column. The power of LSA exists in the technique that is used to reduce the number of dimensions in this word-by-document matrix [27]. By applying singular value decomposition (SVD), typically 100 to 300 of the most important dimensions are extracted [11]. More importantly, SVD also produces the

semantic relations between words [27]. For example, if words “car” and “truck” co-occur with words “clutch,” “tires,” “driver,” and the words “car” and “truck” do not co-occur with other words such as “science” and “printer,” then LSA will conclude that “car” and “truck” are similar. Typically, a cosine measure is used to represent the degree of similarity between words and documents. A cosine of zero indicates no similarity, whereas a cosine of one indicates maximal similarity [17]. Cosines have been shown to reliably match human similarity judgments [11].

LSA was chosen for this study due to its proven track record of providing reliable similarity judgments in a wide variety of application areas. For example, LSA has been successfully used to judge the quality of student essays [16, 25], study the cohesiveness of design team communication [6], and assess learning [14].

Additionally, the LSA group at the University of Colorado at Boulder has created a web site (http://lsa.colorado.edu), allowing researchers to easily analyze textual documents through a web-based interface. The group also has provided a series of semantic spaces – an LSA representation of large bodies of text. For example, the psychology semantic space is composed of over 13,000 documents of psychology-related text. A commonly used semantic space is called “tasaALL” – a general semantic space that is composed of over 37,000 documents from a variety of texts, novels, and newspaper articles spanning numerous fields such as business, language arts, science, and so on. By selecting one of the semantic spaces, researchers are able to use the web site to perform cosine calculations within the semantic space.

Five Step LSA Methodology

Step 1: Conduct Interviews Interviews are essential for informing the resulting personas [5, 7]. Thus, our approach starts with interviews, which provide the textual data that is used during the subsequent steps. A consistent protocol is used during each of the interviews, so that each interviewee is asked the same set of questions. The interviews are audio recorded, and then transcribed.

Step 2: Calculate Cosines Once the interviews are transcribed, LSA is used to calculate the similarity of the interviewees’ answers to specific questions. The answers to each question need to be first extracted from the interview transcripts – we found that a spreadsheet with questions as rows and interviewees as columns was helpful for organizing the text from the interviews. The resulting spreadsheet allows for an easier comparison of the answers across the interviewees for a specific question.

Next, the cosines are calculated among the answers of the interviewees to specific questions. For example, if 12 interviews were conducted, this would result in a series of 12-by-12 matrices, with each of the matrices providing an


1503

objective measure of how similar each interviewee’s answer was to the answers of all other individuals for a specific question. Finally, one aggregate matrix is calculated, which averages the cosines across all of the questions. This matrix provides an objective representation of the overall similarities among the interviewees.

Step 3: Use Cluster Analysis to Identify Personas After the aggregate matrix is calculated, cluster analysis is employed. Cluster analysis classifies observations into clusters (i.e., groups), which are constructed to be as internally homogenous as possible, while as different as possible from all other clusters [26]. Cluster analysis represents a group of multivariate techniques [9] – the researcher is responsible for choosing an appropriate clustering approach.

While other clustering approaches could be used, we incorporated a hierarchical, agglomerative approach. Hierarchical clustering approaches cluster the observations into tree-like structures, where results in earlier steps are nested within clusters in later steps [9]. Agglomerative methods are the most popular type of hierarchical clustering. These methods start with each observation as its own cluster and continue until only one cluster exists [9].

The proposed agglomerative approach can be carried out without any specialized software – we conducted the necessary calculations in a spreadsheet program. Our proposed clustering approach proceeds as follows:

• First, the aggregate matrix is examined for the observations with the highest overall cosine. Either two interviewees, an existing cluster of interviewees and an individual interviewee, or two existing clusters of interviewees with the highest overall cosines are clustered.

• After the observations are clustered, the aggregate matrix needs to be updated. For example, when clustering two interviewees into a new cluster, the cosine between interviewees not in this cluster (e.g., person1) and the interviewees in the new cluster (e.g., person2 and person3) is updated to be the average cosine between the interviewee not in the cluster and the interviewees in the cluster (i.e., cosine(person1, person2) + cosine(person1, person3) / 2). This updated cosine represents the similarity between the interviewee and the cluster of interviewees. The matrix is updated in a similar manner when interviewees are added to an existing cluster or two existing clusters are combined.

The clustering continues until (1) each of the interviewees is grouped with at least one other interviewee, or (2) a large drop-off in the highest overall cosines occurs. The rationale behind the stopping rules is that (1) a persona should represent a group of users and not a single user, and (2) personas should be composed of users with significant similarities. Once each of the interviewees is clustered,

conceptual considerations such as the interpretability of the clustering solution also can be taken into account. In cluster analysis, conceptual considerations are commonly considered because the final clustering solution should be understandable [9].

Step 4: Use LSA Cosine Matrices for Specific Questions to Create Personas Once the persona groupings are established, the personas are created based on the cosine matrices to specific questions. For each identified persona, these matrices can be examined to determine where the most significant similarities existed among the interviewees. One cannot expect all of the interviewees that compose a persona to be identical, so the most significant similarities should be the focus of the resulting persona narrative. We suggest focusing on four to six questions with the highest overall cosines when writing the narrative. The answers of the interviewees to these questions are then examined, so that a determination can be made for the basis of the similarities.

At this point, the personas’ narratives are written. First, we suggest writing the section of the narrative that summarizes the four to six key similarities of each of the personas. Next, each of the personas needs to be brought to life – a name, picture, and personal details are added to the narrative. While the interview data can provide some guidance, this part of the narrative should contain fictitious information to make the persona seem life-like [4].

Step 5 (optional): Compare Additional Interviewees to Personas The primary goal of the final step is to verify the personas that were created in step 4. For example, this step can be carried out when an organization identifies a set of personas, and the project leaders decide to verify that these personas represent the needs of additional users. To verify the personas, the organization could conduct additional interviews, transcribe the recordings, and determine how similar the needs of these new individuals are to the existing personas.

While various strategies could be employed during this step, we suggest the following strategy for verifying the existing personas:

1. Conduct additional interviews that ask focused questions about the needs, goals, and frustrations of the individuals when using the specific product or system.

2. Identify the section of the persona narrative that describes the persona’s goals, needs, and frustrations when using this product or system.

3. Calculate cosines between this section of the narrative, and the answers of each of the additional interviewees.

4. Repeat with the remaining personas.

As a result, the cosines will determine whether the needs of the additional interviewees are reflected by the personas.


1504

When “low” cosines are found between the new interviewees and all of the personas, the organization should consider creating additional persona(s).

METHOD APPLICATION This study was undertaken in conjunction with the library of a large midwestern university. The library was starting an initiative to develop an Institutional Repository (IR) – a system that consolidates the intellectual output (e.g., research papers, teaching resources, source code, research data) of the academic community into one centralized and searchable location. Before designing the IR, the library staff was trying to better understand how graduate students and faculty would use and benefit from this system. More broadly, the library also was looking to identify how individuals use the library’s other online resources, and to understand how individuals search for and share resources in the academic environment.

The library’s goals focused the interview questions and the content of the resulting personas, but a broader goal was to provide an illustration and preliminary validation of the LSA methodology. The following sections describe the application of our methodology through each of the five steps.

Step 1: Conduct Interviews 21 interviews were conducted with graduate students and faculty from a wide range of disciplines at the university. The individuals were recruited via e-mail, and were offered a $15 gift certificate for participating. The interviews typically lasted 30 to 45 minutes, and were recorded using a digital audio recorder. The 21 audio recordings were transcribed by a professional transcription service. One of the recordings contained poor audio quality and was not included in the subsequent analysis.

A protocol – created in conjunction with the library’s staff – was used during the interviews. The protocol contained questions that addressed each of the library’s goals: (1) to discover how the interviewees find and share information (e.g., “can you briefly describe the steps that you went through the last time that you searched for a research resource?”), (2) to capture opinions concerning the library’s other online resources (e.g., “do you find the online resources and systems provided by the library helpful for finding useful and relevant information?”), and (3) to identify how the interviewees would utilize the IR once it was available (e.g., “do you envision yourself sharing information with your students or colleagues through the institutional repository once it is available to you?”).

Step 2: Calculate Cosines LSA then was used to calculate the similarity among the answers of 16 of the interviewees. 4 of the 20 transcripts were not included because they were used in step 5. The general semantic space (i.e., tasaALL) was used for all of the cosine calculations.

Table 1 provides an excerpt from one of the matrices for answers to a specific question (i.e., how the interviewees would utilize the IR for sharing information). (In Table 1 and subsequent figures and tables, students are identified by an “s” and faculty members are by an “f”.) After we calculated similar matrices for the remaining questions, we then created one final aggregate matrix that averaged the cosines across all of the matrices for specific questions.

s1 s2 s3 s4 s5 s6 s7 ...

s1 1 0.50 0.47 0.53 0.55 0.50 0.38 ...

s2 1 0.66 0.74 0.78 0.71 0.56 ...

... 1 ... ... ... ... ...

Table 1. Example matrix containing cosines between interviewees' answers to a specific question

Step 3: Use Cluster Analysis to Identify Personas As summarized in Table 2, the clustering of the 16 interviewees progressed through 12 total steps. During the first five steps, six of the interviewees were combined into a cluster (clu1). In the sixth step, a second cluster (clu2) was generated because these two interviewees were more similar to each other than the first cluster. During the subsequent steps, two more clusters were created and several more interviewees were added to the existing clusters.

Step Clustered

Observations and Cosine

Cluster Membership

1 s4-s6 (0.59) clu1(s4,s6)

2 s3-clu1 (0.56) clu1(s3,s4,s6)

3 s1-clu1 (0.54) clu1(s1,s3,s4,s6)

4 f8-clu1 (0.52) clu1(s1,s3,s4,s6,f8)

5 f7-clu1 (0.51) clu1(s1,s3,s4,s6,f7,f8)

6 f3-f9 (0.50) clu1(s1,s3,s4,s6,f7,f8) clu2(f3,f9)

... ... ...

12 s2-clu1 (0.43)

clu1(s1,s2,s3,s4,s6,f2,f7,f8) clu2(f1,f3,f5,f9) clu3(s5,s7) clu4(f4,f6)

Table 2. 12 steps of the clustering approach

The cosine of the combined observations steadily decreased, so we did not use the 2nd stopping rule (i.e., a significant drop-off in the cosine of the clustered observations). However, in the 12th step, each of interviewees was clustered with at least one other


1505

interviewee (i.e., the 1st stopping rule), and we took into account conceptual considerations to decide whether to continue the clustering. In the 13th step, the two largest clusters (clu1 and clu2) would be combined into a single cluster containing 12 of the 16 interviewees. Creating one persona that was representative of the 12 interviewees was problematic because of the difficulty of finding common similarities across all of the interviewees, so we decided to retain the four persona clustering solution from step 12.

Figure 2 illustrates the resulting grouping of the interviewees into four personas. Persona 1 is composed of half of the interviewees, and combined both student and faculty interviewees. Persona 2 is composed of a group of four faculty members, while personas 3 and 4 each group two interviewees – persona 3 groups two students and persona 4 two faculty members.

Figure 2. LSA grouping of interviewees into personas

Step 4: Use LSA Cosine Matrices for Specific Questions to Create Personas For each of the four personas, we then identified the questions for which the interviewees provided the most similar answers. Each of the answers that we considered as similar contained a cosine of least 0.50, while answers to other questions possessed cosines as low as 0.11.

One of the library’s four personas (“Charles Williams”) is provided in Figure 3. We are not able to share the persona picture due to copyright agreements, and information identifying the institution also was removed. The narrative of the Charles Williams persona (persona 4 in Figure 2) was guided by answers to five questions each with a cosine of 0.51 or higher. Foremost, as summarized in the 2nd

paragraph of the persona, interviewees f4 and f6 both explained that they had very specific needs (usually finding books), and the library’s current online systems served these needs sufficiently. Also, as described in the 3rd paragraph, the two interviewees explained that they did not share research within the local university community, so they would be unlikely to share their research through the IR. Finally, as summarized in the 3rd and 4th paragraphs, the interviewees explained that they currently do not have an outlet for sharing and finding teaching resources, and they both believed that the IR potentially could serve these needs.

PROFESSOR CHARLES WILLIAMS Professor of History

Age: 61

Research: English history in the Anglo-Saxon era

Teaching: Usually one class per year

Service: Faculty search committees, advising doctoral students, and faculty evaluation committees

Meet Professor Charles Williams

(¶1) Charles is a professor at the Department of History, and has been a faculty member at the ___ for 34 years. He is still actively involved in his research on English history in the Anglo-Saxon era, but after many years of hard work, he also is trying to spend more time away from the university. Charles is an avid fisherman and enjoys spending time at his cabin near Carbondale, Colorado. He finds the cabin a peaceful retreat that helps him concentrate on finishing his latest book on Alfred the Great. He also spends his free time with his wife, Megan, and their two daughters – Monica and Ashley – who live in the area and visit often on the weekends to help out around the house and with their garden.

(¶2) Professor Williams has never been able to catch up with the breadth of resources that are available to him on the internet today. However, he does not feel like he is missing out on much. For his research, the library offers him the books that he needs and he knows how to look up their availability using the __ catalog search (he has a direct link saved on his computer in his office). Research in his field is primarily shared through books, so he rarely has to worry about looking for journal articles in any of the electronic databases.

(¶3) Professor Williams is the only person at __ that specializes in his area of research (and one of a few in the world), so he never collaborates on his research with other individuals at __. However, he occasionally does discuss the content of his courses with interested departmental faculty, and due to his faculty committee work he often reviews the syllabi and teaching materials of the faculty he is evaluating.

What does Charles want from the ____ IR?

(¶4) Charles would not mind sharing some of his teaching resources (e.g., class handouts, syllabi) with other faculty through the IR, but he is afraid that it would be too hard for him to learn how to do this. Also, when evaluating the teaching of other faculty members, the IR would help Charles if the teaching resources of these faculty members were available through the repository.

Figure 3. Library's Charles Williams persona

After we wrote the part of the narrative that summarized the answers to the five questions (i.e., paragraphs 2, 3, and 4), we brought the persona to life by including specific,


1506

fictitious information (i.e., name, face, and details in paragraph 1). The description of who Charles Williams is was not guided by the interview data, but we made sure that the description fit the rest of the persona’s narrative. We followed an identical procedure for the remaining three personas.

Step 5: Compare Additional Interviewees to Personas Finally, we compared the needs of the four interviewees whose transcripts were not included in the clustering to the four personas that we created. We calculated cosines between only a section of the transcripts and a section of the personas. We focused on the part of the transcript that contained answers to questions concerning the IR system, and a cosine was calculated between this text and the section of the persona narrative that discussed the persona’s usage of the IR (i.e., the last paragraph in the persona in Figure 3). These cosines allowed for a determination of whether the four personas were capturing the IR-specific needs of the additional interviewees. The resulting cosines are provided in Table 3.

Interviewee Persona 1

Persona 2

Persona 3

Persona 4

s8 0.40 0.33 0.51 0.31

f10 0.29 0.15 0.27 0.24

f11 0.43 0.36 0.51 0.38

f12 0.48 0.33 0.42 0.39

Table 3. Cosines between remaining interviewees and the personas

As summarized in Table 3, interviewees s8 and f11 possess IR-specific needs that are most similar to persona 3. Both of the interviewees focused on course-related needs that the IR could serve (e.g., pointing students to introductory materials, reviewing old course notes), which reflected the needs of persona 3. On the other hand, none of the personas were able to capture the needs of interviewee f10. This interviewee focused on how his/her field is very concerned with copyrights and how s/he would be reluctant to post any materials to the IR. The interviewee explained, “I don't know what would happen if I or someone posted a manuscript [to the IR] and then they [the journal] published it. I'd be nervous about it.” The four persona narratives did not discuss the reservations of sharing materials due to copyrights, and the low cosines between this interviewee and the four personas are warranted. Finally, the needs of interviewee f12 span across multiple personas (especially personas 1 and 3), and therefore this interviewee’s needs are a combination of the existing personas. In summary, the comparison of the remaining interviewees to the personas concluded that the library’s personas are representing the needs of three of the four interviewees. The library should consider conducting additional interviews to determine whether a fifth persona should be

added to reflect the needs of interviewee f10, or if this interviewee is an outlier with needs that are not representative of the broader university community.

DISCUSSION The application of the LSA methodology illustrated how LSA can be effectively utilized to identify and create personas. In the discussion that follows, we first show that the grouping created by the LSA methodology has significant overlap with the manual grouping of the two experts. We then revisit the three drawbacks of personas, and elaborate upon how the LSA methodology addresses each of the drawbacks. We also acknowledge the limitations of this research, and offer opportunities for future research.

Comparison to Expert Identification of Personas To provide a baseline for comparison, two HCI researchers that possessed significant experience identifying, creating, and using personas (i.e., the “persona experts”) manually grouped the 16 interviewees into 4 personas. These two experts first individually identified observations in each of the interviews. Typically, 8 to 12 observations were identified in each of the transcripts. After the observations were identified, patterns were created. These patterns summarized the findings of multiple observations through a more general statement such as “part of community” or “access to teaching resources.” Three to four patterns were usually identified for each of the interviewees. Interviewees with overlapping patterns then were grouped into personas.

Next, the experts worked on resolving the differences between their individual groupings. After reviewing their observations and discussing the rationale for their groupings, the experts were able to reach agreement on one final grouping of the individuals into four personas. Figure 4 presents the four personas that were identified by the two experts.

Figure 4. Expert grouping of interviewees into personas

The agreement between the two groupings of personas was calculated using the Kappa max statistic [23]. Kappa max extends Cohen’s Kappa statistic – commonly used to measure inter-rater agreement – to a comparison of the agreement between two groupings where preexisting categories do not exist.

Considerable overlap between the LSA and expert groupings existed. A comparison of the two groupings


1507

resulted in a Kappa max of 0.67, which represents substantial agreement [15]. For example, in both groupings persona 4 is composed of only interviewees f4 and f6. Additionally, interviewees f1, f3, f5, and f9 overlap between the two groupings for persona 2, while only the expert grouping contains interviewees f7 and f8.

While a claim cannot be made that either of the groupings is correct or the “gold standard,” the substantial agreement does provide evidence that the LSA grouping is making comparable similarity judgments to the experts. The LSA methodology is making similarity judgments through cosines that have some overlap with the observations and patterns identified by the experts.

Drawbacks Revisited

Subjectivity and Perceived Lack of Rigor In response to the perception that personas are “artsy,” the LSA methodology provides a rigorous methodology for persona identification. Through the application of cluster analysis and LSA, persona identification requires less human judgment – the cosines determine which interviewees constitute personas.

The LSA methodology also provides greater traceability between the interview data and the personas that are written. Through the use of the cosine measures, any skeptics of the resulting personas can be shown how the interviewees were grouped (i.e., step 3), and why they were grouped together into personas (i.e., step 4). If the personas still lack credibility, they can be objectively verified after additional interviews are conducted (i.e., step 5).

Resource Intensive While the LSA methodology still requires considerable effort, improvements are made in certain areas. First, the process of identifying personas is made more automatic. Once interviews are conducted, an individual only has to organize the text contained in the transcripts, feed the text to an LSA application, and then perform a simple clustering technique. Developing a custom LSA routine and not relying on the LSA group’s web site could further automate this process. On the other hand, each of the persona experts spent over 20 hours of effort to identify the observations and patterns, and then group the interviewees.

Secondly, the LSA methodology is more efficient because the personas are less likely in need of immediate verification. When personas are identified through a manual approach, their validity often needs to be determined via additional activities such as surveys and interviews [22]. Due to the greater objectivity in the LSA methodology, this verification step initially could be skipped by organizations.

Finally, as the number of interviews increases, the LSA methodology provides even greater savings. The persona identification within the LSA methodology is very scalable – identifying personas from 40 interviews as opposed to 20

interviews would require less additional effort than with a manual approach. Without LSA, HCI practitioners responsible for identifying personas from interview data would have to read and process roughly twice as much text. On the other hand, additional interviews in the LSA methodology would only cause an increase in the dimensionality of the matrices, and require clustering additional observations.

Require Specialized Skills While individuals with specialized qualitative data analysis skills are still needed, the LSA methodology overcomes the reliance on these individuals in some key areas. With the LSA methodology, HCI practitioners with qualitative research experience are still needed to conduct interviews. Organizations without in-house competencies would still have to seek help from vendors. However, once the interviews are conducted, the LSA methodology decreases the reliance on qualitative data analysis skills. As previously explained, the identification of personas is made significantly more automatic. While the creation of personas still requires some expertise, the cosine matrices for specific questions guide the creation. Practitioner(s) still must be able to write a believable and accurate persona narrative, but the writing process is made less equivocal through the objective representation of why these individuals constitute a persona.

Limitations This study has several limitations that need to be addressed. First of all, the generalizability of the LSA methodology still needs to be examined. In this research, we applied the methodology on only one project, and the validity of the methodology in different contexts needs to be verified.

Secondly, interpreting similarities using the cosine measure presents another limitation because of the lack of research that explicitly defines high and low cosines. Cosines have been used as a measure of similarity in numerous studies that have incorporated LSA (e.g., [6, 20]), but none of the studies that we are aware of have provided reasoning for their interpretation of cosine measures. For example, without providing an explicit rationale for the interpretation, Britt et al. [2] represent a high cosine as 0.70 or greater and a low cosine as 0.20 or lower.

Third, the use of the general semantic space presents another limitation. Even though the general semantic space represents similarities among words in a variety of domains (92,409 total words), some wording used by the interviewees could not be “understood” by the semantic space. For example, many technical terms and names of specific databases and web sites are not contained in the general space. A semantic space that captures all of the language that was used by the interviewees would have provided more accurate results. However, the goals of this study were to provide a general understanding of the similar needs, goals, and frustrations of the interviewees, and


1508

comprehending technical or discipline-specific terminology was not deemed essential to accomplishing this goal.

Fourth, while the LSA methodology does provide greater objectivity, it is not completely free from human judgment. Most significantly, when the personas are created in step 4, an individual still must write the persona narrative. The cosines matrices for specific questions provide guidance for what makes the interviewees similar, but the practitioner(s) writing the persona ultimately determines what will be included in the persona narrative.

Fifth, our methodology introduces new requisite skills into the persona development process. Before employing the LSA methodology, HCI practitioners need to possess an understanding of LSA and cluster analysis – the persona identification and creation process could be error-prone when the project team is inexperienced with the techniques at the core of our methodology.

Sixth, the LSA methodology removes some of the personal interaction that many individuals perceive as valuable during the persona development process. Most significantly, HCI practitioners are less likely to collaboratively discuss and analyze the needs, goals, and frustrations of the interviewees that make up the resulting personas.

Finally, the personas that are identified through our LSA methodology are not guaranteed to be representative of the user community. This limitation is not inherent in our LSA methodology, but in how interviewees are selected. Before the LSA methodology is used, HCI practitioners should devise a sampling strategy that will allow for generalizing to the broader population of target users.

Future Research and Extensions There are several prospects for future research that extend this LSA methodology. Foremost, there are opportunities to improve the objectivity of the LSA methodology. Specifically, the persona creation in step 4 could be further informed by the application of additional text analysis techniques. For example, in addition to determining for which questions the interviewees provided similar answers, a text analysis technique could be applied to summarize the similarities among these answers with a series of representative terms. The current LSA methodology requires an individual to review the answers of the interviewees in their entirety when writing the persona narrative.

Additionally, the ability of the LSA methodology to identify accurate personas from other commonly available sources of textual data also could be investigated. For example, organizations frequently use online questionnaires with open-ended questions to gather general opinions from their customers and users. The ability of the LSA methodology to identify personas directly from these answers could be examined.

Finally, the efficiency and usefulness of the LSA methodology could be further improved through the development of an application that automates each of the steps. The development of applications for specific HCI techniques and methods has been the focus of many recent projects (e.g., [1] [3]). For example, LSA has been used as part of the AutoCWW project at the University of Colorado for the automation of the cognitive walkthrough for the web (a usability inspection technique) [1]. A software application that implements and potentially improves the five steps of our methodology would hopefully facilitate the adoption of our LSA methodology within practice.

CONCLUSIONS Personas help organizations overcome biased (and often incorrect) assumptions about their users, and facilitate the communication of the actual user needs and goals [22]. When we presented the personas to the project sponsors and a broader group of librarians, they were enlightened by the personas. These individuals assumed that an IR would be primarily used for sharing research resources. However, the primary needs of three of the identified personas are related to course-related content. The three personas already possess satisfactory outlets for research resources, but do not have access to any useful systems for finding and sharing course-related content. For example, one of the personas wants to use the IR primarily for accessing lecture notes for specific courses, so that she could evaluate the instructor before taking a course. The personas that were identified through the use of the LSA methodology provided a more accurate representation of the needs of the university’s user community and are reshaping the design of the university’s IR system.

The LSA methodology presented in this study also provides an initial step in the evolution of personas. The utility of personas will be constrained until the process of identifying and creating personas becomes simpler, quicker, and more objective. As history has shown, many methods in user interface design evolve when they are introduced into practice. These methods need to be made more durable, so they can be used effectively by a wide-variety of practitioners. Personas are still in their infancy, and modifications may need to be made to the personas method. With today’s rapid product development lifecycles, personas need to be streamlined to be useful because project leaders are usually not willing to wait multiple weeks for personas to be identified and created. The process of moving from user research to accurate personas needs to be more automated, so that the utility of personas is not marginalized – persona identification and creation should not be a bottleneck in the development lifecycle. While further research still needs to be performed, we have provided an initial step in the evolution of personas into a more streamlined method that has the potential to be useful in a broader range of real-world projects.


1509

REFERENCES 1. Blackmon, M.H., Polson, P.G., Kitajima, M. and Lewis,

C. Cognitive walkthrough for the web. Proc. CHI 2002, ACM Press (2002), 463-470.

2. Britt, M.A., Wiemer-Hastings, P., Larson, A.A. and Perfetti, C. Using intelligent feedback to improve sourcing and integration in students’ essays. International Journal of Artificial Intelligence in Education, 14, 3-4, 2004, 359-374.

3. Chi, E.H., Rosien, A., Supattanasiri, G., Williams, A., Royer, C., Chow, C., Robles, E., Dalal, B., Chen, J. and Cousins, S. The Bloodhound Project: automating discovering of web usability issues using the InfoScent Simulator. Proc. CHI 2003, ACM Press (2003), 505-512.

4. Cooper, A. The Inmates are Running the Asylum. Morgan Kaufmann Publishers, Indianapolis, IN, 1999.

5. Cooper, A. and Reimann, R.M. About Face 2.0. Wiley Publishing, Indianapolis, IN, 2003.

6. Dong, A. The latent semantic approach to studying design team communication. Design Studies, 26, 5, 2005, 445-461.

7. Goodwin, K. Getting from research to personas: harnessing the power of data. Cooper Newsletters, 2002, http://www.cooper.com/content/insights/newsletters/2002_11/getting_from_research_to_personas.asp

8. Grudin, J. and Pruitt, J. Personas, participatory design and product development: an infrastructure for engagement. Proc. of the Participatory Design Conference, CPSR (2002), 144-161.

9. Hair, J.F., Black, W.C., Babin, B.J., Anderson, R.E. and Tatham, R.L. Multivariate Data Analysis. Pearson Prentice Hall, Upper Saddle River, NJ, 2006.

10. Hanna, P. Customer storytelling at the heart of business success. Boxes and Arrows, 2005, http://www.boxesandarrows.com/view/customer_storytelling_at_the_heart_of_business_success

11. Landauer, T.K. and Dumais, S.T. A solution to Plato’s problem: the Latent Semantic Analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104, 2, 1997, 211-240.

12. Landauer, T.K., Foltz, P.W. and Laham, D. Introduction to latent semantic analysis. Discourse Processes, 25, 2-3, 1998, 259-284.

13. Landauer, T.K., McNamara, D.S., Dennis, S.J. and Kintsch W. Handbook of Latent Semantic Analysis. Erlbaum, Mahwah, NJ, 2007.

14. Landauer, T.K. and Psotka, J. Simulating text understanding for educational applications with latent

semantic analysis: introduction to LSA. Interactive Learning Environments, 8, 2, 2000, 73-86.

15. Landis, J.R. and Koch, G.G. The measurement of observer agreement for categorical data. Biometrics, 33, 1, 1977, 159-174.

16. Lemaire, B. and Dessus, P. A system to assess the semantic content of student essays. Journal of Educational Computing Research, 24, 3, 2001, 305-320.

17. Magliano, J.P., Wiemer-Hastings, K., Millis, K.K., Munoz, B.D. and McNamara, D. Using latent semantic analysis to assess reader strategies. Behavior Research Methods, Instruments & Computers, 34, 2, 2002, 181-188

18. Manning, H., Temkin, B. and Belanger, N. The power of design personas. Forrester Research, 2003, http://www.forrester.com/ER/Research/Report/0,1338,33033,00.html

19. Manning, H., Root, N.L. and Backer, E. Site design personas: how many, how much. Forrester Research, 2005, http://www.forrester.com/Research/Document/Excerpt/0,7211,37110,00.html

20. Moss, J., Kotovsky, K. and Cagan, J. The role of functionality in the mental representations of engineering students: some differences in the early stages of expertise. Cognitive Science, 30, 1, 2006, 65-93.

21. Norman, D. Ad-hoc personas and empathetic focus. Jnd.org, 2004, http://www.jnd.org/dn.mss/personas_empath.html

22. Pruitt, J., and Adlin, T. The Persona Lifecycle: Keeping People in Mind Throughout Product Design. Morgan Kaufmann, San Francisco, CA, 2006.

23. Reilly, C., Wang, C. and Rutherford, M. A rapid method for the comparison of cluster analyses. Statistica Sinica, 15, 2005, 19-33

24. Sinha, R. Persona development for information-rich domains. Extended Abstracts CHI 2003, ACM Press (2003), 830-831.

25. Srihari, S., Collins, J., Srihari, R., Babu, P. and Srinivasan, H. Automated scoring of handwritten essays based on Latent Semantic Analysis. Proc. of Document Analysis Systems, Springer (2006), 71-83.

26. Tan, P., Steinbach, M. and Kumar, V. Introduction to Data Mining. Pearson, Boston, MA, 2006.

27. Zampa, V. and Lemaire, B. Latent semantic analysis for user modeling. Journal of Intelligent Information Systems, 18, 1, 2002, 15-30.


1510

A latent semantic analysis methodology for the identification and creation of personas

Documents