Sexual Violence Risk Assessment: An Investigation of the Interrater Reliability of Professional Judgments Made Using the Risk for Sexual Violence Protocol

This article was downloaded by: [The University of Manchester Library], [caroline logan]On: 26 October 2012, At: 10:19Publisher: RoutledgeInforma Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House,37-41 Mortimer Street, London W1T 3JH, UK

International Journal of Forensic Mental HealthPublication details, including instructions for authors and subscription information:http://www.tandfonline.com/loi/ufmh20

Sexual Violence Risk Assessment: An Investigation ofthe Interrater Reliability of Professional JudgmentsMade Using the Risk for Sexual Violence ProtocolAlan A. Sutherland a b , Lorraine Johnstone b c , Kate M. Davidson a , Stephen D. Hart d e ,David J. Cooke c e , P. Randall Kropp f , Caroline Logan g , Christine Michie c & Ruth Stocks ba University of Glasgow, Glasgow, UKb NHS Greater Glasgow & Clyde, Glasgow, UKc Glasgow Caledonian University, Glasgow, UKd Simon Fraser University, Burnaby, British Columbia, Canadae University of Bergen, Norwayf Forensic Psychiatric Services Commission of British Columbia, British Columbia, Canadag Greater Manchester West Mental Health NHS Foundation Trust, Manchester, UK

To cite this article: Alan A. Sutherland, Lorraine Johnstone, Kate M. Davidson, Stephen D. Hart, David J. Cooke, P. RandallKropp, Caroline Logan, Christine Michie & Ruth Stocks (2012): Sexual Violence Risk Assessment: An Investigation of theInterrater Reliability of Professional Judgments Made Using the Risk for Sexual Violence Protocol, International Journal ofForensic Mental Health, 11:2, 119-133

To link to this article: http://dx.doi.org/10.1080/14999013.2012.690020

PLEASE SCROLL DOWN FOR ARTICLE

Full terms and conditions of use: http://www.tandfonline.com/page/terms-and-conditions

This article may be used for research, teaching, and private study purposes. Any substantial or systematicreproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form toanyone is expressly forbidden.

The publisher does not give any warranty express or implied or make any representation that the contentswill be complete or accurate or up to date. The accuracy of any instructions, formulae, and drug doses shouldbe independently verified with primary sources. The publisher shall not be liable for any loss, actions, claims,proceedings, demand, or costs or damages whatsoever or howsoever caused arising directly or indirectly inconnection with or arising out of the use of this material.

http://www.tandfonline.com/loi/ufmh20

http://dx.doi.org/10.1080/14999013.2012.690020

http://www.tandfonline.com/page/terms-and-conditions

INTERNATIONAL JOURNAL OF FORENSIC MENTAL HEALTH, 11: 119–133, 2012Copyright C© International Association of Forensic Mental Health ServicesISSN: 1499-9013 print / 1932-9903 onlineDOI: 10.1080/14999013.2012.690020

Sexual Violence Risk Assessment:An Investigation of the Interrater Reliability

of Professional Judgments Made Using the Riskfor Sexual Violence Protocol

Alan A. SutherlandUniversity of Glasgow, Glasgow, UK, and NHS Greater Glasgow & Clyde, Glasgow, UK

Lorraine JohnstoneGlasgow Caledonian University, Glasgow, UK, and NHS Greater Glasgow & Clyde, Glasgow, UK

Kate M. DavidsonUniversity of Glasgow, Glasgow, UK

Stephen D. HartSimon Fraser University, Burnaby, British Columbia, Canada, and University of Bergen, Norway

David J. CookeGlasgow Caledonian University, Glasgow, UK, and University of Bergen, Norway

P. Randall KroppForensic Psychiatric Services Commission of British Columbia, British Columbia, Canada

Caroline LoganGreater Manchester West Mental Health NHS Foundation Trust, Manchester, UK

Christine MichieGlasgow Caledonian University, Glasgow, UK

Ruth StocksNHS Greater Glasgow and Clyde, Glasgow, UK

The RSVP is a set of structured professional judgment guidelines for assessing risk of sex-ual violence. We investigated the interrater reliability (IRR) of judgments made using theRSVP in a multidisciplinary forensic-clinical context. Raters were 28 forensic mental healthand intellectual disability professionals with diverse training and experience. They used theRSVP to evaluate six case vignettes that varied with respect to offense characteristics, clinicalcomplexity, and level of risk. The IRR of ratings for individual risk factors was generallyfair. There was a good level of interrater reliability on Summary Judgments and SupervisionRecommendations. Interrater reliability was highest when used by professionals who were

Address correspondence to Alan A. Sutherland, Leverndale Hospital, 510 Crookston Road, Glasgow, 653 7TU United Kingdom. E-mail: [email protected]

Dow

nloa

ded

by [

The

Uni

vers

ity o

f M

anch

este

r L

ibra

ry],

[ca

rolin

e lo

gan]

at 1

0:19

26

Oct

ober

201

2

120 SUTHERLAND ET AL.

highly trained in forensic risk assessment. On average, professionals with lower levels ofspecialist training agreed less with their colleagues and experts, and provided higher estimationsof sexual violence risk. Lower levels of agreement were found in cases with moderate levels ofcomplexity and risk. The RSVP can be used to make judgments of risk with adequate levels ofinterrater reliability. However, this is dependent on the training and expertise of professionalswho use the tool. Methodological strengths and limitations are considered, followed by adiscussion of implications for training, practice, and future research.

Keywords: Forensic, offenders, risk assessment, sexual violence, structured professionaljudgment, interrater reliability, Risk for Sexual Violence Protocol

The empirical and professional literature supports theuse of the structured professional judgment approach toviolence risk assessment. This approach is used to developcomprehensive risk assessments that are based on scientificand expert opinion. It also allows freedom of professionaldecision making, while aiming to enhance consistency,transparency, and objectivity. The first structured profes-sional judgment approach to violence risk assessment wasdeveloped by Kropp and colleagues. They produced theSpousal Assault Risk Assessment Guide (SARA; Kropp,Hart, Webster, & Eaves, 1995). Shortly afterwards, the HCR-20 (Webster, Eaves, Douglas & Wintrup, 1995; Webster,Douglas, Eaves, & Hart, 1997) was published to aid profes-sionals in the assessment of risk of interpersonal violence.Thereafter followed the Sexual Violence Risk–20 (SVR-20;Boer, Hart, Kropp, & Webster, 1997), the Brief SpousalAssault Form for the Evaluation of Risk (Kropp, Hart, &Belfrage, 2005), and the Guidelines for Stalking Assessmentand Management (SAM; Kropp, Hart, & Lyon, 2008).

Using these tools, evaluators are systematically guidedthrough the process of risk assessment, formulation, andmanagement. These methods have an emphasis on improv-ing professionals’ ability to understand and manage risk. Thedisadvantages of these tools are that they are often time con-suming to complete and are susceptible to a greater degreeof individual bias than actuarial instruments. Ultimately,they require professionals to make difficult judgments, albeitwith of the support of evidence-based practice guidelines.

Risk of Sexual Violence

Sexual violence has been defined by Hart and colleagues(2003) as “the actual, attempted, or threatened sexualcontact with another person that is non-consensual” (p. 2).This definition includes such acts as rape, sexual touching,exhibitionism, obscene communications, and voyeurism.Broader definitions such as that suggested by the WorldHealth Organization (Krug, Dahlberg, Mercy, Zwi, &Lozano, 2002) would also include acts such as forcedabortion and exposure to pornography.

The most widely used structured professional judgmenttool for assessing risk of sexual violence is the Risk forSexual Violence Protocol (RSVP; Hart et al., 2003). The

RSVP has evolved from the use of the earlier structuredprofessional judgment tools such the HCR-20 and SVR-20.The RSVP involves six steps in the assessment processwhich, as well as facilitating the assessment of risk, includesa set of guidelines for producing risk management interven-tions. These steps are: 1) data collection, 2) evaluation ofrisk factor presence, 3) evaluation of risk factor relevance,4) identification and description of most likely future riskscenarios, 5) recommendations for case management, and6) summary judgments of case.

The RSVP is based on a systematic review of the researchevidence and aims to guide professionals in providing riskassessments that are evidence-based and comprehensive. Italso aims to help professionals characterize risks and makejudgments relevant to risk management. Important featuresof the RSVP manual are that it provides an evidence-basedrationale for each item, clear assessment guidelines, and de-tailed definitions of terms and ratings.

Reliability of the RSVP

Anecdotal evidence suggests that the RSVP is widely usedinternationally amongst forensic mental health professionals.Hart and Boer (2009) reported that the RSVP and its prede-cessor, the SVR-20, have sold several thousand copies world-wide and are published in over seven different languages. Itis therefore vital that the reliability and validity of judgmentsmade by professionals using this tool are investigated.

Various methods of assessing the utility of risk assessmentare available. For example, predictive validity is a commonlyused method. However, there is a growing awareness of thelimitations of this approach to validation. Predictive judg-ments are meaningful when applied to groups of offenders.However, at an individual level, predictions are consideredby many to be imprecise (Hart, Michie, & Cooke, 2007).

Although we cannot predict the future with precision, wecan work out what might go wrong and take steps to preventthis, that is, risk management. We can also attempt to ensurethat risk assessment tools are used fairly and consistentlyby assessors and that risk management is proportionate tothe risks posed. Professional judgment is central to risk as-sessment, and therefore the evaluation of interrater reliabilityshould be a particular focus of research.

Dow

nloa

ded

by [

The

Uni

vers

ity o

f M

anch

este

r L

ibra

ry],

[ca

rolin

e lo

gan]

at 1

0:19

26

Oct

ober

201

2

INTERRATER RELIABILITY OF THE RSVP 121

Some published studies have evaluated the interrater relia-bility of structured professional judgment approaches to sex-ual violence risk, namely the SVR-20 (Barbaree, Langton,Blanchard, & Boer, 2008; de Vogel & de Ruiter, 2004; Hilde-brand, de Ruiter & de Vogel, 2004; Hill, Habermann, Klus-mann, Berner, & Briken, 2008; Rettenberger & Ehler, 2007;Sjostedt & Langstrom, 2002) and the Structured Assessmentof Risk and Needs (SARN; Webster et al., 2006). These stud-ies used different statistical indexes of reliability and samplesof varying sizes and characteristics (level of experience, pro-fession, training, etc.). With the exception of Webster et al.(2006), studies used only two independent raters to evaluateinterrater reliability. Overall these studies achieved very highreliability, with reliability coefficients ranging from “good”to “excellent,” with the majority “excellent.” However, Web-ster et al. (2006) in one of two studies found “moderate”reliability for the SARN when a large sample (n = 88) ofassessors was used. This was notably higher when a smallersample of expert evaluators (n = 7) was used. Sjostedt andLangstrom (2002) found “poor” reliability (kappa = .36) forsummary judgments of risk made using the SVR-20, and thiswas attributed to variation in the experience of the two ratersused. Following further training of the raters, the study wasrepeated and found “fair” interrater reliability (kappa = .50).

A recent literature review (Hart & Boer, 2009) sum-marised three very similar unpublished studies that inves-tigated the interrater reliability of the RSVP (Hart, 2003;Watt et al., 2006; Watt & Jackson, 2008). All of the studieswere conducted in Canada and were based on file review datafrom convicted sex offenders. Two of the studies comprisedserious offenders (Hart, 2003; Watt et al., 2006). All studiesindexed interrater reliability using Case 1 Intraclass Correla-tion Coefficients (ICC1) calculated for mixed effects modeland absolute agreement. All the studies examined two expe-rienced evaluators and a large numbers of offenders (n ≥ 50).

All studies found that interrater reliability of ratings forindividual presence and relevance factors was “good” (ICC1between .50 and .74) to “excellent” (ICC1 ≥ .75), with themajority “excellent.” Domain ratings for individual factorswere derived and interrater reliability was found to be “excel-lent,” although in the most recent study by Watt and Jackson(2008), domain ICCs for Sexual Violence History and MentalDisorder were only considered “good.” Agreement for sum-mary judgments including Case Prioritization ratings (codedlow, moderate, or high) was also “excellent.”

These studies indicate that the RSVP can be used to makereliable judgments using structured professional judgment.However, there are some limitations to the methods used. Inparticular, the use of highly experienced raters and data fromgroups of serious offenders is not representative of the het-erogeneous characteristics of risk assessors and offenders inmany forensic-clinical settings. In many settings, offenderspresent varying levels of risk and risk assessors have varyingprofessional backgrounds, experiences, and levels of train-

ing. It is also important to consider that these studies didnot include data from the scenario planning and case man-agement steps of the RSVP. Although not as amenable tostatistical analysis, the qualitative data from these steps in-clude important and complicated judgments that should beincluded in further interrater reliability studies. In addition,none of the above-mentioned studies has investigated po-tential case or professional factors that are associated withvariance in interrater agreement or any other aspects of riskassessment judgments. Yet, these are important questionsthat may help to develop our insight into the risk assessmentprocess and identify targets for improving interrater reliabil-ity. One study of the HCR-20 (de Vogel & de Ruiter, 2004)found that treatment supervisors had made more “low risk”judgments than researchers, and perceived risk was associ-ated with assessor’s attitudes to the offender (feeling relaxedversus feeling controlled and manipulated).

Studies in other domains of clinical practice have inves-tigated predictors of accuracy and interrater reliability ofprofessional judgments. By calculating an index of interraterreliability for individual professionals it is then possible toexplore association between this index and interrater reli-ability. For example, Persons and Bertognolli (1999) usedthis method and found that professional variables (includ-ing experience and level of training) did not predict levelof interrater reliability in a CBT case-formulation exercise.However, they did find that level of professional training(PhD trained or not) was the only predictor of professionalaccuracy in judgments.

Formal training in the use of the RSVP is recommended(Hart et al., 2003) and there is considerable evidence thatsuch user training programs enhance interrater reliabilityof assessment measures (Muller & Wetzel, 1998; Reichelt,James, and Blackburn, 2003). Consensus decision making isa method of enhancing assessment reliability. However, anec-dotal evidence suggests that there is variation in the extent towhich risk assessment is a shared process.

Key questions exist about the interrater reliability of theRSVP, thus setting a clear rationale for this study. The mainaim of this study is to evaluate the interrater reliability ofrisk judgments made using the RSVP by trained multidisci-plinary professionals within forensic mental health settings.The study also aims to investigate professional agreementwith judgments developed in consultation with experts inforensic risk assessment. Other aims are to explore andidentify any professional and case-specific associations (risklevel and clinical complexity) with variability in reliabilityand estimation of risk.

Research Questions

Question 1: What level of interrater reliability do multidis-ciplinary forensic mental health professionals achievewhen using the RSVP to make judgments of risk?

Dow

nloa

ded

by [

The

Uni

vers

ity o

f M

anch

este

r L

ibra

ry],

[ca

rolin

e lo

gan]

at 1

0:19

26

Oct

ober

201

2

https://www.researchgate.net/publication/251148312_Inter-Rater_Reliability_of_Cognitive-Behavioral_Case_Formulations_of_Depression_A_Replication?el=1_x_8&enrichId=rgreq-202d8e1a-bbf7-475c-86f4-177425cc498f&enrichSource=Y292ZXJQYWdlOzI1NDMwNTA3NTtBUzoxODExNDQyNTEyODE0MDhAMTQyMDE5OTU2MzU4Ng==


Question 2: To what extent do professionals agree with expertratings when using the RSVP?

Question 3: Is the interrater reliability of risk judgments madeusing the RSVP associated with characteristics of raterssuch as profession, years experience, amount of formaltraining received, and self-reported variables (e.g., ob-jectivity, adherence to manual, and confidence in accu-racy).

Question 4: Is the interrater reliability of risk judg-ments made using the RSVP associated with charac-teristics of cases, such as level of risk and clinicalcomplexity?

METHOD

This study employs a “fully crossed” design, also called afactorial or “rater × subject” design. Qualified forensic men-tal health professionals (n = 28) provided brief mock riskassessments to six fictitious case vignettes illustrating vary-ing offense characteristics, levels of risk of sexual violence,and clinical complexity. There were three sets of responsevariables:

1. Standard RSVP items: Items from Step 2 (evaluationof risk factor presence), Step 3 (evaluation of risk fac-tor relevance), and Step 6 (summary judgments) wereadministered as published in the RSVP manual.

2. Research items: Forced-choice questions were de-veloped for this study to capture key items fromRSVP Step 4 (risk scenario planning), Step 5 (riskmanagement strategies), and an additional researchitem: overall estimation of risk.

3. Professional information: Self-reported professionalvariables were: profession, length of clinical experi-ence, length of forensic experience, number of daysformal RSVP training received, perceived confidencein accuracy of judgments, perceived objectivity in de-cision making, and perceived level of adherence withthe RSVP manual.

Questions 1 and 2 were addressed using Case 2 In-traclass Correlation Coefficients (ICC2) and percentageagreement statistics. Data analyses also included evalua-tion of agreement with expert judgments that were devel-oped in consultation with international experts in foren-sic risk assessment. Questions 3 and 4 were addressed us-ing correlations between professional variables and partic-ipant agreement on standard RSVP items, standard RSVPitems and comparison of average rates of agreement acrosscases.

Participants

A total of 28 professionals volunteered to participate in thisstudy. All were fully qualified in their profession and wereemployed by National Health Service Boards throughoutScotland. Participants were recruited through a circular invi-tation and through attendance at an RSVP training event.

RSVP training event

A training workshop in using the RSVP was deliveredby one of the authors (L.J.). The one-and-a-half-day eventwas attended by 35 professionals from across NHS ScotlandHealth Boards (Greater Glasgow & Clyde, The State Hos-pital, Tayside and Grampian). Professionals attended as partof their continuing professional development. Clinical Psy-chologists and Psychiatrists requested to attend the trainingdirectly and Nurse Managers nominated Nurses to attend onthe basis that the RSVP was/would become relevant to theirclinical work.

The training workshop used didactic teaching, group exer-cises, and discussion to build client familiarization with theRSVP (background, rationale, and guidelines) and compe-tency in completing each step and item of the RSVP. Follow-ing the workshop, 21 attendees agreed to participate in thestudy. Fifteen sets of study materials were completed follow-ing the training event. All participants worked at individualstations in the training center, with each taking approximatelyfour hours to complete the materials. The remaining six par-ticipants returned completed materials by post. Because ofparticipant anonymity, it was not possible to obtain data onattendees who did not participate.

The sample consisted of Nurses (n = 13), Clinical Psy-chologists (n = 8), and Psychiatrists (n = 7) who worked inforensic mental health (n = 23) or intellectual disability (n =5) settings. Professionals were qualified in their professionfor a mean of 11 years and had worked in forensic settings fora mean of seven years. They were recruited from community(n = 7) and secure inpatient (n = 21) settings and had a broadrange of experience and familiarity in using the RSVP andother risk assessment tools.

Half of the sample, Nurses (n = 8), Clinical Psychologists(n = 2), and Psychiatrists (n = 3) had no previous experienceof using the RSVP. Five Nurses reported contributing to mul-tidisciplinary risk assessments using the RSVP, although theremaining Psychiatrists (n = 4) and Clinical Psychologists(n = 6) had experience of taking a lead role in undertak-ing numerous risk assessments using the tool. All but oneof the participants, a Psychiatrist with five years forensicexperience, had received formal RSVP training. Four of theparticipants had received formal RSVP training on more thanone occasion.

This study did not require participants to make diagnosesof mental disorder as this information was provided in casevignettes. For this reason, all participants (including staff

Dow

nloa

ded

by [

The

Uni

vers

ity o

f M

anch

este

r L

ibra

ry],

[ca

rolin

e lo

gan]

at 1

0:19

26

Oct

ober

201

2


without expertise in diagnosis) met the RSVP user qualifica-tions.

Ethical Approval

Ethical approval was sought and granted by NHS Glasgowand Lothian Research Ethics Committees. Research & De-velopment approval was granted by NHS Glasgow and NHSState Hospital Health boards. All participants were given adetailed study information sheet and gave informed consentto participate in the study.

Measures

Participants were given the following materials: a completepublished version of the RSVP manual (Hart et al., 2003), sixfictitious case vignettes, six data collection workbooks, anda purpose-designed professional information questionnaire.

Case vignettes

High-quality case vignettes, loosely based on cases fromthe research team’s clinical experience, were developed. De-tails of any actual individuals were significantly altered andanonymized. Vignettes were designed to represent the broadrange of clinical complexity, risk of sexual violence, and of-fense characteristics that are encountered in NHS forensicmental health settings.

To enhance authenticity, cases were written in a stan-dard clinical assessment format that provided both risk-relevant and contextual information. Vignettes were two tofour pages long and were structured under the followingheadings: Sources of Information, Background History (in-cluding family, forensic, romantic/sexual, social, psychiatric,employment, and education histories), Index Offense (in-cluding witness, victim, police and offender accounts), andCurrent Presentation (including reports of psychiatric, social,behavioral, and attitudinal presentation). Fictitious names,dates, and other details were specified throughout. Caseswere as follows: Bill, low risk level and low-medium clinicalcomplexity; Mathew, low risk level and medium-high clini-cal complexity; Simon, medium risk level and low-mediumcomplexity; Mark, medium risk level and medium-high clin-ical complexity; Donald, high risk level and low-mediumclinical complexity; and Stuart, high risk level and medium-high clinical complexity.

Expert judgments

Once completed, a panel of six highly experienced expertevaluators were asked to review the cases. This stage wasincluded to provide expert item ratings, and to verify thequality and authenticity of the vignettes. All of the expertsare experienced assessors in the RSVP, and all train andsupervise others in the use of the RSVP and other structuredprofessional judgment guidelines. Three of the experts (S.H.,R.K., & C.L.) were co-authors of the RSVP manual.

Each case was randomly assigned to an expert rater. Tominimize demands on the experts’ time, each case was pre-evaluated beforehand by the members of the research team.Experts provided detailed evaluation of each case and con-firmed their agreement with the majority of pre-ratings. Allexperts made several amendments to the suggested ratingsand requested clarification in relation to some items. In eachinstance, further information was added to the vignette tofacilitate unambiguous rating where possible. A furtherreview of the final evaluations confirmed that the expert judg-ments adhered to the guidelines set out in the RSVP manual.

Experts also completed a feedback questionnaire thatasked about their perception on the authenticity, quality,risk level, and complexity of cases. In general, expertswere highly approving of the case quality and authentic-ity. They agreed that cases were consistent with the levelof risk and complexity that they were designed to portray.One of the raters raised concerns that many of the addi-tional research items (Steps 4 and 5) were forced-choice,and this did not reflect the complexity of judgments inclinical practice. This rater also expressed the view thatscores of “present and partially” are both effectively scoresof “yes,” and might therefore be dichotomized in analysis.This method would have excluded data from analysis andwould not have been amenable to statistical analysis us-ing ICC. Therefore, the decision was made not to use thismethod.

Data collection workbook and ProfessionalInformation Questionnaire

For each case, a nine-page data collection workbook col-lected forced-choice judgments to gather data on responsevariables. Tables 1 and 2 provide brief descriptions of thedata collection items used. At the end of each case, partici-pants were also asked to provide an estimation of their overallestimation of sexual violence risk (responses: very low, low,moderate, high, and very high).

Data Analysis

Interrater reliability of individual RSVP items was indexedusing ICC. Percentage agreement was used to indexinterrater agreement and agreement with expert judgments.Analyses were conducted separately for standard RSVP riskfactors in Steps 2, 3 and 6; and for the additional items usedin Steps 4 and 5.

Intraclass Correlation Coefficient (ICC)

ICC models are based on estimates of mean variability andcan be conceptualized as the ratio of between-groups vari-ance vs. total variance. ICCs provide a measure of chance-corrected agreement by comparing the variability of differentjudges of the same test item to the total variation across alljudges and items. It is the recommended statistic for mea-suring reliability when there are more than two raters and

Dow

nloa

ded

by [

The

Uni

vers

ity o

f M

anch

este

r L

ibra

ry],

[ca

rolin

e lo

gan]

at 1

0:19

26

Oct

ober

201

2


TABLE 1Brief Descriptions of Items from RSVP Steps 2 (Presence), 3 (Relevance) and 6 (Summary Judgments)

STEPS 2-3: IDENTIFICATION OF RISK ITEM PRESENCE AND RELEVANCEClinicians are required to rate the presence (past and recent) and relevance of each of these risk items (Yes, Possibly and No).A. Sexual Violence History

1. Chronicity of Sexual Violence Persistence and frequency of sexual violence (e.g. early onset).2. Diversity of Sexual Violence Diversity in the nature of offending (e.g. offence and victim characteristics).3. Escalation of Sexual Violence Pattern of escalation in offending severity, frequency or diversity over time.4. Physical Coercion in Sexual Violence Actual, attempted or threatened physical harm during the course of sexual violence, or to

further the commission of sexual violence.5. Psychological Coercion in Sexual Violence Acts committed involving either threatened loss or promised gain of status, privilege, favor or

affection.

B. Psychological Adjustment6. Extreme Minimization or Denial of Sexual Violence Failure to admit or accept responsibility for acts of sexual violence and consequences.7. Attitudes That Support or Condone Sexual Violence Beliefs and values that either directly or indirectly encourage or excuse sexual violence.8. Problems with Self-Awareness Lack of self-appraisal of factors or processes that increase the risk of sexual violence.9. Problems with Stress or Coping Unstable psychosocial adjustment and susceptibility to external stressors.

10. Problems Resulting from Child Abuse Serious problems in psychosocial adjustment that are the result of abuse experiences inchildhood or adolescence and that are associated with increased risk of sexual violence.

C. Mental Disorder11. Sexual Deviance Stable pattern of deviant sexual arousal.12. Psychopathic Personality Dis. As defined and assessed by the PCL-R Psychopathy Checklist-Revised (Hare, 1991, 2003).13. Major Mental Illness Substantial impairment in the person’s cognition affect or behaviour.14. Problems with Substance Use Use of legal and illegal substances that cause significant psycho-social impairment.15. Violent or Suicidal Ideation Thoughts, impulses, and fantasies of harming one’s self or others.

D. Social Adjustment16. Problems with Intimate Rels. Failure to establish or maintain stable intimate relationships.17. Problems with Non-Intimate Rels. Failure to establish or maintain positive (pro-social) non intimate relationships. Refers to

conflict, isolation and sexualisation of non-intimate relationships.18. Problems with Employment Failure to establish and maintain stable legal employment or education.19. Non-Sexual Criminality Serious non-sexual criminality.

E. Manageability20. Problems with Planning Failure forming or implementing realistic pro-social life plans.21. Problems with Treatment Failure to benefit from rehabilitative services to address psychosocial difficulties.22. Problems with Supervision Failure to co-operate with supervision services.

STEP 6: SUMMARY JUDGMENTSClinicians are required to make summary judgments in relation to the following:

1. Case Prioritization The degree of effort or intervention that it will require to prevent the person from committingsexual violence: Low /Routine (person is not considered in need of special intervention),Moderate / Elevated (person requires some management strategies), High / Urgent (there isan urgent need to develop a risk management plan for the person).

2. Risk of Serious Physical Harm Severity and imminence of sexual violence that the person might commit: Low, Medium andHigh.

3. Immediate Action Required Need for Immediate Action: Yes, Possibly and No.4. Other Risks Indicated Risk of non-sexual criminality: Yes, Possibly and No.

data are ordered categories (Uebersax, 2009). The weightedkappa statistic is also applicable to this situation. However,the ICC was selected as it yields mathematically equivalentresults to weighted kappa (Cicchetti & Sparrow, 1981; Landis& Koch, 1977) and is more common in the field of clinical-forensic research, thus allowing comparability of results toother studies. There are several variations of the ICC statistic(McGraw & Wong, 1996) and each is suitable for differentstudy designs. The Case 2 ICC (two-way random effects) isappropriate here because all judges rate all cases and bothcan be considered as being drawn from a larger population.

While true randomization of cases and raters was not carriedout, random effects are assumed because the study is con-cerned with subject-specific effects, as opposed to populationaverage effects (or fixed effects). Also, although the assump-tions and interpretations are different, numerical values fordifferent two-way ICC models are identical (Nichols, 1998).ICCs were calculated for “absolute agreement” and “singlemeasures.”

Following previous studies evaluating the interrater reli-ability of the RSVP, ICC interpretation guidelines are takenfrom Fleiss (1981). These are as follows: ICC < .39 = “poor”,

Dow

nloa

ded

by [

The

Uni

vers

ity o

f M

anch

este

r L

ibra

ry],

[ca

rolin

e lo

gan]

at 1

0:19

26

Oct

ober

201

2

https://www.researchgate.net/publication/232568057_Forming_Inferences_About_Some_Intraclass_Correlation_Coefficients?el=1_x_8&enrichId=rgreq-202d8e1a-bbf7-475c-86f4-177425cc498f&enrichSource=Y292ZXJQYWdlOzI1NDMwNTA3NTtBUzoxODExNDQyNTEyODE0MDhAMTQyMDE5OTU2MzU4Ng==

https://www.researchgate.net/publication/245658300_Choosing_an_intraclass_correlation_coefficient?el=1_x_8&enrichId=rgreq-202d8e1a-bbf7-475c-86f4-177425cc498f&enrichSource=Y292ZXJQYWdlOzI1NDMwNTA3NTtBUzoxODExNDQyNTEyODE0MDhAMTQyMDE5OTU2MzU4Ng==

https://www.researchgate.net/publication/247894693_Manual_for_the_Hare_Psychopathy_Checklist-Revised_PCL-R?el=1_x_8&enrichId=rgreq-202d8e1a-bbf7-475c-86f4-177425cc498f&enrichSource=Y292ZXJQYWdlOzI1NDMwNTA3NTtBUzoxODExNDQyNTEyODE0MDhAMTQyMDE5OTU2MzU4Ng==


TABLE 2Brief Description of Additional Research Items Designed to Capture Aspects of RSVP Step 4 (Scenario Planning), Step 5 (Case

Management) and the Professional Information Questionnaire

STEP 4: SCENARIO PLANNING:Clinicians were asked to identify plausible ‘repeat’ and ‘escalation’ offense scenarios. In order to quantify characteristics of scenarios, they were required to

respond to the following forced choice items:

1. Nature The type of offense: sexual breach of peace (e.g., harassment), indecent exposure,indecent assault, rape (without serious violence), rape (with serious violence)and sexual homicide.

2. Victim The likely victim of scenario: prepubescent male, prepubescent male, adolescentfemale, adolescent male, adult female or adult male.

3. Level of Psychological Harm Level of psychological harm: none/negligible, minor (short term/mild emotionaldistress), moderate (medium term/moderate emotional distress) or severe(significant/long-term distress and psychological disturbance incl. PTSD).

4. Level of Physical Harm Level of physical harm: none/negligible, minor (e.g., grazing), moderate (e.g., cutsand bruises), major (e.g., serious cuts and bruises) or fatal /near fatal.

5. Imminence Estimated imminence of scenario (from having the opportunity to offend): 1–4weeks, 6 months, 12 months, or several years.

6. Frequency Estimated frequency of scenario: unlikely/never, once/twice, several times orhabitually/repeatedly.

7. Likelihood Estimated likelihood of scenario: very low probability, low probability, moderateprobability, high probability and very high probability.

STEP 5: CASE MANAGEMENTClinicians were required to make forced choice recommendations about the most appropriate supervision and monitoring strategies.

Recommended Level of Supervision Level of supervision that should be implemented: community outpatient (nosupervision in place), community outpatient (supervision in place), inpatient(non-forensic), inpatient (low-secure), inpatient (forensic medium secure),inpatient (forensic high secure).

Recommended Level of Monitoring Level of monitoring that should be implemented: regular professional contact ormid-appointment telephone calls with relevant professionals.

PROFESSIONAL INFORMATION QUESTIONNAIREThe questionnaire required participants to provide self-report information in relation to the following:

Profession E.g., Clinical Psychology, Nursing, Psychiatry.Work Setting (Current and Previous) E.g., Community, Low Secure, Medium Secure, High Secure.Client Group (Current and Previous) E.g., Adults, Children & Adolescents, Learning Disability.Number of Years Qualified in ProfessionNumber of Years Working in Forensic SettingsNumber of Days Formal RSVP Training ReceivedPerceived Level of Confidence in Accuracy of Judgements 10 point visual analogue scale (very unconfident – very confident).Perceived Level of Objectivity when Using the RSVP 10 point visual analogue scale (very subjective – very objective).Perceived Level of Adherence to RSVP Manual 10 point visual analogue scale (manual not consulted at all – manual consulted at all

stages).

ICC .40 to .59 = “fair”, ICC .60 to.74 = “good” and ICC>.75 = “excellent.”

ICC sample size justification

Walter, Eliasziw, & Donner (1998) provide an equation forcalculating the number of raters and subjects required to useICCs. PASS (Power and Sample Size for Windows; Hintze,2008) includes this equation and was used to calculate thenumber of raters and subjects required for this study. Sixcases and a minimum of 22 raters are required based onpower being set at .80, a null hypothesis of ICC .30 (fairagreement), an alternative hypothesis of ICC .70 (substantialagreement), and significance level of .05. Null and alternativehypotheses of fair and substantial agreement were based on

ICC interpretation guidelines by Landis and Koch (1977).1 Asensitivity analysis revealed that having more than 22 judgeshas a negligible impact on power.

Percentage agreement

Despite often being neglected, percentage agreement pro-vides essential information about raw agreement at a prac-tical level (Uebersax, 2009). For the purposes of this study,

1These interpretation guidelines were used in the research proposal andsample-size estimation. However, guidelines by Fleiss (1981) are used in theanalysis to allow comparability with other studies (Hart, 2003; Watt et al.,2006 and Watt & Jackson, 2008).

Dow

nloa

ded

by [

The

Uni

vers

ity o

f M

anch

este

r L

ibra

ry],

[ca

rolin

e lo

gan]

at 1

0:19

26

Oct

ober

201

2


three measures of percentage agreement were used: agree-ment with expert, agreement with item mode and agreementwith item mean (rounded to the nearest integer). Each agree-ment measure represents the proportion of observations thatwere in agreement with expert, mode and mean respectively.Percentage agreement was calculated for each item andoverall cases and raters. Calculations were adjusted to ac-count for missing ratings.

In this study, the mode and mean reflect the outcomes ofdifferent, but equally plausible methods of multidisciplinarydecision making. Although the mode for each item representsthe most popular rating, the mean might be regarded as therating that would be reached through a process of negotiationand consensus.

Missing data

Of the RSVP standard items (Steps 2, 3 and 6) there were267 missing observations (2% of total 11,760 expected). Ofthe additional research items, there were 99 missing obser-vations (<3% of total 3,864 expected). Visual inspectionrevealed that common causes of missing data appeared tobe participant error (e.g., a page of responses not completedor two responses given when only one required) and incon-clusive scoring (e.g., items not scored with comment fromjudges stating they are unsure how to respond).

When computing ICCs, it was noted that a single missingobservation resulted in the exclusion of the entire data se-ries for that case. The least biased method of resolving thisproblem was to exclude judges with incomplete data on anitem-by-item basis. As has been mentioned, the exclusion ofjudges has a negligible impact on power whereas the removalof case data substantially reduces power.

Questions 3 and 4

To address research question 3, the following agreementindices were calculated for each individual participant: 1)average level of perceived risk of sexual violence (averageof all cases); 2) average percentage agreement with mode(for standard items Steps 2, 3, and 6), and; 3) averagepercentage agreement with expert judgments (for standarditems Steps 2, 3, and 6). Spearman’s Rho (rs) correlationswere calculated for the associations between these indicesand the continuous variables gathered in the ProfessionalInformation Questionnaire.

For all professional variables, assumptions of normalityand equal variance were tested. For the majority of vari-ables, the Kolmogorov-Smirnov test indicated that the distri-bution of data significantly deviated from normal (p < .01).Levene’s test of homogeneity of variance revealed that sam-ple variances were not significantly different (p > .10). Asdata did not meet these assumptions, and a small number ofcases were used, a nonparametric statistic—Spearman’s Rhowas used to address question 3.

To address question 4, the following agreement indiceswere calculated for each case: 1) average level of perceivedrisk of sexual violence (average across judges); 2) averagepercentage agreement with mode (for standard items Steps 2,3, and 6), and; 3) average percentage agreement with expertjudgments (for standard items Steps 2, 3, and 6). Descriptivestatistics allow cases to be compared with respect to thesevariables.

RESULTS

Standard RSVP Items

Question 1: What level of interrater reliability domulti-disciplinary forensic mental healthprofessionals achieve when using the RSVP tomake judgments of risk?

Tables 3–5 show the range of percentage agreements (ex-pert, mean, and mode) and ICCs achieved for individualstandard RSVP items and domains. The number of profes-sionals excluded due to missing data in the analysis of eachitem is also shown.

As can be seen, the average percentage agreements acrossSteps 2, 3, and 6 were as follows: agreement with mode, 71%and agreement with mean, 65%.

The average ICC2 was .51, a fair level of reliability. Thislevel of overall agreement was corroborated by an analysisof all items taken together, which revealed ICC2 = .53, 95%CI [.49, .58]. This supplementary analysis involved stackingall items as independent cases. Missing observations werereplaced by the scale midpoint value. This method of re-placing missing data may have introduced bias and thereforea sensitivity analysis applying random data imputation wasperformed. This method revealed an almost identical findingof fair agreement and greater sensitivity, ICC2 = .53, 95%CI [.49, .56].

ICCs ranged from poor (ICC2 = .05) to excellent (ICC2 =.78) with these values correlating positively with values forpercentage agreement with expert, rs = .40 ( p < .001). Sum-mary Judgments had the highest interrater reliability (ICC2 =.60, good) although the Sexual Violence History domain hadthe lowest mean (ICC2 = .45, fair). 4% of items achievedexcellent reliability, 26% good, 47% fair and 23% poor.

Risk items with good interrater reliability (ICC2 ≥ .60for past, recent, and future) were Attitudes Supportive ofSexual Offending, Problems Resulting from Child Abuse,Major Mental Illness and Problems with Treatment. Itemsachieving poor reliability (ICC2 ≤ .39 for past, recent andfuture) were Psychological Coercion, Problems with Stressor Coping, and Problems with Planning.

Table 5 shows that Summary Judgments about Case Pri-oritization, Risk of Serious Physical Harm, and Other RisksIndicated all had good interrater reliability (ICC2 all ≥ .60).However, the Immediate Action Required rating had onlyfair reliability (ICC2 = .43, 95% CI: .20–.82).

Dow

nloa

ded

by [

The

Uni

vers

ity o

f M

anch

este

r L

ibra

ry],

[ca

rolin

e lo

gan]

at 1

0:19

26

Oct

ober

201

2


TABLE 3Intraclass Correlation Coefficients and Percentage Agreement Statistics for Sexual Violence History and Psychological

Adjustment Domains

% in agreement with ICC

Expert Mode Mean n Excl. n ICC2 ICC2 95% CI

Domain A: Sexual Violence History1. Chronicity

Presence (Past) 60% 76% 62% 25 3 .59 .34 – .9Presence (Recent) 60% 66% 60% 23 5 .49 .25 – .86Relevance (Future) 72% 72% 68% 24 4 .54 .3 – .88

2. DiversityPresence (Past) 58% 74% 63% 23 5 .49 .25 – .86Presence (Recent) 60% 69% 50% 22 6 .53 .29 – .88Relevance (Future) 69% 69% 62% 23 5 .58 .33 – .89

3. EscalationPresence (Past) 54% 76% 67% 24 4 .64 .39 – .92Presence (Recent) 44% 63% 48% 23 5 .28 .11 – .72Relevance (Future) 58% 66% 59% 25 3 .47 .24 – .85

4. Physical CoercionPresence (Past) 80% 80% 80% 25 3 .74 .51 – .95Presence (Recent) 54% 65% 48% 24 4 .21 .07 – .65Relevance (Future) 57% 63% 50% 25 3 .44 .22 – .83

5. Psychological CoercionPresence (Past) 51% 59% 58% 23 5 .39 .18 – .81Presence (Recent) 46% 58% 47% 24 4 .09 .01 – .45Relevance (Future) 51% 58% 40% 24 4 .21 .07 – .65

Domain Average 58% 68% 58% – – .45 – –

Domain B: Psychological Adjustment6. Minimization/Denial


7. AttitudesPresence (Past) 68% 76% 73% 26 2 .63 .38 – .91Presence (Recent) 68% 79% 79% 24 4 .71 .47 – .94Relevance (Future) 67% 73% 56% 25 3 .6 .35 – .9

8. Self-AwarenessPresence (Past) 64% 72% 62% 26 2 .54 .29 – .88Presence (Recent) 66% 74% 65% 24 4 .58 .34 – .9Relevance (Future) 65% 68% 62% 23 5 .42 .2 – .82

9. Stress or CopingPresence (Past) 58% 68% 57% 27 1 .13 .03 – .52Presence (Recent) 60% 60% 47% 25 3 .13 .03 – .52Relevance (Future) 76% 76% 68% 27 1 .05 0 – .35

10. Child AbusePresence (Past) 83% 83% 83% 27 1 .76 .54 – .95Presence (Recent) 73% 80% 80% 26 2 .7 .46 – .93Relevance (Future) 75% 78% 78% 27 1 .69 .45 – .93

Domain Average 70% 75% 68% – – .52 – –

Question 2: To what extent do professionals agreewith expert ratings when using the RSVP?

The average level of agreement with expert was 64%overall, and ranged from 44% to 87% for individualitems. As can be seen in Tables 3–5, risk items achiev-ing the highest levels of agreement (≥ 70% for past, re-cent, and future) were Extreme Minimization/Denial of Sex-ual Violence, Problems Resulting from Child Abuse, Ma-jor Mental Illness, Non-Sexual Criminality, and Problems

with Treatment. Risk items achieving the least reliabil-ity (≤ 55% for past, recent, and future) were Psychologi-cal Coercion, Violent/Suicidal Ideation, and Problems withPlanning.

Table 5 shows that there was 67% agreement with ex-pert on Case Prioritization. However, the other SummaryJudgments had among the lowest agreement with expert (all< 50%). Further exploration of data revealed that, in com-parison to experts, participants had overrated risk of Serious

Dow

nloa

ded

by [

The

Uni

vers

ity o

f M

anch

este

r L

ibra

ry],

[ca

rolin

e lo

gan]

at 1

0:19

26

Oct

ober

201

2


TABLE 4Intraclass Correlation Coefficients and Percentage Agreement Statistics for Mental Disorder and Social Adjustment Domains



Domain C. Mental Disorder11. Sexual Deviance


12. PsychopathyPresence (Past) 79% 79% 79% 25 3 .53 .29 – .88Presence (Recent) 80% 80% 80% 26 2 .51 .26 – .87Relevance (Future) 56% 68% 63% 25 3 .56 .31 – .89

13. Major Mental IllnessPresence (Past) 70% 79% 79% 26 2 .71 .48 – .94Presence (Recent) 80% 81% 80% 25 3 .78 .57 – .96Relevance (Future) 71% 79% 75% 26 2 .67 .42 – .93

14. Substance MisusePresence (Past) 74% 77% 74% 26 2 .2 .06 – .62Presence (Recent) 66% 68% 57% 25 3 .45 .22 – .84Relevance (Future) 77% 81% 81% 25 3 .3 .12 – .74

15. Violent/Suicidal IdeationPresence (Past) 47% 74% 67% 26 2 .59 .34 – .9Presence (Recent) 52% 62% 52% 25 3 .45 .22 – .84Relevance (Future) 45% 74% 70% 26 2 .59 .34 – .9

Domain Average 65% 74% 70% – – .51 – –

Domain D: Social Adjustment16. Intimate Relationships

Presence (Past) 70% 70% 67% 27 1 .15 .04 – .55Presence (Recent) 60% 71% 54% 25 3 .58 0 – .33Relevance (Future) 87% 87% 84% 25 3 .2 .06 – .63

17. Non-Intimate RelationshipsPresence (Past) 65% 74% 72% 27 1 .67 .43 – .93Presence (Recent) 58% 66% 55% 24 4 .57 .32 – .89Relevance (Future) 62% 70% 69% 26 2 .55 .31 – .89

18. EmploymentPresence (Past) 71% 71% 69% 27 1 .57 .32 – .89Presence (Recent) 64% 64% 64% 25 3 .43 .21 – .83Relevance (Future) 66% 71% 62% 27 1 .55 .3 – .88

19. Non-Sexual CriminalityPresence (Past) 81% 85% 81% 28 0 .77 .56 – .96Presence (Recent) 80% 82% 68% 23 5 .43 .21 – .83Relevance (Future) 73% 81% 81% 25 3 .73 .5 – .94

Domain Average 70% 74% 68% – – .52 – –

Physical Harm, underrated Immediate Action Required andunderrated Other Risks Indicated.

Research Items

Question 1: What level of interrater reliability domulti-disciplinary forensic mental healthprofessionals achieve when using the RSVP tomake judgments of risk?

The average percentage agreements across research items(for Steps 4 and 5) were as follows: agreement with mode,62% and agreement with mean, 50%. The average ICC2 was.62, a good level of agreement.

As can be seen in Table 6, ICCs ranged from poor (ICC2= .25) to excellent (ICC2 = .87). 13% of items achievedexcellent reliability, 7% good, 60% fair, and 20% poor.There was poorer agreement on characteristics of escalationscenarios (mean ICC2 = .46) compared to characteristics ofrepeat scenarios (mean ICC2 = .59). Items achieving good orexcellent interrater reliability (ICC2 ≥ .60,) were Nature ofScenario (repeat), Victim in Scenario (repeat and escalation),and Recommended Level of Supervision. There appeared tobe excellent percentage agreement with mean and mode inrelation to Monitoring Recommendations. However, therewas insufficient variance in order to calculate an ICC for thisitem. Items achieving poor reliability were Level of Psycho-logical Harm (escalation scenario), Estimated Imminence of

Dow

nloa

ded

by [

The

Uni

vers

ity o

f M

anch

este

r L

ibra

ry],

[ca

rolin

e lo

gan]

at 1

0:19

26

Oct

ober

201

2


TABLE 5Intraclass Correlation Coefficients and Percentage Agreement Statistics for Manageability and Summary Judgement Domains



Domain E: Manageability20. Planning


21. TreatmentPresence (Past) 74% 74% 68% 26 2 .65 .41 – .92Presence (Recent) 71% 73% 73% 26 2 .69 .45 – .93Relevance (Future) 74% 77% 75% 27 1 .59 .34 – .9

22. SupervisionPresence (Past) 63% 68% 58% 26 2 .52 .28 – .87Presence (Recent) 66% 66% 55% 26 2 .46 .23 – .84Relevance (Future) 65% 68% 61% 27 1 .43 .21 – .83

Domain Average 63% 67% 61% – – .48 – –

Step 6: Summary JudgmentsCase Prioritization 67% 67% 65% 24 4 .62 .37 – .91Risk of Serious Phys. Harm 49% 68% 68% 24 4 .69 .45 .93Immediate Action Req. 44% 55% 52% 24 4 .43 .20 – .82Other Risks Indicated 45% 74% 73% 24 4 .66 .41 – .92

Domain Average 51% 66% 65% – – .60 – –

TABLE 6Intraclass Correlation Coefficients and Percentage Agreement Statistics for Research Items Capturing Aspects of Scenario

Planning and Case Management Steps



Step 4: Scenario PlanningRepeat Scenario

Nature of Scenario 63% 72% 66% 25 3 .67 .42 – .93Victim in Scenario 81% 81% 46% 25 3 .85 .67 – .97Level of Psychological Harm 61% 65% 65% 25 3 .56 .31 – .89Level of Physical Harm 48% 59% 38% 25 3 .58 .33 – .90Estimated Imminence of Scenario 41% 55% 30% 21 7 .46 .22 – .84Estimated Frequency of Scenario 46% 61% 61% 25 3 .49 .26 – .86Likelihood 30% 44% 44% 25 3 .52 .28 – .87

Domain Average 53% 62% 49% – – .59 – –

Escalation ScenarioNature of Scenario 36% 60% 48% 25 3 .57 .32 – .89Victim in Scenario 68% 69% 51% 25 3 .78 .57 – .96Level of Psychological Harm 54% 76% 74% 25 3 .25 .10 – .69Level of Physical Harm 27% 53% 45% 25 3 .53 .29 – .88Estimated Imminence of Scenario 28% 57% 17% 22 6 .32 .13 – .75Estimated Frequency of Scenario 32% 47% 37% 25 3 .29 .12 – .72Likelihood 36% 43% 38% 25 3 .48 .24 – .85

Domain Average 40% 58% 45% – – .46 – –

Step 5: Case ManagementRecommended Level of Supervision 71% 71% 71% 24 4 .87 .71 – .98Recommended Level of Monitoring 67% 89% 89% 25 3 – – –

Domain Average 68% 80% 80% – – – – –

Dow

nloa

ded

by [

The

Uni

vers

ity o

f M

anch

este

r L

ibra

ry],

[ca

rolin

e lo

gan]

at 1

0:19

26

Oct

ober

201

2


Scenario (escalation), and Estimated Frequency of Scenario(escalation).

Question 2: To what extent do professionals agreewith expert ratings when using the RSVP?

The average level of agreement with expert was 49% andranged from 27% to 81%. Items achieving the highest agree-ment with expert (≥70%) were Victim (in repeat scenario)and Recommendations for Supervision. Items achieving theleast agreement with expert (≥30%) were Likelihood (of re-peat scenario), Level of Physical Harm (escalation scenario),and Imminence (of escalation scenario). Further explorationof data revealed that, in comparison to experts, participantshad on average; underrated Likelihood (of escalation sce-nario), overrated Level of Physical Harm (in escalation sce-nario) and underrated Imminence (of escalation scenario).

Question 3: Is the interrater reliability of riskjudgments made using the RSVP associated withcharacteristics of raters, such as profession, yearsexperience, amount of formal training received andself-reported variables?

Spearman’s Rho correlation coefficients were calculatedbetween continuous professional variables and: averageagreement with expert, average agreement with mode, andmean estimation of sexual violence risk. Correlations werecalculated from the data of all judges (n = 28) and arereported in Table 7 below.

As can be seen, this analysis revealed only three signifi-cant correlations, each involving the number of formal RSVPtraining days attended. There were significant positive cor-

TABLE 7rs (2-tailed): Associations Between Professional

Variables and Average Percentage Agreement (withGold-Standard and Mode) and Overall Estimation of

Sexual Violence Risk

Average % inagreement with

Mean estimation ofExpert Mode sexual violence risk

Number of years qualifiedin profession

−.23 .03 −.16

Number of years in forensicsetting

−.30 .10 −.13

Number of days RSVPtraining received

.50∗∗ .46∗ −.56∗∗

Self reported adherence tomanual

−.03 .24 −.33

Self-reported confidence injudgment accuracy

.18 −.23 −.28

Self-reported objectivity ofjudgment process

−.23 −.25 .43

∗∗denotes a correlation that is significant at the .01 level. ∗ denotes acorrelation that is significant at the .05 level.

relations between number of days formal RSVP training re-ceived and: average agreement with expert (rs = .5, p < .01)and average agreement with mode (rs = .46, p < .50). Therewas also a significant negative correlation (rs = –.56, p <

.01) between number of days formal RSVP training receivedand mean estimation of sexual violence risk.

Question 4: Is the interrater reliability of riskjudgments made using the RSVP associated withcharacteristics of cases, such as level of risk andclinical complexity?

Table 8 compares the average percentage agreements (ex-pert, mean, and mode) across cases for key Summary Judg-ments (Case Prioritization, Risk of Serious Physical Harm,and Immediate Action Required).

For each of the key Summary Judgments, the highest aver-age percentage agreements with mode and mean were for thecases at either extreme of the risk/complexity spectrum (Billand Stuart). Bill was the lowest risk / lowest clinical complex-ity case although Stuart was the highest risk / highest clinicalcomplexity. For Case Prioritization, the case achieving thelowest average agreement with mode, mean and expert wasDonald (high risk, low-medium clinical complexity). It maybe relevant that Donald was unique in the important aspectthat he denied his sexual offenses.

For risk of serious physical harm, Simon (medium risk,low-medium clinical complexity) and Mark (medium risk,medium-high clinical complexity) achieved the lowest levelof agreement with mean and mode. Both of the high riskcases had a very low level of percentage agreement withexpert judgment. In both instances, experts had estimated amoderate risk of serious physical harm, although the vastmajority of judges had estimated this as high.

DISCUSSION

Interrater Reliability and Agreement with ExpertJudgments

The finding that there was a good level of interrater reliabilityon Summary Judgments and Supervision Recommendationsis very encouraging as these domains have important impli-cations for case management.

From the available information there does not appear to beany specific reason why some items achieved greater inter-rater reliability than did other items. Differences may be at-tributable to aspects of case vignettes, the formal training pro-vided, or professional expertise. It is noteworthy that a simi-lar study by Watt and colleagues (2006)2 did not identify thesame set of items as achieving superior interrater reliability.

2Previous studies investigating the interrater reliability of the RSVP wereall unpublished. Watt and colleagues kindly forwarded the results of theirconference posters but the other studies (Hart, 2003; Watt & Jackson, 2008)were not available.

Dow

nloa

ded

by [

The

Uni

vers

ity o

f M

anch

este

r L

ibra

ry],

[ca

rolin

e lo

gan]

at 1

0:19

26

Oct

ober

201

2


The finding that participants were considerably more reli-able in their judgments about repeat scenarios than escalationscenarios is understandable given that professionals can re-fer to previous behavior in making judgments about repeatscenarios as opposed to scenarios that have not previouslyoccurred. It highlights the difficulty and subjectivity involvedin speculating about possible future offense scenarios.

Differences between experts and professionals’ ratingsmay be attributable to a number of factors associated with ex-pertise in risk assessment. However, the present study designdoes not allow parameters to be attributed to the variabilityin responses between the two groups.

Professional and Case-Specific Factors

The positive relationship between the amount of formalRSVP training received and professional agreement (withmode and expert) would suggest that formal RSVP traininghas an important positive influence on improving interraterreliability and concordance with expert opinion. However,it may be indicative of other important factors such asprofessional background, qualifications, and other specialisttraining.

The finding that other professional variables such asexperience and confidence did not correlate to agreementindices is consistent with a well-established literaturesuggesting that evaluators tend to be overconfident in theirjudgments (Harvey, 1997), including regarding forensic risk(Desmarais, Nicholls, Read & Brink, 2010).

It is perhaps not surprising that there was a negative cor-relation between the number of days training received andaverage estimation of sexual violence risk. In addition to be-ing less reliable, this suggests that evaluators who have hadless training were also more likely to overestimate risk. Itis possible that this relationship might be mediated by theeffect of evaluators with less training being more cautious intheir decision making regarding risk.

Analysis of agreement across cases revealed an interestingpattern suggesting that it is more difficult to achieve adequateinterrater reliability on cases where there is middling levelsof case complexity and risk. It is understandable that theremay be greater ambiguity and confusion in relation to themiddle “grey area” cases.

Methodological Strengths and Limitations

A strength of this study is the recruitment of profession-als representing a breadth of skills, training, and experiencefound in forensic mental health and intellectual disability ser-vices. The diversity of this sample enhances the ecologicalvalidity of the study and has allowed for the further explo-ration of professional variables that might predict the reliabil-ity of individual professionals. However, this variability mayexplain why this study achieved lower levels of interrater re-liability than did previous studies. This explanation has alsobeen given by authors of studies evaluating the interrater re-liability of the SVR-20 (Sjostedt & Langstrom, 2002) andSARN (Webster et al., 2006). These two studies also usedmany raters of varying expertise and similarly found lowerlevels of interrater reliability than would have been predictedby studies using fewer highly experienced raters.

It might also be argued that use of a large proportion ofNurses is not adequately representative of current clinicalpractice. At present, in the NHS context, Forensic Nurseswould be expected to contribute to aspects of the risk assess-ment process, with Clinical Psychologists and Psychiatriststaking overall responsibility and carrying out risk assess-ments more frequently. Therefore, this study would havebeen more representative had it included a larger proportionof Clinical Psychologists and Psychiatrists. It is likely that theinclusion of a greater number of Clinical Psychologists andPsychiatrists would have enhanced the interrater reliabilityof the tool.

TABLE 8Average Percentage Agreement (‘Gold-Standard’, Mode and Mean) for Key Summary Judgments for Each Case Vignette

Bill Mathew Simon Mark Donald StuartRisk Level Low Low Med. Med. High HighClinical Complexity Low - Med. Med. - High Low - Med. Med. - High Low - Med. Med. – High

Summary Judgment 1: Case Prioritization% in agreement with Expert. 81% 61% 61% 57% 56% 88%% in agreement with sample Mode 81% 61% 61% 57% 56% 88%% in agreement with sample Mean 81% 61% 61% 57% 41% 88%

Summary Judgment 2: Risk of Serious Physical Harm% in agreement with Expert. 81% 79% 46% 46% 22% 20%% in agreement with sample Mode 81% 79% 46% 46% 78% 80%% in agreement with sample Mean 81% 79% 46% 46% 78% 80%

Summary Judgment 3: Immediate Action Required% in agreement with Expert. 37% 57% 11% 46% 44% 68%% in agreement with sample Mode 59% 57% 54% 46% 48% 68%% in agreement with sample Mean 59% 57% 36% 46% 48% 68%

Dow

nloa

ded

by [

The

Uni

vers

ity o

f M

anch

este

r L

ibra

ry],

[ca

rolin

e lo

gan]

at 1

0:19

26

Oct

ober

201

2


In response to these potential criticisms it is important torecognize that all of the study participants met the RSVP userrequirements, and except for one experienced Psychiatristhad received formal accredited training in using the RSVP.Professionals participated in this study because they eithercurrently used, or were increasingly being required to use theRSVP in their clinical work.

Although case vignettes provided sufficient informationfor the purposes of completing the RSVP, the validity of thisstudy would have been strengthened by the use of completecase files, perhaps accompanied by audio or video recordingsof clinical interviews. The present design is not able to eval-uate interrater agreement during the information-gatheringphase of risk assessment. If the participant time and resourceswere available to repeat this study using such materials, agreater level of participant familiarization with cases, andthus reliability in judgments might be expected. Alterna-tively, it is possible that more information might have led togreater difficulty and inconsistency in judgments.

Although 28 participants is more than a sufficient samplesize for this study, the use of six case vignettes achieves onlythe minimum level of statistical power that is adequate. Byusing a process of expert review, this study has attemptedto maximize the validity and authenticity of the cases used.Nevertheless, a greater number of cases would have improvedthe representativeness and statistical power of this study.

Although an improvement over previous studies that didnot evaluate Scenario Planning or Case Management steps,the forced-choice method used here to evaluate these do-mains has some limitations. It is not a valid reflection of thepublished RSVP manual and it yields data that are limitedin comparison to the qualitative feedback normally given forthese steps. It was for these reasons that these research itemswere analyzed and discussed separately from the standardRSVP items.

With the selection of cases and raters, it would have en-hanced the generalizability of these results, had proceduresbeen used to randomize or match samples from the pop-ulation settings. It would also have been useful to haveinterpreted results in the context of information about thepopulation settings from which samples were derived, suchas in comparison to the ratio of offender typologies found inclinical practice.

Recommendations for Further Research

Future research into the psychometric properties of theRSVP and other structured professional judgment ap-proaches should seek to address the methodological issuesdescribed above. First, studies will be strengthened by useof medium-large samples of professionals who are fullyqualified to use the RSVP and are clinically representativeof professionals using the tool in clinical practice. Second,cases of varying offense characteristics, risk, and complexityshould be used. Third, it is desirable to use authentic

case information (including file review and audio-visualrecordings). Fourth, it may be valuable for future studies touse qualitative research methodologies to more comprehen-sively evaluate qualitative judgments of the RSVP. Finally,it will be important to further investigate professional andcase-specific associations with interrater agreement. Suchstudies may help to identify targets for the improvement ofrisk assessment training programs and ongoing supervision.In addition, it will be interesting to replicate this studyin other settings where the RSVP is routinely used (e.g.,criminal justice). Some of these recommendations have alsobeen outlined by Hart and Boer (2009).

Implications for Practice

Results suggest that caution should be exercised when usingthe RSVP, especially when: used by less qualified assessors;in particular items, and; in cases of middling complexity andrisk.

Trainers in the RSVP may wish to pay particular attentionto certain items identified in this study and practitioners maywish to seek consultation when making these judgments. Thisstudy highlights the value of formal training and suggests thatthere is a need to provide supervision, training and supportto professionals who have limited background training andcompetency in risk assessment.

Throughout the course of this study, several participantshave commented on the value of using case vignettes asan adjunct to training. Although case vignettes are used informal training, submission of further mock risk assessmentsas ‘homework’ could be a prerequisite of training certificationor accreditation to use the tool. Online/electronic surveysmay hold cost-effective ways of gathering and analyzing thisdata. Similarly, mock risk assessments could be used to auditexisting risk assessment practice and to provide individualfeedback on agreement with colleagues and experts. Otherinstrument training programs have used such feedback tomonitor and calibrate user rating standards (Reichelt et al.,2003; Muller & Wetzel, 1998).

CONCLUSION

This study indicates that the RSVP can be used to attain ad-equate levels of interrater reliability. However, this is depen-dent on the training and expertise of professionals who usethe tool. There is a need to provide supervision and training toprofessionals who are less competent in risk assessment. Thisis particularly in relation to specific items of the RSVP andin cases with middling levels of complexity and risk, wherelower levels of agreement were found. This need for trainingand supervision is further highlighted when one considersthat professionals with lower levels of specialist trainingagreed less with their colleagues and experts, and also over-estimated sexual violence risk. This study has relevance to

Dow

nloa

ded

by [

The

Uni

vers

ity o

f M

anch

este

r L

ibra

ry],

[ca

rolin

e lo

gan]

at 1

0:19

26

Oct

ober

201

2


professionals using this tool and has revealed importantfindings for the development of training, practice, and futureresearch.

REFERENCES

Barbaree, H. E., Langton, C. M., Blanchard, R., & Boer, D. P. (2008).Predicting recidivism in sex offenders using the SVR-20: The contributionof age-at-release. International Journal of Forensic Mental Health, 7,47–64. doi:10.1080/14999013.2008.9914403

Boer, D. P., Hart, S. D., Kropp, P. R., & Webster, C. D. (1997). Manualfor the Sexual Violence Risk - 20: Professional guidelines for assessingrisk of sexual violence. Vancouver, Canada: British Columbia InstituteAgainst Family Violence.

Cicchetti, D. V., & Sparrow, S. A. (1981). Developing criteria for establish-ing interrater reliability of specific items: Applications to assessment ofadaptive behavior. American Journal of Mental Deficiency, 86, 127–137.

de Vogel, V., & de Ruiter, C. (2004). Differences between clinicians andresearchers in assessing risk of violence in forensic psychiatric pa-tients. Journal of Forensic Psychiatry and Psychology, 15, 145–164.doi:10.1080/14788940410001655916

Desmarais, S. L., Nicholls, T. L., Read, J. D., & Brink, J. (2010). Confidenceand accuracy in assessments of short-term risks presented by forensicpsychiatric patients. Journal of Forensic Psychiatry and Psychology, 21,1–22. doi:10.1080/14789940903183932

Fleiss, J. L. (1981). Statistical methods for rates and proportions, 2nd ed.New York: John Wiley & Sons.

Hare, R. D. (1991). The Hare Psychopathy Checklist-Revised Manual. NorthTonawanda, NY: Multi-Health Systems.

Hare, R. D. (2003). The Hare Psychopathy Checklist-Revised Manual. (2nded). North Tonawanda, NY: Multi-Health Systems.

Hart, S. D. (2003, April). Assessing risk for sexual violence: The Risk forSexual Violence Protocol (RSVP). Paper presented at the annual meet-ing of the International Association of Forensic Mental Health Services,Vienna, Austria.

Hart, S. D., & Boer, D. P. (2009). Structured professional judgment ap-proaches for sexual violence risk assessment. In R. K. Otto & K. S.Douglas (Eds.), Handbook of violence risk assessment (pp. 269–294).New York: Routledge.

Hart, S. D., Kropp, P. R., Laws, D. R., Klaver, J., Logan, C., & Watt, K. A.(2003). The Risk for Sexual Violence Protocol (RSVP): Structured profes-sional guidelines for assessing risk of sexual violence. Burnaby, Canada:Mental Health, Law, & Policy Institute, Simon Fraser University.

Hart S. D., Michie, C., & Cooke, D. J. (2007). Precision of actuarial riskassessment instruments. Evaluating the ‘margins of error’ of group vs.individual predictions of violence. British Journal of Psychiatry, 190(49), 60–65. doi:10.1192/bjp.190.5.s60.

Harvey, N. (1997). Confidence in judgement. Trends in Cognitive Sciences,1, 78–82. doi:10.1016/S1364-6613(97)01014-0.

Hildebrand, M., de Ruiter, C., & de Vogel, V. (2004). Psychopathy andsexual deviance in treated rapists: Association with sexual and nonsexualrecidivism. Sexual Abuse: A Journal of Research and Treatment, 16, 1–24.doi:10.1177/107906320401600101.

Hill, A., Habermann, N., Klusmann, D., Berner, W., & Briken, P. (2008).Criminal recidivism in sexual homicide perpetrators. International Jour-nal of Offender Therapy and Comparative Criminology, 52, 5–20.doi:10.1177/0306624X07307450.

Hintze, J. (2008). PASS (Power and Sample Size for Windows). Kaysville,Utah: NCSS, LLC.

Kropp, P. R., Hart, S. D., & Belfrage, H. (2005). The Brief Spousal AssaultForm for the Evaluation of Risk (B-SAFER): User manual. Vancouver,Canada: ProActive ReSolutions Inc.

Kropp, P. R., Hart, S. D., & Lyon, D. R. (2008). The Stalking Assessmentand Management Guidelines (SAM): User manual. Vancouver, Canada:ProActive ReSolutions Inc.

Kropp, P. R., Hart, S. D., Webster, C. D., & Eaves, D. (1995). Manual forthe Spousal Assault Risk Assessment Guide, 2nd ed. Vancouver, Canada:British Columbia Institute on Family Violence.

Krug, E.G., Dahlberg, L.L., Mercy, J.A., Zwi, A.B., & Lozano, R. (2002).World report on violence and health. Geneva, Switzerland: World HealthOrganization.

Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreementfor categorical data. Biometrics, 33, 159–174. doi:10.2307/2529310.

McGraw, K. O., & Wong, S. P. (1996). Forming inferences about someintraclass correlation coefficients. Psychological Methods, 1, 30–46.doi:10.1037//1082-989X.1.1.30.

Muller, M. J., & Wetzel, H. (1998). Improvement of interrater relia-bility of PANSS items and subscales by a standardized rater train-ing. Acta Psychiatrica Scandinavica, 98, 135–139. doi:10.1111/j.1600-0447.1998.tb10055.x.

Nichols, D. P. (1998). Choosing an Intraclass Correlation Coeffi-cient. SPSS Keywords, 97. Retrieved March 2012 from Universityof California Los Angeles, Academic Technology, SPSS Library:www.ats.ucla.edu/stat/spss/library/whichicc.htm.

Persons, J. B., & Bertognolli, A. (1999). Interrater reliability of cognitive-behavioral case formulations of depression: A replication. Cognitive Ther-apy and Research, 23, 271–283. doi:10.1023/A:1018791531158.

Reichelt, F. K., James, I. A., & Blackburn, I. (2003). Impact of training onrating competence in cognitive therapy. Journal of Behavior Therapy andExperimental Psychiatry, 34, 87–99. doi:10.1016/S0005-7916(03)00022-3.

Rettenberger, M., & Ehler, R. (2007). Predicting re-offense in sexual of-fender subtypes: A perspective validation study of the German ver-sion of the Sexual Offender Risk Appraisal Guide (SORAG). Sex-ual Offender Treatment, 2, 1–12. doi:10.1016/S0005-7916(03)00022-3.

Sjostedt, G., & Langstrom, N. (2002). Assessment of risk for criminal recidi-vism among rapists: A comparison of four different measures. Psychology,Crime & Law, 8, 25–40. doi:10.1080/10683160208401807.

Uebersax, J. (2009). Statistical methods for rater and diagnostic agree-ment. Retrieved May, 2009, from www.john-uebersax.com/stat/agree.htm.

Walter, S. D., Eliasziw, M., & Donner, A. (1998). Sample sizeand optimal designs for reliability studies. Statistics in Medicine,17, 101–110. doi:10.1002/(SICI)1097-0258(19980115)17:1<101::AID-SIM727>3.0.CO;2-E.

Watt, K. A., Hart, S. D., Wilson, C., Guy, L., & Douglas, K. S. (2006,March). An evaluation of the Risk for Sexual Violence Protocol (RSVP)in high risk offenders: Interrater reliability and concurrent validity. Pa-per presented at the annual meeting of the American Psychology-LawSociety, St. Petersburg, FL.

Watt, K. A., & Jackson, K. (2008, July). Interrater and structural reliabilitiesof the Risk for Sexual Violence Protocol (RSVP). Paper presented at theannual meeting of the International Association of Forensic Mental HealthServices, Vienna, Austria.

Webster, C., Douglas, K., Eaves, D., & Hart, S. D. (1997). HCR-20: Assess-ing risk for violence, Version 2. Burnaby, Canada: Mental Health, Law &Policy Institute, Simon Fraser University.

Webster, C., Eaves, D., Douglas, K., & Wintrup, A. (1995). The HCR-20scheme: The assessment of dangerousness and risk. Version 1. Burnaby,BC, Canada: Simon Fraser University and Forensic Psychiatric ServicesCommission of British Columbia.

Webster, S. D., Mann, R. E., Carter, A. J., Long, R., Milner R. J., O’Brien,M. D., Wakeling, H. C., & Ray, N. L. (2006). Interrater reliability ofdynamic risk assessment with sexual offenders. Psychology, Crime &Law, 439–452. doi:10.1080/10683160500036889.

Dow

nloa

ded

by [

The

Uni

vers

ity o

f M

anch

este

r L

ibra

ry],

[ca

rolin

e lo

gan]

at 1

0:19

26

Oct

ober

201

2

Sexual Violence Risk Assessment: An Investigation of the Interrater Reliability of Professional Judgments Made Using the Risk for Sexual Violence Protocol

Documents