Exploring Statistical Approaches to Auditory Brainstem Response Testing Mohammad Khan Student ID: 25544209 A dissertation submitted in partial fulfilment of the requirements for the degree of Bachelor of Science in Audiology University of Southampton Faculty of Engineering and the Environment Institute of Sound and Vibration Research Supervisor: Dr Steven Bell January 2015 Word Count: 18,507 (Excluding front matter, tables, graphs, references and appendices)
75
Embed
Exploring statistical approaches to Auditory Brainstem Response testing
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Exploring Statistical Approaches to Auditory
Brainstem Response Testing
Mohammad Khan
Student ID: 25544209
A dissertation submitted in partial fulfilment of the requirements for the
degree of Bachelor of Science in Audiology
University of Southampton
Faculty of Engineering and the Environment
Institute of Sound and Vibration Research
Supervisor: Dr Steven Bell
January 2015
Word Count: 18,507
(Excluding front matter, tables, graphs, references and appendices)
1
Declaration
I, Mohammad Khan, declare that this thesis is my own work, except where acknowledged, and that
the research reported was conducted in accordance with the principles and regulations outlined by
the University of Southampton, Institute of Sound and Vibration Research. Ethics approval was not
required for this study as no human subjects were tested on.
2
Acknowledgements Thanks go to my supervisor, Dr Steven Bell, for his time, guidance and ongoing support throughout
the course of this research project. I would also like to express my gratitude to all the lecturers of
Audiology at the University of Southampton who have assisted and guided me throughout the degree.
Additionally, I would like to thank Guy Lightfoot and John Stevens for allowing me to use the original
data that they collected. Lastly, thanks goes to my family and friends for their continual support and
patience throughout this project and the entire BSc degree.
3
Abstract
Objective: The Auditory Brainstem Response (ABR) is an important test that is used primarily for the
detection of a hearing loss in newborn infants. Currently, the method of interpreting an ABR is purely
by visual inspection in accordance to the Newborn Hearing Screening Programme (NHSP) protocol
which allows for variability in the interpretation of a waveform due to its subjective nature. Therefore,
the implementation of an objective statistical measure to interpret an ABR is highly desirable.
Method: Comparison of experts’ interpretation of ABR waveforms were made to measure the level of
variability present. 93 averaged ABR traces obtained from 26 babies who failed newborn screening
were used. Sensitivity and specificity of two objective parameters; Fsp and Autratio (parameter which
uses 3:1 NHSP rule) was also explored using experts detection as gold standard. Additionally, data
were simulated to propose critical values for Fsp and Autratio. Bootstrap analysis was applied
throughout testing to indicate significance levels for values produced by the parameters.
Results: A high level of variability was found between four experts when interpreting ABRs (Kappa
<0.9). Overall, the application of the bootstrap method produced very advantageous effects in terms
of sensitivity and specificity levels for both Autratio and Fsp. Critical values of 3.0 and 5.2 were found
for Fsp and Autratio respectively for the bootstrap distribution α=5% (p≤0.05) for the detection a wave
V.
Conclusion: The high level of variability between clinicians is of great concern and should be addressed
by the NHSP. The application of an automatic version of the 3:1 rule combined with bootstrapping is
still not a viable option due to its poor specificity levels. However, the application of bootstrapping
will allow comparisons to be made across studies. Future work should explore the critical values
proposed in this report and address the many limitations mentioned.
An Auditory Evoked Potential (AEP) is electrical activity within the auditory system which is evoked
using an acoustic stimulus. The three main types of AEPs often recorded by audiologists are the
Auditory Brainstem (ABRs), Middle Latency Response (MLR) and the Slow Vertex Response (SVR). The
different responses vary in the site of anatomical generation and in latency of onset. The ABR occurs
about 0 – 10 ms after the stimulus (Coats, 1978) (Hecox & Galambos, 1974) (Picton, et al., 1977)
(Gorga, et al., 1985) (Pratt & Sohmer, 1977), MLR occurs after 10 – 50 ms and the SVR occurs about
50 – 500 ms after the stimulus (Hall, 2006).
These activities are electrical potentials (brain waves) which are represented graphically. The on-going
electrical activity is recorded whilst an acoustic stimuli, such as a tone pip or a click, is presented to
the patient’s ear. This results in the activity of interest to arise from the ear, brain and nerves, which
travels through various tissues and structures until it is finally detected by surface electrodes (Hall,
2006). AEP’s can be categorized in accordance to the different latencies at which the peaks arise in
the waveform. There are short, middle and long latency waveforms. Short latency waveforms are
known to be generated by the auditory nerve and the inner ear structures. As time increases,
responses from the auditory brainstem can be seen along with activity from higher auditory structures
such as the cerebral cortex (Hall, 2006).
AEPs date back to the early 20th century, with the rise of their clinical applications starting from the
1970s due to the growth in availability of powerful computers. Some clinical applications of AEPs are
neonatal screening, threshold detection and detection of neurological disorders along the auditory
pathway.
1.2 Auditory Brainstem Response - ABR
The first description of the human ABR was by Dr. Don Jewett and John Williston in 1970. Since then,
the ABR has been widely implemented in clinics around the world.
The human ABR produces seven distinguishable peaks which are universally labelled using Roman
numerals. Evidence highlights that wave I is from the distal portion of the auditory nerve; wave II from
the proximal portion of the auditory nerve; wave III from the cochlear nucleus and wave IV from the
superior olivary complex and cochlear nucleus. Wave V is thought to be generated from the inferior
colliculus and lateral lemniscus. Wave V is the most commonly used element of the waveform which
7
determines hearing thresholds using the ABR. Lastly, waves VI and VII are generated by the inferior
colliculus (Weinstein, 2000).
The main application of the ABR in the UK is to test the paediatric population, specifically the newborn
population as behavioural testing is often unachievable. The baby firstly undergoes an otoacoustic
emission and an automated auditory brainstem response as part of the Newborn Hearing Screening
Programme (NHSP) (NHS, 2013). The failure of these two tests would require the patient to visit their
local audiology clinic for an ABR test. This is conducted in order to gain frequency specific information
regarding the child’s hearing status (procedure outlined in section 1.3).
The ABR in newborn hearing testing is often described as objective as this method does not require
patients to produce a conscious response in order for clinicians to determine hearing threshold levels.
However, clinicians visually interpret the waveform using a 3:1 rule as recommended by the NHSP
(Sutton, et al., 2013), which is subjective and gives rise to variability (Hall, 2006) (Vidler & Parker,
2004). Therefore, an objective method to interpret the ABR is highly desirable (Elberling & Don, 1984).
It is crucial that an ABR is interpreted correctly and as accurately as possible as its main application is
to assess hearing threshold levels and gives frequency specific information to identify if a hearing
impairment is present (Sutton, et al., 2013). This information is very important because the detection
of a hearing loss, especially in paediatrics, offers children the chance to adequately develop their
speech and language skills by using interventions such as hearing aids or cochlear implants. Many
theories proposed by linguistics and psychologists such as Noam Chomsky and B.F. Skinner, believe
that it is during childhood that language is acquired (Smith, 2004) (Maltby, et al., 2013). At this critical
time, deprivation of sound will result in unstable development. Accurate interpretations and even
recordings require a skilled clinician as many variables must be taken into account when testing. For
example, the amount of background noise present and the state of arousal of the patient can greatly
influence the quality of a recording. Despite this, the ABR is still currently deemed a very powerful
method of determining the hearing threshold levels for the newborn population (Warren, 1989)
Currently, there are no objective methods that use the 3:1 criteria, where the signal of the response
must be three times that of the background noise in an averaged waveform. Limited research has
been conducted to address this issue, as many researchers have proposed various other objective
methods such as the Fsp (Elberling & Don, 1984), Fmp (a modified version of Fsp) and the ± difference
(Wong & Bickford, 1980). The NHSP recently recommended that the Fsp or Fmp method should be
used alongside visual inspection in order to supports clinicians in making a decision. However,
research needs to be conducted to address the questions of how accurate the objective methods are
at correctly identifying a response. It is currently not known how these objective parameters compare
8
to the recommendation of a 3:1 SNR. Therefore, this report aims to synthesise and evaluate the
current literature that is focused on both objective and subjective methods of ABR analysis. In
addition, the implementation of an automatic objective version of the 3:1 rule will be investigated.
1.3 Procedure for testing ABRs in newborn hearing threshold detection
ABR testing is widely used by clinicians to obtain objective hearing thresholds. The main application
of ABR testing in the UK is testing newborn infants. However, it can also be conducted on patients of
all ages if behavioural testing is unavailable. In order to obtain threshold levels using ABR, a systematic
procedure is adhered to as recommended by the NHSP (Sutton, et al., 2013).
Firstly it is recommended that stage A checks should be conducted at the beginning of each session
(NHS, 2008). Testing should ideally be performed in a sound proofed room, as the presence of
extraneous sound may cause interference with the recordings. Furthermore, all equipment should be
positioned a suitable distance away from the patient in order to reduce the level of electrical
interference. The NHSP also recommend the use of single use Ag/AgCl surface electrodes. Prior to this,
the skin should be gently abraded to allow for adequate impedance levels. Currently an impedance
level of 5kΩ or below is recommended (Sutton, et al., 2013).
A single channel recording parameter is recommended to use for AC and BC ABR (Sutton, et al., 2013)
(Stevens, et al., 2013b). The electrodes should be located accordingly:
Positive electrode High forehead (as near to Cz as possible and midline) Negative electrode Ipsilateral mastoid Common electrode Contralateral mastoid
Stevens et al., (2013a) report obtaining larger amplitudes of wave V when applying the negative
electrode on the nape of the neck rather than the mastoid. However they report that there is little
difference in test efficacy. When conducting BC ABR, two-channel recordings may be considered to
determine which cochlea is generating the ABR. If used, the montage should be as follows:
Positive electrode High forehead (as near to Cz as possible and midline) Negative electrodes Ipsilateral mastoid and contralateral mastoid Common electrode Forehead
However limitations are identified by Foxe & Stapells (1993), Small & Stapells (2008) and Stapells &
Ruben (1989) with this montage of wave V detection as it does not detect the correct side cochlea
that is generating the ABR in 100% of cases and may falsely label some unilateral conductive losses as
sensorineural.
9
Table 1a. Parameter recommendations by the NHSP (Sutton, et al., 2013) to obtain optimal recordings.
Tone pip ABR, click ABR and narrow-band chirp ABR can be used to measure hearing thresholds.
Studies by Ferm, et al., (2013) and Elberling & Don, (2010) report an advantage of using narrow-band
chirp ABR where the ABR response is usually found to be larger, which in effect reduces test time.
However as the method of using narrow-band chirp ABR is still relatively new, little experience is
present when using this method in more severe hearing impairments. Thus it is recommended to use
tone pip ABR as it provides more frequency specific responses (Stevens, et al., 2013b) (Sutton, et al.,
2013).
To present the acoustic stimulus, supra aural headphones such as the TDH39/49 models or insert
earphones (e.g. ER-3A) are suitable to use. Care should be taken when using insert earphones as they
may deliver greater amplitudes to babies due to their smaller ear canal (Sutton, et al., 2013). A B-71
bone vibrator that can present up to 60 dB nHL should be used for BC ABR and should be placed on
the mastoid as it produces a greater level of stimulus compared to being placed on the forehead in
10
paediatric audiology (Sutton, et al., 2013). Furthermore, Sutton, et. al., (2013) state that a sleeping
baby is the key to a successful test, as interferences are kept to a minimal.
An artefact rejection level between 3 μV and 10 μV is recommended, and an initial value of no more
than 5 μV should be set up in test protocols. If occasionally the background activity is above 5 μV for
long periods, it is suggested to wait until the activity reduces. If no reduction is observed, then the
rejection level can be raised to a maximum of 10 μV. Filter settings of 30 Hz and 1500 Hz are suggested
by Sutton et. al. (2013) as these produce the best SNR of wave V. The use of digital filters are not
suggested by the NHSP as they cause changes in the waveform which may give rise to difficulties
during interpretations.
Testing is recommended by the NHSP to be started at 60/40 dB nHL (Stevens, et al., 2013b) and steps
should be changed in 10 dB. However, the use of larger steps may be necessary if the clinician believes
that the baby may not be asleep for very long. In this case, steps of 20 dB should be used. Two traces
should ideally be recorded at each stimulus intensity. The clinician needs to label if the response is a
clear response (CR), result absent (RA) or inconclusive (details of which are discussed later in section
1.4). If a CR is present, then a reduction by 10 dB or 20 dB should be made. If the result at the lower
stimulus level is RA, the clinician should increase by 10 dB (Stevens, et al., 2013b). This down in 20/10
dB, up in 10 dB method should be used and two repeatable traces at each level should be obtained.
The gold standard of threshold is defined as the lowest level at which a CR is present, with a RA
recording at a level 5 or 10 dB below the threshold, and a CR at 5/10 dB above threshold (Sutton, et
al., 2013).
The minimum criteria to be considered for a discharge is 30 dB nHL in both ears using AC tone pip ABR
at 4 kHz for newborn patients. This frequency is used as it is the most sensitive to sensorineural hearing
losses (SNHL) and is usually the easiest tone pip ABR to record (Stevens, et al., 2013b). After 4 kHz is
obtained, 1 kHz, 2 kHz and 0.5 kHz should be tested respectively. Both ears must be tested for the
newborn baby, giving priority to starting testing on the better ear.
AC click ABR may be considered only if it is not possible to conduct tone pip ABR or if quick results
need to be found as click stimuli cover a wider range of frequencies (Hall, 2006). In addition, if AC
thresholds are raised, BC thresholds should be measured to 15 dB nHL to check for a conductive
element (Stevens, et al., 2013b).
11
1.4 Interpretation of the ABR – NHSP Recommendations
When interpreting ABR waveforms, clinicians are less interested to know whether a response is
normal, and more interested to know if a response is definitely present or not. The detection of a CR
is based on the signal to noise ratio (SNR) of the averaged waveform. For each stimulus level, the result
is marked in one of three ways:
Decision criteria for the result at each stimulus level
CR Clear Response present RA Response Absent Inc Inconclusive
The defenitions of CR, RA and Inc are reported by Sutton, et al., (2013):
Definition of a clear response
1 Does waveform have good morphology, latency, amplitude? 2 Is wave V peak to SN10 3:1 of the average noise level? 3 Is the wave V peak to SN10 trough > 0.05 µV?
If answer yes to all 3 points = CLEAR RESPONSE (CR)
Definition of a response absent
1 Does not fit criteria of clear response? 2 The average difference between the 2 traces ≤ 0.025 µV?
If answer yes to both points = RESULT ABSENT (RA)
Definition of an inconclusive response
1 Does not fit criteria of clear response? 2 Does not fit criteria of result absent?
If answer yes to both points = INCONCLUSIVE
The analysis is recommended to be performed by a skilled clinician as experience is key to the
successful interpretation of an ABR as the wave V is sometimes difficult to identify. Some clinicians
use the highest amplitude on the wave to define ABRs by their peak, whereas others use the shoulder
of the wave. This method is commonly used as waves IV and V are often combined, which does not
result in an obvious peak for each wave. These dissimilarities in methods highlight the variations that
can occur during the interpretation of an ABR between different clinicians, which may cause different
results.
Furthermore, when interpreting an ABR waveform, an important factor to consider is the latency of
the wave. Latency is the time between two peaks which change as a function of frequency and
12
intensity. It is important to consider latencies during ABR interpretations as an increase in stimulus
intensity, decreases the absolute latency of the ABR (Hall, 2006). Lastly, and most importantly, the
interpretation of an ABR waveform includes the identification of a wave V response. As wave V is used
to determine whether a response is present or not, it is vital that this is correctly identified. Incorrect
identification or misinterpretation can lead to problems in language development as sufficient
amplification may not be provided by a hearing aid or a hearing loss may not be detected.
Furthermore, over amplification is another issue that may arise and can lead to cochlea damage. There
are two approaches that are employed by clinicians when attempting to identify a wave V. Clinicians
either choose the final point on the waveform before the negative slope that follows (SN10), or
selecting the peak. As waveform morphology may sometimes be irregular causing issues with
interpretations and repeatability, it is recommended that parameters should be adjusted accordingly
in order to achieve clearer responses which are repeatable (Hall, 2006) (Sutton, et al., 2013).
Evidently, there are many factors to take into consideration during the interpretation of an ABR
waveform. The combination of which results in considerable variability between different clinicians,
emphasising the desirability of an objective approach.
1.5 Signal to Noise Ratio (SNR)
ABR responses that are detected by surface electrodes are extremely small in voltage, thus are
measured in microvolts (µV) which is one-millionth of a volt. Hence these signals are required to be
amplified up to 100,000 times to be distinguishable. In addition, the signal of interest is concealed
within other brain activity (electroencephalogram - EEG) and also other extraneous electrical signals
from outside the range of the auditory system. Electrical activity which is not of interest and is
unrelated to the auditory stimulus is known as noise. The SNR value thus defines the quality of a
recording by taking into account the level of noise that is present. The higher the SNR, the better the
quality of the recording (Hall, 2006).
There are several factors that may influence the SNR when recording an ABR. For example, if the
patient is not relaxed during the recording and their head is not supported, this may increase the
volume of myogenic noise that is present (Hall, 2006). Problems with myogenic noise is common when
conducting ABR testing in infants and children, resulting in difficulties obtaining adequate SNRs.
Hence, only if deemed necessary, the use of a sedative such as chloral hydrate is occasionally used for
inducing sleep, although this method is not routine in the UK (Reich & Wiatrak, 1996). Other ways of
improving SNR is to ensure that all unnecessary electrical equipment are turned off. This includes
13
lights, fans, computers and similar equipment. Additionally, using a quiet room (sound proof room)
will ensure that extraneous background noise does not mask the evoking stimulus, generating AEPs
which may interfere with the resultant traces. Another factor to consider is the electrical impedance
levels between electrodes. As mentioned previously, an impedance level of 5 kΩs or below is
recommended (Sutton, et al., 2013). Impedances of higher values do not affect the ABR itself, but
pickup an increased volume of external electromagnetic interference and of artefacts from movement
of the electrode (Bremner, et al., 2012).
Several noise removal techniques are used to improve SNRs. One of the main methods, perhaps the
most important, is signal averaging. This method works on several assumptions (Hall, 2006) (Rice
University, 2014):
1 The evoked response to each stimulus is repetitive and time locked
2 The noise is a randomly varying AC wave (uncorrelated to the stimulus)
3 The temporal position of each stimulus and response waveform are accurately known
The recorded waveform is split up between the evoked response (signal) and the noise. The noise is a
random AC waveform, therefore it is equally likely to be positive as it is negative at a given point in
time and varies from epoch to epoch. As a result, noise will tend to cancel out in the Long Term
Average (LTA). Signal averaging works by presenting the auditory stimulus hundreds of times one after
another, in a systematic fashion, causing evoked responses to be detected. As the evoked activity is
repetitive and time locked to the stimulus, this means that it will always produce electrical activity of
very similar voltage at a specific time after the stimulus. Noise, which is also detected by the skin
electrodes, is presumed to be random; i.e. not time locked to the auditory stimulus nor does it produce
a specific reaction to the auditory stimulus. Thus the process of averaging reduces the amount of noise
present in the recording, increasing the SNR (Rice University, 2014).
Bayesian averaging is another type of averaging that can be used to increase SNR. This method works
by weighting individual sweeps inversely with the estimated power of background noise. However,
underestimations of the signal amplitude can sometimes occur (Hall, 2006). Furthermore, sorted
averaging is another method that can be used to reduce the effect of interference. All sweeps are
sorted according to their estimated background noise and then are weighted accordingly. Evidence is
present suggesting that the use of sorted averaging produces significantly higher SNR levels compared
to using Bayesian and standard averaging methods (Mühler & Specht, 1999).
Artefact rejection is also used alongside signal averaging in order to reduce the levels of noise by
rejecting sweeps that consist of noise above a certain level (Hall, 2006). A rejection level between 3
μV and 10 μV is recommended (Sutton, et al., 2013), meaning that if a signal exceeds this
14
predetermined limit, it will not be included in the averaged sweep as it is deemed to have excess
interference (Hall, 2006).
Another method to increase SNR is the use of filters. The use of a notch filter at 50 Hz is used to exclude
the mains-hum (Morelle, 2012). Additionally, using a band-pass filter allows for the recording of traces
at selected frequency bands, helping increase SNR levels (Stockard, et al., 1978), (Laukli & Mair, 1981)
(Ruth, et al., 1982).
Despite the various methods used to decrease the level of interference in a recording, clinicians are
still required to be cautious and should not depend solely on these methods. It is essential for clinicians
to apply their skills and knowledge in order to successfully reduce the level of background noise in
order to meet the 3:1 criteria and correctly identify a wave V. Therefore, the goal still remains to put
in place an alternative method to determine the presence of a significant ABR without considerable
variability.
1.6 Methods of ABR analysis
1.6.1 Subjective Measures
The conventional method used to analyse and interpret ABR waveforms conducted by audiologists in
the UK is by the means of visual inspection, following the NHSP protocol (Sutton, et al., 2013). The
NHSP recommendations were provided in order to help implement a standardised procedure to
reduce the variability that may be present between different clinicians and departments.
Furthermore, the NHSP has provided examples of ‘gold standard’ responses, where the wave V is
labelled by the NHSP for all clinicians to view and recognise (Sutton, et al., 2013). The criteria for a
gold standard response is outlined in section 1.4. There is still no certain way to ensure that all
clinicians will identify a wave V on the same place as other clinicians, thus the introduction of a gold
standard method does not fully eliminate the high level of variability.
Vidler & Parker (2004) designed an experiment to test the level of variability between professionals
when interpreting waveforms. The subjects were 16 professionals who had a mean experience of 8.41
years with ABR testing, ranging from 1.5 to 25 years. The employment of subjects with a wide range
of experience increases this studies external validity, as this provides a good representation of the
population of audiologists. 15 subjects worked as audiological scientists, where one subject worked
as a post-doctoral researcher on AEPs. This may mean that this individual subject may have had
different training to interpret ABRs, and may not be representative of a conventional audiologist.
15
The subjects used a simulator which produced 12 traces that had to be interpreted. The computer
simulator allowed the subjects to have control over obtaining the responses. For example, the
stimulus level at which to start testing, terminate averaging and to record further traces were options
available to the subjects. These options were available to the subjects as an attempt to replicate a real
clinical environment. However, it did not truly represent what a clinician may be presented with in
clinic. The subjects were not given knowledge of previous test results such as failed screening tests,
results of behavioural tests or history taken at the time of acquiring the traces. This may present as a
limitation to this study as the exclusion of this information would not occur in real world scenarios.
Audiologists may have used this information to assist their interpretation of ABRs in clinics, thus the
ecological validity of this study is threatened. However, in order to replicate real clinical situations,
Vidler & Parker included a range of thresholds for which the balance between noise and response
varied.
The results of this test indicate that considerable variability is present between subjects’
interpretations of ABRs. No consistent agreement of thresholds between the 16 subjects were found
for any traces. For nine traces, differences of 35 dB or greater were found between estimated
threshold levels. This wide range of estimated thresholds suggests that large differences in patient
management may likely occur.
The crucial question to address is if this level of variability exists due to the nature of the experimental
procedure. If so, it is unlikely to be representative of real clinical practice. The simulation contained
pre-set high and low frequency filter values which presents a limitation as it was reported by some
participants that different filter settings are used in their clinic. It was also reported that SN10
information was lost as a result of the filter setting which may have resulted in an increased difficulty
during the interpretation of waveforms as many clinicians use SN10 to identify wave V. Additionally,
subjects requested features such as traces to be displayed with negative peaks and peak labelling,
however these features were not available. The authors also note that the test material may have
been unrepresentative of clinical practice as it may have been biased towards difficult ABR cases.
There was a limitation as to the number of traces that could be acquired at each stimulus intensity,
which may have effected judgement of threshold. There were 30 instances in total where clinicians
indicated that recording a third trace would be their next step in data acquisition.
Lastly, data was recorded form the adult population and not the paediatric population. As the main
application of the ABR in the UK is for newborn testing, the adult population does not provide an
adequate representation, reducing the study’s external validity. The combination of the limitations
16
mentioned above suggest that this test interface did not fully allow the replication of real world
scenarios; a major weakness is present in terms of the ecological validity of this study.
A study was conducted by Gans et al. in 1992 (cited in Vidler and Parker, 2004) who also found
significant levels of variability between clinicians when determining threshold levels using ABR testing.
Additionally, they found that the more experienced testers were more accurate in their identifications
of a wave V. However the significance of their results are limited as Gans et al. used a sample of nine
students which is unrepresentative of the conventional clinicians who interpret ABRs.
Kuttva et al., (2009) also aimed to investigate the level of variability between clinicians when reporting
ABR thresholds. They looked at the effect of peer review on threshold detection. 76 babies who failed
the NHSP and were referred to the audiology department for ABR testing. A major advantage to this
study is the use of the paediatric population, who are the main group of patients that undergo ABR
testing. Babies were split into two groups, group A consisted of 38 babies who were tested when no
formal peer review was in place. Group B consisted of another 38 babies where peer review had been
in operation for at least 6 months. It was not stated whether the same or different clinicians examined
groups A and B.
The babies were then tested by individual audiologists and then peer review was conducted by
experts. The experts were two audiologists with at least five years of experience with ABRs. A
limitation to this study is that the details of the initial audiologists were not stated. This means that
critical information, such as the years of experience with ABRs or the number of audiologists used, is
not known which may result in a weakness in the studies external validity as it is harder to generalise
these findings to a larger population of audiologists. Another limitation to this study is that only AC
click stimulus traces were considered for the audit. This does not replicate real world scenarios where
tone pip and bone conduction ABR testing may be used which may yield different resultant waveforms
and thus different interpretations.
Kuttva et al. found differences of up to 20 dB for group A between the threshold levels reported by
the tester and the experts. A Wilcoxon’s signed-rank revealed a significant difference was present
(p<0.00) between the experts and the testers in group A. Similarly findings were also present for group
B, where differences of up to 35 dB were found between testers and experts. This study supports the
findings of Vidler & Parker (2004) and Gans et al. (1992), highlighting the variability that is present due
to the subjective nature of interpretation between different clinicians and therfore emphasising the
desirability of an objective method.
17
1.6.2 Objective Measures
Several researchers have proposed different objective methods to analyse ABR waveforms in order to
reduce the level of subjectivity and variability. However, these methods do not take into account the
3:1 rule of analysis as recommended by the NHSP. This thesis will now focus on the ± difference
method proposed by Wong and Bickford (1980), the Fsp parameter, proposed by Elberling & Don
(1984) and variations of the Fsp.
± Difference
Wong and Bickford (1980) conducted an experiment to determine the presence of a signal in
background noise. They used the ± difference technique in order to add objectivity to the analysis.
This method provides an alternate technique to estimate the SNR in a recording and allows for instant
results to be obtained regarding the signal size. It is found by assigning the even numbered stimulus
responses to one group, and the odd numbered stimulus responses to another, and their coherent
averages calculated accordingly (Wong & Bickford, 1980). By the application of this method during
averaging, the variance in the background noise is estimated which will aid the removal of runs with
poor SNRs. The intervals of variance are calculated in order to find p-values to determine the
empirically acceptable conditions. This is conducted by firstly selecting areas of interest or the whole
array (excluding stimulus artefacts).
𝑉𝑎𝑟() =1
180∙ ∑[
𝑡
𝐴(𝑡) − 𝑚𝑒𝑎𝑛𝐴(𝑡) 2
]
𝑉𝑎𝑟(′) =1
𝑚∙ ∑[
𝑡
𝐴′(𝑡) − 𝑚𝑒𝑎𝑛𝐴′(𝑡) 2
]
𝑃 =𝑉𝑎𝑟()
𝑉𝑎𝑟(′)
A = Average
A(t) and A’(t) = The noise average samples stored in different arrays
P = p-value
M = Each epoch of 10ms represented by 256 words
180 points corresponds to a 1.17 – 8.2 ms region of interest
Table 1.6.2a. Formula for the calculation of the ± difference method. (Wong & Bickford, 1980)
The formula displayed above on table 1.6.2a allows for the calculation of the ± difference which is
used to determine if a significant response is present or not. If p<20, this may suggest contamination
18
of the recording by artefacts and that there is no significant response present. In contrast, if the p-
value is >30, then the response is significant in comparison to the SNR.
Wong and Bickford used two different methods in order to obtain their data. The first method was by
using computer simulated waveforms in order to compare statistical analysis with visual analysis in
the detection of a response. The second method was using two participants who’s data were recorded
twice when restless to imitate background noise, and once when relaxed. The use of only two subjects
may present a limitation in terms of generalizability as a greater sample size would be needed to
represent the target population. No further information was given regarding the subjects’ ages or sex;
as a result, we cannot identify if a paediatric or adult population was used to collect the data, thus
threatening the studies external validity. Furthermore, dissimilar conditions of testing were used for
the two subjects. This results in a weakness in the study’s internal validity as this may introduce
confounding variables which may affect the ability to draw comparisons between the two results.
The results using the human participants revealed that as the intensity decreased from 80 dB nHL to
5 dB nHL, p-values also showed a decrease. However when additional averaging was conducted, a rise
in p-values was observed. Furthermore, a relationship between p-values and body movements was
observed; as body movement increased, p-values decreased.
Wong and Bickford also investigated the effects of noise on amplitude, peak and latency of a
waveform using the data collected from the two participants. They found that poor SNR conditions
give rise to miscalculations in amplitudes, peaks and latencies of a waveform. Thus if low p-values are
observed, this should indicate that improvements are needed with respect to the testing conditions
in order to decrease the noise and not just carry out further signal averaging.
The results of the testing using simulated data revealed that the ± difference method was in good
agreement with visual analysis when determining the presence of a signal. However, Wong and
Bickford acknowledge that visual analysis should be used along with the objective technique as p-
values alone could not be fully relied upon for asymmetrical waveforms. As a result, the exclusion of
subjective analysis has not been achieved by the use of this objective method.
From conducting their study Wong and Bickford concluded that using the ± difference technique
allows clinicians to be indicated when further averaging is necessary, when test conditions need to be
improved and allows one to determine if a significant signal is present or not. However several
limitations are present in this study which should be addressed by, for example, using a larger number
of participants who vary in age and gender in order to find results which can be applied with greater
confidence.
19
1.6.3 Fixed Single Point (Fsp)
A commonly applied quality estimator for the ABR response was proposed by Elberling & Don (1984).
They recognised the difficulty in concluding actual SNR levels and attempted to provide a solution by
proposing a statistical method which determines the quality of the SNR in the averaged recording.
They proposed the Fsp method which is an algorithm that measures the variance ratio to statistically
calculate post-average SNR levels. Residual noise is calculated by measuring the differences in the
noise values obtained from a single point in each sweep from a fixed analysis time window. This
technique provides an estimate of the ratio of the response and the noise that is present in recordings.
Signal averaging reduces the amount of noise present in a recording as the magnitude of noise varies
widely compared to the magnitude of the signal of interest. Thus the level of noise is reduced as signal
averaging increases, increasing SNR, which in turn yields a greater Fsp value. The purpose of this
method was to reduce the reliance on subjective measures and allow the ability to make comparisons
of data across studies.
𝐹𝑠𝑝 = 𝑉𝑎𝑟(𝐴𝐵𝑅 )
𝑉𝑎𝑟(𝑆𝑃 )
Var(𝐴𝐵𝑅 ) = Variance of the averaged ABR
Var (SP) = Estimated variance of background noise
Table 1.6.3a. Fsp equation to determine the magnitude of the background noise. (Elberling & Don, 1984)
Elberling & Don (1984) conducted an experiment to determine the ratio between SNR in averaged
ABR recordings consisting of both the response and background noise. They used 10 subjects in this
experiment. No information was given regarding their age or gender which may cause issues regarding
the extrapolation of the findings to the population as these subjects may not provide a true
representation of the population. Also, a greater sample size should be used in order to effectively
test the usefulness of the parameter. The authors noted that the subjects were either members of the
laboratory or paid volunteers, which may result in recruitment bias. Furthermore, the authors
mention that there was considerable variation between subjects and levels of background noise,
therefore a reasonable sample was established and it was not deemed necessary to use data from the
newborn population to represent this. It was also reported that during the process of recording ABR
responses, variations were presents in background noise over time and changes were observed in the
morphology. However these deviations were considered unimportant as they had no clinical
significance.
20
The degrees of freedom (V1) was calculated in the physiological background noise using a no-stimulus
condition. By choosing the worst case design, the value of the degree of freedom was calculated to be
V1=5. Using this information, they authors report that the minimum SNR associated with any Fsp value
is 2.25 (p≤0.05) or 3.10 (p≤0.01). This means that when detecting thresholds using ABRs where the
quality of waveform morphology is less important, if an Fsp value of 3.1 is achieved, it indicates that a
response is positively detected (p<0.01). Furthermore, the authors suggested that as the Fsp method
produces a simple answer of either a positive or negative detection of a response for whether the
thresholds are significant or not, test times should be shortened for threshold evaluation. The criterion
value for the Fsp can also be altered to any value, making it an advantageous addition to newborn
testing.
They concluded that the use of Fsp allows for objectivity to be added when determining whether a
response is present or not, relieving the reliance on subjective analysis. Additionally, it allows for a
shortened test time for threshold detection as a response can be identified straightaway with 99%
confidence when an Fsp of 3.1 is met, therefore making it a useful addition to ABR testing. Elberling
and Don compared the Fsp method with Wong and Bickford’s method of objective analysis and found
that the ± difference technique was more prone to type II errors. They state that this was the prime
reason they aimed to use another method for the estimation of noise. The authors also added that if
the Fsp method was combined with time varying filtering, significant improvements would be
observed in the quality of ABR recordings, allowing for a much superior form of response detection.
This study presented with several limitations. For example, impedance levels were not reported which
may cause issues with reliability as it may be harder to replicate these findings. Additionally,
impedance levels have an effect on the volume of external electromagnetic interference that is
present on the ABR along with the amount of artefact from the movements of the electrodes
(Bremner, et al., 2012). Furthermore, Elberling & Don did not mention any use of a sound treated
room which may give rise to questions regarding the level of noise present and whether the
experiment setup replicates real world scenarios. Future work should address the aforementioned
limitations, and more importantly, use a larger sample size and ideally a newborn population as this
is the most common patient group used in ABR testing.
In addition to their previous study, Don, et al. conducted an additional study in 1984 shortly after
Elberling and Don (Don, et al., 1984). This study further applies the Fsp method to demonstrate the
application of Fsp in automatic threshold detection and to estimate the number of sweeps required
to reach detection criterion. Don et al. claim that the application of this method can theoretically
reduce the test time and help reduce the variability in test interpretation.
21
Six normal hearing adults were used for this study. However it was not stated whether the hearing
statuses of the participants were actually determined using a test protocol such as PTA, or if the
hearing statuses were just assumed. This lack of information may result in the experiment to be harder
to replicate and doubts about its accuracy may arise. Furthermore, similar to the limitation that is
present in the previous study, a newborn population was not used, threatening its external validity as
the main application of the ABR is newborn testing.
Of the six subjects, four were female and two were male. The age of the subjects ranged from 20 to
32 years old. The use of such a small quantity of participants gives rise to a weakness in this study as
the use of additional participants may have given data which may represent the population better.
Furthermore, an uneven ratio of males to females was used which may also cause problems with
generalizability as results may be more biased towards a specific gender. Additionally, the use of a
specific age range of 20 to 32 years old may also result in a weakness in the studies external validity
as this does not accurately represent the entire population which includes younger and older people.
The reason behind why most studies do not use a paediatric population is unstated. It may perhaps
be due to ethical reasons. However, this results in a vital gap in the literature as most studies use an
adult population to test the effectiveness of objective analytical parameters such as the Fsp.
The results of this study shows that when making comparisons between this study (Don, et al., 1984)
and Wong and Pickford’s (1980) study, Don et al. found that a lower number of sweeps were needed
in order to achieve 99% significance with the Fsp method compared to using the ± difference method
for the same conditions. In addition, a greater risk of Type II error rates was found using the ±
difference method compared to using Fsp, which supports the findings in the previous study by
Elberling and Don (1984). Furthermore, Don et al. state that care should be taken when determining
the position of the analysis time window. When they used a 0-10 ms time window criterion was not
reached, but was reached by 2500 sweeps when using a 4-14 ms window.
Don et al. predicted a linear growth between the number of sweeps averaged with improvements in
SNR in order to predict the number of sweeps necessary to reach threshold criterion. The number of
sweeps that is required to achieve significance can be calculated based on the linear interpretation of
the data if Fsp criterion is not achieved by 1500-2000 sweeps. If the calculated number of sweeps is
excessively high, this may indicate to the clinician to increase the intensity as there may not be a signal
present, or the SNR is very poor. They found that this was true for 90% of the cases when the maximum
number of sweeps reached 5000.
Don et al. provided evidence to strengthen the possibility of adding objectivity to the analysis of ABRs.
Thus the implementation of the Fsp may possibly allow for the automatic detection of a response
22
during threshold detection, reducing the need for subjective analysis. In addition, it would result in a
standardised method which will allow comparisons to be made between results at different clinics.
Furthermore, compared to Wong and Bickford’s method, the Fsp proved to be superior with respect
to accuracy and type II errors.
This study presents several limitations, the most important one being the limited sample size and no
use of the newborn population. Further research should address these issues in order to fully test the
effectiveness of the parameter, increasing the internal and external validity of the study.
Another study was later conducted by Elberling & Don (1987), who used Fsp criterion 3.1 in order to
detect the presence of a response (Elberling & Don, 1984). Psychoacoustic behavioural thresholds
were found using a modified block up-down method (Wetherill & Levitt, 1965). This allowed
comparisons to be made between ABR (Fsp) and psychoacoustic thresholds. They used 10 normal
hearing subjects to record the ABR data. A small sample size threatens the generalizability of the
findings.
The results indicate that across the 10 subjects, a slightly higher median value was found using the Fsp
method compared to using the modified up-down method. The ranges were found to be very similar
for the two test parameters. The results provide evidence that ABR thresholds detected with Fsp
criterion 3.1, are on average, elevated compared to the psychoacoustic thresholds determined by the
modified block up-down method. This study presents several limitations; no information was given
regarding impedance levels, the use of a sound proof room or location of time window on ABR.
Perhaps the use of a more conventional style of behavioural threshold detection, such as PTA, would
have yielded more accurate information and would have allowed for more accurate comparisons to
be made.
1.6.3.1 Objective Measures Continued – Variations of Fsp
Similarly, the Fmp method can also facilitate clinicians in estimating the quality of a recording by the
analysis of ‘multiple points’ – hence ‘mp’. The Fmp analysis tool produces a value based on the
statistical confidence of the repeated detection of a response. Similar to the Fsp method, it uses time-
locked points for the response size and residual noise in order to provide a confidence level. Repeated
presence of time-locked points are analysed for each sweep, where the presence of less variation in
the measures produces a higher rate of confidence in the response. Likewise, residual noise is
determined by calculating the differences in the noise values obtained from the multiple points in
23
each sweep. The amplitude of the noise is then measured; the lower the variation, the less noise in
the trace (Sauter, et al., 2012).
Silva (2009) conducted research into the Fmp method of analysis in AEPs. Silva used a modified non-
stationary fixed multiple point method - NS Fmp. The use of this method would allow for the
accounting of a discrete number of noise sources which may prove to be beneficial as in real world
situations, there may be different sources of noise present. Monte-Carlo simulations were conducted
along with using data from 5 normal hearing subjects. ABR measurements were conducted at 0, 20,
30, 40, 60 and 80 dB SPL where two recordings were made at each level expect 0 dB SPL. A total of
4000 sweeps were obtained at each level using rarefaction clicks. Perhaps the use of tone pips would
have allowed for better generalizability of the findings as threshold detection is recommended to be
carried out using tone pips (Sutton, et al., 2013).
The study by Silva (2009) contained two stages in order to analyse the quality of different SNR
estimators. Firstly, the Fsp and NS Fmp method were evaluated in order to compare the mean square
errors. Secondly, Silva aimed to compare the receiver operating curves for the Fsp and Fmp using real
data obtained from 5 participants. The results showed that by using NS Fmp method, lower mean
square errors were observed compared to using the Fsp parameter. It was also found that the
weighted averaging Fmp has a greater receiver operating curve area compared to the Fsp method. In
addition, NS Fmp proved superior compared to Fsp on artefact rejection levels, as did weighted
averaging compared to conventional averaging.
Another method of objective analysis was introduced by Mocks et al. (1984) where single sweeps are
used to estimate SNR based on power estimate ratios of the signal and noise. Ozdamar & Delgado,
(1996) conducted an experiment to investigate the relationship between the Fsp method, proposed
by Elberling and Don, with the method by Mocks et al. (1984). Ozdamar & Delgado developed a
computationally efficient method of calculating the SNR estimate along with the signal and noise
estimate in ongoing averaging. Both parameters, (Elberling & Don, 1984) (Mocks, et al., 1984), were
evaluated and compared using the developed SNR technique.
Ozdamar & Delgado recorded four sets of data at a given stimulus level from each subject. The storage
of single sweep responses allowed the inspection of recording characteristics and off-line execution
of conventional or other averaging techniques as well as different signal and noise power calculations.
In addition, this also allowed for the direct comparison of different signal-processing techniques to be
studied as no variations in EEG, external or internal noise were present. Only four young subjects with
normal hearing were using in this study, consequentially the external validity is weakened. Recordings
24
were conducted without artefact rejection to obtain both noisy and clean sweeps to test and compare
the performance of the various processing and SNR estimation techniques.
Analysis of the parameters described by Elberling and Don (1984) and Mocks et al. (1984) showed that
they are in fact very similar. The Fsp can be generalised to cover multiple points in time (Fmp) and was
found to be practically equivalent to the SNR estimate of Mocks et al. (1984) with only a difference in
unity. The authors noted that both parameters that were developed for the measurement of signal
power, residual noise and SNR estimates, proved very useful, not only for monitoring the averaging
process, but also in implementing various noise reduction algorithms. In addition, both methods can
be readily implemented in clinical applications for online averaging. The authors also state that a total
time saving of 65% was achieved compared to using only standard averaging with a fixed sweep count
of 2048. The methods analysed in this study are widely applicable to any averaging technique and are
especially important for hearing screening and threshold detection, where reducing test times is
favoured and where an objective means of detection is desired for several reasons. However, this
study presented several limitations, the main one being the small sample size. Further research should
address these limitations in order to strengthen the external validity which would allow for the
findings to be generalised with greater confidence.
1.6.4 Bootstrap Analysis
The bootstrap technique was first introduced by Efron (1979). It falls under the broad umbrella of
resampling methods. This technique provides a means of testing statistical significance of a particular
parameter, such as the Fsp, strengthening its accuracy for threshold detection. It allows assigning
measures of accuracy (which can be defined as confidence intervals, variance or others similar) to test
parameters where typical methods cannot be used (Tibshirani & Efron, 1993). Bootstrapping is carried
out by constructing a number of resamples with replacement of the original dataset at random points
in time which in effect regenerates averages. The bootstrap averages are different each time and this
process is repeated a large number of times (typically few thousand times). These are then averaged
to find an overall average where p-values are generated. The bootstrap method provides a way to
control and check the stability of results. Although for the majority of the problems it is difficult to
know true confidence intervals, the bootstrap is more accurate compared to the intervals obtained
using simple variance and assumptions of normality (Efron, 1979). However, although it is consistent
under some conditions, important assumptions are being made whilst undertaking bootstrap analysis.
For example, the independence of samples, which would otherwise be formally stated in other
A study conducted by Lv, et al., (2007) looked at objective detection of evoked potentials using a
bootstrap technique which provided p-values to prove statistical significance of a response. This
technique has not been used in the detection of evoked potentials so this study fills in a vital gap in
knowledge.
Monte-Carlo simulations were firstly carried out with no stimulus response in order to determine
whether a false positive rate (α=5%) is obtained. All four parameters, diff, power, Fsp and ± difference
were applied to 500 EEG signals and a time window of 5-15 ms was used which ensured that wave V
was included in all recordings as this time window accounts for stimulus intensity effects on wave V
latencies (Coats, 1978) (Hecox & Galambos, 1974) (Picton, et al., 1977) (Gorga, et al., 1985) (Pratt &
Sohmer, 1977). Significance was determined by the bootstrap if p≤0.05. The use of Monte-Carlo
simulations increased this study’s internal validity as it gave a chance for the authors to check the
power of the methods before applying them to data recorded from normal hearing subjects. This gave
the opportunity to modify the research design or methods prior to the experiment.
Lv. et al. then applied the bootstrap method to the data obtained from 12 normal hearing adults
whose ages ranged from only 18 to 30. This may be a limitation of the study due to the use of only 12
subjects along with only using a limited age range. The use of conventional audiometry confirmed the
normal hearing thresholds of the subjects, eliminating any doubt. Hearing thresholds using all four
parameters were then acquired and compared to the visual inspection of three experienced
audiologists.
Lv et al. investigated the accuracy of the proposed method to detect responses when present by
simulating a response of a known SNR. They found that when responses were added to the
simulations, the percentage of detection determined by the objective parameters increased
consistently as SNR increased. The results of the subjective interpretations between the testers when
determining hearing thresholds showed large variations. A Cohen’s Kappa statistic was used to
measure inter-observer reliability and values of 0.70, 0.63 and 0.81 were found. As they are less than
0.90, they cannot be regarded as high (Seigel, et al., 1992) (iSixSigma, 2014), thus a good agreement
between the audiologists was not found. These results further reinforce the findings from the studies
by Vidler & Parker (2004), (Gans, et al., 1992) and Kuttva et al., (2009) regarding the variability in
interpretations between clinicians and highlight the desirability for an objective method of analysis.
Furthermore, the authors found that the Fsp method produced lower mean threshold values for the
12 subjects when compared using subjective analysis. However, data is present from only three
audiologists and in order to make accurate comparisons, more testers would need to be employed.
26
Lv et al. found an Fsp value of 1.81 for the bootstrap distribution α=5% compared to the value of 2.25
when using 250 sweeps found my Elberling & Don (1984). As the number of stimuli was increased to
2000, Lv et al. found an Fsp critical value of 1.75. Thus it appears that the value determined by Elberling
& Don (1984) would be too high for α=5%, in accordance with the worst case assumptions made in
deriving it. Furthermore, the bootstrap analysis shows that the threshold values for the parameters
depend on the number of stimuli used (for a given level α), and vary quite considerably between
individuals. Thus any fixed threshold values for parameters such as Fsp or ± difference would lead to
false-positive rates that differ between subjects indicating that universally valid critical values for
parameters such as the Fsp probably cannot be justified. This method has given evidence that the
application of the bootstrap method can provide statistically significant results on chosen confidence
intervals.
The authors suggest that this method should not supersede other objective methods used for
detecting responses. It should applied when testing significance of values as determined by other
objective parameters such as the Fsp. The bootstrap would also enable ABR data to be compared
between different clinics, which is currently difficult due to the subjective nature of analysis.
1.7 Study Rationale
ABR testing is one of the most important objective methods used in paediatric audiology to determine
hearing levels in the newborn population. Misinterpretation of results can lead to significant changes
in the management of a patient which can result in dissatisfaction on the patient’s behalf. Additionally,
it can lead to problems in language development as sufficient amplification may not be provided or a
hearing loss may not be detected. Furthermore, over amplification is another issue that may arise and
can lead to cochlea damage.
By carrying out an extensive review of the literature, there is consistence evidence suggesting that
significant differences are present between clinicians when interpreting ABRs (Vidler & Parker, 2004)
(Kuttva, et al., 2009) (Gans, et al., 1992) (Lv, et al., 2007). Inconsistencies between clinicians are as a
result of the varying levels of background noise that is present during an ABR recording. In some
departments, peer review is conducted in order to identify and reduce levels of uncertainties.
However, as this is not carried out in all departments and as this still does not fully eliminate
subjectivity (Kuttva, et al., 2009), the ultimate goal would be to disregard any doubt that is present
during waveform analysis.
27
Currently the NHSP suggests a 3:1 rule for the detection of a CR in ABR testing where the signal of
interest must be 3 times that of the noise present (Sutton, et al., 2013). However, the statistics and
rationale behind this method is unknown. Several noise reduction techniques, such as filters, signal
averaging and artefact rejection are used in order to keep noise at a minimal level, increasing the
likelihood of achieving the 3:1 criterion as determined by the NHSP protocol (Sutton, et al., 2013).
Several objective methods of analysis have been developed to reduce the aspect of subjectivity.
Elberling & Don (1984) and Wong & Bickford (1980) proposed the Fsp and ± difference method
respectively in order to introduce objectivity to ABR analysis. Additionally, Lv et al. (2007) introduced
the application of the bootstrap method which allows for significance values to be associated with
values generated by objective parameters (such as the Fsp). However, these objective methods do not
use the 3:1 rule as recommended by the NHSP protocol. Thus the implementation of an objective
parameter which takes into account the 3:1 rule may prove useful as this would combine the
agreement of the subjective NHSP criterion, with a method of objectivity. Combined with the newly
introduced bootstrapped method, this should in theory be a powerful tool at identifying a response.
The Fsp has been recommended by Sutton et al. (2013) as an acceptable parameter which can be used
in order to strengthen the decision of a clinician during ABR analysis. However, little research has been
conducted which compares the objective automated 3:1 rule to the Fsp with the addition of the
bootstrap. Thus, a comparison of both objective methods is desirable to determine which produces
the most accurate results and therefore support its implementation in clinics. Furthermore, the
comparison of the objective methods with experts’ subjective interpretations is also desirable as it will
allow for further investigation regarding the usefulness of the objective methods and the differences
in results between objective and subject methods. In addition, direct comparisons between known
SNRs and significance values (p-values) generated by the bootstrap have not been investigated. This
would prove useful as it would allow for the comparison regarding the levels of SNR at which
significance is reached by the bootstrap for each the objective parameter.
For the purposes of this research, ± difference and Fmp will not be investigated further due to the
limited research present and their uncommon use in clinics. The objective parameters that will be
explored in detail in this thesis are the Fsp and the automated version of the 3:1 rule (which will now
be referred to as ‘Autratio’). Furthermore, the bootstrap method will be applied to the objective
parameters in order to allow for a detailed investigation regarding its applications and relationship
with SNR. Lastly, subjective analysis using the NHSP protocol (Sutton, et al., 2013) will be explored in
order to make comparisons.
28
1.8 Research Questions
By conducting an extensive review of the available literature, this thesis aims to answer the following
research questions:
1. Is there a significant level of variability between experts when interpreting ABRs?
2. Are the objective parameters (Fsp and Autratio) equally sensitive at correctly identifying a
wave V response?
3. Is there a significant correlation between the parameters Fsp and Autratio when detecting
ABRs?
4. Does SNR have a significant effect on Fsp and Autratio values?
a. Does SNR have a significant effect on p-values generated by the bootstrapping of
Fsp and Autratio?
5. At a 95% confidence level (p≤0.05), is ‘3.0’ the critical value of an objective parameter
(Autratio) which uses the 3:1 ratio rule to detect a signal?*
*Value 3.0 is based on the 3:1 signal to noise ratio rule proposed by the NHSP for
subjective analysis.
Experimental Hypotheses
Hypothesis
1. There will be a significant level of variability between experts when interpreting ABRs.
2. When detecting a wave V response in ABR analysis, there will be a significant difference in
detection rates between the parameters Fsp and Autratio.
3. There will be a significant positive correlation between the parameters Fsp and Autratio.
4. As the level of SNR increases, Fsp and Autratio values will increase.
a. As SNR increases, p-values generated by the bootstrapping of Fsp and Autratio will
decrease towards 0.
5. The critical value of Autratio at (p≤0.05), for a response detection, will be 3.0 as Autratio is
calculated according to the 3:1 signal to noise ratio rule.
29
Null Hypothesis
1. There will not be a significant level of variability between experts when interpreting ABRs.
2. There will not be a significant difference in detection rates between the parameters Fsp and
Autratio.
3. There will not be a significant correlation between the parameters Fsp and Autratio.
4. The level of SNR will have no effect on the values generated by Fsp and Autratio.
a. The level of SNR will have no effect on the p-values generated by the bootstrapping
of Fsp and Autratio.
5. The critical value of Autratio at (p≤0.05), for a response detection, will not be 3.0.
30
2.0 Method
The data used in this study was primarily collected by Lightfoot & Stevens for their research article
titled “Effects of Artefact Rejection and Bayesian Weighted Averaging on the Efficiency of Recording”
which was published on the Ear and Hearing journal (Lightfoot & Stevens, 2014). The data collected
by Lightfoot & Stevens will be discussed in section 2.1. This study has also generated original data,
details of which are outlined in section 2.2.
2.1 Data Collected by Lightfoot & Stevens (2014)
Experimental Variables
This study contained several independent, dependent and extraneous variables which are detailed
below.
Independent Variables: Frequency and the intensity of the signal*
*The independent variables mentioned above were manipulated by Lightfoot & Stevens for
their study.
Dependent Variables: Values generated by: Fsp, Autratio, Bootstrapping of Fsp and Autratio, and the
expert analysis of the ABRs*
*The dependent variables mentioned above are primarily for this study only.
Confounding Variables: Variance between testers and environmental noise interference.
Variance between testers was minimised as much as possible by instructing the experts to
follow the NHSP guidance for interpreting an ABR response (see section 1.3 and 1.4). This
ensured that a standardised method of analysis was performed by all the experts*
*The confounding variable mentioned above was primarily present for this study only.
Environmental noise interference was attempted to be kept at a minimal by using a sound
proofed room. In addition, the position of electrode attachment was also kept consistent for
all participants and impedance levels kept below 5 kΩ. Lastly, the same high and low pass filter
settings were used throughout testing along with using approximately 3000 sweeps per
waveform*
*The confounding variable mentioned above was controlled by Lightfoot and Stevens
for their study.
31
Participants
Lightfoot and Stevens’ data consisted of a total of 26 babies, referred in one or both ears from the
NHSP for failing a transient evoked otoacoustic emission and an automated ABR. Participants
underwent routine ABR diagnostic testing at Arrowe Park Hospital, Wirral, United Kingdom. The mean
corrected age of the 26 participants was 3.5 weeks with a range of -1 to +12 weeks. Differing intensities
and frequencies were used along with differing number of repeats, depending on the judgement of
the present clinicians (Lightfoot & Stevens, 2014). An inclusion criteria was employed where
recordings of one or more repeats at the same frequency and intensity had to be carried out in order
to be included in the analysis. This resulted in a total of 93 averaged waveforms to be used (see
Appendix A) for full details.
Equipment and Software
The analysis of the ABR recordings required the following equipment and software:
Laptop/desktop computer
Matlab 2014a (Mathworks, 2014)
Audacity version 1.2.6 (2012)
IBM SPSS 22 (IBM, 2014)
The following equipment were required for the initial testing by Lightfoot and Stevens:
Otoscope
Abrasive gel
Alcohol rub
Single use electrodes (Ag/AgCl)
Audio recording software ClimaxDigital (2012) and Audacity version 1.2.6 (2012)
TDH-39 Supra-aural earphones
Modified Interacoustics Eclipse ABR system
32
Testing Conditions
At the time of testing, electrical noise was reduced by switching off all unnecessary electrical
equipment and sitting the participant as distant as possible from any equipment. The testing room
was a double walled, insulated, sound proof room which ensured optimal acoustic isolation. Artefacts
were reduced by ensuring that the participants’ state of arousal was as calm as possible. Note that
some data were deliberately recorded when participants’ state of arousal was not calm, allowing for
variations with higher noise content. These conditions were controlled by Lightfoot and Stevens.
Test Method
Initial ABR Testing by Lightfoot and Stevens
Lightfoot & Stevens (2014) recorded their data at Arrowe Park Hospital, Wirral, UK. They followed
closely to the procedures outlines in the NHSP guidance for ABR testing in babies (Sutton, et al., 2013)
and English NHSP guidelines for early audiological assessment (Stevens, et al., 2013b).
A total of 26 babies were tested which consisted of 17 male and 8 female participants. The mean
corrected age of the 26 participants was 3.5 weeks with a range of -1 to +12 weeks. Differing intensities
and frequencies were used along with differing number of repeats, depending on the judgement of
the present clinicians (Lightfoot & Stevens, 2014). Lightfoot & Stevens note that the majority of the
babies were from the well population with 19% having risk factors. 16 babies satisfied the English
discharge criteria and so achieved thresholds of 40 dB nHL and 10 dB above.
To detect the ABR, Lightfoot and Stevens used single use Ag/AgCl electrodes which were attached on
the non-test ear mastoid (common), test ear mastoid (negative) and forehead (positive). Electrodes
were attached after applying an abrasive gel which ensured good contact between the electrodes and
the skin. All impedances were kept below 5kΩ with similar impedances across each electrode. The
majority of the participants were tested in good conditions where noise levels were kept to a minimum
along with the baby being asleep or relaxed. However, some participants did not settle and EEG was
still recorded in order to acquire data of slightly higher noise content.
Amplified EEG was recorded using Climax Digital ACAP100 (2012) and Audacity version 1.2.6 (2012).
Participants underwent standard ABR testing which involved the use of a 4 kHz 5-cycle Blackman
envelope tone burst stimuli which was presented at 49.1/sec via TDH-39 transducers. The incoming
EEG was filtered between 33 Hz and 1500 Hz and typically 3000 sweeps contributed to each waveform.
Recordings (a total of n=93) were terminated after 61 second intervals and then saved to be analysed
33
later which allowed the inclusion of noisy and quiet conditions. When re-averaging using conventional
and Bayesian averaging, a wider filter bandwidth (3.3 to 3000 Hz) was used. If the filter settings had
been the same as those used when recording the EEG this would have resulted in doubling the
effective high and low-pass filter slopes.
Analysis of ABR Data
This study analysed the initial data collected by Lightfoot & Stevens. The data were converted into
‘wav’ format using Audacity (2012). In this study, analysis of the waveforms was conducted by using
Matlab R2014a (Mathworks, 2013). A code was devised by Dr Bell (see Appendix B) which allowed the
following features to be displayed and parameters to be calculated for each participant for each ear,
intensity and frequency:
Fsp
Autratio
P-value arising from bootstrapping of Fsp
P-value arising from bootstrapping of Autratio
Graphical representation of initial ABR recording and repeated recording (see figure 2.1a)
Graphical representation of averaged ABR (see figure 2.1a)
Figure 2.1a. Graph displaying two initial ABR waveforms (left) and the corresponding averaged waveform (right) produced on Matlab, from an example subject.
34
For this study, the 93 waveforms collected by Lightfoot and Stevens were given to four experts for
visual analysis. The experts had an option either to label the waveform as wave V present (Y), wave V
absent (N) or request further sweeps or repeat (R). These were then compared to objective parameter
values for analysis. Figure 2.1b below is an example of the graphical representation of the waveform
given to the experts.
Figure 2.1b. Graph displaying example ABR data in the form given to the experts for analysis.
The Fsp parameter was calculated according to the formula proposed by Elberling & Don (1984)
mentioned in section 1.6.3. The parameter Autratio was calculated according to a 3:1 SNR rule based
on the recommendation from the NHSP. An analysis time window of 6-16 ms was used. Peaks were
calculated by taking the maximum averaged wave in the time window. The trough, using the minimum
of the averaged waveform in the time window. The size of the wave is defined as the magnitude from
the peak to the trough, and the noise as the magnitude of the mean absolute difference between the
waveforms for the pre-determined time window. Hence, ratio equals (size of wave)/(noise). Fsp and
Autratio were then bootstrapped in order to generate p-values to allocate significance levels to the
determined values.
Bootstrapping will be carried out by constructing a number of resamples with replacement of the
original dataset at random points in time which in effect will regenerates averages. The averages are
different each time and this process will be repeated a total of approximately 499 times. These will
then be averaged in order to find an overall average where p-values can be generated accordingly
which indicate a significant response present if p≤0.05.
35
2.2 Simulated Data
Experimental Variables
Whilst conducting simulations, several variables were present in this study:
Independent Variables: SNR level before averaging
Dependent Variables: Values generated by: Fsp, Autratio and by the Bootstrapping of Fsp and Autratio
Confounding Variables: As simulations were carried out in this study, there was no presence of any
confounding variables, as all variables were tightly controlled.
Equipment and Software
The following equipment were required in order to conduct simulations:
Laptop/desktop computer
Matlab 2014a (Mathworks, 2014)
IBM SPSS 22 (IBM, 2014)
Testing Conditions
At the time of testing, all unnecessary programmes were terminated in order to reduce the stress on
the CPU. In addition, this also ensured that the chance of a computer error occurring was kept to a
minimal.
Test Method
Generating Simulated Data
An autoregressive model of EEG noise from an example subject was generated using the Yule Walker
method. Sections of simulated EEG noise were generated by putting white noise into the
autoregressive model.
An example ABR response from a subject was embedded into the noise at different levels to generate
different SNR levels. Ten repeats will be conducted at each SNR level and means will be calculated and
then analysed.
36
Figure 2.2a. ABR response from subject 19 which was embedded into the noise at different levels to generate different SNR levels. (Sound presented to subject at right ear with frequency of 4 KHz at 50 dB).
Analysis of Simulated Data
The simulated data were analysed using Matlab R2014a (Mathworks, 2013). A code was devised by Dr
Bell, see Appendix C, which allowed the following features to be displayed and parameters to be
calculated for different SNR levels (set before averaging - which was user assigned):
Fsp
Autratio
P values arising from the bootstrapping of Fsp
P values arising from the bootstrapping of Autratio
Graphical representation of simulated ABR waveform
The data was analysed and different objective parameters were calculated using the same approach
as for real patient data (see section 2.1 – ‘Analysis of ABR data’).
37
2.3 Risk Assessment and Ethics
Prior to commencing analysis, a risk assessment form was completed and submitted to the University
of Southampton. This was approved shorty after submission. Details can be found on Appendix D.
Ethics Approval from the Institute of Sound and Vibration Research was not necessary as advised by
Dr Bell. This was due to the reason that the two sets of data used in this study were either simulated
or originally recorded by Lightfoot & Stevens (2014) in their study. However, it should be noted that
Lightfoot and Stevens obtained NHS ethics approval from the Department of Medical Physics & Clinical
Engineering, Royal Liverpool University Hospital.
2.4 Statistical Analysis
Note: All statistical testing will be carried out using IBM SPSS 22 (IBM, 2014) software.
A Shapiro Wilk test will be conducted firstly in order to check for the distribution of the data.
Appropriate parametric or non-parametric tests will be then carried out based on the distribution of
the data. This will allow comparisons to be made and variable effects to be identified between the
data.
38
3.0 Results
3.1 Analysis of Data Collected by Lightfoot & Stevens
3.1.1 Comparison between Experts’ Interpretations
The experts who analysed the data were either from the University of Southampton Institute of Sound
and Vibration Research or the Southampton Auditory Implant Service. As shown in Appendix A,
analysis was conducted in two groups. Group one consisted of expert 1 only, where group two
consisted of experts 2, 3 and 4 who made their decisions together.
demonstrates the percentage of agreement and disagreement between the experts when analysing
waveforms. For a present detection of a wave V, a high level of agreement was observed between the
experts of 86.96%. In contrast, significantly high levels of disagreement was found between experts
when identifying a no response or when requesting further sweeps, 85.19% and 72.73 % respectively.
Clearly it is evident that there is a higher level of reliability when identifying a present Wave V (Y)
compared to identifying a no response (N) or requesting further sweeps (R).
In a private meeting, expert one noted that he should have identified some waveforms as ‘R’ instead
of ‘N’, as the levels of noise were too large to meet the criterion described in the NHSP procedure. On
two occasions, experts 2, 3 and 4 identified a waveform as ‘Y’ where expert one marked ‘N’. This
provides further evidence for the variability that is present between clinicians when interpreting
waveforms. In addition, a Cohen’s Kappa test was conducted which provides a measure of inter-
observer agreement (Jean, 1996). Kappa is defined as the ‘proportion of observed agreement after
correction for chance agreement’. The calculation produces a value between 0 and 1, where 0 is poor
reliability and 1 is excellent. A kappa value of 0.90 is regarded as high, i.e. for there to be good
agreement between the experts (Arnold, 1985). A value of 0.334 was found which was highly
significant (p=0.000). This indicates that a very poor level of reliability was present between the two
groups of experts.
As expert one indicated that he should have identified some waveforms as ‘R’ instead of ‘N’, it may
prove beneficial to analyse the results using only two categories; category 1 as ‘Y’ (yes) and category
2 as ‘N or R’ (not yes) (see table 3.1.1b below). For a present detection of a wave V, the same level of
86.96% was observed as this category did not change. When ‘N’ and ‘R’ were categorised as one, there
was a significant reduction in the level of disagreement found between experts. Experts agreed
78.87% of cases where a response was either not present or when further sweeps would be requested.
In addition, a Cohen’s kappa test was carried out revealing a value of 0.663 which was highly significant
39
(p=0.000). A kappa value of 0.663 is considerably higher than that found when using three groups,
indicating a higher level of reliability between testers when using two categories of waveform
identification. However, as the kappa value is still considerably lower than 0.90, which is regarded as
high (i.e. for there to be good agreement) (Arnold, 1985), significant inconsistency is still present
between both groups of experts.
Y (Yes) N (No) R (Further Sweeps)
Agreement 86.96% 14.81% 27.27%
Disagreement 13.04% 85.19% 72.73%
Table 3.1.1a. Percentage of agreement between experts using three categories of waveform identification.
Y (Yes) N or R (Not Yes)
Agreement 86.96% 78.87%
Disagreement 13.04% 21.13%
Table 3.1.1b. Percentage of agreement between experts using only two categories of waveform identification.
3.1.2 Sensitivity and Specificity of Objective Parameters
Expert one is an experienced lecturer in Audiology at the Institute of Sound a Vibration Research, and
experts two, three and four are experienced clinicians from the Southampton Auditory Implant Service
who all regularly inspect ABR data. Thus for this section, this study will assume a gold standard based
on the agreement of all testers for each waveform in order to analyse the rate of detection of a wave
V using Fsp and Autratio. Table 3.1.2a below shows the number of correct and incorrect detections of
a wave V. A value of 3.1 was chosen for Fsp as the critical value/threshold in order to test the accuracy
of the proposed findings by Elberling & Don (1984) when using a worst case scenario. A critical value
of 3.0 was assigned for the parameter Autratio in order to replicate the criterion assigned by the NHSP
of obtaining a 3:1 SNR for a signal present (Sutton, et al., 2013).
If values produced by Fsp or Autratio met or exceeded their critical values when the experts agreed
on a response present, this would be deemed as a correct detection by the objective parameter. If
critical values were met or exceeded by the Fsp or Autratio, but experts requested further sweeps or
indicated a no response present, then this would be deemed as an incorrect identification by the
objective parameters.
Furthermore, bootstrapping was applied to Autratio and Fsp, where a wave V response was deemed
present only if the p-value was found to be less than or equal to 0.05 which was determined by the
40
bootstrap. Bootstrapping was used to determine if the parameter value derived from the coherent
average is significantly different from the distribution of the non-coherent averages at p<0.05. From
these values, the corresponding sensitivity and specificity levels were calculated by using the equation
found below on table 3.1.2b.
Table 3.1.2a indicates that the parameter Autratio produced the highest sensitivity level of 96% using
a critical value of 3.0. Autratio sensitivity dropped to 92% when only examining Autratio values which
were deemed significant by the application of bootstrapping. In addition, a specificity of 35.71% was
found for Autratio, meaning that this parameter has a high chance of producing a false positive result.
Furthermore, if looking at significant Autratio values only as determined by the bootstrap, a
significantly higher specificity of 64.29% was found.
When using a critical value of 3.1 as proposed by Elberling & Don (1984), the Fsp parameter produced
the lowest sensitivity level of 70%. In contrast, a significantly higher sensitivity of 88% was achieved
when only analysing Fsp values which were deemed significant by the bootstrap. However, the
specificity dropped from 82.14% to 67.88% when only analysing Fsp deemed significant by the
bootstrap.
Table 3.1.2a. Sensitivity and specificity levels of Fsp and Autratio when using experts’ answers as gold standard. Note: Substitute ‘x’ according to assigned parameter value
Table 3.1.2b. Calculations of sensitivity and specificity
3.1.3 Correlational Analysis
A correlational test was performed on the values produced by the parameters Fsp and Autratio for all
data (n=93, see Appendix A) in order to determine if any similarities were present. Firstly, a Shapiro-
Wilk test determined that all data were non-parametric. Therefore a Spearman’s Rho test was
conducted to identify correlation between the data for Fsp and Autratio. The results of the test show
that a significant correlation with a coefficient of 0.752 exists between Fsp and Autratio (p=0.000).
Additionally, Figure 3.1.3a shows the relationship between Fsp and Autratio in a graphical manner. As
the value of Autratio increases, Fsp value increases too. Furthermore, it can be clearly deduced that
Autratio values were higher on most occasions compared to the Fsp.
The same procedure was adhered to when examining Fsp and Autratio values which were deemed
significant (p≤0.05) by the application of bootstrapping (n=62, see Appendix A). Values determined
significant by the bootstrapping of Fsp and Autratio will now be referred as SigFsp and SigAutratio
respectively for this section only.
A test for normality revealed that both sets of data were non-parametric. A Spearman’s Rho test
indicated that a significant correlation (p=0.000) was present between SigAutratio and SigFsp with a
correlation coefficient of 0.533. However, this correlation was not as strong as that between Fsp and
Autratio. Additionally, figure 3.1.3b displays this information in a graphical manner.
42
Figure 3.1.3a. Graph displaying the relationship between Fsp and Autratio when using all data (n=93).
Figure 3.1.3b. Graph displaying the relationship between Fsp and Autratio when only using significant values (n=62), as determined by bootstrapping the objective parameters (p≤0.05).
0
2
4
6
8
10
12
14
16
18
0 2 4 6 8 10 12 14 16 18
Fsp
Autratio
0
2
4
6
8
10
12
14
16
18
0 2 4 6 8 10 12 14 16 18
Sign
ific
ant
Fsp
Val
ue
s(D
ete
rmin
ed
by
the
bo
ots
trap
pin
g o
f Fs
p)
Significant Autratio Values (Determined by the bootstrapping of Autratio)
43
3.2 Analysis of Simulated Data
SNR and P-value Analysis
The use of Monte Carlo simulations has also allowed for the analysis of the functioning of the
bootstrap method with respect to the two parameters Fsp and Autratio. This was done by retesting
ten times at each SNR and then averaging the actual p-values generated from the bootstrapping of
Autratio and Fsp. Table 3.2a below displays the mean p-values at each SNR.
Table of Means
SNR (BA) Fsp P value Autratio P Value
0.002 0.28537 0.61763
0.004 0.12901 0.16533
0.006 0.0172* 0.02566*
0.008 0.0004* 0.00501*
0.01 0* 0.00401*
Table 3.2a. Mean p-values generated by using the bootstrap method at each SNR. *Significant (p≤0.05) A Shapiro-Wilk test indicated that all data were parametric (p>0.05) thus a Pearson’s r test was carried
out in order to analyse the relationship between each parameter and SNR. The test revealed a very
highly significant correlation coefficient of -0.895 (p=0.04) between SNR and the p-values generated
for Fsp by the bootstrap. Similarly, a correlation coefficient of -0.836 was found between SNR and the
p-values generated by the bootstrapping of Autratio. However, statistical significance was not
achieved (p=0.078) for this correlation. Figure 3.2b displays this relationship graphically. By analysing
table 3.2a and figure 3.2b, it is clear that the p-values generated for Autratio are on average higher
compared to p-values generated for Fsp across all SNRs, but mostly towards the lower level SNRs.
Averaged p-values for both parameters first reach statistical significance (defined as p≤0.05) at an SNR
of 0.006 after which they are then very similar to one another. This indicates that p-values produced
by the bootstrapping of both objective parameters tend closer towards 0 as SNR increases.
A repeated measures ANOVA test was conducted in order to measure the effect of SNR on p-values
generated by the bootstrapping of both objective parameters. The test indicated that SNR has a
significant effect on p-values generated by the bootstrapping of Fsp (p=0.000). In addition, statistical
significance was also achieved for the effect of SNR on p-values arising from the bootstrapping of
Autratio (p=0.000).
44
Figure 3.2b. Graph displaying the relationship between SNR and Fsp/Autratio p-values as determined by the bootstrap.
SNR and Parameter Value Analysis
The use of Monte-Carlo simulations has enabled for the investigation of how the parameter values
differ according to changes in SNR. Each SNR was retested ten times in order to generate average Fsp
and Autratio values (all data can be found on Appendix E). Table 3.2c displays the means at each SNR.
Table of Means
Fsp Autratio SNR (BA)
1.63723 2.0826 0.002
2.23355 3.87623 0.004
3.96289 6.57598 0.006
6.27492 7.29712 0.008
7.58759 8.3473 0.01
Table 3.2c. Table displaying the mean parameter values at each SNR.
A Shapiro-Wilk test revealed that all data were parmetric (p>0.05), thus a Pearson’s r test was carried
out. The test revealed a very highly significant correlation coefficient of 0.985 (p=0.002) between Fsp
and SNR. Similarly, a highly significant correlation coefficient of 0.976 (p=0.004) was found for the
relationship between Autratio and SNR. The values produced by Autratio were found to be higher on
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.002 0.004 0.006 0.008 0.01
Sign
ific
ance
Ge
ne
rate
d b
y B
oo
tstr
ap (
p-v
alu
es)
SNR (Before Averaging)
Fsp AutRatio
45
average compared to that produced by the Fsp parameter, which is also supported by the findings
discussed in section 3.1.3. Figure 3.2d displays this relationship in a graphical manner.
A repeated measures ANOVA test was conducted in order to investigate the effect of SNR on Autratio
and Fsp values. However statistical significance was not achieved; p=0.316 and p=0.104 with respect
to Fsp and Autratio.
Figure 3.2d. Graph displaying the relationship between SNR and Fsp/Autratio parameter values. Parameter and P-value Analysis By conducting simulations the relationship between parameter values and their corresponding p-
values as determined by the bootstrap can be analysed. The values from table 3.2c and 3.2a have been
imported to create table 3.2e (see below), which will be used for the upcoming calculations.
Table of Means
SNR Fsp Fsp P value Autratio Autratio P value
0.002 1.63723 0.28537 2.0826 0.61763
0.004 2.23355 0.12901 3.87623 0.16533
0.006 3.96289 0.0172 6.57598 0.02566
0.008 6.27492 0.0004 7.29712 0.00501
0.01 7.85759 0 8.34730 0.00401
Table 3.2e. Mean parameter values and their corresponding mean p-values at different SNR levels.
0
1
2
3
4
5
6
7
8
9
10
0.002 0.004 0.006 0.008 0.01
Par
ame
ter
Val
ue
SNR (Before Averaging)
Fsp Autratio
(y) (x)
46
A Shapiro-Wilk test indicated that all data were parametric (p>0.05). A Pearson’s r test revealed a
strong negative correlation coefficient of -0.829 between Fsp values and p-values arising from the
bootstrapping of Fsp (p=0.082). A significant difference was found in the values for Fsp (M=4.34,
SD=2.56) and p-values arising from the bootstrapping of Fsp (M=0.86, SD=0.12); t(4)=3.573, p=0.023.
Furthermore, a stronger negative correlation coefficient of -0.900 (p=0.038) was found between
Autratio values and p-values arising from the bootstrapping of Autratio. Similarly, a repeated
measures t-test determined that a significant difference was present for values of Autratio (M=5.63,
SD=2.58) and p-values arising from the bootstrapping of Autratio (M=0.16, SD=0.26); t(4)=4.335,
p=0.012.
By looking at point ‘x’ on figures 3.2f and 3.2g, it can be deduced that an Fsp critical value of 3.0 and
an Autratio critical value of 5.2 are found for the bootstrap distribution α=5% (p≤0.05). These critical
values are the lowest values to be deemed as significant when applying the bootstrap method to the
objective parameters in order to detect a significant response present.
Figure 3.2f. Graph displaying the relationship between Fsp values and their corresponding p-values at different SNR levels.
(x)
47
Figure 3.2g. Graph displaying the relationship between Autratio values and their corresponding p-values at different SNR levels.
(x)
48
4.0 Discussion
4.1 Comparison between Experts’ Interpretations
The results of this study support the findings of Vidler & Parker (2004), Gans, et al. (1992), Kuttva et
al. (2009) and Lv et al. (2007) that a high level of variability is present during the subjective
interpretation of an ABR using the 3:1 proposed by the NHSP.
Agreement was found between the experts when they believed a response was present in the
recording (Y) (86.96%). In contrast, a high level of disagreement was found when reporting a response
as absent (N) (85.19%) and when requesting for additional sweeps or a repeat (R) (72.73%). On two
occasions, experts 2, 3 and 4 labelled a waveform as signal present (Y), where expert 1 identified them
as a no response present (N). In addition to this, a Cohen’s Kappa test, which provides a measure of
inter-observer agreement (Jean, 1996), produced a value of 0.334 (p = 0.000). This relatively low Kappa
value strengthens the findings as a low value indicates a poor level of agreement between the experts.
This high level of disagreement may have been established due to the fact that experts 2, 3 and 4 were
reluctant to identify a response as not present (N) 2 times, instead labelling it as (R) 30 times. In
contrast with expert 1 who identified no response (N) 25 times, and requested further sweeps (R), 14
times. This inconsistency may have been present due to the experts not having any control over the
acquisition of the data. Therefore they may not have wanted to rule out a response as absent (N),
making ‘R’ a popular choice.
Further analysis was conducted in order to address the matter that expert 1 introduced. Expert 1
indicated that he should have marked some waveforms as ‘R’ instead of ‘N’, which may have produced
different findings. By analysing the results again, this time with only two categories; ‘Yes’ (Y) and ‘Not
yes’ (N or R), significant reductions were found between the level of disagreement between experts.
Experts agreed 78.87% of cases for a ‘Not yes’, compared to an agreement of only 14.81% and 27.28%
with respect to ‘N’ and ‘R’. A higher Kappa value was found, 0.663 (p = 0.000), which indicates that
the level of agreement between the experts had increased compared to using three categories.
However, these findings still express the potential differences between clinicians’ interpretations of
ABR waveforms. Major differences in management options may result as a consequence of the
variability that is present between clinicians. Hence even the smallest differences in results highlight
the need for and desirability of an objective method in order to reduce the level of uncertainty and
variability that may occur during subjective analysis. The results provide evidence that the null
hypothesis; ‘there will not be a significant level of variability between experts when interpreting ABRs’
can be rejected, and the experimental hypothesis accepted.
49
However, several limitations were present regarding the method of obtaining the data. Firstly, a
potential limitation may have been that the experts were not used to analysing graphical ABR data in
the form given to them (see figure 2.1b). In addition, the graphs did not contain labelled axes which
may have also affected their interpretations. However, no mention was given by any of the experts
regarding this issue having effected their ability to interpret results.
Another major limitation is the method of data collection. Ideally, if all experts collected the data
themselves, this would have resulted in a greater external validity as this would have better replicated
a real world scenario. Additional information regarding the number of sweeps, arousal of patient,
conditions of testing and test parameters may have also better represented real world scenarios. As
a result of the lack of additional information, experts 2, 3 and 4 reported difficulty in ruling out a
response as absent. Further research into this field could address these limitations by allowing the
clinicians who collected the data, to analyse the data themselves. If this is not possible, full information
should be provided to the experts with respect to the number of sweeps, arousal of patient and
conditions of testing.
Further research may also address the limitation of comparing interpretations from only four experts.
In addition the experts were all based in the Southampton area where expert 1 was a lecturer of
Audiology and experts 2, 3 and 4 were experienced clinicians from the Southampton Auditory Implant
Service Centre. This may not have accurately represented a conventional audiologist who works in a
paediatric ABR clinic across the UK. This along with the use of a small sample size (n=4) threatens the
external validity of the findings. It may prove beneficial to include several clinicians from clinics around
the UK in order to examine the level of variability that is present between the experts and perhaps
even between clinics. Furthermore, the use of a greater sample size (ABR data) would allow for more
accurate predictions to be made with greater confidence in the findings.
4.2 Sensitivity and Specificity of Objective Parameters
A primary aim of this project was to investigate the accuracy of the objective methods. This was
conducted by making comparisons with the experts’ analysis. In order to do this, the agreement of a
response present (n=50) by experts 1, 2, 3 and 4 was used as a gold standard.
Firstly, sensitivity and specificity levels were calculated according to the formula in table 3.1.2b. When
using a critical value of 3.1, Fsp produced a sensitivity level of 70%, meaning that 30% of wave V signals
will be missed (not labelled as present). In addition, a specificity level of 82.14% was found which
meant that only 5 false positives were produced by the Fsp. A considerably higher sensitivity level of
50
88% was found when only analysing Fsp values which were deemed significant by the bootstrapping
of Fsp. This indicates that the application of the bootstrap produced a better correct identification
rate of a wave V. However, a lower specificity level of 67.86% was found, meaning that the
bootstrapping of Fsp resulted in producing greater false positive results (n=9). However, these results
imply that the advantages of applying the bootstrap method to Fsp outweigh the disadvantages, as a
greater benefit in terms of sensitivity is achieved with only a slight deterioration in the specificity level.
A very high sensitivity level was found for Autratio when using a critical value of 3.0 (96%). When
bootstrapping was applied to Autratio, a minute reduction of only 4% was found with regards to the
sensitivity level. However, a significantly higher specificity level was produced as a result of the
bootstrapping of Autratio (64.29%) compared to when analysing Autratio using a critical value of 3.0
(35.71%). Similar to Fsp, this means that the advantages of applying the bootstrap method to Autratio
outweigh the disadvantages, as greater benefits in terms of specificity are achieved with only a slight
decrease in the sensitivity level.
Although Fsp and Autratio are calculated considerably differently from one another, the application
of bootstrapping to both parameters revealed very interesting results. Both parameters produced very
similar sensitivity and specificity levels. A sensitivity of 92% was achieved by the bootstrapping of
Autratio, which is only 4% higher than that found by the bootstrapping of Fsp. In addition, a specificity
of 67.86% was found by the bootstrapping of Fsp, which is only 3.57% greater than that produced by
the bootstrapping of Autratio. These results imply that the bootstrap method is universally applicable
to various objective parameters and could be used to compare results across clinics.
These findings suggest that the application of the bootstrap method is advantageous for Fsp sensitivity
levels and Autratio specificity levels. However, although significant improvements were seen with
regards to sensitivity and specificity levels for both the Fsp and Autratio, neither parameter produced
an adequate specificity level, the highest being 82.14% achieved by the Fsp using a 3.1 critical value.
The highest sensitivity of 96% was achieved by Autratio using a critical value of 3.0, which however
yielded a poor sensitivity of 35.71%. With the application of the bootstrap, a good balance between
sensitivity specificity was found for both the Fsp and Autratio, which is very desirable. However,
although a good balance was achieved, values of sensitivity and specificity are still too low in order for
the objective methods to supersede the subjective analysis performed by clinicians.
The results thus provide evidence that the null hypothesis; ‘there will not be a significant difference
in detection rates between the parameters Fsp and Autratio’ can be rejected.
51
However, the specificity values may not be fully accurate as a result of a limitation. Using the experts
as gold standard may not provide information that is fully accurate, as in reality, the objective
parameters could have correctly indicated a response as present at lower stimulation levels than the
experts. For example, if the objective parameters detected a signal as present but the experts
requested further sweeps or a repetition ‘R’, this report deemed this as an incorrect detection by the
objective parameter. It is important to note that as a worst case design has been employed when
calculating specificity levels, in reality, values could be considerably better. However, it was necessary
to label a waveform as ‘R’ due to experts not wanting to definitely rule out a response. As mentioned
previously, if the data was primarily acquired by the experts themselves, perhaps more definite
decisions would have been made regarding whether a response was present or not, allowing for a
better measurement of accuracy of the objective parameters.
Furthermore, the relatively low Kappa values and percentages of agreement and disagreement
discussed previously in section 3.1.1 provides evidence that considerable inconsistency is present
between the experts when interpreting waveforms. The experts may have been incorrect when
marking a signal as not present, when in reality, it may have actually been present. Thus using the
experts as a gold standard may not provide the best measurement of accuracy for when comparing
the accuracy of the objective parameters.
Further research should address these limitations by allowing the experts to acquire the data with full
control of all parameters such as the number of sweeps etc. This would result in more accurate
calculations of sensitivity and specificity levels for both objective parameters which may support their
implementation into a clinical environment.
4.3 Correlational Analysis
The Fsp and Autratio values calculated using Matlab r2014a (Mathworks, 2014), were investigated to
identify if a correlation was present as they are both independent of each other and are calculated
considerably differently. A Spearman’s Rho test was conducted which produced a highly significant
result. A correlation coefficient of 0.752 was found between Fsp and Autratio (p=0.000). In addition,
when comparing Fsp and Autratio values which were deemed significant (p≤0.05) by means of the
bootstrap method (n=62), a Spearman’s Rho test revealed a correlation coefficient of 0.533 (p=0.000).
A higher mean value of 5.19 was observed for SigFsp (Fsp values determined significant by the
bootstrap method) compared to 3.8 for Fsp. Also a higher mean of 7.57 was found for SigAutratio
(Autratio values determined significant by the bootstrap method) compared to 6.2 for Autratio. This
52
indicates that the use of bootstrapping will on average, result in obtaining higher parameter values.
Furthermore, the results suggest that the parameter Autratio produced higher values on average
compared to those produced by Fsp.
By examining the line of best fit on figure 3.1.3a, if we consider an Autratio value of 3.0 (according to
the NHSP 3:1 rule), the corresponding Fsp value is close to 2.0. A corresponding Autratio value close
of 5.0 is found when using an Fsp value of 3.1 (as proposed by Elberling & Don [1984]). This supports
the findings discussed previously that the Fsp parameter produces lower values on average compared
to that produced by the Autratio. If these objective parameters are used in conjunction with one
another in clinics, obtaining a significant Autratio value of 3.0 may not always result in a significant
Fsp value of 3.1. This suggests that the combined use of both parameters may not currently be viable
with their current critical values.
Furthermore, the results indicate that the null hypothesis; ‘there will not be a significant correlation
between the parameters Fsp and Autratio’ can be rejected, and the experimental hypothesis
accepted.
4.4 Simulated Data
Strong negative correlations between SNR and both parameter p-values were found as a result of
conducting simulations. As value of -1 is a very strong negative correlation, the results indicate that as
SNR increases, Fsp p-values and Autratio p-values decrease and significance arising from the use of
bootstrapping (p≤0.05) is achieved at a greater rate.
For Autratio p-values at 0.002 SNR, significance was only reached once (1/10) compared to n=10/10
at 0.01 SNR. This further provides evidence that the bootstrapping method is sensitive to changes in
SNR. A repeated measures ANOVA test showed that the effect of SNR on p-values arising from the
bootstrapping of Fsp and Autratio was statistically significant (p=0.000); strengthening the idea that
the level of SNR has a significant effect on p-values. Thus the null hypothesis; ‘the level of SNR will
have no effect on the p-values generated by the bootstrapping of Autratio and Fsp’ can be rejected.
Furthermore, by the inspection of figure 3.2b, it is evident that lower p-values are produced for Fsp
across all SNRs. This means that by using the Fsp parameter, on average, values of greater confidence
(p<0.05) will be produced compared to Autratio, supporting the combined use of Fsp and
bootstrapping in clinics.
Significant correlations were found between SNR and both parameters, suggesting a clear relationship
that as SNR increases, parameter values also increase. However significance was not achieved by an
53
ANOVA test, which tested for the effect of SNR on values produced by Autraio and Fsp. This indicates
that the null hypothesis ‘the level of SNR will have no effect on the values generated by Fsp and
Autratio’ cannot be fully rejected. Furthermore, across all SNRs, Autratio produced higher values
compared to Fsp. This gives indications that the critical value (to detect a wave V response) of Autratio
may be considerably higher than that for Fsp (see figure 3.2d).
Negative correlations were found between parameter values and their corresponding p-values,
suggesting that as parameter values increase, p-values tend closer to 0 and significance is achieved at
a better rate. This further supports the use of bootstrapping in clinics, as a greater/stronger parameter
value results in significance to be achieved more frequently.
By analysing figure 3.2d, specifically point ‘y’, it can be deduced that the recommendation of the 3:1
rule proposed by the NHSP, which should in theory associate to an Autratio value of 3.0, equates to
an SNR level (before averaging) of approximately 0.003. Furthermore, point ‘x’ indicates that an SNR
of 0.005 equates to the Fsp critical value of 3.1 as proposed by Elberling and Don (1984). However,
the lowest SNR level where the p-values (generated from the bootstrapping of Autratio and Fsp) were
below 0.05 was at an SNR of 0.006, which is significantly higher than 0.003 and 0.005. This suggests
that critical values of 3.1 for Fsp and 3.0 for Autratio may be too low. However only five different SNR
levels were used in this study, which may be a reason as to why these values seem too low/inaccurate.
We can deduce from the results by examining point ‘x’ marked on figures 3.2f and 3.2g, that the
minimum value associated with a significant response (i.e. the critical value) for Fsp is 3.0 for the
bootstrap distribution α=5% (p≤0.05). Furthermore, the suggested critical value for Fsp that has been
extrapolated from the findings of this report is very similar to the critical value of 3.1 suggested by
Elberling and Don (1984) for (p<0.01). However, the suggested Fsp critical value of 3.0 is lower
compared to that found by Lv, et al. (2007), who suggested 1.75 based on their results which was
determined by the bootstrap distribution α=5% (p≤0.05). The differences in Fsp critical values found
between this study and the studies by Lv, et al (2007) and Elberling and Don (1984) support the results
of Lv, et al. who also found that the critical values for Fsp varied quite considerably between
recordings. This supports the idea that universally valid threshold values for Fsp probably cannot be
justified.
This study also found an Autratio critical value of 5.2 for the bootstrap distribution α=5% (p≤0.05).
Evidently, this is higher than the subjective criteria of a 3:1 signal to noise ratio proposed by the NHSP
(which should equate to an Autratio value of 3.0). Thus the null hypothesis; ‘the critical value of
Autratio at (p≤0.05), for a response detection, will be not be 3.0.’ has to be accepted. Although the
parameter Autratio has been coded to calculate values using the subjective 3:1 criteria, there are
54
many reasons as to why a critical value of 3.0 may have not been achieved. One plausible explanation
may be that the noise estimate from visual inspection is different from what the algorithm of Autratio
produces. For example, if visual inspection tends to overestimate noise compared to the 'true'
statistical value, then for a given recording the 'ratio' (which is signal divided by noise) will be lower
for visual inspection than for the objective method (assuming the signal peak to trough difference is
similar for both). So where subjective visual inspection suggests that clear responses start with ratios
around 3.0, the objective method may calculate the ratio to be around 5.0 due to the difference in
noise level estimates.
It is also important to note that when conducting simulations, several limitations were present. Most
importantly, this study used only five different SNR levels which were set before averaging. Future
research should address this issue by using an extensive range of SNRs in order to obtain more
accurate information regarding relationships and effects between variables. In addition, the use of
more SNRs will enable more accurate estimates of critical values for the objective parameters. In
addition, significance was not achieved for various relationships and effects between variables, which
may be a result of the limited range SNRs. Another limitation is that each SNR was retested at ten
times. Perhaps a greater retest value would have yielded more accurate and reliable results.
Another limitation in this study was due to the fact that no artefacts were present in the simulated
waveforms. In reality, when recording an ABR in clinic, several artefacts are present which may
influence the interpretation of a waveform. Consequentially, the ecological validity of the simulated
findings are weakened.
Lastly, the simulations did not explore false positive rates of the objective parameters. Future research
should address this by generating a known waveform with no actual wave V present (no stimulation
data), in order to determine which parameter has a higher false positive rate. Although this study has
attempted to explore false positive rates in section 3.1.2, experts were used as gold standard which
presents a major limitation as discussed previously.
55
5.0 Conclusion
To conclude, this report has found results which agree with the reviewed literature regarding the high
level of variability that is present during ABR analysis. This level of inconsistency is of great concern,
as major differences may occur during the management of a patient. This should be addressed by the
NHSP immediately by perhaps providing mandatory standardised training programmes to audiologists
around the UK.
Neither objective parameter detected a response at a rate of 100% when using experts as gold
standard. Fsp produced relatively good sensitivity and specificity levels with a critical value of 3.1
(Elberling & Don, 1984), compared to Autratio which produced a very low specificity level using a
critical value of 3.0 based on the NHSP. This highlights the fact that an Autratio critical value of 3.0 is
unfitting and should be revised accordingly. Furthermore, the advantages of applying the bootstrap
method to Fsp and Autratio greatly outweighed the disadvantages, supporting the combined use of
the bootstrap method with Autratio and Fsp.
Further analysis using simulations provided evidence that the bootstrap technique is sensitive to
changes in SNR. It was also found that SNR had a significant correlation with both objective
parameters and that on average, Autratio produced greater values across all SNRs. Based on the
findings, this report proposed an Fsp critical value of 3.0, which is very similar to previous work
(Elberling & Don, 1984). An Autratio value of 5.2 was also found, which is significantly higher than the
subjective criterion of 3:1 as proposed by the NHSP. However, the inaccurate estimation of noise by
visual inspection may be the underlying factor to this variation.
Correlational analysis revealed that although Fsp and Autratio are calculated considerably differently,
they are both related to one another. This was explored further by conducting simulations and the
results indicated that SNR was the underlying variable which caused this relationship; where Fsp and
Autratio values increased with SNR.
The findings of this report suggest that the implementation of an objective approach to supersede the
subjective method of analysis is still not a viable option. The Fsp parameter did not achieve an
adequate level of sensitivity and both parameters produced poor specificity levels. The use of
bootstrapping resulted in achieving greater advantages than disadvantages in terms of the accuracy
of the objective parameter which strengths the implementation of the bootstrap in ABR clinics. The
use of bootstrapping will allow comparisons of results to be made across clinics. This report suggests
that the objective methods should be used alongside subjective analysis to provide confidence to
clinicians.
56
Further research should test the accuracy of the critical values proposed based on the findings of this
report. Furthermore, limitations that have been highlighted in this report should be addressed by
future work in order to make better comparisons between the objective parameters. Applying the
bootstrap method to other objective parameters such as the Fmp or ± difference may also prove
beneficial as it would allow for better comparisons to be made regarding its usefulness. This would
aid the decision of whether it is actually viable for an objective method to supersede the current
subjective method of analysis.
57
Works Cited
Arlinger, S. D., 1981. Technical aspects of stimulation, recording, and signal processing. Scandinavian
Audiology, Volume 13, pp. 41-53.
Arnold, S. A., 1985. Objective versus visual detection of the auditory brain stem response.. Ear
Hearing, 6(3), pp. 144-150.
Audacity, 2012. Free Audio Editor and Recorder. [Online]
Available at: http://audacity.sourceforge.net/
[Accessed 04 December 2014].
Besouw, R. V., 2012. Physiological Measurement: Auditory evoked potentials and synchronous
averaging. [Online]
Available at: https://blackboard.soton.ac.uk/bbcswebdav/pid-1704748-dt-content-rid-
859574_1/xid-859574_1
[Accessed 22 November 2014].
Bremner, D. et al., 2012. Audiology Assessment Protocol: Version 4.1, s.l.: BC Early Hearing Program.
Wetherill, G. B. & Levitt, H., 1965. Sequential estimation of points on a psychometric function. British
Journal of Mathematical and Statistical Psychology, 18(1), pp. 1-10.
Wong, P. K. H. & Bickford, R. G., 1980. Brain stem auditory evoked potentials: the use of noise
estimate.. Electroencephalogr Clin Neurophysiol, 50(1), pp. 25-34.
7.0 Appendicies
7.1 Appendix A: Table displaying: All data used from Lightfoot & Stevens’ study, calculated parameter values of data and lastly, experts’ interpretations of
data.
Key
y Agreement of 'yes'
Objective Parameters
Bootstrapped Significance Values
When considering three categories:
Yes (y), No (n) and Result Inconclusive (r)
When considering two categories only:
Yes (y), Not yes (n) or (r)
n Agreement of 'no'
r Agreement of 'result inconclusive'
n' or 'r' Agreement of 'not yes'
p≤0.05 Significant
Baby Ear AC/BC Freq (KHz)
Intensity (dB nHL)
Fsp Ratio Sig FSP Sig Ratio Expert
analysis 1
Expert analysis
2,3,4
Expert analysis 1
Expert analysis
2,3,4
1 Lt A 4 40 4.4251 14.7561 0.000 0.000 y y y y
2 Lt A 4 50 0.364 3.7872 0.599 0.058 n r n r
2 Lt A 4 60 2.221 3.5414 0.002 0.010 r r r r
2 Lt A 4 70 0.8769 4.8766 0.188 0.010 y y y y
2 Lt A 4 80 2.0503 3.0688 0.002 0.094 y r y r
2 Rt A 4 50 0.6172 3.7816 0.273 0.076 r r r r
2 Rt A 4 60 0.6723 4.288 0.237 0.054 n r n r
2 Rt A 4 70 1.5558 5.4421 0.002 0.004 y r y r
3 Lt A 4 40 2.498 5.2218 0.000 0.008 y y y y
3 Lt A 4 50 2.8156 9.5915 0.004 0.000 y y y y
3 Rt A 4 40 4.5966 4.2667 0.000 0.002 y y y y
3 Rt A 4 50 3.0648 16.7977 0.002 0.000 y y y y
4 Lt A 4 80 0.5658 1.629 0.297 0.427 n n n n
4 Rt A 4 70 0.3305 0.764 0.868 0.918 n n n n
63
4 Rt A 4 80 0.7165 2.7933 0.479 0.305 n r n r
5 Lt A 4 40 7.2178 8.2583 0.000 0.000 y y y y
5 Rt A 4 50 4.8535 12.8006 0.000 0.000 y r y r
6 Lt A 4 40 6.717 9.3483 0.000 0.000 y y y y
6 Rt A 4 40 6.9995 5.8313 0.000 0.000 y y y y
6 Rt A 4 50 14.3761 9.3689 0.000 0.000 y y y y
7 Lt A 1 45 3.1515 7.9225 0.000 0.000 y y y y
7 Lt A 1 55 9.2722 11.2221 0.000 0.000 y y y y
7 Lt A 4 40 16.2599 8.5218 0.000 0.000 y y y y
7 Rt A 4 40 4.2809 6.9783 0.002 0.002 y y y y
7 Rt A 4 50 1.7277 4.113 0.068 0.070 y y y y
8 Lt A 4 50 0.2731 0.5102 0.321 0.100 n r n r
8 Lt A 4 60 2.0884 4.1589 0.006 0.016 n r n r
8 Lt A 4 70 1.941 5.4221 0.008 0.000 n y n y
8 Rt A 4 50 1.1445 7.3446 0.090 0.002 y y y y
8 Rt B 4 30 4.3725 8.4473 0.000 0.000 r y r y
8 Rt A 4 60 6.3957 3.8134 0.000 0.086 y y y y
9 Lt A 4 50 0.3024 3.7901 0.553 0.010 r r r r
9 Lt A 4 60 0.9976 3.7925 0.010 0.006 y y y y
9 Rt A 4 60 1.3887 3.7559 0.014 0.012 r y r y
9 Rt A 4 70 1.9527 4.3338 0.008 0.008 r y r y
9 Rt B 4 30 2.6884 4.7687 0.000 0.044 y y y y
10 Lt A 4 40 5.7471 6.0921 0.000 0.000 y y y y
10 Rt A 4 40 6.8229 10.0162 0.000 0.000 y y y y
10 Rt A 4 50 10.6075 9.7659 0.000 0.000 y y y y
11 Lt A 4 40 2.728 12.8851 0.012 0.000 y y y y
11 Lt A 4 50 3.9165 7.7218 0.000 0.000 y y y y
64
11 Rt A 4 40 5.2349 7.9004 0.000 0.000 y y y y
11 Rt A 4 50 9.7042 10.5996 0.000 0.000 n r n r
12 Lt A 4 50 0.3048 1.026 0.705 0.643 r r r r
12 Lt A 4 60 0.6246 2.7254 0.311 0.078 y y y y
12 Rt A 4 60 0.9664 4.4995 0.054 0.040 n r n r
12 Rt A 4 65 1.265 3.5409 0.078 0.068 n r n r
13 Lt A 4 50 0.2688 2.9172 0.776 0.249 y y y y
13 Lt A 4 60 1.4746 3.5494 0.000 0.000 y y y y
13 Lt A 4 70 7.0925 11.2886 0.000 0.000 y y y y
13 Rt A 4 40 4.6711 7.4281 0.000 0.000 y y y y
13 Rt A 4 50 6.547 11.9604 0.000 0.000 y y y y
13 Rt A 4 60 10.117 6.7241 0.000 0.000 n r n r
14 Lt A 4 40 2.9965 4.6614 0.002 0.012 n y n y
14 Lt A 4 50 5.5169 5.0416 0.000 0.000 r y r y
14 Rt A 4 40 5.558 6.5864 0.000 0.000 y y y y
14 Rt A 4 50 5.1072 9.4276 0.000 0.000 y y y y
15 Lt A 4 40 7.8613 10.543 0.000 0.000 y y y y
15 Lt A 4 50 7.7214 6.5855 0.000 0.000 y y y y
15 Rt A 4 40 3.0514 9.0503 0.006 0.000 y y y y
15 Rt A 4 50 6.2835 8.0596 0.000 0.000 n r n r
16 Lt A 4 40 0.296 3.1622 0.796 0.164 n r n r
16 Lt A 4 50 0.3806 3.4996 0.669 0.130 n r n r
16 Rt A 4 40 3.4048 2.268 0.006 0.419 n r n r
16 Rt A 4 50 0.9739 2.6267 0.204 0.561 n r n r
16 Rt A 4 60 1.912 3.1217 0.036 0.212 n r n r
17 Lt A 4 40 4.7684 7.0083 0.000 0.000 y y y y
17 Rt A 4 50 8.5826 8.6365 0.000 0.000 y y y y
18 Lt A 4 40 2.5921 5.4859 0.000 0.008 y y y y
65
18 Lt A 4 50 3.3808 5.5447 0.000 0.002 r r r r
18 Rt A 4 40 1.3159 2.4657 0.022 0.363 n r n r
18 Rt A 4 50 0.4065 1.8605 0.800 0.509 n r n r
19 Lt A 4 30 7.043 8.4987 0.000 0.000 y y y y
19 Rt A 4 50 12.0276 8.1472 0.000 0.000 y y y y
20 Rt A 4 40 0.4241 3.8028 0.721 0.078 n r n r
20 Rt A 4 50 1.163 3.2583 0.028 0.084 n y n y
21 Lt A 4 40 2.8643 6.5816 0.000 0.000 y y y y
21 Lt A 4 50 3.6222 5.7806 0.000 0.000 y y y y
21 Rt A 4 40 2.3649 4.321 0.000 0.014 r y r y
21 Rt A 4 50 2.2128 6.3634 0.000 0.000 r y r y
23 Lt A 4 40 6.3384 4.696 0.000 0.008 y y y y
23 Lt A 4 50 0.6384 2.3934 0.505 0.128 n r n r
23 Rt A 4 40 1.5449 5.7374 0.024 0.026 r y r y
23 Rt A 4 50 4.2412 3.7149 0.000 0.016 y r y r
24 Lt A 4 60 2.8683 10.2549 0.000 0.000 y y y y
24 Rt A 4 65 1.2955 3.8627 0.078 0.048 n r n r
25 Lt A 4 40 1.2817 3.0344 0.126 0.008 r r r r
25 Lt A 4 50 0.7545 4.6986 0.359 0.006 y y y y
25 Rt A 4 40 4.3085 5.7578 0.000 0.008 r y r y
25 Rt A 4 50 4.5883 11.7267 0.000 0.000 y y y y
26 Lt A 4 50 7.0774 7.7962 0.000 0.000 y y y y
26 Rt A 4 40 4.4908 6.6928 0.000 0.000 y y y y
26 Rt A 4 50 8.3415 6.3562 0.000 0.000 y y y y
66
7.2 Appendix B: Code devised by Dr Bell for use on Matlab. The code calculates Fsp, Autratio and p-values arising from bootstrapping. Graphical representations of ABR waveforms are also produced.
Part 1 Part 2 %% 24 7 13. Deleted analysis using random index
bootstrap %% used random rotation from Kimberley - seems to work
Ok %% something funny is going on - why is this showing a
drift??? %% INCLUDE INVERSION OF DATA - SEEMS TO BE UPSIDEDOWN
%% 12 8 14 correction added so that mean abs difference
used on line 96 %% (3:1 rule) instead of abs of mean difference
% triggers at 0.02s separation clear;close all X=wavread('Baby 26 Rt 4k 50dB.wav'); %% ADD FILENAME 1
HERE
x=resample(X,5000,44100); % resample to 5k = guy used
2 HERE x1=resample(X1,5000,44100); % resample to 5k = guy used
filter of 1500 fs=5000; analysisstart=6; % start of 3:1/fsp window analysisend=16; % end of 3:1/fsp window AS=analysisstart/1000*fs; AE=analysisend/1000*fs; Amid=fix((AS+AE)/2); % run recording 1
data=-x(:,1); triggers=x(:,2); trigdelay=0.002; % length to pause after a trigger TD=trigdelay*fs; % pause in samples triggerthreshold=0.2;
epoch=0.018*fs; %18ms window %rejectT=0.1; % level to reject epochs. 0.2 is with no
scaling. rejectT=0.15; % seems Ok about 37.5 microV %rejectT=5; N=0; for i=1:length(triggers)-epoch; %for i=1:20000 if triggers(i)>triggerthreshold temp=-data(i:i+epoch-1); % artefact rejection if max(temp)<rejectT; if min(temp)>-rejectT; N=N+1; array(N,:)=temp; end end for n=i+1:i+TD triggers(n)=0; % if a trigger is detected,
set next triggers to zero end end end array=array/4000; % scale? scale=linspace(0,18,epoch); % time axis for plot abr=-mean(array);
67
Part 3 Part 4 figure;plot(scale,abr);title('Average of 2 responses')
% run recording 2 data1=-x1(:,1); triggers1=x1(:,2); trigdelay=0.002; % length to pause after a trigger TD=trigdelay*fs; % pause in samples triggerthreshold=0.2;
epoch=0.018*fs; %18ms window %rejectT=0.1; % level to reject epochs. 0.2 is with no
scaling. rejectT=0.15; % seems Ok about 37.5 microV %rejectT=5;
N1=0; for i=1:length(triggers1)-epoch; %for i=1:20000 if triggers1(i)>triggerthreshold temp1=-data1(i:i+epoch-1); % artefact rejection if max(temp1)<rejectT; if min(temp1)>-rejectT; N1=N1+1; array1(N1,:)=temp1; end end for n=i+1:i+TD triggers1(n)=0; % if a trigger is detected,
set next triggers to zero end end end array1=array1/4000; % scale? scale=linspace(0,18,epoch); % time axis for plot abr1=-mean(array1); hold on
plot(scale,abr1,'c');title('ABR overlay plot')
array2=[array' array1']'; % combine arrays of abrs abr2=(abr+abr1)/2; % average abr and abr1 - gives abr2 figure;plot(scale,abr2) top=max(abr2(AS:AE)); %peak in ABR 2 low=min(abr2(AS:AE)); %trough in ABR 2 diff=mean(abs((abr(AS:AE)-abr1(AS:AE)))); % average
% FSP calc upper=var(abr(AS:AE)); % power from 5 to 15 ms % NHSP
for click (.005*5000:.015*5000)); lower=var(array(:,Amid)); % power of SP at ms .008*5000) Fsp=upper*N/lower
% Bootstrap Fsp for n=1:499 temp=array; for i=1:N rotate=fix(rand*(epoch-1))+1; % avoid zero by
adding 1 temp(i,:)=[temp(i,rotate+1:epoch)
temp(i,1:rotate)]; %% end bootstrap(n,:)=mean(temp);
BootFsp(n)=N*var(bootstrap(n,AS:AE))/var(temp(:,Amid));; End
Bootvar=sort(BootFsp); BootFsp(475);
68
Part 5 Part 6 sort(BootFsp); count1=0; for i=1:length(BootFsp) if Fsp>BootFsp(i) count1=count1+1; end end SIGFSP=1-(count1/length(BootFsp)) % Bootstrap 3:1 for n=1:499 temp=array; for i=1:N rotate=fix(rand*(epoch-1))+1; % avoid zero by
adding 1 temp(i,:)=[temp(i,rotate+1:epoch)
temp(i,1:rotate)]; %% end bootstrap(n,:)=mean(temp); abr3=mean(temp); % call 'abr3'
temp1=array1; % 2nd array for i=1:N1 rotate=fix(rand*(epoch-1))+1; % avoid zero by
adding 1 temp1(i,:)=[temp1(i,rotate+1:epoch)
temp1(i,1:rotate)]; %% end bootstrap1(n,:)=mean(temp1); abr4=mean(temp1); % call 'abr4' abr5=(abr3+abr4)/2; top=max(abr5(AS:AE)); %peak in ABR 5 low=min(abr5(AS:AE)); %trough in ABR 5 % correction to mean abs, not abs mean 5/12/14 diff=mean(abs((abr3(AS:AE)-abr4(AS:AE)))); % average
difference over window Bootratio(n)=(top-low)/diff;
end
count2=0; for i=1:length(Bootratio) if ratio>Bootratio(i) count2=count2+1; end end SIGRATIO=1-(count2/length(Bootratio)) break
title('mean response with upper and lower 5th, 1st
percentile amplitudes from bootstrap and upper and lower
.2%')
69
7.3 Appendix C: Code devised by Dr Bell for use on Matlab software to analyse simulations. This code allows the user to input an SNR level (before averaging) and then Fsp, Autratio and p-values arising from bootstrapping are calculated. Graphical representations of the waveform are also produced.
Part 1 Part 2 %% 24 7 13. Deleted analysis using random index
bootstrap %% used random rotation from Kimberley - seems to work
Ok %% something funny is going on - why is this showing a
drift??? %% INCLUDE INVERSION OF DATA - SEEMS TO BE UPSIDEDOWN
%% 12 8 14 correction added so that mean abs difference
used on line 96 %% (3:1 rule) instead of abs of mean difference
% triggers at 0.02s separation % clear;close all % X=wavread('Baby 19 Rt 4k 50dB.wav'); %% ADD FILENAME
1 HERE % x=resample(X,5000,44100); % resample to 5k = guy used
2 HERE % x1=resample(X1,5000,44100); % resample to 5k = guy
used filter of 1500 %% 28 Nov. Read in x and x1 here fs=5000; analysisstart=6; % start of 3:1/fsp window analysisend=16; % end of 3:1/fsp window AS=analysisstart/1000*fs; AE=analysisend/1000*fs; Amid=fix((AS+AE)/2); % run recording 1 data1=-x1(:,1);
triggers1=x1(:,2); trigdelay=0.002; % length to pause after a trigger TD=trigdelay*fs; % pause in samples triggerthreshold=0.2;
epoch=0.018*fs; %18ms window %rejectT=0.1; % level to reject epochs. 0.2 is with no
scaling. %rejectT=0.15; % seems Ok about 37.5 microV %rejectT=5; rejectT=10000; %no rejection N1=0; for i=1:length(triggers1)-epoch; %for i=1:20000 if triggers1(i)>triggerthreshold temp1=-data1(i:i+epoch-1); % artefact rejection if max(temp1)<rejectT; if min(temp1)>-rejectT; N1=N1+1; array1(N1,:)=temp1; end end for n=i+1:i+TD triggers1(n)=0; % if a trigger is detected,
set next triggers to zero end end end array1=array1/4000; % scale? scale=linspace(0,18,epoch); % time axis for plot abr1=-mean(array1); hold on
70
Part 3 Part 4
plot(scale,abr1,'c');title('ABR overlay plot')
array2=[array' array1']'; % combine arrays of abrs abr2=(abr+abr1)/2; % average abr and abr1 - gives abr2 figure;plot(scale,abr2) top=max(abr2(AS:AE)); %peak in ABR 2 low=min(abr2(AS:AE)); %trough in ABR 2 diff=mean(abs((abr(AS:AE)-abr1(AS:AE)))); % average
% FSP calc upper=var(abr(AS:AE)); % power from 5 to 15 ms % NHSP
for click (.005*5000:.015*5000)); lower=var(array(:,Amid)); % power of SP at ms .008*5000) Fsp=upper*N/lower % Bootstrap Fsp for n=1:499 temp=array; for i=1:N rotate=fix(rand*(epoch-1))+1; % avoid zero by
adding 1 temp(i,:)=[temp(i,rotate+1:epoch)
temp(i,1:rotate)]; %% end bootstrap(n,:)=mean(temp);
BootFsp(n)=N*var(bootstrap(n,AS:AE))/var(temp(:,Amid));; end
Bootvar=sort(BootFsp);
BootFsp(475); sort(BootFsp); count1=0; for i=1:length(BootFsp) if Fsp>BootFsp(i) count1=count1+1; end end SIGFSP=1-(count1/length(BootFsp))
% Bootstrap 3:1 for n=1:499 temp=array; for i=1:N rotate=fix(rand*(epoch-1))+1; % avoid zero by
adding 1 temp(i,:)=[temp(i,rotate+1:epoch)
temp(i,1:rotate)]; %% end bootstrap(n,:)=mean(temp); abr3=mean(temp); % call 'abr3' temp1=array1; % 2nd array for i=1:N1 rotate=fix(rand*(epoch-1))+1; % avoid zero by
adding 1 temp1(i,:)=[temp1(i,rotate+1:epoch)
temp1(i,1:rotate)]; %% end bootstrap1(n,:)=mean(temp1); abr4=mean(temp1); % call 'abr4' abr5=(abr3+abr4)/2; top=max(abr5(AS:AE)); %peak in ABR 5 low=min(abr5(AS:AE)); %trough in ABR 5 % diff=abs(mean((abr3(AS:AE)-abr4(AS:AE)))); %
average difference over window
71
Part 5 Part 6
diff=mean(abs((abr3(AS:AE)-abr4(AS:AE)))); % average
difference over window Bootratio(n)=(top-low)/diff; End
count2=0; for i=1:length(Bootratio) if ratio>Bootratio(i) count2=count2+1; end end SIGRATIO=1-(count2/length(Bootratio))