ICUB TRIES TO PLAY AN ELECTRONIC KEYBOARD
BY
PEIXIN CHANG
THESIS
Submitted in partial fulfillment of the requirements for the degree of Bachelor of Science in Electronic and Computer Engineering
in the Undergraduate College of the University of Illinois at Urbana-Champaign, 2017
Urbana, Illinois
Adviser:
Professor Stephen E. Levinson
ii
Abstract
Can a robot understand natural language? This thesis discusses the first stage of a project that tries to
answer this question using the approach of music performance. One of the most demanding cognitive
challenges for a human mind is music performing which is closely related to language acquisition. The
first stage of the project aims at exploring the potential of an iCub humanoid robot, the platform used in
this project, to play an electronic keyboard. The robot is expected to analyze the electronic keyboard with
its visual ability, listen to a sequence of musical notes a human played on the keyboard, and press the
same musical notes in the same order on the keyboard. The three modules built in this project will be
introduced and discussed, and the next stage of the project will be revealed.
Key words: Humanoid robot, music.
iii
Acknowledgments
I would like to thank my advisor, Professor Stephen Levinson, who has supported me throughout my
research and given a lot of constructive advice. I would like to thank Felix Wang, a Ph.D. student, who
has always been supportive of me. I scheduled experiments on the robot with him every time. He always
gives me many useful and heuristic suggestions. He is enthusiastic about making my project work. I
would like to thank Luke Wendt, who gave me advice for my Motor Control Module. I would also like to
thank Yuchen He, who taught me about YARP and ICUB-main, software for controlling the robot, and
gave me advice for my Computer Vision Module.
iv
Contents
1. Introduction ............................................................................................................................................. 1
1.1 Motivation and Aim .......................................................................................................................... 1
1.2 Basic Music Theory and Notation ................................................................................................... 2
1.3 Platform ............................................................................................................................................. 3
1.4 Keyboard Playing Procedure ........................................................................................................... 3
2. The Computer Vision Module ............................................................................................................... 6
2.1 Step 1: Pre-processing ...................................................................................................................... 7
2.2 Step 2: Contours ................................................................................................................................ 8
2.3 Step 3: Post-processing ..................................................................................................................... 9
3. The Pitch Detector Module .................................................................................................................. 13
3.1 Onset Detection ............................................................................................................................... 14
3.2 Pitch Detection ................................................................................................................................ 16
4. The Motor Control Module .................................................................................................................. 20
4.1 Arm and Hand Movement ............................................................................................................. 20
4.2 Cartesian Control ............................................................................................................................ 21
5. Performance and Discussion ................................................................................................................ 24
6. Conclusion ............................................................................................................................................. 26
References .................................................................................................................................................. 27
1
1. Introduction
1.1 Motivation and Aim
Can a robot understand natural language? We believe that if it is going to acquire human language, it has
to build a mental model of the world by physically interacting with the environment around itself. The
interaction enriches the robot’s sensorimotor experiences and helps the robot develop its associative
memory that is the central component of its cognition. Along with its other cognitive abilities, the robot
can acquire the linguistic competence [1]. The members of Language Acquisition and Robotics Group at
UIUC have successfully make the iCub, a humanoid robot, finish tasks such as ball balancing [2] and
solving 3D mazes [3]. In both cases, the robot learned fine-motor skills and gain sensorimotor
experiences. However, no one in the group has tried a music-related project, which in fact has potential
value. Music performance is one of the most demanding cognitive challenges for the human mind and
basic motor control functions, such as timing, sequencing, and spatial organization of movement, are
required. Fulfilling the requirements of music performing leads to extraordinary and unique sensory-
motor interplay such that actions, such as reaching and grasping, do not capture [4]. Music is also
entangled with the language acquisition process in early life. Brandt et al. [5] describe language as a
special type of music and deny the viewpoint that music is a by-product of cognitive and behavioral
“technology” adapted for language.
A project with the aim of teaching the iCub robot to play an electronic keyboard was proposed by the
author of this thesis. The robot is expected to analyze the electronic keyboard with its visual ability, listen
to a sequence of musical notes a human played on the keyboard, and perform the same musical notes in
the same order on the keyboard. This project will provide the iCub with a unique opportunity to interact
with the environment in a new way. It is expected that music learning can reveal some clues for the
question of robot language acquisition and maybe provide an analogy to language acquisition. It is also a
challenging project. That is why this project should be divided into several stages. In the first stage, the
stage this paper is about, the potential of the robot to play an electronic keyboard is explored. It is a good
opportunity for its human advisor, the author of this thesis, to familiarize his student, iCub. There is no
2
need to apply any learning techniques in this stage of the project so that the limit of hard coding methods
can be clearly seen. In the next stage, however, learning algorithms will be applied to the system built in
the first stage. In the second stage, iCub will begin learning music. The skills it learned will be
incorporated into an associative memory.
1.2 Basic Music Theory and Notation
This section informally introduces to the readers the basic music theory and music-related notations used
in this paper.
Figure 1: An example keyboard labelled with scientific pitch notation and integer numbers on the pitch axis. C4 is used as the origin.
All the keys in an electronic keyboard are divided into different octaves. The dashed rectangle in Figure 1
identifies one of these octaves. An octave consists of 7 white keys and 5 black keys. The next 12 keys to
the right of the dashed rectangle form another octave. Each white key in an octave can be denoted as a
capital letter from A, B, C, D, E, F, G with an integer subscript. C4 means that it is the first key in the
octave numbered as 4. C4 is called middle C, which always associates a frequency about 261.6 Hz. The
sharp symbol # means that this black key has a pitch or frequency a half step higher than the white key to
its left but a half step lower than the white key to its right. Generally, a key always has a higher pitch than
all the keys to its left and lower pitch than all the keys to its right. Each key or pitch is modeled as a point
on the integer number line with middle C as the origin. The key number plus 1 whenever there is a
transposition up by a half step. F3, for example, has a key number -7 because it is 7 keys to the left of
middle C. The speed of playing musical notes is called tempo with unit beats per minute (bpm).
3
1.3 Platform
The iCub robot was chosen as the platform for this project. It is a child-like humanoid robot designed for
research in embodied cognition. It is expected to interact with humans and the environment and perform
tasks useful for learning as a child does [6]. iCub has 53 degrees of freedom (DOF) allocated to its head,
torso, right and left arms, and right and left legs. Each of its arms including the hand has 16 controlled
DOF, and its head has 6 controlled DOF [7]. iCub is equipped with 2 PointGrey Dragonfly cameras. The
images taken from these cameras have a resolution of 320x240 in RGB color mode. It also has two
microphones for ears. All the software for iCub is written in C++ programming language, and an
embedded PC104 card is used to handle all sensory and motor-state information [6]. The researchers in
the Language Acquisition and Robotics Group in UIUC call their iCub Bert. Bert is short for Bertrand
Russell, a great mind who contributed to the early days of cybernetics. Figure 2 shows Bert.
Figure 2: A photo for Bert.
The electronic keyboard Bert is going to play is Casio CTK2400 61-key Portable Keyboard.
1.4 Keyboard Playing Procedure
This project has three modules, and they are the Computer Vision Module, the Pitch Detector Module,
and the Motor Control Module. The Computer Vision Module processes visual input from Bert’s eyes,
4
while the Pitch Detector Module processes auditory input. The Motor Control Module sends commands
to Bert’s joints such that Bert can move his arms and hands to the desired places. All these three modules
interact with each other and with the environment around Bert. The process for Bert to play the sequence
of musical notes specified by the users is illustrated in the process flowchart in Figure 3. Only one of the
three modules works in each step. The other two modules are either waiting for the reply from this
working module or for the user inputs. The name of the working module is indicated before the colon
symbol in each step.
Figure 3: The process flowchart for playing electronic keyboard.
In Chapter 2, the Computer Vision Module will be introduced. It will show the tasks that the Computer
Vision Module needs to finish and will reveal the method used to complete these tasks. Chapter 3 will
Motor Control: Repeat the last 3 steps for the rest of the musical notes played by users
Motor Control: Right arm goes back to the preparation pose
Motor Control: Perform a key press
Motor Control: Robot right arm reaches to the location of the first key the users played
Computer Vision: Send to the motor control module 3D locations for keys played in order
Pitch Detector: Inform the computer vision module the keys played by the user
Pitch Detector: Analyze the musical notes a user played with the keyboard
Motor Control: Reply the computer vision module with 3D locations calculated
Computer Vision: Query the motor control module for 3D locations based on 2D coordinates
Computer Vision: Analyze the image for 2D pixel coordinates for all keys
Motor Control: Notify the computer vision module of the image saved
Motor Control: An image for the keyboard is saved, preparation pose achieved
Motor Control: Go to the inital pose
5
introduce the Pitch Detector Module and the algorithm used for pitch detection. The Motor Control
Module will be presented in Chapter 4 which will suggest the expectation for this module and explain
how Bert moves his arm to the keys he should play. The methods used in this project will be discussed in
Chapter 5. Finally, Chapter 6 will be the conclusion for this thesis.
6
2. The Computer Vision Module
The task of this module is to help Bert obtain the 3D locations for each electronic keyboard key in his
sight. The method used in this module should be able to finish the following tasks autonomously or with
the help of a human:
(a) Filter out irrelevant background objects in sight and keep only electronic keyboard keys
(b) Successfully identify each key that Bert can see on the keyboard
(c) Distinguish between white keys and black keys
(d) Figure out which keys are the musical C note, C3, C4, C5 notes etc., following piano teaching
conventions
(e) Approximate the 3D locations of those keys based on what Bert sees
Bert is expected to be a skilled keyboard player, so no additional markers such as colored tape, symbols,
or characters are attached to a key. Figure 4 shows an image taken by the right-eye camera of Bert. This
image has been converted to grayscale.
Figure 4: A scene that Bert may see when he is going to play the electronic keyboard.
This section presents the method step by step to show how the tasks mentioned above are achieved. The
main idea is to find the contour for each key. The input images must be pre-processed to obtain plausible
contours. These contours also need to be post-processed to achieve all the tasks mentioned earlier.
7
2.1 Step 1: Pre-processing
As the simplest segmentation process, gray-level thresholding is the transformation of an input image to a
segmented binary image [8]. A threshold value ranges from 0 to 255. For each pixel in the picture, if the
intensity of that pixel is greater than the threshold value, this pixel becomes black. Otherwise, it becomes
white. Thresholding can be effective if the keyboard is evenly lit. The task of choosing a reasonable
threshold value is left to users.
(a)
(b)
Figure 5: (a) A successful thresholding with a threshold value 218. (b) An unsuccessful thresholding with a threshold value 170.
The left image is an example of successful thresholding because each key is clearly separated by black
lines and a complete octave is kept. Preserving a complete octave is important for the later steps. Users
need to specify two threshold values to draw two white lines and one black line, based on the figure
obtained after thresholding (see Figure 5). These two white lines isolate the black keys region from the
rest of the black pixels belong to the background so that the contour generated later for each black key
will not encompass those unwanted black pixels. For convenience and practicability, the percentage of
white pixels in a row is calculated for each row of pixels in the image to prevent the lines being drawn
moving freely in the vertical direction and help the lines be attracted to the desired place. Similarly, the
black line also clearly defines the white key region so that no unwanted white pixels will be included.
8
Figure 6: Based on the binary image obtained earlier, one white line at the top of black keys, one white line just below the
bottom of black keys, and a black line 5 pixels below the lower white line should be drawn to separate the black keys
region with white keys region.
2.2 Step 2: Contours
Contours for each black key and white key can be generated by following the border following algorithm
mentioned in [9]. The algorithm does a TV raster scan to the input binary image and tries to find a starting
point that satisfies certain conditions for border following. From that starting point, it marks the pixels on
the border between the connected component of pixels with densities 0 and connected component of
pixels with densities 1. After the entire border is marked, the raster scan resumes, and it tries to find the
next starting point. The algorithm stops when the raster scan reaches the lower right corner of the image.
Since in pre-processing steps the connected components are well grouped, the contours generated will
encompass each key. After the generation of contours based on Figure 6, many unwanted contours exist.
They come from the residues of keys that have incomplete shapes due to thresholding. By restricting the
minimum contour areas, these noisy contours can be easily erased. Only the contours that have areas
bigger than the restriction are kept. There are also restrictions on the maximum contour areas, which will
guarantee erasing the contours encompassing some parts of the background. As shown in Figure 7, there
exists a one-to-one correspondence between contours and keys, and each contour can be processed
individually.
9
(a)
(b)
(c)
Figure 7: After the generation of contours for each key, unwanted contours can be erased if their areas do not satisfy the restriction set by the parameters. The parameters chosen in (a) keep all the contours. In (b), unwanted white key contours are erased by choosing a proper minimum area restriction. (c) shows a plausible processing result. Each red contour corresponds to a white key, and each cyan contour corresponds to a black key.
2.3 Step 3: Post-processing
Bert needs to know at least where he should press within these contours. The mass center for each contour
can be calculated with the theory of image moment mentioned in [10]. The result is shown in Figure 8.
Figure 8: Center of mass for each contour marked by purple dots.
These purple dots in Figure 8 represent the desired positions where Bert needs to press within the
contours. Their 2D pixel coordinates should be saved for later use. The next step is to identify musical C
notes from these dots. The method here is to mark all the purple points inside contours for black keys to
be 0, mark those points inside contours for white keys to be 1, and sort all the points according to how far
10
they are away from the left edge of the image and number all of them according to the sorting result. The
leftmost point thus has number 1 after the sort. Then, for an octave consisting of 12 keys, the binary
sequence must be “101011010101”, and the left most “1” in the sequence must be a musical C note.
(a) (b)
Figure 9: (a) White keys are marked as 1 and black keys are marked as 0. The binary sequence for an octave must be "101011010101". (b) Musical C notes are successfully found and marked by white points.
The C note in the incomplete octave in Figure 9 (b) can also be easily detected because it is 12 contours
away from the first white point. In this project, it is assumed that the first marked white point represents
middle C. So, the method identifies the contour that corresponds to middle C among all contours.
The final task to finish is to approximate the 3D locations (𝑋, 𝑌, 𝑍) in robot root frame of mass centers for
those contours given their 2D pixel coordinates (𝑢, 𝑣). Each 3D location associates a keyboard key
position that Bert is going to press. A pixel coordinates (𝑢, 𝑣) corresponds to a 3D point (𝑥′, 𝑦′, 𝑧′) in the
camera frame. The relationship between the 3D point (𝑥′, 𝑦′, 𝑧′) in the camera frame and the 2D pixel
coordinates (𝑢, 𝑣) can be expressed as
𝑢 = 𝑓𝑥
𝑥′
𝑧′+ 𝑜𝑥
(2.1)
𝑣 = 𝑓𝑦
𝑦′
𝑧′+ 𝑜𝑦
(2.2)
Equation (2.1) and Equation (2.2) can also be expressed in matrix form,
11
𝑧′ (𝑢𝑣1
) = (𝑓𝑥 0 𝑜𝑥 00 𝑓𝑦 𝑜𝑦 0
0 0 1 0
) (
𝑥′𝑦′
𝑧′1
)
(2.3)
where 𝑓𝑥, 𝑓𝑦, 𝑜𝑥 , and 𝑜𝑦 are all camera intrinsic parameters, which can be obtained from camera
calibration files for the iCub robot. The transformation from camera frame to robot root frame is
described by Equation (2.4).
𝑯𝑟𝑜𝑜𝑡𝑐𝑎𝑚𝑒𝑟𝑎 ∗ (
𝑋𝑌𝑍1
) = (
𝑥′𝑦′
𝑧′1
)
(2.4)
where 𝑯𝑟𝑜𝑜𝑡𝑐𝑎𝑚𝑒𝑟𝑎 is the homogeneous transformation between from the robot root frame and the camera
frame. With the help of iCub Gaze Interface [11], which combines the information of torso joint angles,
head joint angles, and other internal robot parameters, 𝑯𝑟𝑜𝑜𝑡𝑐𝑎𝑚𝑒𝑟𝑎 can be calculated. Figure 10 shows these
two coordinate systems provided by [7].
(a)
(b)
Figure 10: (a) Robot camera coordinates. The x-axis is in red. The y-axis is in green. The z-axis is in blue. (b) Robot root reference frame.
The electronic keyboard that Bert is going to play is usually higher than its root reference frame origin,
but Equation (2.4) does not consider this fact, and thus the 𝑋, 𝑌, and Z values obtained by Equation (2.4)
are not the 3D coordinates located on the keyboard. But, this problem can be easily solved if the surface
that contains all the white keys on the electronic keyboard is modeled as a plane with function 𝑎𝑥 + 𝑏𝑦 +
12
𝑐𝑧 + 𝑑 = 0, where 𝑎 = 𝑏 = 0, 𝑐 = 1, and 𝑑 is the height difference between root reference frame origin
and where white keys located. The plane function for surface containing all black keys has higher value of
height difference 𝑑. The 3D locations for each key projected on these two planes can be attained by
applying simple geometric properties of triangles and vector addition and subtraction, if the value of 𝑑 is
measured and provided by users. At this point, all the tasks mentioned at the beginning of this chapter is
done.
Middle C is usually used as a reference in music. It is also the origin for the pitch axis shown in Figure 1.
Identifying which mass centers for these contours corresponds to middle C is a necessary step, because
musical meaning is embedded into each mass center point through this step. This step makes the pitch
relationships between each point well defined. Otherwise, these points only individually associate with a
3D location without any other music information. Besides, this step simplifies the communication
between the Computer Vision Module and the Pitch Detector Module. For example, if a musical note C5
is detected, the pitch detector only needs to report a 12 because C5 is 12 keys to the right of middle C, and
the Computer Vision Module can efficiently find out which mass center corresponds to musical note C5,
collect 3D Cartesian coordinates (𝑈𝐶5, 𝑉𝐶5
, 𝑊𝐶5), and send the information to the Motor Control Module.
13
3. The Pitch Detector Module
In this stage of the project, iCub was not expected to autonomously learn the auditory-motor relation.
Instead, a pitch detector was used. Given a record in Microsoft WAVE PCM sound file format of several
musical notes played with an electronic keyboard, this module should inform the Computer Vision
Module about what musical notes in what order are played by the users. The input to this module will go
through 3 processing steps. In the first step, the module decodes the WAVE format and extracts the
sample data. In the second step, this module analyzes the sample data and decides when or at which
sample a new note has been played. In the final step, a pitch detection algorithm is applied to the samples
between every two neighbor onset events to compute the frequency of the corresponding musical notes.
For testing purposes, a sequence of musical notes was played on the electronic keyboard and the sound
was recorded as a WAVE format sound file. The sampling frequency is 44,100 Hz. These musical notes
are C4, D4, E4, F4, and G4. Due to the fact that the microphone instrumented inside the robot is seriously
affected by the CPU fan noise, an external microphone is used to finish the task of recording.
Figure 11: A time-domain input that consists of four musical notes with frequencies 261.6 Hz (C4), 293.66 Hz (D4), 329.63 Hz (E4), 349.23 Hz (F4), and 392.00 Hz (G4), respectively from left to right along the timeline. The onset events happen at about the 44,070th, 85,250th, 126,100th, 164,100th, and 205,200th samples. These estimated values were manually selected and are usually greater than the true values. The tempo is about 69 bpm.
14
The discussion in this chapter will be focused on onset detection, the second step, and pitch detection, the
third step.
3.1 Onset Detection
The aim of onset detection is to find out when each musical note starts [12]. There are many approaches
to onset detection, such as spectral difference method, phase deviation method, and negative log-
probability method. The procedures used in most of these methods are similar. The original signal will be
pre-processed first. Then, a detection function is obtained. The peaks of the function should coincide with
the samples at which musical notes have been played. Finally, the locations of the onset events are found
by applying peak-picking algorithm [13]. Based on Dixon’s experimental results [12], spectral flux turns
out to be the method that satisfies the most the project needs.
Short-time Fourier transform (STFT) will be applied first to the time-domain input signals. The discrete-
time STFT can be expressed as
𝑋𝑘(𝑛) = ∑ 𝑤[𝑚]𝑥[ℎ𝑛 + 𝑚]𝑒−2𝑗𝜋𝑚𝑘
𝑁
𝑁2
−1
𝑚=−𝑁2
(3.1)
where, 𝑛 represents nth frame, 𝑘 means the kth frequency bin, 𝑤[𝑚] is the window applied to the input,
𝑁 is the window length, and ℎ is the hop size signifying how much to shift 𝑤[𝑚] for a given 𝑛. Spectral
flux calculates the difference in magnitude between each frequency bin in the current frame and the
corresponding frequency bin in the previous frame, emphasizing only on energy increase frequencies, and
add the difference together for all frequency bins in a frame. It is given by [12]
𝑆𝐹(𝑛) = ∑ 𝐻(|𝑋(𝑛, 𝑘)| − |𝑋(𝑛 − 1, 𝑘)|)
𝑁2
−1
𝑘=−𝑁2
(3.2)
15
where 𝐻(𝑥) =𝑥+|𝑥|
2. 𝐻(𝑥) makes sure that those frequencies where there is a decrease in energy have no
effect. Bello et al. [13] suggest spectral difference method which is similar to Equation (3.2), and it is
given by
𝑆𝐷(𝑛) = ∑ [𝐻(|𝑋(𝑛, 𝑘)| − |𝑋(𝑛 − 1, 𝑘)|)]2
𝑁2
−1
𝑘=−𝑁2
(3.3)
The square in Equation (3.3) magnifies the magnitude difference and is helpful for the peak-picking
process, and it was chosen to implement the onset detection. The result after processing the input
mentioned at the beginning of this chapter is shown in Figure 12. Hamming window, whose length 𝑁 is
2048, is used. Hop size ℎ is 882.
Figure 12: Detection function obtained by following Equation (3.1) and Equation (3.2) for input signal mentioned at the beginning of the chapter. Peaks are manually selected. The horizontal dash line represents the threshold value.
Because other learning programs will probably replace the pitch detector and limited time was available
for this project, pick-picking algorithm suggested in [12] was not followed. Instead, a simpler threshold
method was applied. The method accepts any data point around the real peak with a magnitude higher
than the threshold as candidates and always selects the first candidate as the peak by ignoring several
frame number after it. It was found that the onset detection gave acceptable results when the threshold
16
value was set to 20,000 as long as the musical notes were played in adagio tempo (around 70 bpm) or
slower. The sample at which an onset event happens 𝑠 can be computed approximately as
𝑠 = ℎ ∗ 𝐹 (3.4)
where ℎ is the hop size, and 𝐹 is the frame number. The onset event happens approximately at the
44,100th, 84,672th, 125,244th, 164,052th and 204,642nd sample reported by the program. Although the
results obtained may be different from the true values of onset samples, these estimated results are enough
for dividing the entire recording into five pieces. Each piece includes the samples for only one musical
note.
3.2 Pitch Detection
The remaining task after the onset detection is to analyze the samples within each piece suggested by the
estimated onset samples and to find out the fundamental frequency of the samples within each piece. A
wide variety of algorithms for pitch detection has been proposed in the literature. Noll [14] suggests
Cepstrum pitch determination method. Ross et al. [15] describe the method of using average magnitude
difference function (AMDF) as a variation of the autocorrelation function (ACF) for pitch detection. The
pitch detection algorithm used in this project is decided to be YIN [16], an improved method for the
classic ACF. This algorithm divides the process of pitch detection into 6 steps. In the first step, the
autocorrelation function is calculated. The autocorrelation function 𝑎𝑡(𝜏) of lag 𝜏 at time 𝑡 defined as
𝑎𝑡(𝜏) = ∑ 𝑥𝑛 ∗ 𝑥𝑛+𝜏
𝑡+𝑁
𝑛=𝑡+1
(3.5)
where 𝑁 is the integration window size and 𝑥 represents a sample. In this project, the value of 𝑁 is 882
again and the maximum lag is 588, which is 2/3 of 𝑁. The sampling frequency is 44,100 Hz. The pitch
period is the period between zero delay and the delay value of a strong peak [15]. The fundamental
frequency is then the inverse of the period. However, ACF method alone will make a lot of errors. In the
second step, a difference function is calculated, and it is defined as
17
𝑑𝑡(𝜏) = ∑(𝑥𝑛 − 𝑥𝑛+𝜏)2
𝑁
𝑛=1
= 𝑎𝑡(0) + 𝑎𝑡+𝜏(0) − 2𝑎𝑡(𝜏)
(3.6)
derived from a periodic function with period 𝑇: 𝑥𝑡 − 𝑥𝑡+𝑇 = 0, for all 𝑡, supposing that the signal is
periodic. It is suggested that the multiples of the pitch period will be a 𝜏 that makes 𝑑𝑡(𝜏) = 0 or nearly
zero because of the imperfect periodicity (see Figure 13). The error rate is reduced when difference
function is used because ACF is sensitive to amplitude changes. Figure 13 shows the difference function
for the F4 in the input file.
Figure 13: The difference function for the F4 in the input file. The two data tips show two dips where 𝒅𝒕(𝝉) are nearly zero. They happen at lag values that equal 126 and 252, respectively. Zero-lag dip has a 𝒅𝒕(𝝉) value that equals zero. Because of imperfect periodicity, 𝒅𝒕(𝝉) is non-zero at the period of F4.
In step 3, error rate is further reduced by replacing difference function with cumulative mean normalized
difference function, which is defined as
𝑑𝑡′ (𝜏) = {
1,
𝑑𝑡(𝜏)/[(1
𝜏) ∑ 𝑑𝑡(𝑛)],
𝜏
𝑛=1
𝜏 = 0
otherwise
(3.7)
The cumulative mean normalized difference function prevents the algorithm from choosing zero-lag dip.
Zero-lag dip also has 𝑑𝑡(𝜏) equals zero, but using this dip will not generate the correct fundamental
frequency. Step 3 also benefits for the next step.
18
Figure 14: The cumulative mean normalized difference function for F4 in the input file. The dashed line indicates the absolute threshold value, 0.2.
Step 4 is mainly for solving the well-known problem of autocorrelation method of pitch detection.
“Octave error” usually happens when one of the higher-order dips is deeper than the dip which suggests a
period. In Figure 14, for example, 252 instead of 126 may be chosen as the lag value. Then, the
fundamental frequency is incorrectly approximated as 175 Hz, which is F3, a musical note one octave
below F4. In step 4, an absolute threshold is set. All dips below the threshold will be considered as
candidates. The algorithm chooses the first dip below the threshold. So, 126 instead of 252 is chosen as
the lag value. This threshold value is chosen to be 0.2 for this project. Step 5, parabolic interpolation, and
step 6, best local estimate, suggested in [16] make the algorithm has even lower error rate. It was found
that the first four steps have produced results with error rates lower than 5%, which satisfies the project
needs. Therefore, only first four steps were implemented. The same step is gone through for each musical
note in the input file, and the results will be a set containing all the fundamental frequencies of musical
notes played by users in order.
As mentioned in the previous chapter, the Pitch Detector Module should report to the Computer Vision
Module which musical notes are played in integer numbers. This requirement can be done with a lookup
19
table. The keys for the entries are the true values of fundamental frequencies for each piano key. The true
frequencies 𝑓(𝑛) can be calculated by
𝑓(𝑛) = ( √212
)𝑛−49 ∗ 440 𝐻𝑧 (3.8)
where 𝑛 is the nth key in 88-key piano because pianos is in twelve-tone equal temperament. The content
under each key in that table is the integer number indicating how far left or how far right they are to the
middle C, the origin for the integer number line in Figure 1. Each detected frequency will be compared
with the keys in the lookup table. The content under the key that has the closest value with the detected
frequency will be sent to the Computer Vision Module. Because of this lookup table, the errors produced
in onset detection and pitch detection are in some degree tolerated.
The Pitch Detector Module suggests that there are 5 musical notes in the input file and their fundamental
frequencies are 262.5 Hz, 294 Hz, 331.5 Hz, 350 Hz, and 393.7 Hz. Compared to the true value of each
musical notes recorded in the input file, namely, 261.6 Hz (C4), 293.66 Hz (D4), 329.63 Hz (E4), 349.23
Hz (F4), and 392.00 Hz (G4), the results are close enough. Finally, the Pitch Detector Module reports to
the Computer Vision Module that the keys played are 0, 2, 4, 5, and 7.
20
4. The Motor Control Module
This module is designed to accomplish any motor-related tasks. In this project, Bert is expected to play
the electronic keyboard with only his right hand, and he will not move his left hand, left arm, and torso
during his performance. The focus of this chapter will be on describing in detail how the last four steps in
Figure 3 are done. One method to achieve the goal of moving his arm and hand to the right position and
performing a key press is to manually move his arm to an electronic keyboard key, sample all joint angles
for his right arm along the trajectory, and do the same thing for all electronic keyboard keys in Bert’s
sight. When Bert is asked to play a certain key, he only needs to play the corresponding trajectory
recorded earlier. The arm movement performed will be human-like and graceful. However, the method
fails if the relative position between Bert and the keyboard changes. This project expects Bert to play the
keyboard even if the position of the keyboard is different every time. In addition to the difficulties raised
by fulfilling this expectation, the fact that there are no touch sensors on his fingertips for iCub v1.1 and
v1.0 also complicates the problem. As planned for the next stage of the project, his vision may be
primarily occupied by the process of reading musical scores. All these facts suggest that his motions have
to depend on accurate position control.
4.1 Arm and Hand Movement
When Bert is going to show his skills in playing an electronic keyboard, he will first go to a preparation
pose from his initial pose after his start-up. Bert can do this by going through a sequence of intermediate
poses or steps. For each pose, he follows the joint angles specified for his head and left and right arms. As
shown in Figure 15 (b), he has moved his left arm backward and his right arm upward so that both of his
arms will not block his sights when he looked downward to see the electronic keyboard in front of him.
What he sees could be similar to Figure 4 and will be saved as an image in one of the intermediate steps.
After he achieves the preparation pose, his end effector will be changed to the middle finger tip in his
right hand, which is originally his palm in his right hand. iCub Cartesian Interface [17] helps achieve this
goal. Due to the fact that the motor at the proximal end of his index finger does not respond, his middle
finger is used instead to finish a key press.
21
(a)
(b)
Figure 15: (a) The initial pose after Bert’s start-up. (b) The preparation pose. Bert is ready for playing a musical note on the keyboard. Every time Bert is going to play a note, he should be in this preparation pose.
Every time Bert should be in his preparation pose when he is going to play a note (see Figure 15 (b)). Bert
is asked to first reach with his right middle finger to a position several centimeters right above the key he
will play. During the period of moving his right arm and hand to that position, he keeps his hand pose as
it is in his preparation pose, although his fingers can point towards different directions. Because his palm
in the right hand is always facing down, he is able to press the desired key using his right middle finger
once he successfully reaches above the desired key. When his middle finger is pressing down, all other
fingers are not moving. He will raise his middle finger after he finishes the key press and moves his arm
and hand back to the preparation pose. He will repeat the same procedure starting from his preparation
pose and play the next musical note until all the notes are played.
4.2 Cartesian Control
Given the 3D Cartesian coordinates just several centimeters above a key to be played and the 3D
Cartesian coordinates for his end effector when he is in his preparation pose, Bert needs to perform a
point-to-point motion for each musical note. Since the 3D Cartesian coordinates for the target key are
known, the joint configuration for that position can be approximated by inverse kinematics technique,
supposing that the Cartesian coordinates for these keys are in the workspace and that the inverse
kinematics calculations have solutions. So, Bert is able to move his arm and hand from the starting
position to the target position by changing the joint angles at the starting point to these at the end point.
22
However, it is well-known that this type of movement suffers from some problems such as multiple
solutions problem and unsafe path problem. Luckily, even though inverse kinematics may return multiple
joint configurations, the arm and hand movements are still reasonable enough for Bert to finish the task of
reaching, especially when the palm in his right hand is required to face downward. Besides, the
workspace is almost free of obstacles, except for the electronic piano, so the problem that the path his arm
and hand follows may not be safe is a secondary issue.
Figure 16 shows the paths that Bert’s right middle finger followed to play four musical notes on the
electronic keyboard in an experiment. In this experiment, the four musical notes were played with the
order of C4, G4, C5, and A4.
Figure 16: Paths that Bert’s right middle finger followed to play C4, G4, C5, and A4 four musical notes on the electronic keyboard. The samples were recorded in robot root reference frame (see Figure 10 (b)). The three data tips show three significant points coordinates along the path for playing C5.
23
The path for C5, which is the farthest curve from X axis, is used as an example for illustrating the figure.
In his preparation pose, Bert’s middle finger tip had the 3D Cartesian coordinates of (-0.3469 m, 0.2734
m, 0.1749m) in robot root reference frame. The end effector moved towards a position about 5
centimeters above the key C5. The coordinates for this position were (-0.312 m, 0.1821 m, 0.06803 m). A
key press was performed after the end effector had reached just above the desired key and the coordinates
of end effector changed to be (-0.2866 m, 0.1784 m, 0.01876 m). Finally, the robot moved back to the
preparation pose with the same path.
24
5. Performance and Discussion
This short chapter discusses the methods used in this project, especially the potential problems for the
three modules. The method used in the Computer Vision Module depends heavily on the relative
positions between dots inside contours for black keys and dots inside contours for white keys. Since Bert
does not see from a perfect orthographic view of the keyboard, this method will return a wrong result due
to convergence effect of parallel lines. Since Bert only has one entire octave after thresholding, this
problem doesn’t occur. However, this potential problem makes Bert fail at recognizing C notes and, more
importantly, play the wrong key.
Figure 17: The white points shown in the right image are believed to be the musical C notes. The method gives a wrong result because of the convergence effect. In the yellow circles in the left image, the black key, which should be at the left of a white key, has a center of mass at the right of that white key. Besides, the order for pitches is different from Figure 1 and is messed up.
One solution could be to calculate a homography matrix 𝐻 [18]. At least four corresponding points should
be provided for this method to work. Another solution is to use the C note that is correct. The second
white point, for example, is correct because the center part of the image is distorted less by the effect.
Representing the black and white keys as binary number eases the process of error checking. The correct
pattern of an octave, “101011010101”, and the correct C note can be used as a reference to fix the wrong
pattern.
It was found that the problem in the Pitch Detector Module usually relates to onset detection. It has low
accuracy for finding the onset samples when the notes are played above 90 bpm. The problem occurs
maybe because of the naïve pick-picking method used for onset detection. A more reliable peak-picking
25
algorithm may solve the problem. In Figure 18 for example, this method believes frame 142 has a higher
magnitude than frame 143 does.
Figure 18: An enlarged view of the detection function consisting only the portion corresponds to the fourth musical note, F4. Data points near the peak data point may have close magnitude value.
In the Motor Control Module, although inverse kinematics generates plausible joint configurations and
the path taken by Bert’s arm and hand is safe and natural, this simple method may not lead to natural and
reasonable movements of arm and hand when the starting point is not fixed at certain Cartesian
coordinates. After all, pianists do not go back to a preparation pose after they press a key. More
sophisticated motion planning and trajectories planning algorithms should be used so that Bert can have a
more human-like music performance.
26
6. Conclusion
This paper discusses the first stage of the project which aims at teaching a humanoid robot, Bert, to play
an electronic keyboard. This project is expected to provide the iCub with a unique opportunity to interact
with the environment around him. The tasks for the first stage of this project were marked as
accomplished when Bert accurately pressed the desired electronic keyboard key the first time. This
moment proved that the system described in this paper works. However, as suggested in Chapter 5, the
potential issues of the system also indicate its limitation. In the next stage of the project, Bert is expected
to have a better performance because more modern methods will replace the classic ones. The skills he
learned will be incorporated into an associative memory. The ultimate task will be to map the world
around Bert with an associative memory so that symbolic and abstract ideas can be constructed into a
language framework.
27
References
[1] S.E. Levinson, K. Squire, R.S. Lin, and M. McClain, "Automatic language acquisition by an
autonomous robot," in AAAI Spring Symposium on Developmental Robotics, 2005.
[2] A. F. Silver, "Multi-dimensional pre-programmed and learned fine motor control for a humanoid
robot," M.S. Thesis, Urbana, 2012.
[3] V. N. Kamalnath, "Usage of computer vision and machine learning to solve 3D mazes," M.S. Thesis,
Urbana, 2013.
[4] R. J. Zatorre, J. L. Chen and V. B. Penhune, "When the brain plays music: auditory-motor
interactions in music perception and production," Nature Reviews Neuroscience, vol. 8, no. 7, pp.
547-588, 2007.
[5] A. Brandt, M. Gebrian, and L. R. Slevc, "Music and early language acquisition," Frontiers in
psychology, vol. 3, p. 327, 2012.
[6] G. Metta, G. Sandini, D. Vernon, L. Natale, F. Nori, "The iCub humanoid robot: an open platform for
research in embodied cognition," in Proceedings of the 8th Workshop on Performance Metrics for
Intelligent Systems, Gaithersburg, Maryland, 2008.
[7] "Wiki for iCub and Friends," [Online]. Available: http://wiki.icub.org/wiki/Main_Page. [Accessed
March 2017].
[8] M. Sonka, V.Hlavac, and R. Boyle, Image processing, analysis, and machine vision, Cengage
Learning, 2014.
[9] S. Suzuki, "Topological structural analysis of digitized binary images by border following,"
Computer vision, graphics, and image processing, vol. 30, no. 1, pp. 32-46, 1985.
[10] M.-K. Hu, "Visual pattern recognition by moment invariants," IRE Transactions on Information
Theory, vol. 8, no. 2, pp. 179-187, 1962.
[11] A. Roncone, U. Pattacini, and L. Natale, "A cartesian 6-DoF gaze controller for humanoid robots," in
Proceedings of Robotics: Science and Systems, Ann Arbor, MI, 2016.
[12] S. Dixon, "Onset detection revisited," in Proceedings of the 9th International Conference on Digital
Audio Effects, Montreal, Quebec, Canada, 2006.
28
[13] J. Bello, L. Daudet, S. Abdallah, C. Duxbury, M. Davies, and M. Sandler, "A tutorial on onset
detection in music signals," IEEE Transactions on Speech and Audio Processing, vol. 13, no. 5, pp.
1035-1047, 2005.
[14] A. M. Noll, "Cepstrum pitch determination," The Journal of the Acoustical Society of America, vol.
41, no. 2, pp. 293-309, 1967.
[15] M. J. Ross, H. L. Shaffer, A. Cohen, R. Freudberg, and H. J. Manley, "Average magnitude diffrence
function pitch extractor," IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 22,
no. 5, pp. 353-362, 1974.
[16] A. D. Cheveigné and H. Kawahara, "YIN, a fundamental frequency estimator for speech and music,"
The Journal of the Acoustical Society of America, vol. 111, no. 4, pp. 1917-1930, 2002.
[17] U. Pattacini, F. Nori, L. Natale, G. Metta, and G. Sandini, "An experimental evaluation of a novel
minimum-jerk cartesian controller for humanoid robots," in IEEE/RSJ International Conference on
Intelligent Robots and Systems, Taipei, Taiwan, 2010.
[18] G. Bradski and A. Kaehler, Learning OpenCV: Computer vision with the OpenCV library, O'Reilly
Media, Inc., 2008.