ICUB TRIES TO PLAY AN ELECTRONIC KEYBOARD

ICUB TRIES TO PLAY AN ELECTRONIC KEYBOARD

BY

PEIXIN CHANG

THESIS

Submitted in partial fulfillment of the requirements for the degree of Bachelor of Science in Electronic and Computer Engineering

in the Undergraduate College of the University of Illinois at Urbana-Champaign, 2017

Urbana, Illinois

Adviser:

Professor Stephen E. Levinson

ii

Abstract

Can a robot understand natural language? This thesis discusses the first stage of a project that tries to

answer this question using the approach of music performance. One of the most demanding cognitive

challenges for a human mind is music performing which is closely related to language acquisition. The

first stage of the project aims at exploring the potential of an iCub humanoid robot, the platform used in

this project, to play an electronic keyboard. The robot is expected to analyze the electronic keyboard with

its visual ability, listen to a sequence of musical notes a human played on the keyboard, and press the

same musical notes in the same order on the keyboard. The three modules built in this project will be

introduced and discussed, and the next stage of the project will be revealed.

Key words: Humanoid robot, music.

iii

Acknowledgments

I would like to thank my advisor, Professor Stephen Levinson, who has supported me throughout my

research and given a lot of constructive advice. I would like to thank Felix Wang, a Ph.D. student, who

has always been supportive of me. I scheduled experiments on the robot with him every time. He always

gives me many useful and heuristic suggestions. He is enthusiastic about making my project work. I

would like to thank Luke Wendt, who gave me advice for my Motor Control Module. I would also like to

thank Yuchen He, who taught me about YARP and ICUB-main, software for controlling the robot, and

gave me advice for my Computer Vision Module.

iv

Contents

1. Introduction ............................................................................................................................................. 1

1.1 Motivation and Aim .......................................................................................................................... 1

1.2 Basic Music Theory and Notation ................................................................................................... 2

1.3 Platform ............................................................................................................................................. 3

1.4 Keyboard Playing Procedure ........................................................................................................... 3

2. The Computer Vision Module ............................................................................................................... 6

2.1 Step 1: Pre-processing ...................................................................................................................... 7

2.2 Step 2: Contours ................................................................................................................................ 8

2.3 Step 3: Post-processing ..................................................................................................................... 9

3. The Pitch Detector Module .................................................................................................................. 13

3.1 Onset Detection ............................................................................................................................... 14

3.2 Pitch Detection ................................................................................................................................ 16

4. The Motor Control Module .................................................................................................................. 20

4.1 Arm and Hand Movement ............................................................................................................. 20

4.2 Cartesian Control ............................................................................................................................ 21

5. Performance and Discussion ................................................................................................................ 24

6. Conclusion ............................................................................................................................................. 26

References .................................................................................................................................................. 27

1

1. Introduction

1.1 Motivation and Aim

Can a robot understand natural language? We believe that if it is going to acquire human language, it has

to build a mental model of the world by physically interacting with the environment around itself. The

interaction enriches the robot’s sensorimotor experiences and helps the robot develop its associative

memory that is the central component of its cognition. Along with its other cognitive abilities, the robot

can acquire the linguistic competence [1]. The members of Language Acquisition and Robotics Group at

UIUC have successfully make the iCub, a humanoid robot, finish tasks such as ball balancing [2] and

solving 3D mazes [3]. In both cases, the robot learned fine-motor skills and gain sensorimotor

experiences. However, no one in the group has tried a music-related project, which in fact has potential

value. Music performance is one of the most demanding cognitive challenges for the human mind and

basic motor control functions, such as timing, sequencing, and spatial organization of movement, are

required. Fulfilling the requirements of music performing leads to extraordinary and unique sensory-

motor interplay such that actions, such as reaching and grasping, do not capture [4]. Music is also

entangled with the language acquisition process in early life. Brandt et al. [5] describe language as a

special type of music and deny the viewpoint that music is a by-product of cognitive and behavioral

“technology” adapted for language.

A project with the aim of teaching the iCub robot to play an electronic keyboard was proposed by the

author of this thesis. The robot is expected to analyze the electronic keyboard with its visual ability, listen

to a sequence of musical notes a human played on the keyboard, and perform the same musical notes in

the same order on the keyboard. This project will provide the iCub with a unique opportunity to interact

with the environment in a new way. It is expected that music learning can reveal some clues for the

question of robot language acquisition and maybe provide an analogy to language acquisition. It is also a

challenging project. That is why this project should be divided into several stages. In the first stage, the

stage this paper is about, the potential of the robot to play an electronic keyboard is explored. It is a good

opportunity for its human advisor, the author of this thesis, to familiarize his student, iCub. There is no

2

need to apply any learning techniques in this stage of the project so that the limit of hard coding methods

can be clearly seen. In the next stage, however, learning algorithms will be applied to the system built in

the first stage. In the second stage, iCub will begin learning music. The skills it learned will be

incorporated into an associative memory.

1.2 Basic Music Theory and Notation

This section informally introduces to the readers the basic music theory and music-related notations used

in this paper.

Figure 1: An example keyboard labelled with scientific pitch notation and integer numbers on the pitch axis. C4 is used as the origin.

All the keys in an electronic keyboard are divided into different octaves. The dashed rectangle in Figure 1

identifies one of these octaves. An octave consists of 7 white keys and 5 black keys. The next 12 keys to

the right of the dashed rectangle form another octave. Each white key in an octave can be denoted as a

capital letter from A, B, C, D, E, F, G with an integer subscript. C4 means that it is the first key in the

octave numbered as 4. C4 is called middle C, which always associates a frequency about 261.6 Hz. The

sharp symbol # means that this black key has a pitch or frequency a half step higher than the white key to

its left but a half step lower than the white key to its right. Generally, a key always has a higher pitch than

all the keys to its left and lower pitch than all the keys to its right. Each key or pitch is modeled as a point

on the integer number line with middle C as the origin. The key number plus 1 whenever there is a

transposition up by a half step. F3, for example, has a key number -7 because it is 7 keys to the left of

middle C. The speed of playing musical notes is called tempo with unit beats per minute (bpm).

3

1.3 Platform

The iCub robot was chosen as the platform for this project. It is a child-like humanoid robot designed for

research in embodied cognition. It is expected to interact with humans and the environment and perform

tasks useful for learning as a child does [6]. iCub has 53 degrees of freedom (DOF) allocated to its head,

torso, right and left arms, and right and left legs. Each of its arms including the hand has 16 controlled

DOF, and its head has 6 controlled DOF [7]. iCub is equipped with 2 PointGrey Dragonfly cameras. The

images taken from these cameras have a resolution of 320x240 in RGB color mode. It also has two

microphones for ears. All the software for iCub is written in C++ programming language, and an

embedded PC104 card is used to handle all sensory and motor-state information [6]. The researchers in

the Language Acquisition and Robotics Group in UIUC call their iCub Bert. Bert is short for Bertrand

Russell, a great mind who contributed to the early days of cybernetics. Figure 2 shows Bert.

Figure 2: A photo for Bert.

The electronic keyboard Bert is going to play is Casio CTK2400 61-key Portable Keyboard.

1.4 Keyboard Playing Procedure

This project has three modules, and they are the Computer Vision Module, the Pitch Detector Module,

and the Motor Control Module. The Computer Vision Module processes visual input from Bert’s eyes,

4

while the Pitch Detector Module processes auditory input. The Motor Control Module sends commands

to Bert’s joints such that Bert can move his arms and hands to the desired places. All these three modules

interact with each other and with the environment around Bert. The process for Bert to play the sequence

of musical notes specified by the users is illustrated in the process flowchart in Figure 3. Only one of the

three modules works in each step. The other two modules are either waiting for the reply from this

working module or for the user inputs. The name of the working module is indicated before the colon

symbol in each step.

Figure 3: The process flowchart for playing electronic keyboard.

In Chapter 2, the Computer Vision Module will be introduced. It will show the tasks that the Computer

Vision Module needs to finish and will reveal the method used to complete these tasks. Chapter 3 will

Motor Control: Repeat the last 3 steps for the rest of the musical notes played by users

Motor Control: Right arm goes back to the preparation pose

Motor Control: Perform a key press

Motor Control: Robot right arm reaches to the location of the first key the users played

Computer Vision: Send to the motor control module 3D locations for keys played in order

Pitch Detector: Inform the computer vision module the keys played by the user

Pitch Detector: Analyze the musical notes a user played with the keyboard

Motor Control: Reply the computer vision module with 3D locations calculated

Computer Vision: Query the motor control module for 3D locations based on 2D coordinates

Computer Vision: Analyze the image for 2D pixel coordinates for all keys

Motor Control: Notify the computer vision module of the image saved

Motor Control: An image for the keyboard is saved, preparation pose achieved

Motor Control: Go to the inital pose

5

introduce the Pitch Detector Module and the algorithm used for pitch detection. The Motor Control

Module will be presented in Chapter 4 which will suggest the expectation for this module and explain

how Bert moves his arm to the keys he should play. The methods used in this project will be discussed in

Chapter 5. Finally, Chapter 6 will be the conclusion for this thesis.

6

2. The Computer Vision Module

The task of this module is to help Bert obtain the 3D locations for each electronic keyboard key in his

sight. The method used in this module should be able to finish the following tasks autonomously or with

the help of a human:

(a) Filter out irrelevant background objects in sight and keep only electronic keyboard keys

(b) Successfully identify each key that Bert can see on the keyboard

(c) Distinguish between white keys and black keys

(d) Figure out which keys are the musical C note, C3, C4, C5 notes etc., following piano teaching

conventions

(e) Approximate the 3D locations of those keys based on what Bert sees

Bert is expected to be a skilled keyboard player, so no additional markers such as colored tape, symbols,

or characters are attached to a key. Figure 4 shows an image taken by the right-eye camera of Bert. This

image has been converted to grayscale.

Figure 4: A scene that Bert may see when he is going to play the electronic keyboard.

This section presents the method step by step to show how the tasks mentioned above are achieved. The

main idea is to find the contour for each key. The input images must be pre-processed to obtain plausible

contours. These contours also need to be post-processed to achieve all the tasks mentioned earlier.

7

2.1 Step 1: Pre-processing

As the simplest segmentation process, gray-level thresholding is the transformation of an input image to a

segmented binary image [8]. A threshold value ranges from 0 to 255. For each pixel in the picture, if the

intensity of that pixel is greater than the threshold value, this pixel becomes black. Otherwise, it becomes

white. Thresholding can be effective if the keyboard is evenly lit. The task of choosing a reasonable

threshold value is left to users.

(a)

(b)

Figure 5: (a) A successful thresholding with a threshold value 218. (b) An unsuccessful thresholding with a threshold value 170.

The left image is an example of successful thresholding because each key is clearly separated by black

lines and a complete octave is kept. Preserving a complete octave is important for the later steps. Users

need to specify two threshold values to draw two white lines and one black line, based on the figure

obtained after thresholding (see Figure 5). These two white lines isolate the black keys region from the

rest of the black pixels belong to the background so that the contour generated later for each black key

will not encompass those unwanted black pixels. For convenience and practicability, the percentage of

white pixels in a row is calculated for each row of pixels in the image to prevent the lines being drawn

moving freely in the vertical direction and help the lines be attracted to the desired place. Similarly, the

black line also clearly defines the white key region so that no unwanted white pixels will be included.

8

Figure 6: Based on the binary image obtained earlier, one white line at the top of black keys, one white line just below the

bottom of black keys, and a black line 5 pixels below the lower white line should be drawn to separate the black keys

region with white keys region.

2.2 Step 2: Contours

Contours for each black key and white key can be generated by following the border following algorithm

mentioned in [9]. The algorithm does a TV raster scan to the input binary image and tries to find a starting

point that satisfies certain conditions for border following. From that starting point, it marks the pixels on

the border between the connected component of pixels with densities 0 and connected component of

pixels with densities 1. After the entire border is marked, the raster scan resumes, and it tries to find the

next starting point. The algorithm stops when the raster scan reaches the lower right corner of the image.

Since in pre-processing steps the connected components are well grouped, the contours generated will

encompass each key. After the generation of contours based on Figure 6, many unwanted contours exist.

They come from the residues of keys that have incomplete shapes due to thresholding. By restricting the

minimum contour areas, these noisy contours can be easily erased. Only the contours that have areas

bigger than the restriction are kept. There are also restrictions on the maximum contour areas, which will

guarantee erasing the contours encompassing some parts of the background. As shown in Figure 7, there

exists a one-to-one correspondence between contours and keys, and each contour can be processed

individually.

9

(a)

(b)

(c)

Figure 7: After the generation of contours for each key, unwanted contours can be erased if their areas do not satisfy the restriction set by the parameters. The parameters chosen in (a) keep all the contours. In (b), unwanted white key contours are erased by choosing a proper minimum area restriction. (c) shows a plausible processing result. Each red contour corresponds to a white key, and each cyan contour corresponds to a black key.

2.3 Step 3: Post-processing

Bert needs to know at least where he should press within these contours. The mass center for each contour

can be calculated with the theory of image moment mentioned in [10]. The result is shown in Figure 8.

Figure 8: Center of mass for each contour marked by purple dots.

These purple dots in Figure 8 represent the desired positions where Bert needs to press within the

contours. Their 2D pixel coordinates should be saved for later use. The next step is to identify musical C

notes from these dots. The method here is to mark all the purple points inside contours for black keys to

be 0, mark those points inside contours for white keys to be 1, and sort all the points according to how far

10

they are away from the left edge of the image and number all of them according to the sorting result. The

leftmost point thus has number 1 after the sort. Then, for an octave consisting of 12 keys, the binary

sequence must be “101011010101”, and the left most “1” in the sequence must be a musical C note.

(a) (b)

Figure 9: (a) White keys are marked as 1 and black keys are marked as 0. The binary sequence for an octave must be "101011010101". (b) Musical C notes are successfully found and marked by white points.

The C note in the incomplete octave in Figure 9 (b) can also be easily detected because it is 12 contours

away from the first white point. In this project, it is assumed that the first marked white point represents

middle C. So, the method identifies the contour that corresponds to middle C among all contours.

The final task to finish is to approximate the 3D locations (𝑋, 𝑌, 𝑍) in robot root frame of mass centers for

those contours given their 2D pixel coordinates (𝑢, 𝑣). Each 3D location associates a keyboard key

position that Bert is going to press. A pixel coordinates (𝑢, 𝑣) corresponds to a 3D point (𝑥′, 𝑦′, 𝑧′) in the

camera frame. The relationship between the 3D point (𝑥′, 𝑦′, 𝑧′) in the camera frame and the 2D pixel

coordinates (𝑢, 𝑣) can be expressed as

𝑢 = 𝑓𝑥

𝑥′

𝑧′+ 𝑜𝑥

(2.1)

𝑣 = 𝑓𝑦

𝑦′

𝑧′+ 𝑜𝑦

(2.2)

Equation (2.1) and Equation (2.2) can also be expressed in matrix form,

11

𝑧′ (𝑢𝑣1

) = (𝑓𝑥 0 𝑜𝑥 00 𝑓𝑦 𝑜𝑦 0

0 0 1 0

) (

𝑥′𝑦′

𝑧′1

)

(2.3)

where 𝑓𝑥, 𝑓𝑦, 𝑜𝑥 , and 𝑜𝑦 are all camera intrinsic parameters, which can be obtained from camera

calibration files for the iCub robot. The transformation from camera frame to robot root frame is

described by Equation (2.4).

𝑯𝑟𝑜𝑜𝑡𝑐𝑎𝑚𝑒𝑟𝑎 ∗ (

𝑋𝑌𝑍1

) = (

𝑥′𝑦′

𝑧′1

)

(2.4)

where 𝑯𝑟𝑜𝑜𝑡𝑐𝑎𝑚𝑒𝑟𝑎 is the homogeneous transformation between from the robot root frame and the camera

frame. With the help of iCub Gaze Interface [11], which combines the information of torso joint angles,

head joint angles, and other internal robot parameters, 𝑯𝑟𝑜𝑜𝑡𝑐𝑎𝑚𝑒𝑟𝑎 can be calculated. Figure 10 shows these

two coordinate systems provided by [7].

(a)

(b)

Figure 10: (a) Robot camera coordinates. The x-axis is in red. The y-axis is in green. The z-axis is in blue. (b) Robot root reference frame.

The electronic keyboard that Bert is going to play is usually higher than its root reference frame origin,

but Equation (2.4) does not consider this fact, and thus the 𝑋, 𝑌, and Z values obtained by Equation (2.4)

are not the 3D coordinates located on the keyboard. But, this problem can be easily solved if the surface

that contains all the white keys on the electronic keyboard is modeled as a plane with function 𝑎𝑥 + 𝑏𝑦 +

12

𝑐𝑧 + 𝑑 = 0, where 𝑎 = 𝑏 = 0, 𝑐 = 1, and 𝑑 is the height difference between root reference frame origin

and where white keys located. The plane function for surface containing all black keys has higher value of

height difference 𝑑. The 3D locations for each key projected on these two planes can be attained by

applying simple geometric properties of triangles and vector addition and subtraction, if the value of 𝑑 is

measured and provided by users. At this point, all the tasks mentioned at the beginning of this chapter is

done.

Middle C is usually used as a reference in music. It is also the origin for the pitch axis shown in Figure 1.

Identifying which mass centers for these contours corresponds to middle C is a necessary step, because

musical meaning is embedded into each mass center point through this step. This step makes the pitch

relationships between each point well defined. Otherwise, these points only individually associate with a

3D location without any other music information. Besides, this step simplifies the communication

between the Computer Vision Module and the Pitch Detector Module. For example, if a musical note C5

is detected, the pitch detector only needs to report a 12 because C5 is 12 keys to the right of middle C, and

the Computer Vision Module can efficiently find out which mass center corresponds to musical note C5,

collect 3D Cartesian coordinates (𝑈𝐶5, 𝑉𝐶5

, 𝑊𝐶5), and send the information to the Motor Control Module.

13

3. The Pitch Detector Module

In this stage of the project, iCub was not expected to autonomously learn the auditory-motor relation.

Instead, a pitch detector was used. Given a record in Microsoft WAVE PCM sound file format of several

musical notes played with an electronic keyboard, this module should inform the Computer Vision

Module about what musical notes in what order are played by the users. The input to this module will go

through 3 processing steps. In the first step, the module decodes the WAVE format and extracts the

sample data. In the second step, this module analyzes the sample data and decides when or at which

sample a new note has been played. In the final step, a pitch detection algorithm is applied to the samples

between every two neighbor onset events to compute the frequency of the corresponding musical notes.

For testing purposes, a sequence of musical notes was played on the electronic keyboard and the sound

was recorded as a WAVE format sound file. The sampling frequency is 44,100 Hz. These musical notes

are C4, D4, E4, F4, and G4. Due to the fact that the microphone instrumented inside the robot is seriously

affected by the CPU fan noise, an external microphone is used to finish the task of recording.

Figure 11: A time-domain input that consists of four musical notes with frequencies 261.6 Hz (C4), 293.66 Hz (D4), 329.63 Hz (E4), 349.23 Hz (F4), and 392.00 Hz (G4), respectively from left to right along the timeline. The onset events happen at about the 44,070th, 85,250th, 126,100th, 164,100th, and 205,200th samples. These estimated values were manually selected and are usually greater than the true values. The tempo is about 69 bpm.

14

The discussion in this chapter will be focused on onset detection, the second step, and pitch detection, the

third step.

3.1 Onset Detection

The aim of onset detection is to find out when each musical note starts [12]. There are many approaches

to onset detection, such as spectral difference method, phase deviation method, and negative log-

probability method. The procedures used in most of these methods are similar. The original signal will be

pre-processed first. Then, a detection function is obtained. The peaks of the function should coincide with

the samples at which musical notes have been played. Finally, the locations of the onset events are found

by applying peak-picking algorithm [13]. Based on Dixon’s experimental results [12], spectral flux turns

out to be the method that satisfies the most the project needs.

Short-time Fourier transform (STFT) will be applied first to the time-domain input signals. The discrete-

time STFT can be expressed as

𝑋𝑘(𝑛) = ∑ 𝑤[𝑚]𝑥[ℎ𝑛 + 𝑚]𝑒−2𝑗𝜋𝑚𝑘

𝑁

𝑁2

−1

𝑚=−𝑁2

(3.1)

where, 𝑛 represents nth frame, 𝑘 means the kth frequency bin, 𝑤[𝑚] is the window applied to the input,

𝑁 is the window length, and ℎ is the hop size signifying how much to shift 𝑤[𝑚] for a given 𝑛. Spectral

flux calculates the difference in magnitude between each frequency bin in the current frame and the

corresponding frequency bin in the previous frame, emphasizing only on energy increase frequencies, and

add the difference together for all frequency bins in a frame. It is given by [12]

𝑆𝐹(𝑛) = ∑ 𝐻(|𝑋(𝑛, 𝑘)| − |𝑋(𝑛 − 1, 𝑘)|)

𝑁2

−1

𝑘=−𝑁2

(3.2)

15

where 𝐻(𝑥) =𝑥+|𝑥|

2. 𝐻(𝑥) makes sure that those frequencies where there is a decrease in energy have no

effect. Bello et al. [13] suggest spectral difference method which is similar to Equation (3.2), and it is

given by

𝑆𝐷(𝑛) = ∑ [𝐻(|𝑋(𝑛, 𝑘)| − |𝑋(𝑛 − 1, 𝑘)|)]2

𝑁2

−1

𝑘=−𝑁2

(3.3)

The square in Equation (3.3) magnifies the magnitude difference and is helpful for the peak-picking

process, and it was chosen to implement the onset detection. The result after processing the input

mentioned at the beginning of this chapter is shown in Figure 12. Hamming window, whose length 𝑁 is

2048, is used. Hop size ℎ is 882.

Figure 12: Detection function obtained by following Equation (3.1) and Equation (3.2) for input signal mentioned at the beginning of the chapter. Peaks are manually selected. The horizontal dash line represents the threshold value.

Because other learning programs will probably replace the pitch detector and limited time was available

for this project, pick-picking algorithm suggested in [12] was not followed. Instead, a simpler threshold

method was applied. The method accepts any data point around the real peak with a magnitude higher

than the threshold as candidates and always selects the first candidate as the peak by ignoring several

frame number after it. It was found that the onset detection gave acceptable results when the threshold

16

value was set to 20,000 as long as the musical notes were played in adagio tempo (around 70 bpm) or

slower. The sample at which an onset event happens 𝑠 can be computed approximately as

𝑠 = ℎ ∗ 𝐹 (3.4)

where ℎ is the hop size, and 𝐹 is the frame number. The onset event happens approximately at the

44,100th, 84,672th, 125,244th, 164,052th and 204,642nd sample reported by the program. Although the

results obtained may be different from the true values of onset samples, these estimated results are enough

for dividing the entire recording into five pieces. Each piece includes the samples for only one musical

note.

3.2 Pitch Detection

The remaining task after the onset detection is to analyze the samples within each piece suggested by the

estimated onset samples and to find out the fundamental frequency of the samples within each piece. A

wide variety of algorithms for pitch detection has been proposed in the literature. Noll [14] suggests

Cepstrum pitch determination method. Ross et al. [15] describe the method of using average magnitude

difference function (AMDF) as a variation of the autocorrelation function (ACF) for pitch detection. The

pitch detection algorithm used in this project is decided to be YIN [16], an improved method for the

classic ACF. This algorithm divides the process of pitch detection into 6 steps. In the first step, the

autocorrelation function is calculated. The autocorrelation function 𝑎𝑡(𝜏) of lag 𝜏 at time 𝑡 defined as

𝑎𝑡(𝜏) = ∑ 𝑥𝑛 ∗ 𝑥𝑛+𝜏

𝑡+𝑁

𝑛=𝑡+1

(3.5)

where 𝑁 is the integration window size and 𝑥 represents a sample. In this project, the value of 𝑁 is 882

again and the maximum lag is 588, which is 2/3 of 𝑁. The sampling frequency is 44,100 Hz. The pitch

period is the period between zero delay and the delay value of a strong peak [15]. The fundamental

frequency is then the inverse of the period. However, ACF method alone will make a lot of errors. In the

second step, a difference function is calculated, and it is defined as

17

𝑑𝑡(𝜏) = ∑(𝑥𝑛 − 𝑥𝑛+𝜏)2

𝑁

𝑛=1

= 𝑎𝑡(0) + 𝑎𝑡+𝜏(0) − 2𝑎𝑡(𝜏)

(3.6)

derived from a periodic function with period 𝑇: 𝑥𝑡 − 𝑥𝑡+𝑇 = 0, for all 𝑡, supposing that the signal is

periodic. It is suggested that the multiples of the pitch period will be a 𝜏 that makes 𝑑𝑡(𝜏) = 0 or nearly

zero because of the imperfect periodicity (see Figure 13). The error rate is reduced when difference

function is used because ACF is sensitive to amplitude changes. Figure 13 shows the difference function

for the F4 in the input file.

Figure 13: The difference function for the F4 in the input file. The two data tips show two dips where 𝒅𝒕(𝝉) are nearly zero. They happen at lag values that equal 126 and 252, respectively. Zero-lag dip has a 𝒅𝒕(𝝉) value that equals zero. Because of imperfect periodicity, 𝒅𝒕(𝝉) is non-zero at the period of F4.

In step 3, error rate is further reduced by replacing difference function with cumulative mean normalized

difference function, which is defined as

𝑑𝑡′ (𝜏) = {

1,

𝑑𝑡(𝜏)/[(1

𝜏) ∑ 𝑑𝑡(𝑛)],

𝜏

𝑛=1

𝜏 = 0

otherwise

(3.7)

The cumulative mean normalized difference function prevents the algorithm from choosing zero-lag dip.

Zero-lag dip also has 𝑑𝑡(𝜏) equals zero, but using this dip will not generate the correct fundamental

frequency. Step 3 also benefits for the next step.

18

Figure 14: The cumulative mean normalized difference function for F4 in the input file. The dashed line indicates the absolute threshold value, 0.2.

Step 4 is mainly for solving the well-known problem of autocorrelation method of pitch detection.

“Octave error” usually happens when one of the higher-order dips is deeper than the dip which suggests a

period. In Figure 14, for example, 252 instead of 126 may be chosen as the lag value. Then, the

fundamental frequency is incorrectly approximated as 175 Hz, which is F3, a musical note one octave

below F4. In step 4, an absolute threshold is set. All dips below the threshold will be considered as

candidates. The algorithm chooses the first dip below the threshold. So, 126 instead of 252 is chosen as

the lag value. This threshold value is chosen to be 0.2 for this project. Step 5, parabolic interpolation, and

step 6, best local estimate, suggested in [16] make the algorithm has even lower error rate. It was found

that the first four steps have produced results with error rates lower than 5%, which satisfies the project

needs. Therefore, only first four steps were implemented. The same step is gone through for each musical

note in the input file, and the results will be a set containing all the fundamental frequencies of musical

notes played by users in order.

As mentioned in the previous chapter, the Pitch Detector Module should report to the Computer Vision

Module which musical notes are played in integer numbers. This requirement can be done with a lookup

19

table. The keys for the entries are the true values of fundamental frequencies for each piano key. The true

frequencies 𝑓(𝑛) can be calculated by

𝑓(𝑛) = ( √212

)𝑛−49 ∗ 440 𝐻𝑧 (3.8)

where 𝑛 is the nth key in 88-key piano because pianos is in twelve-tone equal temperament. The content

under each key in that table is the integer number indicating how far left or how far right they are to the

middle C, the origin for the integer number line in Figure 1. Each detected frequency will be compared

with the keys in the lookup table. The content under the key that has the closest value with the detected

frequency will be sent to the Computer Vision Module. Because of this lookup table, the errors produced

in onset detection and pitch detection are in some degree tolerated.

The Pitch Detector Module suggests that there are 5 musical notes in the input file and their fundamental

frequencies are 262.5 Hz, 294 Hz, 331.5 Hz, 350 Hz, and 393.7 Hz. Compared to the true value of each

musical notes recorded in the input file, namely, 261.6 Hz (C4), 293.66 Hz (D4), 329.63 Hz (E4), 349.23

Hz (F4), and 392.00 Hz (G4), the results are close enough. Finally, the Pitch Detector Module reports to

the Computer Vision Module that the keys played are 0, 2, 4, 5, and 7.

20

4. The Motor Control Module

This module is designed to accomplish any motor-related tasks. In this project, Bert is expected to play

the electronic keyboard with only his right hand, and he will not move his left hand, left arm, and torso

during his performance. The focus of this chapter will be on describing in detail how the last four steps in

Figure 3 are done. One method to achieve the goal of moving his arm and hand to the right position and

performing a key press is to manually move his arm to an electronic keyboard key, sample all joint angles

for his right arm along the trajectory, and do the same thing for all electronic keyboard keys in Bert’s

sight. When Bert is asked to play a certain key, he only needs to play the corresponding trajectory

recorded earlier. The arm movement performed will be human-like and graceful. However, the method

fails if the relative position between Bert and the keyboard changes. This project expects Bert to play the

keyboard even if the position of the keyboard is different every time. In addition to the difficulties raised

by fulfilling this expectation, the fact that there are no touch sensors on his fingertips for iCub v1.1 and

v1.0 also complicates the problem. As planned for the next stage of the project, his vision may be

primarily occupied by the process of reading musical scores. All these facts suggest that his motions have

to depend on accurate position control.

4.1 Arm and Hand Movement

When Bert is going to show his skills in playing an electronic keyboard, he will first go to a preparation

pose from his initial pose after his start-up. Bert can do this by going through a sequence of intermediate

poses or steps. For each pose, he follows the joint angles specified for his head and left and right arms. As

shown in Figure 15 (b), he has moved his left arm backward and his right arm upward so that both of his

arms will not block his sights when he looked downward to see the electronic keyboard in front of him.

What he sees could be similar to Figure 4 and will be saved as an image in one of the intermediate steps.

After he achieves the preparation pose, his end effector will be changed to the middle finger tip in his

right hand, which is originally his palm in his right hand. iCub Cartesian Interface [17] helps achieve this

goal. Due to the fact that the motor at the proximal end of his index finger does not respond, his middle

finger is used instead to finish a key press.

21

(a)

(b)

Figure 15: (a) The initial pose after Bert’s start-up. (b) The preparation pose. Bert is ready for playing a musical note on the keyboard. Every time Bert is going to play a note, he should be in this preparation pose.

Every time Bert should be in his preparation pose when he is going to play a note (see Figure 15 (b)). Bert

is asked to first reach with his right middle finger to a position several centimeters right above the key he

will play. During the period of moving his right arm and hand to that position, he keeps his hand pose as

it is in his preparation pose, although his fingers can point towards different directions. Because his palm

in the right hand is always facing down, he is able to press the desired key using his right middle finger

once he successfully reaches above the desired key. When his middle finger is pressing down, all other

fingers are not moving. He will raise his middle finger after he finishes the key press and moves his arm

and hand back to the preparation pose. He will repeat the same procedure starting from his preparation

pose and play the next musical note until all the notes are played.

4.2 Cartesian Control

Given the 3D Cartesian coordinates just several centimeters above a key to be played and the 3D

Cartesian coordinates for his end effector when he is in his preparation pose, Bert needs to perform a

point-to-point motion for each musical note. Since the 3D Cartesian coordinates for the target key are

known, the joint configuration for that position can be approximated by inverse kinematics technique,

supposing that the Cartesian coordinates for these keys are in the workspace and that the inverse

kinematics calculations have solutions. So, Bert is able to move his arm and hand from the starting

position to the target position by changing the joint angles at the starting point to these at the end point.

22

However, it is well-known that this type of movement suffers from some problems such as multiple

solutions problem and unsafe path problem. Luckily, even though inverse kinematics may return multiple

joint configurations, the arm and hand movements are still reasonable enough for Bert to finish the task of

reaching, especially when the palm in his right hand is required to face downward. Besides, the

workspace is almost free of obstacles, except for the electronic piano, so the problem that the path his arm

and hand follows may not be safe is a secondary issue.

Figure 16 shows the paths that Bert’s right middle finger followed to play four musical notes on the

electronic keyboard in an experiment. In this experiment, the four musical notes were played with the

order of C4, G4, C5, and A4.

Figure 16: Paths that Bert’s right middle finger followed to play C4, G4, C5, and A4 four musical notes on the electronic keyboard. The samples were recorded in robot root reference frame (see Figure 10 (b)). The three data tips show three significant points coordinates along the path for playing C5.

23

The path for C5, which is the farthest curve from X axis, is used as an example for illustrating the figure.

In his preparation pose, Bert’s middle finger tip had the 3D Cartesian coordinates of (-0.3469 m, 0.2734

m, 0.1749m) in robot root reference frame. The end effector moved towards a position about 5

centimeters above the key C5. The coordinates for this position were (-0.312 m, 0.1821 m, 0.06803 m). A

key press was performed after the end effector had reached just above the desired key and the coordinates

of end effector changed to be (-0.2866 m, 0.1784 m, 0.01876 m). Finally, the robot moved back to the

preparation pose with the same path.

24

5. Performance and Discussion

This short chapter discusses the methods used in this project, especially the potential problems for the

three modules. The method used in the Computer Vision Module depends heavily on the relative

positions between dots inside contours for black keys and dots inside contours for white keys. Since Bert

does not see from a perfect orthographic view of the keyboard, this method will return a wrong result due

to convergence effect of parallel lines. Since Bert only has one entire octave after thresholding, this

problem doesn’t occur. However, this potential problem makes Bert fail at recognizing C notes and, more

importantly, play the wrong key.

Figure 17: The white points shown in the right image are believed to be the musical C notes. The method gives a wrong result because of the convergence effect. In the yellow circles in the left image, the black key, which should be at the left of a white key, has a center of mass at the right of that white key. Besides, the order for pitches is different from Figure 1 and is messed up.

One solution could be to calculate a homography matrix 𝐻 [18]. At least four corresponding points should

be provided for this method to work. Another solution is to use the C note that is correct. The second

white point, for example, is correct because the center part of the image is distorted less by the effect.

Representing the black and white keys as binary number eases the process of error checking. The correct

pattern of an octave, “101011010101”, and the correct C note can be used as a reference to fix the wrong

pattern.

It was found that the problem in the Pitch Detector Module usually relates to onset detection. It has low

accuracy for finding the onset samples when the notes are played above 90 bpm. The problem occurs

maybe because of the naïve pick-picking method used for onset detection. A more reliable peak-picking

25

algorithm may solve the problem. In Figure 18 for example, this method believes frame 142 has a higher

magnitude than frame 143 does.

Figure 18: An enlarged view of the detection function consisting only the portion corresponds to the fourth musical note, F4. Data points near the peak data point may have close magnitude value.

In the Motor Control Module, although inverse kinematics generates plausible joint configurations and

the path taken by Bert’s arm and hand is safe and natural, this simple method may not lead to natural and

reasonable movements of arm and hand when the starting point is not fixed at certain Cartesian

coordinates. After all, pianists do not go back to a preparation pose after they press a key. More

sophisticated motion planning and trajectories planning algorithms should be used so that Bert can have a

more human-like music performance.

26

6. Conclusion

This paper discusses the first stage of the project which aims at teaching a humanoid robot, Bert, to play

an electronic keyboard. This project is expected to provide the iCub with a unique opportunity to interact

with the environment around him. The tasks for the first stage of this project were marked as

accomplished when Bert accurately pressed the desired electronic keyboard key the first time. This

moment proved that the system described in this paper works. However, as suggested in Chapter 5, the

potential issues of the system also indicate its limitation. In the next stage of the project, Bert is expected

to have a better performance because more modern methods will replace the classic ones. The skills he

learned will be incorporated into an associative memory. The ultimate task will be to map the world

around Bert with an associative memory so that symbolic and abstract ideas can be constructed into a

language framework.

27

References

[1] S.E. Levinson, K. Squire, R.S. Lin, and M. McClain, "Automatic language acquisition by an

autonomous robot," in AAAI Spring Symposium on Developmental Robotics, 2005.

[2] A. F. Silver, "Multi-dimensional pre-programmed and learned fine motor control for a humanoid

robot," M.S. Thesis, Urbana, 2012.

[3] V. N. Kamalnath, "Usage of computer vision and machine learning to solve 3D mazes," M.S. Thesis,

Urbana, 2013.

[4] R. J. Zatorre, J. L. Chen and V. B. Penhune, "When the brain plays music: auditory-motor

interactions in music perception and production," Nature Reviews Neuroscience, vol. 8, no. 7, pp.

547-588, 2007.

[5] A. Brandt, M. Gebrian, and L. R. Slevc, "Music and early language acquisition," Frontiers in

psychology, vol. 3, p. 327, 2012.

[6] G. Metta, G. Sandini, D. Vernon, L. Natale, F. Nori, "The iCub humanoid robot: an open platform for

research in embodied cognition," in Proceedings of the 8th Workshop on Performance Metrics for

Intelligent Systems, Gaithersburg, Maryland, 2008.

[7] "Wiki for iCub and Friends," [Online]. Available: http://wiki.icub.org/wiki/Main_Page. [Accessed

March 2017].

[8] M. Sonka, V.Hlavac, and R. Boyle, Image processing, analysis, and machine vision, Cengage

Learning, 2014.

[9] S. Suzuki, "Topological structural analysis of digitized binary images by border following,"

Computer vision, graphics, and image processing, vol. 30, no. 1, pp. 32-46, 1985.

[10] M.-K. Hu, "Visual pattern recognition by moment invariants," IRE Transactions on Information

Theory, vol. 8, no. 2, pp. 179-187, 1962.

[11] A. Roncone, U. Pattacini, and L. Natale, "A cartesian 6-DoF gaze controller for humanoid robots," in

Proceedings of Robotics: Science and Systems, Ann Arbor, MI, 2016.

[12] S. Dixon, "Onset detection revisited," in Proceedings of the 9th International Conference on Digital

Audio Effects, Montreal, Quebec, Canada, 2006.

28

[13] J. Bello, L. Daudet, S. Abdallah, C. Duxbury, M. Davies, and M. Sandler, "A tutorial on onset

detection in music signals," IEEE Transactions on Speech and Audio Processing, vol. 13, no. 5, pp.

1035-1047, 2005.

[14] A. M. Noll, "Cepstrum pitch determination," The Journal of the Acoustical Society of America, vol.

41, no. 2, pp. 293-309, 1967.

[15] M. J. Ross, H. L. Shaffer, A. Cohen, R. Freudberg, and H. J. Manley, "Average magnitude diffrence

function pitch extractor," IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 22,

no. 5, pp. 353-362, 1974.

[16] A. D. Cheveigné and H. Kawahara, "YIN, a fundamental frequency estimator for speech and music,"

The Journal of the Acoustical Society of America, vol. 111, no. 4, pp. 1917-1930, 2002.

[17] U. Pattacini, F. Nori, L. Natale, G. Metta, and G. Sandini, "An experimental evaluation of a novel

minimum-jerk cartesian controller for humanoid robots," in IEEE/RSJ International Conference on

Intelligent Robots and Systems, Taipei, Taiwan, 2010.

[18] G. Bradski and A. Kaehler, Learning OpenCV: Computer vision with the OpenCV library, O'Reilly

Media, Inc., 2008.

ICUB TRIES TO PLAY AN ELECTRONIC KEYBOARD

Documents

ICUB TRIES TO PLAY AN ELECTRONIC KEYBOARD