Grand digital piano: multimodal transfer of learning of sound and …cs229.stanford.edu/proj2019spr/report/5.pdf · 2019. 6. 18. · Most digital pianos use sample-based synthesis,

1

Grand digital piano: multimodal transfer of learningof sound and touch

Ernesto Evgeniy Sanches Shayda ([email protected]), Ilkyu Lee ([email protected])

I. MOTIVATION

Musical instruments have evolved during thousands of yearsallowing performers to produce almost all imaginable musi-cal sounds of various timbre, pitch, loudness and duration.Speaking about keyboard instruments, the current grand pianomechanism is a result of series of inventions and improvementsand is a very complex mechanical system. During past decadesan new area of research about creating digital instruments thathave same feel and sound as original acoustic instrumentshave gained popularity [1]. There was a partial success indeveloping a digital piano with realistic sound and touch: mostdevices recreate basic properties of sound, however it can benoticed that the models are far from perfection due to thefollowing reasons.

Most digital pianos use sample-based synthesis, meaningthat for every piano key a finite number of ”sample tones” ofa real acoustic instrument are recorded, and then the resultingsound when a key is pressed is just one of the samples or acombination of the samples. As a result, an infinite varietyof timbres of a real instrument is reduced to just a few,corresponding to number of samples per key that are pre-recorded.

There are also pianos that use mathematical models of theacoustic instrument to produce sounds without use of ”sampletones” [2]. Theoretically this allows to recreate an infinitevariety of sounds of the original instrument. However, themodels require a lot of assumptions about internal physicsof piano action mechanism and about the sound productionprocess and use hand-designed parts, which are not guaranteedto be equivalent to the mechanics of a real instrument.

We are going to improve the methods described above, usingmachine learning to train a physically-based model of piano,which will allow to produce new, previously unheard soundsduring the training, depending on how performer plays theinstrument.

II. PRIOR WORK

Generating realistic sound sound data in a form of exactwaveforms, without usage of hand-crafted aggregate soundfeatures (such as spectrograms or mel frequency cepstralcoefficients) was made possible by a line of research startedfrom WaveNet model [3]. Recently a very detailed NSynthdataset [4] of notes played on different musical instrumentswas collected, which make possible to use models such asWaveNet to generate realistic musical instruments sound. Animproved model for musical sounds generation was recently

proposed [5], utilizing generative adversarial networks to makegenerated sounds even closer to real ones.

While sound generation algorithms significantly advancedduring past years, there is a difficulty in using the proposedalgorithms in actual digital instruments, because the mentionedalgorithms are concerned only with one aspect: generatingrealistic sound. However very noticeable part of playing aninstrument is a connection between performer’s actions and theactual sound that is generated. While existing models supportsimple conditioning, allowing to generate sounds of varyingloudness and timbre, exact connection between physical ac-tions by a performer and the resulting sound was not studiedin context of artificial intelligence models of sound generation.

In this work we are going to close this gap by introduc-ing a multi-modal model, that analyzes performer’s actionsand resulting sound jointly, allowing to generate appropriatesound based on the way how performer plays the instrument.Additionally, we develop a transfer of learning method, thatallows to learn the model once on very precise equipment andhigh quality acoustic instruments, and then apply the resultingmodel on different variants of target instruments of differentprice and quality, aiming at obtaining closest possible soundand feel to the original instrument in all cases.

While the methods can be applied to any musical instru-ment, where performer’s actions and resulting sound can becaptured, in this work we focus on creating a realistic digitalpiano, which is one of the most used digital instruments.

III. PROBLEM FORMULATION

We consider multiple modalities, that are always presentin a musical performance. We start from ”touch” - a wayhow performer plays, and ”sound” - the resulting musicthat is heard. Each modality can be represented as a timeseries containing information about what is happening at eachmoment of time. For piano, ”touch” modality consists of aphysical state of all keys, namely positions or velocities as afunction of time. Additionally, other controls can be recordedas well, such as positions or velocities of piano pedals.

Considering just this two modalities, it is possible to makea generative model, as long as there is a possibility to do thefollowing:

1) Precisely record touch and sound during the trainingphase on a reference instrument.

2) Precisely record touch during the performance or a testingphase.

3) Ensure that meaning of recorded touch (that can bedescribed by an expected sound to be heard from such

2

Fig. 1. Digital piano sensors setup

touch) on a reference instrument, and on performanceinstrument, are same.

We notice that there is a number of difficulties arising inthis setting:

1) First, for recording data from real instrument, we wantto apply non-invasive measurement, to prevent destruc-tion or degradation of the instruments while conductingresearch. This means, that we are not going to modify theacoustic instrument and record data from the inside, orattach arbitrary devices to the instrument using adhesivesor use other possibly irreversible techniques.

2) There are multiple ways to measure touch using sensors,that can be grouped into contact-based and non-contactmethods. According to the previously stated require-ments, for collecting training data on real instrumentwe are going to use non-contact methods. Most efficientmethod in this group is laser distance sensing [6]. Themethod is very precise, but is usable only in laboratoryconditions: arbitrary hand movements can block laserlight and interfere with measurements. As a conclusion,we are going to use laser distance measurement, but onlyfor collecting training data for ”touch” modality.

3) For the testing or a performance phase, we want to applyrobust method for measuring the data, to prevent givingspecial instructions to the performers, and just let themuse the instrument as they normally would. This preventsusage of non-contact measurement techniques from out-side of the instrument. Because of that, for the test phase,we consider a possibility to apply contact-based methodsof measurement and also to apply modifications to thedigital instruments for easier measurement.

This considerations show that there will be a problem of differ-ence in ”touch” data in training and testing phases, requiring ause of Transfer of Learning methods. We propose a solution ofthis problem by adding new modalities: Instead of creating amodel ”Touch → Sound”, we will create a conditional model”Touch → Intermediate modality 1 → Intermediate modality2 → ... → Sound”. Either touch, or one of the intermediate

Sensor Standard deviation, mm Sampling rate, HzVL53L0X < 2.0 33

RPLidar A1M8 < 0.5 2000TABLE I

COMPARISON OF DISTANCE MEASUREMENT SOLUTIONS

modalities may not be available during a training or testingphases - in this case we will train auxiliary models to fill inmissing modalities; and after filling in the missing data we willbe able to do complete transfer of learning from the input intraining stage to whatever input will be available in the testingstage, allowing to produce equivalent sounds.

IV. PRACTICAL IMPLEMENTATION

We consider the following input time series that are possibleto record either in training or testing or in both phases:

1) Laser distance sensor data. It is possible to record suchdata both for source (acoustic) and for target (digitalinstrument). However during testing phase the data willnot be available for digital instrument. We are evaluatingtwo sensors for touch modeling:

a) Semiconductor miniature laser distance sensorVL53L0X. It has default sampling rate of 33measurements per second and deviation ofmeasurements around 2% of the measured value.There is also an updated version named VL53L1X,that allows up to 50Hz sampling rate.

b) Simple LIDAR: RPLIDAR A1M8, which can be usedas a fixed position laser distance sensor as well bymanually disabling rotation of the sensor head. It hassampling rate of 2 thousands measurements per secondand deviation of measured value of 1%.

2) Accelerometer and gyroscope sensor: miniature sensorscan be attached to piano keys of digital piano and measurethe positions with rate of up to 1000 measurements persecond.

3) Hall effect magnetic sensors: such miniature sensors canalso be attached to digital piano keys, allowing to measuredistance to a magnet, that needs to be attached to digitalpiano case.

4) Ambient light proximity sensors: such sensors providenon-contact measurement of small distance below 10millimeters. In contrast to laser sensors, such sensors havelower cost and can be mounted inside the digital pianofor measuring distance to each key.

5) MIDI sensors (recording approximate key velocity at themoment of piano key strike), that are already presentin most digital piano instruments. This allows to havea reference sensor data. However, MIDI sensors do notrecord data continuously, they integrate the data and justprovide a single velocity measurement for each key press,therefore they do not allow to precisely model the touch.

We have compared two measurement solutions for use withacoustic piano in Table I. Because grand piano measurementsare done only once and can be reused on multiple digitaldevices, it is beneficial to use highest precision equipmentfor this part of work. Therefore we continue the remaining

3

experiments using the RPLidar A1M8 distance sensor. Forthe intermediate modality we choose to narrow our selectionto the Accelerometer and MIDI sensors for the followingreason: MIDI sensors are present in almost all digital pianosthat have MIDI output and can be used without addingexternal hardware; and Accelerometer sensors usually havebuilt-in gyroscope, which allows to capture angular velocity,which is the exact mechanical parameter that is required tomodel dynamics of the key. In contrast, light and magneticsensors only indirectly measure the mechanics: there is a non-linear relationship between angular position and velocity, andresulting measurement of magnetic field and light intensity.

The final arrangement of sensors for modeling a single pianokey is shown on Figure 1, consisting of a Lidar that is alsoused with grand piano, an accelerometer, attached to a key,and internal MIDI sensors of a digital keyboard.

V. TRANSFER OF LEARNING MODELS

The physical setup, described in previous section, shows aneed to model the following sequence of dependencies:

Laser sensor data (available on digital and acousticpiano during training; not available during testing) →{Accelerometer, MIDI sensor data} (available only on digitalpiano during training and testing)→ Sound (available only onreal piano during training and testing; must be re-created ondigital piano)

While ideally we would generate sound from laser sen-sor data during performance, this most precise method ofmeasurement is not available at the testing stage. Thereforewe first transform Laser sensor data into a corresponding(predicted) data from other sensors, that will be available inthe testing stage. This sensor data is available only on digitalpiano, because it involves contact measurement. At the sametime, real sound data is available only on acoustic piano. Thisdictates a need to create two big datasets and to train twoclasses of machine learning models:

1) On Digital piano: take ”Laser sensor data” as an input andpredict ”Other sensors data, including MIDI, accelerom-eter”

2) After having such model, apply it on Acoustic piano,by predicting other sensor data, which is unavailable onAcoustic piano.

3) On Acoustic piano, take predicted other sensor data asan input, and train a machine learning model to predictsound from it.

4) Apply this model on digital piano to generate realisticsound from other sensors, present only on digital piano.

The overall dataset will be highly multi-modal, containing atleast two main time series (Laser sensor data, resulting sound)and many auxiliary time series (intermediate sensors, presentonly on digital piano).

VI. MACHINE LEARNING MODELS

A. Sound models

We start from examining the largest part of the data: musicalinstruments sound. We have recorded sound of two notes(middle C and G keys) from digital and acoustic piano, trying

Fig. 2. T-SNE visualization of sound of piano notes

to record as many as possible different ways of touching thepiano keys. This resulted in initial dataset, that can be furtherextended to an octave (8-13 keys) or full piano range (88 keys).In the following two sections we conduct experiments on thesound data only for initial analysis and understanding of thedata.

B. Supervised classification between digital and acoustic in-strument

We considered a task of predicting whether the note soundedis from acoustic or digital piano. We used raw representationof 2 seconds of sound, which were recordings of C4 and G4notes on both a grand piano and a Yamaha YDP-141 digitalpiano, and pre-processed the recordings to normalize audioand trim silence. There were 132 recordings in total, and witha sampling rate of 44100 Hz, allowing for a total feature spaceof 132 samples of dimension 2*44100.

Then, applying a binary labelling system for digital andacoustic, we utilized a multi-layer perceptron (MLP) neuralnetwork with a hidden layer of 100 units and ReLU as theactivation function for classification. Given a 80-20 trainingtesting split, the resulting testing accuracy was 81%. This givesconfidence in the ability of models to differentiate acoustic anddigital sounds, allowing for further data analysis.

C. Unsupervised learning of sound representation

We have applied T-SNE algorithm [7] on a collected initialdataset of two notes on two instruments (Figure 2). Learnedrepresentation indeed shows that digital and acoustic instru-ments are different and separable in the latent space, andalso that different notes have distinguishable characteristics.This initial results show that the big task of this project isalso feasible, and we are going to proceed with collecting theremaining sensors data and training the main models for soundgeneration.

D. Feed-forward neural network vs. regression model forintermediate modality prediction

We start the actual task of transforming acoustic pianotouch data into a common format from modeling of Laser →

4

MIDI sensor relationship using an MLP neural network and aregression model. Firstly, on the digital piano, the C4 and G4notes were pressed with varying force and hence key velocity,with the laser sensor measuring the displacement of the keywith time. The corresponding MIDI data of the key velocitywas also recorded with time and then normalized by dividingby 127 (only for regression model, as we want to keep thediscrete values for MLP multinomial classification).

Then, after combining the C and G data, the 1-D timeseries data for both the laser position and MIDI velocity werereshaped into 2-D arrays, wherein each row represented 20milliseconds of data, with a shift of 10 millisecond from thefirst entry of a row and the first entry of the next row. Theresulting feature space was further processed to remove anyportions wherein the MIDI value does not change given a20 millisecond frame, as prediction is impossible otherwise.Finally, the last value of the MIDI velocity per row wasused as output space, such that the prediction is going fromL(t−n : t)→M(t), where L(t) and M(t) are the Laser andMIDI data, respectively, with n representing 20 milliseconds(882 features).

The dataset was then fed into two models: an MLP neuralnetwork with a single hidden layer of 100 units and a simpleridge regression model with mean squared error loss. Wewould expect the ridge regression model to perform betterthan MLP, since given a true label (velocity), the multinomialsoftmax loss of MLP would penalize wrong classification ofvelocity regardless of how close the predicted velocity is tothe true one. On the other hand, a regression model not onlyallows for prediction of velocity values that may be absent inthe training data but also penalizes a prediction closer to thetrue value much less than a prediction that is further away.

As expected, the MLP network gave a testing accuracy of45%, while the ridge regression gave an error of 26%. Eventhough the regression model performs better, its error is toohigh; we then proceed with a recurrent neural network (RNN)to go about the prediction of MIDI given laser data.

E. Recurrent network for intermediate modality prediction

Unlike feed-forward networks, RNNs contain a feedbackloop that takes in a previous output yt−1 at time t − 1 asinput at time t along with some original input xt. In thismanner, one can think of RNNs as having memory, whereinsequential information is utilized to understand correlations intime-dependent events to aid in better prediction. Moreover,RNNs have the ability to do many-to-many prediction, whichis desired here as we are predicting a full time series.

In our case, we use an RNN model with two stackedlong short-term memory (LSTM) layers with 10 neurons ineach layer, with mean squared error loss and rmsprop as theoptimizer. Rather than doing a shifting window separation ofdata as the feature space, we simply divide the data into 2-second lengths, giving us 128 input vectors. In addition, inorder to preserve the sequential nature of the predictions, wedo not split our training and testing data randomly. In thismanner, the loss was able to converge to about 0.086 withonly a few epochs (4-6).

Fig. 3. RNN architecture with intermediate tasks

However, we notice that RNN cannot exactly predict suchrapidly changing signal as a square wave in MIDI, which isdifficult to approximate by a smooth function. We start fromdividing the problem into two parts: predicting where MIDIevents start, and predicting the actual value of MIDI event.

Therefore, we have implemented an architecture from Fig-ure 3 to provide additional supervision to the network. Thenetwork still has an LSTM layer or multiple LSTM layers,that are then connected to two separate output layers: ”binaryevents” prediction layer has a sigmoid activation function tooutput probability of a fact that an event is currently happen-ing. This layer is trained using cross-entropy loss to maximizelikelihood of correctly predicting event start times. And ”eventproperties” layer has a linear activation and is trained withmean squared error loss, to minimize deviations in predictedevent properties. Then two outputs are manually combinedin a post-processing layer to output ”event properties” only ifthe binary model has predicted that there is an event currently.Such separation of the task into two subtasks allows to achievefaster convergence.

In figure 4 we present the resulting time series generatedby the model. The model is able to correctly time the events,but still has difficulties in producing exact values, that we aregoing to solve by smoothing in the following section

F. Kalman filter for features extraction

Despite all information being available in the raw data,in previous sections we have seen number of difficulties intraining neural networks. To aid the networks in solving thefinal task, we will incorporate a prior knowledge about basicphysical model of a piano key.

Piano key is a system with only one degree of freedom, andcan be fully characterized by angular position of a key. Further,we can compute respective velocity and acceleration to modeldynamic behavior. Because sensors provide noisy data, andalso do not provide all three parameters simultaneously, wecan use Kalman filter [8] to obtain reliable estimation of

5

Fig. 4. Test set prediction of MIDI signal using RNN

Fig. 5. Kalman filter for MIDI velocity prediction

key dynamics. This will transform a single raw feature -position, into three smoothed features - position, velocity andacceleration.

We notice that velocity modeled by Kalman filter (Figure5) allows to predict MIDI values at the start of MIDI eventsjust by a linear transformation (which is expected, becauseboth sensors now measure velocity), allowing the system tooperate without refering to RNN, or just using the RNN tocorrect possible non-linearities.

We further notice that we have more anvanced sensorsystem now, compared to a simple Synthesizer Midi sensors.Therefore, one of the reasons of getting low accuracy ofpredicting Midi signal may be the fact that ground truth isfar from perfection. Particularly, there is a subjective dif-ference in what key velocity actually means on the digitaland acoustic piano. Performing a subjective test measuringMean Opinion Scores comparing different ways of interpretingvelocity should answer the question whether Digital piano’sMIDI sensors can be used as a reference.

VII. SOUND GENERATION

While the full implementation of the described ideas requiretwo models: for touch, described in previous sections, andfor sound, in this work we will use an existing model forsound generation, conditioned on the predicted touch data. Weuse the TensorFlow implementation of WaveNet [9] and traininitially on the acoustic piano recordings of G4 and C4.

The model generates a new sample at time t by maximizingthe log-likelihood of a joint probability distribution of thedata stream up to time t − 1. This is done by a product ofconditional distributions of each element in the above timerange, employed as softmax distribution after treating thedata values with Mu-law companding for preserving sounddynamic range and at the same time decreasing required bitsper sample to 8 bits.

In this implementation, the model has 9 hidden dilatedconvolutional layers, with the dilation increasing by a factorof 2 as one progresses in the layers, allowing for exponentialreceptive field growth with depth. Given 1200 steps, the train-ing loss (cross entropy between the output for each timestep)was reduced from 5.677 to 1.1. We note that the loss can bereduced even greater with more time, and there exists otherimplementations for parallelization that have not been testedat the time of writing.

In terms of actual audio generation, conditioning on the notecan be done manually via differentiation in the original dataname itself; however, we have not yet been able to successfullycondition on the MIDI velocity. This serves as a good directionfor future work to improve upon the sound generation withconditioning included.

VIII. EVALUATION AND CONCLUSIONS

We have done objective evaluation by computing loss valueson the hold-out testing set, with the results presented in thecorresponding sections. Kalman filter showed a good perfor-mance in conjunction with RNN for intermediate modalityprediction. And Wavenets provided initial results on soundgeneration which should be improved however for generatingrealistic sound

Subjectively, the system is in the initial stage of showingresults, that can be used by musicians. We believe thatpresented idea is novel and practically useful, and there isdefinitively an additional work required to make the modelsapplicable on production devices.

Source code of this work and collected datasets are availableat https://github.com/ernestosanches/AIPiano

IX. ACKNOWLEDGEMENTS

We would like to thank Anand Avati from Stanford Uni-versity for helpful discussions regarding transfer of learning,and Professor Elena Abalyan from University of Suwon forguidance on musical aspects of the work.

X. CONTRIBUTIONS

1) Ernesto Evgeniy Sanches Shayda - idea, data collection,supervised modeling of touch data.

2) Ilkyu Lee - initial data analysis and sound generationmodeling

6

REFERENCES

[1] Andy Hunt and Marcelo M Wanderley. Mapping performer parametersto synthesis engines. Organised sound, 7(2):97–108, 2002.

[2] Balazs Bank and Juliette Chabassier. Model-based digital pianos: fromphysics to sound synthesis. IEEE Signal Processing Magazine, 36(1):103–114, 2019.

[3] Aaron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan,Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W Senior, andKoray Kavukcuoglu. Wavenet: A generative model for raw audio. SSW,125, 2016.

[4] Jesse Engel, Cinjon Resnick, Adam Roberts, Sander Dieleman, DouglasEck, Karen Simonyan, and Mohammad Norouzi. Neural audio synthesisof musical notes with wavenet autoencoders, 2017.

[5] Jesse Engel, Kumar Krishna Agrawal, Shuo Chen, Ishaan Gulrajani,Chris Donahue, and Adam Roberts. Gansynth: Adversarial neural audiosynthesis. arXiv preprint arXiv:1902.08710, 2019.

[6] Markus-Christian Amann, Thierry M Bosch, Marc Lescure, Risto AMyllylae, and Marc Rioux. Laser ranging: a critical review of unusualtechniques for distance measurement. Optical engineering, 40(1):10–20,2001.

[7] Laurens van der Maaten and Geoffrey Hinton. Visualizing data usingt-sne. Journal of machine learning research, 9(Nov):2579–2605, 2008.

[8] Rudolph Emil Kalman. A new approach to linear filtering and predictionproblems. Journal of basic Engineering, 82(1):35–45, 1960.

[9] https://github.com/ibab/tensorflow-wavenet.

Grand digital piano: multimodal transfer of learning of sound and …cs229.stanford.edu/proj2019spr/report/5.pdf · 2019. 6. 18. · Most digital pianos use sample-based synthesis,

Documents