Audio Networking Technical challenges and possibilities for distributed digital sound production at Otago CSIS Seminar series, July 2010 Chris Edwards Department of Information Science University of Otago
Jan 15, 2016
Audio NetworkingTechnical challenges and possibilities for distributed digital sound production at Otago
CSIS Seminar series, July 2010
Chris Edwards
Department of Information ScienceUniversity of Otago
Motivation and Background
Music Department’s new $1M SSL mixing console
New Zealand Music Industry Centre (NZMiC)
KAREN high-capacity network connectivity
Interesting technical and creative possibilities:
Remote (live) mixing
Remote recording (live or layered multi-track)
Distributed real-time performance (live and recorded)
Internet broadcast/multicast/streaming
Asynchronous production tasks, e.g. (re)mixing, mastering, film score composition, with very short turnaround
The SSL Console
Solid State Logic model C200 HD
Digital control surface, array of common per-channel controls
Signal level metering
Transport control, timecode display
Full automation (programmable fader (etc.) motion)
DAW control (mouse, keyboard, display)
64 dedicated control strips, pageable for even more
(and it can play “Pong”!)
Not just a pretty face:Behind the console
Dual C-SB Stageboxes
48 high-quality mic inputs each
Gain and pre-amp behaviour remotely controllable
~2 km reach over single-mode optical fibre
Portable; plans for Marama Hall, Stadium, Town Hall
Centuri core
Signal routing and control surface I/O
Storage
DSP modules
Outboard hardware effects processors
DAW (a Power Macintosh with MADI card)
KAREN connectivity
Albany Street Installation
SSL Centuri
core
C-SB Stage Unit SSL DSP
Units
SSL Console
C-SB Stage Unit
Network
optical fibre(2 km reach)
DAW(Mac)
48 mic input channels per stage unit
MADI(Multi-Channel Audio Digital Interface)
O/B EffectsUnits
KAREN
Kiwi Advanced Research and Education Network
Operated by REANNZ (Research and Education Advanced Network New Zealand), Ltd. (Crown-owned company)
10 Gb/s generally available between participating institutions
16 national points-of-presence (PoPs)
International links to Australia, North America
and, via these, to Asia and Europe
KAREN NZ Network Map
Source: http://www.karen.net.nz/topology/
Digital Audio BasicsDiscretised, quantised representation of continuous analog signal
Signal is represented as a stream of numbers
Driven by hardware clock (typically a crystal oscillator)
One sample recorded/played every clock cycle
Fs is sampling frequency
Typ. transmitted digitally using PCM (pulse-code modulation)
Am
plit
ude
Time
Digital Audio Basics (2)
Typical audio sampling frequencies are 10s–100s of kHz
Human hearing tops out around 16–20 kHz
Nyquist limit (highest reproducible frequency) is Fs/2
Fs {∊ 44.1 kHz (CD-DA), 48 kHz, 96 kHz, 192 kHz}
Higher Fs means more resampling options, better resampling quality
Higher Fs also makes life easier for the low-pass filters
Oversampling is also commonly used
24-bit signed integer precision common in studio work
~120 dB theoretical dynamic range; limited by analog noise floor in practice
Different ways of being wrong:Quantisation error, jitter et al .
Use of quantisation means approximation, error
Amplitude (bits per sample)
Time (sampling frequency)
More bits and/or faster clock: better approximation
Often a worthwhile trade-off (no generation loss, DSP, transmission)
Hardware clocks are not perfectly stable
Non-uniformity classified as follows:
• Jitter (short-term variations: cycle-to-cycle)
• Wander (medium-term)
• Drift (long-term)
Measurement: eye diagrams, Allan variance
A stable clock is not necessarily an accurate clock!
Transmitting Digital Audio(within the studio)
Example protocols:
AES3 (AES/EBU)
MADI
ADAT Optical Interface
S/PDIF
Local digital audio transmission is generally synchronous.
Must avoid clock drift to avoid buffer over-/under-runs
A word about word clock
Synchronous operation means having a common reference clock
Word clock is a dedicated digital signal operating at Fs
Word clock ≠ timecode, it’s a frequency reference
Master clock signal fanned out to slave devices via dedicated co-axial cable
Clock can also be sent as part of audio data
In-band bit clock (“self-clocking” signal)
Used in AES3, ADAT, S/PDIF
Typically uses bi-phase mark coding or similar
Both assume complete control of physical medium.
Digital audio is surprisingly demanding on clock quality.
Transmitting Digital Audio(beyond the studio)
“Pro” networked-audio protocols generally operate at OSI Layer 2
Data Link layer, i.e. not routeable, local area only
Ethernet is popular, unsurprisingly
Examples:
• AVB (IEEE 1722)
• AES47 (IEC 62365, AES3 over ATM)
• AES51 (AES3 over ATM over Ethernet)
• CobraNet
• EtherSound
Some OSI Layer 3/4 protocols do exist:
NetJACK (Open Source)
Livewire
•TCP
•UDP
4.Transport
•IP3.
Network
•Ethernet
2.Data Link
•UTP
1.Physical
(Some) challenges for distributed digital audio production
1. Audio hardware clock synchronisation
2. Audio data delivery (network service quality)
Network capacity (“bandwidth”)
Latency (packet delivery time, i.e. delay)
Trade-offs between these
Quality of Service (QoS) assurance (per-packet priority)
Network out[r]ages
3. Timecode and transport control
4. Interoperability in general
Protocols
Framing
Data representation and encoding
Challenge 1: Hardware Clock Drift
Unsynchronised audio hardware clocks will drift
Drifting too far will lead to buffer over-/under-runs
Unacceptable audio glitches (drop-outs, pops/clicks)
Word-clock operates at physical level
Running co-ax to Auckland, London, Seattle not feasible!
Hardware clock synchronisation:Some possible solutions
Discipline the audio clock using a common external source
Internet Network Time Protocol (NTP) (Mills, 1980s–)
Timecode embedded in application-level packets
GPS PPS timekeeping signal
Dynamically resample audio at each node
Solution should be low-jitter:
More than a few hundred picoseconds may be unacceptable
Jitter may manifest as white noise or more complex distortions
At best, jitter undermines SNR
Reasons for leaning towards GPSResampling degrades quality; avoid if possible.
Pro audio hardware generally has word-clock input anyway.
A hardware solution would be convenient.
NTP doesn’t claim especially high accuracy
Approx. ±10 ms for general use on the Internet
Personal computer hardware clocks are not especially accurate or stable
NTP is primarily concerned with absolute timekeeping.
We care more about consistent frequency.
NTP assumes symmetric network paths (not a problem for frequency reference only?)
NTP’s clock slewing behaviour might be disruptive if applied to audio AD/DA converters?
NTP experts recommend using GPS anyway! (Shalunov, 2005)
GPS is globally available and uses a dedicated radio signalling.
GPS satellite network is guaranteed to keep closely in sync; ideal for single master clock approach.
Pros and cons to be investigated further…!
GPSThe Global Positioning System
“Wherever you go, there you are.”—anon.
The Global Positioning System (GPS)
Basically a distributed high-precision time-keeping and message broadcasting system
24 satellites (plus spares!) in medium Earth orbit (20,000 km altitude)
6 orbital planes with 4 satellites each
4 must be “visible” to receiver to get precise position.
True position of each satellite is known/predictable (the ephemeris).
Satellites broadcast time-stamped messages.
Delay in receiving timestamped message determines distance from satellite.
Intersection of distances pinpoints location in space
GPS is also used to help other satellites know where (and when) they are.
GPS uses distance (from time) rather than direction:
Receiver uses delay in receiving each message to calculate distance to the satellite that sent it.
Requires very precise timekeeping, as messages travel at/near light speed.
Relativistic effects must be accounted for!
1D position (i.e. on a line) requires two distance measurements.
2D (on a plane) requires three distance measurements (circles).
3D (in space) requires four distance measurements (spheres).
Earth’s sphere could be used to provide the fourth distance (provided you are on the surface).
Would still require four readings for altitude.
(essential if flying or in space)
Using four measurements improves accuracy as well.
How GPS location works
Satellite 1 Satellite 2
You Are Here
r1r2
GPS in one dimension
•Satellite positions are known•Messages are time-stamped, so time of sending is known•Delay in receiving message can be measured•Distance is proportional to delay•Intersection of distances determines actual position
GPS for time-keeping
PPS (pulse per second, i.e. 1 Hz) signal available is externally on many GPS receivers.
Can be used for precise timekeeping, even in remote areas.
Once location is determined and locked in, even higher timing accuracy is possible.
Can derive higher frequencies (for word clock) using frequency synthesis.
Proposed Scheme
Use globally available GPS PPS signal to discipline local audio hardware clocks
Uniform frequency (not absolute time) is the critical thing.
Avoid clock drift across sites, to avoid buffering errors.
Already been done! Shera (1998):
Ham radio application, originally
Voltage-controlled crystal oscillator (VCXO)
PLL-based regulation (phase-locked control feedback loop, de Bellescize (1932))
Temperature-sensitive (even with thermostatic “oven”)
27 MHz master clock is common in multimedia systems
Because of NTSC television timings, AFAICT
Video sync input required for SSL Centuri (implications?)
Shera (1998): block diagram
u-blox LEA-6T
GPS receiver module for precision timing applications
Position-lock for greater timekeeping accuracy
Programmable output clock pulse, 1/60 Hz to 10 MHz
High sensitivity; useable indoors
15 ns accuracy achievable
Ideally would simply connect LEA-6T clock output to audio word clock input
Innovative Integration PCIe-Timing card
PCIe expansion card
GPS receiver for clock discipline
Multiple programmable digital clocks
1560 kHz .. 1 GHz output
0.2 ps jitter specification
How a PLL works(analogy: two cars on a race track) 1 lap = 1 clock cycle
“Master” reference car and following “slave” car
Lead or lag is phase difference
Measure once per lap or continuously
Constant phase difference means same frequency
If gaining, slow down slightly
If lagging, speed up slightly
Frequency is the derivative of phase!
PLL Demo in Pure Data (if time)
The Software Side(in case you don’t know JACK )☻
JACK = JACK Audio Connection Kit (Paul Davis, ~2000?)
Audio server program providing low latency and sample-accurate sync
Like an Open Source combination of ReWire (inter-process audio), ASIO (low-latency audio I/O) and VST (software plug-ins)
Provides audio routing among software clients and hardware
Clients may be ordinary processes or in-process plug-ins
Originated on Linux, now also runs on Mac OS X, Windows, BSDs
Also can provide network transport over IP (NetJACK)!
Probably an ideal platform for research software development
JACK details (1)Runs at real-time priority where possible
No additional latency due to JACK itself
mmap()s to system audio buffers
Provides a high-level audio API
Client software requires no audio hardware access code
Various audio back-ends: ALSA, FFADO, Core Audio, PortAudio, etc.
Enables rapid development and portability of audio apps
Client connects to server, registers audio input/output port(s)
Registered clients have process() callback invoked on demand by JACK server
Synchronous execution of all clients
Supports MIDI data streams too; may support video etc. in future
JACK details (2)
All audio data represented uniformly as 32-bit IEEE floating-point, normalised to -1.0..+1.0
Provides global transport control and timecode
No multiplexing/interleaving (e.g. stereo, 5.1, etc.) at the JACK level
One port: one channel
Use whatever channel configuration you need
Buffer over-/under-runs (“xruns”) detected and reported by JACK server
Server can disconnect misbehaving clients
JACK details (3)
Audio processing driven by audio hardware
Hardware buffer typically divided in two (double-buffering):
• Software reads from one buffer, writes to the other
One interrupt period to receive input
Two interrupt periods to process and deliver (input and output)
Example timing: 256 frames/period × 2 periods/buffer @ 96 kHz:
(1 frame is all samples across channels taken at one sampling interval)
375 Hz interrupt rate
~5 ms “through” latency
Comparable to sound delay from monitor speakers
Buffer (frames/period × nperiods)
JACK buffer management
Period 1512 frames
Period 2512 frames
Audio Hardware
Software
Period 1512 frames
JACK: Implementation Challenges
(Hard) real-time processing requirements
Also, want non-root users to be able to run JACK and clients
May have only hundreds of microseconds to run all client process() callbacks
Overhead of context switches (e.g. CPU cache invalidation) is significant!
Linux signals proved too slow to be used for JACK IPC.
Current design uses FIFOs.
Client callbacks must of course be RT-safe.
Recording/streaming software must do I/O!
NetJACK
Networking extension to JACK
Technically just another audio back-end
Allows multiple JACK instances to communicate via UDP/IP
Remote (slave) JACK instances run inside the master JACK loop
BUT!: slave instances are generally deaf and mute
No audio clock available; driven by reception of network packets instead
Processing only; no audio I/O (DSP farming)
However:
• Sample-rate conversion exists in code-base for local audio I/O
• CELT lossy codec with packet-loss concealment also available
Might be suitable for use/adaptation for distributed studio work
Large buffer period sizes to handle latency (4096 frames for 96 kHz within NZ?)
NetJACK: Possible modifications
Allow normal audio I/O on NetJACK slave instances
No resampling, so no loss of quality
Could be feasible if hardware clock synch scheme works
Would it require/experience some extra buffering?
• Jitter buffer
• I/O still triggered by audio hardware
Facility to measure and record network latencies
(Local) JACK already accounts for latency throughout the call graph
JACK transport pre-roll can compensate for playback latency
Challenge 2: Data Delivery Quality
Long-range Internet transport is highly variable:
Non-uniform delivery time of packets
Variable bandwidth available
Congestion, traffic-shaping, etc.
Live audio data must be delivered as fast as possible
Buffering generally increases throughput, robustness and jitter-immunity at the expense of latency
Network performance on KARENKAREN should provide a good starting point for feasibility studies
Bandwidth aplenty:
Up to 10 Gb/s generally available
• ~10,000 × typical home DSL
Typically under 5% utilisation
Audio: ~600 raw Mb/s for 96 32-bit audio channels at 192 kHz
Whole-session transfer in < 10 s (in theory)
• 4 minutes × 24 tracks of 24-bit audio @ 96 kHz ≈ 4 GB
• Endpoint disk I/O is probably the bottleneck in practice
Interestingly, no QoS facilities
Latency is the big problem
Audio signals must be kept within ~15 ms to seem musically simultaneous
Acoustic and electromagnetic signal propagation is not instantaneous
~3 ms/m for sound waves in air (~330 m/s)
Light (fibre-optic) and electrical signal propagation is typically around 0.7c
~5 ms/1000 km
20–30 ms RTT (round-trip time) observed between Otago and Auckland via KAREN (so 10–15 ms each way)
Worst-case latency is really the important case
The Latency Problem
Drums @ Dunedin
Guitar @ Auckland
15 ms delay
15 ms delay
1. Drum part provides reference timing
2. Guitar plays in sync with heard drum sound
3. Guitar part sounds late by 30 ms
“I canna change the Laws of Physics”
Observed latencies to international locations via KAREN( s o u r c e : h t t p s : / / k m e a s u r e . k a r e n . a c . n z / c g i - b i n / s m o k e p i n g . c g i ? t a r g e t = I N T E R N A T I O N A L _ L O C A T I O N S )
Sydney: ~40 ms RTT
Perth: ~80 ms RTT
Seattle: ~160 ms RTT
North America generally: 200..300 ms RTT
Asia: 300..500 ms RTT
Europe: 300..400 ms RTT
Note: these are averages (“show me the histograms!”)
Equivalent Approx. Distances(cf . propagation of sound waves in air)
Dunedin to Auckland: 5 m
Dunedin to Sydney: 10 m
Dunedin to Seattle: 30 m
Dunedin to Europe: 60 m
d = v / t
International latencies, in musical terms
At 120 BPM tempo (e=120):
2 beats/s
Asia round-trip ≈ e
North America round-trip ≈ r
Australia round-trip ≈ y
Network latency will be a problem for certain applications.
Acoustic demonstrations of delay
Phasing (comb filtering) 0.02..15 ms
Stereophonic (Haas effect) shifts ~10-50 ms
Distinct echoes ~50+ ms
Synchronisation vs. delay
Synchronisation and delay are two different problems.
For some applications, delay is largely irrelevant
e.g. mixing a band from 20 m away can still be done
Synchronisation, however, is generally critical
esp. if the same audio is split across multiple paths and recombined
• comb filtering, changes in comb filtering
What Might Be Feasible?
Mixing can be considered part of a live performance, but latency requirement is less stringent
Remote recording is one-directional; high latency is quite acceptable. Internet streaming ditto.
Pre-scored performance is easier than fully live
E.g. Sibelius score, sequenced backing, metronome/click-track
Pre-roll to compensate for latency
Layered multi-track recording generally doable
Latency requirements can be relaxed considerably under certain conditions:
In particular, if nodes don’t need to hear all other nodes
• Acyclic audio processing graph
Sync is more important than absolute delay in many situations
Better read up on some graph theory...!
For further investigation
Determine required audio hardware clock quality (jitter, drift, etc.)
Trial the GPS hardware clock sync idea
Is variable satellite visibility a problem?
Test feasibility of NTP for hardware clock sync
Determine latency requirements for potential applications
Develop/co-opt network analysis framework for distributed studio
Delve into the JACK code (ZOMG! Real-time C code!)
Investigate network tuning parameters
Investigate use of the Internet for longer-haul transport
Moving to the Internet
KAREN provides many benefits over a normal consumer Internet connection
Long-haul Internet would mean significantly lower connection quality (bandwidth, latency, packet jitter, reliability)
Potential hassles:
QoS and traffic-shaping
Firewalls and NAT
CELT for lower data rate and concealment of packet loss?
Only if necessary (it is lossy)
References and further reading Stereophile magazine article on digital audio clock jitter
http://www.stereophile.com/reference/193jitter/
Sound on Sound article on digital studio clockshttp://www.soundonsound.com/sos/jun10/articles/masterclocks.htm
Brooks Shera's GPS-Controlled Frequency Standardhttp://www.rt66.com/~shera/index_fs.htm
Phase-Locked Loop (PLL) overviewhttp://en.wikipedia.org/wiki/Phase-locked_loop
NTP overviewhttp://www.eecis.udel.edu/~mills/exec.html
Shalunov, 2005: NTP Cookbookhttp://www.internet2.edu/workshops/npw/binder-docs/ntp-cookbook.pdf
NTP RFC documenthttp://www.eecis.udel.edu/~mills/database/rfc/rfc1059.txt
NetJACK2 architectural overviewhttp://trac.jackaudio.org/wiki/WalkThrough/User/NetJack2
KAREN timing statisticshttps://kmeasure.karen.ac.nz/cgi-bin/smokeping.cgi?target=INTERNATIONAL_LOCATIONS
Allan clock variance measurementhttp://en.wikipedia.org/wiki/Allan_variance
To find this document, go to:http://eprints.otago.ac.nz/
?Questions?
Suggestions:
What about Skype? How does it manage?
What about MIDI?
How do musicians deal with latency normally?