# A little fixed point math for embedded audio

2022-04-26Recently, I wanted to generate some sounds for MnemOS, and I wanted it to go fast.

This is a little mini-blog post describing how I did that using fixed point math, and reduced the total CPU usage to about 20% of the original amount.

I wanted to generate a sine wave to play out of a speaker on my small embedded device. My end goal was to generate 44.1kHz PCM samples, which meant I needed to generate 44100 `i16`

s per second.

I also needed to generate stereo audio, but for now, I just play the same sine wave on both channels.

In embedded Rust, the core library doesn't actually link in routines for things like `sin`

operations.

This means that typically you have two choices to "provide" these operations:

- The
`libm`

crate, which provides very accurate implementations of math operations like`sin`

, but is often (relatively) slow, and the code size is (relatively) large, as these are fit for very low error rate calculations. - The
`micromath`

crate, which provides reasonable approximations of many operations like`sin`

, to some respectable precision (but less precise than`libm`

).

Since I knew I didn't need *crazy* accurate data, I didn't even try libm for this. My initial code using `micromath`

looked like this, for generating my sine wave data:

```
// Generate N samples, interleaving left and right i16 PCM samples
for (i, dat) in idata.chunks_exact_mut(2).enumerate() {
use micromath::F32Ext;
let mut value = (i as f32);
value *= (2.0 * core::f32::consts::PI * 441.0);
value /= 44100.0;
let ival = (value.sin() * (i16::MAX as f32)) as i16;
dat.iter_mut().for_each(|i| *i = ival);
}
```

I profiled this on my 64MHz Cortex-M4F CPU, and found it took an acceptable, but not impressive, average of 114 CPU cycles per loop iteration.

**This meant that to generate 44100 samples per second, I would spend a total of 5.02M cycles/second generating the audio, out of a total of 64M cycles I had total, which was 7.9% of my entire CPU time.**

So instead, I decided to use a slightly different approach, using a phase accumulator, and fixed point math, and a pre-calculated 256-entry look up table.

Generating the sine table was pretty straightforward, I ran this on my desktop, and copy and pasted the output into an array:

```
// Generate a 256 point sine look up table
for i in 0..256 {
let val = 2.0 * core::f32::consts::PI * ((i as f32) / 256.0);
let val = (i16::MAX as f32) * val.sin();
print!("{}, ", val.round() as i16);
}
println!();
```

That code looked like this:

```
// A pre-calculated Sine Look Up Table, or LUT.
use crate::SINE_TABLE; // : [i16; 256];
// We have a sample rate at 44.1kHz. Figure out how many samples it
// will take to play a single loop of our sine wave. For example,
// at 441hz, it will take 100 samples to "traverse" one
// whole sine wave.
let samp_per_cyc: f32 = 44100.0 / 441.0;
// The increment is (how many items are in our look up table)
// divided by (how long it takes to go through one look up table)
let fincr = 256.0 / samp_per_cyc;
// Convert this number to an i32, which we use as a fixed point
// decimal, basically 8bits.24bits, where the 8 bits are which LUT
// position we are currently in, and the 24 bits are like a "decimal"
// describing where we are between the LUT positions
let incr: i32 = (((1 << 24) as f32) * fincr) as i32;
// This calculation is based on the current "phase angle" of the sine
// table. This works because we increment our progress through the
// 256 value in our 8.24 fixed point number, and it will wrap around
// correctly whenever we "overflow" (this is a good thing! our sine
// table is the same loop over and over and over and...)
let mut cur_offset = 0i32;
// generate the next N samples...
idata.chunks_exact_mut(2).for_each(|i| {
let val = cur_offset as u32;
// Mask off the top 8 bits. This tells us which LUT position
// we are in
let idx_now = ((val >> 24) & 0xFF) as u8;
// Add one (wrapping), to tell us what the NEXT LUT position is.
let idx_nxt = idx_now.wrapping_add(1);
// Get the i16 value of the two LUT data points
let base_val = SINE_TABLE[idx_now as usize] as i32;
let next_val = SINE_TABLE[idx_nxt as usize] as i32;
// Distance to next value - perform 256 slot linear interpolation
// Here, I take the top 8 bits of the 24 bit "decimal" part of
// the number. This will be used to interpolate one of
// 256 positions between the two LUT positions.
//
// This is to reduce error between the "steps" of each LUT
// value in our table
let off = ((val >> 16) & 0xFF) as i32; // 0..=255
// Here, we "weight" each sample based on how close we are at. We
// multiply this to a total of 256x larger than our original
// sample, split between how close we are between the two samples.
// We multiply each, then add them back together
let cur_weight = base_val.wrapping_mul(256i32.wrapping_sub(off));
let nxt_weight = next_val.wrapping_mul(off);
let ttl_weight = cur_weight.wrapping_add(nxt_weight);
// Our number is now the weighted average of the two LUT samples,
// but 256x too big. Reduce it down with a right shift operation.
let ttl_val = ttl_weight >> 8; // div 256
// Un-sign-extend this back to an i16, to use as a sample
let ttl_val = ttl_val as i16;
// Set the linearly interpolated value to the left and
// right channel
i.iter_mut().for_each(|i| *i = ttl_val);
// Adjust our phase angle by the "increment" we calculated earlier
cur_offset = cur_offset.wrapping_add(incr);
});
```

**When profiling this approach, My average loop time was now down to 22 cycles per iteration, meaning it would now only take me 970.2k CPU cycles per second, or 1.5% of my total CPU time!**

I also checked my approximation against the "real" floating point sine operation, and found a maximum error of 0.012% for any `i16`

value, which is more than close enough for my ears!

There's no long term lesson here, I just wanted to share the technique I used, and explain it a little bit. These approaches are widely used in high performance audio equipment, but I hadn't found too many clear examples to use when writing my code.