Friday, 14 December 2012

Nasty nasty ATOMIC

Ooo what an insidious bug, let me set the scene.

I have a function that wants to transmit data on serial line, it's actually the start of a simple packetising feature I'm writing. So far there are two functions used as follows.


The names are fairly self-explanatory I think except that packetSend() wraps the string in "<>" characters, so the end result is this


being transmitted.

It worked just fine straight out of the box and lower bit rates, but then I upped the bit rate to 500kbps and it failed miserably, just transmitting "<A" or "<AB" or maybe  "<ABC" depending on where I placed debugging code.

The LARD serial code works like most others I guess, some code puts bytes into a buffer and some other code takes them out and puts them into the UART with the "taking out" part being interrupt driven. Both these processes involve updating a counter to indicate the number of bytes in the buffer, very important information that tells both sides how many bytes there are and where they should go.

For this reason any accesses to the buffer should be "atomic", meaning that when process A is accessing the buffer process B (or anybody else for that matter) must not also access the buffer. An atomic operation is indivisible, there must be no interrupts during the course of that operation.

In an interrupt-driven system this can be done by using semaphores, mutexes etc but in a simple system like this the easiest way is to just disable the interrupts when putting bytes into the buffer so the taking out code (which being interrupt-driven can occur at any time, even half-way through updating a counter) cannot try to take a byte out while the putting in code is putting one in. Got it?

So to facilitate this I have two nice macros called ATOMIC_START and ATOMIC_END, and to allow them to be nested they maintain their own counter and ATOMIC_END will only disable interrupts if that counter is 0.

So far so good, but the symptoms I mentioned above really smacked of a race condition on the buffer's nItems counter, the transmit ISR was reading 0 items when in fact there was 1. It would therefore take no further action and even though the buffer continued to fill no more bytes would be transmitted.

Time for the logic analyser and some deep thought. It's very hard to fix a problem if you can't see it and that's where a logic analyser comes into its own because this sort of problem cannot be seen with a debugger or "printf" debugging. I find the best way is to use 2-3 spare IO pins and toggle them at critical parts of the code, this has almost no affect on the real-time nature of the program and with the pulses properly placed they can tell you a lot.

So here's the obligatory logic analyser trace pic.

The packetSend()function has 28 bytes to send so it starts the ball rolling by writing the first of them directly into the UART ('A'), after that it writes the bytes into the buffer ('B').

The first write then causes a byte to be transmitted and when that's complete an interrupt ('C') is triggered. This checks to see if there are any bytes in the buffer and if so reads one and writes it to the UART ('D'). This process continues until there are no bytes left in the buffer and as you can see the As, Bs, Cs and Ds are nicely interspersed and everything works well for 4 bytes.

Now look at what happens at around the 200uS mark, the B that has been taking about 12uS blows out to nearly 38uS and smack in the middle of it we see a ISR call (with a negative pulse I used to see which path the code took).

This is the crux of the bug. If ATOMIC_START worked properly it should not be possible to service an interrupt in the middle of packetSend(). This means that potentially both functions are trying to access the buffer's byte counter at the same time and the results are indeterminate.

In this case the ISR obviously reads the counter just before it was incremented from 0 to 1, it therefore got a value of 0 and that folks was the end of any transmission, despite the fact that packetSend() continued to write bytes into the buffer.

I replace ATOMIC_START with the standard _disable_irq() and the whole shebang bursts into life with all 28 bytes being transmitted correctly.

So, another bug squashed, tomorrow I'll be having a long hard look at ATOMIC_START but for now it's 3AM so I'm off to bed.


  1. You don't need a counter to implement nested disabling of interrupts on Cortex chips. Here are my equivalents to your ATOMIC_START/END (using CMSIS):

    #define EnterCritical() \
    register uint32_t primask__ = __get_PRIMASK(); \

    #define ExitCritical() __set_PRIMASK(primask__)

  2. Thanks ToBeFrank (I assume), I'll use that in future.