Finding bugs in the WiSense firmware

I am a big fan of the “assert( )” macro. I put in a lot of asserts in my code to catch bugs, race conditions and corner conditions. I use the “assert()” macro whenever the code detects a condition which should never happen. Is is better to “assert( )” rather than let the system run further. This will just hide an obvious bug or worse the system can get trashed and present you with strange symptoms which will almost always take a long time to debug.

When we have a distributed system such as a wireless mesh network, “assert( )” is not going to help if the asserting node is not being debugged (running under JTAG / SPY-BI-WIRE etc). I started off by defining a function “SYS_fatal( )” in the WiSense node firmware. I call this function instead of the macro “assert()”. This function has a infinite loop which simply blinks the two on board LEDs alternately giving me a visual indication that a particular node has asserted and there is now a bug or more to fix. I have been using this method for a long time. I never fail to use an expletive when I find some node giving me the “blink”.

Later I added a 16 bit argument to “SYS_fatal( )”. This argument identifies the location of the code which called “SYS_fatal( )”. I have around 200 “SYS_fatal( )” calls sprinkled throughout the code including the MAC layer, the network layer, the APP layer, driver code etc. I added a unique 16 bit identifier to each of these calls.  For example –

if (condition)  
    SYS_fatal(SYS_FATAL_ERR_X);

I added code (to SYS_fatal()) to store the passed 16 bit argument to the 256 byte EEPROM memory (at byte offset 0) on the AT24MAC602 on board every WiSense node. The function then blinks the two LEDs as before to indicate that the node has asserted.  On reset/power up, a WiSense node reads the last saved 16 bit assert id from the EEPROM and saves it in RAM. Then the EEPROM location containing the 16 bit assert is cleared (set to 0x0000). Assert id 0x0000 indicates that the node was not in a fault/asserted condition before it was reset or power cycled. The 16 bit assert id (read from the EEPROM) then gets sent to the LPWMN coordinator within the ASSOC_REQUEST message (as part of the network joining process).

When the LPWMN coordinator gets an ASSOC_REQUEST from a node, it copies the 16 bit last saved assert id received in the message into the node’s entry in the list of registered nodes maintained in RAM.

The gateway UI (which runs under Linux/Cygwin) has a command “gnlfe” which allows a user to retrieve any registered node’s last saved assert/fault id.

The firmware thus has a mechanism to determine the location of the code which asserted without having to reproduce the scenario with the node in question under JTAG/SPY-BI-WIRE control. This is a big time saver.

Let me end this post by an example of WiSense firmware code which calls SYS_fatal( ).

sts = PLTFRM_startTimerA0(ackTmoMilliSecs, 0, MAC_ackTmoHndlr);
if (sts != PLTFRM_STS_SUCCESS)  
    SYS_fatal(SYS_FATAL_ERR_18);

MAC_cntxt.txModState = MAC_TX_MOD_STATE_WAIT_ACK;

In the example above, the code is calling SYS_fatal( ) if it fails to start a timer. There is no reason why this timer API should fail unless there is a bug in the code. This API fails if the timer “A0” is already running or the timeout is not within the allowed range. Both conditions point to buggy code.

I could have used the C compiler pre-processor macros “__FILE__” and “__LINE__” to identify each invocation of SYS_fatal( ) instead of using a 16 bit unique identifier. The downside is that with time, as code changes, __LINE__ info from a node running an older version of the firmware may not correspond to the latest code. In addition,  __FILE__ and __LINE__ info would increase flash usage (read only data section) and also increase the size of the ASSOC_REQ message.

Posted on December 29, 2014, in Uncategorized. Bookmark the permalink. Leave a comment.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: