Skip to main content

Oscilloscope assisted debugging

Concept

External test equipment is a powerful but often overlooked tool for debugging particular kinds of software problems. In particular, test equipment like oscilloscopes and logic analyzers are well suited to investigating timing related issues on embedded systems.

Why is test equipment suited to helping with issues related to timing? Oscilloscopes and logic analyzers can easily provide microsecond or better measurements and, unlike software based monitoring approaches, they aren't affected by other software running on the system.

Embedded systems are ideal for this approach as they both provide the io outputs required for timing measurement, and often lack the advanced debugging facilities found in desktop/laptop processors and operating systems that would make this approach less necessary.

At the conclusion of debugging some 1-Wire bus contention issues I was still seeing CRC errors reported periodically. Here is how I used an oscilloscope to track down the likely cause of the CRC errors.

External measurement

The CRC errors I was seeing were periodic, occurring on perhaps 20% of the temperature reads from the DS18B20 digital temperature sensor. From the previous 1-Wire debugging the DS18B20 timing and bus signals looked clean and correct. They also looked consistent across all of the scope samples that I observed. That lead me to wondering if the CRC issue wasn't the DS18B20's timing on the bus but rather when the microprocessor was reading the value that the DS18B20 was signalling.

The esp32-owb 1-Wire driver uses a gpio line directly to communicate with devices. This means that the driver has to select whether the gpio is in output mode vs. input mode, and more precisely the software has to determine when to read the value of the gpio signal in order to receive data from the DS18B20 device.

Let's use an oscilloscope and another gpio signal to look at when the driver is reading data from the 1-Wire bus.

Results

Here is a scope capture during a communication where the 1-Wire driver reported a CRC error. The yellow signal is the 1-Wire bus. The blue is the additional gpio line that is used to provide visibility into when the 1-Wire bus was read. The driver was modified to set this gpio line high immediately before the 1-Wire bus was read and set low immediately after the gpio read is completed.



In the above capture you can see that the left most blue pulse is right on the edge of the rising yellow line. It is almost always undesirable to sample the state of a communication bus while the state of the bus is changing. This points to a timing issue during bus sampling as a potential cause of the CRC errors.

Normal bus timing of a bit read

Here is a scope trace that shows a typical bit read and its timing. The timing is 20us even though the driver is calling delay functions that should result in a cumulative delay of 15us. While the timing is off by a significant amount, 5us or ~30%, it still falls within the range of acceptable timings as defined by the 1-Wire specification.


Bad read timing

This scope trace shows a situation where the bus read was too late and the wrong value was read from the bus. The dashed cursor shows where the sample would have been taken if the sample was at the normal 20us.



Based on the sample time, the blue pulse, the value sampled was incorrect and would represent a bit error in the communication with the DS18B20 device. A single bit error like this would result in a CRC error during message validation.

The variation in the sample timing may be due to the processor handing other operations causing the driver's software based delays to be extended.

Timing offset

The difference between the intended 15us and the actual 20us could be compensated for by reducing the  the software delay values.

The below trace shows another capture of the nominal timing, measuring at 20us.



Using cursors we can look at the time between when the bus becomes stable and the current sampling point, 6.2us.


Knowledge that the 1-Wire specification indicates 15us, along with seeing that the bus is at a valid state 6.2us before the 20us sampling point give us a high confidence that we could back off the sample time by 5us without an issue.

Can we reduce the software delay in the driver, re-measuring with the oscilloscope to arrive at an actual delay of 15us? Would the timing be consistent in these cases or could we see an issues where the delay timing would be too short if the target cpu load was low?

Delay function measurement

Let's check the software delay function to see if it is working as intended by creating a minimal application that performs a range of gpio pulses across a range of delay values. The test function used was:

void gpio_output_pulses(int count, int delay_us)
{
    for(int i = 0; i < count; i++)
    {
        gpio_set_level(PHY_DEBUG_GPIO, 1);
        ets_delay_us(delay_us);
        gpio_set_level(PHY_DEBUG_GPIO, 0);

        ets_delay_us(20);
    }
}

The application used the test routine with delays ranging from 5us to 400us. The application was intentionally minimal to reduce the impact of other software systems, such as wifi communication, led blinking etc from altering the behavior of the delay function.

The timing of the pulses was measured on the oscilloscope and matched very closely with the target delay passed into the test function. This is evidence that the delay function is working as desired.

What else could explain the difference between the expected 15us and the 20us seen in the full application? The gpio operations to set input/output and to read the value could be responsible for some of that time. Another possibility is that event handling for things like wifi communication is causing the timing delay.

Timing measurement including gpio functions

Let's measure a more representative test case with the full set of gpio functions to see what their impact is on overall timing.

Here is the read bit routine from the esp32-owb library slightly modified to insert the delay times and the interaction with the PHY_DEBUG_GPIO for external timing measurement:

void gpio_simulate_read()
{
        gpio_set_direction(OWB_GPIO, GPIO_MODE_OUTPUT);
        gpio_set_level(OWB_GPIO, 0);  // Drive DQ low
        ets_delay_us(6);
        gpio_set_direction(OWB_GPIO, GPIO_MODE_INPUT); // Release the bus
        gpio_set_level(OWB_GPIO, 1);  // Reset the output level for the next output
        ets_delay_us(9);

        gpio_set_level(PHY_DEBUG_GPIO, 1);
        int level = gpio_get_level(OWB_GPIO);
        gpio_set_level(PHY_DEBUG_GPIO, 0);
        ets_delay_us(55);   // Complete the timeslot and 10us recovery
}



Here are some scope captures from the minimal application.



From these captures we can see that the timing is very close to the intended 15us of delay, its consistently between 16 and 16.8us.

These results add evidence that system processing is responsible for the remainder of the 5us delay, approximately 3.2us.

Why is cpu load affecting timing?

The delays involved are short, in the microseconds range. To achieve such short intervals a busy loop is used in the delay routine. The delay function will calculate, based on the desired delay, the number of loops it should perform before exiting.

How can other software affect the duration of the delay? Embedded processors, even if they aren't running multi-tasking operating systems, have to respond to external events such as wifi radio packets. This event handling can take time. While the event is being handled the delay function is suspended and it isn't looping. When the delay function resumes looping some amount of time has elapsed but the delay function isn't aware or able to know or compensate for that additional time. Introduce several of these interruptions and its possible to extend the already small delay interval by a significant amount of time.

In my case I'm running a multi-tasking operating system, FreeRTOS. But the 1-Wire driver is explicitly disabling task switching during these bus operations. This makes it more likely that the timing delay is a result of the processor handling interrupts or high priority events such as wifi packets.

To summarize, because a software loop is used to achieve the very small delay amounts required for the 1-Wire bus operation, it is possible that interrupts to handle external events can affect the 1-Wire bus timing. Because the timing was very accurate in the minimal test application it seems even more likely that wifi and other interrupts are the cause of the issue in the full application.

Potential next steps

As a next step it should be possible to disable interrupts during the 1-Wire gpio communications. If this resolved the timing variation then we could be confident that interrupt processing was the cause. At this point I'm not going to look into the issue any further. I'm pretty confident that interrupts are the cause of the issue but its always possible that I missed something so its tough to say with certainty.

Conclusion

The software delays used in the 1-Wire gpio based driver are resulting in significant timing variations. Some of these timing variations are outside of the 1-Wire timing specification and are resulting in the wrong values being read from the bus. These read issues are causing CRC errors, but may not be the only cause of CRC errors. Depending on your use case and cpu activity these periodic bit errors may or may not affect your use of the esp32-owb library. In my case I'm not particularly concerned about occasional missed temperature readings but plan to continue working to improve the library to resolve these issues.

A natural evolution of the 1-Wire driver is to look at hardware that can help maintain more precise timing. In particular the ESP32 has a number of RMT (remote control modules) that are designed to generate and read signals such as those used in IR communications. This RMT functionality is well suited to interacting with a 1-Wire bus.

I've already started working on an RMT driver for esp32-owb and hope to have it integrated in the next few weeks.

Comments

Popular posts from this blog

Debugging an imprecise bus access fault on a Cortex-M3

This information may apply to other cortex series processors but is written from practical experience with the Cortex-M3. Imprecise bus access faults are ambiguous, as noted by the term "imprecise". Compared to precise bus errors, imprecise errors are much trickier to debug and especially so without a deep understanding of arm processors and assembly language. Imprecise and precise flags are found in the BusFault status register, a byte in the CFSR (Configurable Fault Status Register). BusFault status register bits The definition for imprecise and precise bits is: [2] IMPRECISERR Imprecise data bus error: 0 = no imprecise data bus error 1 = a data bus error has occurred, but the return address in the stack frame is not related to the instruction that caused the error. When the processor sets this bit to 1, it does not write a fault address to the BFAR. This is an asynchronous fault. Therefore, if it is detected when the priority of the current pr

Graco Swing By Me - Battery to AC wall adapter modification

If you have one of these Graco battery powered swings you are probably familiar with the cost of C batteries! The swing takes four of them and they only last a handful of days. I'm not sure if the newer models support being plugged into the wall but ours didn't. If you are a little familiar with electronics and soldering, here is a rough guide on how you can modify yours to plug in! I wasn't sure how exactly to disassemble the swing side where the batteries were. I was able to open up the clamshell a bit but throughout this mod I was unable to determine how to fully separate the pieces. I suspect that there is some kind of a slip plate on the moving arm portion. The two parts of the plastic are assembled and the moving arm portion with the slip plate is slid onto the shaft. Because of the tension in that slip plate it doesn't want to back away, and because of the mechanicals that portion of the assembly doesn't appear accessible in order to free it. I was

Memory efficient queuing of variable length elements

In embedded environments memory can be a critical driver of the design of data structures and containers. Computing resources have been expanding steadily each year but there are still a wide range of systems with far less than a megabyte of memory. On systems with tens of kilobytes of memory, structures are often designed to be compact to maximize data density. Rather than splurging on memory aligned elements that would be faster for the processor to access, a developer will typically use types with minimal sizes based on the known range of values that the element is intending to hold. Fixed sized buffers At my day job a fixed size pool of messages was implemented to hold message data. While this achieved one design goal of using statically allocated buffers, avoiding dynamic allocations that might fail at runtime, it isn't efficient if there is a wide range of message sizes. It isn't efficient because each message uses a message buffer. With small message sizes the buff