Concept
External test equipment is a powerful but often overlooked tool for debugging particular kinds of software problems. In particular, test equipment like oscilloscopes and logic analyzers are well suited to investigating timing related issues on embedded systems.Why is test equipment suited to helping with issues related to timing? Oscilloscopes and logic analyzers can easily provide microsecond or better measurements and, unlike software based monitoring approaches, they aren't affected by other software running on the system.
Embedded systems are ideal for this approach as they both provide the io outputs required for timing measurement, and often lack the advanced debugging facilities found in desktop/laptop processors and operating systems that would make this approach less necessary.
At the conclusion of debugging some 1-Wire bus contention issues I was still seeing CRC errors reported periodically. Here is how I used an oscilloscope to track down the likely cause of the CRC errors.
External measurement
The CRC errors I was seeing were periodic, occurring on perhaps 20% of the temperature reads from the DS18B20 digital temperature sensor. From the previous 1-Wire debugging the DS18B20 timing and bus signals looked clean and correct. They also looked consistent across all of the scope samples that I observed. That lead me to wondering if the CRC issue wasn't the DS18B20's timing on the bus but rather when the microprocessor was reading the value that the DS18B20 was signalling.
The esp32-owb 1-Wire driver uses a gpio line directly to communicate with devices. This means that the driver has to select whether the gpio is in output mode vs. input mode, and more precisely the software has to determine when to read the value of the gpio signal in order to receive data from the DS18B20 device.
Let's use an oscilloscope and another gpio signal to look at when the driver is reading data from the 1-Wire bus.
Results
Here is a scope capture during a communication where the 1-Wire driver reported a CRC error. The yellow signal is the 1-Wire bus. The blue is the additional gpio line that is used to provide visibility into when the 1-Wire bus was read. The driver was modified to set this gpio line high immediately before the 1-Wire bus was read and set low immediately after the gpio read is completed.
In the above capture you can see that the left most blue pulse is right on the edge of the rising yellow line. It is almost always undesirable to sample the state of a communication bus while the state of the bus is changing. This points to a timing issue during bus sampling as a potential cause of the CRC errors.
Normal bus timing of a bit read
Here is a scope trace that shows a typical bit read and its timing. The timing is 20us even though the driver is calling delay functions that should result in a cumulative delay of 15us. While the timing is off by a significant amount, 5us or ~30%, it still falls within the range of acceptable timings as defined by the 1-Wire specification.Bad read timing
This scope trace shows a situation where the bus read was too late and the wrong value was read from the bus. The dashed cursor shows where the sample would have been taken if the sample was at the normal 20us.Based on the sample time, the blue pulse, the value sampled was incorrect and would represent a bit error in the communication with the DS18B20 device. A single bit error like this would result in a CRC error during message validation.
The variation in the sample timing may be due to the processor handing other operations causing the driver's software based delays to be extended.
Timing offset
The difference between the intended 15us and the actual 20us could be compensated for by reducing the the software delay values.
The below trace shows another capture of the nominal timing, measuring at 20us.
Using cursors we can look at the time between when the bus becomes stable and the current sampling point, 6.2us.
Knowledge that the 1-Wire specification indicates 15us, along with seeing that the bus is at a valid state 6.2us before the 20us sampling point give us a high confidence that we could back off the sample time by 5us without an issue.
Can we reduce the software delay in the driver, re-measuring with the oscilloscope to arrive at an actual delay of 15us? Would the timing be consistent in these cases or could we see an issues where the delay timing would be too short if the target cpu load was low?
The application used the test routine with delays ranging from 5us to 400us. The application was intentionally minimal to reduce the impact of other software systems, such as wifi communication, led blinking etc from altering the behavior of the delay function.
The timing of the pulses was measured on the oscilloscope and matched very closely with the target delay passed into the test function. This is evidence that the delay function is working as desired.
What else could explain the difference between the expected 15us and the 20us seen in the full application? The gpio operations to set input/output and to read the value could be responsible for some of that time. Another possibility is that event handling for things like wifi communication is causing the timing delay.
Here is the read bit routine from the esp32-owb library slightly modified to insert the delay times and the interaction with the PHY_DEBUG_GPIO for external timing measurement:
Here are some scope captures from the minimal application.
From these captures we can see that the timing is very close to the intended 15us of delay, its consistently between 16 and 16.8us.
These results add evidence that system processing is responsible for the remainder of the 5us delay, approximately 3.2us.
How can other software affect the duration of the delay? Embedded processors, even if they aren't running multi-tasking operating systems, have to respond to external events such as wifi radio packets. This event handling can take time. While the event is being handled the delay function is suspended and it isn't looping. When the delay function resumes looping some amount of time has elapsed but the delay function isn't aware or able to know or compensate for that additional time. Introduce several of these interruptions and its possible to extend the already small delay interval by a significant amount of time.
In my case I'm running a multi-tasking operating system, FreeRTOS. But the 1-Wire driver is explicitly disabling task switching during these bus operations. This makes it more likely that the timing delay is a result of the processor handling interrupts or high priority events such as wifi packets.
To summarize, because a software loop is used to achieve the very small delay amounts required for the 1-Wire bus operation, it is possible that interrupts to handle external events can affect the 1-Wire bus timing. Because the timing was very accurate in the minimal test application it seems even more likely that wifi and other interrupts are the cause of the issue in the full application.
Delay function measurement
Let's check the software delay function to see if it is working as intended by creating a minimal application that performs a range of gpio pulses across a range of delay values. The test function used was:void gpio_output_pulses(int count, int delay_us) { for(int i = 0; i < count; i++) { gpio_set_level(PHY_DEBUG_GPIO, 1); ets_delay_us(delay_us); gpio_set_level(PHY_DEBUG_GPIO, 0); ets_delay_us(20); } }
The application used the test routine with delays ranging from 5us to 400us. The application was intentionally minimal to reduce the impact of other software systems, such as wifi communication, led blinking etc from altering the behavior of the delay function.
The timing of the pulses was measured on the oscilloscope and matched very closely with the target delay passed into the test function. This is evidence that the delay function is working as desired.
What else could explain the difference between the expected 15us and the 20us seen in the full application? The gpio operations to set input/output and to read the value could be responsible for some of that time. Another possibility is that event handling for things like wifi communication is causing the timing delay.
Timing measurement including gpio functions
Let's measure a more representative test case with the full set of gpio functions to see what their impact is on overall timing.Here is the read bit routine from the esp32-owb library slightly modified to insert the delay times and the interaction with the PHY_DEBUG_GPIO for external timing measurement:
void gpio_simulate_read() { gpio_set_direction(OWB_GPIO, GPIO_MODE_OUTPUT); gpio_set_level(OWB_GPIO, 0); // Drive DQ low ets_delay_us(6); gpio_set_direction(OWB_GPIO, GPIO_MODE_INPUT); // Release the bus gpio_set_level(OWB_GPIO, 1); // Reset the output level for the next output ets_delay_us(9); gpio_set_level(PHY_DEBUG_GPIO, 1); int level = gpio_get_level(OWB_GPIO); gpio_set_level(PHY_DEBUG_GPIO, 0); ets_delay_us(55); // Complete the timeslot and 10us recovery }
Here are some scope captures from the minimal application.
From these captures we can see that the timing is very close to the intended 15us of delay, its consistently between 16 and 16.8us.
These results add evidence that system processing is responsible for the remainder of the 5us delay, approximately 3.2us.
Why is cpu load affecting timing?
The delays involved are short, in the microseconds range. To achieve such short intervals a busy loop is used in the delay routine. The delay function will calculate, based on the desired delay, the number of loops it should perform before exiting.How can other software affect the duration of the delay? Embedded processors, even if they aren't running multi-tasking operating systems, have to respond to external events such as wifi radio packets. This event handling can take time. While the event is being handled the delay function is suspended and it isn't looping. When the delay function resumes looping some amount of time has elapsed but the delay function isn't aware or able to know or compensate for that additional time. Introduce several of these interruptions and its possible to extend the already small delay interval by a significant amount of time.
In my case I'm running a multi-tasking operating system, FreeRTOS. But the 1-Wire driver is explicitly disabling task switching during these bus operations. This makes it more likely that the timing delay is a result of the processor handling interrupts or high priority events such as wifi packets.
To summarize, because a software loop is used to achieve the very small delay amounts required for the 1-Wire bus operation, it is possible that interrupts to handle external events can affect the 1-Wire bus timing. Because the timing was very accurate in the minimal test application it seems even more likely that wifi and other interrupts are the cause of the issue in the full application.
Potential next steps
As a next step it should be possible to disable interrupts during the 1-Wire gpio communications. If this resolved the timing variation then we could be confident that interrupt processing was the cause. At this point I'm not going to look into the issue any further. I'm pretty confident that interrupts are the cause of the issue but its always possible that I missed something so its tough to say with certainty.
Conclusion
The software delays used in the 1-Wire gpio based driver are resulting in significant timing variations. Some of these timing variations are outside of the 1-Wire timing specification and are resulting in the wrong values being read from the bus. These read issues are causing CRC errors, but may not be the only cause of CRC errors. Depending on your use case and cpu activity these periodic bit errors may or may not affect your use of the esp32-owb library. In my case I'm not particularly concerned about occasional missed temperature readings but plan to continue working to improve the library to resolve these issues.A natural evolution of the 1-Wire driver is to look at hardware that can help maintain more precise timing. In particular the ESP32 has a number of RMT (remote control modules) that are designed to generate and read signals such as those used in IR communications. This RMT functionality is well suited to interacting with a 1-Wire bus.
I've already started working on an RMT driver for esp32-owb and hope to have it integrated in the next few weeks.
Comments
Post a Comment