Multiple DS18B20 dallas.sensor: Scratch pad checksum invalid!

I finally managed to setup a development environment so I will start looking into this but it will probably take time since there’s so many things that’s new to me and my C++ skills are kind of old.

So approaching the original developer is still a very good option.

I will try to do pull request to original code whenever I find things that can be improved

It will be interesting to see what you find. From my work the scratchpad errors are not linked to a specific sensors on a string but randomly move between them so does indicates a timing issue. The sensor also changes timing due to temperature so heating one up to 70C can also generate more / less errors. I did wonder, but not investigate, the impact of having the radio running, could a WiFi event cause the timing to shift slightly and cause the error? I mentioned earlier that I have definitely found that having each sensor (up to 4 on a string) at a different resolution improves matters but doesn’t eliminate it. From the 20 sensors I tried, none failed to be detected properly providing the pull up was correct and the power supply good. Some boards have a poor quality 5V to 3.3V LDR so generating your own 3.3V using a separate regulator is a good idea as is not using parasitic power to eliminate another potentially gotcha.

Yes I also expect to find some timing issues. E.G. when I compared my (back then) home-made SW and reading the ROM code from a single sensor, for one sensor I got a difference in just one bit (and yes, then the checksum will be incorrect). E.G. If I knew the ROM code was 0xA0 0xB0 0xC0 etc I got the result 0xA1 0xB1 0xC1 (or the other way around, I don’t remember).
Note that the error this time was in my code but just the thing that the first bit (or last, don’t remember) was not detected correctly while the remaining 8 bits was detected correctly (and this being repeatable for each byte for one sensor but not others) indicates that sensors from the same batch behaves differently.

Concerning electrical shortcomings I didn’t see any in my setup monitoring the signals with an oscilloscope (60 MHz Analogue with delayed trigger function).
I was expecting to see capacitive load issues in my setup with 17 sensors and cables in a mess with sensors in both star and chain config and only using the built-in power supply in my Sonoff Basic relay, but that was not the case. The raise time on the data bus was just fine and no issues on VCC.

The 17 sensors are placed on a wood boiler system with temperature ranging from -30 to +95 degrees Celsius and when now starting the season I realized that (when finally removing the sensors with faulty ROM codes) that only 4 out of 17 was working and the faulty one even included the outdoor sensor so there was no heating problem. And I forgot to mention, my sensors are delivered from AZ Delivery and the price indicates these being none original

Another thing I noticed was that my old sensor that no longer responded with correct ROM code toked about 15% longer time to do temperature conversion than an unused sensor from the same batch. Considering that all unused sensors from the same batch had the same conversion time, also this indicates characteristic changes over time when in use

Running slow on my side since my embed programming skills are way to old, but I will give some heads-up of my findings.

I’m using a ESP8266 but I guess others are kind of the same. I have done some findings in the original code but I have not concluded yet if those will have any real effect. There might be compatibility reasons, but just considering ESP8266 I’ve found this things in the code:

  • The bus are in some (at least one) cases pulled high and that shall instead be left for the tri-state mode to let the pull-up resistor to pull the bus high. The main reason for this will be explained in the two next points.
  • The code both changes pin mode (to input or output) and and also sets the pin high and low. This is not necessarily, the pin can be configured as low once in the initial phase and then we just need to configure the pin mode as output (for low signal since it’s already set to low in the initial phase) and input (for high signals since this will activate tri-state and then the pull-up resistor will drive the bus high).
    If the bus is driven high from the ESP and low from the sensor (or vise verse), the bus will be in unknown mode and this by itself can cause instability.
  • Considering the problem I will describe later in this post, operating more code than needed makes these issues worse. Saying this, I haven’t changed the code and can’t say if my findings will help, so far I’m struggling catching up my programming skills and understand and restructuring the code
  • There code waits a predefined time for the temperature conversion. When not using parasite mode this is not needed, a better solution is the check the bus since devices will drive the line low while doing temperature conversion
  • The code does not support parasitic mode and that is not the intention of the code either. But it doesn’t check if there are any sensors using parasitic mode and this could lead to instability. This is not mentioned in the documentation either

My main findings are that it’s very hard to create a reliable timing on the 1-wire bus with the ESP8266 using the code framework that the ESPHome Dallas Component is based on. I will attach some pictures below showing what I mean

I started to test the accuracy of the delayMicroseconds() function used in the code and since I couldn’t get a stable reading on my oscilloscope I decided to create my own using a loop of instructions and I decided to go for the read-pin function. Repeating the read instruction 200 times gave an average of ~0.67 us per read instruction. I then decreased the count to 1 time and got an average of 5 and sometimes 6 us per read. My thinking was the this depended on some overhead in my time measure and loop but just to verify I decided to check the average read-time twice in a row and then it was down to 1 or 2 us.

Doing the same test with pulling the bus low and releasing the bus into tri-state the average was ~1.3 us for both of them. Decreasing to 5 times gives an average of 1.4 us per instruction but sometimes an average of 2.4. Decreasing to 1 time gives an average of 1 to 2 us but sometimes as high as 7 us.

Now looking at the oscilloscope pictures below. Ignore the pattern generated but compare the two pictures. Note that the timing (x-scale) is quite high (20 us per division) so also smaller differences matters.
The delays here are generated by repeating the instruction, e.g. if I pull the bus low I repeat the same instruction 4 times. The first bus-low instruction is happening at the very left of the screen. The first release line is happening at the first division mark in both picture’s, but then the second bus-low is happening 4 us later in the upper picture compared to the lower picture. In the upper image we also se some “shadows” indicating the instructions sometimes taking longer time.

And before someone asks, Yes I have turned of the interrupts. But… I haven’t seen the low level implementation of the code that shall stop the interrupts (just the virtual constructors) so I’m not sure it works.


Also attaching this picture below (now time scale is 5 us per div) showing a shadow sample being 9 us away from its average

I don’t think it’s the ESP-chip that is the problem here (and @gedger indicates the same), I rather suspect that it’s in the low-level part of the code (i.e. not the part of the code that was written by the developer of the Dallas component for ESPHome)

So does this matter??? I would say yes, at least looking at the timing in the current code where an additional 5 us sometimes could be a disaster. Please note that I haven’t verified if any of this will help though…

Looking at the timing below:


To determine if the reading is a 1 or 0, we have 15 us to read the bus after the bus has been pulled low by the ESP. In ideal conditions (my bread-board) the pull-up resistor drives the bus high in 2 us but in real situations having capacitance added by sensors and cables this can be much longer. The spec allows this to take 15-1=14 us.

The dilemma:
If the sensor wants to deliver a zero and we (due to extra delays) waits to long, the sensor will already have released the bus into tris-state when we are reading and hence we will read a 1. But on the other hand, we can’t read to fast either because if the sensor wants to deliver a 1, we might (if we are reading to early) read while the pull-up resistor is driving the bus high (but before the signal is high enough to register a 1.

So what can we do??? I guess the best would be to read the line as fast as possible but not faster than the rise-time on the 1-wire bus. So how can we know the rise-time of the 1-wire bus since it changes depending on the number of sensors, length and characteristics of cables, etc. My idea is to use the reset pulse described below to calculate the rise-time:


If we initiate the reset pulse and measure the time it takes until we read a high value and then add some micros seconds just to be safe, we probably get a good indication of the system. Let’s say we do this in the startup phase 200 times, calculate the average and then add (lets say) 5 us. If the value is lower than 15 us we are good to go and I expect that we will get a much lower value than 15 us. We can also use other methods like broadcasting for the ROM code and for each bit received calculate the average time for a 0 and the average time for a 1. The second solution is probably best since we can get timings from more than one sensor (kind of random but anyway…) but requires that we have the correct timing to even write to the bus. So the two methods probably needs to be combined.

But the best method (if available using ESPHome code) would probably be to fix the base code for the pin configurations so that we can control the bus with the accuracy that is required and that @gedger indicated that we could do using Arduino.

Lets see what some evenings, nights, and early mornings will lead to…

For reference. Average execution time for Reading the bus, Setting the bus to low by setting the bus to Output mode, Setting the bus to high by setting the bus to input (tri-state) mode.
image

image

image

image

2 Likes

Some interesting findings. It shouldn’t be a surprise that a software loop timing mechanism is inconsistent, but it’s good to see it actually measured. The other question, is the ESP doing anything else in the background, even with interrupts turned off? I suspect it is, especially when WiFi / Bluetooth is being used, perhaps one of the developers can comment? The approach of measuring the timing characterises of the sensor string in use and then optimising the sample time seems a sound approach, it would definitely provide the best immunity to the inevitable software timing loop errors. It would also allow a warning to be thrown up if the timing was near the limits with perhaps a suggestion to change the pull up resistor value (if it would help?)

My prototype with the scratchpad errors is currently in transit via the USA to the developer so hopefully we may get another view at some point in the future, unless you solve it first…

He/she would probably solve it faster than me but I will continue anyway but in a slow phase. The timing is only important during a very short time (60 us) i.e. the time to write or transmit a bit. This is also something that could be corrected in the original code since the interrupt protection is done on byte and even word level.
I don’t think that other activities shall disturb under the very short time we need to transmit a bit on the bus but I have no knowledge about the scheduling in the ESPHome platform. An alternative would be to use the ESP built-in UART that I asked for in this thread but I didn’t get any positive response on that Dallas ds18b20 using UART instead of GPIO

Did some improvements yesterday.

I did some changes for the reset pulse in the original code and now I’m strictly following the Dallas specification as good as I can. I have not yet started to use my home-made timings, this code just takes a different approach and verifies that each state in the timing is done.

To stress the environments I added capacitors over Data-VCC and Data-Gnd to simulate long cables. Adding a 1 nF capacitance adds quite a good rise-time of ~14 us.

Using the original code the sensor was not identified and no ROM code given. But using my new code the ROM code was identified :slight_smile:.
Still problem in reading temperatures but I will look into that as well.

But high rise-times is just one of the problems we can face, having reflections is another problem so I will look into simulating reflections later. I also need to get a ESP32 chip to make sure it works for those as well since I could see in the code that the developer had made some comments about low level code differences between ESP32 and ESP8266 etc

1 Like

Thank you for taking your time to dig into this :slight_smile:

I found this thread yesterday when I discovered that a freezer that I am controlling with a Sonoff THD16D, flashed with ESPHome, no longer reported any temperature, and thus not controlling the temperature in the freezer any more. And in the logs I could see the “Scratch pad checksum invalid!” error being logged, and that search led me here.

The device is basically this one, but with a display, and is based on esp32. The cable to the Dallas sensor is 50 cm long.

Just performing a restart of the device did nothing to solve the problem. But I managed to “restore” it by re-flashing it with the same FW that it was already using.
So I can no longer reproduce the fault. But it will probably re-surface anytime in the future, unless something is changed in EspHome.

Details about my device, for reference

  • ESPHome: 2023.2.4
  • Board type: nodemcu-32s
  • Connected to WiFi
  • Running the Climate component to control the built-in relay
  • Working flawlessly from ~March -23 until 24/11 -23 when it stopped reading the temperature

I’m to new into this to understand how a re-flash could help, but was it just a re-flash or did it also include a build where you probably would have a lot of libraries updated, including the Dallas component

It was a re-flash with the same version of ESPHome. I use the ESPHome docker container to manage, and build, all the FW’s for my devices. I find it convenient because I can access it from anywhere, and I will always have the same ESPHome version. So the code was re-compiled, but the SW version of ESPHome, and the yaml-file defining the content, was the same as what was used before, that stopped working.

I too find it strange that a re-flash could help. But then, I do not know how the internal ESPHome code looks like. I only know it from a user perspective.

You can test my version now if you want, I have attached the code needed at the end.

If it’s better than the original code… Well at the moment I can’t tell since also the original code now works even though I stress the bus with capacitors.

It’s kind on your own risk and I have only tested this on a 8266. If things goes wrong (as it did for me a couple of times when I created an internal loop :slight_smile: ), you can be enforced to flash using a cable.

So what have I done? The original code followed a best practice document from Maxim: 1-Wire Communication Through Software | Analog Devices. It’s probably a good document but… It also says that: “The system must be capable of generating an accurate and repeatable 1µs delay for standard speed and 0.25µs delay for overdrive speed”. The problem I have seen is that we don’t get a 1us accuracy in timing in the ESPHome platform and hence it’s not that reliable to follow this document. In the Arduino documentation they they say this: “This function works very accurately in the range 3 microseconds and up to 16383. We cannot assure that delayMicroseconds will perform precisely for smaller delay-times

So instead of timing exactly when I shall read and write I have tried to follow the specification from Analog devices: DS18B20 - Programmable Resolution 1-Wire Digital Thermometer (analog.com). And here I have used another approach, I listen on the bus to determine when it has reached a specific state. I have also assumed that the 1-wire devices are very fast in reacting on slope changes from the master so there will be no delays until a device drives the bus low if it shall be low and vice verse.
I know my approach might be sensitive for reflections on the bus but these can easily be handled with some micro delays (repeats) in the code.

For the specific bus commands, this is what I have done:

  • For reading I just checked that the ESP has driven the pin low and then check at what time the sensor has released the bus to tri-state. If the time is more than 15us I consider it to be a 0, if the bus is released to tri-state earlier than 15us I consider it to be a 1.
    image

  • For the reset-pulse I did kind of the same. After pulling the bus low in 480us I released the bus into tri-state and waited until I could read the bus as high. I then check if the bus is low again within a time frame of 240us from when I detected that the bus was high. If it was low within that time-frame there’s at least one sensor present. I then check that the sensor release the bus into tri-state and wait out the second 480us period

  • For the write_bit()…

Since it’s my first C++ work in 15 years the code could be better and there are still things to fix. And I haven’t done the write-bit part yet either.

Other things I done is to remove some none needed instructions and I have also moved the interrupt protection part of the code from a high level down to just the 3 bus control functions read_bit(), write_bit(), and reset().

Next:

  • Listen on the sensors to see when they have done the temperature conversion instead of using the pre-defined time as of now
  • Confirm that all sensors are in a none parasitic mode.
  • Try to find a ESP32 I can test this on. I have one but haven’t been able to flash it.
  • Understand if there’s a more accurate way to measure time or if a timer or HW interrupt could be used instead.
  • Create a test setup introducing reflections. Will probably do this with a 100m cable and some high-ohm and low-ohm terminators. This will be fun since it was a long time since I did this.

Code needed in the Yaml to run my test-code below.

external_components:
  - source: 
      type: git
      url: https://github.com/erapade-forks/esphome
      ref: Testable_in_ESP8266
    components: [dallas]

Using GitHub - nrandell/dallasng for an year.
For me it working on various ESP8266/ESP32, incl. Sonoff devices, D1 mini & Shelly Uni.

1 Like

Just tried on ESP32. Compiled, installed - not booting …

Thanks for testing. My device is in a place not suitable for easy re-flashing. So I will not dare to try it until someone confirms that it works on ESP32.

Thanks for testing, I hope it didn’t gave you any problems.

I will for sure try this dallasng, a totally different piece of code I would say

I updated my device to use this library too, for now, to see if it holds up better.
It compiled ok, the device started ok, and it found the sensor without problem.
So far so good :slight_smile:

I’ve had this problem for a while and I was just reading to see if there is anything new about the error and then spot this page. Tried the code from @erapade and everything seems to work. Installation on Wemos D1 mini went smoothly and checksum error is gone. I have 10 sensors running, each with 2-5m cable. Thanks to @erapade

Thanks for informing. I was going to give this up but now I got some energy.

I think you shall also try this instead of my mod, and please report back in this thread how it works. Some posts up in this thread you will also have a reference to this:

external_components:
  - source: github://nrandell/dallasng
2 Likes

Good work, unfortunately I only have access to ESP32 variants at the moment so can’t test, sorry.