Intermittent but major waves of unreachable lights

Hey all,

For the past year and a half I’m having ittermittent, yet major problems with my setup for Zigbee-lights in my house and for the love of me I cannot find the culprit, let alone a way to fix it. I’m really hoping you guys have any ideas.

Setup

  • I’ve got a single familiy home with 3 floors.
  • The link of hardware / software is the following: Intel NUC server (on the top (3rd) floor) → Home Assistant (docker container) → deconz (docker container with image: deconzcommunity/deconz) → conbee II stick (via unique UUID passed to deconz) → Zigbee lights, remote controls and sockets.
  • Per floor I’ve got multiple lights and sockets. Mostly Philips Hue (lights and remote controls) and some Ikea Tradfi sockets. I made sure that there are enough nodes between the first floor and the third so that the mesh network is well covered.

Symptoms

  • In 80% of the time, everything works fine and quick. Even the lights on the first floor communicate quickly via the meshh network to the third floor (the conbee stick). When looking at the network in the Phoscon VNC viewer, a lot of links exist and there are a lot of green one, indicating good connections.
  • On average, once a week, out of the blue, all of a sudden a lot of lights become unavailable in both Home Assistant and deconz. I can see this in the Phoscon web interface and in the VNC viewer: the lights are suddenly unreachable. Sometimes one or three lights keep working, but most fail. I have since then monitored the situation and let Home Assistant count the unavailable lights and plot hem in a graph. Sometimes there are small peaks wich work themselfs out, sometimes it lasts for hours on end. Most of the time, the situation resolves itself, but only after those hours have passed. The Philips Hue remotes also give a solid red light when a button is pressed. And when this happens, no changes in the network are done by me: no lights are powered off that might serve as a critical link.
  • Yesterday, a completely new event happend: all the lighs on the first floor started flashing between 0% and 100% brightness. And, the most weird part is: when I disconnected the conbee stick as the controller from the mesh network, they continued! Even after power cycling the lights and no controller present. How is this possible, what is giving the lights those commands? It was a bit scary even. After letting them off the power the entire night and plugging them back in, they acted normally. However, even the Ikea Tradfri sockets behaved this way and when plugging them in in the morning, they have haven’t come back anymore, I needed to pair them again.

Things I’ve tried
When there was another peak of unavailable lights, I tried the following things:

  • I power cycled a light, in the hope that it will reconnect. Nothing happend, it kept being unavailable.
  • I put an extension cable between the server and the conbee of around 1,5 meter. The symptoms kept happening.
  • I updated the firmware of the CONBEE II. No effect.
  • I set my Wifi channels and the zigbee channels as far apart as possible. No effect. I have neigbors, no clue what they use as a channel.
  • During a peak of unavailable lights, I yanked the CONBEE II stick for a minute and put it back in. Magically, light started coming back almost immeadiately. Hmm, interesting.

Since that last thing actually helps, but doens’t prevent it from happening again, I suspect that there is something wrong with either the deconz image, the Phoscon software, the conbee stick, the mesh network or the radio spectrum itself. Maybe the neighbors have a crappy Wifi setup? But I cannot explain the size and duration of these outages, I would hope that the Zigbee network can heal itself somehow in a shorter amount of time. However, I have no clue how to diagnose it from this point on. I really hope someone can help me with this.

Attachments

  • Graph of unavailable lights over a longer time and a video of the spooky flashing lights without any controller on the network present: Conbee II issues - Album on Imgur

-edit-
Switched to Imgur

Can you share some logs? How to get logs? - #4 Just to see if theres anything going on at this moment.

It would also be benefical to have logs when this happens.

Can you also share the versions?

I happend to have saved the logs from when the lights were misteriously flashing yesterday, but I didn’t check all the right log levels, so I’ll post it anyway, maybe it’ll be of some help?

I’ll let the logs run now for a couple of minutes and will post them shortly, to see what is happening now without an outage.

When another outage occurs, I will save the logs. I can’t reproduce it, so I’ll do that as soon as I see one.

The versions are as follows:

  • Gateway Version: 2.25.0
  • Firmware version: 26780700

-edit-
Switched to pastebin

Here are the logs from the past couple of minutes with the indicated log levels: 15:57:15:250 APS-DATA.indication request id: 89 -> finished15:57:15:250 APS-DA - Pastebin.com

-edit-
Switched to pastebin

It just happened again.

So I tried to go into VNC to look at the logs, but I only get a black blank screen in the VNC viewer. So it connects, it doesn’t just show anything. So I went to Portainer to look into the deconz-logging and it so happened to stop logging at exactly the same time I posted the last post with the logs. Not a single line was added in the console. So, I tought it crashed, but I can open the console in the running docker container and trigger commands.

top - 19:04:52 up 54 days, 29 min,  0 user,  load average: 1.15, 1.35, 1.08
Tasks:   9 total,   1 running,   8 sleeping,   0 stopped,   0 zombie
%Cpu(s):  6.2 us,  3.0 sy,  0.8 ni, 89.5 id,  0.2 wa,  0.0 hi,  0.3 si,  0.0 st 
MiB Mem :   7823.9 total,    216.2 free,   4378.9 used,   3208.5 buff/cache     
MiB Swap:   4096.0 total,   2448.4 free,   1647.6 used.   3445.0 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                
      1 deconz    20   0  808552 107548  39408 S   1.0   1.3  39:48.59 deCONZ                                                                                                 
     49 deconz    20   0   16036   2692      0 S   0.0   0.0   0:00.00 tigervncserver                                                                                         
     50 deconz    20   0  279676  54604  16924 S   0.0   0.7   2:04.12 Xtigervnc                                                                                              
     53 deconz    20   0    2576     92      0 S   0.0   0.0   0:00.00 Xtigervnc-sessi                                                                                        
     54 deconz    20   0    9732    616    184 S   0.0   0.0   0:00.00 tigervncconfig                                                                                         
     55 deconz    20   0  221144   9800   5768 S   0.0   0.1   0:00.19 openbox                                                                                                
    117 deconz    20   0   56216   4848      0 S   0.0   0.1   0:15.66 websockify                                                                                             
 109267 root      20   0    4188   3328   2792 S   0.0   0.0   0:00.03 bash                                                                                                   
 109455 root      20   0    8620   4596   2708 R   0.0   0.1   0:00.02 top 

Usually, when this happened, I was able to go into VNC and see what happened by the way, it just keeps getting weirder and weirder.

Hi

Getting a bad gateway. Can you use pastebin?

Here it is:

And here, during a small outage wave just happening now. The logs are from around 10:50 hours and at 11:55 I found out it was happening and I increased the loglevels. I hope it reveals something to you.

(Can you perhaps unhide all my previous posts? I editted the URL’s so that it shouldn’t be causing any bad gateways anymore).

For others reading along:
Had a chat with @pimmeh and spoke about the errors. He’s going to include a Extension cable and get rid of atleast the 0xE1 errors. Not sure why the 0xE9 happens. Hope @manup can help out with that.

The flashing lights are kind of weird, deCONZ doesn’t send light control commands on its own. Is there some automation running in the background which may be involved?

He’s going to include a Extension cable and get rid of atleast the 0xE1 errors.

Indeed always a good idea, usually this improves routing / reachable issue significantly.
I’d propose to observe if that already improves the reachable problems.

The 0xE9 means no MAC ACK received for a command being send, without a extension cable it’s also more likely to see those. In deCONZ Menu → Edit → Network Settings there is an checkbox APS ACKs which can be enabled to force heavier retry approach, which can help in complex “more hops” networks. So for a 3 floors setup this is worth a try to enable.


If all that, especially the USB extension cable still doesn’t solve the reachable problems, I’d recommend to enable application level source routes with latest version v2.25.1 (which has a few improvements here). Please have a look at Source Routing · dresden-elektronik/deconz-rest-plugin Wiki · GitHub to see how to enable it (just go with the automatic settings, manual setting routes isn’t needed).

It takes quite some time until the routes (blue/red lines in the GUI) are figured out, but after a while — 30 minutes to 1 hour — this should stabilize and long time reachable routers are selected as “good” hops to jump through the network, while routers which are switched off sometimes are less considered as in-between hops.

Ok, so. I just got the actual extension cable in, before this, it was a powered USB 2.0 hub. In my chat with @Mimiix , he indicated that that might actually interfere with the Zigbee frequency. I’ll try this first and see for a couple of days how that behaves before I try anything else, such as enabling the application level source routing.

Regarding the flashing light: no it was no automation. Home Assistant didn’t have any record of it actually sending those commands and such a weird automation was never programmed. I’m quite happy to focus on the no connectivity problem for now, maybe it’s a very weird symptom of it.

@Mimiix also asked to post the LQI-values, but when I enable the butting via the VNC GUI, nothing happens. The button goes on and off, but I don’t see anything happening in the network topology; the neighbor links button doesn’t seem to toggle anything, it looks broken in de GUI. Restarting the deconz container doesn’t help it. Anyway, the links do show with the colors, so maybe that might be something. I posted it here: Zigbee topology before USB extension cable - Album on Imgur
Mind you, that is before I moved to the USB extension cable, so we’ll see how that goes afterwards.
Interesting here to see is that I hoped that the Ikea Tradfri Wireless Outlet 2, between the 2nd and 3rd floor, might act as a good router between those floors, but judging from the line color’s, it’s not really doing a good job.

So, I’m going to put on the USB extension cable now and wait a few hours / days, see what’s what.

Looks like this is a bug LQI display no longer works after V2.24.00 · Issue #7457 · dresden-elektronik/deconz-rest-plugin · GitHub

Good to see you managed to get a screenshot. Judging of that, you really need to get better connections. I think that is the main reason for dropping.

Can you also share some logs to see if the 0xE1 is gone now?

I’ll get those shortly. Can you perhaps clarify why it sometimes hours on end for the network to find new routes during those outages? I understand that if a critical-path node might be unreachable, it might take a while for the Zigbee network to find a new route, but should it really take this long? It whould have hoped it would be in the minutes range, not hours. Or are those connections in my house between really that bad (judging by colors only until the LQI’s are back :wink: )?

Here are the logs from just now: 17:12:00:125 saved node state in 0 ms17:12:00:126 sync() in 0 ms17:12:00:418 - Pastebin.com
I saw 1 instance of the 0xE1-error. There are still a lot of instances on the 0xE9 error.

Things I’ve changed since last time:

  • Got a 2 meter extension cable between the server and the Conbee II.
  • Re-paired all the Ikea Tradfri Wireless Outlets.
  • Moved Ikea Tradfri Wireless Outlet 2 about 3 meters: it’s down dangling on an extension cord in the staircase between the 2nd and 3rd floor.

And what do you know, just as I finished the logging, I saw a small drop in available lights and quickly started up the logging again: 17:19:24:804 Websocket 172.20.0.17:45880 send message: {"attr":{"id":"28","lasta - Pastebin.com
I also made a new screenshot of the topology: imgur. com /a/XuzXxAs (copy paste without the spaces, I’m not allowed to paste more than 2 links in a post, not very handy).
You can clearly see some lights / outlets not routing anymore, so the list above is not doing the trick on its own.

So next is your suggestion to try to set the checkbox APS ACKs, I will to that now, see what happens.

Ok, that didn’t really work. So I’ve listened to you guys about not having good enough connections and while I hung some Ikea Outlets as repeaters at strategic locations, it didn’t really make the connection lines green enough. So, I bought a couple of Sonoff Dongles and flashed them with the router firmware and included them in the network. I choose these bad boys because of their huge external antenna.

Tadaa: way better! Zigbee network with 2x Sonoff repeaters - Album on Imgur

So now I have a clear route form the 3rd floor to the 1st, but @Mimiix suggested that is still a single point of failure, so I’m going to place a third Sonoff between on the second floor (I bought 3, just in case), hopefully that creates a backup path, should it be neccessary.

For now, I seem to have a more stable network (but not backup up by enough evidence yet) and I’m going to monitor it a couple of days, see how it holds up. But, progress! :smiley:

1 Like

So’m ready to say: I’m happy and I don’t see any more dropouts. The placement of 3 Sonoff repeaters did the trick. The connections line are green(er) and my automations are lightning fast. Thanks you very much @Mimiix and @manup ! :smiley:

1 Like