Tagging @manup because we’ve talked about my giant production system a few times.
I’m running the latest software (via Docker) and latest firmware. I have (correction: had) 342 well-operating bulbs (98% Hue, 2% dresden elektronik strip controllers). I’ve had the power go out in a horrible way a few months ago where, for 2 hours, it came on and off again in bursts and after that everything was totally fine. A few weeks ago, however, I had a nice and clean power outage for 2 to 4 minutes and when it came back on, 80%-ish of my bulbs are unreachable. My Docker setup on my NUC is on battery backup and I never lost power there.
I updated everything, restored from a backup, and scoured the forums for ideas. I’m out of them and super frustrated and really could use some advice.
I can provide whatever information might be helpful.
The Hue lights are usually quite good at keeping the connection even after long network loss times.
I think next we should check what the logs say (E1 errors etc.), and perhaps also examine the zll.db database that the coordinator configuration hasn’t changed any parameters which could cause this.
I agree that Hue lights are great at keeping connection. I’ve had bulbs offline for months and they came back perfectly. And as I mentioned, a previous “crazy” power off/on/off/on/etc. situation ended up impacting 0 bulbs. It’s just this last time.
Also, there have been “moments” when restarting the system where I could, for a few seconds, successfully control lights that a few seconds later became “unavailable” so I don’t think it’s the bulbs.
Let me get the DB file for you and upload here. Sorry for the late reply, been down with COVID (first time).
Thanks I had a quick look, while the almost all parameters like PANID seem stable the NWK Update Id, is lowered from 2 to 0 in the last entry. I’m not sure how this happened but it should only increase, for example when a channel change is made.
So a quick check if this is the problem:
deCONZ → Edit → Network Settings
NWK Update ID should be 2
If it isn’t:
Set it to 2 and press Save and Done to close the settings
Finally in the top toolbar press Leave and after a few seconds Join (this actually activates the changed configuration).
Unfortunately, that didn’t work. I think I know what happened - I was desperate for “something” to try so I did that CTRL+Advanced click option that brought up some things to fall back to in the web app. So this particular issue was self inflected. But regardless, changed to 2 via steps above - no go.
I read somewhere, and I can’t find it now, a person who went on and on in a Github ticket about a very similar issue (previous versions of Deconz and firmware) and finally got it to work with clean hardware and the latest (at that time) firmware. Open to trying that - I have an usused Conbee II here.
I used to see these somewhat frequently (I’m 95% sure, anyway) when every bulb was working as expected. So I don’t believe this is going to lead us to why, after a single/clean power outage, I can no longer communicate with 280-ish bulbs.
That said: You’re the expert. I’m happy to track down whatever seems reasonable.
I had another thought, can you please share the config.ini to check the NWK frame counter value.
There is a 32-bit NWK counter which roughly overflow after 4 billion messages, in which case another counter needs to increase, while the NWK counter starts from 0. If that isn’t done properly routers might think there is a problem and ignore incoming messages (we are under attack). Not sure if this could be the problem here but since the network is larger, there are a lot of messages flying around. I’ll extend the firmware to query and modify the second counter, so we can check if this is the issue.