Tagging @manup because we’ve talked about my giant production system a few times.
I’m running the latest software (via Docker) and latest firmware. I have (correction: had) 342 well-operating bulbs (98% Hue, 2% dresden elektronik strip controllers). I’ve had the power go out in a horrible way a few months ago where, for 2 hours, it came on and off again in bursts and after that everything was totally fine. A few weeks ago, however, I had a nice and clean power outage for 2 to 4 minutes and when it came back on, 80%-ish of my bulbs are unreachable. My Docker setup on my NUC is on battery backup and I never lost power there.
I updated everything, restored from a backup, and scoured the forums for ideas. I’m out of them and super frustrated and really could use some advice.
I can provide whatever information might be helpful.
I agree that Hue lights are great at keeping connection. I’ve had bulbs offline for months and they came back perfectly. And as I mentioned, a previous “crazy” power off/on/off/on/etc. situation ended up impacting 0 bulbs. It’s just this last time.
Also, there have been “moments” when restarting the system where I could, for a few seconds, successfully control lights that a few seconds later became “unavailable” so I don’t think it’s the bulbs.
Let me get the DB file for you and upload here. Sorry for the late reply, been down with COVID (first time).
Thanks I had a quick look, while the almost all parameters like PANID seem stable the NWK Update Id, is lowered from 2 to 0 in the last entry. I’m not sure how this happened but it should only increase, for example when a channel change is made.
So a quick check if this is the problem:
deCONZ → Edit → Network Settings
NWK Update ID should be 2
If it isn’t:
Set it to 2 and press Save and Done to close the settings
Finally in the top toolbar press Leave and after a few seconds Join (this actually activates the changed configuration).
Unfortunately, that didn’t work. I think I know what happened - I was desperate for “something” to try so I did that CTRL+Advanced click option that brought up some things to fall back to in the web app. So this particular issue was self inflected. But regardless, changed to 2 via steps above - no go.
I read somewhere, and I can’t find it now, a person who went on and on in a Github ticket about a very similar issue (previous versions of Deconz and firmware) and finally got it to work with clean hardware and the latest (at that time) firmware. Open to trying that - I have an usused Conbee II here.
I used to see these somewhat frequently (I’m 95% sure, anyway) when every bulb was working as expected. So I don’t believe this is going to lead us to why, after a single/clean power outage, I can no longer communicate with 280-ish bulbs.
That said: You’re the expert. I’m happy to track down whatever seems reasonable.
I had another thought, can you please share the config.ini to check the NWK frame counter value.
There is a 32-bit NWK counter which roughly overflow after 4 billion messages, in which case another counter needs to increase, while the NWK counter starts from 0. If that isn’t done properly routers might think there is a problem and ignore incoming messages (we are under attack). Not sure if this could be the problem here but since the network is larger, there are a lot of messages flying around. I’ll extend the firmware to query and modify the second counter, so we can check if this is the issue.