Lost 80% of my 342-bulb Setup After Power Outage

Fun times.

Tagging @manup because we’ve talked about my giant production system a few times. :slight_smile:

I’m running the latest software (via Docker) and latest firmware. I have (correction: had) 342 well-operating bulbs (98% Hue, 2% dresden elektronik strip controllers). I’ve had the power go out in a horrible way a few months ago where, for 2 hours, it came on and off again in bursts and after that everything was totally fine. A few weeks ago, however, I had a nice and clean power outage for 2 to 4 minutes and when it came back on, 80%-ish of my bulbs are unreachable. My Docker setup on my NUC is on battery backup and I never lost power there.

I updated everything, restored from a backup, and scoured the forums for ideas. I’m out of them and super frustrated and really could use some advice.

I can provide whatever information might be helpful.

Hello, and I think you already have tried the power cycle for bulbs ?

Oh yes. No change.

Pinged manup to check in.

1 Like

The Hue lights are usually quite good at keeping the connection even after long network loss times.

I think next we should check what the logs say (E1 errors etc.), and perhaps also examine the zll.db database that the coordinator configuration hasn’t changed any parameters which could cause this.

I agree that Hue lights are great at keeping connection. I’ve had bulbs offline for months and they came back perfectly. And as I mentioned, a previous “crazy” power off/on/off/on/etc. situation ended up impacting 0 bulbs. It’s just this last time.

Also, there have been “moments” when restarting the system where I could, for a few seconds, successfully control lights that a few seconds later became “unavailable” so I don’t think it’s the bulbs.

Let me get the DB file for you and upload here. Sorry for the late reply, been down with COVID (first time).

Here you go: https://drive.google.com/file/d/1iBEQvyiQida9sCJwR5UUzFaWQ66G7Uei/view?usp=sharing

That’s the db file.

Let me know how much of the logs you’d like to see, if you want x lines after a fresh container restart, etc., etc.

Can you share the logs? In #deconz you can find out how to make logs :slight_smile:

Here are about 3 minutes of logs in INFO, INFO_L2, ERROR, ERROR_L2, APS, APS_L2: logs - Google Docs

Can you use pastebin?

Thanks I had a quick look, while the almost all parameters like PANID seem stable the NWK Update Id, is lowered from 2 to 0 in the last entry. I’m not sure how this happened but it should only increase, for example when a channel change is made.

So a quick check if this is the problem:

  • deCONZ → Edit → Network Settings
  • NWK Update ID should be 2

If it isn’t:

  • Set it to 2 and press Save and Done to close the settings
  • Finally in the top toolbar press Leave and after a few seconds Join (this actually activates the changed configuration).

Unfortunately, that didn’t work. I think I know what happened - I was desperate for “something” to try so I did that CTRL+Advanced click option that brought up some things to fall back to in the web app. So this particular issue was self inflected. But regardless, changed to 2 via steps above - no go.

FWIW, @Mimiix, here’s the pastebin version: Lots - Fisher - Pastebin.com

Any other thoughts?

I read somewhere, and I can’t find it now, a person who went on and on in a Github ticket about a very similar issue (previous versions of Deconz and firmware) and finally got it to work with clean hardware and the latest (at that time) firmware. Open to trying that - I have an usused Conbee II here.

Thanks for the pastebin. I rather not download files to my pc / use the google site as it’s rather slow.

I checked the log, I didn’t see anything rather obvious other than this:

  1. 12:58:55:729 0x001788010618E68A error APSDE-DATA.confirm: 0xE1 on task

That node gives an 0xE1 (network busy). It isn’t widespread or large , but you can have a look what that device does in your network.

I used to see these somewhat frequently (I’m 95% sure, anyway) when every bulb was working as expected. So I don’t believe this is going to lead us to why, after a single/clean power outage, I can no longer communicate with 280-ish bulbs.

That said: You’re the expert. I’m happy to track down whatever seems reasonable.

I just wanted to point it as the only “wrong” thing i could find in the logs :sweat_smile:

Thanks for the compliment, however, this is where my expertise ends and where @de_employees continue :sweat_smile:

LOL! Anytime. :slight_smile:

Yeah, this is a weird one. It happened so suddenly. There’s a solution here somewhere.

Do you have activated light search while power cycling the lights? That sometimes helped me when I lost few lights.

I had another thought, can you please share the config.ini to check the NWK frame counter value.

There is a 32-bit NWK counter which roughly overflow after 4 billion messages, in which case another counter needs to increase, while the NWK counter starts from 0. If that isn’t done properly routers might think there is a problem and ignore incoming messages (we are under attack). Not sure if this could be the problem here but since the network is larger, there are a lot of messages flying around. I’ll extend the firmware to query and modify the second counter, so we can check if this is the issue.

Sound a bit like the issue I have with my hue bulbs, following on the side line here. :slight_smile:

Here’s the config.ini contents:

[N00212effff0618b1]
framecounter=41205450

[controller]
apsAcksEnabled=false
autoFetchFFD=true
autoFetchRFD=true
max-aps-busy-per-node=1

[discovery]
zdp\mgmtLqiInterval=180
zdp\nwkAddrInterval=0

[http]
appcache=true
port=7100

[nodelist]
geometry=@ByteArray(\x1\xd9\xd0\xcb\0\x3\0\0\0\0\0\x3\0\0\0\x14\0\0\x2\xf4\0\0\0&\0\0\0\0\0\0\0\0\xff\xff\xff\xff\xff\xff\xff\xff\0\0\0\0\0\0\0\0\a\x80\0\0\0\x3\0\0\0\x14\0\0\x2\xf4\0\0\0&)
state=@ByteArray(\0\0\0\xff\0\0\0\0\0\0\0\x1\0\0\0\0\0\0\0\x2\x1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\x3\x16\0\0\0\a\0\x1\x1\x1\0\0\0\0\0\0\0\0\0\0\0\0\x64\xff\xff\xff\xff\0\0\0\x84\0\0\0\0\0\0\0\a\0\0\0\x64\0\0\0\x1\0\0\0\0\0\0\0\x64\0\0\0\x1\0\0\0\0\0\0\0\xbe\0\0\0\x1\0\0\0\0\0\0\0\x64\0\0\0\x1\0\0\0\0\0\0\0\x64\0\0\0\x1\0\0\0\0\0\0\0\x64\0\0\0\x1\0\0\0\0\0\0\0\x64\0\0\0\x1\0\0\0\0\0\0\x3\xe8\0\0\0\0\x64)

[nodeview]
sceneRect=@Variant(\0\0\0\x14\xc0\xe8j\0\0\0\0\0\xc0\xe8j\0\0\0\0\0@\xf8j\0\0\0\0\0@\xf8j\0\0\0\0\0)

[otau]
fast-page-spacing=25
sensor-dont-start=true
sensor-restart=true
sensor-slowdown=10
slow-page-spacing=250

[remote]
default\ip=127.0.0.1
default\port=8080

[source-routing]
enabled=false
max-hops=5
min-lqi=150
min-lqi-display=0

[window]
geometry=@ByteArray(\x1\xd9\xd0\xcb\0\x3\0\0\0\0\0\0\0\0\0\0\0\0\a\x7f\0\0\x4\xaf\0\0\x1@\0\0\0X\0\0\x6?\0\0\x4W\0\0\0\0\x2\0\0\0\a\x80\0\0\0\0\0\0\0\x13\0\0\a\x7f\0\0\x4\xaf)
state="@ByteArray(\0\0\0\xff\0\0\0\0\xfd\0\0\0\x1\0\0\0\0\0\0\x3\x4\0\0\x4L\xfc\x2\0\0\0\x1\xfc\0\0\0;\0\0\x4L\0\0\0\x8d\x1\0\0\x1a\xfa\0\0\0\x5\x2\0\0\0\a\xfb\0\0\0\"\0S\0o\0u\0r\0\x63\0\x65\0R\0o\0u\0t\0i\0n\0g\0\x44\0o\0\x63\0k\0\0\0\0\0\xff\xff\xff\xff\0\0\x1\x6\0\xff\xff\xff\xfb\0\0\0\x1a\0R\0\x45\0S\0T\0\x41\0P\0I\0P\0l\0u\0g\0i\0n\0\0\0\0\0\xff\xff\xff\xff\0\0\x1\x92\0\xff\xff\xff\xfb\0\0\0\x1a\0S\0T\0\x44\0O\0T\0\x41\0U\0P\0l\0u\0g\0i\0n\0\0\0\0\0\xff\xff\xff\xff\0\0\x1\xa3\0\xff\xff\xff\xfb\0\0\0\x18\0N\0o\0\x64\0\x65\0L\0i\0s\0t\0V\0i\0\x65\0w\0\0\0\0\0\xff\xff\xff\xff\0\0\0Y\0\xff\xff\xff\xfb\0\0\0\x16\0\x42\0i\0n\0\x64\0\x44\0r\0o\0p\0\x62\0o\0x\0\0\0\0\0\xff\xff\xff\xff\0\0\x1\x30\0\xff\xff\xff\xfb\0\0\0\x18\0N\0o\0\x64\0\x65\0I\0n\0\x66\0o\0\x44\0o\0\x63\0k\x1\0\0\0\0\xff\xff\xff\xff\0\0\0r\0\xff\xff\xff\xfb\0\0\0\x1e\0\x43\0l\0u\0s\0t\0\x65\0r\0I\0n\0\x66\0o\0\x44\0o\0\x63\0k\x1\0\0\0\0\xff\xff\xff\xff\0\0\0Y\0\xff\xff\xff\0\0\x4v\0\0\x4L\0\0\0\x4\0\0\0\x4\0\0\0\b\0\0\0\b\xfc\0\0\0\x1\0\0\0\x2\0\0\0\x1\0\0\0\x16\0m\0\x61\0i\0n\0T\0o\0o\0l\0\x42\0\x61\0r\x1\0\0\0\0\xff\xff\xff\xff\0\0\0\0\0\0\0\0)"

Thanks for looking!