Burst of "None of the device-sessions for dev_addr resulted in valid MIC " failures after a gateway came back online

Question

Context

Platform: ChirpStack v4.3.0 (NS + AS), EU868 region
Integrations: Redis (sessions + streams), MQTT enabled
MQTT replay: max_inflight_messages = 1, qos = 1 on the gateway side + patch gateway/backend/mqtt.rs

// get message stream: special compil with
let mut stream = client.get_stream(None);

Radio topology: 2 gateways (GW1 primary, GW2 secondary)
Fleet: 3 devices, all with FrameCounter Validation disabled (to accept data that is not in chronological order)
Period: 17/10 → 22/10 (UTC/local doesn’t matter, same window)

Symptoms

On 22/10 between 13:13 and 13:37, the NS emits 433 warnings for one device DEVICE A (only):

WARN ... chirpstack::uplink::data: None of the device-sessions for dev_addr resulted in valid MIC dev_addr=0x0183b587

From 13:46 onward, uplinks from the same device are OK again and things return to normal.

Topology (ASCII)

                              (Internet)
                                   │
                     ┌──────────────┴──────────────┐
                     │       Gateway 1 (GW1)       │
                     │          primary            │
                     │ 17→18/10 : OK               │
                     │ 18→22/10 : OFFLINE          │
                     │ 22/10 : buffer flush        │
                     └──────────────┬──────────────┘
                                    │
               ┌────────────────────┼────────────────────┐
               │                    │                    │
           ┌───┴───────────┐    ┌───┴─────────┐      ┌───┴───┐
           │ Dev A         │    │ Dev B       │      │ Dev C │
           │ (EUIFA)       │    │ (EUIxx)     │      │ (EUIyy)
           │               │    │             │      │       │
           │→ GW1 (mostly) │    │→ GW1 (~60%) │      │→ GW1 (100%)
           │→ GW2 (2 frames)│   │→ GW2 (~40%) │      │       │
           │FCnt1031/1032   │   │             │      │       │
           │last_fcnt32↑    │   │             │      │       │
           │Stored on GW1   │   │Stored on GW1│      │Stored on GW1
           └────────┬──────┴──────┬───────────┘      └───────┘
                    │             │
                    │             │
              ┌─────┴─────────────┴─────┐
              │       Gateway 2 (GW2)    │
              │        secondary         │
              │ 17→22/10 : online        │
              │ receives A (2 frames),   │
              │ B (~40%)                 │
              └──────────────────────────┘


                         ▼ 22/10 ~13:13 → 13:37
         ┌──────────────────────────────────────────────────────┐
         │   GW1 comes back online and flushes its buffer:      │
         │   • Dev A → 433 uplinks rejected (warn session)      │
         │   • Dev B → OK 100% (out-of-order replay, valid MIC) │
         │   • Dev C → OK 100% (in order, valid MIC via GW1)    │
         └──────────────────────────────────────────────────────┘

                         ▼ 22/10 ~13:46+
         ┌──────────────────────────────────────────────────────┐
         │   Normal operation resumes:                          │
         │   • Live uplinks received: FCnt 1155, 1156, 1157...  │
         │   • All devices A, B, C back to 100% OK              │
         └──────────────────────────────────────────────────────┘

Scope

Device A: dev_eui=00800000000255fa, dev_addr=0183b587 (impacted)
- Not all uplink warnings are resent as data.
- Rate: 1 message / 15 min (no 16→32‑bit rollover within the window)
- Uplink: OTAA (keys/sessions OK), no join detected during the incident
- Uniqueness: DevAddr unique in the database (verified in Postgres)
Device B: no issue, 100% data revovered at the end(mostly received by GW1, also by GW2)
- Rate: 1 message / 15 min (no 16→32‑bit rollover)
Device C: no issue received , 100% data revovered at the end only by GW1 (never by GW2)
- No issue.

Timeline (observed in logs)

17/10 14:18 → 18/10 03:33: regular uplinks OK (GW1).
18/10: GW1 loses Internet and buffers. GW2 stays online and delivers 2 uplinks from Device A . Device B goes through GW2 for about 40% of its cadence. Device C never goes through GW2.
18/10 → 22/10: GW1 offline continues to store local devices (A, B, C). During this period, Device B goes through GW2 about 30%; Device C never through GW2.
22/10 13:13–13:37: GW1 comes back and flushes its queues → burst of warning session for dev_addr=0183b587 (Device A, 433 lines).
- Device B: no warning (replayed out of order, but all validated).
- Device C: no warning (never via GW2, all validated).
22/10 13:46+: reception of live frames (e.g., FCnt 1155, 1156, 1157) → 100% OK for A, B, and C.

Interpretation / hypothesis

The 433 rejections for Device A cover ≈ 4d08h at 1 message / 15 min => device A has some holes in the data.
Devices B and C were also replayed, out of order, but everything was recovered at 100%.

Items already checked

DevAddr unique (Postgres).
No Join/Rejoin visible in the window.
No counter rollover (15‑min cadence).
Logs: Gateway rx-info saved for other devEUI at flush time → GW1 did flush its queues.

Questions for the forum

Why does Device A have issues while B and C do not?
Is there a link with the fact that Device A briefly went through GW2 (valid)?
Any configuration recommendations to prevent this case (out‑of‑order → MIC failure) from happening again?

Log excerpts

Burst on 22/10 13:13–13:37:

WARN ... chirpstack::uplink::data: None of the device-sessions for dev_addr resulted in valid MIC dev_addr=0183b587

(repeated ~433 times)

Recovery around 13:46+:

INFO ... chirpstack::storage::device_session: Device-session saved dev_eui=00800000000255fa dev_addr=0183b587

Orne Brocaar · Answer

The problem is, ChirpStack expects the messages to arrive in order. The way the FCnt validation works (and how it derrives the full 32bit frame-counter from the 16bit frame-counter in the PHYPayload) is that it expects that the 16LSB have rolled-over in case the 16bit (LSB) value is less than the previous value, and thus it will assume it must increase the 16MSB part of the frame-counter resulting in an incorrect 32bit frame-counter. An incorrect frame-counter means it can not validate the MIC.

Any ideas how this can be handled better are welcome, but I don't think there is a secure way to do this at the network-server side, unless you keep track of all the used frame-counters, which is not realistic.

guillaume fernandez · Answer

I think I need to upgrade my understanding of the FrameCounter and its validation.
I checked the stack’s pseudo-code and, indeed, there are quite a few subtleties.

Looking at the counters, there’s a difference between the 16-bit FCnt in the frame and the server-side FCnt,
which suggests a change of “epoch” (MSB) on the faulty device:

Device	FCnt 16 (frame)	FCnt (server)
Device A “problem”	13082	78618
Device B OK	(100%) 9703	9703

This epoch shift (FC > 64000) on the faulty device is probably the tail of the comet here.

So, which step in the scenario actually made Device A go wrong?
Reminder: Device B was going through GTW2 ~40% during the GTW1 blackout.
There were gaps, and when GTW1 came back and replayed, so not in the right order..., but 100% everything was restored.

For Device A, it went through GTW2 only twice during GTW1’s internet outage
(two messages out of ~380 messages, ~4 days).
We’re trying to reconstruct what happened on our side, but we can’t figure it out yet…

rimelink · Answer

Solution

Try to run the LoRa Device and ChirpStack synchronously as much as possible. For example, restarting the device (which forces FCnt = 0) can resolve the issue.

Principle

FCnt is a 32-bit integer in both the Device and ChirpStack, while the LoRaWAN protocol uses a 16-bit integer.

Whenever FCnt exceeds 0xFFFF (65535), both the Device and ChirpStack need to record the upper 16 bits in order to synchronize the FCnt value. For example:

Device | LoRaWAN | ChirpStack
0 | 0 | 0
1 | 1 | 1
... | ... | ...
65535 | 65535 | 65535
65536 | 0 | 65536
65537 | 1 | 65537
... | ... | ...

When a ChirpStack suddenly connects to a Device that has already sent more than 0xFFFF (65535) LoRa messages, it cannot know the upper 16 bits of FCnt.

This means the FCnt synchronization between the two sides fails.

Since FCnt is used to calculate the MIC and encrypt the Payload, this leads to a MIC error and also causes Payload decryption errors.

Reference link：http://lora.timeddd.com/forum.php?mod=viewthread&tid=1063