Burst of "None of the device-sessions for dev_addr resulted in valid MIC " failures after a gateway came back online

Viewed 28

Context

  • Platform: ChirpStack v4.3.0 (NS + AS), EU868 region
  • Integrations: Redis (sessions + streams), MQTT enabled
  • MQTT replay: max_inflight_messages = 1, qos = 1 on the gateway side + patch gateway/backend/mqtt.rs
// get message stream: special compil with
let mut stream = client.get_stream(None);
  • Radio topology: 2 gateways (GW1 primary, GW2 secondary)
  • Fleet: 3 devices, all with FrameCounter Validation disabled (to accept data that is not in chronological order)
  • Period: 17/10 → 22/10 (UTC/local doesn’t matter, same window)

Symptoms

On 22/10 between 13:13 and 13:37, the NS emits 433 warnings for one device DEVICE A (only):

WARN ... chirpstack::uplink::data: None of the device-sessions for dev_addr resulted in valid MIC dev_addr=0x0183b587

From 13:46 onward, uplinks from the same device are OK again and things return to normal.



Topology (ASCII)

                              (Internet)
                                   │
                     ┌──────────────┴──────────────┐
                     │       Gateway 1 (GW1)       │
                     │          primary            │
                     │ 17→18/10 : OK               │
                     │ 18→22/10 : OFFLINE          │
                     │ 22/10 : buffer flush        │
                     └──────────────┬──────────────┘
                                    │
               ┌────────────────────┼────────────────────┐
               │                    │                    │
           ┌───┴───────────┐    ┌───┴─────────┐      ┌───┴───┐
           │ Dev A         │    │ Dev B       │      │ Dev C │
           │ (EUIFA)       │    │ (EUIxx)     │      │ (EUIyy)
           │               │    │             │      │       │
           │→ GW1 (mostly) │    │→ GW1 (~60%) │      │→ GW1 (100%)
           │→ GW2 (2 frames)│   │→ GW2 (~40%) │      │       │
           │FCnt1031/1032   │   │             │      │       │
           │last_fcnt32↑    │   │             │      │       │
           │Stored on GW1   │   │Stored on GW1│      │Stored on GW1
           └────────┬──────┴──────┬───────────┘      └───────┘
                    │             │
                    │             │
              ┌─────┴─────────────┴─────┐
              │       Gateway 2 (GW2)    │
              │        secondary         │
              │ 17→22/10 : online        │
              │ receives A (2 frames),   │
              │ B (~40%)                 │
              └──────────────────────────┘


                         ▼ 22/10 ~13:13 → 13:37
         ┌──────────────────────────────────────────────────────┐
         │   GW1 comes back online and flushes its buffer:      │
         │   • Dev A → 433 uplinks rejected (warn session)      │
         │   • Dev B → OK 100% (out-of-order replay, valid MIC) │
         │   • Dev C → OK 100% (in order, valid MIC via GW1)    │
         └──────────────────────────────────────────────────────┘

                         ▼ 22/10 ~13:46+
         ┌──────────────────────────────────────────────────────┐
         │   Normal operation resumes:                          │
         │   • Live uplinks received: FCnt 1155, 1156, 1157...  │
         │   • All devices A, B, C back to 100% OK              │
         └──────────────────────────────────────────────────────┘

Scope

  • Device A: dev_eui=00800000000255fa, dev_addr=0183b587 (impacted)

    • Not all uplink warnings are resent as data.
    • Rate: 1 message / 15 min (no 16→32‑bit rollover within the window)
    • Uplink: OTAA (keys/sessions OK), no join detected during the incident
    • Uniqueness: DevAddr unique in the database (verified in Postgres)
  • Device B: no issue, 100% data revovered at the end(mostly received by GW1, also by GW2)

    • Rate: 1 message / 15 min (no 16→32‑bit rollover)
  • Device C: no issue received , 100% data revovered at the end only by GW1 (never by GW2)

    • No issue.

Timeline (observed in logs)

  • 17/10 14:18 → 18/10 03:33: regular uplinks OK (GW1).
  • 18/10: GW1 loses Internet and buffers. GW2 stays online and delivers 2 uplinks from Device A . Device B goes through GW2 for about 40% of its cadence. Device C never goes through GW2.
  • 18/10 → 22/10: GW1 offline continues to store local devices (A, B, C). During this period, Device B goes through GW2 about 30%; Device C never through GW2.
  • 22/10 13:13–13:37: GW1 comes back and flushes its queuesburst of warning session for dev_addr=0183b587 (Device A, 433 lines).
    • Device B: no warning (replayed out of order, but all validated).
    • Device C: no warning (never via GW2, all validated).
  • 22/10 13:46+: reception of live frames (e.g., FCnt 1155, 1156, 1157) → 100% OK for A, B, and C.

Interpretation / hypothesis

  • The 433 rejections for Device A cover ≈ 4d08h at 1 message / 15 min => device A has some holes in the data.
  • Devices B and C were also replayed, out of order, but everything was recovered at 100%.

Items already checked

  • DevAddr unique (Postgres).
  • No Join/Rejoin visible in the window.
  • No counter rollover (15‑min cadence).
  • Logs: Gateway rx-info saved for other devEUI at flush time → GW1 did flush its queues.

Questions for the forum

  1. Why does Device A have issues while B and C do not?
  2. Is there a link with the fact that Device A briefly went through GW2 (valid)?
  3. Any configuration recommendations to prevent this case (out‑of‑order → MIC failure) from happening again?

Log excerpts

Burst on 22/10 13:13–13:37:

WARN ... chirpstack::uplink::data: None of the device-sessions for dev_addr resulted in valid MIC dev_addr=0183b587

(repeated ~433 times)

Recovery around 13:46+:

INFO ... chirpstack::storage::device_session: Device-session saved dev_eui=00800000000255fa dev_addr=0183b587
2 Answers

The problem is, ChirpStack expects the messages to arrive in order. The way the FCnt validation works (and how it derrives the full 32bit frame-counter from the 16bit frame-counter in the PHYPayload) is that it expects that the 16LSB have rolled-over in case the 16bit (LSB) value is less than the previous value, and thus it will assume it must increase the 16MSB part of the frame-counter resulting in an incorrect 32bit frame-counter. An incorrect frame-counter means it can not validate the MIC.

Any ideas how this can be handled better are welcome, but I don't think there is a secure way to do this at the network-server side, unless you keep track of all the used frame-counters, which is not realistic.

I think I need to upgrade my understanding of the FrameCounter and its validation.
I checked the stack’s pseudo-code and, indeed, there are quite a few subtleties.

Looking at the counters, there’s a difference between the 16-bit FCnt in the frame and the server-side FCnt,
which suggests a change of “epoch” (MSB) on the faulty device:

Device FCnt 16 (frame) FCnt (server)
Device A “problem” 13082 78618
Device B OK (100%) 9703 9703

This epoch shift (FC > 64000) on the faulty device is probably the tail of the comet here.

So, which step in the scenario actually made Device A go wrong?
Reminder: Device B was going through GTW2 ~40% during the GTW1 blackout.
There were gaps, and when GTW1 came back and replayed, so not in the right order..., but 100% everything was restored.

For Device A, it went through GTW2 only twice during GTW1’s internet outage
(two messages out of ~380 messages, ~4 days).
We’re trying to reconstruct what happened on our side, but we can’t figure it out yet…