nRF52832 + RFM95W - Tx sometimes gets stuck

Hi,
I’m using a custom board based on nRF52, with Adafruit Arduino Core, running LMIC LoRaWAN library v4.1.1. I have successfully connected it to the Helium network (EU 868), and most of the time I can send data without any problems (simple 0/1 status, sent when input changes).

However, every now and then (let’s say after a day or multiple days, not consistent), it doesn’t get to EV_TXCOMPLETE within 60 seconds. The function LMIC_setTxData2 runs without error, but then it can take minutes (up to 20 minutes in one occasion) for the transmit to actually happen.

This is part of the log when the issue happens (before that if was working 2+ days, as seen in millis()). Here it times out and needs 3.5 minutes to send data:

07:56:13.154 -> Send. os:6C248EB2 milis:153583415 op=900 txrx=20 txchn=0 duty=65756954
07:56:13.201 -> Packet queued
07:56:15.465 -> EV_TXCOMPLETE (includes waiting for RX windows)
07:56:15.465 -> os:6C26BEBE milis:153585708 op=900 txrx=20 txchn=1 duty=6C27379F
07:56:16.491 -> Send. os:6C27B450 milis:153586714 op=900 txrx=20 txchn=1 duty=6C27379F
07:56:16.491 -> Packet queued
07:56:20.599 -> EV_TXCOMPLETE (includes waiting for RX windows)
07:56:20.599 -> os:6C2BA459 milis:153590843 op=900 txrx=20 txchn=2 duty=6C2C1D3A
07:56:21.623 -> os:6C2C9A28 milis:153591850 op=900 txrx=20 txchn=2 duty=6C2C1D3A
17:25:29.948 -> Send. os:6B62060B milis:187739169 op=900 txrx=20 txchn=2 duty=6C2C1D3A
17:25:29.948 -> Packet queued
17:26:29.916 -> LMIC timeout!
17:26:29.916 -> os:6B9B4063 milis:187799177 op=108 txrx=20 txchn=0 duty=6C2C1D3A
17:26:30.947 -> Send. os:6B9C35F5 milis:187800183 op=108 txrx=20 txchn=0 duty=6C2C1D3A
17:26:30.947 -> LMIC opmode err: 108
17:27:30.958 -> LMIC timeout!
17:27:30.958 -> os:6BD5704D milis:187860191 op=108 txrx=20 txchn=0 duty=6C2C1D3A
17:28:30.958 -> LMIC timeout!
17:28:30.958 -> os:6C0EA937 milis:187920193 op=108 txrx=20 txchn=0 duty=6C2C1D3A
17:29:06.452 -> EV_TXCOMPLETE (includes waiting for RX windows)

Here is a code snippet for sending and debug log:

void do_send(osjob_t* j){
    // Check if there is not a current TX/RX job running
    if (LMIC.opmode & (OP_TXRXPEND | OP_POLL | OP_TXDATA)) 
    {
      Serial.print("LMIC opmode err: ");
      Serial.println(LMIC.opmode, HEX);     
    }
    else 
    {
      // prepare data
      mr_status = digitalRead(GPIO_IN_MR);
      last_status = mr_status; // potentially override status in main loop

      // save LED to show status during sending
      if (mr_status == 0) used_led = LED_GREEN;
      else used_led = LED_RED;
      
      // Clear Payload
      lpp.reset();
      lpp.addDigitalInput(1, mr_status);
      
      // reset tx_complete flag, to continuously poll lmic until tx is finished
      tx_complete = 0;               

      // Prepare upstream data transmission at the next possible time.
      // transmit on port 1 (the first parameter); you can use any value from 1 to 223 (others are reserved).
      // don't request an ack (the last parameter, if not zero, requests an ack from the network).
      // Remember, acks consume a lot of network resources; don't ask for an ack unless you really need it.
      lmic_tx_error_t ret = LMIC_setTxData2(1, lpp.getBuffer(), lpp.getSize(), 0);

      Serial.println("Packet queued");
    }
}

void setup() {
  // put your setup code here, to run once:
  Serial.begin(115200);

  // Uncomment to blocking wait for Serial connection
  // while ( !Serial ) delay(10);

  Serial.println("GarageNotifier Init");
  Serial.println("--------------------------");

  Bluefruit.begin();
  Bluefruit.autoConnLed(false);
  //Bluefruit.setTxPower(4);    // Check bluefruit.h for supported values

  pinMode(GPIO_IN_MR, INPUT);

  ledOff(LED_GREEN);
  ledOff(LED_RED);
  
  // LMIC init
  os_init();
  // Reset the MAC state. Session and pending data transfers will be discarded.
  LMIC_reset();

  // ZB: latest LMIC limits max error to 0.4%, 
  // define LMIC_ENABLE_arbitrary_clock_error in lmic_project_config.h if more is needed
  // disabled here, no benefit was visible (defaults to 0.1%)
  // LMIC_setClockError(1 * MAX_CLOCK_ERROR / 100);
  
  LMIC_setLinkCheckMode(0); // Disable link-check mode and ADR, because ADR tends to complicate testing.
  LMIC_setDrTxpow(DR_SF7, 14); // SF7 = shortest time on air. 14 dBm output (max for EU)
  // LMIC_selectSubBand(1); // ZB: Sub band 2 only for US, not relevant in EU (and will result in errors)

  // Start job (sending automatically starts OTAA too)
  do_send(&sendjob); 
}

void loop() {
  // run LMIC loop continuously until tx is finished
  do 
  {
    os_runloop_once();
    delay(1);
    
    if (tx_complete == 0 && ++led_prescale > 50)
    {
      // current state (green/red LED) as tx status info, until sent
      digitalToggle(used_led);
      led_prescale = 0;
    }
    
    if ((millis() - start_time > 60000) && tx_complete == 0)
    {
      Serial.println("LMIC timeout!");
      start_time = millis();

      if (LMIC.opmode & OP_JOINING)
      {
        // if joining, don't reset LMIC
        Serial.println("Joining, LMIC reset skipped!");
        break;
      }

      // reset/rejoin disabled for debug
      /*LMIC_reset();
      LMIC_setLinkCheckMode(0);
      do_send(&sendjob); */
      break;
    }
  }
  while (tx_complete == 0);

  if (cnt < 2)
  {
    Serial.print("os:");
    Serial.print(os_getTime(), HEX);
    Serial.print(" milis:");
    Serial.print(millis());

    Serial.print(" op=");  Serial.print(LMIC.opmode, HEX);
    Serial.print(" txrx="); Serial.print(LMIC.txrxFlags, HEX);  
    Serial.print(" txchn=");  Serial.print(LMIC.txChnl, HEX);
    Serial.print(" duty=");  Serial.println(LMIC.globalDutyAvail, HEX);
    cnt++;    
  }

  ledOff(LED_GREEN);
  ledOff(LED_RED);
  // loop delay when no tx in progress, to save power (goes automatically to sleep in this period)
  delay(1000);
  
  mr_status = digitalRead(GPIO_IN_MR);
  if (last_status != mr_status)
  {   
    last_status = mr_status;

    Serial.print("Send. os:");
    Serial.print(os_getTime(), HEX);
    Serial.print(" milis:");
    Serial.print(millis());

    Serial.print(" op=");  Serial.print(LMIC.opmode, HEX);
    Serial.print(" txrx="); Serial.print(LMIC.txrxFlags, HEX);  
    Serial.print(" txchn=");  Serial.print(LMIC.txChnl, HEX);
    Serial.print(" duty=");  Serial.println(LMIC.globalDutyAvail, HEX);
   
    os_runloop_once();
    tx_complete = 0; 
    start_time = millis();
    // request new transmission
    os_setCallback(&sendjob, do_send);    
    
    cnt=0;   
  }  
}

Event handler code is taken from standard example, all Serial prints disabled except for EV_TXCOMPLETE, where flag tx_complete is set to 1 so the MCU doesn’t need to loop continuously.

Any ideas where to start to debug this?

print out the value of ret after LMIC_setTxData2() and see if it’s failing. If so, the LMIC is busy handling previous commands.

I suspect (since you’re in EU868) that you’re running out of available “time on air” and the LMIC is blocked on a previous transmit, waiting for available time.

I will do that, but note that there was a 10 hour gap between transmits (problematic tx starts at 17:25:29.948). How is it possible to run out of time on air?

Also, you can ignore the Send at 17:26:30.947, that one exits for the reason you mentioned, LMIC is busy with previous transmit that still didn’t finish (“timeout” won’t reset or retry anything in this debug code, it just prints to log every minute until message is sent).

I have added printing of LMIC_setTxData2() return value, but it’s 0, so no problem visible there:

16:59:45.014 -> Send. os:3F5DD98 milis:35422854 op=900 txrx=20 txchn=0 duty=667CD8D
16:59:45.014 -> LMIC setTxData ret: 0
16:59:45.014 -> Packet queued
17:00:44.997 -> LMIC timeout!
17:00:44.997 -> os:42F17F1 milis:35482862 op=108 txrx=20 txchn=1 duty=667CD8D
17:00:45.991 -> Send. os:4300D83 milis:35483868 op=108 txrx=20 txchn=1 duty=667CD8D
17:00:46.039 -> LMIC opmode err: 108
17:01:46.037 -> LMIC timeout!
17:01:46.037 -> os:46947DB milis:35543875 op=108 txrx=20 txchn=1 duty=667CD8D
17:02:46.005 -> LMIC timeout!
17:02:46.005 -> os:4A28102 milis:35603878 op=108 txrx=20 txchn=1 duty=667CD8D
17:03:46.011 -> LMIC timeout!

Looks like I fixed it, works for more than a week now without a glitch. If somebody has similar issues in the future, here is my workaround:

  1. In my initial post it is visible that in almost 10 hours between packets, the OS time doesn’t advance, it actually takes a step back. This makes LMIC wait until OS time advances at least to the time of previous packet (related to duty limiting variables), hence the delay when trying to send a packet after 10 hours. To fix it, I have added a call to os_getTime() in my main loop, which is run once per second. This calls hal_ticks() which can now properly take care of micros() overflow, and OS time is therefore correct.

  2. Even with OS time being correct, the problem with packet send delay still exists when OS time wraps around and becomes negative (after cca 9h time difference between packets). I didn’t debug this, just added a “keepalive” packet after 8h of inactivity, as this is more than acceptable for me. Works good.

@tmm , If you want me to reproduce the 2nd issue again and provide some more details, let me know. Seems like this would be a problem for very low power devices, which send data rarely.

Yes, reproing and documenting would be good. We actually, coincidentally, ran into the same thing at about the same time that you sent your message. Your message prompted me to look at the code that converts micros() to LMIC os time, and it’s obvious that it won’t tolerate long delays without polling. If you can file an issue on the LMIC repo, we’ll get it fixed shortly. For now, the workaround has to be “don’t sleep very long without pollilng the LMIC”, where “very long” is measured in minutes. I calculate (but haven’t verified) that 2^30/1e6 seconds is the critical threshold. That’s a little less than 18 minutes. I wouldn’t sleep longer than half of that without doing a much more careful analysis and some simulations; so that says “don’t sleep more than 8 minutes” as a rule of thumb.