Heltec Wireless Stick ESP32 - LMIC problem - radio.c:1065

Sorry you’re having problems.

We start by looking at the assert that’s failing. It is:

    ASSERT( (readReg(RegOpMode) & OPMODE_MASK) == OPMODE_SLEEP );

This means that the radio is not in sleep mode when starting RX. It means therefore that something has corrupted the radio state, because the LMIC is always careful to leave things in sleep mode.

This could happen for any number of reasons.

Please post your config, and also please turn off debug prints, as they always cause problems. They are not helpful in debugging assert failures; they’re only helpful in debugging the logic of the LMIC, which is not the issue in this case.

You might also post the schematic for the device, including anything attached externally.

Best regards,
–Terry

Hello Terry,

following the LoRa code:

#include "userdefines.h"

// Compile this only if we have a LoRa capable hardware

// loraWan hardware related code
// code based on free-to-use / do-anything-what-you-want-with-it example code
// by Thomas Telkamp, Matthijs Kooijman and Terry Moore, MCCI.

// IOs for optional LoRa
//
// SX1276 (pin) => ESP32 (pin)
// ===========================
// SCK = GPIO5
// MISO = GPIO19
// MOSI = GPIO27
// CS = GPIO18
// RESET = GPIO14
// DIO0 (8) = GPIO26 (15)
// DIO1 (9) = GPIO33 (13)
// DIO2 (10) = GPIO32 (12)

// Send a valid LoRaWAN packet using frequency and encryption settings matching
// those of the The Things Network.
//
// This uses OTAA (Over-the-air activation), where where a DevEUI and
// application key is configured, which are used in an over-the-air
// activation procedure where a DevAddr and session keys are
// assigned/generated for use with all further communication.
//
// Note: LoRaWAN per sub-band duty-cycle limitation is enforced (1% in
// g1, 0.1% in g2), but not the TTN fair usage policy.
//
// To use this code, first register your application and device with
// the things network, to set or generate an AppEUI, DevEUI and AppKey.
// Multiple devices can use the same AppEUI, but each device has its own
// DevEUI and AppKey.
//
// Do not forget to define the radio type correctly in
// arduino-lmic/project_config/lmic_project_config.h or from your BOARDS.txt.

#include <Arduino.h>
#include <lmic.h>
#include <hal/hal.h>
#include "hal/heltecv2.h"
#include <SPI.h>
#include "webconf.h"
#include "utils.h"
#include "loraWan.h"
#include "log.h"

// All these 3 LoRa parameters (DEVEUI, APPEUI and APPKEY) may be copied literally from the TTN console window.
// The necessary reversal on DEVEUI and APPEUI is done by hex2data.
void os_getArtEui(u1_t *buf) {
  (void) hex2data(buf, (const char *) appeui, 8);
  reverseByteArray(buf, 8);
}

void os_getDevEui(u1_t *buf) {
  (void) hex2data(buf, (const char *) deveui, 8);
  reverseByteArray(buf, 8);
}

void os_getDevKey(u1_t *buf) {
  (void) hex2data(buf, (const char *) appkey, 16);
}

// Schedule TX every this many seconds (might become longer due to duty cycle limitations).
const unsigned TX_INTERVAL = 10;

// Pin mapping
const lmic_pinmap lmic_pins = {
  .nss = LORA_CS,
  .rxtx = LMIC_UNUSED_PIN,
  .rst = LORA_RST,
  .dio = {LORA_IRQ, LORA_IO1, LORA_IO2 },
};

static volatile transmissionStatus_t txStatus;
static uint8_t *__rxPort;
static uint8_t *__rxBuffer;
static uint8_t *__rxSz;

void event_log(const char *event) {
  log(DEBUG, "%09d: %s", os_getTime(), event);
}

void onEvent(ev_t ev) {
  switch (ev) {
  case EV_SCAN_TIMEOUT:
    txStatus = TX_STATUS_ENDING_ERROR;
    event_log("EV_SCAN_TIMEOUT");
    break;
  case EV_BEACON_FOUND:
    txStatus = TX_STATUS_UNKNOWN;
    event_log("EV_BEACON_FOUND");
    break;
  case EV_BEACON_MISSED:
    txStatus = TX_STATUS_UNKNOWN;
    event_log("EV_BEACON_MISSED");
    break;
  case EV_BEACON_TRACKED:
    txStatus = TX_STATUS_UNKNOWN;
    event_log("EV_BEACON_TRACKED");
    break;
  case EV_JOINING:
    txStatus = TX_STATUS_JOINING;
    event_log("EV_JOINING");
    break;
  case EV_JOINED:
    txStatus = TX_STATUS_JOINED;
    event_log("EV_JOINED");
    {
      u4_t netid = 0;
      devaddr_t devaddr = 0;
      u1_t nwkKey[16];
      u1_t artKey[16];
      LMIC_getSessionKeys(&netid, &devaddr, nwkKey, artKey);
      String ak, nk;
      for (int i = 0; i < sizeof(artKey); ++i) {
        ak += String(artKey[i], 16);
      }
      for (int i = 0; i < sizeof(nwkKey); ++i) {
        nk += String(nwkKey[i], 16);
      }
      log(DEBUG, "netid: %d devaddr: %x artKey: %s nwkKey: %s", netid, devaddr, ak.c_str(), nk.c_str());
    }
    // Disable link check validation (automatically enabled
    // during join, but because slow data rates change max TX
    // size, we don't use it in this example.
    LMIC_setLinkCheckMode(0);
    break;
  // This event is defined but not used in the code.
  // No point in wasting codespace on it.
  // case EV_RFU1:
  //   event_log("EV_RFU1");
  //   break;
  case EV_JOIN_FAILED:
    txStatus = TX_STATUS_ENDING_ERROR;
    event_log("EV_JOIN_FAILED");
    break;
  case EV_REJOIN_FAILED:
    txStatus = TX_STATUS_ENDING_ERROR;
    event_log("EV_REJOIN_FAILED");
    break;
  case EV_TXCOMPLETE:
    event_log("EV_TXCOMPLETE (includes waiting for RX windows)");
    txStatus =   TX_STATUS_UPLINK_SUCCESS;
    if (LMIC.txrxFlags & TXRX_ACK) {
      txStatus = TX_STATUS_UPLINK_ACKED;
      event_log("Received ack");
    }
    if (LMIC.dataLen) {
      log(DEBUG, "Received %d bytes of payload", LMIC.dataLen);
      if (__rxPort != NULL) *__rxPort = LMIC.frame[LMIC.dataBeg - 1];
      if (__rxSz != NULL) *__rxSz = LMIC.dataLen;
      if (__rxBuffer != NULL) memcpy(__rxBuffer, &LMIC.frame[LMIC.dataBeg], LMIC.dataLen);
      txStatus = TX_STATUS_UPLINK_ACKED_WITHDOWNLINK;
    }
    break;
  case EV_LOST_TSYNC:
    txStatus = TX_STATUS_ENDING_ERROR;
    event_log("EV_LOST_TSYNC");
    break;
  case EV_RESET:
    txStatus = TX_STATUS_ENDING_ERROR;
    event_log("EV_RESET");
    break;
  case EV_RXCOMPLETE:
    // data received in ping slot
    txStatus = TX_STATUS_UNKNOWN;
    event_log("EV_RXCOMPLETE");
    break;
  case EV_LINK_DEAD:
    txStatus = TX_STATUS_ENDING_ERROR;
    event_log("EV_LINK_DEAD");
    break;
  case EV_LINK_ALIVE:
    txStatus = TX_STATUS_UNKNOWN;
    event_log("EV_LINK_ALIVE");
    break;
  // This event is defined but not used in the code.
  // No point in wasting codespace on it.
  // case EV_SCAN_FOUND:
  //   event_log("EV_SCAN_FOUND");
  //   break;
  case EV_TXSTART:
    txStatus = TX_STATUS_UNKNOWN;
    event_log("EV_TXSTART");
    break;
  default:
    txStatus = TX_STATUS_UNKNOWN;
    log(DEBUG, "Unknown event: %u", (unsigned int) ev);
    break;
  }
}


void setup_lorawan() {
  txStatus = TX_STATUS_UNKNOWN;
  // LMIC init
  os_init();
  // Reset the MAC state. Session and pending data transfers will be discarded.
  LMIC_reset();
  LMIC_setClockError(MAX_CLOCK_ERROR * 1 / 100);

  // Setup the LoRaWan stack for TTN Europe
  LMIC_setupChannel(0, 868100000, DR_RANGE_MAP(DR_SF12, DR_SF7),  BAND_CENTI);  // g-band
  LMIC_setupChannel(1, 868300000, DR_RANGE_MAP(DR_SF12, DR_SF7B), BAND_CENTI);  // g-band
  LMIC_setupChannel(2, 868500000, DR_RANGE_MAP(DR_SF12, DR_SF7),  BAND_CENTI);  // g-band
  LMIC_setupChannel(3, 867100000, DR_RANGE_MAP(DR_SF12, DR_SF7),  BAND_CENTI);  // g-band
  LMIC_setupChannel(4, 867300000, DR_RANGE_MAP(DR_SF12, DR_SF7),  BAND_CENTI);  // g-band
  LMIC_setupChannel(5, 867500000, DR_RANGE_MAP(DR_SF12, DR_SF7),  BAND_CENTI);  // g-band
  LMIC_setupChannel(6, 867700000, DR_RANGE_MAP(DR_SF12, DR_SF7),  BAND_CENTI);  // g-band
  LMIC_setupChannel(7, 867900000, DR_RANGE_MAP(DR_SF12, DR_SF7),  BAND_CENTI);  // g-band
  LMIC_setupChannel(8, 868800000, DR_RANGE_MAP(DR_FSK,  DR_FSK),  BAND_MILLI);  // g2-band
  LMIC_setLinkCheckMode(0);
  LMIC.dn2Dr = SF9;
  LMIC_setDrTxpow(DR_SF7, 14);
}

// Send LoRaWan frame with ack or not
// - txPort : port to transmit
// - txBuffer : message to transmit
// - txSz : size of the message to transmit
// - ack : true for message ack & downlink / false for pure uplink
//   When Ack is false, the downlink buffer can be set to NULL as rxSz and rPort
// - rxPort : where to write the port where downlink has been received
// - rxBuffer : where the downlinked data will be stored
// - rxSz : size of received data
transmissionStatus_t lorawan_send(uint8_t txPort, uint8_t *txBuffer, uint8_t txSz, bool ack, uint8_t *rxPort, uint8_t *rxBuffer, uint8_t *rxSz) {
  // Check if there is not a current TX/RX job running
  if (LMIC.opmode & OP_TXRXPEND) {
    log(DEBUG, "OP_TXRXPEND, not sending");
    return TX_STATUS_ENDING_ERROR;
  } else {
    txStatus = TX_STATUS_UNKNOWN;
    __rxPort = rxPort;
    __rxBuffer = rxBuffer;
    __rxSz = rxSz;
    // Prepare upstream data transmission at the next possible time.
    LMIC_setTxData2(txPort, txBuffer, txSz, ((ack) ? 1 : 0));
    // wait for completion
    uint64_t start = millis();
    while (true) {
      switch (txStatus) {
      case TX_STATUS_UNKNOWN:
      case TX_STATUS_JOINING:
      case TX_STATUS_JOINED:
        os_runloop_once();
        break;
      case TX_STATUS_UPLINK_SUCCESS:
      case TX_STATUS_UPLINK_ACKED:
      case TX_STATUS_UPLINK_ACKED_WITHDOWNLINK:
      case TX_STATUS_UPLINK_ACKED_WITHDOWNLINK_PENDING:
        return txStatus;
      case TX_STATUS_ENDING_ERROR:
      case TX_STATUS_TIMEOUT:
        break;
      }
      if (millis() - start > LORA_TIMEOUT_MS) {
        setup_lorawan();
        return TX_STATUS_TIMEOUT;
      }
    }
  }
}

And here the schematic of the device:

This is based on a project, designed by Jürgen Böhringer and his friends.
More details are available under https://github.com/ecocurious2/MultiGeiger.

Best regards

Paul.

(Edited markdown to get it to show the C++ code literally.)

Hi again,

Do you have a link to the schematics of the ESP32 board?

Looking at the sketch, I would say that you can remove the calls to LMIC_setupChanne(); they’re not needed, and will be overridden by TTN during join. Also, you can’t change channels 0-2 in EU868; the LMIC ignores this – it certainly won’t let you change frequencies or bands, and at least one of 0-2 is BAND_MILLI. But this is not your problem.

LMIC_setTxData2() returns errors (it returns lmic_tx_error_t); please check the docs, and use its results. The logic in lorawan_send() is trying to duplicate work that the LMIC already does, and therefore is risky. In older versions of the LMIC, it might have been required, but with v3.3.0 it’s not advisable or required. In my experience, the LMIC is now very good about always either transmitting your data or reporting an error status.

I would not call LMIC_setLinkCheckMode(0) – in most cases the device and the network works better if linkcheck mode is enabled. In addition, it’s very important for handling the TTN V2 -> V3 transition. Furthermore, the comment “slow data rates change max TX size” misses the key point about v3.3.0 – the LMIC automatically increases the DR on transmit if the message is too big for the current DR. (If this is not desirable, you must call LMIC_setTxData2_strict().)

Setting LMIC.dn2Dr is normally not required.

If you’re going to call LMIC_setDrTxpow(), it should be called after join complete; and it’s not a bad idea to do so.

However, the key point is this:

Always call os_runloop_once() in your loop() routine. This must be done without fail, because the MAC can keep going (and will keep going) even after a transmit completes. As far as I can tell, the code only does this while a TX is pending. Not a safe optimization because the network can keep the LMIC busy arbitrarily long. (A lot longer, I must add, in TTN V3, because the RX window is now 5 seconds – each mac downlink can cause a mac uplink, which means things can go on for quite a while.)

Best regards,
–Terry

Hello Terry,

thanks for analyzing the code. I have implemented the adjustments and test again.
Following the Pinout Diagram of the board:

Best regards

Paul.

Aside: using discourse seems much more efficient for me than the GitHub equivalent.

I see that the opmode is 0x118 at the point of failure; this is a rare case where the debug prints are helpful (though it’s better to print these outside the LMIC). This means that a poll was pending (bit 4 is set, OP_POLL).

I think this means that the state tracking (in onEvent()) is getting confused.

This can happen if you try to send while the LMIC is busy dealing with downlink messages.

This can in turn happen after a lot of uplinks, because TTN may decide to send MAC downlinks. But the LMIC sends EV_TXCOMPLETE as soon as it knows your message has been sent and the RX slots have been process. This may be before mac processing is complete. So EV_TXCOMPLETE can’t be used as evidence that the mac is ready for a new message.

The comment says that you’re not honoring TTN fair use, so I gather that you might be transmitting fairly often?

I think you may have found a bug in the LMIC; it appears to me at the moment that LMIC_setTxData() needs to check whether any of (OP_POLL | OP_TXDATA | OP_TXRXPEND) are set.

So you should try:

if (LMIC.opmode & (OP_POLL | OP_TXDATA | OP_TXRXPEND)) {
   // don't transmit, LMIC is probably busy

If adding those to your safety check in lorawan_send() causes the problem to go away, let me know. I’ll need to study all the paths through the code and see if it still seems likely that this is a problem. It’s been clear to me for a while that we need queuing rather than flags, just because of these kinds of issues.

You may also take a look in my lora send code, as an example. I implemented queuing around the LMIC API. In long term tests this code proved to run pretty stable. We have some devices with uplinks every minute (with fair use, since on private TTI network), running hundreds of of days without any disruption.

Can we have this patch

if (LMIC.opmode & (OP_POLL | OP_TXDATA | OP_TXPEND)) {
// don’t transmit, LMIC is probably busy

included in the LMIC API, to keep the application away from using the LMIC state flags?

@Klaus, now’s a good time (since we’re ramping up for TTN V3 which means I have to do a release – a release is not strictly necessary for anybody other than MCCI, but we need it for best device operation).

But we should confirm that this change is right thing. I’ve opened issue #677 for this. Hope others can do some testing to confirm. My LoRaWAN tester is out on loan, and anyway I don’t have a good repro case.

I could try to contribute to tests, if i had a repro case for ESP32.

1 Like

short interim report: with the check (OP_POLL | OP_TXDATA | OP_TXRXPEND) the device has been running for 13 hours without failure. I have increased the transmission cycle to 5 minutes.

1 Like

Bad news, the device stopped sending to TTN after nearly 20 hours. But unlike before, all other functions (measurements, WiFi, web server) are still active. Since debug is off, I can’t see what happened.
The device is locatet in TTN V3. I’ll go back to TTN V2 with it and start the test again

How often is lmic runloop called by your code? Sure that calls are not blocked by other code, and there a no race conditions?

TTN V2 makes no difference except for RX1 window time – better to test with V3, actually, as it stresses the interfaces more.

Lmic is called every 5 minutes for sending data. Airtime is 61,7 ms for every transmission.

The test yesterday showed that the transmission will stop at some point. What option do I have to log what happens?

I don’t have your full code, but I suggest you replace all calls to os_runloop_once() by calls to the following wrapper function:

void my_runloop_once(void) {
    // declare a type, so we don't care about the type of millis() (32 or 64 bit)
    using millis_t = decltype(millis());

   // state variables
    static milis_t tLastPrint;
    static bool fInit = false;

    // get the current time
    const  milis_t now = millis();

    // once each minute, print the current high-frequency time.
    // this will give us an indication that the LMIC is up, and also let us check
    // that the clock is working properly. Also print opmode and txrx
    if (! fInit || (now - tLastPrint) >= 60 * 1000) {
      fInit = true;
      tLastPrint = now;
      Serial.print(os_getTime(), HEX);
      Serial.print(" op=");  Serial.print(LMIC.opmode, HEX);
      Serial.print(" txrx="); Serial.print(LMIC.txrxFlags, HEX);
   }
  // finally: invoke the routine we really want to call.
  os_runloop_once();
}

This will let us know immediately if the Arduino thread is stuck (messages will stop, and we’ll know the value of os_time()). If the Arduino thread is not stuck, we can see some of the LMIC state, and this will give us a clue for more hunting.

Since you saved millis() as a uint64_t in your sample code, I want to confirm that we don’t have an integration problem at the HAL layer. We’ll be able to rule that out with the above log.

–Terry

During compilation I got following error:

sketch\loraWan.cpp: In function ‘void my_runloop_once()’:
loraWan.cpp:201:27: error: invalid conversion from ‘uint ()() {aka unsigned int ()()}’ to ‘ostime_t {aka int}’ [-fpermissive]

 const  ostime_t now = os_getTimeSecs;

Probably not the only error – it’s hard to compile a code fragment. Not impossible with things like repl.it, but…

I’ve edited to fix the two(!) problems on that line.

BTW, depending on taste, you might want to add a Serial.println() at the end of the clause doing the prints. It will make a more readable log, but will take more vertical space. If you’re capturing output to a file with something like Tera Term, I’d add the newline; if just using a “peephole” into the data, I’d add the newlines afterwards, with an editor.