Troubleshooting

Common issues across the RIoT platform — firmware, backend, frontend, deployment, and device connectivity — with direct solutions.

Firmware

SD Card Mount Failure

Symptom: Device fails to read config.json on boot. Serial monitor shows SD mount failed or similar.

Verify the SD card is formatted as FAT32. Cards formatted as exFAT or NTFS are not supported by the ESP32 SD library.
Power off the device and reseat the SD card. Inspect the card slot for bent pins or debris.
Check the SPI wiring between the ESP32 and the SD card module — confirm CS, MOSI, MISO, and CLK connections.
Test with a different SD card. Cards larger than 32 GB often ship as exFAT and must be reformatted.
If the issue persists on multiple cards, replace the SD card module.

I2C Communication Errors

Symptom: Sensor reads return 0xFF or timeout. Serial monitor shows I2C NACK or Wire.endTransmission() error codes.

Confirm 4.7 kOhm pull-up resistors are present on both SDA and SCL lines. Missing pull-ups are the most common cause of I2C failures.
Run an I2C scanner sketch to verify the sensor responds at the expected address. Cross-reference the datasheet for the correct 7-bit address.
Keep I2C cable length under 1 meter. Longer runs degrade signal integrity — use an I2C extender or switch to RS-485 for distant sensors.
Check for address conflicts. Two devices sharing the same address will corrupt the bus. Use address-select pins or a TCA9548A multiplexer.
Ensure VCC and GND are shared between the ESP32 and all I2C peripherals.

CAN Bus Issues

Symptom: No messages received, intermittent packet loss, or CAN controller enters error-passive/bus-off state.

Verify the CAN transceiver (e.g., MCP2515 or SN65HVD230) is powered and connected correctly to the ESP32 SPI or TWAI pins.
Confirm 120 Ohm termination resistors are installed at both physical ends of the CAN bus. Measure resistance between CAN_H and CAN_L with the bus powered off — expect ~60 Ohm for two terminators in parallel.
Ensure all nodes on the bus use the same bitrate (e.g., 500 kbps). A single misconfigured node will disrupt the entire bus.
Check for ground loops. All CAN nodes must share a common ground reference. Run a dedicated ground wire alongside CAN_H and CAN_L.
Inspect cable for damage, especially in industrial environments with vibration or heat exposure.

RTC Not Responding

Symptom: Device logs show RTC read failed or timestamps default to 2000-01-01.

Run an I2C scan and confirm the DS3231 responds at address 0x68. If absent, check wiring.
Measure the CR2032 battery voltage. Replace if below 2.5 V — a weak battery causes intermittent failures even when the device is externally powered.
Verify VCC is 3.3 V. The DS3231 operates at 2.3–5.5 V but must match the ESP32 I2C logic level.
Check that the RTC module does not conflict with another device on address 0x68 (e.g., some MPU-6050 modules share this address).

Sensor Reading Failures

Symptom: Sensor values are zero, NaN, or static. Device status reports normally but sensor data is missing or invalid.

Open config.json on the SD card and verify sensor configuration parameters — sensor type, GPIO pin, I2C address, and polling interval.
Check for GPIO conflicts. Pins used by the SD card SPI bus, CAN transceiver, or onboard LED cannot be reused for sensor input. Consult the pin mapping in the firmware README.
Connect a USB cable and open the serial monitor at 115200 baud. Sensor initialization errors and raw readings are logged at boot and on each poll cycle.
For analog sensors, verify the ADC attenuation setting matches the sensor output range. The ESP32 ADC is nonlinear above ~3.1 V.
Power-cycle the sensor. Some I2C sensors (e.g., SHT3x, BME280) enter a fault state after bus errors and require a full power reset.

Backend

Database Connection

Symptom: Backend fails to start with OperationalError: could not connect to server or FATAL: database does not exist.

Verify DATABASE_URL in the backend environment. Expected format:

postgresql+asyncpg://riot:<password>@riot-postgres:5432/riot

Confirm PostgreSQL is running and healthy:

docker exec riot-postgres pg_isready -U riot -d riot

Check that the PostGIS extension is installed. The backend requires it for spatial queries:

docker exec riot-postgres psql -U riot -d riot -c "SELECT PostGIS_Version();"

If missing, create it:

docker exec riot-postgres psql -U riot -d riot -c "CREATE EXTENSION IF NOT EXISTS postgis;"

Inspect backend logs for connection pool exhaustion (QueuePool limit). Increase SQLALCHEMY_POOL_SIZE if the worker count exceeds the default pool.

Migration Failures

Symptom: alembic upgrade head fails with revision conflicts, missing tables, or lock timeouts.

Run migrations from inside the backend container:
```
docker exec riot-backend alembic upgrade head
```

If Alembic reports multiple heads, resolve the branch conflict:

docker exec riot-backend alembic heads
docker exec riot-backend alembic merge -m "merge heads" <rev1> <rev2>

For lock timeouts, check for long-running transactions blocking DDL:

docker exec riot-postgres psql -U riot -d riot -c "SELECT pid, state, query FROM pg_stat_activity WHERE state != 'idle';"

If the migration history is corrupted, verify the current revision matches the database state:
```
docker exec riot-backend alembic current
```

Celery Workers Not Processing

Symptom: Export tasks stay in PENDING state. No worker activity in logs.

Verify CELERY_BROKER_URL points to the Valkey instance:
```
redis://riot-valkey:6379/0
```

Confirm Valkey is running:

docker exec riot-valkey valkey-cli ping

Expect PONG.

Check worker logs for connection errors or task registration issues:
```
docker logs riot-celery-worker --tail 100
```

Verify workers are registered and responsive:

docker exec riot-celery-worker celery -A celery_app.celery inspect ping --timeout 5

If workers are connected but tasks are not executing, confirm the task names in the codebase match the registered task list:
```
docker exec riot-celery-worker celery -A celery_app.celery inspect registered
```

Frontend

Build Failures

Symptom: npm run build fails with syntax errors, module-not-found errors, or out-of-memory crashes.

Verify the Node.js version matches the requirement in .nvmrc or package.json engines. The frontend requires Node.js 20+.
Delete existing dependencies and reinstall:
```
rm -rf node_modules .next
npm install
```
Clear the Next.js build cache explicitly:
```
rm -rf .next
npm run build
```
For out-of-memory errors during build, increase the Node.js heap:
```
NODE_OPTIONS="--max-old-space-size=4096" npm run build
```
Check for TypeScript errors that block production builds but not dev mode:
```
npx tsc --noEmit
```

BFF Proxy Errors

Symptom: API calls from the frontend return 502, CORS errors, or ECONNREFUSED in the browser console.

Verify the backend URL configured in the frontend environment. In production, the Next.js server-side BFF proxies requests to the backend over the Docker network:
```
BACKEND_URL=http://riot-backend:8000
```
Confirm the backend container is running and reachable from the frontend container:
```
docker exec riot-frontend wget -q -O - http://riot-backend:8000/health
```
For CORS errors in the browser, check that the backend CORS_ORIGINS environment variable includes the frontend domain.
Open the browser Network tab and inspect the failing request. Note the response status code, headers, and body — these distinguish between proxy errors (502/504), auth errors (401/403), and application errors (4xx/5xx).
Check frontend server logs for upstream connection failures:
```
docker logs riot-frontend --tail 100
```

Deployment

Docker Networking

Symptom: Containers cannot reach each other by service name. Requests between services return ECONNREFUSED or DNS resolution failures.

Verify all services are on the same Docker Compose network. Run:
```
docker network inspect riot_default
```
Confirm each expected container appears in the Containers section.
Check for port conflicts on the host. If another process binds the same port, Compose will fail to start the conflicting service:
```
ss -tlnp | grep -E ':(443|8443|8000|3000|5432) '
```
Review service logs for startup failures that prevent the container from joining the network:
```
docker compose -f deploy/docker-compose.prod.yml logs --tail 50
```
If a service was recreated, dependent services may still reference a stale container IP. Restart the dependent services:
```
docker compose -f deploy/docker-compose.prod.yml restart nginx haproxy
```

SSL Certificates

Symptom: Browsers show NET::ERR_CERT_DATE_INVALID or Nginx fails to start with cannot load certificate.

Check certificate expiry:

openssl x509 -in /etc/letsencrypt/live/$DOMAIN/fullchain.pem -noout -enddate

Renew with Certbot if expired:
```
certbot renew --force-renewal
```

Verify Nginx config references the correct certificate paths:

ssl_certificate     /etc/letsencrypt/live/$DOMAIN/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/$DOMAIN/privkey.pem;

After renewal, reload Nginx to pick up the new certificate:
```
docker exec riot-nginx nginx -s reload
```
Confirm the Certbot renewal timer is active:
```
systemctl status certbot.timer
```

HAProxy mTLS

Symptom: ESP32 devices receive TLS handshake failures. HAProxy logs show SSL handshake failure or unknown CA.

Verify the CA certificate path in haproxy.cfg. The CA used to sign device client certificates must be referenced in the ca-file directive.
Confirm the client certificate presented by the device was signed by the expected CA:
```
openssl verify -CAfile /path/to/ca.pem /path/to/client.pem
```

Check that the client certificate has not expired:

openssl x509 -in /path/to/client.pem -noout -dates

Inspect HAProxy logs for detailed handshake errors:
```
docker logs riot-haproxy --tail 100
```
If certificates were recently rotated in Vault, restart HAProxy to reload them:
```
docker compose -f deploy/docker-compose.prod.yml restart haproxy
```

Nginx Routing

Symptom: Requests return 502 Bad Gateway, 504 Gateway Timeout, or are routed to the wrong upstream.

Verify the upstream blocks in the Nginx config point to the correct container names and ports:

upstream backend {
    server riot-backend:8000;
}
upstream frontend {
    server riot-frontend:3000;
}

Test the Nginx configuration for syntax errors:
```
docker exec riot-nginx nginx -t
```
For 502 errors, the upstream service is unreachable. Confirm the target container is running and healthy.
For 504 errors, the upstream is too slow. Increase proxy_read_timeout for long-running endpoints (e.g., exports):
```
proxy_read_timeout 120s;
```
Check Nginx error logs for the specific upstream failure:
```
docker logs riot-nginx --tail 100
```

Device Connectivity

WiFi Connection

Symptom: Device fails to connect to WiFi. Serial monitor shows WiFi: DISCONNECTED or repeated association attempts.

Verify the SSID and password in config.json on the SD card. Check for trailing whitespace or encoding issues.
Confirm the WiFi network is 2.4 GHz. The ESP32 does not support 5 GHz bands.
Check signal strength. Place the device within reasonable range of the access point. RSSI below -80 dBm will cause frequent disconnects.
Verify the firmware retry logic. The device should implement exponential backoff with a maximum retry interval. Check serial output for retry timing.
On enterprise networks (WPA2-Enterprise), confirm the device firmware supports the required EAP method and that credentials are correct.

mTLS Handshake

Symptom: Device connects to WiFi but cannot reach the backend. Serial monitor shows TLS handshake failed or certificate verify failed.

Check certificate expiry on the device. An expired client certificate will be rejected by HAProxy:
```
openssl x509 -in client.pem -noout -enddate
```
Verify the device's CA certificate matches the one configured in HAProxy. A CA mismatch occurs when certificates are rotated on the server but not on the device.
Confirm the device's real-time clock is synchronized. TLS validation fails if the device clock is set to a date outside the certificate's validity window. Check the RTC battery and NTP sync logic.
Ensure the device is connecting to the correct hostname and port (port 8443 for mTLS). The device must use the production domain, not localhost.

Test the handshake from a workstation to isolate whether the issue is device-side or server-side:

openssl s_client -connect riot.example.com:8443 -cert client.pem -key client-key.pem -CAfile ca.pem

Upload Failures

Symptom: Device connects and authenticates but sensor data uploads fail. Serial monitor shows HTTP 4xx/5xx or timeout errors.

Verify the endpoint URL in config.json. The upload path should be:
```
https://riot.example.com:8443/v1/device/readings
```
For timeout errors, check network stability. Intermittent WiFi or high-latency links cause upload failures under default timeout settings. Increase the HTTP client timeout in the firmware if needed.
Confirm the device implements exponential backoff on failed uploads. Without backoff, rapid retries can exhaust memory or trigger rate limiting.
For HTTP 401 errors, verify the device's API key and secret have not been revoked. Re-check credentials in config.json against the platform dashboard.
For HTTP 413 errors, the upload payload is too large. Reduce the batch size in the device configuration to send fewer readings per request.
Check backend logs to confirm whether the request is arriving:
```
docker logs riot-backend --tail 100
```