Anomaly Detection on MCUs: Possible or Overhyped?

It is possible, but only when you constrain the problem and treat “MCU anomaly detection” as a systems engineering task, not just a model choice. The hype usually comes from assuming you can drop a generic neural network on a Cortex-M0, feed raw sensor streams and get robust results across devices, environments and lifetimes. In practice, you win when you combine:

Careful feature design so the model sees stable, informative inputs.
Small, explainable models that fit flash and RAM and are easy to validate.
Operational guardrails like calibration, drift detection and safe fallbacks.

When people ask “Anomaly detection on MCUs: possible or overhyped?”, a useful framing is: Do you need on-device anomaly detection (latency, privacy, connectivity gaps) or do you just want on-device pre-filtering and ship features upstream?

Where MCU anomaly detection actually works

MCU-based anomaly detection tends to succeed in these scenarios:

1) Tight latency or safety constraints

If you need sub-100 ms reaction times, relying on cloud inference is risky. Examples include motor protection, overcurrent events, thermal runaway detection or mechanical shock detection.

2) Intermittent connectivity or high data cost

If you cannot stream raw vibration or audio, on-device detection lets you transmit only events, snippets or compact features. This is a strong fit for cellular IoT and battery devices.

3) Stable operating envelope

The fewer modes your system has, the easier it is to define “normal.” A pump with fixed RPM and consistent mounting is easier than a consumer wearable that experiences endless user behaviors.

4) Clear anomalies with high signal-to-noise

Hard faults (stalled rotor, disconnected sensor, sudden imbalance) often show up as obvious changes in RMS, peak-to-peak, spectral energy or temperature slope. You do not need deep learning for these.

What makes it hard on MCUs

The constraints are not just flash and RAM. The hard parts often live in data quality and lifecycle management.

Compute and memory ceilings

RAM: feature windows, FFT buffers and model tensors compete for a few KB to a few hundred KB.
Flash: model weights, DSP libraries, logging and OTA (Over-The-Air) update partitions compete for space.
CPU budget: you might have milliseconds per window at low clock speeds to save power.

Concept drift and device-to-device variability

A model trained on one motor may fail on another because of mounting, tolerances, aging and temperature. This is why many “one model for all devices” anomaly projects quietly die after pilots.

Label scarcity and ambiguous ground truth

Anomalies are rare, labels are expensive and “anomaly” might mean “unfamiliar but acceptable.” This makes purely supervised approaches hard to scale.

False positives are operationally expensive

A single noisy accelerometer can create constant alerts, ticket churn and device distrust. On-device detection must include robust debouncing, hysteresis and state-aware gating.

Approach comparison: statistics vs ML vs hybrid

Below is a practical comparison across common MCU-ready approaches. “ML” here includes TinyML (machine learning on microcontrollers) toolchains like TensorFlow Lite for Microcontrollers (TFLite Micro), CMSIS-NN and vendor SDKs.

Approach	Typical inputs	Pros	Cons	Best fit
Rules and thresholds (RMS, peak, slope)	Time-domain features	Small, explainable, easy to validate, fast	Needs per-device tuning, brittle across modes	Hard faults, safety cutoffs
Streaming z-score / EWMA (Exponentially Weighted Moving Average)	Feature streams	Adapts slowly to baseline changes, cheap compute	Can learn the fault if adaptation is too fast, needs guardrails	Drift-tolerant monitoring
Distance-to-centroid (Mahalanobis-lite)	Small feature vectors	Works well when normal is clustered, modest compute	Needs covariance estimate, sensitive to feature scaling	Multiple sensors, stable features
One-class models (one-class SVM offline, then port)	Feature vectors	Principled novelty detection	Harder to implement on MCU, memory heavy	Gateway or higher-end MCUs
Autoencoder reconstruction error (TinyML)	Feature vectors or small spectra	Unsupervised training on normal, flexible	Training is off-device, thresholding is tricky, drift issues	Vibration, acoustic anomalies with stable pipeline
Small classifier (TinyML, supervised)	Features or spectrogram bins	Can separate known fault types	Needs labeled fault data, risks overfitting	Known failure modes, controlled dataset

A recurring theme in “Anomaly detection on MCUs: possible or overhyped?” is that the simplest method that meets requirements usually wins. If your anomaly is obvious in a few robust features, statistics beat neural networks on cost, explainability and time-to-production.

End-to-end architecture options

Think in terms of where you compute features, where you score anomalies and where you learn or update baselines.

Option A: Pure MCU detection (score and decide on-device)

Pros: lowest latency, works offline, reduced data transmission.
Cons: updates are harder, limited observability, per-device calibration burden.
Common pattern: compute features per window, run a tiny model, store short “pre-event” ring buffer, transmit only on trigger.

Option B: MCU features, cloud scoring

Pros: better models, easier iteration, centralized threshold tuning.
Cons: latency and connectivity dependence, higher data cost.
Common pattern: device sends summary features every N seconds and raw snippets on demand.

Option C: MCU scoring, cloud learning (hybrid)

Pros: device remains responsive, cloud improves baseline and thresholds over time.
Cons: needs a robust update mechanism and compatibility checks.
Common pattern: ship initial thresholds, then periodically update per-device calibration parameters, not the whole model.

Option D: Gateway-based detection

If you have a Linux gateway or an ESP32-class device with more headroom, you can keep sensors cheap and push anomaly detection to the local edge. This often gives the best of both worlds: local low latency with more compute and storage.

Code example 1: streaming z-score on a ring buffer

This example implements a streaming anomaly detector using a rolling mean and standard deviation over a fixed window. It is a solid baseline because it is small, deterministic and easy to test. You can feed it any scalar feature (RMS vibration, current draw, temperature slope).

How to use it: compute a feature once per window (for example, once per 200 ms). Push the feature into the detector. If the z-score exceeds a threshold for K consecutive windows, trigger an event.

// Streaming z-score anomaly detector with rolling mean/stddev over a fixed window.
// Compile with: g++ -O2 zscore.cpp -o zscore

#include <cmath>
#include <cstdint>
#include <cstdio>
#include <vector>

class RollingStats {
public:
  explicit RollingStats(size_t window)
    : window_(window), buf_(window, 0.0f) {}

  void push(float x) {
    if (count_ < window_) {
      buf_[count_] = x;
      sum_ += x;
      sumsq_ += x * x;
      count_++;
      idx_ = count_ % window_;
    } else {
      // Remove oldest, add newest
      float old = buf_[idx_];
      buf_[idx_] = x;
      sum_ += x - old;
      sumsq_ += x * x - old * old;
      idx_ = (idx_ + 1) % window_;
    }
  }

  bool ready() const { return count_ >= window_; }

  float mean() const { return sum_ / static_cast<float>(window_); }

  float stddev(float eps = 1e-6f) const {
    float m = mean();
    float var = (sumsq_ / static_cast<float>(window_)) - (m * m);
    if (var < 0.0f) var = 0.0f; // numeric guard
    return std::sqrt(var + eps);
  }

private:
  size_t window_;
  std::vector<float> buf_;
  size_t idx_ = 0;
  size_t count_ = 0;
  float sum_ = 0.0f;
  float sumsq_ = 0.0f;
};

struct ZScoreDetector {
  RollingStats stats;
  float z_threshold;
  uint32_t consecutive_needed;
  uint32_t consec = 0;

  ZScoreDetector(size_t window, float z_th, uint32_t k)
    : stats(window), z_threshold(z_th), consecutive_needed(k) {}

  // Returns true when anomaly condition is met.
  bool update(float x) {
    stats.push(x);
    if (!stats.ready()) return false;

    float z = std::fabs((x - stats.mean()) / stats.stddev());

    if (z >= z_threshold) {
      consec++;
    } else {
      consec = 0;
    }

    return consec >= consecutive_needed;
  }
};

int main() {
  // Example stream: mostly stable around 10, then a step anomaly to 16.
  ZScoreDetector det(/*window=*/32, /*z_threshold=*/3.0f, /*k=*/3);

  for (int i = 0; i < 200; i++) {
    float x = 10.0f + 0.2f * std::sin(i * 0.1f);
    if (i >= 120) x += 6.0f; // anomaly

    bool alarm = det.update(x);
    if (alarm) {
      std::printf("ALARM at i=%d, x=%.3f, mean=%.3f, std=%.3f\n",
                  i, x, det.stats.mean(), det.stats.stddev());
      break;
    }
  }
  return 0;
}

Engineering notes for MCUs:

Use fixed-point if you have no FPU (Floating Point Unit), but float32 is fine on Cortex-M4F and newer.
Add state gating: only run detection when the machine is in the right mode (for example, motor on, RPM stable).
Add adaptation limits: if you let the rolling window “learn” too fast, a slowly developing fault becomes the new normal.

Code example 2: TFLite Micro autoencoder inference

Autoencoders are a common TinyML anomaly approach: train an encoder-decoder network on “normal” feature vectors so reconstruction error increases on novel patterns. On MCUs, you almost always train off-device, then deploy the quantized model for inference only.

This example shows how you run inference with TFLite Micro and compute a reconstruction error score. It assumes you already have a quantized autoencoder_int8.tflite converted into a C array (the standard TFLite Micro workflow).

What runs on the MCU: feature vector in, model inference, reconstruction error out, threshold decision.

// TFLite Micro int8 autoencoder inference and reconstruction error scoring.
// This is a minimal example you can adapt to Cortex-M with TFLite Micro.

#include <cstdint>
#include <cmath>
#include <cstdio>

#include "tensorflow/lite/micro/all_ops_resolver.h"
#include "tensorflow/lite/micro/micro_interpreter.h"
#include "tensorflow/lite/micro/micro_mutable_op_resolver.h"
#include "tensorflow/lite/schema/schema_generated.h"
#include "tensorflow/lite/version.h"

// Provide your model as a C array, generated by xxd -i or similar.
#include "autoencoder_int8_model_data.h"  // defines: g_model[], g_model_len

// Tune based on your model needs. Keep it small.
constexpr int kTensorArenaSize = 20 * 1024;
static uint8_t tensor_arena[kTensorArenaSize];

static float dequantize_int8(int8_t v, float scale, int zero_point) {
  return (static_cast<int>(v) - zero_point) * scale;
}

int main() {
  const tflite::Model* model = tflite::GetModel(g_model);
  if (model->version() != TFLITE_SCHEMA_VERSION) {
    std::printf("Model schema mismatch\n");
    return 1;
  }

  tflite::AllOpsResolver resolver;
  tflite::MicroInterpreter interpreter(model, resolver, tensor_arena, kTensorArenaSize);

  if (interpreter.AllocateTensors() != kTfLiteOk) {
    std::printf("AllocateTensors failed\n");
    return 1;
  }

  TfLiteTensor* input = interpreter.input(0);
  TfLiteTensor* output = interpreter.output(0);

  // Example: one feature vector of length N, quantized to int8.
  // In real use, compute features on-device then quantize them.
  const int N = input->bytes; // for int8, bytes == N
  for (int i = 0; i < N; i++) {
    input->data.int8[i] = 0; // replace with quantized features
  }

  if (interpreter.Invoke() != kTfLiteOk) {
    std::printf("Invoke failed\n");
    return 1;
  }

  // Reconstruction error in dequantized space (MSE).
  float mse = 0.0f;
  float in_scale = input->params.scale;
  int in_zero = input->params.zero_point;
  float out_scale = output->params.scale;
  int out_zero = output->params.zero_point;

  for (int i = 0; i < N; i++) {
    float x = dequantize_int8(input->data.int8[i], in_scale, in_zero);
    float y = dequantize_int8(output->data.int8[i], out_scale, out_zero);
    float e = x - y;
    mse += e * e;
  }
  mse /= static_cast<float>(N);

  // Threshold must be calibrated with real normal data.
  const float threshold = 0.05f;
  bool anomaly = (mse > threshold);

  std::printf("MSE=%.6f anomaly=%d\n", mse, anomaly ? 1 : 0);
  return 0;
}

Key comparison point: the autoencoder can detect subtle pattern shifts, but the operational cost is higher than simple z-score. You must manage feature normalization, quantization drift, model versioning and per-device threshold calibration.

Dataset, training and validation realities

Most failures in embedded anomaly projects come from training and validation shortcuts. If you want “Anomaly detection on MCUs: possible or overhyped?” to land on the “possible” side for your use case, focus on these realities:

Define “normal” by mode

Many devices have multiple normal modes: different RPM, loads, temperatures, or mounting conditions. A single baseline often creates false positives. Practical solutions include:

Separate detectors per mode (state machine controls which is active).
Include mode variables as inputs (RPM, duty cycle, valve state).
Compute mode-invariant features (orders in vibration analysis rather than absolute frequency bins).

Collect boring data, lots of it

For unsupervised approaches (autoencoders, one-class methods), normal data volume matters more than anomaly data. You need to capture:

Temperature extremes
Sensor placement variation
Manufacturing tolerances
Aging and wear over time

Validate with “unknown unknowns”

Do not validate only on hand-picked faults. Validate on real field noise: loose cables, bumped enclosures, supply droop, EMI (Electromagnetic Interference), sensor saturation and user misuse. A detector that survives messy data is worth more than a model that wins on a curated test set.

Power, memory and latency budgeting

MCU anomaly detection lives or dies by budgets. A realistic budgeting process looks like this:

1) Choose the windowing strategy

Time window: for vibration, 256 to 2048 samples are common (depending on sample rate).
Overlap: overlap improves responsiveness but increases compute.
Trigger buffer: keep a circular buffer of raw samples so you can transmit a pre-fault snippet.

2) Estimate feature cost

RMS, peak, kurtosis: cheap.
FFT: moderate to heavy, but feasible on Cortex-M4/M7 with DSP libs if you keep sizes small.
Mel-frequency features (audio): heavier, often better on higher-end MCUs or gateways.

3) Estimate inference cost

A small int8 network can run in a few milliseconds on Cortex-M4F/M7. On smaller cores, inference can dominate power. If you need long battery life, consider using statistics and only running ML when a cheap pre-check looks suspicious.

4) Budget RAM explicitly

RAM often breaks TinyML deployments, not flash. Your RAM consumers typically include:

Sample buffers and FFT work buffers
Feature vectors
TFLite Micro tensor arena
Stacks for interrupts and RTOS (Real-Time Operating System) tasks

Deployment, operations and update strategy

If you only compare algorithms, you miss the biggest differentiator: operations.

Threshold management

Every approach needs thresholds. Plan how you will:

Set initial thresholds from factory characterization or pilot data.
Adjust thresholds per device based on early-life “burn-in” data.
Prevent runaway adaptation (for example, freeze baseline updates during suspected anomalies).

Observability without streaming everything

To debug false positives, you need context. Common patterns:

Transmit feature traces around alarms (few dozen points).
Transmit raw snippets only when an alarm is triggered.
Log summary histograms (mean, variance, percentiles) periodically.

Model and firmware updates

Keep the model versioned independently from firmware if possible.
Use A/B partitions for safe rollbacks when deploying new thresholds or models.
Consider updating calibration parameters (scales, offsets, thresholds) more frequently than full models.

A decision checklist

Use this checklist to decide if your application is a good fit for on-device detection and which approach to choose.

Start with requirements

Latency: Do you need detection in under 100 ms?
Connectivity: Can you rely on always-on IP connectivity?
Power: Can you afford continuous feature extraction?
Cost: Can you move up to a higher-end MCU or add a gateway?

Then check data and variability

How many normal modes exist?
How different are devices unit-to-unit?
Will sensor placement vary?
Do you have at least weeks of normal data per environment?

Pick the smallest method that works

If anomalies are large and obvious: rules, RMS, slopes, z-score.
If anomalies are subtle patterns: hybrid (cheap pre-check plus autoencoder) or pure TinyML on higher-end MCUs.
If you need global learning across fleets: MCU features plus cloud scoring, or gateway inference.

Revisiting the core question, “Anomaly detection on MCUs: possible or overhyped?” is best answered by piloting with your real sensor, your real mounting, your real power budget and your real operational workflow. If you cannot support calibration and ongoing tuning, even a great model will look overhyped in production.

Conclusion

Anomaly detection on MCUs can be genuinely effective when you constrain the problem, engineer a stable feature pipeline and plan for calibration, drift and updates. Simple statistical detectors often outperform TinyML in time-to-market and reliability, while TinyML autoencoders and small classifiers pay off when you need sensitivity to subtle pattern changes and you can support the operational overhead. If you treat “Anomaly detection on MCUs: possible or overhyped?” as a system design decision rather than a model trend, you can ship something robust and maintainable.

Table of Contents

What “anomaly detection” means on MCUs