Part 8by Muhammad

ONNX vs TFLite vs TensorFlow Micro: What to Use for Edge AI

ONNX vs TFLite vs TensorFlow Micro

ONNX vs TFLite vs TensorFlow Micro is a practical decision you make when you need to run machine learning models on edge devices, from Linux gateways to tiny microcontrollers. This comparison is for intermediate embedded and IoT engineers who already ship firmware or edge applications and need a clear, technical basis for choosing a model format and runtime.

TL;DR: ONNX is a portable interchange format with multiple runtimes (great for heterogeneous edge deployments), TensorFlow Lite (TFLite) is a production-ready mobile and edge runtime with strong tooling and accelerators, TensorFlow Lite for Microcontrollers (TensorFlow Micro, often called TFLM) is a minimal interpreter for MCUs with tight RAM and flash budgets.

Table of Contents

Quick comparison

CategoryONNX (with ONNX Runtime)TensorFlow Lite (TFLite)TensorFlow Micro (TFLM)
What it isModel interchange format + runtimesModel format + runtime for mobile/edgeMinimal inference runtime for microcontrollers
Typical targetsLinux gateways, Windows, Android, server edgeAndroid/iOS, Linux, embedded Linux, some RTOSMCUs (bare metal, FreeRTOS, Zephyr, etc.)
Model file.onnx.tflite.tflite compiled/embedded, plus op resolver
QuantizationSupported via Q/DQ graphs, per-runtime behaviorExcellent tooling, full integer quantization is commonPrimarily int8 and int16 flows, you curate ops tightly
AccelerationNNAPI, TensorRT, OpenVINO, DirectML, vendor EPsNNAPI, GPU delegate, XNNPACK, vendor delegatesVendor kernels, CMSIS-NN, specialized accelerators
StrengthCross-framework portability and multi-backend deploymentProduction edge runtime, great conversion and profilingSmall footprint, deterministic embedded-friendly execution
Main tradeoffInterchange complexity, opset quirks and runtime varianceConversion constraints, delegate-specific performanceLimited ops, strict memory planning and integration effort

What ONNX, TFLite and TensorFlow Micro actually are

ONNX in one paragraph

Open Neural Network Exchange (ONNX) is primarily a model interchange format with a standardized graph representation and versioned operator sets (opsets). In practice, you pair ONNX with a runtime such as ONNX Runtime (ORT) to execute models on CPU and accelerators via execution providers (EPs) like TensorRT, OpenVINO, DirectML or NNAPI. ONNX is most compelling when you want to train in one framework (PyTorch is common) but deploy on multiple targets without rewriting your inference stack.

TFLite in one paragraph

TensorFlow Lite (TFLite) is a deployment-focused runtime and flatbuffer model format optimized for mobile and edge inference. It supports CPU backends (including XNNPACK), NNAPI (Android), GPU delegates and a broad set of vendor delegates for NPUs. Most teams using TensorFlow training pipelines convert models into .tflite and then apply post-training quantization or quantization-aware training (QAT) for better performance and smaller binaries.

TensorFlow Micro (TFLM) in one paragraph

TensorFlow Lite for Microcontrollers (TensorFlow Micro or TFLM) is a tiny inference interpreter designed for microcontrollers with no operating system (or a small RTOS). You typically compile it into firmware, embed the model as a C array and include only the operators you need. TFLM targets deterministic memory usage and small flash size, so you manage an arena allocator for tensors and usually rely on int8 quantized models to fit RAM constraints.

ONNX vs TFLite vs TensorFlow Micro: architecture and ecosystem differences

The core difference in ONNX vs TFLite vs TensorFlow Micro is that ONNX is format-first (with many runtimes), TFLite is runtime-plus-format for edge devices and TensorFlow Micro is a microcontroller-first interpreter with aggressive footprint control.

Graph representation and execution model

  • ONNX: A protobuf graph with explicit tensor types and shapes, plus opset versioning. Execution depends on the runtime and selected EPs. Many optimizations happen via graph transformations and kernel selection.
  • TFLite: A flatbuffer graph optimized for fast loading and low overhead. TFLite uses delegates for offload and often partitions the graph (some nodes delegated, others on CPU).
  • TFLM: Interprets a subset of TFLite operations. You provide an op resolver that registers only needed kernels. Memory is planned into a single tensor arena for predictable allocation.

Who maintains what

  • ONNX: Specification is governed by the ONNX community, runtimes are maintained by different orgs (Microsoft maintains ONNX Runtime, others exist).
  • TFLite and TFLM: Both are part of the TensorFlow ecosystem (Google-led), with vendor contributions for delegates and optimized kernels.

What “portability” really means

  • ONNX portability: High at the file format level, but you must validate opset compatibility, supported kernels and numeric equivalence across runtimes and EPs.
  • TFLite portability: Strong across devices that can run the TFLite runtime, but model conversion constraints (supported ops, quantization requirements) can shape your architecture.
  • TFLM portability: Strong across MCUs, but you often need platform glue (timers, logging, memory alignment) and sometimes vendor-specific kernels for performance.

Runtime footprint, latency and memory behavior

For IoT deployments, you care about flash size, RAM usage, cold-start time, throughput and tail latency. These characteristics differ significantly across ONNX Runtime, TFLite and TensorFlow Micro.

Flash and binary size (rule of thumb)

  • ONNX Runtime: Typically the largest. Even the minimal builds tend to be heavier than TFLite due to broader kernel coverage and backend plumbing. Great for gateways, often too heavy for MCUs.
  • TFLite: Moderate. You can reduce size by building only required ops, but many deployments use prebuilt libraries that include more kernels than needed.
  • TFLM: Smallest when configured well. You include only kernels you register, plus a tiny interpreter and an arena allocator.

RAM model

  • ONNX Runtime and TFLite: Both allocate tensors dynamically, with arena-like optimizations depending on backend. Memory peaks can surprise you when delegates create extra buffers (for example, GPU or NPU tensor layout conversions).
  • TFLM: You allocate a fixed tensor arena, then TFLM plans all intermediate tensors into that arena. This makes worst-case RAM predictable and testable.

Latency and determinism

  • ONNX Runtime: Can be very fast on CPU and accelerators, but performance varies with EP selection, threading and graph optimizations. Determinism depends on your backend.
  • TFLite: Often excellent on ARM CPUs via XNNPACK and on Android via NNAPI. Determinism is usually good on CPU, less predictable when delegates partition the graph.
  • TFLM: Designed for embedded determinism. You can get stable inference timing if you avoid dynamic behaviors and use fixed input shapes.

Operator coverage and quantization

Operator support and quantization strategy usually decide feasibility more often than raw performance. In ONNX vs TFLite vs TensorFlow Micro, quantization also changes your memory footprint and whether you can use optimized kernels.

Operator coverage

  • ONNX: Broad operator sets, but the exact supported set depends on the runtime and EP. A model that runs on ORT CPU may not fully offload to TensorRT or OpenVINO without edits.
  • TFLite: Strong coverage of common vision, audio and NLP blocks, but some TensorFlow ops do not convert cleanly. Custom ops exist but complicate deployment.
  • TFLM: Small curated set. If your model uses uncommon ops (complex control flow, exotic normalizations) you either re-architect the model or implement kernels.

Quantization types that matter on edge

  • Float32: Easiest to get right numerically, often too slow or too large for MCUs.
  • Float16: Useful on some mobile GPUs and NPUs, less common on MCUs.
  • Int8 (full integer): The workhorse for edge. Enables CMSIS-NN on Cortex-M and many NPUs on mobile and embedded Linux.

Quantization workflow differences

  • TFLite: Best-in-class post-training quantization tooling. Representative datasets and calibration are first-class concepts in the converter.
  • ONNX: Quantization is commonly done via ONNX Runtime quantization tools or upstream training/export. Models often use QuantizeLinear/DequantizeLinear (Q/DQ) nodes, then EPs fuse them.
  • TFLM: Inherits TFLite’s quantized model format, but you must ensure all ops have corresponding TFLM kernels and that your model avoids unsupported patterns.

Hardware acceleration options (NPU, DSP, GPU and MCU accelerators)

Acceleration can make or break your power budget. The tradeoff is that accelerators often impose constraints: supported ops, quantization requirements, tensor layouts and static shapes.

ONNX Runtime execution providers

  • TensorRT EP: Strong on NVIDIA Jetson and GPU edge servers, prefers static shapes and supported ops.
  • OpenVINO EP: Targets Intel CPUs, iGPUs and VPUs, helpful for industrial gateways.
  • DirectML EP: Useful on Windows devices with compatible GPUs.
  • NNAPI EP: Can offload to Android accelerators, but behavior depends on device drivers.

TFLite delegates

  • XNNPACK: High-performance CPU delegate for ARM and x86, often the default on mobile and Linux.
  • NNAPI delegate: Android accelerator path.
  • GPU delegate: OpenGL/Vulkan/Metal paths depending on platform, best for some vision workloads.
  • Vendor delegates: Coral Edge TPU, NXP, Qualcomm, MediaTek and others, each with constraints.

TensorFlow Micro acceleration

  • CMSIS-NN: Optimized int8 kernels for ARM Cortex-M, a common boost for conv and fully connected layers.
  • Vendor libraries: Many MCU vendors provide optimized DSP and NN kernels, sometimes integrated via TFLM kernel implementations.
  • Dedicated MCU accelerators: Some MCUs include NPUs (often requiring int8 and supported op subsets).

Tooling and workflows (training, conversion, profiling, debugging)

Your day-to-day experience often depends more on tooling than on runtime performance. You should pick the stack that matches your training framework, CI pipeline and debugging constraints.

Training origin and export

  • PyTorch-first teams: Often export to ONNX, then validate in ONNX Runtime. You can still end at TFLite, but it usually requires an intermediate conversion path and additional checks.
  • TensorFlow-first teams: Usually export directly to TFLite and optionally to TFLM, with quantization integrated into the training pipeline.

Profiling and benchmarking

  • ONNX Runtime: Profiling tools can show per-node timing, EP assignment and graph optimizations, helpful for diagnosing partial offload.
  • TFLite: Strong on-device profiling on Android and via tooling, plus visibility into delegate partitioning.
  • TFLM: You typically instrument with MCU timers and measure cycles, then iterate by pruning ops and increasing CMSIS-NN coverage.

Debugging numeric drift

  • ONNX: Drift often comes from opset differences, different kernel implementations or EP fusions.
  • TFLite and TFLM: Drift often comes from quantization calibration, per-channel quantization choices and different reference versus optimized kernels.

Portability and deployment patterns for IoT fleets

In real fleets, you deploy across multiple device classes: cameras on embedded Linux, gateways with GPUs, MCUs for sensing. A common pattern is to standardize on a training framework and then compile different inference artifacts per class.

Common fleet patterns

  • Gateway standardization: Export to ONNX for x86 and ARM64 gateways, use ONNX Runtime with CPU EP by default and hardware EPs where available.
  • Mobile and embedded Linux apps: Use TFLite for consistent packaging, smaller runtime and easier delegate integration.
  • Always-on sensing MCUs: Use TFLM with int8 models and strict memory budgeting.

Model versioning and compatibility

  • ONNX opsets: Treat opset as an API version. Pin it in CI and re-validate when you upgrade exporters or runtimes.
  • TFLite converter versions: Pin TensorFlow version in CI. Small converter changes can alter quantization and accuracy.
  • TFLM kernels: Kernel availability differs by version. Keep your model architecture aligned with what your firmware includes.

Security, model updates and reproducibility

  • Signed updates: Regardless of runtime, treat model artifacts as executable content. Sign and verify models during over-the-air (OTA) update.
  • Reproducible builds: Pin toolchain versions (TensorFlow/ONNX exporter, converter, runtime) and keep representative datasets for quantization calibration.
  • Attack surface: Larger runtimes have more parsing and kernel code. If you run untrusted models, prefer sandboxing (Linux) and strict update controls (MCUs).

Working code examples

The examples below show two common workflows: (1) converting a TensorFlow Keras model to an int8 TFLite model and (2) running an ONNX model with ONNX Runtime on a Linux gateway. Both examples are usable as-is if you install the stated dependencies.

Example 1: Convert a Keras model to fully quantized int8 TFLite

This script trains a tiny model on random data (so accuracy is not the point), then converts it to an int8 TFLite model using a representative dataset generator. In a real project, you replace the dummy data with a small calibration dataset that matches your sensor distribution.

# Converts a simple Keras model to a fully quantized int8 .tflite model using a representative dataset.
import numpy as np
import tensorflow as tf

# 1) Build a tiny model (example input: 10 features)
model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(10,)),
    tf.keras.layers.Dense(16, activation="relu"),
    tf.keras.layers.Dense(3, activation="softmax"),
])
model.compile(optimizer="adam", loss="sparse_categorical_crossentropy")

# 2) Train briefly on synthetic data
x = np.random.randn(2000, 10).astype(np.float32)
y = np.random.randint(0, 3, size=(2000,)).astype(np.int32)
model.fit(x, y, epochs=2, batch_size=64, verbose=0)

# 3) Representative dataset for calibration (replace with real samples)
def rep_data():
    for i in range(200):
        yield [x[i:i+1]]

# 4) Convert to int8 TFLite
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = rep_data
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8

tflite_model = converter.convert()

with open("model_int8.tflite", "wb") as f:
    f.write(tflite_model)

print("Wrote model_int8.tflite, bytes:", len(tflite_model))

How this maps to TensorFlow Micro

This model_int8.tflite is also the starting point for TensorFlow Micro. In firmware you typically convert it into a C array (for example, with xxd -i) and then ensure every operator used by the graph is registered in your TFLM op resolver. If you see allocator failures on MCU, your first lever is often reducing tensor sizes (smaller input, fewer channels) then ensuring kernels use optimized int8 paths.

Example 2: Run ONNX inference with ONNX Runtime (CPU) on a gateway

This example loads an ONNX model and runs it with NumPy input. It prints input and output tensor names and shapes, which helps you wire the model into a sensor pipeline. It assumes you already have a valid model.onnx exported from your training framework.

# Loads an ONNX model and runs a single inference with ONNX Runtime on CPU.
import numpy as np
import onnxruntime as ort

# Create an inference session (CPUExecutionProvider by default)
sess = ort.InferenceSession("model.onnx", providers=["CPUExecutionProvider"])

# Inspect inputs and outputs
inputs = sess.get_inputs()
outputs = sess.get_outputs()
print("Inputs:")
for i in inputs:
    print(" ", i.name, i.shape, i.type)
print("Outputs:")
for o in outputs:
    print(" ", o.name, o.shape, o.type)

# Prepare one input. Adjust name, dtype and shape to match your model.
input_name = inputs[0].name
shape = [d if isinstance(d, int) else 1 for d in inputs[0].shape]  # replace dynamic dims with 1
x = np.random.randn(*shape).astype(np.float32)

# Run inference
out = sess.run([outputs[0].name], {input_name: x})
print("Output[0] shape:", np.array(out[0]).shape)

Notes for acceleration

On real deployments you often switch to an accelerator EP (for example, TensorRT on Jetson). That changes both performance and sometimes which ops are supported. Validate correctness with a golden test set whenever you change providers.

Decision guide and common scenarios

If you are still stuck on ONNX vs TFLite vs TensorFlow Micro, decide based on your device class first, then work backward to the training and conversion pipeline.

Scenario A: Cortex-M MCU doing keyword spotting

  • Pick: TensorFlow Micro
  • Why: You need predictable RAM usage, int8 kernels and no OS assumptions.
  • Watch: Ensure your model uses only TFLM-supported ops, budget the tensor arena early and measure cycles under real clock settings.

Scenario B: Battery camera on embedded Linux (ARM64) doing person detection

  • Pick: TFLite in many cases
  • Why: Good CPU performance via XNNPACK, delegates for NPUs, stable deployment story.
  • Watch: Delegate partitioning can cause extra copies. Prefer models designed for your target accelerator’s op set.

Scenario C: Industrial gateway with mixed hardware SKUs

  • Pick: ONNX with ONNX Runtime
  • Why: You can keep one artifact format and swap execution providers per SKU (CPU, OpenVINO, TensorRT).
  • Watch: Pin opset and EP versions in CI, test numeric consistency across providers.

Scenario D: You train in PyTorch but need to run on an MCU

  • Pick: Often you end at TensorFlow Micro, but your path matters
  • Why: MCU constraints push you toward TFLM regardless of training framework.
  • How: Consider designing the model with TFLite/TFLM operator constraints in mind, then export and validate early.

Gotchas and anti-patterns

Assuming “supports int8” means “fast int8”

Many stacks can run int8 models but still fall back to slow reference paths for some ops. Verify that your critical layers use optimized kernels (CMSIS-NN for TFLM, XNNPACK or delegate kernels for TFLite, EP-fused int8 kernels for ONNX Runtime).

Letting conversion drive your architecture too late

In ONNX vs TFLite vs TensorFlow Micro projects, conversion failures often appear at the end. Instead, test export and conversion from day one with a minimal model, then evolve architecture while maintaining deployability.

Ignoring tensor layout and pre-processing costs

On edge devices, pre-processing can dominate. Align your input pipeline (resize, normalization, MFCC features for audio) with your chosen runtime and accelerator to avoid extra copies or float conversions.

Not treating the model as part of your firmware or app API

Changing input shape, normalization constants or label ordering breaks downstream logic. Version your model with an explicit contract: input tensor metadata, expected ranges, pre-processing steps and output semantics.

Conclusion

ONNX vs TFLite vs TensorFlow Micro comes down to your target hardware and how much control you need over footprint and acceleration. Use ONNX when you want cross-framework portability and flexible backends on gateways, use TFLite when you want a mature edge runtime with strong quantization and delegate support, use TensorFlow Micro when you need deterministic, tiny inference on microcontrollers with strict RAM and flash limits.