February 4, 2025 · Daniel Osei · engineering

Building Real-Time Transaction Classification Under 200ms

A technical deep dive into how we designed Spendaq's classification engine to hit median latency under 200ms — the threshold for synchronous API integration in banking apps.

Why 200ms is the threshold that matters

When we were designing Spendaq's classification pipeline, the first architectural question wasn't "how accurate can we get" — it was "what latency makes synchronous integration viable?" The answer is grounded in how banking apps actually call external APIs.

A synchronous enrichment call in a banking transaction feed has to complete before the response is returned to the user. If your app fetches recent transactions and enriches them in the same request path, the enrichment latency adds directly to the page load time. Acceptable synchronous API latency in a mobile banking app is typically under 300ms total for the enrichment hop — factoring in your own network overhead, you need the classification service to return in under 200ms for that to work comfortably. Above that threshold, you're forced into async architecture: fetch raw, display immediately, patch categories via webhook when enrichment finishes. Async works, but it introduces category flicker (the user sees the raw category, then watches it change) and complicates your state management considerably.

The 200ms target isn't arbitrary — it's what separates "drop into your synchronous request path" from "requires a separate async pipeline." We built to the synchronous threshold because that's what makes integration a one-sprint task rather than a multi-month data architecture project.

Where the latency actually lives

Achieving sub-200ms median latency on a machine learning classification task requires understanding precisely where time is spent. For a classification service like this, the time budget roughly breaks down across four phases: network transit, model load and feature extraction, inference, and response serialization. Of these, inference is often assumed to be the bottleneck, but in practice, model load and feature extraction are more commonly the problem in naive implementations.

Model loading: the warm-path requirement

In early prototypes, we observed that cold-starting a classification model per request was taking 40-80ms per call just on model initialization — before a single transaction was evaluated. The fix is well-understood in ML serving: keep models warm in memory and avoid per-request initialization. We run classification models in a persistent serving process with pre-warmed state. The models are small enough (quantized, not a large language model) that they fit comfortably in worker memory. First-request cold start is irrelevant once the process is running.

Feature extraction: the merchant normalization step

The most computationally intensive step before inference isn't the model — it's merchant name normalization. Raw merchant descriptors from open banking feeds are noisy: uppercase, punctuation artifacts, location codes appended to business names, processor prefixes. A transaction from a regional office supply chain might arrive as STAPLES #1182 CHARLOTTE NC in one bank's feed and STAPLS STORE 1182 in another. Before we can classify, we need to resolve these to a canonical merchant identifier.

Our normalization pipeline uses a combination of a lookup cache (canonical merchant IDs for the merchants we've seen before, hit rate improves significantly over the first few weeks on a new feed), a fuzzy matching step for near-misses, and a fallback to token-level feature extraction for genuinely new merchants. The cache hit path is fast — under 5ms. The fallback path takes longer but also carries lower confidence, which is reflected in the confidence score we return.

Inference: keeping the model appropriate to the task

There is a temptation in ML systems design to use the most powerful available model for any classification task. For transaction categorization, this is wrong. A large foundation model running inference on every transaction would push latency into the 500ms-2s range — and wouldn't materially improve accuracy over a well-trained, smaller model for this structured task. Transaction categorization is not a hard language understanding problem. The signal is largely in the merchant identity, the MCC code if available, the amount range, and the transaction pattern. A gradient-boosted classifier or a compact neural text classifier trained specifically on business transaction data outperforms general-purpose LLMs on this task while running inference in under 10ms.

We're not saying large models have no role in financial data — they're useful for tasks like intent classification from free-text transaction memos or generating narrative summaries. But for high-volume, latency-sensitive, structured classification of transaction records, the model has to be appropriate for the task, not impressive-sounding.

The batch-vs-streaming architecture decision

Spendaq supports both synchronous batch classification (POST an array of transaction objects, receive classified results) and streaming via webhook delivery for event-driven integrations. The latency guarantee applies to the synchronous batch path. For a batch of up to 200 transactions, median response time is under 200ms. Batching amortizes the per-call overhead — network round-trip, TLS handshake, request routing — across multiple transactions, so larger batches are more efficient per transaction than individual calls.

In a typical integration pattern, a neobank's backend receives a webhook from Plaid signaling new transactions, fetches the transaction batch, POSTs it to Spendaq's /v1/classify endpoint, and receives corrected categories before writing to its own database. The entire round-trip is fast enough that this can happen synchronously within the Plaid webhook handler before the neobank writes to its own database — meaning the app never stores uncorrected categories at all, which is the cleanest integration architecture.

Confidence scores and the fallback contract

Every classification response includes a confidence_score field, ranging from 0.0 to 1.0. This is not a cosmetic field — it is a contract signal that tells the consuming application how much to trust the category assignment.

{
  "transaction_id": "txn_8a3f92c",
  "raw_category": "DEBIT_MISC",
  "corrected_category": "Office Supplies",
  "confidence_score": 0.94,
  "merchant_canonical": "staples",
  "classification_ms": 138
}

Banking product teams can use the confidence score to make UI decisions: display the corrected category as-is when confidence is above 0.85, show the category with a "review" affordance when it's between 0.60 and 0.85, and fall through to a user-categorization prompt below 0.60. This three-tier pattern means you're never silently wrong on a low-confidence classification — you're surfacing uncertainty in the UI instead of presenting a wrong answer as fact.

The p50 latency of 138ms is what we cite. The p99 is higher — typically 180-220ms depending on the proportion of merchant name fallbacks in the batch. p99 is still within the synchronous threshold on most network configurations, but building to p99 < 200ms is an ongoing engineering target rather than a current guarantee. Any integration that requires hard p99 guarantees should use the async webhook path.

What the next latency improvements look like

The current bottleneck on p99 is the fuzzy merchant matching fallback for genuinely novel merchant descriptors. We're building a supplementary lookup from card network MCC data (available in the ISO 8583 payment authorization message for card transactions) to reduce fallback rate. When the MCC is available and high-signal (e.g., MCC 7389, Services-Computer Programming/Data Processing), we can assign a business-context category with high confidence even when merchant name normalization fails.

For ACH transactions — which carry no MCC — the problem is harder. ACH records carry a company name in the NACHA SEC code entry and a short text description. Payroll, rent, vendor payments, loan repayments, and owner draws all arrive as ACH debits or credits with text that varies by originator. This is where the contextual pattern-matching layer (transaction history, counterparty recurrence, amount clustering) becomes essential, and where the latency budget is most at risk. That's the engineering work that will take the classification engine from good to genuinely comprehensive for SMB transaction populations.

← All posts Get API Access