The hardest part of building an accurate transaction categorization system isn't the machine learning. It's the data problem that precedes it. Every company that uses an accounting system has a chart of accounts, and almost no two charts of accounts look alike — even in the same industry, the same revenue bracket, or the same ERP. Account codes, account names, hierarchy structure, cost center conventions: all of it is bespoke, built up over years by Controllers and accounting managers making pragmatic decisions about how to organize their particular company's financial data.
If you build a categorization model that maps transactions to a generic taxonomy — "Travel," "Software," "Office Supplies" — you've solved the easy problem. The hard problem is mapping a transaction to your account 7412, not some abstract "Software Subscriptions" category that may or may not correspond to how your company actually thinks about software spend. This post describes how we approached that problem when building Spendaq's GL mapping engine.
Why Generic Taxonomies Don't Work
We spent time in early 2024 looking at existing categorization approaches — both what ERPs do natively and what some expense management tools offer as add-ons. The pattern we kept seeing was a two-stage process: first, classify the transaction into a generic category ("Meals & Entertainment," "Travel," "IT Services"), then do a secondary mapping step where someone configures which of their GL accounts corresponds to each generic category.
The problem with this approach is that the secondary mapping step is where most of the complexity lives, and most tools treat it as a simple lookup table. In practice, the relationship between generic categories and actual GL accounts is many-to-many, context-dependent, and subject to change. An AWS charge might go to 6550 Cloud Infrastructure for the engineering department or 7410 Software Subscriptions for marketing's analytics stack — same vendor, different account, depending on who's spending and on what. A generic classifier that outputs "Software" can't make that distinction because it doesn't know your cost center structure.
The more fundamental problem: every intermediate abstraction layer introduces another place for errors to accumulate. If the generic classifier misses "IT Services" and calls something "Professional Services," the downstream GL mapping will be wrong even if the mapping rules are perfect.
Training Directly on Customer Account Structures
When we started building the Spendaq engine, Daniel's core architectural decision was to eliminate the generic taxonomy layer entirely. We don't classify transactions into our categories. We classify transactions into your categories — meaning we train a mapping model whose output space is your actual chart of accounts.
This creates an onboarding requirement: we need your chart of accounts before we can do anything useful. The ingestion process reads your COA and builds an account embedding space — essentially a representation of what each account means based on its name, its position in the account hierarchy, any account descriptions you've defined, and the transaction history associated with it. Account 7412 SaaS Platform Costs is represented differently than 7410 Software Subscriptions and differently than 6550 Cloud Infrastructure, because those accounts have different names, different hierarchical parents, and (after onboarding) different transaction histories.
The transaction classifier then maps incoming transaction data — vendor name, amount, date, any available merchant category code, card program metadata — to a probability distribution over your account space. The account with the highest probability score gets the suggestion. Accounts with scores above a confidence threshold auto-approve. Everything else routes to the reviewer queue.
The Vendor Normalization Layer
Before any classification happens, raw vendor strings go through a normalization step. This is unglamorous but critical. Corporate card transaction data in particular arrives with merchant descriptors that are highly inconsistent: AMZN MKTP US*1K8J3, AMAZON.COM*RT4GK, and AMAZON WEB SERV are three different strings that may refer to three different actual vendors (Amazon Marketplace, Amazon.com retail, Amazon Web Services) — or the first two may be the same vendor with different order codes appended.
We maintain a normalized vendor database that maps raw descriptor strings to canonical vendor names and, where available, vendor categories. The normalization layer runs before the classifier sees the transaction, so the classifier works with "Amazon Web Services" rather than AMAZON WEB SERV. This significantly improves the classifier's ability to match against your historical transaction patterns for that vendor.
The normalization database is not static — it learns from corrections. When a reviewer reclassifies a transaction and changes the vendor match, we update the normalization mapping for future transactions with that descriptor. This is one of the feedback loops that drives ongoing accuracy improvement.
Confidence Scoring and the Review Queue
Not every classification decision deserves the same level of automation. A charge from a vendor that has appeared in your books 40 times over the last 18 months, always coding to the same account, with an amount consistent with the historical range — that's a high-confidence suggestion. A charge from a vendor that appears for the first time, with a merchant descriptor that doesn't normalize cleanly, in an amount that doesn't match any obvious pattern — that's a low-confidence suggestion that should go to a human.
The confidence score we attach to each classification suggestion is calculated from several signals: vendor familiarity (how many times has this normalized vendor appeared in this account's history), amount consistency (does the amount fall within the observed range for this vendor-account pair), descriptor match quality (how cleanly did the raw string normalize), and contextual signals like department context from card program metadata or PO references in AP invoices.
We calibrate the confidence threshold for each account, not globally. A high-volume, low-variance account like 6200 Payroll Tax should require very high confidence before auto-approving. A low-volume, high-variance account where manually reviewed exceptions are common should have a lower auto-approve threshold — surfacing more to human review rather than auto-approving incorrectly.
What Happens When the Classifier Is Wrong
We're not saying this system is infallible — no classification system is. The goal isn't to eliminate human review; it's to reduce it to cases where human judgment adds real value. When a reviewer overrides a classification suggestion, that override is a training signal. We log the original suggestion, the confidence score, the reviewer's correction, and the transaction features that drove the original suggestion. That data feeds into the retraining cycle.
Retraining runs on a rolling basis as correction data accumulates. In practice, new accounts see meaningful accuracy improvement in the first 30–60 days as the retraining cycle incorporates their correction patterns. After 90 days of production use, most accounts are operating at 95%+ accuracy on recurring vendor-account pairs, with the residual review queue consisting mostly of genuinely novel transactions that any system would route to human judgment.
The Multi-Entity Complexity
One design challenge we underestimated initially was multi-entity account structures. A company that runs multiple subsidiaries through a shared ERP instance may have charts of accounts that share some account codes but not others, or use the same account names with different codes, or map the same economic activity to different accounts by entity for reporting reasons. An AWS charge that belongs in 6550 for the parent entity might belong in 8120 for a subsidiary that uses a different COA structure.
We handle this by treating each entity's account structure as a separate classifier instance, with shared vendor normalization but entity-specific account embeddings and transaction history. The performance overhead of this approach is meaningful — we're maintaining N classifier instances instead of one — but the accuracy benefit is significant enough that we consider it non-negotiable for multi-entity accounts. A single shared classifier trained across entity boundaries introduces the exact kind of cross-entity coding errors that multi-entity Finance teams spend significant close time cleaning up.
What We're Still Working On
The current engine performs well on recurring vendor-account patterns and degrades gracefully on novel vendor-account combinations by routing to human review. Where it still struggles: transactions with genuinely ambiguous economic character — a catered lunch billed by a restaurant that could be coded to 6312 Meals & Entertainment, 7100 Sales Expenses, or 6410 Employee Relations depending on who attended and why. The accounting treatment for those is a policy question as much as a classification question, and getting it right requires context that isn't in the transaction data.
We're building toward a richer context model that incorporates approval metadata, expense report narratives where available, and calendar context (a restaurant charge during a known conference date is more likely client entertainment than internal team lunch). That's the next significant accuracy frontier for us.
The core architecture — train on your COA, not a generic taxonomy; normalize vendors before classification; calibrate confidence per account; retrain on corrections — is what gets you from ERP-native 65% accuracy to something in the high 90s. The remaining gap is context, and that's where we're focused.