I’ve spent two decades watching organizations measure what’s easy to count instead of what actually matters. Lines of code. Story points. Meeting attendance. Uptime percentages reported to the board with three decimal places of false precision.
Now we have tokens.
Amazon supposedly demands that more than 80% of its developers use AI tools weekly, and tracks token consumption on internal leaderboards. The response was predictable to anyone who’s managed a tech organization: employees started running their internal agentic platform, MeshClaw, on meaningless tasks — pasting irrelevant text, looping identical requests — solely to climb the ranking. At Meta, an employee built a dashboard called “Claudeonomics” that ranked the company’s nearly 85,000 workers by token consumption. In a 30-day window, total usage exceeded 60 trillion tokens before the leaderboard was quietly taken down.
Sixty trillion tokens. I want to pause on that number for a moment, not because it’s impressive, but because it tells us absolutely nothing about whether anyone at Meta delivered better software, resolved customer problems faster, or made a single decision they wouldn’t have made without AI assistance.
This is Goodhart’s Law in its purest form: when a measure becomes a target, it ceases to be a good measure. Economist Charles Goodhart identified this dynamic in monetary policy in the 1970s. Half a century later, the world’s most technologically sophisticated companies are rediscovering it the hard way, at a cost measured in wasted GPU cycles and organizational anxiety.
We’ve Been Here Before
The token leaderboard is new. The failure mode is ancient.
In the early days of software development, managers counted lines of code. More lines meant more productivity, right? Until developers discovered that verbose code scores better than elegant code, and the incentive silently optimized toward bloat. Then we moved to commits. Then story points — a metric so systematically manipulated that most engineering teams abandoned it as a productivity signal while keeping it for estimates. Then came “meeting attendance” and “responsiveness” as proxies for engagement, which produced the culture of performative availability that killed deep work in most companies.
Each time, the pattern is identical: a legitimate underlying concern (are they working? are they productive? are they adopting new tools?) gets translated into a proxy metric, the proxy becomes a target, and behavior reorganizes around optimizing the proxy instead of the underlying thing.
Tokens are the latest iteration. And they’re a particularly bad proxy for at least three reasons.
Why Tokens Are Especially Bad Metrics
First, token volume is inversely correlated with skill. A junior developer who doesn’t know what they’re doing generates far more tokens than a senior developer who writes a precise prompt and gets a useful answer in a single exchange. Rewarding token consumption is, structurally, a way of rewarding inefficiency.
Second, tokens measure input, not output. Every enterprise tech mandate I’ve seen fail confused activity with results. Whether someone used AI is not the same question as whether AI helped them accomplish something. Burning tokens on MeshClaw while watching the leaderboard is activity. Shipping a feature is a result. They’re not the same thing.
Third, gaming is trivially easy. You don’t have to be Machiavellian to manipulate a token leaderboard. You just have to be rational. If the system rewards token consumption and that consumption is cheap relative to the social cost of ranking low, any rational actor will generate tokens. This isn’t a character flaw in Amazon engineers — it’s a predictable response to a poorly designed incentive system.
What Should Be Measured Instead
I’m not against measuring AI adoption. Adoption doesn’t happen on its own, and there are legitimate organizational reasons to track whether expensive tools are being used. But measurement should be honest about what it’s trying to learn.
If the real concern is adoption, measure task completion rates with and without AI assistance. If it’s productivity, measure cycle time on comparable tasks. If it’s quality, measure defect rates or rework. These are harder to track than tokens. They require judgment to interpret. They can’t be reduced to a leaderboard the way a raw consumption number can.
That difficulty isn’t a bug — it’s the point. Easy metrics are easy to game. Metrics worth tracking are the ones that force you to look at the actual work.
The Deeper Irony
There’s something almost poetic about the fact that Amazon, a company that bet over $100 billion on AI’s transformative potential, is struggling to measure whether that transformation is actually happening. The hyperscalers are telling investors that inference chips are consumed as fast as they’re deployed. The combined 2026 capital expenditure of the four big players is pushing toward $700 billion, with some projections exceeding a trillion for 2027.
In that context, the primary mechanism for tracking the internal value of AI is… a token leaderboard that employees are gaming with loop scripts.
Analyst Gil Luria put it diplomatically when he told Fortune: “You get the behavior you create incentives to generate.” What he didn’t say, but every CIO who survived an ERP rollout or DevOps transformation already knows, is that poor incentive design is a form of self-harm for organizations. You don’t just fail to measure what matters — you actively create noise that drowns out the signal.
Employees running MeshClaw in circles aren’t the problem. Executives who confused token consumption with AI adoption are.
A Note for Latin American Engineering Teams
For those of us building tech organizations in Latin America, where AI budgets are tighter and API costs matter more, this story carries a specific warning. The pressure to demonstrate AI adoption is real here too — from boards, investors, clients who want to see “AI strategy” in every presentation.
Don’t let that pressure push you toward vanity metrics. Tokens consumed are a cost, not an achievement. The question worth asking in every sprint review isn’t “How much AI did we use?” — it’s “What did we build that we couldn’t have built without it, and how long did it take?”
That’s the metric that survives Goodhart’s Law.
