14 Apr 2026, Tue

Anthropic Faces Growing Backlash as Developers Accuse Claude of Performance Degradation

A significant and escalating wave of criticism is being directed at AI firm Anthropic, with a growing number of developers and AI power users taking to social media platforms to voice accusations of performance degradation in their flagship AI models, Claude Opus 4.6 and Claude Code. These users allege that the models have become demonstrably less capable, less reliable, and significantly more wasteful with computational resources, particularly token usage, compared to their performance just weeks prior. The accusations range from intentional throttling and performance tuning by Anthropic to unintended consequences stemming from resource management and compute limitations.

These concerns have rapidly disseminated across major online forums and social media channels, including extensive discussions on GitHub, X (formerly Twitter), and Reddit. High-visibility posts across these platforms have detailed a perceived decline in Claude’s ability to engage in sustained reasoning, a marked increase in instances where the model abandons tasks mid-completion, and a heightened propensity for generating hallucinations or contradictory outputs. This collective user sentiment has led some to coin the term "AI shrinkflation," drawing a parallel to the practice of reducing product size while maintaining the same price, suggesting users are receiving a diminished product for their investment. More pointedly, some critics suspect Anthropic may be actively throttling or otherwise tuning down Claude’s performance during periods of peak demand.

While these claims of intentional degradation remain unproven, Anthropic employees have publicly refuted allegations of deliberately degrading models to manage capacity. Nevertheless, the company has acknowledged recent, tangible changes to usage limits and reasoning default settings, which has undeniably fanned the flames of this burgeoning controversy. VentureBeat has reached out to Anthropic for in-depth clarification on these accusations, specifically inquiring whether any recent adjustments to reasoning defaults, context handling, throttling mechanisms, inference parameters, or benchmark methodologies could account for the surge in user complaints. Furthermore, Anthropic was asked to provide an explanation for the recent benchmark-related claims and whether they intend to release additional data to assuage customer concerns. A spokesperson for Anthropic did not address these specific questions individually but instead directed VentureBeat to X posts from Boris Cherny, the creator of Claude Code, and Thariq Shihipar, a member of the Claude Code team. These posts, referenced below, address Opus 4.6 performance and usage limits, respectively.

Viral User Complaints, Including from an AMD Senior Director, Argue Claude Has Become Less Capable

One of the most detailed and impactful public critiques originated as a GitHub issue meticulously filed by Stella Laurenzo on April 2, 2026. Laurenzo’s LinkedIn profile identifies her as a Senior Director within AMD’s AI group, lending significant weight and credibility to her observations. In her comprehensive post, Laurenzo asserted that Claude Code had regressed to a point where it could no longer be reliably trusted for complex engineering tasks. To substantiate this claim, she presented an extensive analysis derived from 6,852 Claude Code session files, 17,871 "thinking blocks" (internal reasoning steps), and 234,760 tool calls.

Laurenzo’s analysis contended that starting in February, Claude’s estimated reasoning depth experienced a sharp decline, coinciding with a rise in indicators of poorer performance. These indicators included more instances of premature task abandonment, an increase in "simplest fix" behavior (prioritizing quick, superficial solutions over deeper analysis), a greater prevalence of reasoning loops, and a discernible shift from a research-first approach to an edit-first methodology. The core argument presented was that for advanced engineering workflows, extended reasoning is not merely an optional feature but a fundamental prerequisite for the model’s usability.

This GitHub thread rapidly transcended its initial platform, gaining significant traction on broader social media. X users, such as @Hesamation, amplified Laurenzo’s findings by posting screenshots of her GitHub analysis on April 11, transforming it into a widely discussed talking point. This amplification was crucial because it provided the burgeoning "Claude is getting worse" narrative with a tangible foundation beyond anecdotal user frustration. The detailed, data-rich post from a senior AI leader at a major semiconductor company, arguing that the regression was observable in logs, tool-use patterns, and user corrections, offered concrete evidence.

Anthropic’s initial public response focused on differentiating perceived changes from actual model degradation. In a pinned follow-up comment on the same GitHub issue, posted approximately a week prior to the latest reporting, Boris Cherny, the lead for Claude Code, acknowledged Laurenzo’s thorough analysis but contested its central conclusion. Cherny explained that the "redact-thinking-2026-02-12" header mentioned in the complaint was a user interface-only change designed to hide internal thought processes from the display and reduce latency, asserting that it "does not impact thinking itself," "thinking budgets," or the underlying mechanics of extended reasoning.

He further posited that two other product changes likely influenced user perceptions: Claude Opus 4.6’s transition to adaptive thinking as the default on February 9th, and a shift on March 3rd to a "medium effort" setting (effort level 85) as the default for Opus 4.6. Cherny characterized this default as Anthropic’s assessment of the optimal balance between intelligence, latency, and cost for the majority of users. He also noted that users requiring more extended reasoning could manually elevate the effort level by inputting /effort high in Claude Code terminal sessions. This exchange highlights the crux of the controversy: critics like Laurenzo argue that Claude’s performance in demanding coding tasks has objectively worsened, citing logs and usage patterns as evidence. Anthropic, conversely, does not deny that changes have occurred but maintains that the most significant recent modifications were related to product and interface choices impacting user visibility and default system effort, rather than a clandestine downgrade of the core model. While this distinction might hold technical significance, it offers little solace to power users experiencing a perceived decline in output quality.

Further amplifying Laurenzo’s post and the wider chorus of agreement from power users, external coverage from publications like TechRadar and PC Gamer provided additional visibility to the allegations. Another viral post on X, authored by developer Om Patel on April 7, articulated a similar argument with even more directness, claiming that the perceived decline in Claude’s intelligence had been "actually measured," resulting in a reported 67% drop. This post was instrumental in popularizing the "AI shrinkflation" label and broadening the controversy beyond the dedicated Claude Code user base into the wider AI discourse on X. These claims have resonated deeply with many frustrated users who report experiencing more unfinished tasks, increased backtracking, accelerated token consumption, and a pervasive sense that Claude is less inclined to engage in deep, analytical reasoning for complex coding challenges than it was earlier in the year.

Benchmark Posts Turned Anecdotal Frustration into a Public Controversy

The most prominent benchmark-based accusation emerged from BridgeMind, an entity that operates the BridgeBench hallucination benchmark. On April 12, BridgeMind’s X account reported that Claude Opus 4.6 had fallen from an accuracy of 83.3% and a No. 2 ranking in a prior evaluation to 68.3% accuracy and a No. 10 ranking in a recent retest, declaring this a definitive proof that "Claude Opus 4.6 is nerfed." This post rapidly gained traction, becoming a primary anchor for the broader public narrative suggesting Anthropic had degraded the model’s performance.

In parallel, other users circulated benchmark-related or test-based posts suggesting that Opus 4.6 was underperforming compared to its predecessor, Opus 4.5, in practical coding scenarios. Further analysis pointed to TerminalBench-related results as purported evidence of changes in the model’s behavior within specific testing harnesses or product contexts. The cumulative effect was a powerful reinforcement loop, where benchmark screenshots, side-by-side comparisons, and anecdotal frustrations coalesced and amplified each other in the public sphere. This amplification is significant because benchmark claims often carry more weight and wider reach than subjective user experiences. A developer’s assertion that a model "feels worse" is one thing; a screenshot depicting a ranking drop from No. 2 to No. 10, or a dramatic percentage swing in accuracy, creates the appearance of hard, irrefutable proof, even if the underlying comparison methodology is more complex.

Critics of the Benchmark Claims Say the Evidence is Weaker Than It Looks

The most critical rebuttal to the BridgeBench claim did not originate from Anthropic itself. Instead, it came from Paul Calcraft, an independent software and AI researcher, who posted on X arguing that the viral comparison was misleading. Calcraft pointed out that the earlier Opus 4.6 result was based on a mere six tasks, whereas the later evaluation involved 30 tasks. He emphatically stated it was a "DIFFERENT BENCHMARK." Furthermore, Calcraft noted that on the six tasks common to both runs, Claude’s score only shifted modestly, from 87.6% previously to 85.4% in the later run. The more significant swing, he contended, appeared to stem primarily from a single fabrication result without sufficient repeats, suggesting it could easily fall within the range of ordinary statistical noise.

This external rebuttal is significant because it undermines one of the most widely circulated and seemingly concrete pieces of evidence supporting the "nerfed" narrative. While it does not invalidate the concerns of users who feel Claude’s performance has degraded, it strongly suggests that at least some of the benchmark evidence fueling the story may be overstated, inadequately normalized, or not directly comparable. Notably, even the BridgeBench post itself later received a community note echoing similar sentiments. The note highlighted that the two benchmark runs covered different scopes – six tasks in one instance and thirty in the other – and that the subset of common tasks showed only a minor variation. While this doesn’t render the later result meaningless, it significantly weakens the most assertive version of the "BridgeBench proved it" argument.

This nuance is now a central characteristic of the controversy: the claims made are not uniformly robust. Some are rooted in direct user experiences. Others point to genuine product changes. Some rely on benchmark comparisons that appear to be methodologically flawed or lacking apples-to-apples comparability. And a portion hinges on inferences about underlying system behavior that are not directly verifiable by individuals outside Anthropic.

Earlier Capacity Limits Gave Users a Reason to Suspect More Changes Under the Hood

The current backlash is also occurring in the shadow of a real, confirmed policy change implemented by Anthropic in late March. On March 26, Thariq Shihipar, a technical staffer at Anthropic, posted that "To manage growing demand for Claude," the company was adjusting how 5-hour session limits function for Free, Pro, and Max subscribers during peak hours, while keeping weekly limits unchanged. He elaborated that during weekdays from 5 a.m. to 11 a.m. Pacific time, users would consume their 5-hour session limits at a faster rate than previously. In subsequent posts, Shihipar stated that Anthropic had achieved efficiency gains to mitigate some of the impact, but that approximately 7% of users would encounter session limits they would not have hit before, particularly those on Pro tiers.

In an email to VentureBeat on March 27, 2026, Anthropic clarified that Team and Enterprise customers were unaffected by these changes and that the adjustment was not dynamically optimized per user but rather applied to the pre-defined peak-hour window. Anthropic also reiterated its ongoing investment in scaling capacity. While these statements specifically addressed session limits rather than model downgrades, they provide crucial context. They establish two key points that users are now frequently connecting: first, Anthropic is actively managing surging demand for its services; and second, the company has already implemented changes to ration usage during busy periods. This does not constitute proof of Anthropic reducing model quality, but it helps explain why a significant number of users are predisposed to believe that other, less transparent, changes may also have occurred.

Prompt Caching and TTL

A separate, more recent GitHub issue, #46829, broadens the dispute beyond mere model quality and delves into pricing and quota management. In this issue, user seanGSISG argued that Claude Code’s prompt-cache time-to-live (TTL) appeared to shift from a one-hour setting back to a five-minute setting in early March. This conclusion was based on an analysis of nearly 120,000 API calls extracted from Claude Code session logs across two machines. The complaint asserted that this change led to significant increases in cache-creation costs and quota consumption, especially for extended coding sessions where cached context expires rapidly and requires frequent rebuilding. The author claimed this shift could explain why some subscription users began encountering usage limits they had not previously experienced.

What makes this particular issue noteworthy is that Anthropic did not outright deny that a change had occurred. In a reply on the thread, Jarred Sumner confirmed the March 6 change was real and intentional but contested the framing that it represented a regression. He explained that Claude Code utilizes different cache durations for various request types and that a one-hour cache is not universally more cost-effective, as one-hour writes incur higher upfront costs and only yield savings when the same cached context is reused sufficiently to justify it. In his account, the change was part of ongoing cache optimization efforts, not a surreptitious downgrade, and the pre-March 6 behavior described in the issue "wasn’t the intended steady state."

The thread subsequently received a more detailed response from Anthropic’s Cherny, who described one-hour caching as "nuanced." He stated that the company has been testing heuristics to improve cache hit rates, token utilization, and latency for subscribers. Cherny noted that Anthropic maintains a five-minute cache for many queries, including subagents that are infrequently resumed, and that disabling telemetry can also deactivate experiment gates, potentially causing Claude Code to revert to a five-minute default in certain scenarios. He added that Anthropic plans to introduce environment variables that will allow users to directly enforce either one-hour or five-minute cache behavior. Collectively, these responses do not validate the issue author’s assertion that Anthropic silently made Claude Code more expensive overall. However, they do confirm that Anthropic has been actively experimenting with cache behavior behind the scenes during the same period users began voicing louder complaints about quota burn and evolving product behavior.

Anthropic Says User-Facing Changes, Not Secret Degradation, Explain Much of the Uproar

Employees affiliated with Anthropic have publicly pushed back against the broader accusations. In a widely circulated reply on X, Cherny directly addressed claims that Anthropic had secretly "nerfed" Claude Code by stating, "This is false." He explained that Claude Code had been defaulted to medium effort in response to user feedback that Claude was consuming excessive tokens, and that this change had been communicated in the changelog and via an on-screen dialog presented to users upon opening Claude Code. This response is significant because it acknowledges a substantial product change while refuting a more conspiratorial interpretation of it. Anthropic is not denying that alterations have been made; rather, it asserts that these changes were disclosed and aimed at balancing token usage, not at covertly diminishing model quality.

Public documentation also corroborates the ongoing adjustments to effort defaults. The changelog for Claude Code indicates that on April 7, Anthropic modified the default effort level from medium to high for API-key users, as well as for Bedrock, Vertex, Foundry, Team, and Enterprise users. This suggests Anthropic has been actively fine-tuning these settings across various user segments, which could plausibly influence user perceptions even if the core model weights remain unaltered.

Shihipar has also directly refuted the broader accusation of demand-management-driven degradation. In a reply on X posted April 11, he stated that Anthropic does not "degrade" its models to better serve demand. He also noted that changes to thinking summaries affected how some users were measuring Claude’s "thinking," and that the company had not found evidence to support the most severe qualitative claims circulating online.

The Real Issue May Be Trust as Much as Model Quality

What has become unequivocally clear is the emergence of a significant trust deficit between Anthropic and a segment of its most demanding users. For developers who rely on Claude Code extensively throughout their workday, subtle shifts in visible thinking output, default effort settings, token consumption patterns, latency trade-offs, or usage caps can feel indistinguishable from a genuine reduction in model capability. This holds true regardless of whether the root cause lies in a product setting, a user interface modification, an inference policy tweak, capacity pressures, or an actual quality regression.

This situation also implies that both sides of the dispute may be talking past each other. Users are articulating their lived experiences: increased friction, more frequent failures, and diminished confidence in the tool. Anthropic, in turn, is responding in product-centric terms: effort defaults, hidden thinking summaries, changelog disclosures, and denials of secret model degradation driven by demand. These descriptions are not necessarily mutually exclusive. A model can indeed feel worse to users even if the company believes it has not "nerfed" the underlying model in the manner critics allege. However, occurring at a time when Anthropic’s primary competitor, OpenAI, has recently pivoted and allocated more resources to its competing enterprise and code-focused product, Codex – even introducing a new, mid-range ChatGPT subscription to bolster usage of the tool – this controversy is certainly not the kind of publicity that benefits Anthropic or its customer retention efforts.

Concurrently, the public evidence remains mixed. Some of the most viral claims have originated from developers armed with detailed logs and strong convictions derived from extensive usage. Some of the benchmark evidence has faced scrutiny from external observers regarding its methodological soundness. And Anthropic’s own recent adjustments to limits and settings ensure that this debate is unfolding against a backdrop of genuine, confirmed operational changes, rather than being solely based on rumor.

By admin

Leave a Reply

Your email address will not be published. Required fields are marked *