The research, led by Mesut Cicek, an associate professor in the Department of Marketing and International Business in WSU’s Carson College of Business, involved a rigorous and systematic evaluation. Cicek collaborated with co-authors Sevincgul Ulu of Southern Illinois University, Can Uslay of Rutgers University, and Kate Karniouchina of Northeastern University. Their collective expertise spanned marketing, business analytics, and information systems, bringing a multidisciplinary lens to the intricate task of assessing AI performance in a domain requiring nuanced interpretation.
In total, the team evaluated an extensive dataset of 719 hypotheses meticulously extracted from scientific studies published in prestigious business journals since 2021. This selection was deliberate: hypotheses in business research frequently involve multifaceted relationships, conditional factors, and subtle distinctions that preclude straightforward "true" or "false" answers without deep analytical reasoning. They are designed to explore complex phenomena, such as consumer behavior, market dynamics, or organizational strategies, where multiple variables interact. Reducing such complexity to a binary judgment demands a level of inference and contextual understanding that goes far beyond simple pattern matching or information retrieval.
To ensure a robust measure of consistency and minimize the impact of single-instance variability, the researchers adopted a comprehensive testing protocol. For each of the 719 hypotheses, they posed the same question to ChatGPT 10 distinct times. This repeated interrogation was crucial for gauging the AI’s reliability and its propensity for producing divergent outputs even under identical input conditions. The study utilized the free version of ChatGPT-3.5 during its initial phase in 2024, followed by an updated assessment in 2025 using the more recent ChatGPT-5 mini. This dual-phase testing allowed for an observation of performance evolution across different model iterations, although, as the results would show, the core limitations persisted.
Accuracy Results and Limits of AI Performance
The initial raw accuracy figures presented a seemingly promising picture. When the experiment was first conducted in 2024, ChatGPT answered correctly 76.5% of the time. In a follow-up test in 2025, accuracy rose slightly to 80%. These numbers, in isolation, might suggest a highly competent system capable of discerning truth with considerable precision. However, these raw percentages only tell part of the story and can be misleading without proper statistical adjustment.
The critical insight emerged once the researchers adjusted for random guessing. In a binary choice scenario (true/false), there’s an inherent 50% probability of a correct answer purely by chance. When this baseline is factored in, the true effectiveness of the AI’s reasoning becomes far less impressive. After this crucial adjustment, the AI performed only about 60% better than chance. To put this into perspective, if a system achieves 100% accuracy, it’s 100% better than chance. If it achieves 75% accuracy, it’s 50% better than chance. So, 60% better than chance on a binary task (where 50% is random) translates to an actual performance of around 80% (50% base + 30% added value from the AI), which while numerically higher than the adjusted "60% better than chance," the analogy used by the researchers — "a level closer to a low D than to strong reliability" — powerfully conveys the actual interpretative value of the AI’s output in a rigorous academic context. It suggests that while the AI isn’t purely guessing, its ‘understanding’ is far from robust or consistently reliable.
A particularly striking weakness identified in the study was the system’s profound difficulty in identifying false statements. ChatGPT correctly labeled false hypotheses only 16.4% of the time. This finding is significant because it reveals a potential bias or inherent limitation in how LLMs process and evaluate information, particularly when it contradicts pre-existing patterns or established knowledge. If an AI struggles to identify what is incorrect, its utility in critical domains like scientific validation, fact-checking, or decision support becomes severely compromised. The inability to robustly flag falsehoods means that even seemingly high overall accuracy might mask a fundamental flaw in its evaluative capabilities, leading to the propagation of misinformation or the endorsement of unsupported claims.
Beyond mere accuracy, the study also exposed a notable inconsistency in ChatGPT’s responses. Even when given the exact same prompt 10 times, ChatGPT produced consistent answers only about 73% of the time. This means that nearly one in four times, the AI would flip its answer from ‘true’ to ‘false’ or vice versa for the identical question.
Inconsistent Answers Raise Concerns
"We’re not just talking about accuracy, we’re talking about inconsistency, because if you ask the same question again and again, you come up with different answers," emphasized Cicek, highlighting the profound implications of this variability. He elaborated on the frustrating reality of the testing process: "We used 10 prompts with the same exact question. Everything was identical. It would answer true. Next, it says it’s false. It’s true, it’s false, false, true. There were several cases where there were five true, five false."
This level of inconsistency is perhaps more concerning than a lower, but consistent, accuracy rate. For any system intended for critical applications—whether in scientific research, business strategy, or public policy—reliability and reproducibility of results are paramount. If an AI cannot provide a stable answer to the same question, its outputs cannot be trusted as definitive or even consistently indicative. This raises serious questions about its suitability for tasks requiring high stakes, where erratic responses could lead to misinformed decisions, wasted resources, or even adverse outcomes. The observed inconsistency challenges the very notion of an AI as a dependable source of information or reasoning.
AI Fluency vs. Real Understanding
The findings, published in the Rutgers Business Review, serve as a powerful cautionary tale, highlighting the critical importance of exercising extreme caution when relying on AI for significant decisions, especially those that demand nuanced interpretation or complex reasoning. The study vividly illustrates a fundamental dichotomy: while generative AI can produce remarkably smooth, coherent, and convincing language, it does not yet demonstrate the same level of conceptual understanding that underpins human intelligence.
Cicek’s analysis penetrates to the core of this distinction. According to him, these results strongly suggest that the advent of artificial general intelligence (AGI)—the hypothetical ability of an AI to understand, learn, and apply intelligence across a wide range of tasks at a human level—may still be significantly further away than many optimistic projections suggest. The current capabilities of LLMs, impressive as they are in terms of language generation, appear to be fundamentally different from genuine cognitive understanding.
"Current AI tools don’t understand the world the way we do — they don’t have a ‘brain,’" Cicek stated unequivocally. He further clarified, "They just memorize, and they can give you some insight, but they don’t understand what they’re talking about." This perspective aligns with a growing body of expert opinion that views LLMs as sophisticated pattern-matching engines rather than truly intelligent entities. They excel at identifying statistical relationships within vast datasets of human language and then generating text that mimics those patterns. They can synthesize information, summarize, and even "reason" in a superficial sense by stringing together logically plausible statements. However, this process does not necessarily imply a deep, causal understanding of the underlying concepts, the context, or the implications of the information they are processing. They lack what philosophers of mind call "qualia" or a subjective experience of the world, nor do they possess common sense reasoning in the human sense.
This distinction between fluency and understanding is crucial for setting realistic expectations for AI deployment. Users, particularly those in business and research, often project human-like intelligence onto AI systems due to their impressive linguistic output. Cicek’s study serves as a vital corrective, demonstrating that the ability to articulate plausible answers does not equate to a profound grasp of the subject matter, especially when that subject matter involves complex, scientific hypotheses requiring rigorous evaluation.
Key Weakness in AI Reasoning
The study’s results collectively point to a fundamental limitation inherent in current large language model AI systems. Although these models are adept at generating fluent and persuasive responses, their architecture, primarily based on statistical correlations and pattern recognition, often struggles when confronted with tasks requiring deep reasoning, causal inference, or the nuanced evaluation of complex, conditional statements. This struggle can manifest in answers that, while sounding entirely convincing and authoritative, are factually incorrect or logically flawed.
Cicek emphasized that this isn’t merely a bug to be fixed with more data or better training. It represents a core architectural challenge. LLMs are designed to predict the next most probable word in a sequence, not to construct a logical argument from first principles or to ascertain truth based on empirical evidence. When asked to evaluate a scientific hypothesis, the AI draws upon its vast training data to identify linguistic patterns associated with "supported" or "unsupported" claims. However, it doesn’t perform the kind of critical analysis a human researcher would: dissecting methodologies, assessing statistical significance, considering confounding variables, or understanding the theoretical underpinnings of the hypothesis. This limitation means that the AI can often produce outputs that mimic reasoning without actually engaging in it.
This weakness has significant implications for various sectors. In medical diagnosis, an AI might fluently describe a condition and its treatment, but a subtle misinterpretation of patient data could lead to a severe error. In legal analysis, it might cite relevant statutes but fail to grasp the specific jurisdictional nuances that invalidate an argument. In financial forecasting, it could generate convincing market predictions without truly understanding the complex interplay of economic indicators. The potential for "plausible but incorrect" answers generated by these systems necessitates a robust human oversight layer, especially in fields where accuracy is paramount and errors carry high costs.
Why Experts Urge Caution With AI
Based on these compelling findings, the researchers issue clear and strong recommendations for business leaders and decision-makers: verify all AI-generated information and approach it with a healthy dose of skepticism. The ease with which AI can produce eloquent but inaccurate or inconsistent information demands a proactive and critical stance from users. Relying blindly on AI outputs, especially for strategic decisions, carries significant risks.
Moreover, the study emphasizes the critical need for comprehensive training to better understand the specific capabilities and, crucially, the inherent limitations of AI systems. This isn’t merely about learning how to use an AI tool; it’s about developing "AI literacy"—a deep understanding of what these technologies can effectively accomplish and where their current boundaries lie. Such training should equip users to formulate effective prompts, critically evaluate AI outputs, identify potential biases or inconsistencies, and understand when human expertise remains indispensable.
While this specific study focused on ChatGPT, Cicek noted that similar experiments conducted with other prominent AI tools have yielded comparable outcomes. This suggests that the identified limitations are not unique to a single model but rather reflect a broader characteristic of current-generation large language models. The research also builds upon and reinforces earlier work that urged caution against the overhyping of AI capabilities. For instance, a 2024 national survey revealed that consumers were less likely to purchase products when those products were marketed with an excessive focus on their AI components. This consumer skepticism likely stems from a combination of concerns about data privacy, job displacement, and perhaps an intuitive sense that AI, despite its allure, might not always deliver on exaggerated promises. The study’s findings provide empirical evidence that validates this caution, offering a scientific basis for the public’s nuanced perception of AI.
In conclusion, Cicek’s overarching message resonates with a pragmatic and balanced perspective on AI. "Always be skeptical," he advised. "I’m not against AI. I’m using it. But you need to be very careful." This sentiment encapsulates the evolving relationship between humans and artificial intelligence. AI tools offer unprecedented capabilities for automation, information synthesis, and creative generation. Their ability to process vast amounts of data and generate human-like text is undeniably revolutionary. However, as this rigorous study from Washington State University demonstrates, the current generation of AI still lacks the fundamental understanding, consistent reliability, and robust reasoning abilities required for unassisted critical evaluation, especially in complex scientific and business contexts. The future integration of AI must therefore be guided by a clear-eyed understanding of its strengths and weaknesses, ensuring that human oversight, critical thinking, and ethical considerations remain at the forefront. Only through such a balanced approach can the true potential of AI be harnessed responsibly and effectively, without succumbing to the pitfalls of uncritical reliance.

