
GPTurk: Is AI Faking Human Feedback and Human Labor?
Imagine this: Late one night in 2023, a freelance crowd worker sits in front of her laptop, racing to finish a paid task. She’s been asked to summarize a dense 400-word science article in under 100 words.
The pay? A mere dollar or two. Rather than furiously typing, she calmly copies the text into another window—an AI chatbot. In seconds, it spits out a polished summary. She pastes it back, hits submit, and collects her fee. To the requester, it appears a human (her) completed the job. In truth, an AI ghostwriter just toiled in her place.
In 1770, a clockwork marvel called The Mechanical Turk wowed audiences by playing chess against human opponents. Inside its wooden cabinet, a hidden human chess master was pulling the strings.
Fast-forward to today—on Amazon’s Mechanical Turk (a crowdsourcing website named after that automaton), the magic trick has flipped. Now, the humans on the platform have a machine hidden inside their workflows, quietly using AI to generate the “human” output.
Recent revelations have pulled back the curtain on this next-gen Turk, and the findings are astonishing: in a new experiment, researchers discovered that around 33–46% of crowd workers quietly used AI (like ChatGPT) to complete a writing task
This infiltration of AI-generated text into ostensibly human labor is raising profound questions about the integrity of our data, the nature of online work, and the future of trust.

Ghosts in the Crowdwork Machine
Crowdsourcing platforms like Amazon Mechanical Turk and Prolific have long been the go-to places to get human input on demand. Need thousands of product reviews, survey responses, or labels for training an AI model? Post the tasks online, and an army of remote workers (sometimes called “Turkers”) will do it for a few cents each.
The underlying assumption has always been that you’re paying for human effort – the platform even bills itself as “Artificial Artificial Intelligence,” meaning real people doing tasks that computers can’t do. But with the rise of powerful large language models, that assumption is crumbling. If crowd workers start delegating their tasks to AI, the whole premise of “human” data comes into doubt.
Are we truly getting insight into human opinions, creativity, and behavior – or just the regurgitated output of an algorithm faking a crowd of humans?
This is not a far-fetched hypothetical. The recent study by Veniamin Veselovsky and colleagues (cheekily titled “Artificial Artificial Artificial Intelligence”) set out to measure exactly this phenomenon. They suspected that many Turkers were secretly relying on AI, so they designed a real-world test: ask crowd workers to produce short summaries of research abstracts – a task that’s tedious for humans but easy for an AI – and then catch any telltale signs of machine involvement.
“LLM use by crowd workers compromises research on human behavior, preferences, and opinions.”
As Veniamin Veselovsky and colleagues warn.
The problem they were tackling was both urgent and sneaky: if AI-written text is flooding platforms meant to collect human-generated content, it threatens the validity of studies and data that rely on those platforms
For example, a psychology survey intended to gauge human opinion isn’t very useful if half the respondents just asked ChatGPT how to answer. As the researchers warned, this trend “raises serious concerns about the gradual dilution of the ‘human factor’ in crowdsourced text data.”
In other words, our supposedly human-sourced data might be losing its humanity.

Setting the Trap for an AI Ghostwriter
How do you catch a ghost in the machine? The research team turned to a mix of clever sleuthing and machine learning. First, they reran a known experiment: asking workers on Mechanical Turk to summarize academic abstracts (about 400 words) into short, 100-word summaries.
They didn’t forbid using AI, nor did they announce they were looking for AI usage – it was just presented as another routine task. Behind the scenes, however, the scientists set a trap. They instrumented the task environment to record keystroke data, looking for patterns like a worker pasting large chunks of text (rather than typing normally).
Think of this as installing a sort of digital surveillance camera on the typing process – if an entire paragraph appears in the text box in an instant, that’s a big red flag that it was generated elsewhere and pasted.
At the same time, they trained a text classifier (a machine-learning model) to distinguish between human-written summaries and AI-written ones. Using a collection of genuine human-written summaries from before the AI era and a set of machine-generated summaries (from GPT-4 and ChatGPT), they taught the classifier the subtle differences in style and word choice.
This is like a forensic linguist training to spot the fingerprints of AI in a piece of text.
The trap was set; now it was time to spring it. As the crowdworker submissions rolled in, the system quietly analyzed each one. The results were startling.
By combining the keystroke evidence and the AI-text detector’s judgment, the researchers concluded that between 33% and 46% of the participants had used an LLM to help write their summaries
In raw numbers, about 15 to 20 out of 44 workers were likely letting ChatGPT (or a similar model) do the heavy lifting
In other words, nearly half of the “crowd” was essentially an AI in disguise. One could almost hear echoes of the original Mechanical Turk: a machine was indeed doing the work, concealed behind a human facade.
What gave them away? Beyond the copy-paste keystrokes, the content of the AI-assisted summaries had subtle tells. They were often too perfect, too similar to one another. When you have dozens of humans summarize an article, you expect a range of writing styles and perspectives – some might focus on different aspects of the text, others might phrase things colloquially or inject a personal tone.
The AI-generated summaries, by contrast, came out uniform and homogeneous. As the authors noted, responses written with LLM help were high-quality but had a certain sameness, being “more homogeneous than those written without LLMs’ help.”
In fact, the team measured a homogeneity score (a metric of textual similarity) nearly double for AI-generated texts compared to human ones
It’s as if many workers all handed in essays that, while fluent, read like they were written by the same ghostwriter. High quality, yes – one AI can produce grammatically sound, relevant summaries all day – but lacking the diversity and idiosyncrasy that a group of independent humans would show.
Importantly, the AI-assisted outputs weren’t bad in the traditional sense. In many cases, they were perfectly accurate and well-written summaries. This highlights a central tension: from a task requester’s perspective, an AI-written answer can look great (concise, no typos, on-point) and might even outperform a rushed human effort.
The researchers found that these AI-augmented workers produced summaries that met the requirements and often passed basic quality checks. It’s easy to see why neither the requesters nor the platform immediately caught on – there was no glaring error to expose the fakery. The ghosts did a good job. Too good, in fact, and too many of them doing the job in the exact same way.
Why It Matters: When “Human” Data Isn’t Human
Discovering that a significant fraction of crowd work might be done by AI isn’t just a quirky finding – it strikes at the heart of many fields that rely on real human data. Research in the social sciences, user experience testing, market surveys, and AI training all lean heavily on crowdsourced human input.
If that input is secretly machine-generated, the implications are wide-ranging:
- Compromised research on human behavior: Studies seeking genuine human opinions, preferences, or creativity could be drawing conclusions from AI-generated responses.
- A survey about political beliefs or an experiment on how people write under stress means little if ChatGPT wrote a chunk of the answers. We risk basing theories of human psychology or sociology on what a language model thinks a human would say.
- Contaminated training data for AI: Ironically, AI might be polluting the very datasets we use to train new AI. Crowdsourced labels and examples are often treated as gold-standard “human” data to teach models about the world. But as soon as those labels or examples are produced by an AI, we enter a hall of mirrors. Future models trained on this data would in part be learning from older AI outputs, potentially leading to a feedback loop of homogenization or errors. Researchers have dubbed this problem “model dementia,” noting that models trained on AI-generated data can actually forget or distort true human patterns over time. In short, feeding AI outputs back into AI training can degrade the performance of subsequent models – a worrying loop where the human reality in the data gets fainter and fainter.
- Loss of diversity and nuance: Human responses are naturally messy and diverse. Ten different people will phrase an answer in ten different ways, often revealing subtle nuances, cultural influences, or personal experiences. This diversity is a strength – it helps researchers and AI systems capture the full spectrum of human perspectives. But if many responses come from the same few AI models, we get a much narrower band of styles and viewpoints.
- Indeed, the study found the AI-written texts preserved key terms and phrasing very consistently, whereas humans paraphrased and varied more. That homogenization could erase minority viewpoints or creative outliers. Crowd wisdom turns into machine monotony.
- Erosion of trust in online content and systems: More broadly, if end-users (or companies) begin to suspect that ostensibly human-generated content is machine-made, trust takes a hit. Think of product reviews, forum answers, or even “human” customer service chats that might be farmed out to crowd workers – if those workers use AI, are we effectively getting AI-generated reviews and advice without knowing it?
- The credibility of crowdsourced content – already under scrutiny due to bots – faces a new kind of challenge. And for the workers who don’t cheat, there’s a risk too: a blanket of skepticism could devalue legitimate human work if requesters start assuming everyone might be using AI.
The stakes are high for science and society alike. As one commentator put it after learning of the 2023 experiment, “Mechanical Turk is just AI now.”
That’s an exaggeration, but it captures the alarm: if we can’t be sure crowdworkers are actually human (or at least behaving like independent humans), then any process built on the “wisdom of the crowd” needs to be rethought. In effect, the crowd might be partly an illusion.
It’s telling that around the same time this study came out, other researchers were exploring the flip side of the coin: using AI instead of crowd workers. In one notable analysis, ChatGPT not only matched crowd workers on several text annotation tasks, it beat them by a large margin – and at thirty times lower cost
For example, classifying the sentiment or topic of thousands of tweets, which might cost hundreds of dollars via Mechanical Turk, could be done by the AI for a few dollars in compute time, with even higher accuracy in some cases
This isn’t to say humans are obsolete (AI has its own blind spots), but the economics create a strong pressure. If you’re a cash-strapped researcher or a business, why not use the AI directly? And if you’re a crowd worker trying to make a living off pennies per task, why not secretly enlist that same AI to boost your throughput?
Both sides are tempted by efficiency. The result is a kind of quiet convergence: crowd workers have incentive to use AI, and requesters have incentive to accept AI-like quality – as long as it meets the task criteria, who cares if a human or GPT-4 wrote it? Well, as we’ve outlined, in many cases we should care. But the incentives aren’t always aligned with maintaining a “pure” human touch.
Cat and Mouse: Detection and Deterrence
Platform operators and researchers now face a classic cat-and-mouse dynamic. How do we prevent or detect AI “cheating” without undermining the positive aspects of crowd work? It turns out this is not an easy task. The 2023 study didn’t stop at measuring the problem; the authors also experimented with mitigation strategies. In a follow-up, they ran a second study to see if they could deter workers from using AI by altering the task conditions.
One approach was straightforward: explicitly ask them not to use any AI assistance (essentially an honor-system request). Another approach was more tactical: introduce small hurdles that make it harder to copy-paste outputs directly from an AI, such as requiring workers to complete the task in a text box that doesn’t easily allow pasting, or to write in chunks.
The result? LLM use dropped, but it didn’t disappear. In the experiment, the estimated prevalence of AI usage roughly halved – from about 28% down to ~16% – when workers were warned and minor obstacles were added
That’s a notable improvement, but it still means roughly 1 in 6 workers in the “please don’t use AI” condition went ahead and used it anyway. Moreover, when the researchers tried a more indirect approach (without an explicit no-AI rule, just hurdles), the reduction was only a few percentage points, not statistically significant. In some cases, workers simply ignored the request not to use AI if it was in their financial interest to do so
After all, from the worker’s perspective, using ChatGPT can save time – and time is money when you’re getting paid per task.
This cat-and-mouse game highlights a key point: people respond to incentives. As one researcher wryly noted, “If you pay someone a below-market wage for a task (like $1 to produce high-quality text), don’t be surprised if they utilize labor-saving technologies (like ChatGPT).”
Underpaid crowd workers have always found ways to optimize or automate parts of their work – whether by writing scripts, sharing answers, or now, quietly asking an AI for help. In effect, some Turkers have become managers, orchestrating a team of one AI assistant to maximize their earnings. They might rationalize it as just using a tool, not so different from using spell-check or Google.
The ethical lines can blur: the platform’s rules may not clearly ban using external assistance, and there’s no personal data being faked (unlike someone pretending to be another person). So from the worker’s view, it can feel like a victimless shortcut. The big victim, however, is the integrity of the data and the trust between requesters and workers.
On the detection side, things aren’t much easier. While the study at hand succeeded with a custom detector and instrumentation, deploying such measures widely is challenging. Mechanical Turk and other platforms generally do not record detailed keystroke logs by default (for privacy and technical reasons), so one would have to build custom task interfaces to catch copy-paste events.
And even that isn’t foolproof – a crafty worker could still manually retype an AI’s output or use the AI on a second device to evade detection. Automated text classifiers that detect AI-written text are also fallible. OpenAI’s own attempt at a generic GPT-written text detector was recently shut down because of its “low rate of accuracy.”
It often failed to distinguish AI from human writing reliably. AI-generated text, especially from the latest models, is becoming indistinguishable from human text in many cases, as the models learn to mimic human quirks and avoid obvious tells. It’s a moving target; as detection improves, so do the AIs at masking their “machine scent.” This is reminiscent of spam detection or ad fraud – an endless back-and-forth evolution.
There is also a question of policy: should platforms explicitly ban AI usage by workers? Some have started updating guidelines, but enforcement is tough without detection tools. The study’s authors suggest that incorporating AI into the workflow might be inevitable – perhaps the solution is not to strictly prohibit AI, but to manage its use.
For instance, they muse about having crowd workers use AI in a transparent way, where the AI’s contributions are tagged or the human plays a curator role, so requesters know what they’re getting
This would be a shift from the current model, essentially acknowledging that hybrid human-AI labor is a reality and trying to make it work for everyone. It’s akin to doping in sports – you either fight an endless battle to ban it, or you radically change the rules of the game (though unlike doping, AI assistance could be seen as a positive productivity tool if used openly).
From an economic angle, one could argue that if the AI is doing half the work, maybe the worker and the AI together should command a different kind of compensation or contract. These discussions are only beginning, but they will be crucial as AI becomes interwoven with human labor in more domains.
In the meantime, researchers who rely on crowd platforms are advised to take precautions: better task design, clearer instructions, and screening methods. For example, explicitly reminding participants that the study is about genuine human responses can dissuade some from using AI.
Including attention checks or requiring an explanation of how they arrived at an answer can trip up a pure copy-paste approach. And as a simple step, pay a fair wage – workers paid decently have less incentive to risk rejection by cheating
Ultimately, maintaining a good relationship with crowd workers (treating them as partners rather than interchangeable “human CPUs”) can encourage honesty. In one sense, this upheaval is forcing requesters to engage more with the human side of crowd work – to ensure the human is actually in the loop, you might need to communicate more and build mutual trust.
The Blurring Line Between Human and Machine
Two and a half centuries ago, a clever hoax made people question whether a machine could truly think like a human. Today, we have to question whether the “humans” behind our machines are truly doing the thinking. The line between human and AI labor is blurring in a way that feels like a grand irony of history.
On Amazon’s modern Mechanical Turk, real people became the hidden hands inside a machine-like system of micro-tasks. Now those people have, in turn, hidden a machine inside their work. It’s a hall of mirrors: AI imitating humans, and humans leaning on AI.
What does it mean for our future when a large chunk of “human-generated” data might actually be AI-generated? For one, it challenges the foundations of how we teach AI and learn about ourselves. If we ask an AI to mimic human writing style, and we train it on a dataset where humans were mimicking AI to begin with, we’re caught in a loop of imitation of imitation.
The distinctive spark that comes from a human perspective – with all our inconsistencies, emotions, and irrationalities – could get lost in translation. We might end up with AI systems that are superlatively good at predicting the kind of vanilla, homogenized output that another AI would produce, while missing the messy truth of genuine human responses.
On the other hand, one might ask: if the AI outputs are indistinguishable from what a human would say, does it matter? This is a deeply philosophical and practical question. Certainly, if our goal was specifically to understand human cognition or behavior, then an AI-written answer is a false signal.
But if our goal was just to get a job done (summarize text, categorize an image, answer a trivia question), maybe we care more about the quality and less about who or what did it. Society will have to grapple with this distinction. We may have to start labelling the origin of our data, much like organic vs. non-organic produce: do you want the organically grown human insight, or are you okay with the cheaper, mass-produced AI substitute?
In scientific research, I suspect there will be a push for “AI transparency,” where participants might be asked to disclose AI assistance, or studies might be designed to minimize it.
The broader consequence is a matter of trust. Humans are social creatures; we ascribe meaning to knowing something came from a real person. Consider online advice communities, product reviews, or even creative works – it’s one thing to read a heartfelt personal story, and another to find out it was generated by an algorithm.
If AI can fake not just isolated content but entire networks of human feedback, we risk a kind of erosion of trust in digital communities. We might approach any content with skepticism: “Was this written by a person, or is this just a very good simulation?” The value of authenticity could soar, or conversely, people might become indifferent, caring only about usefulness. It’s a strange future to contemplate.
In the here and now, the infiltration of AI into crowd work is a wake-up call. It’s telling us that AI is no longer confined to shiny tech demos or sci-fi scenarios; it has quietly seeped into the mundane, day-to-day tasks of gig workers. It’s being used not by rogue supervillains or giant corporations alone, but by regular folks trying to make ends meet or save time.
In a sense, AI has become the invisible coworker in many workplaces. This realization should inspire both awe and caution. Awe, because it showcases how far AI’s capabilities have come – a language model can slip into a human role undetected and produce results good enough to fool us. Caution, because it reminds us that technology’s progress often outpaces our social agreements and oversight mechanisms.
As we move forward, we’ll likely develop new norms and tools to address this. Perhaps crowd platforms will evolve into collaborative human-AI marketplaces, where tasks are explicitly divided or shared between a person and an AI, and everyone knows who did what. Or perhaps entirely new verification methods (like cryptographic watermarks in AI output, or real-time video verification of work) will be employed to ensure a human is truly generating the content.
Each solution comes with trade-offs in cost, privacy, or convenience. We will have to ask: How much do we value preserving a zone of purely human-generated data? It might become a scarce resource, something to be protected like an endangered species in the data ecosystem.
In the end, the story of the next-gen Turk is more than a quirky anecdote about Turkers and chatbots. It’s a microcosm of the changing relationship between human labor and AI. We’re witnessing a new form of synergy and tension: humans and AIs intermingling in the production of knowledge and information.
This can lead to incredible productivity and perhaps free humans from drudgery – after all, who loves summarizing boring text for pennies? – but it also forces us to redefine what authentic human contribution means. As the boundaries blur, we might find ourselves reasserting the importance of that human element, precisely because it’s no longer a given.
The next time we read an online comment, a survey result, or even a heartfelt story that was supposedly crowdsourced, we might pause and wonder: Was there a ghost in this machine? It’s a question that, for now, doesn’t have an easy answer. But asking it drives us to ensure that we don’t lose sight of why we wanted human feedback in the first place.
In a world where AIs can mimic us so well, rediscovering and safeguarding the truly human – our creativity, our diversity, our unpredictability – becomes more important than ever. The challenge ahead is learning how to tell when our data is truly human, and deciding what to do when it’s not. The illusion of the Mechanical Turk fooled people for decades; its modern counterpart invites us not to be fooled again, but rather to confront how we coexist with our own creations. In that confrontation, we just might learn something new about ourselves.
Responses