Software Development

Lead Software Testing Expert and CNCF Kubestronaut Iuliia Kozlova on Why a Slop Detector You Cannot Run Is a Slop Detector You Cannot Trust

Iuliia Kozlova explains why AI slop detection tools must be reproducible, runnable, well-documented, and backed by honest test results.

Joshua White

June 30, 20265 min read

Google News

A tool that claims to measure quality is making a quality claim about itself. That is the trap most engineers walk into without noticing. They build a detector that scores the honesty of a pull request or the substance of a support article, and they never stop to ask whether a stranger can install it, run it, and confirm with their own hands that it does what the README promises. From a testing chair, that gap is the whole story. A claim you cannot reproduce is not a weak result. It is no result at all.

Iuliia Kozlova, a Lead Software Testing Expert and CNCF Kubestronaut, spends her working life inside that gap. Her job is not to admire an idea but to verify it: to set the thing up from scratch, follow the documented path, and watch whether the behavior matches the promise. She judged the AI Slop Scan Hackathon with that exact discipline, which made her one of the most demanding readers on the panel, as she refused to score what she could not validate.

AI Slop Scan, organized by Hackathon Raptors, asked 37 teams to build tools that detect, measure, or mitigate AI-generated low-quality content across code review, documentation, marketplace listings, and general writing. The brief carried a hidden second test. A tool that promises to separate signal from noise is itself a claim that must withstand scrutiny, which means the event was never solely about detection accuracy. It was about whether the work could be trusted by someone who had not built it. For an engineer whose career is built on reproducibility, it was close to a perfect assignment.

The tools that earned trust by being verifiable

The submissions Kozlova scored highest shared a single property that had nothing to do with cleverness: she could actually run them, and they behaved the way the team said they would.

Her top marks in the batch went to The Slop Slayers and their slop-shrink scanner, and her evaluation reads like a verification report that passed on every line. "A fully operational and essential tool," she wrote. "It works with URLs from different domains and with text in different languages." She noted that the product reached past the obvious goal: "an interesting product approach, it detects not just AI content but evaluates the veracity of the information." And then the part that, to her, sealed the score: "Excellent, detailed documentation, a demo, and a presentation are also included." Working tool, multiple inputs, a runnable demo, and documentation that let her confirm the claims herself. Nothing was asked of her on faith.

She gave similar credit to c0nfig, whose Slop Scan project she found genuinely substantial. "The project has high practical value, truly analyzes projects from multiple perspectives, and works," she wrote, and she was careful to separate that from novelty: "While not innovative, it is well-executed." What lifted it was that the team had made the work checkable in minutes rather than hours. "Comprehensive documentation is provided, including a detailed architecture description, a presentation, and a demo, allowing for a quick evaluation." That last phrase is a tester's compliment. The team had respected her time and removed the friction between her and the evidence.

Signal-OSS earned the same kind of nod for engineering that held up under inspection. "The project has clear practical value and tackles a real issue," she wrote. She was honest that the core idea was not unprecedented, "the idea itself is not entirely new," but she rewarded the execution and, again, the verifiability: "the solution is well implemented, especially the integration with git. The documentation and demo are good." Across all three of her highest scores, the pattern is identical. The idea mattered, but what converted a good idea into a high score was that she could stand it up and watch it work.

The verification wall

If the strongest entries handed Kozlova the evidence, the weakest ones put a wall between her and it. And as a tester, she treats that wall as a finding, not an inconvenience.

The clearest example was Mnkwar's Real Signal. She thought the product was good. "Not only is this product useful, but it's also original, combining the features of several similar products," she wrote, and she praised the pitch: "It has a good presentation." Then came the sentence that capped the score, written in the careful language of someone documenting a failed setup. "The project appears to be implemented, but I was not able to validate it fully because the setup process is not documented and there are configuration and dependency issues that prevent a straightforward setup." To a casual reviewer that is a footnote. To her it is the headline. An original tool she could not get running is, in the only test that counts, an unverified tool, and the missing demo meant there was no fallback path to the evidence either.

DINooo's Grounded failed an even more basic check: it did not do what it claimed when she ran it. "The project is useful, but not innovative," she allowed, before laying out the defects plainly. "Failed to fetch when trying to start crawling different URL sources. The scanner also doesn't work in text mode." Two of the tool's advertised capabilities did not function on contact. She added the structural gaps that left her nothing to fall back on: "No project justification, presentation, or demo, minimal documentation." When the live behavior breaks and there is no demo and no documentation to consult, there is no version of the product a tester can credit. The claim and the reality had separated, and she scored the reality.

The missing demo, again and again

The single most repeated note across Kozlova's batch was not about algorithms. It was about the absence of a demo, and she treated it as a serious omission every time, because a demo is the one artifact that lets a stranger verify a claim without trusting the author.

diffsniff's gatekeeper drew her shortest and bluntest review: "Limited functionality, very simple ruleset for the analyzer. No presentation or demo, minimal documentation." With nothing to run and little to read, there was nothing to validate, and the thin ruleset gave her no reason to extend the benefit of the doubt.

Error909's citationghost was a more interesting case, because she liked the idea and still could not get past the gap. "The product is useful, though not for a wide range of users. It's an idea with little competition given its combination of features and execution." She even praised the internals: "Good documentation with a well-described architecture." And then the recurring deduction: "no presentation or demo." A well-documented architecture tells her how the team intended the tool to behave. A demo would have told her how it actually behaved. She has learned not to confuse the two.

Even bisht's reviewradar, which she clearly admired, lost ground at the same spot. "A very good idea and excellent criteria for filtering out AI slop. The product can be useful for a wide range of people. The documentation is detailed, there's a widget," she wrote, and then: "but the presentation is unavailable." The pattern is consistent enough to be a rule. Good ideas and good documentation set the ceiling. A runnable demo is what lets a tester confirm the floor, and without it the score stays capped no matter how strong the concept.

Accuracy is a measurement, not an assertion

Where teams did give her something to verify, Kozlova went straight to the numbers, and she refused to let a confident presentation stand in for a confident result.

Beyond Horizon's PRISM is the cleanest illustration. She gave the team real credit for transparency: "The product really works, offers a good perspective on solving an existing problem, and presents test results very well." That is rare praise from her, because most teams never showed test results at all. But she would not round the figures up to match the polish. "Though the accuracy figures are rather modest," she added, in the same breath. The presentation was strong and the honesty was stronger, and she scored the actual numbers rather than the way they were framed. A tool that fights low-quality output, in her reading, has no license to oversell its own.

Gajadhar's paperlens drew the inverse note. "Good technical execution," she wrote, but the output did not earn its keep: "rather uninformative conclusions and recommendations after processing the input data." She added a usability flag that a pure code reviewer might have skipped: "the interface may not be entirely user-friendly or familiar to the average user." The pipeline ran. The verdict it produced was not useful enough to act on. For her, a quality tool that processes input and returns a vague conclusion has not finished the job, however clean the engineering underneath.

A tester's checklist for any tool that asks to be believed

Read across the batch and Kozlova's scores resolve into a short set of questions she would put to any tool whose entire purpose is to be trusted. They are less a marketing rubric than a verification protocol.

Can a stranger set it up from the documented path, with no undocumented steps and no dependency archaeology? An original tool that will not install is an unverified tool.

Does it actually do, on contact, what it claims to do, across the inputs it advertises? Failed fetches and broken modes are not edge cases when they sit on the headline feature.

Is there a runnable demo, not just a described one? A demo is how a tester confirms behavior without trusting the author, which is why its absence capped score after score in her batch.

Are there real test results, and are the numbers reported honestly rather than dressed up? Modest accuracy stated plainly beats impressive accuracy implied.

Does the output earn action? A tool that runs and then returns a vague conclusion has executed without delivering. Practical usefulness is measured at the recommendation, not the pipeline.

The verdict, and where this is going

For Kozlova, the lesson of AI Slop Scan is the one she applies to every release she signs off on. The hardest and most valuable work is rarely in the component everyone is looking at. It is in whether the thing can be reproduced by someone who did not write it, on a clean machine, following only what is written down. That is the difference between a result and a story about a result.

"A tool that judges quality is asking me to trust its judgment," she reflects. "So the first thing I do is try to verify it. If I cannot set it up, if the demo is missing, if the feature breaks when I run it, then I have nothing to trust, no matter how good the idea sounds. The teams that scored well were not always the most original. They were the ones I could stand up and watch work."

She sees the same standard becoming more important, not less, as generations get cheaper and tools multiply. When anyone can produce a plausible-looking detector in a weekend, the scarce thing stops being the idea and becomes the proof that it holds. "Soon every problem will have ten tools that claim to solve it," she says. "The ones that survive will be the ones a careful person can verify in an afternoon: clean setup, a working demo, honest numbers. Reproducibility is not a nice-to-have for these tools. For anything that asks to be trusted, it is the product."

AI Slop Scan was organized by Hackathon Raptors, a Community Interest Company supporting innovation in software development. The event challenged 37 teams to build tools that detect, measure, and mitigate AI-generated low-quality content across code review, documentation, marketplace reviews, and general writing. Iuliia Kozlova, a Lead Software Testing Expert and CNCF Kubestronaut, served as a judge evaluating submissions for detection accuracy, practical usefulness, technical execution, innovation, and presentation.

Newsletter

From obsession to clarity — one original question every week.

We answer one noisy topic at a time, in full. No daily roundup, no thread bait — just the question, the principles, and the system.

Continue reading

Walmart Senior Software Engineer Makarand Gujarathi on Why "Production-Grade" Is a Claim That Has to Survive Real Data

Featured image for investigation-management-platforms-government-teams

Software Development

Important Features Government Teams Consider in Investigation Management Platforms

Software Development

The tools that earned trust by being verifiable

The verification wall

The missing demo, again and again

Accuracy is a measurement, not an assertion

A tester's checklist for any tool that asks to be believed

The verdict, and where this is going

From obsession to clarity — one original question every week.

Continue reading

Walmart Senior Software Engineer Makarand Gujarathi on Why "Production-Grade" Is a Claim That Has to Survive Real Data

Important Features Government Teams Consider in Investigation Management Platforms

Your Guide to Delivering Reliable Software Through Quality Assurance