Bobby Samuels, the CEO and co founder of Protege, used a June 4 InfoWorld essay to make a sharp claim: the next AI breakthrough will not come from bigger models, it will come from better data. Protege is the data startup that raised sixty five million dollars from Andreessen Horowitz, Footwork and CRV and works with most of the Magnificent Seven on training and evaluation, so Samuels has visibility into where frontier projects actually get stuck. His thesis is that AI advances are uneven across domains, software engineering excelling while healthcare, customer support and complex reasoning lag, because the differentiator is no longer architecture or hardware, it is the availability and quality of domain specific data.
Samuels calls this the data gap. Software engineering has standardized languages, robust documentation, public code review and massive structured digital records, which is why coding models look so capable. Healthcare data is fragmented and privacy constrained. Enterprise workflow data was never designed for AI training. Multilingual speech data varies wildly in quality. He lists three forces driving AI: models, chips and data. The first two attract thousands of researchers and billions in capex. The third, he argues, lacks any equivalent institutional focus.
He pulls out three structural problems behind that under investment. The first is capacity. Few specialized teams build domain specific datasets to a high standard, because talent flows toward models and hardware where the prestige and the salaries are. The second is design. Dataset construction is a distinct discipline. It requires experimental design, domain knowledge and statistical validation, not just labeling throughput. The third is translation. The researchers who need data are not the people sourcing it, and nuance gets lost through procurement and vendor layers that treat one dataset as interchangeable with another if it matches a spec sheet. Annotation and RLHF services help with bounded tasks, but the frontier problems need datasets built from real human activity, complex, multimodal and sensitive, that are not AI ready by default.
The piece is sharpest on benchmarks. Samuels reminds us that benchmarks cannot be created from the same data used for training, because that is handing the answers to the model in advance. Yet that is exactly what happens inside many enterprises today: a team splits its labeled corpus, trains on most of it, evaluates on the rest, and reports a number that overstates real world performance. He proposes treating dataset construction as experimental design with documented, validated protocols, building benchmarks that reflect real world complexity, and developing standardized quality metrics for datasets, analogous to credit scores in finance.
His solution at the industry level is an ecosystem of AI data labs, specialized research institutions that work on dataset contamination, factuality and groundedness, de identification, international representation, bias mitigation and real world benchmark design. The line that lands is direct: AI models have their research labs, AI chip builders have their fabrication plants, AI data needs institutions of equal seriousness and ambition.
For technology leaders this is not abstract. We see the same pattern inside enterprises. Most internal fine tuning and RAG programs we encounter at bruno.digital have a strong model and platform story and a thin data story. Teams will benchmark Claude, Gemini, GPT and an open source contender against each other for weeks, then evaluate them on a few hundred examples a junior analyst labeled in a sprint. The variance in the eval data is larger than the variance between the models. That is the gap Samuels is pointing at.
The retail and grocery examples bring it home. Retail customer support assistants live or die on whether the training and evaluation data reflects the actual language customers use about deliveries, returns, loyalty points and product compatibility. A model that scores well on a generic helpdesk benchmark can collapse on real store level transcripts. The fix is not a bigger model, it is a deliberately designed dataset built from real conversations with documented inclusion criteria, annotation standards and validation. Likewise for category management or supply chain forecasting AI, the binding constraint is whether the historical operational data has been curated, deduplicated and joined with the right context, not which transformer variant is under it.
The operator agenda is concrete. First, fund dataset engineering as a real discipline inside the platform team, with senior people who own protocols, inclusion criteria, annotation standards and validation. Second, separate training data from evaluation data with the same rigor we separate production from test environments, and treat contamination as a release blocker. Third, build a small set of internal benchmarks that look like the real work, not generic Q and A, and version them. Fourth, when buying labeled data or RLHF services, write contracts that specify experimental design, sampling frames and inter annotator agreement, not just headcount and turnaround. Fifth, ask vendors that pitch domain models how their training data was sourced, what is in it and how their evaluation set was constructed, and weight the answer at least as heavily as benchmark scores.
If Samuels is right, and the early signs from our own client work suggest he is, the next wave of competitive advantage in AI sits with whichever teams take dataset construction seriously enough to staff and govern it. The good news is that this is a tractable problem. The work is unglamorous, the headcount is smaller than a model team, and the payoff shows up directly in model behavior on the workflows that pay our bills.


