Hello and welcome to Eye on AI. In today’s edition: an exclusive on a new AI research platform; OpenAI kicks off 12 days of launches and demos; Google DeepMind’s new weather model outperforms existing systems; Spotify taps NotebookLM; and medical note-taking company Abridge scores another big health care contract.
Those watching the swarm of generative AI models and product launches this week can add another to the list: Corpora.ai.
The new platform, launching in a limited capacity today, is a “research engine” that scours academic papers, news articles, patents, and any other information available freely on the internet to create detailed research documents in response to user prompts in seconds. After a user inputs a topic, Corpora.ai creates an initial summary, which users can then request be expanded into a four or eight-page report complete with citations—often, several hundred of them. Founder Mel Morris, an English entrepreneur and early backer of Candy Crush, says he’s not offering search like Google, or even new AI-enabled search tools such as Perplexity, but instead aims to provide much more depth on a particular subject.
“It's not going to help you find the cheapest place to buy a TV. But it will help you understand a topic you know something about or nothing about,” Morris, who’s funded the company so far, told Eye on AI.
This makes Corpora.ai the latest in an emerging crop of AI tools aimed at research, including Elicit, Consensus, Scite, and ResearchRabbit. I got an early look at the new platform and found that while it offers something different than many of the popular generative AI tools people are currently using, it still faces many of the same challenges.
How it works
The Corpora.ai model was built by the company from scratch and is yet another tool using RAG (retrieval augmented generation), an approach that’s swept the AI industry. After a user submits a query—for example, “bird watching in the New York Hudson Valley,” which is one I tried—it first deconstructs the prompt to understand what the topic or question involves and then breaks it into parts, Morris said.
Essentially, it creates an outline of what high-level information a research paper on bird watching in the Hudson Valley should cover, creates a bunch of related search terms, and starts making its way through Corpora.ai’s dataset to collect relevant information. Lastly, the platform uses a generative AI model to summarize the information collected, creating the text for the report. The four-page bird watching report I generated started with an introduction and had chapters on prime viewing locations, tips for identifying birds in the area, and conservation efforts.
Is sourcing enough to build trust and ensure accuracy?
The key to a product like this is the quality of the information and whether the report generated can be trusted. For this, Corpora.ai is relying on citations and an additional scoring system that indicates how much of the information was “extracted” directly from the source material as opposed to being written by the model. The sourcing is extensive—the bird watching report contained 31 sources, and an eight-page report I created about milestones in the history of AI development credited 323 sources. The sources are cited and linked throughout the text, as well as at the end of each chapter and overall end of the report.
The links direct you to the source material but not to the specific passages containing the relevant information, so you’re still on the hook for scouring the original source if you want to verify specific facts or figures. The source material I’ve seen so far mostly looks reputable, but it’s important to note opinion articles are included, so opinions may be presented as unbiased information.
There’s also the fact that at the final point in the process, a model is still generating the final text. It didn’t take me long to find differences between text in a Corpora.ai report and the text it cited. For example, my report about milestones in AI development at one point stated that “the integration of AI also prompted shifts in the employment landscape, with a significant percentage of jobs affected either through augmentation or replacement.” The source material actually states that nearly 40% of jobs worldwide face being impacted by AI—a projection of the future versus a statement that this impact has already occurred. Yet, the Corpora score for this chapter was 100%, indicating it was extracted directly rather than written by the model. While this kind of error has been observed in answers from other generative AI tools like Perplexity and Google’s AI summaries, it’s a fairly serious failing. And it’s not encouraging that it occurred on the very first fact I checked.
Corpora.ai vs. ChatGPT vs. Wikipedia vs. humans
The bird watching report Corpora.ai created was informative and decently pleasant to read. But I also wanted to see how the platform would fare on more intricate and specific research quests.
My request for a timeline of the major milestones in AI development served as the perfect topic for comparison, since it’s one I know well and thus would be able to easily evaluate for thoroughness and accuracy. Corpora.ai didn’t structure its report as a timeline like requested (OK, maybe timelines aren’t the platform’s strong suit). But it also offered a lot of superfluous information outside the scope of my request while omitting a significant number of the most pivotal events in AI development, such as AlphaGo’s victory over the human Go champion and the publication of the 2017 “Attention is All You Need” paper. Responding to the same prompt, ChatGPT generated a timeline but it was even less complete and lacked detail, even after revising my prompt several times. I checked Wikipedia and it had a handy timeline, but it was far too granular and didn’t contextualize the information. Last but not least, I Googled “Timeline of AI development.” The first link was an article from TechTarget, written by a human. It hands-down fulfilled my request the best. (Maybe search isn’t dead after all).
Lastly, since Corpora.ai is focused on deep research, I wanted to give it a chance to shine in this department. So I prompted it to research a high-level idea related to how technology impacts society that I’ve been kicking around for a while, but finding difficult to research through traditional means. The platform seemed to understand my prompt, but the information delivered wasn’t any better (or worse) than the other methods I’ve used to research this topic. The report did, however, repeat itself often, and at times, felt like just a collection of random facts strung together. In those moments, the way the model works (breaking a topic into parts, extracting information from various sources, and then combining and summarizing it) was palpable.
New model, same problems
From my early look at Corpora.ai, I can say it definitely adds something new and interesting to the landscape of AI tools. At the same time, it faces—and poses—many of the same problems as other generative AI products.
The text produced still feels slightly disjointed and soulless. It can’t be entirely trusted. Morris says the information is “extracted” from the source materials, which could draw the same copyright concerns affecting other generative AI platforms (he says the company is interested in revenuing sharing deals down the line). Also like other generative AI platforms, Corpora.ai relies on the availability of high-quality information. If Corpora.ai and tools like it succeed, eliminating the need for users to ever actually go to news sites or directly interact with the sources providing information that feeds the tool, what will happen to the business models that currently sustain those sources?
And with that, here’s more AI news.
Sage Lazzaro
sage.lazzaro@consultant.fortune.com
sagelazzaro.com