DatologyAI is building tech to automatically curate AI training datasets

chatbot training dataset

The issue, as several lawsuits argue, revolves around whether the bots make fair use of the material by transforming into something new, or whether they just memorize it whole and regurgitate it, without citation or permission. Harvard’s new AI training collection has an estimated 242 billion tokens, an amount that’s hard for humans to fathom but it’s still just a drop of what’s being fed into the most advanced AI systems. Facebook parent company Meta, for instance, has said the latest version of its AI large language model was trained on more than 30 trillion tokens pulled from text, images and videos. “A lot of the data that’s been used in AI training has not come from original sources,” said the data initiative’s executive director, Greg Leppert, who is also chief technologist at Harvard’s Berkman Klein Center for Internet & Society. This book collection goes “all the way back to the physical copy that was scanned by the institutions that actually collected those items,” he said.

chatbot training dataset

More from TechCrunch

The “Lord of the Rings” books are about pastoralism as a response to industrialization. “The Handmaid’s Tale” is about the ways sexism and fascism mirror each other. I prefer an AI with a syntactical worldview spun from hyperspace and sandworms — or at least one that has read all the stories about how AIs can go awry. That said, I’d sure like to see a more diverse canon represented. Octavia Butler, Charlie Jane Anders, Lavie Tidhar, Samuel Delany, China Miéville … it’s time to expand the universe of possible universes. The question of what’s on GPT-4’s reading list is more than academic.

Many bot companies assume that you know what your customers want and that if you don’t it is your problem to figure it out. However, a tool that analyzes customer questions in real time is a key piece of any bot solution. The analysis will tell you what your customers actually want to do with your bot and will let you focus your training efforts where it pays the biggest dividends. This is the step in determining bot priorities relative to customer needs, and it is the most important part of any bot training framework. This is what many bot companies are pitching as the entire training program, when it is really only the first step of a real training plan.

If you have a more complex business and are using your bot for customer service, you should plan to invest considerable effort in ongoing training.
But some — for governance and compliance reasons or otherwise — are building models on custom data from scratch, and spending tens of thousands to millions of dollars in compute in order to train and run them.
When the company first reached out to the Boston Public Library, one of the biggest in the U.S., the library made clear that any information it digitized would be for everyone, said Jessica Chapel, its chief of digital and online services.

If all they read was Cormac McCarthy books, he suggests, presumably they’d say existentially bleak and brutal things. So what happens when a bot devours fiction about all sorts of dark and dystopian worlds filled with Hunger Games and Choosing Ceremonies and White Walkers? “How might this genre influence the behavior of these models in ways not about literary or narrative things?” Bamman says.

“That ability to rapidly build off of what other people have built was a really important component of the whole web. That doesn’t exist for bots yet,” he continues, which is why he founded Seed Vault — a decentralized marketplace for conversational user interface development. In addition to the modern public-school canon — Charles Dickens and Jack London, Frankenstein and Dracula — there are a few fun outliers. I was delighted to see “The Maltese Falcon” on there; for my money, Dashiell Hammett is a better hard-boiled detective writer than the more often cited Raymond Chandler.

ChatGPT’s secret reading list

We serve over 5 million of the world’s top customer experience practitioners. Join us today — unlock member benefits and accelerate your career, all for free. Our sister community, Reworked, gathers the world’s leading employee experience and digital workplace professionals. And our newest community, VKTR, is home for AI practitioners and forward thinking leaders focused on the business of enterprise AI. Google spent years beating back legal challenges from authors to its online book library, which included many newer and copyrighted works.

chatbot training dataset

What is the proper name that fills in the MASK token in it? This name is exactly one word long, and is a proper name (not a pronoun or any other word). One way to answer the question is to look for information that could have come from only one place. When prompted, for example, a GPT-3 writing aid called Sudowrite recognizes the specific sexual practices of a genre of fan-fiction writing called the Omegaverse. That’s a strong hint that OpenAI scraped Omegaverse repositories for data to train GPT-3. The chatbot’s GPT-4 version was amazingly accurate about the Bennet family tree.

The Silmarillion. Really?

Supreme Court let stand lower court rulings that rejected copyright infringement claims. The case adds to a growing spate of incidents, from spreading misinformation to generating misleading, offensive or harmful outputs, and underscore the need for regulation and ethical guardrails in the dizzying race for AI-powered solutions. It’s an impressive list of AI luminaries to say the least — and suggests that there might just be something to Morcos’ claims. Biases emerge from prejudicial patterns concealed in large datasets, like pictures of mostly white CEOs in an image classification set.

DatologyAI is building tech to automatically curate AI training datasets

The largest concentration of works is from the 19th century, on subjects such as literature, philosophy, law and agriculture, all of it meticulously preserved and organized by generations of librarians. Wysa co-founder Aggarwal emphasizes the importance of creating a safe and trustworthy space for users, particularly in sensitive domains like mental health. “Each time an LLM generates a word, there is potential for error, and these errors auto-regress or compound, so when it gets it wrong, it doubles down on that error exponentially,” she says.

If you are delivering the wrong answers more than 3 percent of the time, your system should be taken back to the drawing board. The book collection is more linguistically diverse than typical AI data sources. Fewer than half the volumes are in English, though European languages still dominate, particularly German, French, Italian, Spanish and Latin.

“Hurricane” and “wildfire” might not be at the top of any cell phone provider’s training list, but had Verizon’s chatbot not recognized the words, the company would have seemed unfeeling and out of touch. Even if you have a simple pizza-ordering bot, you’re going to have to continuously learn from your customers how they want to order and add new platform support and new products. If you have a more complex business and are using your bot for customer service, you should plan to invest considerable effort in ongoing training. “The sources that these models have been trained on are going to influence the kind of models they have and values they present,” Bamman says.

Considering IVR is probably the most-hated piece of technology invented in the last 50 years, replicating that process is probably not a great idea. So you’re thinking of implementing a chatbot, like every other company on the planet. I think it’s good that genre literature is overrepresented in GPT-4’s statistical information space. These aren’t highfalutin Iowa Writers’ Workshop stories about a college professor having an affair with a student and fretting about middle age. Genre — sci-fi, mystery, romance, horror — is, broadly speaking, more interesting, partially because these books have plots where things actually happen. Bamman’s GPT-4 list is a Borgesian library of episodic connections, cliffhangers, third-act complications, and characters taking arms against seas of troubles (and whales).

by Philippe

January 30, 2025

36 Views00 Likes

Tags:

DatologyAI is building tech to automatically curate AI training datasets

DatologyAI is building tech to automatically curate AI training datasets

More from TechCrunch

ChatGPT’s secret reading list

The Silmarillion. Really?

DatologyAI is building tech to automatically curate AI training datasets

Page Copyright