Scientists have developed a new type of machine learning model that can understand and design genetic instructions.
The model, dubbed Evo, can predict the effects of genetic mutations and generate new DNA sequences — although those DNA sequences do not closely match the DNA of living organisms.
With time and training, however, Evo and similar models could help scientists understand the functions of various DNA and RNA sequences and mitigate disease, researchers wrote in a new study published Nov. 15 in the journal Science.
Evo is a type of artificial intelligence (AI) system called a large language model (LLM), which is similar to OpenAI's GPT-4 or Google's Gemini. Researchers and developers train LLMs on vast amounts of data from publicly available resources, like the internet, and the LLMs look for patterns such as common phrases or typical sentence structures, using those patterns to supply words in a sentence one by one.
Related: Humanity faces a 'catastrophic' future if we don’t regulate AI, 'Godfather of AI' Yoshua Bengio says
Unlike more common LLMs, Evo isn’t trained on words. Instead, it’s trained on the genomes of millions of microbes — archaea, bacteria and the viruses that infect them, but not eukaryotic organisms like plants and animals. Each base pair — the basic chemical units that make up DNA — from those genomes acts as a "word" in the model. Evo then compares sequences of base pairs against its training set to predict how a strand of DNA will work, or to generate new genetic material.
Other models have already used machine learning and even LLMs to examine genetic information. But so far they have been limited to specialized functions or hampered by high computational cost, the scientists wrote in the study. Evo, by contrast, uses a fast, high-resolution model to process long strings of information, allowing it to analyze patterns at the genome scale and to capture information about large-scale interactions that more specialized models might miss.
The authors tested Evo on a series of tasks. Evo predicted how genetic mutations would affect protein structures, performing comparably to models trained specifically for that task. It also generated one set of protein and RNA components that protected against viral infection in laboratory tests.
Evo even generated sequences of DNA the size of entire genomes — but that DNA wouldn’t necessarily keep something alive. Some of the genetic instructions were similar to DNA in existing organisms. Others looked similar at first glance but didn’t make sense upon closer inspection, similar to an AI-generated image of a person with too many fingers. For example, many of the protein structures encoded in the Evo-generated DNA don’t match naturally occurring proteins.
"These samples represent a 'blurry image' of a genome that contains key characteristics but lacks the finer-grained details typical of natural genomes," the researchers wrote in the study.
They also only trained Evo on microbial genomes, so predicting the effects of human genetic mutations is still out of its grasp. Critically, the team emphasized the need for safety and ethics guidelines to prevent tools like Evo from being misused as their performance improves. In particular, the team excluded data on viral genomes that infect eukaryotic hosts.
"A proactive discussion involving the scientific community, security experts and policy-makers is imperative to prevent misuse and to promote effective strategies for mitigating existing and emerging threats," the researchers wrote.