Lipsh-Sokolik developed an algorithm that uses physics-based protein design calculations together with a new machine learning model. The algorithm broke down each of the different variants of xylan breaking enzyme sequences into several fragments and then introduced dozens of mutations into those pieces – all in ways that maximized the potential compatibility of the different bits. It then assembled fragments into different combinations and selected the million sequences of encoded enzymes that were deemed to be stable.
The next step for Lipsh-Sokolik and colleagues was to synthesize a million actual enzymes from these computer models and test them in the lab. To their surprise, 3,000 were confirmed to be active. “The first time we looked at the experimental results, we were amazed,” Fleishman says. “The 0.3 percent success rate is not high, but the sheer number of different active enzymes we got was staggering. In typical protein design and engineering studies you see maybe a dozen active enzymes.”
Armed with an extensive repertoire of enzymes, the researchers then asked a key question that interests protein researchers: What molecular features distinguish active enzymes from inactive ones?
Using machine learning tools, Lipsh-Sokolik examined about a hundred features that characterize enzymes and used the ten most promising ones to create an activity predictor. When she incorporated this activity predictor into her algorithm and repeated the design experiment with the xylan-breaking enzymes, this second-generation repertoire had as many as 9,000 enzymes that broke down xylan and another 3,000 that could break down cellulose, adding up to a total of 12,000 active enzymes. This was a tenfold increase in success rate over the initial experiment, and an unparalleled feat in the history of protein design: The team managed, in a single experiment, to design more potentially active enzymes than standard methods could produce in a decade.