In a few years,
biologists will complete the momentous task of reading the entire
human genome, the sequence of more than three billion symbols-chemical
bases-that determine our biological natures. "That is when
the real work will begin," says Ididore Rigoutsos, Manager,
Computational Biology Center, IBM.
code is written in an alphabet of four symbols. It is a program
that directs the construction of proteins, the truly important molecules
of life. And that has significant implications for the pharmaceutical
industry. "When new drugs are developed, what they are targeting,
with few exceptions, are proteins," says Barry Robson, IBM
Distinguished Engineer and strategic adviser to the Computational
fact about proteins is their shapes. Their nooks and crannies fit
into one another like keys into locks, controlling the whole range
of cellular processes. Every protein consists of some combination
of the 20 different kinds of aminoacids. But identifying the purpose
of each protein sequence is a formidable task.
To gain this
knowledge, researchers take advantage of the fact that evolution
is parsimonious, using the same structures over and over. By looking
for proteins whose amino-acid sequence is homologous to that of
a protein whose structure is already known, scientists can make
educated guesses about the unknown structure. One of the latest
and most promising techniques for finding patterns came about as
a result of an accident.
off my bicycle and broke my back," says Rigoutsos. "Because
I was in bed for three months, I had a lot of time to read."
His review of the literature revealed that people had been trying
to solve the problem of finding recurrent patterns in the structures
of proteins or DNA by attempting to align sequences with one another.
If several sequences matched around a location, scientists would
take this as evidence for a pattern. Rigoutsos wondered if he could
turn the process around and find patterns directly and then use
the patterns to align sequences. He and Aris Floratos, another member
of IBM’s bioinformatics and pattern discovery group, devised a powerful
algorithm they dubbed Teiresias, after the blind seer of Greek mythology.
patterns while making very few assumptions about what it is looking
for. It has found uses outside biology in such areas as identifying
attacks on computer systems and analyzing literary style. Using
Teiresias, Rigoutsos and Floratos have compiled a ‘Bio-Dictionary’,
that may contain the key to understanding the language of the genes.
copy of the Wall Street Journal and remove all the spaces,"
Rigoutsos suggests to illustrate how the Bio-Dictionary was assembled.
"You know the paragraph, you know the symbols. The task is
to find the words, but you do not know the symbols. We have done
the same thing for proteins." The ‘words’ they have discovered
constitute the basic vocabulary of proteins. Like human words, they
link together according to rules to form sentences-that is, proteins.
The IBM researchers have begun to decipher the words in their Bio-Dictionary,
to interpret what structural and functional features they represent.
"The analogy to natural language appears to be deep,"
One of the biggest
riddles Deep Computing could answer is how a strand of amino-acids
folds into a protein. "Nobody has yet simulated that process,"
Robson says. "It’s a deep, fundamental problem. Until it’s
solved, you can’t design interesting new proteins from scratch.
More important still, if we can crack this problem from first principles,
we can design new polymers and materials, and ultimately create
In the end, the researchers might be able to refine their algorithms
enough to predict the folding of any protein structure, not just
natural ones. This not only holds the promise of engineering new
drugs. It could also allow the creation of unique, self-assembling
molecular structures that could realize the dream of building molecular-scaled
idea of nanotechnologists was to build nanoscale robots called ‘assemblers’
that would construct molecular machinery. "Well, nature doesn’t
work by making these robots," Robson points out. Instead, it
specifies the linear sequence of amino acids in a protein and lets
the laws of physics do the building for free. "The problem
is how to do that on a general basis," Califano says. "If
I want to build a hammer, how do I make my protein fold into a hammer?
You can’t solve that problem unless you understand how proteins
fold." He and his colleagues are betting that the hammer of
Deep Computing will enable scientists to do just that.
In fact, Deep
Computing offers not just a hammer but an entire toolbox of techniques,
technologies and philosophies. By joining the raw computing power
and algorithmic virtuosity that were once the province of high-end
scientific computing with the vast oceans of data typical of business
computing, IBM scientists are forging a new discipline capable of
solving real-world problems in all their complexity and depth.
Excerpted from: Think Research, 1999