Cut Out that Jazz

author-image
DQI Bureau
New Update

It’s happened to the IT industry before. A wave festooned with the
trimmings of ‘big, happening, the thing of the future’ sweeps it off it’s
feet, engulfing in the process the small and big companies in its expanse. And
the jargon that it brings in its wake triggers a new rush to clamber on to the
new segment, before the initial hoopla dies out. The cinders of the software
services era of the early 90s are still glowing. The hot ash from the dot-com
furnace is a bitter reminder. Banking, insurance and telecom are still crackling
away.

Advertisment

That the biotechnology furnace is hot is an old story. Indian IT companies
have already been rolling toward it in right ernest. And the wannabes–other IT
companies and professionals, hovering at the fringe, are just waiting to sneak
in. But scan the reams of information on biotechnology and you find proteins,
cells and DNA, which you thought you’d left behind for good in Class X and
XII. So, where’s the IT in BT?

To begin with, we need to differentiate between biotechnology and
bioinformatics. Biotechnology employs molecular biology and genetics to create
improved agricultural products, food, animal feed, industrial materials and
medicines. It deals with the experimental techniques and instrumentation in
biology. Bioinformatics, on the other hand, is about IT solutions to biological
problems and applying computer technology to the management of biological
information. So it is in bioinformatics in which IT has a role, not in
biotechnology. As Zensar global CEO Ganesh Natarajan says, "Companies need
to stay out of biotech and focus on bioinformatics. Getting a few people trained
in biotechnology alone will not create another revolution."

Can You
Become a Bioinformatics Professional?

Raw
Skills

Acquired
skills
C,
C++, Java, OOPS, J2EE4
BioXML,
Biojava, Bioperl,
VRML4 3EMBL,
Genbank, Swiss-Prot,
XML4 3geml,
RDBMS:
Oracle PL/SQL4
3Biology
concepts,
CGI,UNIX,
PERL4
3Statistics,
Algorithms Mathematical models, Probability theory
Web
tools: HTML, ASP, JSP,4 JDBC, Swing
3Bioinformatics
software: GCG BLAST RasMol
You
should be comfortable working in a command-line computing
environment. Working in Linux or Unix will provide this experience
You
should have experience with programming in computer languages such
as C or C++, as well as in scripting languages like Perl or Python
You
should have knowledge of a major molecular biology software, as well
as a fairly deep background in some aspects of molecular biology

Advertisment

Given the current hype surrounding biotechnology, and the eagerness displayed
by IT companies in this field, comparisons with the Internet wave are but
natural. Bioinformatics, however, is a different ballgame. To begin with, there
are no quick returns to be had. Not only do companies need huge investments,
patience is also an essential prerequisite. And unlike quick-fix diplomas in Web
designing and ‘A-to-Z of Java’ courses that powered many a dot-com, an IT
company would need to invest in at least six months of intensive training to get
computational biology through to its people. Here again, the program needs to be
designed and taught by experts. And given that bioinformatics cannot be tackled
by programming power alone, there’s a need to have life science experts with
strong domain expertise on board. After the training comes the designing of the
‘going-to-market’ strategy. IT companies need to figure out exactly where
they are headed–offering contract services, creating original intellectual
property in the form of algorithms and technologies, dabbling in database
services and product development, or conducting joint research in the drug
discovery area.

Jargon
in the Jazz
Demystifying
biotechnology
Proteomics
The
number of proteins expressed from DNA changes constantly as a result
of altering conditions like eating, during sleep, as the result of
an infection or other disease, during all phases of pregnancy,
because of a wound etc. Proteomics aims to understand the protein
expression in response to such conditions.

Proteomics
Software


Proteomic software provides scientists with the ability to conduct
database searches of known protein sequences utilizing batch or
real-time processing. This software is capable of controlling
automated hardware, i.e. robotics, as well as facilitating data
transfer operation.

GEML


The Genome Markup Language is based on the Extensible Markup
Language (XML), the standard framework for storing and transferring
structured data over the Internet. Rosetta Inpharmatics, with the
assistance of others in the GEML community, has created the GEML
format to help standardize how DNA microarray and gene expression
data is stored and transferred between different gene expression
analysis systems and databases.

Micro array or
biochip technology

This makes it possible to simultaneously study expressions of
thousands of genes or proteins in a single experiment in the
laboratory. The large volume of data generated by a series of such
experiments is expected to help understand the genetic basis of
diseases.

Chromosomes
These
are discrete units of the genome carrying many genes, consisting of
(histone) proteins and a very long molecule of DNA. They are found
in the nucleus of every plant and animal cell.

DNA
Short for dioxyribonucleic acid. DNA is made up of four bases,
represented by A (adenine), C (cystosine), G (guanine), T (thymine).
The order of these bases provides the cell with coded instructions
on everything it does. DNA sequencing involves determining the order
of these bases along a DNA strand.

Gene
The basic unit for transmitting hereditary traits. A gene is made up
of the DNA sequence that provides the code for a protein with a
known function in the cell.

Protein
A
chain of amino acids. A protein is usually at least 50 amino acids
long. Proteins form structural elements in a

cell and are also involved in the chemical reactions that occur in a
cell.

Amino Acid
The sub-unit that makes up a protein. There are 20 types of amino
acids. Protein sequencing involves determining the order of amino
acids in a protein.

Genome
The genome comprises all the genetic material in a cell or organism.
The human genome is made up of 23 pairs of chromosomes each
containing hundreds of genes. There are approximately 30,000 genes
in the human body.

Human Genome
Project
A
bioinformatics project that has identified the 30,000 genes in human
DNA. Coordinated by the US Department of Energy and the National
Institutes of Health, the US Human Genome Project started in 1990
and released its findings in February 2001 along with findings from
a separate project by Celera Genomics Group. There are similar
projects in other countries as well. The purpose is to store the 3
billion chemical base pairs (the DNA sequence) derived from these
analyses in databases for use in biomedical research.

"IT companies should be able to set up the necessary infrastructure with
high computing power. They should also be willing to diversify or partner with
someone to address the wet (actual experiments) side of biology. This is not a
field where things can be done in isolation without the active involvement of
the end-user," points out Patni Computer general manager Dilip
Dhanuka.

Advertisment

Five-fold jump in five years

"According to a CII report, the Indian biotechnology market was valued
at $2.5 billion in 2001, a five-fold increase since 1997. It is expected to
reach $10 billion by 2010. Against this, the biopharma market worldwide is
estimated at about $17 billion," says Tata Consultancy Services executive
vice-president (advanced technology) M Vidyasagar.

Compare this with a market size of $380 billion (conventional pharmaceutical
industry) and the nearly $800-billion IT industry. Even India’s IT industry
(both domestic and exports) of about $12 billion is more than half of the
worldwide biopharma market. Given the small market size, adds Dr Vidyasagar,
"there’s too much hype about bioinformatics" at present.

"The global bioinformatics opportunity is expected to be over $8 billion
by 2008 and Indian companies, which understand the nuances of areas like data
analysis for genomics and proteinomics, can capture a share of this pie,"
says Zensar’s Natarajan.

Advertisment

Simple-speak

Numbers apart, where is it that IT contributes to biotechnology?

Think of the times when you could get away with being politically incorrect.
Conjure up an image of this absent-minded research genius who works tirelessly
but never knows what he keeps where. Here is this smart, efficient young lady
who, out of the mess that his lab is, brandishes just the right equation,
algorithm, sequence, at the very instant he needs it. And as is the case with
geniuses, he carelessly chucks that all-important slip of paper into the mayhem
after he’s done with it, basking in the knowledge that the efficient lady will
fish it out when he needs it again.

Well, today, IT has been called upon to don the mantle of this young lady, as
biotechnology works feverishly to create a ‘better’ tomorrow for all of us.
So does that mean IT in biotech is essentially low-end ‘secretarial’ work?
Surely not. For, the multitude of researchers working across the globe, spewing
fresh information by the hour, has created a classic case of ‘info-indigestion’.
This multitude of information has brought with it various sets of problems,
which can only be solved by the advances that IT has ushered in.

Advertisment

Explaining the scale of data that needs to be handled in biotechnology,
Oracle Corp general manager (ebiz) S Grover says, "There are 32,000 genomes
with 1.5 million proteins in them. Each genome requires approximately 300
terabytes of trace files. So 32,000 times 300 TB is massive. Medical imaging
generates 400 million GB of data annually. Each mass spectrometer generates 200
GB of data daily. Multiply this by the 1000s of mass spectrometers in use in the
world today and you get the picture..."

This sheer volume of data calls for the creation of not just scalable, but
intelligent, databases. But can’t search engines, that most of us take for
granted now, do the trick? Try searching for the term ‘histamine’ using a
search engine. If the engine is robust, there’s no reason why it should not
throw up all references to ‘histamine’. Unless of course, some people using
histamine decide to spell it as ‘histamyne’ or ‘histemine’. If they do,
you will never know! Biologists sometimes can’t agree on the very definitions
and concepts the databases are supposed to manage.

In genomics, for example, you can’t get more fundamental than the
definition of a gene. Yet, that definition could differ. One software solution
is to develop submission protocols that follow the rules strictly. Such a system
would check submissions to ensure that the before the data is entered, the full
genus/species name matches the corresponding entry with known genus/species
names. Another solution could be to develop a drop-down selection.

Advertisment
Who’s Doing What
Indian IT companies and their bioinformatics initiatives
TCS
has partnered with the Centre for DNA Fingerprinting and Diagnostics
(CDFD), the Indian Statistical Institute as well as Government of
India’s Department of Biotechnology. TCS has hired about 35
professionals who will undergo training for this initiative. The
company’s focus will be on the creation of original intellectual
property (IP) in the form of original algorithms and methodologies.
This work is being undertaken at the Advanced Technology Centre of
TCS in Hyderabad.

IBM
India Research Labs (IRL), New Delhi,
is working to generate Intellectual Properties in algorithms to
support Life Sciences research apart from contributing to the
development of the Blue Gene, super computer. IRL is engaged in a
research project on Gene Structure and Function Prediction in
collaboration with the Department of Biochemical Engineering and
Biotechnology, IIT, New Delhi. IBM Global Services India( IGSI)
Exports is currently involved in building prototypes, demonstrable
pilot application for emerging challenges facing the Life Sciences
industry. This group has embarked on developing a tool that helps in
data discovery from disparate data sources. IGSI Exports further
contributes to IBM’s Life Sciences initiative by building
competencies in developing, testing and implementing Bioinformatics
databases, user tools, utilities and applications. IRL and IGSI
Exports are collaborating to build a prototype system for
Bioinformatics for Microaarays.


Wipro
is
looking at providing IT solutions to life science companies,
outsourcing of R&D activites of major biotechnology companies
and embedded software solutions for the medical devices markets. The
company is also ‘strengthening its relationship’ with Beckman
Coulter. In April 2002, Wipro launched Wipro Healthcare and Life
Science to address the requirements of the Bio-IT market. This
business will offer solutions to hospitals and health insurance
companies for efficient and compliant work flows, Bio-informatics
solutions to Life Science companies to significantly reduce drug
discovery cycle time, and IT solutions to Medical and Analytical
devices companies.

Compaq India
has tied up with IIT, Mumbai, TIFR for its foray into bioinformatics
apart from partnerships with the National University of Singapore.
Compaq recently set up a Super Computing Lab for the Center for
Biotechnology in New Delhi. Compaq has deployed 8 AlphaServer and 1
terabyte of SAN storage solution. Compaq has also provided software
tools for parallelization and will provide complete maintenance of
the lab for the next three years. Compaq has also been hosting
multiple biotech forums.

Zensar
is the technology partner for a bioInformatics venture focused on
the creation of a new model for virtual research. This is a modeling
environment that will combine research and data analysis
capabilities with Artificial Intelligence components. The initial
project is valued at half a million dollars and the company will
have teams of eight to 20 people working in this area. Infosys and
Patni Computer are assessing developments in the bioinformatics
arena before they decide to explore opportunities in this sector.

Government Bioinformatics
and Genomics Centre (BGC)

Bioinformatics Centres under BTIS:
Jawaharlal Nehru University,
New Delhi; Indian Institute of Science, Bangalore; Bose Institute,
Kolkata; Institute of Microbial Technology, Chandigarh; Madurai
Kamraj University, Madurai; Indian Agricultural Research Institute,
New Delhi; University of Poona, Pune; Center for Cellular and
Molecular Biology, Hyderabad; National Institute of Immunology, New
Delhi; Department of Biotechnology, Govt of India, New Delhi ,Bioinformatics
Centres:
Advanced Diploma in Bioinformatics; Direct Ph.D
programme in Bioinformatics

Even assuming that the data entered is standardized, searching a colossal
database is no mean task. Here, software tools like the NCBI BLAST (National
Center for Biotechnology Information’s Basic Local Alignment Search Tool) run
several instances of the program, searching individual portions of the database.

IBM’s DiscoveryLink, for instance, understands the schema of different
databases and the kind of queries that can be handled. When a person or program
sends in an info-query, DiscoveryLink breaks it down and sends the parts off to
the various databases. The partial answers that come in are combined, and an
answer is returned.

Advertisment

The massive volume of data also calls for increased speed of computing. IBM’s
supercomputer Blue Gene–being built at an investment of $100 and to be
operational by 2005–is expected to operate at about 200 teraflops, or 1
trillion operations per second, larger than the total combined power of the top
500 supercomputers in operation today.

"With Blue Gene, IBM is trying to set a new supercomputer speed limit
– a petaflop, or a thousand trillion floating calculations per second,"
says Dr Manoj Kumar, director, IBM Research Labs, New Delhi. IRL, incidentally,
is part of the team working on Project Blue Gene.

Another problem is that databases created by different organizations store
information idiosyncratically, creating different file formats that cannot talk
to each other. To begin with itself, biological data is complex and interlinked.
A spot on a DNA array, for instance, is connected not only to immediate
information about its intensity, but to layers of information about genomic
location, DNA sequence, structure, function, and much more. Creating information
systems that allow biologists to seamlessly follow these links without getting
lost in a sea of information is a challenge for computer scientists.

Parallel computing

Parallel computing is a concept that has been around for a long time. Break
a problem down into computationally tractable components, and instead of solving
them one at a time, employ multiple processors to solve each sub-problem
simultaneously. The parallel approach has made its way into experimental
molecular biology with technologies such as the DNA microarray. Microarrays
allow researchers to conduct thousands of gene expression experiments
simultaneously on a tiny chip.

Much of what we currently think of as part of bioinformatics–sequence
comparison, sequence database searching, sequence analysis–is more complicated
than just designing and populating databases. Bioinformati-cians are the
tool-builders, and it’s critical that they understand biological problems as
well as computational solutions in order to produce useful tools.

Developing analytical tools to discover knowledge in data is the second, and
more scientific, aspect of bioinformatics. There are many levels at which we use
biological information, whether it is in comparing sequences to develop a
hypothesis about the function of a newly discovered gene, breaking down known 3D
protein structures into bits to find patterns that help predict how the protein
folds, or modeling how proteins and metabolites in a cell work together to make
the cell function. The ultimate aim of analytical bioinformati-cians is to
develop predictive methods that allow scientists to model the function and
phenotype of an organism based on genome sequence alone.

Commenting on IBM’s role in bioinformatics, managing director Abraham
Thomas says, "Online collections of biomedical abstracts, papers and other
literature are used to produce annotated databases for easy access of
information. For example, in a protein database, the annotation for a protein
may include it’s properties, functions, structure, similarities with other
proteins, diseases associated with deficiencies in the protein etc."

RoI shapes the final decision

Apart from database management and data-mining solution and services, there
are several other applications of IT within bioinformatics. "The challenge
is to obtain a return on the enormous investment required to obtain the
explosion of genomic data. This requires significant computational capabilities,
consisting of high-performance platforms, sophisticated and validated
algorithms, and the integration of these processes into the scientific work
process," says Dan Stevens, director, marketing (life sciences), Silicon
Graphics.

Bioinformatics tools can also be in the analysis of genome sequences and
detect genes and their functionalities, protein sequences to predict their
structure (either secondary or tertiary, and analysis of clinical data to
predict toxicity of drugs and/or molecules).

Teaming up is the best option

Given the specialized nature of bioinformatics, it makes sound business
sense for IT companies to partner with pharma and research companies.
"Information technology and its optimized use can qualitatively change the
nature of this collaboration, with tools like electronic product development
exchanges," says Oracle’s Grover.

"Build domain knowledge, partner with leading research institutes,
develop intellectual property and understand customer challenges and deliver
solutions which add value," says DA Prasanna, vice-chairman, Wipro (also
executive officer at Wipro Healthcare and Life Sciences), outlining the success
formula for a foray into bioinformatics. But even if IT companies follow
this formula, can they really re-invent themselves as end-to-end bioinformatics
solution providers?

"If any of the ‘complete solutions’ provided by such companies fails
repeatedly, customers will start doubting the ability of communities in the
information technology space to fulfill their commitments. To avoid this
undesirable outcome, IT companies and professionals must learn to work within
their areas of competence, says Stevens.

Ultimately, however, it is up to an IT company to determine how far it wants
to go along the road to biotechnology. Clearly, there are rich pickings along
this road, and the further it goes, the more money it will make. The stock
market gives much higher premia to drug discovery companies–and that’s just
the tip of the biotech berg–than to pure IT companies. The downside is that
the further an IT company walks along that road, the further it moves away from
its core competence. But then, if it rakes in the moolah in these cash-strapped
times, why not? After all, proteomics, genomics and pharmacogenomics are all
derived from the Latin root-omics, which means ‘give us money’!

Manjiri Kalghatgi in New Delhi

TechnoFunctionals

How can you build a biotech-savvy IT workforce?

As in the case of banking, insurance, and other verticals where IT plays a
role, domain knowledge is supreme for professionals working in the field of
bioinformatics. So what does it take to build a biotech-savvy IT workforce? IT
companies need to retrain and reposition their IT and systems teams in life
science-related projects. Getting someone who knows C++ to learn Java is one
thing, but getting someone who lost touch with biology at age 15 to understand
the complex functioning of the human body is another ballgame altogether.
Getting someone who has no IT training to write software code is no mean task
either. The latter could be a trifle easier! As Compaq India director
(enterprise products) Pallab Talukdar says, "We are looking at biologists
picking up IT, rather than the reverse." And this explains the evolution of
terminology like Bio-Perl and Bio-Java.

"These are actually extensions to existing IT technology," informs
Compaq’s Talukdar, explaining that the syntax of Bioperl is actually closer to
biology than to IT. "For instance, in specifying the name of an array, Bio-Perl
addresses a variable as an amino acid, making it far easier for a biologist to
use the language," he explains. While the evolution of such languages has
certainly contributed towards bringing biologists a step closer to IT, the need
for techno-functional professionals trained in both IT and technology is growing
rapidly. But the quickest and best way of building a team for bioinformatics
projects would be to have a mix of talent from both fields on board and re-train
them to achieve the level of expertise required for the project.

"About 80% of the people we have have a basic degree in electrical
engineering or computer science. They are taught the basics of molecular
biology, and then put through a rigorous training program on the various
mathematical algorithms that are used in bioinformatics. This includes exposure
to the latest techniques, software packages and databases. About 20% of our
chosen staff has a basic degree in life sciences. Their knowledge of biology is
brushed up. They are taught the basics of computer programming, as well as
computational techniques used in bioinformatics–though not to the same level
of rigor as the EE/CS staffers," says Tata Consultancy’s Vidyasagar. So
what are the raw skills that IT professionals need to have to qualify for
training in this area?

Dr Manoj Kumar, director, IBM India Research Labs, says competence in areas
like e-biz, data and storage management, data mining, parallel / distributed
computing, middleware and knowledge management would aid in the pursuit of
bioinformatics. "Then there’s stuff like probability theory, statistics,
design and analysis of algorithms, discrete mathematics, relational and spatial
databases," he says. "IT companies need to train staff on analysis and
interpretation of biological data through techniques of visualization, algorithm
development and mathematical models," says Arena Multimedia CTO NJ Rajaram.

"A DBA just manages databases, but the role of the database researcher
has certainly grown in importance," explains Compaq’s solutions architect
Balasubramanian. "Apart from handling a variety of data, he has to deal
with organizing, indexing and managing the storage of that data. This has more
to do with biological sets and defining an indexing mechanism, which enables
convenience in data retrieval. It also involves statistical analysis of
sequences and imaging of data," elaborates Balasubramanian.

Aptech wing Arena has already launched specific courses on bioinformatics and
bio-computing. These are primarily aimed at pharmaceutical / drug research
companies, consulting companies and IT firms. "Training programs are called
for in the areas of molecular biology, handling proteomic, genomic data, DNA
structures and sequences, pharmacology and dealing with patent and bibliography
data," says Rajaram. And apart from strong fundamental IT skills and
knowledge of biotechnology concepts, the need for professionals to constantly
update themselves has never been stronger before. As SGI’s Dan Stevens says,
in the post-genomic era, professionals will have to read up and attend meetings
to learn about state-of-the-art solutions. "And apart from that, they will
have to learn how to filter through those solutions that are only trying to take
advantage of market share and the current market hype," he cautions.