Can journalists wrought innovation in technology? If the dot-com boom saw
journalists re-train themselves to be content experts, the dotcom bust left them
with a feeling that they have reached the end-of-the-road in innovating on the
Web. After all, the portal got traffic because of those big ads and hoardings,
the free e-mail, and the dating services, one would think. What about content?
Content was a differentiator in some cases but it was not sustained. The extent
of innovation in managing content was limited. Journalists knew what readers
(read surfers) liked to read, while technology made the content rich and
interactive. But the problem was delivering the right content to the right
person at the right time. This is where the Rediff team created their success
story, which they call ‘collaboration between journalists and technology’.
There’s
nothing new about personalization of sites. Personalization of content based on
pre-fixed reader preferences too is not new. What the team at Rediff attempted
was to use the ‘principle of exclusion’ which involves understanding each
reader’s preferences for content and using that to exclude him from the class
of readers that he would have got otherwise classified into.
That is, to create an independent entity of the reader and give him content
that he is likely to prefer. This is called implicit personalization. Once the
reader and his preferences are identified, the content has to be fed in
whichever format he requires depending on the access device he uses.
Says Zaki Ansari, senior editor, Rediff, and a trained hand at print
journalism, "No portal has demographics like in print, portals can talk
about psychographics and therefore content is context." That is, the
content has to be extremely relevant in the context in which the Internet reader
is seeking it. This involves throwing up related content and giving the big
picture on a particular topic. This also involves a good amount of recommending–that
‘people like you also like to read’ the following and such like. As an
extension, this also involves ‘deep-linking’ into external sites and
suggesting commerce.
The Building Blocks |
|
Block | What It Does |
Rediff BackYard |
Content management and editorial workflow system. A very fast system that automates publishing, creates a virtual newsroom, stores content in XML form. |
XML Repository |
Describes different parts of the news copy by headline, byline, dateline, actual copy, agency credits, etc. Does not have any formatting information like font, color, alignment. Bold, etc. Because XML is without formatting, it can be published in any design and form. All one needs to do is to add a template. Templates for print, Web-page, SMS or handheld device exist. The approach is called as single-source publishing. |
The Indexer |
Holds six years of Rediff.com data indexed by individual words. Weights are assigned to each word depending upon the importance of the word to the copy. |
Categorizer | AI-based system that works like a human news editor. Sorts copies by subject categories, based on hundreds of rules. Uses weights calculated by the indexer to figure out which words in the copy are important |
Related Content Engine |
Considers copies within each category and looks for very strong relations between news stories there. Based on categorizer and indexer. |
Personalization Engine |
Creates personalized editions. Studies user behavior using clickstream analysis |
Recommendation Engine |
Intelligently recommends content that the reader would like but has not read yet. Based on collaborative filtering. |
Secondly, the reader has vast Internet content resources at his command (call
it other global portals and information sites) and one cannot disrupt the
quality of experience with an Indian portal. Therefore, the portal has to
function with the speed and elegance of a professional design agency and the
design of the site has to be continually changed to keep up with the design
trends. Third, content is all about navigability and the copy has to be made
malleable. XML being the current publishing standard, the pages are XML-based
and they function in a "publish once, push many" mode. That is,
content is separated from the form using a navigation bar. Lastly, access to
content is increasingly happening beyond the PC and multiple devices like PDAs
and cell phones have to be supported.
Says Ansari, "We had to engineer compelling content and
do what one may call object-oriented journalism". Having defined the goals,
the Rediff team also realized that there were no ready-made tools available that
could do the whole of what they desired. Hence the team got down to developing
the technology in-house. In a way, it was nothing new for Rediff for it had
developed its own mail-engine for the Rediff-Mail service and also managed to
scale it up massively.
Building blocks
At the heart of this content exercise is the editorial
workflow system called as the Rediff BackYard. The Rediff BackYard works out of
a browser and is the single central news desk. Regardless of where the reporters’
are–in Kargil, Iraq, or the US, all content is reposited onto the Rediff
BackYard. This throws up an XML copy into the XML repository which goes to an
‘indexer’–a complex and huge piece of technology.
The indexer takes every word in the copy and assigns weights
to them depending on the importance of the word to the story. The indexer’s
job is to prepare data on the basis of which machine intelligence can figure out
which copies are closer to which words. The indexer is based on the mathematical
concept of Bayesian Inference, a topic in probability theory. The ‘categorizer’
is the next most important part. It sorts copies by multiple subject categories.
The categorizer’s artificial intelligence works on the basis of hundreds of
rules written just once. The rules could be in the form of simple or complex
Boolean expressions.
Added to that, there is the International Press & Telecom
Council, which is a numbered schema to classify news. The Indian terms like BJP,
VHP, etc are added to the IPTC. The categorizer uses the weights calculated by
the indexer to figure out which words in the copy are important enough to apply
the rules. It is an artificial intelligence-based algorithm with a feedback
loop, which plays the role of an editor–a super-fast electronic news editor at
that and it has the capability of learning from experience and becoming smarter
in time.
All this is then overwritten into XML and goes to the Related
Content Engine (RCE). This software considers copies within each category and
then looks for very strong relations between news stories there. For instance, a
‘Kerala bus accident story’ and the "Salman Khan road accident story
are both in the ‘road accidents’ category. But they are not related. The RCE
dissociates them and looks for an earlier "Salman Khan road accident"
story that can be associated. This is where the weights that were assigned in
the indexer comes of aid. The RCE is based on the foundation of the categorizer,
which in turn rests partly on the indexer.
The team at Rediff took four sleepless months to develop the
systems and they describe it as a unique collaborative effort between the
journalists and the technologists. The team also had to access external
resources in the form of mathematical research and software adaptation. The
entire system was developed ground up and has been functioning very well since
it was rolled out this January.