Advertisment

The Rediff Overhaul

author-image
DQI Bureau
New Update

Can journalists wrought innovation in technology? If the dot-com boom saw

journalists re-train themselves to be content experts, the dotcom bust left them

with a feeling that they have reached the end-of-the-road in innovating on the

Web. After all, the portal got traffic because of those big ads and hoardings,

the free e-mail, and the dating services, one would think. What about content?

Content was a differentiator in some cases but it was not sustained. The extent

of innovation in managing content was limited. Journalists knew what readers

(read surfers) liked to read, while technology made the content rich and

interactive. But the problem was delivering the right content to the right

person at the right time. This is where the Rediff team created their success

story, which they call ‘collaboration between journalists and technology’.

Advertisment

There’s

nothing new about personalization of sites. Personalization of content based on

pre-fixed reader preferences too is not new. What the team at Rediff attempted

was to use the ‘principle of exclusion’ which involves understanding each

reader’s preferences for content and using that to exclude him from the class

of readers that he would have got otherwise classified into.

That is, to create an independent entity of the reader and give him content

that he is likely to prefer. This is called implicit personalization. Once the

reader and his preferences are identified, the content has to be fed in

whichever format he requires depending on the access device he uses.

Says Zaki Ansari, senior editor, Rediff, and a trained hand at print

journalism, "No portal has demographics like in print, portals can talk

about psychographics and therefore content is context." That is, the

content has to be extremely relevant in the context in which the Internet reader

is seeking it. This involves throwing up related content and giving the big

picture on a particular topic. This also involves a good amount of recommending–that

‘people like you also like to read’ the following and such like. As an

extension, this also involves ‘deep-linking’ into external sites and

suggesting commerce.

Advertisment
The

Building Blocks
Block What

It Does
Rediff

BackYard
Content

management and editorial workflow system.

A very fast system that automates publishing, creates a virtual

newsroom, stores content in XML form.
XML

Repository
Describes

different parts of the news copy by headline, byline, dateline, actual

copy, agency credits, etc. Does not have any formatting information like

font, color, alignment. Bold, etc. Because XML is without formatting, it

can be published in any design and form. All one needs to do is to add a

template. Templates for print, Web-page, SMS or handheld device exist. The

approach is called as single-source publishing.
The

Indexer
Holds

six years of Rediff.com data indexed by individual words. Weights are

assigned to each word depending upon the importance of the word to the

copy.
Categorizer AI-based

system that works like a human news editor. Sorts copies by subject

categories, based on hundreds of rules. Uses weights calculated by the

indexer to figure out which words in the copy are important
Related

Content Engine
Considers

copies within each category and looks for very strong relations between

news stories there. Based on categorizer and indexer.
Personalization

Engine
Creates

personalized editions. Studies user behavior using clickstream analysis
Recommendation

Engine
Intelligently

recommends content that the reader would like but has not read yet. Based

on collaborative filtering.

Secondly, the reader has vast Internet content resources at his command (call

it other global portals and information sites) and one cannot disrupt the

quality of experience with an Indian portal. Therefore, the portal has to

function with the speed and elegance of a professional design agency and the

design of the site has to be continually changed to keep up with the design

trends. Third, content is all about navigability and the copy has to be made

malleable. XML being the current publishing standard, the pages are XML-based

and they function in a "publish once, push many" mode. That is,

content is separated from the form using a navigation bar. Lastly, access to

content is increasingly happening beyond the PC and multiple devices like PDAs

and cell phones have to be supported.

Says Ansari, "We had to engineer compelling content and

do what one may call object-oriented journalism". Having defined the goals,

the Rediff team also realized that there were no ready-made tools available that

could do the whole of what they desired. Hence the team got down to developing

the technology in-house. In a way, it was nothing new for Rediff for it had

developed its own mail-engine for the Rediff-Mail service and also managed to

scale it up massively.

Advertisment

Building blocks

At the heart of this content exercise is the editorial

workflow system called as the Rediff BackYard. The Rediff BackYard works out of

a browser and is the single central news desk. Regardless of where the reporters’

are–in Kargil, Iraq, or the US, all content is reposited onto the Rediff

BackYard. This throws up an XML copy into the XML repository which goes to an

‘indexer’–a complex and huge piece of technology.

The indexer takes every word in the copy and assigns weights

to them depending on the importance of the word to the story. The indexer’s

job is to prepare data on the basis of which machine intelligence can figure out

which copies are closer to which words. The indexer is based on the mathematical

concept of Bayesian Inference, a topic in probability theory. The ‘categorizer’

is the next most important part. It sorts copies by multiple subject categories.

The categorizer’s artificial intelligence works on the basis of hundreds of

rules written just once. The rules could be in the form of simple or complex

Boolean expressions.

Advertisment

Added to that, there is the International Press & Telecom

Council, which is a numbered schema to classify news. The Indian terms like BJP,

VHP, etc are added to the IPTC. The categorizer uses the weights calculated by

the indexer to figure out which words in the copy are important enough to apply

the rules. It is an artificial intelligence-based algorithm with a feedback

loop, which plays the role of an editor–a super-fast electronic news editor at

that and it has the capability of learning from experience and becoming smarter

in time.

All this is then overwritten into XML and goes to the Related

Content Engine (RCE). This software considers copies within each category and

then looks for very strong relations between news stories there. For instance, a

‘Kerala bus accident story’ and the "Salman Khan road accident story

are both in the ‘road accidents’ category. But they are not related. The RCE

dissociates them and looks for an earlier "Salman Khan road accident"

story that can be associated. This is where the weights that were assigned in

the indexer comes of aid. The RCE is based on the foundation of the categorizer,

which in turn rests partly on the indexer.

The team at Rediff took four sleepless months to develop the

systems and they describe it as a unique collaborative effort between the

journalists and the technologists. The team also had to access external

resources in the form of mathematical research and software adaptation. The

entire system was developed ground up and has been functioning very well since

it was rolled out this January.

Easwardas Satyan

Advertisment