Uncategorized News

Demystifying Datawarehousing

DQI Bureau

15 Sep 1998 00:00 IST

New Update

For many years, companies have been working
to improve access to data for decision-making and analysis. To date, most of these efforts
have focused on enhancing or replacing the transaction processing systems that are the
entry point of most company data. Now that these operational systems contain vast amounts
of data, companies are refocusing their attention on making this information available to
end users through efficient organization and access management. The focus and technology
has shifted from data input to information availability. Datawarehousing is a
comprehensive, systematic approach that addresses this challenge for improved information
access.

Advertisment

In a datawarehouse, a company transforms
the data into a more useful resource by summarizing it (grouping it more conveniently for
end-users), transforming it (putting it into more usable formats), categorizing it (thus
providing the ability to 'slice and dice' information), and distributing it (dispersing it
to appropriate groups to increase its availability and accessibility). Datawarehouses
always have an explicit time dimension-that is, any data value stored is associated with a
particular time. This feature enables historical information to be maintained in the
datawarehouse. A key factor in designing a successful datawarehouse is utilizing a
high-level, enterprise-wide data working model.

The data is then organized for easy access
into categories (such as by subject area). The information in these subject-area
warehouses (or datamarts) is integrated through metadata. The datawarehouse is the
mechanism necessary to store and deliver the required data to a company's business users,
and it cannot be an effective mechanism without metadata.

Data cleansing is particularly important
because the results obtained from a datawarehouse are only as good as the data entered in
the first place. Many organizations focus closely on their datawarehouse implementation
projects but fail to give sufficient attention to data cleansing. In addition to affecting
the quality of results fundamentally, data cleansing is an ongoing activity throughout the
life of the datawarehouse and therefore represents both a significant cost factor and a
critical success factor. Therefore, organizations should consider appointing a data
quality manager to oversee this area of warehouse operation.

Advertisment

The Rationale

There are several ways that
datawarehousing can be used successfully by companies:

- face="Arial" size="2" color="#000000"> To provide business users with a 'customer-centric'
view of the company's heterogeneous data by helping to integrate data from sales, service
manufacturing and distribution, and other customer-related business systems.

- face="Arial" size="2" color="#000000"> To provide added value to the company's customers
by allowing them to access better information when datawarehousing is coupled with
Internet technology.

- face="Arial" size="2" color="#000000"> To consolidate data about individual customers and
provide a repository of all customer contacts for segmentation modeling, customer
retention planning, and cross-sales analysis.

- face="Arial" size="2" color="#000000"> To break down barriers between business departments
and functions by offering a way to consolidate and reconcile views from multiple 'islands
of automation', thus providing a look at activities that cross functional and departmental
lines.

- face="Arial" size="2" color="#000000"> To provide 'macro-level' information views of the
company's critical success factors from multiple data sources such as shareholder value
drivers and performance measures.

- face="Arial" size="2" color="#000000"> To enhance traditional financial reporting by
providing more timely and detailed access to information.

- face="Arial" size="2" color="#000000"> To report in global trends on group results across
multidivisional, multinational operating units, including trends or relationships in areas
such as merchandising, production planning and so on.

IDC estimates that datawarehousing is
currently in use at 20% of US companies, and that another 33% will implement
datawarehousing by the year 2000. Also, as users within those companies that have
datawarehouses begin to realize the benefits, datawarehouse usage will increase, and
companies will demand access to more data. As a result, Sentry Market Research estimates
that the raw data in datawarehouses will triple in the next two years. As company data
increases in both volume and complexity, the datawarehouse will become an even more
critical repository of timely and accurate data for decision-making.

Datawarehousing Versus Operational
Databases

One way to develop a better
understanding of datawarehousing is to contrast it with the databases that hold most of
the data from current business applications. OLTP describes systems that support
day-to-day business operations. They are transactional in nature, with many users
concurrently accessing a database to insert, retrieve, update and delete individual
records. In contrast, OLAP describes analytical systems that support business
decision-making. OLAP may be adhoc in nature and require access to data representing the
company's performance over an extended period of time. OLAP emerged due to the inability
of OLTP systems to deliver adhoc query support.

Advertisment

OLAP data is lightly to highly summarized,
unchanging and accessed as read only. OLTP data is at the most detailed level. A query
executed twice in an OLAP system will return the same results each time, but an OLTP
system may return different results because the underlying data can change. Another way to
contrast the two processing types (and associated databases) is that OLAP helps a company
decide what to do, and OLTP helps a company do it. OLTP focuses on efficient operational
management, while OLAP can be operational and focuses on value-added activities (speedy
decisions, customer service improvements and revenue increases). OLTP systems capture and
store the data needed to run the business on a daily basis, which also can be used to
create OLAP systems. The resulting OLAP systems are aggregations, transformations,
integrations and historical collections of OLTP data from one or more systems.

Operational Data Store

One component of the
datawarehouse process has emerged recently that combines some characteristics of OLAP
systems with OLTP systems. This component is generally referred to as an operational data
store (ODS). The ODS involves combining data from one or more operational databases into a
repository from which users can access granular information. As such, the ODS is similar
in many ways to an OLAP system. However, an ODS also shares several characteristics of an
OLTP system: it contains detailed operational data, it contains current data and it can be
accessed by many concurrent users.

The ODS is particularly popular as a means
of integrating various OLTP systems and supporting operational querying and reporting
while minimizing the impact on the performance of the OLTP system. Most ODSs are built to
support business operations such as sales and marketing, customer service, or report
generation from transactional systems.

Advertisment

As a component of the datawarehousing
process, the ODS is being integrated into the datawarehouse as a repository for
atomic-level detail generated by the extraction process. As such, companies are finding
that ODS provides the granular data source and the technology to support datamining
applications. In addition, if an ODS is developed, it can be used as the source from which
the datawarehouse is populated and regenerated over time.

Enabling Technologies

Datawarehousing is the latest
generation in DSSs. IT is closely related to and builds upon previous technological
approaches such as the information center, the executive information system (EIS) and
improvised spreadsheets.

Many technologies associated with
datawarehousing are not new. These technologies have been evolving continually and now
have reached the point where datawarehousing has become a feasible and cost-effective
solution for achieving the goals of earlier decision-support systems. Some of these key
supporting technologies are desktops, relational database management systems (RDBMs),
Internet/intranets, extraction/transformation products, and metadata repository tools.
Another important development has been the technological advancement (scalability) of
important components such as disk storage, hardware, networking technologies, relational
databases and so on.

Advertisment

The low cost of terabyte disk arrays as
well as the availability of powerful, easy-to-use, inexpensive desktop computers and
software (such as information access tools) are key enablers for the improved data access
provided by datawarehousing. The graphical user interface (GUI) and functionality
associated with today's desktop computers have made them much easier to use. In turn, they
can be connected via powerful networks to the company's datawarehouse, making the data
more accessible than was possible with preview DSSs.

Representative
Warehouse Generation Vendors

Advertisment

color="#FF0000">CompanyÂ Â Warehouse
Generation Product(s) Apertus
CarletonÂ Â MetaCenter is a suite of tools that
includes Apertus Enterprise/Integrator data cleansing and integration software,
Carletion's Passport data extraction and transformation product, Intellidex's Warehouse
Control Center end user information management, and Software AG's SourcePoint
high-performance scheduled data movement. D2K
Â Â D2K introduced its data migration and
transformation product, Tapestry, in April 1997. Tapestry uses BEA Systems Tuxedo
messaging middleware to handle the scheduling and execution of processes such as the
extraction of data from operational databases and the movement of that data into a
warehouse or mart. D2K says that an upgrade, Tapestry 2.0, will begin shipping in early
1998. Improvements in version 2.0 include support for more source databases (including
Teradata, Adabase, Rdb, and AS/400 databases) and modules to move data from PeopleSoft or
SAP applications into a data warehouse.

Advertisment

Evolutionary
Technologies International (ETI)Â Â ETI-EXTRACT Tool Suite automates data
collection, conversions, and migrations, and provides metadata management. Master Set and
Data Conversion Tool components interactively convert selected data from multiple sources
in any file or database format to any other format. IBM
Â Â Visual Warehouse, an application for the
quick development of datamarts, has been enhanced with data-loading agents that eliminate
size constraints and allow it to build and manage terabyte-sized datawarehouses. New
management capabilities allow users to generate code, set up automated scripts, and
monitor the extraction of data from sources to the warehouse. The foundation of Visual
Warehouse is Business Views, which defines the mapping of source data to target data and
the scheduling of extraction and load processing. The Visual Warehouse solution package
includes DataGuide, a metadata tool, Lotus' Approach for analyzing data from the
warehouse, IBM Net. Data for integrating datawarehouses and the Web, and DB2. Informatic
Â Â PowerMart Suite assists users with
designing, deploying, and managing enterprise scalable datamarts. It consists of PowerMart
Designer, which visually defines data maps and transformations: Repository, which stores
map, transformation, and datamart definitions: Server Manager, which configures and
monitors the warehousing process: PowerMart Server, which maps and transforms data from
the operational system and loads it into target datamarts; and PowerCapture which allows a
datamart to be refreshed incrementally, as changes occur in the operational system. Information
BuildersÂ Â SmartMart is a turnkey solution that
integrates tools for designing datamarts and datawarehouses, transforming and cleaning
data, and querying, analyzing and managing data. In additon to extraction and
transformation support, SmartMart components include data modeling, directories, analysis,
OLAP, datamining, distribution and publishing as well as multimensional database. Java and
OLE DB support have been being added to boost browser access and improve metadata
management. Platinum
TechnologyÂ Â InfoRefiner automates the extraction,
cleansing, and movement of legacy and mainframe data to client server databases. InfoHub
is an application development and database access tool that provides direct SQL access to
non-relational mainframe data from client server applications. InfoPump provides
bidirectional replication of data between heterogeneous databases. Praxis
InternationalÂ Â OmniEnterprise is a suite of enterprise
information movement and management products that enables companies to move, synchronize,
manage, and maintain updated database information locally, remotely, and via the Internet.
It includes OmniDirector, for modeling distributed information flow and monitoring
production information movement across an enterprise; OmniReplicator, for replicating data
on a transaction-by-transaction basis; OmniCopy, which combines cross-table mapping and
filtering with networked-routing and automated scheduling and OmniLoader, for extraction
and load synchronization. Price
WaterhouseÂ Â Geneva V/T exploits the high data
bandwidth, parallel processing, and shared memory/disk architecture of the S/390 platform
and MVS operating system to extract, cleanse, and reformat data from operational systems
to feed lightly and heavily summarized datawarehouses. It also feeds specialized
datamining processes from multiple sources. Prism
SolutionsÂ Â Prism Warehouse Executive provides
integrated capabilities for design, construction, and maintenance of datawarehouses and
datamarts. It generates code to extract, integrate, and transform data between a wide
variety of source and target databases on mainframe and client server platforms. The
metadata is stored in the Prism Warehouse Directory companion software. Sagent
TechnologyÂ Â The Sageent Data Mart Solution is a family
of integrated products for populating, managing, and accessing datamarts. It includes
Sagent Data Mart Server, Sagent Admin Sagent Information Studio, Sagent Design Studio,
Sagent WebLink, Sagent Analysis, and Sagent Reports. Vality
TechnologyÂ Â Integrity Data Reengineering Environment
is a data cleansing environment that reconditions and consolidates large volumes of legacy
data through pattern processing, lexical analysis and statistical matching. It can also
uncover data values that stray from their meta labels and business rules and standardize
different legacy file formats, data structures, and character and value representations
that vary from system to system.

The introduction of RDBMS and its
widespread adoption is another key enabling technology for datawarehousing. The RDBMS
introduced the concept of data independence and standardized data access via structured
query language (SQL). Subsequent improvements in the areas of database replication, OLAP
capabilities, improved query performance through parallelism and partitioning and greater
standardization and interoperability are also key contributors to datawarehousing.

Advances in extraction/transformation and
data cleansing products are also assisting with the development and ongoing management and
administration of datawarehouses. They simplify and automate the enormous task of
extracting and formatting the data from operational databases.

Datawarehousing Approaches

Datawarehousing approaches can
range from simple to complex, with many variations in between. There is no
one-size-fits-all solution. However, to be successful, a company's datawarehousing
strategy must be directly aligned with its overall business strategy. Once the enterprise
datawarehousing strategy has been agreed to, a company can choose a tactical plan that
uses one or more of the basic approaches that fit best with its current applications,
data, and technology architectures.

There are two major approaches to consider
within the overall architecture: the datamart and the enterprise datawarehouse. These
approaches differ greatly in scale and complexity.

Datamarts: A datamart is a
scaled-down version of a datawarehouse that focuses on a particular subject area. It is
usually designed to support the unique business requirements of a specific department or
business process. A company can have many datamarts, each focused on a subset of the
company. The trend toward greater use of the datamart strategy often is driven by
end-users and departments that want to build local datamarts focused on their specific
needs instead of waiting for a company-wide enterprise datawarehouse.

There are two major types of datamarts:
dependent and independent.

- face="Arial" size="2" color="#000000"> Dependent Datamart: Contains a
subset of the data from the datawarehouse, which acts as its authoritative source. The
data is fed from the central datawarehouse (the hub) to the datamarts (the spokes), which
are a means of accessing the data-sometimes referred to as 'data retailing' or the 'hub
and spoke' model. The dependent datamart is basically a distribution mechanism for data,
and as such, is tightly integrated into the enterprise datawarehouse.

- face="Arial" size="2" color="#000000"> Independent Datamart: Derives its
data directly from transaction-procesing systems and operates autonomously. An independent
datamart focuses on one subject or problem, and it may not be as easy to integrate with a
datawarehouse, particularly if an enterprise has multiple, independent datamarts.

The movement to datamarts is also being
accelerated by the availability of lower-priced software and hardware as well as packaged
datamart applications. For example, Informatica's PowerMart and Sagent Technology's
Datamart Builder are suites of tools to help a company design and build a datamart.

Because of its reduced scope, a datamart
takes less time to build, costs less, and is less complex than an enterprise
datawarehouse; it is easier to agree on common data definitions for a single subject area
than for an entire company. Therefore, a datamart is appropriate when a company needs to
improve data access in a targeted area, such as the marketing department. However, the
indiscriminate introduction of multiple datamarts with no linkage to each other or to an
enterprise datawarehouse will cause longer-term problems. As these datamarts proliferate
quickly, they may give rise to the same 'islands of information' problem that plagued
their earlier initiatives. Isolated datamarts are undesirable and should be developed and
deployed with the overall datawarehousing strategy so that eventually they can evolve into
an integrated decision-support environment if business circumstances require.

Datawarehouses: The
enterprise datawarehouse provides an enterprise-wide view and reconciles the various
departmental perspectives into a single, integrated corporate perspective (sometimes, the
datawarehouse can be the hub that feeds data to datamarts). This strategy is the most
complex because a datawarehouse is generally company-wide in scope, and therefore,
requires the company to establish a centralized, structured view of all its data.

A datawarehouse can be physical or virtual.
A physical datawarehouse is a central repository where all data is staged after it has
been gathered and transformed. A virtual datawarehouse contains pointers to the data that
is stored in the various business-area datamarts, creating a logical view of all the
company's data.

The enterprise datawarehouse provides a
consistent, comprehensive view of the company, with business users employing common
terminology and data standards throughout the enterprise. When put together, these two
strategies can represent stages in the evolution of a datawarehouse. In practice, few
organizations begin by implementing a datawarehouse because the cost and timeframe are
prohibitive. Instead, an organization often ends up with a cooperating collection of
datamarts forming a virtual datawarehouse. Organizations often find that a single large
datawarehouse is possible but not desirable for a number of reasons, including geography
and access to data, differing requirements, scale and scope of the organization's
activities and so on.

Data
Mart versus Datawarehouse

Typical
Characteristic Datamart Enterprise
Datawarehouse

size="2" color="#FF0000">EFFORT

Scope
Â Â A single subject
areaÂ Â Many subject
areas Time to build
Â Â Months
Â Â Years Cost to build
Â Â $hundreds of
thousandsÂ Â $millions Complexity to
buildÂ Â Low to medium
Â Â High

size="2" color="#FF0000">DATA

Requirements for
sharingÂ Â Shared (within a
business area)Â Â Shared (across
the company) Sources
Â Â Few operational
and external systems; a data warehouseÂ Â Multiple
operational and external systems SizeÂ Â Megabytes to low
gigabytesÂ Â Gigabytes to up
to 50 terabytes Time horizon
Â Â Near-current and
historical dataÂ Â Historical data,
usually more than 5 years worth of data Frequency of
updateÂ Â Daily, weekly
Â Â Daily, weekly,
monthly Number of
instancesÂ Â Multiple data
marts distributed across the companyÂ Â Single data
warehouse

size="2" color="#FF0000">TECHNOLOGY

Hardware
Â Â PCs, workgroup
servers, small to midsized Windows NT or Unix serversÂ Â Mid-sized to
large Unix servers, mainframes Typical operating
systemsÂ Â Windows NT, Unix,
As/400Â Â Unix, MVS Database
Â Â Workgroup
databaseÂ Â Large database

face="Arial" size="2" color="#FF0000">USAGE

Number of
concurrent users TensÂ Â Hundreds or
thousands Type of users
Â Â Business area
analysts and managersÂ Â Knowledge
workers, corporate analysts, and senior executives Business focus
Â Â Optimizing
activites within the business areaÂ Â Cross-functional
optimization and decision-making

Therefore, datamarts are implemented most
often, sometimes to test the waters and prove the usefulness of technologies and business
projects. Such activities are highly beneficial, but it is extremely important for an
organization to develop an overall coordinated strategy and policy to prevent the creation
of multiple, incompatible datamarts.

Whatever route is taken, it is critical to
ensure that development, implementation, and deployment are performed iteratively. Each
step should be of a short, fixed duration and offer real (measurable) benefits.

Datawarehouse Technologies

Data from operational systems
and other external systems are periodically extracted and transformed (cleansed and
summarized) by datawarehouse generation tools, and then loaded into a datawarehouse. The
datawarehouse is managed through the use of data management tools, and is accessed using
information access tools.

Datawarehouse Generation:
Datawarehouse generation is the complex process of extracting data from the operational
databases, transforming (and cleansing) the data, moving the data to the server on which
the datawarehouse is located, loading the data into the datawarehouse and managing the
datawarehouse. With most datawarehouses available for use 16 to 22 hours per day in a
read-only mode, there is usually only a 2-to-8 hour period in which to complete these
processes (often in the early morning) if daily updates are needed.

Preparing Data For The
Datawarehouse: Preparing data to be loaded into a datawarehouse involves
extraction, transportation and transformation. Extraction programs are run periodically to
extract and collect the relevant data from the operational systems (reading the
transactional system data formats and identifying the changed records). Customized
programs often perform the data extracts, but shrink-wrapped software tools are now
becoming available to assist in this process. These programs specify the source
operational systems or external system and the extraction criteria.

Database gateways that access different
database and file formats and database replication are often used in conjunction with
other software to extract data from operational systems. The software transports the
extracted data to an intermediate file.

After the data has been transferred to the
intermediate file (which can be an ODS), transformation programs or utilities are run to
prepare the data for the datawarehouse. The transformation process includes the following
steps:

- face="Arial" size="2" color="#000000"> Consolidating the data from multiple sources

- face="Arial" size="2" color="#000000"> Filtering the data to eliminate unnecessary details
or fields.

- face="Arial" size="2" color="#000000"> Cleansing the data to eliminate incorrect or
duplicate data.

- face="Arial" size="2" color="#000000"> Converting and translating the extracted
operational data

- face="Arial" size="2" color="#000000"> Aggregating the data.

Consolidating the data involves integrating
data from multiple sources (such as internal OLTP applications or external information
providers) to create a consolidated view of the data. The records from these various data
sources are combined into one file, which is further processed prior to being loaded into
the datawarehouse.

Filtering the data in the consolidated file
involves selecting only the data that the datawarehouse needs and identifying any
unnecessary data. Unnecessary records are removed from the consolidated file. The
remaining records are filtered to remove unnecessary or duplicate attributes.

Cleansing the data is a particularly
important and difficult process. Data such as customer or mailing information must be
accurate, but it is often difficult to recognize that multiple entries representing the
same entity exist in data sources (such as Bill Jones, 123 Willow Street; William Hones,
123 Willow St; and W Jones, 123 Willow Street). Stale, redundant or poor quality data must
be identified and corrected before it is introduced into the datawarehouse.

Sophisticated data-scrubbing tools have
emerged to help improve the quality of data before it is put into a datawarehouse. For
example, Vality Technology's Integrity Data Reengineering Environment and Apertus'
Enterprise/Integrator tools use pattern analysis, fuzzy logic, lexical analysis and
statistical matching to identify and consolidate logically redundant data such as names
and addresses, resolve value conflicts and validate the integrated result against business
rules. These tools can also help uncover data values that are outside the normal range and
can standardize the different legacy file formats and data structures.

Converting and translating the source data
involves mapping the source data to the target data structure and converting the source
data to the database format. The mapping process invokes rules that convert the values of
the data used in the source application into the values used in the decision-support
environment. The mapping process accomplishes the following:

- face="Arial" size="2" color="#000000"> Defines the logical association of data from the
source systems to the target database (for example, CUST_N field with the format FIRSTNAME
MI LASTNAME in the source data to CUSTOMER field with the format LASTNAME FIRSTNAME MI in
the target database)

- face="Arial" size="2" color="#000000"> Translate column, field, and table names which are
often cryptic technology names) into the business terminology of the large database.

- face="Arial" size="2" color="#000000"> Translates data codes (such as 1 for Male) to more
meaningful codes (such as M for Male ) or descriptions.

The availability of data models
(description of a database's organization, often created as entity relationship diagrams)
is a key contributor to success in the complex and critical mapping process. Converting
the data involves converting the data layout (data type, length, format mask and so on) of
the source system to the target database format. Data conversion engines typically accept
SQL data or flat or sequential files as input, and they output data in the appropriate
file format for loading into the target database.

Aggregating the data involves sorting and
summarizing the data, and creating generalized key for the aggregates so that they can be
easily queried. Data can be aggregated along any dimension, independent of any other
dimension. For example, if atomic data exists at the product, store, or day level, users
can aggregate it by product category, region or month. In general, it is possible to
create more levels of aggregates than are practical, given storage constraints and the
time required to compute these aggregates. The challenge is to determine where aggregates
are used frequently or where they are useful to a significant percentage of the user
queries. Incorrect summarizations can yield incorrect or misleading results, or impair the
efficiency of the datawarehouse.

Loading Data Into The Datawarehouse

After the data has been
transformed into the appropriate format, it is loaded into the datawarehouse, which
includes preparing the data for access. The loading process can be a bulk load (to
establish the datawarehouse), a trickle load (to replicate changes as they occur in the
operational system) or a periodic incremental load (to refresh the database with new
snapshots of the operational data on a regular basis). Errors and exceptions from the
loading process must also be identified and resolved.

Modeling The Data In The
Datawarehouse

The data must also be
organized properly for easy access. This process includes summarizing the data within
hierarchies; in addition, it may include denormalizing the data, time-stamping the data,
dimensioning the data, building appropriate indexes and updating models of the data. The
terms multidimensional, star schema and snowflake modeling are used to describe this
modeling technique.

Besides being highly summarized, OLAP data
is also denormalized (by including descriptive information in the summary table, typically
resulting in fewer but larger tables). This process reduces the number of joins necessary
when querying improve data integrity for updates. Another major difference in data
organization is the use of data partitioning: some OLAP applications use data partitioning
(breaking up a big table into several smaller tables) to speed data loading and improve
data access; however, OLTP applications rarely use this form of partitioning because of
data integrity concerns associated with updating the data.

Delivering Value

Operationally, datawarehousing
can allow organizations to spend significantly less time and fewer resources on finding,
cleaning and accumulating data to build reports that reconstruct a view of the past. More
important, datawarehousing delivers strategic business value when it enables management
and other workers that require business information to make faster and more effective
decisions. The implementation of datawarehousing can also facilitate synergistic thinking
and establish fact-based dialog about business performance across the enterprise.