Advertisment

Demystifying Datawarehousing

author-image
DQI Bureau
New Update

For many years, companies have been working

to improve access to data for decision-making and analysis. To date, most of these efforts

have focused on enhancing or replacing the transaction processing systems that are the

entry point of most company data. Now that these operational systems contain vast amounts

of data, companies are refocusing their attention on making this information available to

end users through efficient organization and access management. The focus and technology

has shifted from data input to information availability. Datawarehousing is a

comprehensive, systematic approach that addresses this challenge for improved information

access.

Advertisment

In a datawarehouse, a company transforms

the data into a more useful resource by summarizing it (grouping it more conveniently for

end-users), transforming it (putting it into more usable formats), categorizing it (thus

providing the ability to 'slice and dice' information), and distributing it (dispersing it

to appropriate groups to increase its availability and accessibility). Datawarehouses

always have an explicit time dimension-that is, any data value stored is associated with a

particular time. This feature enables historical information to be maintained in the

datawarehouse. A key factor in designing a successful datawarehouse is utilizing a

high-level, enterprise-wide data working model.

The data is then organized for easy access

into categories (such as by subject area). The information in these subject-area

warehouses (or datamarts) is integrated through metadata. The datawarehouse is the

mechanism necessary to store and deliver the required data to a company's business users,

and it cannot be an effective mechanism without metadata.

Data cleansing is particularly important

because the results obtained from a datawarehouse are only as good as the data entered in

the first place. Many organizations focus closely on their datawarehouse implementation

projects but fail to give sufficient attention to data cleansing. In addition to affecting

the quality of results fundamentally, data cleansing is an ongoing activity throughout the

life of the datawarehouse and therefore represents both a significant cost factor and a

critical success factor. Therefore, organizations should consider appointing a data

quality manager to oversee this area of warehouse operation.

Advertisment

The Rationale



There are several ways that

datawarehousing can be used successfully by companies:



- face="Arial" size="2" color="#000000"> To provide business users with a 'customer-centric'

view of the company's heterogeneous data by helping to integrate data from sales, service

manufacturing and distribution, and other customer-related business systems.



- face="Arial" size="2" color="#000000"> To provide added value to the company's customers
by allowing them to access better information when datawarehousing is coupled with

Internet technology.



- face="Arial" size="2" color="#000000"> To consolidate data about individual customers and
provide a repository of all customer contacts for segmentation modeling, customer

retention planning, and cross-sales analysis.



- face="Arial" size="2" color="#000000"> To break down barriers between business departments
and functions by offering a way to consolidate and reconcile views from multiple 'islands

of automation', thus providing a look at activities that cross functional and departmental

lines.



- face="Arial" size="2" color="#000000"> To provide 'macro-level' information views of the
company's critical success factors from multiple data sources such as shareholder value

drivers and performance measures.



- face="Arial" size="2" color="#000000"> To enhance traditional financial reporting by
providing more timely and detailed access to information.



- face="Arial" size="2" color="#000000"> To report in global trends on group results across
multidivisional, multinational operating units, including trends or relationships in areas

such as merchandising, production planning and so on.





IDC estimates that datawarehousing is

currently in use at 20% of US companies, and that another 33% will implement

datawarehousing by the year 2000. Also, as users within those companies that have

datawarehouses begin to realize the benefits, datawarehouse usage will increase, and

companies will demand access to more data. As a result, Sentry Market Research estimates

that the raw data in datawarehouses will triple in the next two years. As company data

increases in both volume and complexity, the datawarehouse will become an even more

critical repository of timely and accurate data for decision-making.

Datawarehousing Versus Operational

Databases



One way to develop a better

understanding of datawarehousing is to contrast it with the databases that hold most of

the data from current business applications. OLTP describes systems that support

day-to-day business operations. They are transactional in nature, with many users

concurrently accessing a database to insert, retrieve, update and delete individual

records. In contrast, OLAP describes analytical systems that support business

decision-making. OLAP may be adhoc in nature and require access to data representing the

company's performance over an extended period of time. OLAP emerged due to the inability

of OLTP systems to deliver adhoc query support.

Advertisment

OLAP data is lightly to highly summarized,

unchanging and accessed as read only. OLTP data is at the most detailed level. A query

executed twice in an OLAP system will return the same results each time, but an OLTP

system may return different results because the underlying data can change. Another way to

contrast the two processing types (and associated databases) is that OLAP helps a company

decide what to do, and OLTP helps a company do it. OLTP focuses on efficient operational

management, while OLAP can be operational and focuses on value-added activities (speedy

decisions, customer service improvements and revenue increases). OLTP systems capture and

store the data needed to run the business on a daily basis, which also can be used to

create OLAP systems. The resulting OLAP systems are aggregations, transformations,

integrations and historical collections of OLTP data from one or more systems.

Operational Data Store



One component of the

datawarehouse process has emerged recently that combines some characteristics of OLAP

systems with OLTP systems. This component is generally referred to as an operational data

store (ODS). The ODS involves combining data from one or more operational databases into a

repository from which users can access granular information. As such, the ODS is similar

in many ways to an OLAP system. However, an ODS also shares several characteristics of an

OLTP system: it contains detailed operational data, it contains current data and it can be

accessed by many concurrent users.

The ODS is particularly popular as a means

of integrating various OLTP systems and supporting operational querying and reporting

while minimizing the impact on the performance of the OLTP system. Most ODSs are built to

support business operations such as sales and marketing, customer service, or report

generation from transactional systems.

Advertisment

As a component of the datawarehousing

process, the ODS is being integrated into the datawarehouse as a repository for

atomic-level detail generated by the extraction process. As such, companies are finding

that ODS provides the granular data source and the technology to support datamining

applications. In addition, if an ODS is developed, it can be used as the source from which

the datawarehouse is populated and regenerated over time.

Enabling Technologies



Datawarehousing is the latest

generation in DSSs. IT is closely related to and builds upon previous technological

approaches such as the information center, the executive information system (EIS) and

improvised spreadsheets.

Many technologies associated with

datawarehousing are not new. These technologies have been evolving continually and now

have reached the point where datawarehousing has become a feasible and cost-effective

solution for achieving the goals of earlier decision-support systems. Some of these key

supporting technologies are desktops, relational database management systems (RDBMs),

Internet/intranets, extraction/transformation products, and metadata repository tools.

Another important development has been the technological advancement (scalability) of

important components such as disk storage, hardware, networking technologies, relational

databases and so on.

Advertisment

The low cost of terabyte disk arrays as

well as the availability of powerful, easy-to-use, inexpensive desktop computers and

software (such as information access tools) are key enablers for the improved data access

provided by datawarehousing. The graphical user interface (GUI) and functionality

associated with today's desktop computers have made them much easier to use. In turn, they

can be connected via powerful networks to the company's datawarehouse, making the data

more accessible than was possible with preview DSSs.



Representative

Warehouse Generation Vendors
Advertisment
color="#FF0000">Company   Warehouse

Generation Product(s)
Apertus

Carleton  
MetaCenter is a suite of tools that

includes Apertus Enterprise/Integrator data cleansing and integration software,

Carletion's Passport data extraction and transformation product, Intellidex's Warehouse

Control Center end user information management, and Software AG's SourcePoint

high-performance scheduled data movement.
D2K

  
D2K introduced its data migration and

transformation product, Tapestry, in April 1997. Tapestry uses BEA Systems Tuxedo

messaging middleware to handle the scheduling and execution of processes such as the

extraction of data from operational databases and the movement of that data into a

warehouse or mart. D2K says that an upgrade, Tapestry 2.0, will begin shipping in early

1998. Improvements in version 2.0 include support for more source databases (including

Teradata, Adabase, Rdb, and AS/400 databases) and modules to move data from PeopleSoft or

SAP applications into a data warehouse.
Advertisment
Evolutionary

Technologies International (ETI)
  
ETI-EXTRACT Tool Suite automates data

collection, conversions, and migrations, and provides metadata management. Master Set and

Data Conversion Tool components interactively convert selected data from multiple sources

in any file or database format to any other format.
IBM

  
Visual Warehouse, an application for the

quick development of datamarts, has been enhanced with data-loading agents that eliminate

size constraints and allow it to build and manage terabyte-sized datawarehouses. New

management capabilities allow users to generate code, set up automated scripts, and

monitor the extraction of data from sources to the warehouse. The foundation of Visual

Warehouse is Business Views, which defines the mapping of source data to target data and

the scheduling of extraction and load processing. The Visual Warehouse solution package

includes DataGuide, a metadata tool, Lotus' Approach for analyzing data from the

warehouse, IBM Net. Data for integrating datawarehouses and the Web, and DB2.
Informatic

  
PowerMart Suite assists users with

designing, deploying, and managing enterprise scalable datamarts. It consists of PowerMart

Designer, which visually defines data maps and transformations: Repository, which stores

map, transformation, and datamart definitions: Server Manager, which configures and

monitors the warehousing process: PowerMart Server, which maps and transforms data from

the operational system and loads it into target datamarts; and PowerCapture which allows a

datamart to be refreshed incrementally, as changes occur in the operational system.
Information

Builders
  
SmartMart is a turnkey solution that

integrates tools for designing datamarts and datawarehouses, transforming and cleaning

data, and querying, analyzing and managing data. In additon to extraction and

transformation support, SmartMart components include data modeling, directories, analysis,

OLAP, datamining, distribution and publishing as well as multimensional database. Java and

OLE DB support have been being added to boost browser access and improve metadata

management.
Platinum

Technology
  
InfoRefiner automates the extraction,

cleansing, and movement of legacy and mainframe data to client server databases. InfoHub

is an application development and database access tool that provides direct SQL access to

non-relational mainframe data from client server applications. InfoPump provides

bidirectional replication of data between heterogeneous databases.
Praxis

International
  
OmniEnterprise is a suite of enterprise

information movement and management products that enables companies to move, synchronize,

manage, and maintain updated database information locally, remotely, and via the Internet.

It includes OmniDirector, for modeling distributed information flow and monitoring

production information movement across an enterprise; OmniReplicator, for replicating data

on a transaction-by-transaction basis; OmniCopy, which combines cross-table mapping and

filtering with networked-routing and automated scheduling and OmniLoader, for extraction

and load synchronization.
Price

Waterhouse
  
Geneva V/T exploits the high data

bandwidth, parallel processing, and shared memory/disk architecture of the S/390 platform

and MVS operating system to extract, cleanse, and reformat data from operational systems

to feed lightly and heavily summarized datawarehouses. It also feeds specialized

datamining processes from multiple sources.
Prism

Solutions
  
Prism Warehouse Executive provides

integrated capabilities for design, construction, and maintenance of datawarehouses and

datamarts. It generates code to extract, integrate, and transform data between a wide

variety of source and target databases on mainframe and client server platforms. The

metadata is stored in the Prism Warehouse Directory companion software.
Sagent

Technology
  
The Sageent Data Mart Solution is a family

of integrated products for populating, managing, and accessing datamarts. It includes

Sagent Data Mart Server, Sagent Admin Sagent Information Studio, Sagent Design Studio,

Sagent WebLink, Sagent Analysis, and Sagent Reports.
Vality

Technology
  
Integrity Data Reengineering Environment

is a data cleansing environment that reconditions and consolidates large volumes of legacy

data through pattern processing, lexical analysis and statistical matching. It can also

uncover data values that stray from their meta labels and business rules and standardize

different legacy file formats, data structures, and character and value representations

that vary from system to system.

The introduction of RDBMS and its

widespread adoption is another key enabling technology for datawarehousing. The RDBMS

introduced the concept of data independence and standardized data access via structured

query language (SQL). Subsequent improvements in the areas of database replication, OLAP

capabilities, improved query performance through parallelism and partitioning and greater

standardization and interoperability are also key contributors to datawarehousing.

Advances in extraction/transformation and

data cleansing products are also assisting with the development and ongoing management and

administration of datawarehouses. They simplify and automate the enormous task of

extracting and formatting the data from operational databases.

Datawarehousing Approaches



Datawarehousing approaches can

range from simple to complex, with many variations in between. There is no

one-size-fits-all solution. However, to be successful, a company's datawarehousing

strategy must be directly aligned with its overall business strategy. Once the enterprise

datawarehousing strategy has been agreed to, a company can choose a tactical plan that

uses one or more of the basic approaches that fit best with its current applications,

data, and technology architectures.

There are two major approaches to consider

within the overall architecture: the datamart and the enterprise datawarehouse. These

approaches differ greatly in scale and complexity.

Datamarts: A datamart is a

scaled-down version of a datawarehouse that focuses on a particular subject area. It is

usually designed to support the unique business requirements of a specific department or

business process. A company can have many datamarts, each focused on a subset of the

company. The trend toward greater use of the datamart strategy often is driven by

end-users and departments that want to build local datamarts focused on their specific

needs instead of waiting for a company-wide enterprise datawarehouse.

There are two major types of datamarts:

dependent and independent.



- face="Arial" size="2" color="#000000"> Dependent Datamart: Contains a

subset of the data from the datawarehouse, which acts as its authoritative source. The

data is fed from the central datawarehouse (the hub) to the datamarts (the spokes), which

are a means of accessing the data-sometimes referred to as 'data retailing' or the 'hub

and spoke' model. The dependent datamart is basically a distribution mechanism for data,

and as such, is tightly integrated into the enterprise datawarehouse.



- face="Arial" size="2" color="#000000"> Independent Datamart: Derives its
data directly from transaction-procesing systems and operates autonomously. An independent

datamart focuses on one subject or problem, and it may not be as easy to integrate with a

datawarehouse, particularly if an enterprise has multiple, independent datamarts.

The movement to datamarts is also being

accelerated by the availability of lower-priced software and hardware as well as packaged

datamart applications. For example, Informatica's PowerMart and Sagent Technology's

Datamart Builder are suites of tools to help a company design and build a datamart.

Because of its reduced scope, a datamart

takes less time to build, costs less, and is less complex than an enterprise

datawarehouse; it is easier to agree on common data definitions for a single subject area

than for an entire company. Therefore, a datamart is appropriate when a company needs to

improve data access in a targeted area, such as the marketing department. However, the

indiscriminate introduction of multiple datamarts with no linkage to each other or to an

enterprise datawarehouse will cause longer-term problems. As these datamarts proliferate

quickly, they may give rise to the same 'islands of information' problem that plagued

their earlier initiatives. Isolated datamarts are undesirable and should be developed and

deployed with the overall datawarehousing strategy so that eventually they can evolve into

an integrated decision-support environment if business circumstances require.

Datawarehouses: The

enterprise datawarehouse provides an enterprise-wide view and reconciles the various

departmental perspectives into a single, integrated corporate perspective (sometimes, the

datawarehouse can be the hub that feeds data to datamarts). This strategy is the most

complex because a datawarehouse is generally company-wide in scope, and therefore,

requires the company to establish a centralized, structured view of all its data.

A datawarehouse can be physical or virtual.

A physical datawarehouse is a central repository where all data is staged after it has

been gathered and transformed. A virtual datawarehouse contains pointers to the data that

is stored in the various business-area datamarts, creating a logical view of all the

company's data.

The enterprise datawarehouse provides a

consistent, comprehensive view of the company, with business users employing common

terminology and data standards throughout the enterprise. When put together, these two

strategies can represent stages in the evolution of a datawarehouse. In practice, few

organizations begin by implementing a datawarehouse because the cost and timeframe are

prohibitive. Instead, an organization often ends up with a cooperating collection of

datamarts forming a virtual datawarehouse. Organizations often find that a single large

datawarehouse is possible but not desirable for a number of reasons, including geography

and access to data, differing requirements, scale and scope of the organization's

activities and so on.



Data

Mart versus Datawarehouse

Typical

Characteristic
Datamart Enterprise

Datawarehouse

size="2" color="#FF0000">EFFORT

Scope

  
A single subject

area  
Many subject

areas
Time to build

  
Months

  
Years Cost to build

  
$hundreds of

thousands  
$millions Complexity to

build  
Low to medium

  
High

size="2" color="#FF0000">DATA

Requirements for

sharing  
Shared (within a

business area)  
Shared (across

the company)
Sources

  
Few operational

and external systems; a data warehouse  
Multiple

operational and external systems
Size   Megabytes to low

gigabytes  
Gigabytes to up

to 50 terabytes
Time horizon

  
Near-current and

historical data  
Historical data,

usually more than 5 years worth of data
Frequency of

update  
Daily, weekly

  
Daily, weekly,

monthly
Number of

instances  
Multiple data

marts distributed across the company  
Single data

warehouse

size="2" color="#FF0000">TECHNOLOGY

Hardware

  
PCs, workgroup

servers, small to midsized Windows NT or Unix servers  
Mid-sized to

large Unix servers, mainframes
Typical operating

systems  
Windows NT, Unix,

As/400  
Unix, MVS Database

  
Workgroup

database  
Large database

face="Arial" size="2" color="#FF0000">USAGE

Number of

concurrent users
Tens   Hundreds or

thousands
Type of users

  
Business area

analysts and managers  
Knowledge

workers, corporate analysts, and senior executives
Business focus

  
Optimizing

activites within the business area  
Cross-functional

optimization and decision-making

Therefore, datamarts are implemented most

often, sometimes to test the waters and prove the usefulness of technologies and business

projects. Such activities are highly beneficial, but it is extremely important for an

organization to develop an overall coordinated strategy and policy to prevent the creation

of multiple, incompatible datamarts.

Whatever route is taken, it is critical to

ensure that development, implementation, and deployment are performed iteratively. Each

step should be of a short, fixed duration and offer real (measurable) benefits.

Datawarehouse Technologies



Data from operational systems

and other external systems are periodically extracted and transformed (cleansed and

summarized) by datawarehouse generation tools, and then loaded into a datawarehouse. The

datawarehouse is managed through the use of data management tools, and is accessed using

information access tools.

Datawarehouse Generation:

Datawarehouse generation is the complex process of extracting data from the operational

databases, transforming (and cleansing) the data, moving the data to the server on which

the datawarehouse is located, loading the data into the datawarehouse and managing the

datawarehouse. With most datawarehouses available for use 16 to 22 hours per day in a

read-only mode, there is usually only a 2-to-8 hour period in which to complete these

processes (often in the early morning) if daily updates are needed.

Preparing Data For The

Datawarehouse:
Preparing data to be loaded into a datawarehouse involves

extraction, transportation and transformation. Extraction programs are run periodically to

extract and collect the relevant data from the operational systems (reading the

transactional system data formats and identifying the changed records). Customized

programs often perform the data extracts, but shrink-wrapped software tools are now

becoming available to assist in this process. These programs specify the source

operational systems or external system and the extraction criteria.

Database gateways that access different

database and file formats and database replication are often used in conjunction with

other software to extract data from operational systems. The software transports the

extracted data to an intermediate file.

After the data has been transferred to the

intermediate file (which can be an ODS), transformation programs or utilities are run to

prepare the data for the datawarehouse. The transformation process includes the following

steps:



- face="Arial" size="2" color="#000000"> Consolidating the data from multiple sources



- face="Arial" size="2" color="#000000"> Filtering the data to eliminate unnecessary details
or fields.



- face="Arial" size="2" color="#000000"> Cleansing the data to eliminate incorrect or
duplicate data.



- face="Arial" size="2" color="#000000"> Converting and translating the extracted
operational data



- face="Arial" size="2" color="#000000"> Aggregating the data.


Consolidating the data involves integrating

data from multiple sources (such as internal OLTP applications or external information

providers) to create a consolidated view of the data. The records from these various data

sources are combined into one file, which is further processed prior to being loaded into

the datawarehouse.

Filtering the data in the consolidated file

involves selecting only the data that the datawarehouse needs and identifying any

unnecessary data. Unnecessary records are removed from the consolidated file. The

remaining records are filtered to remove unnecessary or duplicate attributes.

Cleansing the data is a particularly

important and difficult process. Data such as customer or mailing information must be

accurate, but it is often difficult to recognize that multiple entries representing the

same entity exist in data sources (such as Bill Jones, 123 Willow Street; William Hones,

123 Willow St; and W Jones, 123 Willow Street). Stale, redundant or poor quality data must

be identified and corrected before it is introduced into the datawarehouse.

Sophisticated data-scrubbing tools have

emerged to help improve the quality of data before it is put into a datawarehouse. For

example, Vality Technology's Integrity Data Reengineering Environment and Apertus'

Enterprise/Integrator tools use pattern analysis, fuzzy logic, lexical analysis and

statistical matching to identify and consolidate logically redundant data such as names

and addresses, resolve value conflicts and validate the integrated result against business

rules. These tools can also help uncover data values that are outside the normal range and

can standardize the different legacy file formats and data structures.

Converting and translating the source data

involves mapping the source data to the target data structure and converting the source

data to the database format. The mapping process invokes rules that convert the values of

the data used in the source application into the values used in the decision-support

environment. The mapping process accomplishes the following:



- face="Arial" size="2" color="#000000"> Defines the logical association of data from the

source systems to the target database (for example, CUST_N field with the format FIRSTNAME

MI LASTNAME in the source data to CUSTOMER field with the format LASTNAME FIRSTNAME MI in

the target database)



- face="Arial" size="2" color="#000000"> Translate column, field, and table names which are
often cryptic technology names) into the business terminology of the large database.



- face="Arial" size="2" color="#000000"> Translates data codes (such as 1 for Male) to more
meaningful codes (such as M for Male ) or descriptions.

The availability of data models

(description of a database's organization, often created as entity relationship diagrams)

is a key contributor to success in the complex and critical mapping process. Converting

the data involves converting the data layout (data type, length, format mask and so on) of

the source system to the target database format. Data conversion engines typically accept

SQL data or flat or sequential files as input, and they output data in the appropriate

file format for loading into the target database.

Aggregating the data involves sorting and

summarizing the data, and creating generalized key for the aggregates so that they can be

easily queried. Data can be aggregated along any dimension, independent of any other

dimension. For example, if atomic data exists at the product, store, or day level, users

can aggregate it by product category, region or month. In general, it is possible to

create more levels of aggregates than are practical, given storage constraints and the

time required to compute these aggregates. The challenge is to determine where aggregates

are used frequently or where they are useful to a significant percentage of the user

queries. Incorrect summarizations can yield incorrect or misleading results, or impair the

efficiency of the datawarehouse.

Loading Data Into The Datawarehouse



After the data has been

transformed into the appropriate format, it is loaded into the datawarehouse, which

includes preparing the data for access. The loading process can be a bulk load (to

establish the datawarehouse), a trickle load (to replicate changes as they occur in the

operational system) or a periodic incremental load (to refresh the database with new

snapshots of the operational data on a regular basis). Errors and exceptions from the

loading process must also be identified and resolved.

Modeling The Data In The

Datawarehouse



The data must also be

organized properly for easy access. This process includes summarizing the data within

hierarchies; in addition, it may include denormalizing the data, time-stamping the data,

dimensioning the data, building appropriate indexes and updating models of the data. The

terms multidimensional, star schema and snowflake modeling are used to describe this

modeling technique.

Besides being highly summarized, OLAP data

is also denormalized (by including descriptive information in the summary table, typically

resulting in fewer but larger tables). This process reduces the number of joins necessary

when querying improve data integrity for updates. Another major difference in data

organization is the use of data partitioning: some OLAP applications use data partitioning

(breaking up a big table into several smaller tables) to speed data loading and improve

data access; however, OLTP applications rarely use this form of partitioning because of

data integrity concerns associated with updating the data.

Delivering Value



Operationally, datawarehousing

can allow organizations to spend significantly less time and fewer resources on finding,

cleaning and accumulating data to build reports that reconstruct a view of the past. More

important, datawarehousing delivers strategic business value when it enables management

and other workers that require business information to make faster and more effective

decisions. The implementation of datawarehousing can also facilitate synergistic thinking

and establish fact-based dialog about business performance across the enterprise.

Excepted with

permission from



Technology Forecast 1998


© Price Waterhouse Associates.

Advertisment