Friday, April 25, 2014

Data warehousing Basic Concepts


Data warehousing::

Definition:
A data warehouse is Subject Oriented, Integrated, Time Variant and Nonvolatile collection of data that support management decision making process.


A data warehouse maintains its functions in three layers: staging, integration, and access. Staging is used to store raw data for use by developers (analysis and support). The integration layer is used to integrate data and to have a level of abstraction from users. The access layer is for getting data out for users.

1. Ralph Kimball's paradigm: Data warehouse is the conglomerate of all data marts within the enterprise. Information is always stored in the dimensional model.
Definition as Per Ralph Kimball:  A data warehouse is a copy of transaction data specifically structured for query and analysis.
His Approach towards towards the Data warehouse Design is Bottom-Up. In the bottom-up approach data marts are first created to provide reporting and analytical capabilities for specific business processes.
  2.Bill Inmon's paradigm: Data warehouse is one part of the overall business intelligence system. An enterprise has one data warehouse, and data marts source their information from the data warehouse. In the data warehouse, information is stored in 3rd normal form.
Definition as Per Bill Inmon:
A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process.

Data warehouse Features

The key features of Data Warehouse such as Subject Oriented, Integrated, Nonvolatile and Time-Variant are are discussed below:

·        Subject Oriented - The Data Warehouse is Subject Oriented because it provide us the information around a subject rather the organization's ongoing operations. These subjects can be product, customers, suppliers, sales, revenue etc

·        Integrated - Data Warehouse is constructed by integration of data from heterogeneous sources such as relational databases, flat files etc.

·        Time-Variant - The Data in Data Warehouse is identified with a particular time period.

·        Non Volatile - Non volatile means that the previous data is not removed when new data is added to it. 

Data Warehouse Applications

Data Warehouse helps the business executives in organize, analyse and use their data for decision making.

·        financial services

·        Banking Services

·        Consumer goods

·        Retail sectors.

·        Controlled manufacturing

Data Warehouse Types

Information processing, Analytical processing and Data Mining are the three types of data warehouse applications that are discussed below:

·        Information processing - Data Warehouse allow us to process the information stored in it.The information can be processed by means of querying, basic statistical analysis, reporting using crosstabs, tables, charts, or graphs.

·       Analytical Processing - Data Warehouse supports analytical processing of the information stored in it.The data can be analysed by means of basic OLAP operations,including slice-and-dice,drill down,drill up, and pivoting.

·        Data Mining - Data Mining supports knowledge discovery by finding the hidden patterns and associations, constructing analytical models, performing classification and prediction.

Data warehousing architecture:





1. Operational Data Store : is a database designed to integrate data from multiple sources for additional operations on the data. The data is then passed back to operational systems for further operations and to the data warehouse for reporting.

2. ERP : Enterprise resource planning integrates internal and external management information across an entire organization, embracing finance /accounting,manufacturing, sales and service, etc.
          
3. CRM : Customer relationship management is a widely-implemented strategy for managing a company’s interactions with customers, clients and sales prospects.    
         
4. ETL (Extract,Transform, And load) :
is a process in database usage and especially in data warehousing that involves: Extract data from outside sources
Transforming it to fit operational needs (which can include quality levels)
Loading it into the end target (database or data warehouse).

5. Data Marts: A data mart (DM) is the access layer of the data warehouse(DW) environment that is used to get data out to the users. The DM is a subset of the DW, usually oriented to a specific business line or team.
          
6. OLAP : OLAP (Online Analytical Processing) is a methodology to provide end users with access to large amounts of data in an intuitive and rapid manner to assist with deductions based on investigative reasoning.

7. OLTP : Online transaction processing, or OLTP, refers to a class of systems that facilitate and manage transaction-oriented applications, typically for data entry and retrieval transaction processing.

8. Data Mining: Is the process of extracting patterns from large data sets by combining methods from statistics and artificial intelligence with database management. Data mining is seen as an increasingly important tool by modern business to transform data into business intelligent giving an informational advantage.

Difference between (OLAP and OLTP):

SN
Data Warehouse (OLAP)
Operational Database(OLTP)
1
This involves historical processing of information.
This involves day to day processing.
2
OLAP systems are used by knowledge workers such as executive, manager and analyst.
OLTP system are used by clerk, DBA, or database professionals.
3
This is used to analysis the business.
This is used to run the business.
4
It focuses on Information out.
It focuses on Data in.
5
This is based on Star Schema, Snowflake Schema and Fact Constellation Schema.
This is based on Entity Relationship Model.
6
It focuses on Information out.
This is application oriented.
7
This contains historical data.
This contains current data.
8
This provides summarized and consolidated data.
This provide primitive and highly detailed data.
9
This provide summarized and multidimensional view of data.
This provides detailed and flat relational view of data.
10
The number of users are in Hundreds.
The number of users are in thousands.
11
The number of records accessed are in millions.
The number of records accessed are in tens.
12
 Data base size is 100 GB to TB Data base size GB

Data Warehouse Schemas:

A schema is a collection of database objects, including tables, views, indexes, and synonyms. You can arrange schema objects in the schema models designed for data warehousing in a variety of ways.

Star Schema:

The star schema (also called star-join schema, data cube, or multi-dimensional schema) is the simplest style of data warehouse schema. The star schema consists of one or more fact tables referencing any number of dimension tables

The facts that the data warehouse helps analyze are classified along different dimensions:
The fact table holds the main data. It includes a large amount of aggregated data, such as price and units sold. There may be multiple fact tables in a star schema.

Dimension tables, which are usually smaller than fact tables, include the attributes that describe the facts. Often this is a separate table for each dimension. Dimension tables can be joined to the fact table(s) as needed.

Dimension tables have a simple primary key, while fact tables have a set of foreign keys which make up a compound primary key consisting of a combination of relevant dimension keys.

Advantages :

Provide a direct and intuitive mapping between the business entities being analyzed by end users and the schema design.
Provide highly optimized performance for typical star queries.

Snow Flake Schema: 

The snowflake schema is represented by centralized fact tables which are connected to multiple dimensions. In the snowflake schema, dimensions are normalized into multiple related tables, whereas the star schema's dimensions are denormalized with each dimension represented by a single table.

          Snowflake schemas are often better with more sophisticated query tools that isolate users from the raw table structures and for environments having numerous queries with complex criteria.

Advantages :

Some OLAP multidimensional database modeling tools that use dimensional data marts as data sources are optimized for snowflake schemas.

A snowflake schema can sometimes reflect the way in which users think about data. Users may prefer to generate queries using a star schema in some cases, although this may or may not be reflected in the underlying organization of the database.

Dimensional modeling (DM) is the name of a set of techniques and concepts used in data warehouse design. Dimensional modeling always uses the concepts of facts (measures), and dimensions (context).
Facts : Facts are typically (but not always) numeric values that can be aggregated, and dimensions are groups of hierarchies and descriptors that define the facts.
Types of Facts :
There are three types of facts:
Additive: Additive facts are facts that can be summed up through all of the dimensions in the fact table.

Semi-Additive: Semi-additive facts are facts that can be summed up for some of the dimensions in the fact table, but not the others.

Non-Additive: Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table.

Types of Fact Tables

There are two types of fact tables:
Cumulative: This type of fact table describes what has happened over a period of time. For example, this fact table may describe the total sales by product by store by day. The facts for this type of fact tables are mostly additive facts. The first example presented here is a cumulative fact table.

Snapshot: This type of fact table describes the state of things in a particular instance of time, and usually includes more semi-additive and non-additive facts. The second example presented here is a snapshot fact table.

Dimension : A dimension is a data elements that categorizes each item in a data set into non-overlapping regions. A data warehouse dimension provides the means to "slice and dice" data in a data warehouse. Dimensions provide structured labeling information to otherwise unordered numeric measures.
Types of Dimension :

Conformed dimensions :

These dimensions are somethings that is built once in your model and can bne reused multiple times with different fact tables.

Example:

Consider a model containing multiple fact tables, representing different data marts. now look for a dimension that is common to these fact tables. In this example let's consider that the product dimesion is common and hence can be reused by creating short cuts and joining the different fact tables. some more examples like time dimension, customer dimension and product dimension.

Snapshot:


Junk dimension :
A junk dimension is grouping of low cardinality flags and indicators. This junk dimension helps in avoiding cluttered design of data warehouse. Provides an easy way to access the dimensions from a single point of entry and improves the performance of sql queries.

Example: For example, assume that there are two dimension tables (gender and marital status). The data of these two tables are shown below:

Code:
Table: Gender
Id Gender_status
----------------
1  Male
2  Females

Table: Marital Status
Id Marital_Status
----------------
1  Single
2  Married
Here both the dimensions have low cardinality flags. This will cause maintenance of two tables and decrease the performance of SQL queries.

We can combine these two dimensions into a single table by cross joining and can maintain a single dimension table. The result of cross join is shown below:

Code:
id gender mrg_status
--------------------
1  Male, Single
Male      Married
Female   Single
4  Female   Married
This new dimension table is called a junk dimension. This will improve the manageability and improves the sql queries performance.
Degenerate dimension :

A degenerate table does not have its own dimension table. It is derived from a fact table. The column (dimension) which is a part of fact table but does not map to any dimension.
E.g. employee_id
Dimensional modeling structure:
The dimensional model is built on a star-like schema, with dimensions surrounding the fact table. To build the schema, the following design model is used:
Choose the business process
Declare the Grain
Identify the dimensions
Identify the Fact
Benefits of dimensional modeling :
Benefits of the dimensional modeling are following:
Understandability
Query performance
Extensibility

Data modelling::


Data modeling is the process of creating a data model by applying formal data model descriptions using data modeling techniques.
In other words Data modeling can be defined as a method used to define and analyze data requirements needed to support the business processes of an organization.

Conceptual, logical and physical schemes :

Conceptual Schema: This consists of entity classes, representing kinds of things of significance in the domain, and relationships assertions about associations between pairs of entity classes. A conceptual schema specifies the kinds of facts or propositions that can be expressed using the model.

Logical Schema:  This consists of descriptions of tables and columns, object oriented classes, and XML tags, among other things.

Physical Schema: describes the physical means by which data are stored. This is concerned with partitions, CPUs, table spaces, and the like.

Modeling methodologies :
Data models represent information areas of interest. While there are many ways to create data models, But only two modeling methodologies standard  top-down and bottom-up are used in real time environment :

Bottom-up models : are often the result of a reengineering effort. They usually start with existing data structures forms, fields on application screens, or reports. These models are usually physical, application-specific, and incomplete from an enterprise prospective. They may not promote data sharing, especially if they are built without reference to other parts of the organization.


Top-down logical data models, on the other hand, are created in an abstract way by getting information from people who know the subject area. A system may not implement all the entities in a logical model, but the model serves as a reference point or template.

Difference between  Datamart and Data warehouse:::


Datamart
Data Warehouse
1.Data mart is usually sponsored at the department level and developed with a specific issue or subject in mind, a data mart is a data warehouse with a focused objective.
1.Data warehouse is a “Subject-Oriented, Integrated, Time-Variant, Nonvolatile collection of data in support of decision making”.
2.A data mart is used on a business division/ department level.
2.A data warehouse is used on an enterprise level.
3.A Data Mart is a subset of data from a Data Warehouse. Data Marts are built for specific user groups.
3.A Data Warehouse is simply an integrated consolidation of data from a variety of sources that is specially designed to support strategic and tactical decision making.
4.By providing decision makers with only   a subset of data from the Data Warehouse, Privacy, Performance and Clarity Objectives can be attained.
4.The main objective of Data Warehouse is to provide an integrated environment and coherent picture of the business at a point in


Staging area:

staging area is a place where you hold temporary tables on data warehouse server.
Staging tables are connected to work area or fact tables. we basically need staging area to hold the data, and perform data cleansing and merging, before loading the data into warehouse.

Data cleansing:

Weeding out unnecessary or unwanted things (characters and spaces etc) from incoming data to make it more meaningful and informative.


Data cleansing is also known as data scrubbing.

Data merging:

Data can be gathered from heterogeneous systems and put together.

Data scrubbing

Data scrubbing is the process of fixing or eliminating individual pieces of data that are incorrect, incomplete or duplicated before the data is passed to end user.