Skip to main content

Data Warehouse

  •  A Data Warehouse is a subject-oriented integrated time-varying non-volatile COLLECTION OF DATA THAT is used PRIMARILY IN ORGANIZATIONAL DECISION MAKING.
- Bill Inmon,Building the Data Warehouse 1996

  • A process of transforming data into information and making it available to users in a timely enough manner to make a difference
              • Forrester Research April 1996
  • Data Warehouse Architecture
 
 
  • OLTP vs Data Warehouse
 

                    OLTP

            Data Warehouse

  • Application Oriented

  • Used to Run Business

  • Detailed Data

  • Current Up to Date

  • Isolated Data

  • Repetitive Access

  • User Specific

  • Performance Sensitive

  • Less Records accessed at a time

  • Read/Update Access

  • No Data Redundancy (Normalization)


  • Subject Oriented

  • Used to Analyze Business

  • Summarized and Refined

  • Snapshot Data

  • Integrated Data

  • Ad-hoc Access

  • Business User

  • Performance Relaxed

  • Large Volumes at a time

  • Mostly Read(Batch Update)

  • Redundancy Present (De normalized)



  • ETL (Extraction ,Transformation and Loading) is a process by which data is integrated and transformed from the operational systems into the Data Warehouse environment

Operational Data Challenges

  • Data from Heterogeneous sources
  • Format Differences
  • Data Variations 
    • Across Locations same code represents different customers
    • Across period of time a product code have been re-used.

Data Transformation

  • Conversions of Data - Data Type change / Standardized to common units (currency/measurements)
  • Classification -Changing continuous values to discrete ranges (temperature to temperature ranges)
  • Splitting of Fields
  • Merging
  • Aggregations 
  • Derivations(Percentages,Ratios,Indicators)
  • Four Classes - Structure ,Format,Conversions,Classifications

Guiding Principles

  • Single Version of Truth
  • Integration of Data
  • Non Redundant Data
  • Established Standards
  • Methodology specific to data warehousing
  • Business Oriented data model with metadata
  • Detailed atomic data
  • Uniform data meanings
  • Time-dimensioned data with years of history
  • Adhoc Access
  • Scaled for growth of data,users and speed
  • Any Questions ,any data, at any time

ETL Methodologies

  • Kimball /Star Schema - the Right way to do it , takes longer to develop , use less space
  • Inmon / 3rd Normal Form - The wrong way ,keep the same structure as source , puts lot of work on Business Analysts.

Kimball vs Inmon

  • Ralph Kimball approch stressed the importance of data marts , which are repositories of data belonging to particular lines of Business .The data warehouse is simply a combination of different data marts that facilitates reporting and analysis. the Kimball data warehouse uses a "bottom-up" approch.
  • Bill Inmon regarded the data warehouse as the centralized repository for all enterprise data. In this approch , an organization first creates a normalized data warehouse model.Dimensional data marts are then created based on the subjects , it uses "top-down" approch.

 

 

Comments

Popular posts from this blog

LookML

  What Is LookML? LookML is a language for describing dimensions, aggregates, calculations, and data relationships in a SQL database. Looker uses a model written in LookML to construct SQL queries against a particular database. LookML Projects A LookML Project is a collection of model, view, and dashboard files that are typically version controlled together via a Git repository. The model files contain information about which tables to use, and how they should be joined together. The view files contain information about how to calculate information about each table (or across multiple tables if the joins permit them). LookML separates structure from content, so the query structure (how tables are joined) is independent of the query content (the columns to access, derived fields, aggregate functions to compute, and filtering expressions to apply). LookML separates content of queries from structure of queries SQL Queries Generated by Looker For data analysts, LookML fosters DRY style (“d

CSV to HTML Converter Shell Script

#Generic Converter from CSV to HTML #!/bin/bash usage () { cat <<EOF Usage:$(basename $0)[OPTIONS] input_file > ouptut.html Explicit Delimiter can be specified if not then it default to comma as delimiter Options: -d specify delimiter , instead of comma --head specified then treats first line as column header , <thead> and <th> tags --foot last line , <tfoot> and <th> tags Samples: 1.$(basename $0) input.csv Parse 'input.csv' with comma as delimiter and send HTML table to STDOUT 2. $(basename $0) -d '|' < input.csv > out.html Parse 'input.csv' with PIPE as delimiter and send HTML table to out.html 3. $(basename $0) -d '\t' --head --foot < input.tsv > out.html Parse 'input.tsv' , tab as delimiter process first and last lines as header and footer write to out.html

A Deep Dive Into Google BigQuery Architecture

Introduction Google’s BigQuery is an enterprise-grade cloud-native data warehouse. BigQuery was first launched as a service in 2010 with general availability in November 2011. Since inception, BigQuery has evolved into a more economical and fully-managed data warehouse which can run blazing fast interactive and ad-hoc queries on datasets of petabyte-scale. In addition, BigQuery now integrates with a variety of Google Cloud Platform (GCP) services and third-party tools which makes it more useful. BigQuery is serverless, or more precisely data warehouse as a service. There are no servers to manage or database software to install. BigQuery service manages underlying software as well as infrastructure including scalability and high-availability. The pricing model is quite simple - for every 1 TB of data processed you pay $5. BigQuery exposes simple client interface which enables users to run interactive queries. Overall, you don’t need to know much about underlying BigQuery architecture or