Skip to main content

Data Warehouse

  •  A Data Warehouse is a subject-oriented integrated time-varying non-volatile COLLECTION OF DATA THAT is used PRIMARILY IN ORGANIZATIONAL DECISION MAKING.
- Bill Inmon,Building the Data Warehouse 1996

  • A process of transforming data into information and making it available to users in a timely enough manner to make a difference
              • Forrester Research April 1996
  • Data Warehouse Architecture
 
 
  • OLTP vs Data Warehouse
 

                    OLTP

            Data Warehouse

  • Application Oriented

  • Used to Run Business

  • Detailed Data

  • Current Up to Date

  • Isolated Data

  • Repetitive Access

  • User Specific

  • Performance Sensitive

  • Less Records accessed at a time

  • Read/Update Access

  • No Data Redundancy (Normalization)


  • Subject Oriented

  • Used to Analyze Business

  • Summarized and Refined

  • Snapshot Data

  • Integrated Data

  • Ad-hoc Access

  • Business User

  • Performance Relaxed

  • Large Volumes at a time

  • Mostly Read(Batch Update)

  • Redundancy Present (De normalized)



  • ETL (Extraction ,Transformation and Loading) is a process by which data is integrated and transformed from the operational systems into the Data Warehouse environment

Operational Data Challenges

  • Data from Heterogeneous sources
  • Format Differences
  • Data Variations 
    • Across Locations same code represents different customers
    • Across period of time a product code have been re-used.

Data Transformation

  • Conversions of Data - Data Type change / Standardized to common units (currency/measurements)
  • Classification -Changing continuous values to discrete ranges (temperature to temperature ranges)
  • Splitting of Fields
  • Merging
  • Aggregations 
  • Derivations(Percentages,Ratios,Indicators)
  • Four Classes - Structure ,Format,Conversions,Classifications

Guiding Principles

  • Single Version of Truth
  • Integration of Data
  • Non Redundant Data
  • Established Standards
  • Methodology specific to data warehousing
  • Business Oriented data model with metadata
  • Detailed atomic data
  • Uniform data meanings
  • Time-dimensioned data with years of history
  • Adhoc Access
  • Scaled for growth of data,users and speed
  • Any Questions ,any data, at any time

ETL Methodologies

  • Kimball /Star Schema - the Right way to do it , takes longer to develop , use less space
  • Inmon / 3rd Normal Form - The wrong way ,keep the same structure as source , puts lot of work on Business Analysts.

Kimball vs Inmon

  • Ralph Kimball approch stressed the importance of data marts , which are repositories of data belonging to particular lines of Business .The data warehouse is simply a combination of different data marts that facilitates reporting and analysis. the Kimball data warehouse uses a "bottom-up" approch.
  • Bill Inmon regarded the data warehouse as the centralized repository for all enterprise data. In this approch , an organization first creates a normalized data warehouse model.Dimensional data marts are then created based on the subjects , it uses "top-down" approch.

 

 

Comments

Popular posts from this blog

LookML

  What Is LookML? LookML is a language for describing dimensions, aggregates, calculations, and data relationships in a SQL database. Looker uses a model written in LookML to construct SQL queries against a particular database. LookML Projects A LookML Project is a collection of model, view, and dashboard files that are typically version controlled together via a Git repository. The model files contain information about which tables to use, and how they should be joined together. The view files contain information about how to calculate information about each table (or across multiple tables if the joins permit them). LookML separates structure from content, so the query structure (how tables are joined) is independent of the query content (the columns to access, derived fields, aggregate functions to compute, and filtering expressions to apply). LookML separates content of queries from structure of queries SQL Queries Generated by Looker For data analysts, LookML fosters DRY style...

CSV to HTML Converter Shell Script

#Generic Converter from CSV to HTML #!/bin/bash usage () { cat <<EOF Usage:$(basename $0)[OPTIONS] input_file > ouptut.html Explicit Delimiter can be specified if not then it default to comma as delimiter Options: -d specify delimiter , instead of comma --head specified then treats first line as column header , <thead> and <th> tags --foot last line , <tfoot> and <th> tags Samples: 1.$(basename $0) input.csv Parse 'input.csv' with comma as delimiter and send HTML table to STDOUT 2. $(basename $0) -d '|' < input.csv > out.html Parse 'input.csv' with PIPE as delimiter and send HTML table to out.html 3. $(basename $0) -d '\t' --head --foot < input.tsv > out.html Parse 'input.tsv' , tab as delimiter process first and last lines as header and footer write to out.html ...

Python : Setup to use local Nexus repo to download packages

 If you want to use your own or nexus repo to download the packages , use below process Create folder if not exists in C:\Users\username\AppData\Roaming\pip in windows Create pip.ini file on windows AppData>Roaming>pip folder  Copy the below lines and place in pip.ini file                [global]               index-url=https://<repo.com>/nexus/repository/ae-pypi-group/simple               trusted-host=<repo.com> On unix default configuration file is $HOME/.config/pip/pip.conf which respects XDG_CONFIG_HOME environment variable. On MacOs the configuration file is $HOME/Library/Application Support/pip/pip.conf