Skip to main content

Experience Using Google BigQuery

 Google BigQuery Experience:


Ingestion to Google Cloud Storage Bucket from data Landing Pad using gsutil 

$gsutil cp file.txt gs://bucket/folder #copy file to gcp bucket $gsutil -m cp file* gs://bucket/folder #Copy all files in Parallel

Load to BigQuery from GCP Bucket

#Loading Complete line into one column table using line delimiter $bq load --field_delimiter=$(echo -en "\x01") --skip_leading_rows=1 --max_bad_records=1 \ --noautodetect --source_format=CSV dataset.tableNAME gs://bucket/folder line:string #Loading filed delimiter ~ file to table $bq load -F "~" --skip_leading_rows=1 --allow_jagged_rows \ --source_format=CSV dataset.tableNM \ gs://bucket/folder col1:string,col2:date

Uncompress files in GCP storage

$gsutil cat gs://bucket/file.txt.gz | zcat | gsutil cp - gs://bucket/file.txt $gsutil -Z cp gs://bucket/file.txt.gz gs://bucket/file.txt

Join files less than 30 count to single file using compose

$gsutil compose gs://bucket/file.* gs://bucket/file.txt #unix command to split file (split files will have extension _aa,_ab) $split -b 10G file.txt file.txt_

Uncompress files in GCP storage

$gsutil cat gs://bucket/file.txt.gz | zcat | gsutil cp - gs://bucket/file.txt $gsutil -Z cp gs://bucket/file.txt.gz gs://bucket/file.txt

Show Metadata and Update

$bq show --format=prettyjson project:dataset > update.json $bq update --source update.json project:dataset #change specialGroup to useremail

gcloud change or autorize project

$gcloud init Pick configuration to use: [1] Re-initialize this configuration [default] with new settings [2] Create a new configuration please enter your numeric choice:1 Do you have a network proxy you would like to set in gcloud (Y/n)? n Would you like to continue anyway (y/N)? y Choose the account you would like to use to perform operations for this configuration [1] existing email if any [2] Log in with a new account Please enter your numeric choice:2 <Authenticate the Account using Google Authentication UI > you are logged in as [email_id] This account has a lot of projects! Listing them all can take a while [1] Enter a project ID [2] Create a new project [3] List projects Please enter your numeric choice:1 Enter an existing project id you would like to use:<project_id> Your current project has been set to :[<project_id]

Comments

Popular posts from this blog

LookML

  What Is LookML? LookML is a language for describing dimensions, aggregates, calculations, and data relationships in a SQL database. Looker uses a model written in LookML to construct SQL queries against a particular database. LookML Projects A LookML Project is a collection of model, view, and dashboard files that are typically version controlled together via a Git repository. The model files contain information about which tables to use, and how they should be joined together. The view files contain information about how to calculate information about each table (or across multiple tables if the joins permit them). LookML separates structure from content, so the query structure (how tables are joined) is independent of the query content (the columns to access, derived fields, aggregate functions to compute, and filtering expressions to apply). LookML separates content of queries from structure of queries SQL Queries Generated by Looker For data analysts, LookML fosters DRY style...

A Deep Dive Into Google BigQuery Architecture

Introduction Google’s BigQuery is an enterprise-grade cloud-native data warehouse. BigQuery was first launched as a service in 2010 with general availability in November 2011. Since inception, BigQuery has evolved into a more economical and fully-managed data warehouse which can run blazing fast interactive and ad-hoc queries on datasets of petabyte-scale. In addition, BigQuery now integrates with a variety of Google Cloud Platform (GCP) services and third-party tools which makes it more useful. BigQuery is serverless, or more precisely data warehouse as a service. There are no servers to manage or database software to install. BigQuery service manages underlying software as well as infrastructure including scalability and high-availability. The pricing model is quite simple - for every 1 TB of data processed you pay $5. BigQuery exposes simple client interface which enables users to run interactive queries. Overall, you don’t need to know much about underlying BigQuery architecture or...

Data Warehouse 101 - Part 2

  Traditional Data Warehouse Concepts A  data warehouse  is any system that collates data from a wide range of sources within an organization. Data warehouses are used as centralized data repositories for analytical and reporting purposes. A traditional data warehouse is located on-site at your offices. You purchase the hardware, the server rooms and hire the staff to run it. They are also called on-premises, on-prem or (grammatically incorrect) on-premise data warehouses. Facts, Dimensions, and Measures The core building blocks of information in a data warehouse are facts, dimensions, and measures. A  fact  is the part of your data that indicates a specific occurrence or transaction. For example, if your business sells flowers, some facts you would see in your data warehouse are: Sold 30 roses in-store for $19.99 Ordered 500 new flower pots from China for $1500 Paid salary of cashier for this month $1000 Several numbers can describe each fact, and we call these...