Skip to main content

Dataproc

Accessing Pyspark in Dataproc 

-Dfs.AbstractFileSystem.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS

Gcloud dataproc submit:

gcloud dataproc jobs submit pyspark --project <project_id>  --cluster=<cluster-name> --region us-east4 --jars gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar ${local_py_file} -- "${final_gcp_loc}" param1 


Autoscaling policy

auto_scaling_policy = ( f"projects/{PROJECT_ID}/regions/us-east4/autoscalingPolicies/{dev}


Sample autoscaling policy

Name:

Region:

Primary worker configuration

    Min Instances 2

    Max Instances 2

    Weight 1

Seconday worker configuration

    Min Instances 0

    Max Instances 10

    Weight 1

Cooldown duration 2 minues

Graceful decomission time 0 seconds

Scale up factor 0.5

Scale up min worker fraction 0

Scale down factor 0.25

Scale down min worker fraction 0


Code Snippet for Cluster Configuration in Dataproc

from airflow.providers.google.cloud.operators.dataproc import ClusterGenerator ,DataprocCreateClusterOperator

clusterGenrator=ClusterGenerator(

    project_id= <>,

    zone="us-east4-a",

    location="us-east4",

    image_version="2.0.27-centos8",

    master_machine_type="n2-standard-4",

    worker_machine_type="n2-standard-8",

    num_workers=2,

    num_preemptible_workers=0,

    internal_ip_only=True,

    storage_bucket=config_bucket,

    service_account= <>,

    service_account_scopes=["https://www.googleapis.com/auth/cloud-platform"],

    tags=<>,

    subnetwork_uri=<>,

    autoscaling_policy=autoscaling_policy,

    idle_time_ttl=3600,

    auto_delete_ttl=3600 *2,

    enable_component_gateway=True,

    optional_components=["HIVE_WEBHCAT","JUPYTER"],    

    metadata={"PIP_PACKAGES":"pyyaml requests pandas openpyxml"},

).make()


TASK in Composer for creaton:

create_dataproc= DataprocCreateClusterOperator(

    task_id="create_dataproc",

    impersonation_chain=sa,

    cluster_name=<cluster_name>,

    region=region,

    cluster_config=clusterGenrator.

    labers={

        "created-by":<user>,

},

)


Comments

Popular posts from this blog

LookML

  What Is LookML? LookML is a language for describing dimensions, aggregates, calculations, and data relationships in a SQL database. Looker uses a model written in LookML to construct SQL queries against a particular database. LookML Projects A LookML Project is a collection of model, view, and dashboard files that are typically version controlled together via a Git repository. The model files contain information about which tables to use, and how they should be joined together. The view files contain information about how to calculate information about each table (or across multiple tables if the joins permit them). LookML separates structure from content, so the query structure (how tables are joined) is independent of the query content (the columns to access, derived fields, aggregate functions to compute, and filtering expressions to apply). LookML separates content of queries from structure of queries SQL Queries Generated by Looker For data analysts, LookML fosters DRY style...

CSV to HTML Converter Shell Script

#Generic Converter from CSV to HTML #!/bin/bash usage () { cat <<EOF Usage:$(basename $0)[OPTIONS] input_file > ouptut.html Explicit Delimiter can be specified if not then it default to comma as delimiter Options: -d specify delimiter , instead of comma --head specified then treats first line as column header , <thead> and <th> tags --foot last line , <tfoot> and <th> tags Samples: 1.$(basename $0) input.csv Parse 'input.csv' with comma as delimiter and send HTML table to STDOUT 2. $(basename $0) -d '|' < input.csv > out.html Parse 'input.csv' with PIPE as delimiter and send HTML table to out.html 3. $(basename $0) -d '\t' --head --foot < input.tsv > out.html Parse 'input.tsv' , tab as delimiter process first and last lines as header and footer write to out.html ...

Python : Setup to use local Nexus repo to download packages

 If you want to use your own or nexus repo to download the packages , use below process Create folder if not exists in C:\Users\username\AppData\Roaming\pip in windows Create pip.ini file on windows AppData>Roaming>pip folder  Copy the below lines and place in pip.ini file                [global]               index-url=https://<repo.com>/nexus/repository/ae-pypi-group/simple               trusted-host=<repo.com> On unix default configuration file is $HOME/.config/pip/pip.conf which respects XDG_CONFIG_HOME environment variable. On MacOs the configuration file is $HOME/Library/Application Support/pip/pip.conf