Skip to main content

Posts

Showing posts from May, 2022

Distcp

Command:  hadoop distcp -libjars $jar1 --files $file1 -Dgoogle.cloud.auth.service.account.json.keyfile=SERVICE_ACCOUNT.json -Dfs.AbstractFileSystem.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS -m 1 hdfs://${hadoop_loc}  gs://${gcs_loc}

Dataproc

Accessing Pyspark in Dataproc  -Dfs.AbstractFileSystem.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS Gcloud dataproc submit: gcloud dataproc jobs submit pyspark --project <project_id>  --cluster=<cluster-name> --region us-east4 --jars gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar ${local_py_file} -- "${final_gcp_loc}" param1  Autoscaling policy auto_scaling_policy = ( f"projects/{PROJECT_ID}/regions/us-east4/autoscalingPolicies/{dev} Sample autoscaling policy Name: Region: Primary worker configuration     Min Instances 2     Max Instances 2     Weight 1 Seconday worker configuration     Min Instances 0     Max Instances 10     Weight 1 Cooldown duration 2 minues Graceful decomission time 0 seconds Scale up factor 0.5 Scale up min worker fraction 0 Scale down factor 0.25 Scale down min worker fraction 0 Code Snippet for Cluster Configuration in Dataproc from airflow.providers.google.cloud.operators.dataproc import ClusterGenerator ,DataprocCre