Accessing Pyspark in Dataproc
-Dfs.AbstractFileSystem.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS
Gcloud dataproc submit:
gcloud dataproc jobs submit pyspark --project <project_id> --cluster=<cluster-name> --region us-east4 --jars gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar ${local_py_file} -- "${final_gcp_loc}" param1
Autoscaling policy
auto_scaling_policy = ( f"projects/{PROJECT_ID}/regions/us-east4/autoscalingPolicies/{dev}
Sample autoscaling policy
Name:
Region:
Primary worker configuration
Min Instances 2
Max Instances 2
Weight 1
Seconday worker configuration
Min Instances 0
Max Instances 10
Weight 1
Cooldown duration 2 minues
Graceful decomission time 0 seconds
Scale up factor 0.5
Scale up min worker fraction 0
Scale down factor 0.25
Scale down min worker fraction 0
Code Snippet for Cluster Configuration in Dataproc
from airflow.providers.google.cloud.operators.dataproc import ClusterGenerator ,DataprocCreateClusterOperator
clusterGenrator=ClusterGenerator(
project_id= <>,
zone="us-east4-a",
location="us-east4",
image_version="2.0.27-centos8",
master_machine_type="n2-standard-4",
worker_machine_type="n2-standard-8",
num_workers=2,
num_preemptible_workers=0,
internal_ip_only=True,
storage_bucket=config_bucket,
service_account= <>,
service_account_scopes=["https://www.googleapis.com/auth/cloud-platform"],
tags=<>,
subnetwork_uri=<>,
autoscaling_policy=autoscaling_policy,
idle_time_ttl=3600,
auto_delete_ttl=3600 *2,
enable_component_gateway=True,
optional_components=["HIVE_WEBHCAT","JUPYTER"],
metadata={"PIP_PACKAGES":"pyyaml requests pandas openpyxml"},
).make()
TASK in Composer for creaton:
create_dataproc= DataprocCreateClusterOperator(
task_id="create_dataproc",
impersonation_chain=sa,
cluster_name=<cluster_name>,
region=region,
cluster_config=clusterGenrator.
labers={
"created-by":<user>,
},
)
Comments
Post a Comment