Spark

September 21, 2024 ยท View on GitHub

NOT PORTED YET

Spark-on-Kubernetes UI Tunnel

Quickly using DevOps-Bash-tools (will prompt if more than 1 spark job is running):

kubectl_port_forward_spark.sh  # <namespace>

Manually:

Set the Kubernetes namespace where the Spark job is running:

NAMESPACE=prod

On the bastion itself you need to find the Spark driver (master) pod which hosts the UI.

SPARK_DRIVER_POD="$(
  kubectl get pods -n "$NAMESPACE" \
                   -l spark-role=driver \
                   --field-selector=status.phase=Running \
                   -o name |
  tee /dev/stderr
)"

You should see a pod name output, which is also saved to $SPARK_DRIVER_POD for future commands.

If you see more than one pod name output then you need to pick one explicitly, perhaps kubectl get pods -n "$NAMESPACE" and pick the one that aligns to your job start time.

Kubectl to port-forward to that Spark driver pod's UI port:

kubectl port-forward --address 127.0.0.1 -n "$NAMESPACE" "$SPARK_DRIVER_POD" 4040:4040

Then open http://localhost:4040.

Troubleshooting

JStack Thread Dumps

If the docker image used eg. Informatica is using JRE instead of JDK and the copied versions of JDK don't work then another workaround is to tunnel to the Spark master drive UI and get the thread dumps from there:

ssh bastion -L 4040:localhost:4040

On the bastion:

kubectl port-forward --address 127.0.0.1 "$(kubectl get pods -l spark-role=driver -o name | head -n1)" 4040