Batch Accounting Visualisation
Hadoop and Spark
Get started
Monitoring and Debugging
Kerberos
- When using Hadoop/yarn you need to have a Kerberos TGT. If this is to be executed from a script, you can use a keytab file like this:
- Run:
cern-get-keytab --keytab sparktest.keytab --login malandes --user
- Then get the Kerberos TGT with:
kinit -kt /path/to/sparktest.keytab malandes
- Then, in the script, the following lines must be included:
export KRB5CCNAME=FILE:$XDG_RUNTIME_DIR/krb5cc
kinit -kt /afs/cern.ch/user/m/malandes/spark/sparktest.keytab malandes
Running on the Hadoop cluster
- From the client node:
- To launch the spark job:
#!/bin/bash
source /cvmfs/sft.cern.ch/lcg/views/LCG_105a_swan/x86_64-centos7-gcc11-opt/setup.sh
source /cvmfs/sft.cern.ch/lcg/etc/hadoop-confext/hadoop-swan-setconf.sh analytix 3.3 spark3
export KRB5CCNAME=FILE:$XDG_RUNTIME_DIR/krb5cc
kinit -kt /afs/cern.ch/user/m/malandes/spark/sparktest.keytab malandes
spark_cmd="spark-submit \
--master yarn \
--deploy-mode client \
--keytab /afs/cern.ch/user/m/malandes/spark/sparktest.keytab \
--principal malandes@CERN.CH \
--packages org.apache.hadoop:hadoop-aws:3.3.2 \
--conf spark.yarn.appMasterEnv.LD_LIBRARY_PATH=$LD_LIBRARY_PATH \
--conf spark.yarn.appMasterEnv.PYTHONPATH=$PYTHONPATH \
--conf spark.executor.memory=16g \
--conf spark.executor.instances=24 \
--conf spark.executor.cores=4 \
--conf spark.driver.memory=32g \
--conf spark.ui.showConsoleProgress=false \
--conf spark.hadoop.fs.s3a.path.style.access=true \
--conf spark.hadoop.fs.s3a.fast.upload=true \
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
script-name.py"
eval $spark_cmd
Connecting HDFS in Power BI
- Example of URL to load a file stored in HDFS:
https://ithdp6013.cern.ch:50070/webdfs/v1/user/malandes/hpc-2023-test.csv/part-00000-812eef77-4d73-46c6-8993-ec82c70584b9-c000.csv
--
MariaALANDESPRADILLO - 2024-02-15