Batch Accounting Overview

Documentation

Configuration

Software Versions

Component OS Software Version SLoC Comments
Thooki CC7 Go 1.0.9-1    
Thooki EL9 Go      
Spark CC7 Python 2.7.5 2.4-10    
Spark EL9 Python 3.9.18 2.7-11    
Rerun CC7 Go 1.13 0.0.1-7    
Rerun EL9 Go 1.19 1.0.0-9    
apel-ssm CC7 Python 2.7.5 3.1.1-1 -  
apel-ssm EL9 Python 3.9.18 - -  

Cronjobs

minute hour day(month) day day(week)

Node Job Time Bucket Meaning
CC7 prod condor_summaries 0 3 * * * s3://accountingdata and s3://accountingreports/batch/condor_thooki Every day at 3 am
CC7 prod monthly_summary 10 8 * * * s3://accountingreports/overall/monthly/ Every day at 8:10 am
CC7 prod apel 20 8 * * *   Every day at 8:20 am
CC7 dev condor_summaries 0 7 * * * s3://accountingdatadev and s3://accountingreportsdev/batch/condor_thooki Every day at 7am
CC7 dev monthly_summary 0 15 * * * s3://accountingreportsdev/overall/monthly/ Every day at 3 pm
CC7 dev apel NA  
EL9 dev condor_summaries 0 12 * * * s3://accountingdatadev and s3://accountingreportsdev/batch/condor_thooki Every day at 12am
EL9 dev monthly_summary 0 16 * * * s3://accountingreportsdev/overall/monthly/ Every day at 4 pm
EL9 dev apel    

BC/DR

APEL

Certificates

  • X509 certificates needed for APEL are configured via Puppet.

Cephfs

  • Schedd's history files are stored and pre-processed as json files by Thooki in CephFS.
  • Openstack project: IT-Batch - Infrastructure
  • Share name: htcondor-accounting-data (20TB)
  • Cephfs docs
  • Quota:
    • In Openstack project page
    • To check used space, in Thooki Master: df -H
    • Quota can be changed via request ticket in the Openstack project page. More details here.
  • Share information:
    • eval $(ai-rc "IT-Batch - Infrastructure")
    • openstack share access list htcondor-accounting-data

DBoD

  • Spool jobs need to be reprocessed in the right completed date. A DB is used to track those dates that need to be reprocessed.
  • Production Instance:
    • Host: dbod-batchacc.cern.ch
    • Port: 5501
    • User: admin
    • DB name: dirtyfiles
    • Password: tbag show --hg batchinfra thooki_db_pass
  • Useful commands:
select * from dirty_date;
select * from job_runs;
describe job_runs;
describe dirty_date;
select j.id,d.id,j.state,d.dirty from job_runs j join dirty_date d on j.date_id = d.id where j.state="PROCESSED";
update dirty_date set dirty="0" where dirty="1";
truncate table dirty_date; (deletes contents but table stays in DB)

S3

  • Spark stores the json files produced by Thooki and extracts the needed information to produce the accounting reports in S3 buckets.
  • Openstack project:
    • Prod and dev: mmarques (4TB)
      • s3://accountingdata
      • s3://accountingdatadev
      • s3://accountingreports
      • s3://accountingreportsdev
      • Credentials:
        • tbag show --hg batchinfra s3_access_key
        • tbag show --hg batchinfra s3_secret_key
    • Backup: None, in s3-fr-prevessin-1.cern.ch (account computeaccountingbackup, no quota set)
      • s3://accountingdata
      • Credentials:
        • tbag show --hg batchinfra backup_s3_access
        • tbag show --hg batchinfra backup_s3_secret
    • Test: IT-Batch test and development (100GB)
      • s3://accounting-testing
      • s3://accounting-testing-reports
      • s3://accountingdatatest
      • s3://accountingreportstest
      • Credentials:
        • tbag show --hg batchinfra test_s3_access
        • tbag show --hg batchinfra test_s3_secret
  • S3 Docs
  • Quota:
    • In Openstack project page
    • To check used space. :
      • s3cmd -H du
      • s3cmd -H du -c s3/s3cfg-prod
      • s3cmd -H du -c s3/s3cfg-prod s3://accountingdata/batch/condor_thooki/2023
    • Quota can be changed via request ticket in the Openstack project page. More details here.
    • Bucket information:
      • eval $(ai-rc "IT-Batch test and development")
      • openstack container show accounting-testing

Secrets

  • Teigi docs
  • Example to add a new s3 configuration file: tbag set --hg batchinfra s3key_test --file s3cfg.testbucket.2023-05-12
  • Example to show defined secrets: tbag showkeys --hg batchinfra | grep s3key_test

Alarms

  • Show alarms: roger show accounting-spark-01
  • Update state: roger update accounting-spark-01 --appstate production
  • Enable alarm: roger update accounting-spark-01 --hw_alarmed true

E-groups and service accounts

  • svcbuild: used to build koji packajes. See Variables: KOJICI_USER and KOJICI_PWD.

JIRA

Kowledge Transfer

Git

How to restart nodes

  • Thooki: service thooki restart
  • Spark is not a service but a set of cron jobs that ran on a daily basis. crontab -l to know more
  • Re-run:

How to install nodes from scratch

  • Spark nodes:
    • Use Openstack environment of existing node, i.e.: eval $(ai-rc --same-project-as accounting-spark-01.cern.ch)
    • The Openstack project is IT-Batch - Infrastructure
    • Check available flavours: openstack flavor list
    • Check available images: openstack image list
    • Create the machine:
      • Old puppet: ai-bs -g batchinfra/sparktest --foreman-environment qa --cc7 --nova-flavor m2.large --nova-sshkey malandes_key accounting-install-test
      • New puppet: ai-bs -g compute_accounting/test --foreman-environment qa --el9 --nova-flavor m2.large --nova-sshkey malandes_key accounting-spark-el9-test
    • Delete the machine: ai-kill accounting-install-test.cern.ch

How to apply a configuration change

  • Create a new branch in gitlab to apply the changes:
    • Go to the root of the repository and click on +, then New branch. See Docs for more details, if needed.
  • Create a new environment to deploy a machine using the new branch, see Docs for more details:
    • Modify the yaml file. Example for hepscore.yaml:
 
default: qa
notifications: compute-accounting-sprint@cern.ch
overrides:
  hostgroups:
    batchinfra: hepscore
  • Deploy a testing machine using the new environment, i.e. ai-bs -g batchinfra/sparkdev --foreman-environment hepscore --cc7 --nova-flavor m2.large --nova-sshkey malandes_key accounting-test
  • Apply the configuration changes in the new branch
  • QA
  • Prod

How to apply a code change

  • Create a new branch in gitlab to apply the changes:
    • Go to the root of the repository and click on +, then New branch. See Docs for more details, if needed.
  • When you have finished changing your code in git, tag the changes:
  • This will start a new CI/CD pipeline and a new rpm version will be available in the testing repo
  • Deploy a testing node as explained the previous section
  • In the testing node ran yum install --enablerepo=batch7-testing accountingjobs
  • QA
  • Prod

Jupyter Notebooks

  • Swan page
  • Configure Environment:
    • For heavy computations: 4 cores + 16 GB
    • Software stack: 104a
    • Environment Script: $CERNBOX_HOME/SWAN_projects/HPCAccounting/swan_s3_env.sh
    • Spark Cluster: General Purpose (Analytix)
  • Click on the start icon Spark clusters connection
    • Tick on Include S3Filesystem options
    • spark.hadoop.fs.s3a.access.key {S3A_ACCESS_KEY}
    • spark.hadoop.fs.s3a.secret.key {S3A_SECRET_KEY}
    • S3Filesystem
      • spark.hadoop.fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem
      • spark.hadoop.fs.s3a.endpoint: https://s3.cern.ch
      • spark.hadoop.fs.s3a.path.style.access: true
      • spark.hadoop.fs.s3a.fast.upload: true
      • spark.jars.packages: org.apache.hadoop:hadoop-aws:3.3.2

How to recalculate accounting records

See documentation for accounting jobs code

BEER jobs

Open questions

Trainings and Documentation

-- MariaALANDESPRADILLO - 2023-03-02

Edit | Attach | Watch | Print version | History: r51 < r50 < r49 < r48 < r47 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r51 - 2024-02-29 - MariaALANDESPRADILLO
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback