Main Web>TWikiUsers>MariaALANDESPRADILLO>MiCaja>BatchAccounting (2024-02-29, MariaALANDESPRADILLO)

Batch Accounting Overview

Documentation

Configuration

List of Nodes
Puppet configuration:
- HTCondor Schedulers
  - Manifests:
    - Puppet manifest
    - Puppet manifest (cron job)
  - Templates:
    - /etc/cron.hourly/accounting_rsync_cron
- Thooki nodes
  - gitlab repo
  - Puppet:
    - Manifests:
      - Cephfs configuration
      - master.pp - Hostgroup: batchinfra/accounting/thooki/master
      - minion.pp - Hostgroup: batchinfra/accounting/thooki/minion (thooki-minion-0{1,2,3})
    - Templates:
      - /etc/thooki/config.yml
    - Variables:
      - accounting.yaml
    - Files:
      - Log rotate - /var/log/thooki/*.log and /var/log/accounting-historian-daemon/*.log
- Spark nodes
  - gitlab repo
  - New Puppet:
    - compute_accounting Hostgroup
  - Old Puppet:
    - Manifests:
      - spark.pp - Hostgroup: batchinfra/spark (accounting-spark-01)
      - sparkdev.pp - Hostgroup: batchinfra/sparkdev (accounting-spark-dev)
      - sparktest.pp - Hostgroup: batchinfra/sparktest (accounting-spark-test)
    - Templates:
    - Variables:
    - Files:
      - Log rotate - /var/log/spark-scripts/*.log and /var/log/accounting-share/*.log
- Rerun cron job
  - gitlab repo
  - Puppet:
    - Manifests:
      - rerun.pp - This is installed in prod only (accounting-spark-01)
    - Templates:
      - /usr/local/etc/accounting-spark-sync/config.yml
- Backup node
  - Puppet:
    - Manifest:
      - accbackup.pp - Hostgroup: batchinfra/accbackup
    - Templates:
      - /home/backup/s3_creds

Software Versions

Component	OS	Software	Version	SLoC	Comments
Thooki	CC7	Go	1.0.9-1
Thooki	EL9	Go
Spark	CC7	Python 2.7.5	2.4-10
Spark	EL9	Python 3.9.18	2.7-11
Rerun	CC7	Go 1.13	0.0.1-7
Rerun	EL9	Go 1.19	1.0.0-9
apel-ssm	CC7	Python 2.7.5	3.1.1-1	-
apel-ssm	EL9	Python 3.9.18	-	-

Cronjobs

minute hour day(month) day day(week)

Node	Job	Time	Bucket	Meaning
CC7 prod	condor_summaries	0 3 * * *	s3://accountingdata and s3://accountingreports/batch/condor_thooki	Every day at 3 am
CC7 prod	monthly_summary	10 8 * * *	s3://accountingreports/overall/monthly/	Every day at 8:10 am
CC7 prod	apel	20 8 * * *		Every day at 8:20 am
CC7 dev	condor_summaries	0 7 * * *	s3://accountingdatadev and s3://accountingreportsdev/batch/condor_thooki	Every day at 7am
CC7 dev	monthly_summary	0 15 * * *	s3://accountingreportsdev/overall/monthly/	Every day at 3 pm
CC7 dev	apel	NA
EL9 dev	condor_summaries	0 12 * * *	s3://accountingdatadev and s3://accountingreportsdev/batch/condor_thooki	Every day at 12am
EL9 dev	monthly_summary	0 16 * * *	s3://accountingreportsdev/overall/monthly/	Every day at 4 pm
EL9 dev	apel

BC/DR

BC/DR AS-IS assessment

APEL

Certificates

X509 certificates needed for APEL are configured via Puppet.

Cephfs

Schedd's history files are stored and pre-processed as json files by Thooki in CephFS.
Openstack project: IT-Batch - Infrastructure
Share name: htcondor-accounting-data (20TB)
Cephfs docs
Quota:
- In Openstack project page
- To check used space, in Thooki Master: df -H
- Quota can be changed via request ticket in the Openstack project page. More details here.
Share information:
- eval $(ai-rc "IT-Batch - Infrastructure")
- openstack share access list htcondor-accounting-data

DBoD

Spool jobs need to be reprocessed in the right completed date. A DB is used to track those dates that need to be reprocessed.
Production Instance:
- Host: dbod-batchacc.cern.ch
- Port: 5501
- User: admin
- DB name: dirtyfiles
- Password: tbag show --hg batchinfra thooki_db_pass
Useful commands:

select * from dirty_date;
select * from job_runs;
describe job_runs;
describe dirty_date;
select j.id,d.id,j.state,d.dirty from job_runs j join dirty_date d on j.date_id = d.id where j.state="PROCESSED";
update dirty_date set dirty="0" where dirty="1";
truncate table dirty_date; (deletes contents but table stays in DB)

S3

Spark stores the json files produced by Thooki and extracts the needed information to produce the accounting reports in S3 buckets.
Openstack project:
- Prod and dev: mmarques (4TB)
  - s3://accountingdata
  - s3://accountingdatadev
  - s3://accountingreports
  - s3://accountingreportsdev
  - Credentials:
    - tbag show --hg batchinfra s3_access_key
    - tbag show --hg batchinfra s3_secret_key
- Backup: None, in s3-fr-prevessin-1.cern.ch (account computeaccountingbackup, no quota set)
  - s3://accountingdata
  - Credentials:
    - tbag show --hg batchinfra backup_s3_access
    - tbag show --hg batchinfra backup_s3_secret
- Test: IT-Batch test and development (100GB)
  - s3://accounting-testing
  - s3://accounting-testing-reports
  - s3://accountingdatatest
  - s3://accountingreportstest
  - Credentials:
    - tbag show --hg batchinfra test_s3_access
    - tbag show --hg batchinfra test_s3_secret
S3 Docs
Quota:
- In Openstack project page
- To check used space. :
  - s3cmd -H du
  - s3cmd -H du -c s3/s3cfg-prod
  - s3cmd -H du -c s3/s3cfg-prod s3://accountingdata/batch/condor_thooki/2023
- Quota can be changed via request ticket in the Openstack project page. More details here.
- Bucket information:
  - eval $(ai-rc "IT-Batch test and development")
  - openstack container show accounting-testing

Secrets

Teigi docs
Example to add a new s3 configuration file: tbag set --hg batchinfra s3key_test --file s3cfg.testbucket.2023-05-12
Example to show defined secrets: tbag showkeys --hg batchinfra | grep s3key_test

Alarms

Show alarms: roger show accounting-spark-01
Update state: roger update accounting-spark-01 --appstate production
Enable alarm: roger update accounting-spark-01 --hw_alarmed true

E-groups and service accounts

svcbuild: used to build koji packajes. See Variables: KOJICI_USER and KOJICI_PWD.

JIRA

Batch Accounting

Kowledge Transfer

Git

Clone the repository: git clone https://:@gitlab.cern.ch:8443/compute-accounting/accounting-jobs.git
Change to branch: git checkout <name-of-branch>
Update local copy of the repo: git pull origin <name-of-branch>
Upload a change:
- git add <file-name>
- git commit -m "commit message"
- git push
Create a tag:
- git tag v0.0.0
- git push --tags

How to restart nodes

Thooki: service thooki restart
Spark is not a service but a set of cron jobs that ran on a daily basis. crontab -l to know more
Re-run:

How to install nodes from scratch

Spark nodes:
- Use Openstack environment of existing node, i.e.: eval $(ai-rc --same-project-as accounting-spark-01.cern.ch)
- The Openstack project is IT-Batch - Infrastructure
- Check available flavours: openstack flavor list
- Check available images: openstack image list
- Create the machine:
  - Old puppet: ai-bs -g batchinfra/sparktest --foreman-environment qa --cc7 --nova-flavor m2.large --nova-sshkey malandes_key accounting-install-test
  - New puppet: ai-bs -g compute_accounting/test --foreman-environment qa --el9 --nova-flavor m2.large --nova-sshkey malandes_key accounting-spark-el9-test
- Delete the machine: ai-kill accounting-install-test.cern.ch

How to apply a configuration change

Create a new branch in gitlab to apply the changes:
- Go to the root of the repository and click on +, then New branch. See Docs for more details, if needed.
Create a new environment to deploy a machine using the new branch, see Docs for more details:
- Modify the yaml file. Example for hepscore.yaml:

 
default: qa
notifications: compute-accounting-sprint@cern.ch
overrides:
  hostgroups:
    batchinfra: hepscore

Deploy a testing machine using the new environment, i.e. ai-bs -g batchinfra/sparkdev --foreman-environment hepscore --cc7 --nova-flavor m2.large --nova-sshkey malandes_key accounting-test
Apply the configuration changes in the new branch
QA
Prod

How to apply a code change

Create a new branch in gitlab to apply the changes:
- Go to the root of the repository and click on +, then New branch. See Docs for more details, if needed.
When you have finished changing your code in git, tag the changes:
This will start a new CI/CD pipeline and a new rpm version will be available in the testing repo
Deploy a testing node as explained the previous section
In the testing node ran yum install --enablerepo=batch7-testing accountingjobs
QA
Prod

Jupyter Notebooks

Swan page
Configure Environment:
- For heavy computations: 4 cores + 16 GB
- Software stack: 104a
- Environment Script: $CERNBOX_HOME/SWAN_projects/HPCAccounting/swan_s3_env.sh
- Spark Cluster: General Purpose (Analytix)
Click on the start icon Spark clusters connection
- Tick on Include S3Filesystem options
- spark.hadoop.fs.s3a.access.key {S3A_ACCESS_KEY}
- spark.hadoop.fs.s3a.secret.key {S3A_SECRET_KEY}
- S3Filesystem
  - spark.hadoop.fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem
  - spark.hadoop.fs.s3a.endpoint: https://s3.cern.ch
  - spark.hadoop.fs.s3a.path.style.access: true
  - spark.hadoop.fs.s3a.fast.upload: true
  - spark.jars.packages: org.apache.hadoop:hadoop-aws:3.3.2

How to recalculate accounting records

See documentation for accounting jobs code

BEER jobs

ACC-636

Open questions

https://gitlab.cern.ch/compute-accounting/accounting-jobs/-/blob/master/accountingjobs/daily/condor.py:
- why default value of hepspec is 80?

Trainings and Documentation

Monitoring - Docs
Configuration - Training
Puppet language style guide - Puppet language basics
Git - Git book
rpmci

-- MariaALANDESPRADILLO - 2023-03-02

Topic revision: r51 - 2024-02-29 - MariaALANDESPRADILLO

Main

Webs

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
Main All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback