Batch Accounting Overview
Documentation
Configuration
- List of Nodes
- Puppet configuration:
- HTCondor Schedulers
- Thooki nodes
- gitlab repo
- Puppet:
- Manifests:
- Templates:
- Variables:
- Files:
- Log rotate -
/var/log/thooki/*.log
and /var/log/accounting-historian-daemon/*.log
- Spark nodes
- gitlab repo
- New Puppet:
- Old Puppet:
- Manifests:
- spark.pp - Hostgroup: batchinfra/spark (accounting-spark-01)
- sparkdev.pp - Hostgroup: batchinfra/sparkdev (accounting-spark-dev)
- sparktest.pp - Hostgroup: batchinfra/sparktest (accounting-spark-test)
- Templates:
- Variables:
- Files:
- Log rotate -
/var/log/spark-scripts/*.log
and /var/log/accounting-share/*.log
- Rerun cron job
- gitlab repo
- Puppet:
- Manifests:
- rerun.pp - This is installed in prod only (accounting-spark-01)
- Templates:
- Backup node
Software Versions
Component |
OS |
Software |
Version |
SLoC |
Comments |
Thooki |
CC7 |
Go |
1.0.9-1 |
|
|
Thooki |
EL9 |
Go |
|
|
|
Spark |
CC7 |
Python 2.7.5 |
2.4-10 |
|
|
Spark |
EL9 |
Python 3.9.18 |
2.7-11 |
|
|
Rerun |
CC7 |
Go 1.13 |
0.0.1-7 |
|
|
Rerun |
EL9 |
Go 1.19 |
1.0.0-9 |
|
|
apel-ssm |
CC7 |
Python 2.7.5 |
3.1.1-1 |
- |
|
apel-ssm |
EL9 |
Python 3.9.18 |
- |
- |
|
Cronjobs
minute hour day(month) day day(week)
Node |
Job |
Time |
Bucket |
Meaning |
CC7 prod |
condor_summaries |
0 3 * * * |
s3://accountingdata and s3://accountingreports/batch/condor_thooki |
Every day at 3 am |
CC7 prod |
monthly_summary |
10 8 * * * |
s3://accountingreports/overall/monthly/ |
Every day at 8:10 am |
CC7 prod |
apel |
20 8 * * * |
|
Every day at 8:20 am |
CC7 dev |
condor_summaries |
0 7 * * * |
s3://accountingdatadev and s3://accountingreportsdev/batch/condor_thooki |
Every day at 7am |
CC7 dev |
monthly_summary |
0 15 * * * |
s3://accountingreportsdev/overall/monthly/ |
Every day at 3 pm |
CC7 dev |
apel |
NA |
|
EL9 dev |
condor_summaries |
0 12 * * * |
s3://accountingdatadev and s3://accountingreportsdev/batch/condor_thooki |
Every day at 12am |
EL9 dev |
monthly_summary |
0 16 * * * |
s3://accountingreportsdev/overall/monthly/ |
Every day at 4 pm |
EL9 dev |
apel |
|
|
BC/DR
APEL
Certificates
- X509 certificates needed for APEL are configured via Puppet.
Cephfs
- Schedd's history files are stored and pre-processed as json files by Thooki in CephFS.
- Openstack project: IT-Batch - Infrastructure
- Share name: htcondor-accounting-data (20TB)
- Cephfs docs
- Quota:
- In Openstack project page
- To check used space, in Thooki Master:
df -H
- Quota can be changed via request ticket in the Openstack project page. More details here.
- Share information:
-
eval $(ai-rc "IT-Batch - Infrastructure")
-
openstack share access list htcondor-accounting-data
DBoD
- Spool jobs need to be reprocessed in the right completed date. A DB is used to track those dates that need to be reprocessed.
- Production Instance:
- Host: dbod-batchacc.cern.ch
- Port: 5501
- User: admin
- DB name: dirtyfiles
- Password:
tbag show --hg batchinfra thooki_db_pass
- Useful commands:
select * from dirty_date;
select * from job_runs;
describe job_runs;
describe dirty_date;
select j.id,d.id,j.state,d.dirty from job_runs j join dirty_date d on j.date_id = d.id where j.state="PROCESSED";
update dirty_date set dirty="0" where dirty="1";
truncate table dirty_date; (deletes contents but table stays in DB)
S3
- Spark stores the json files produced by Thooki and extracts the needed information to produce the accounting reports in S3 buckets.
- Openstack project:
- Prod and dev: mmarques (4TB)
- s3://accountingdata
- s3://accountingdatadev
- s3://accountingreports
- s3://accountingreportsdev
- Credentials:
-
tbag show --hg batchinfra s3_access_key
-
tbag show --hg batchinfra s3_secret_key
- Backup: None, in
s3-fr-prevessin-1.cern.ch
(account computeaccountingbackup
, no quota set)
- s3://accountingdata
- Credentials:
-
tbag show --hg batchinfra backup_s3_access
-
tbag show --hg batchinfra backup_s3_secret
- Test: IT-Batch test and development (100GB)
- s3://accounting-testing
- s3://accounting-testing-reports
- s3://accountingdatatest
- s3://accountingreportstest
- Credentials:
-
tbag show --hg batchinfra test_s3_access
-
tbag show --hg batchinfra test_s3_secret
- S3 Docs
- Quota:
- In Openstack project page
- To check used space. :
-
s3cmd -H du
-
s3cmd -H du -c s3/s3cfg-prod
-
s3cmd -H du -c s3/s3cfg-prod s3://accountingdata/batch/condor_thooki/2023
- Quota can be changed via request ticket in the Openstack project page. More details here.
- Bucket information:
-
eval $(ai-rc "IT-Batch test and development")
-
openstack container show accounting-testing
Secrets
- Teigi docs
- Example to add a new s3 configuration file:
tbag set --hg batchinfra s3key_test --file s3cfg.testbucket.2023-05-12
- Example to show defined secrets:
tbag showkeys --hg batchinfra | grep s3key_test
Alarms
- Show alarms:
roger show accounting-spark-01
- Update state:
roger update accounting-spark-01 --appstate production
- Enable alarm:
roger update accounting-spark-01 --hw_alarmed true
E-groups and service accounts
-
svcbuild
: used to build koji packajes. See Variables: KOJICI_USER
and KOJICI_PWD
.
JIRA
Kowledge Transfer
Git
How to restart nodes
- Thooki:
service thooki restart
- Spark is not a service but a set of cron jobs that ran on a daily basis.
crontab -l
to know more
- Re-run:
How to install nodes from scratch
- Spark nodes:
- Use Openstack environment of existing node, i.e.:
eval $(ai-rc --same-project-as accounting-spark-01.cern.ch)
- The Openstack project is
IT-Batch - Infrastructure
- Check available flavours:
openstack flavor list
- Check available images:
openstack image list
- Create the machine:
- Old puppet:
ai-bs -g batchinfra/sparktest --foreman-environment qa --cc7 --nova-flavor m2.large --nova-sshkey malandes_key accounting-install-test
- New puppet:
ai-bs -g compute_accounting/test --foreman-environment qa --el9 --nova-flavor m2.large --nova-sshkey malandes_key accounting-spark-el9-test
- Delete the machine:
ai-kill accounting-install-test.cern.ch
How to apply a configuration change
- Create a new branch in gitlab to apply the changes:
- Go to the root of the repository and click on
+
, then New branch
. See Docs for more details, if needed.
- Create a new environment to deploy a machine using the new branch, see Docs for more details:
- Modify the yaml file. Example for hepscore.yaml:
default: qa
notifications: compute-accounting-sprint@cern.ch
overrides:
hostgroups:
batchinfra: hepscore
- Deploy a testing machine using the new environment, i.e.
ai-bs -g batchinfra/sparkdev --foreman-environment hepscore --cc7 --nova-flavor m2.large --nova-sshkey malandes_key accounting-test
- Apply the configuration changes in the new branch
- QA
- Prod
How to apply a code change
- Create a new branch in gitlab to apply the changes:
- Go to the root of the repository and click on
+
, then New branch
. See Docs for more details, if needed.
- When you have finished changing your code in git, tag the changes:
- This will start a new CI/CD pipeline and a new rpm version will be available in the testing repo
- Deploy a testing node as explained the previous section
- In the testing node ran
yum install --enablerepo=batch7-testing accountingjobs
- QA
- Prod
Jupyter Notebooks
- Swan page
- Configure Environment:
- For heavy computations: 4 cores + 16 GB
- Software stack:
104a
- Environment Script:
$CERNBOX_HOME/SWAN_projects/HPCAccounting/swan_s3_env.sh
- Spark Cluster:
General Purpose (Analytix)
- Click on the start icon
Spark clusters connection
- Tick on
Include S3Filesystem options
- spark.hadoop.fs.s3a.access.key {S3A_ACCESS_KEY}
- spark.hadoop.fs.s3a.secret.key {S3A_SECRET_KEY}
- S3Filesystem
- spark.hadoop.fs.s3a.impl:
org.apache.hadoop.fs.s3a.S3AFileSystem
- spark.hadoop.fs.s3a.endpoint:
https://s3.cern.ch
- spark.hadoop.fs.s3a.path.style.access:
true
- spark.hadoop.fs.s3a.fast.upload:
true
- spark.jars.packages:
org.apache.hadoop:hadoop-aws:3.3.2
How to recalculate accounting records
See documentation for
accounting jobs code
BEER jobs
Open questions
Trainings and Documentation
--
MariaALANDESPRADILLO - 2023-03-02