How Accounting Data is Produced

The document below uses as an example accounting data from 10th September 2023 coming from scheduler bigbird10.

The schedulers

  • Schedulers rsync their history file to CephFS folder /accounting/incoming once per hour.
  • However, the history file that is synced is the one with format history.YYYYMMDDTXXXX, while the history file of the day is called only history. It's only at midnight when this file is copied to history.YYYYMMDDTXXXX and then, the following hour can be synced to CephFS.
[root@bigbird10 ~]# ll /var/lib/condor/spool/history.20230911T000254
-rw-r--r--. 1 condor condor 166068912 Sep 10 23:57 /var/lib/condor/spool/history.20230911T000254

[root@bigbird10 ~]# more /var/log/accounting-rsync.log | grep history.20230911T000254
2023/09/11 01:01:04 [SENT] [bigbird10.cern.ch] var/lib/condor/spool/history.20230911T000254 7466079

[root@thooki-master ~]# ll /accounting/incoming/bigbird10.cern.ch/history.20230911T000254
-rw-r--r--. 1 root root 166068912 Sep 10 23:57 /accounting/incoming/bigbird10.cern.ch/history.20230911T000254

Thooki

  • Watcher: For every scheduler in /accounting/incoming a watcher process (code) is launched. Note that if new schedulers sync their files, Thooki needs to be restarted to take the new scheduler into account.
  • Then, every 30min (this wouldn't be necessary, as history files are only synced from schedulers once per day) (config), the watcher process (code) does the following:
  • Walker: Checks in /accounting/incoming if there's a new history file (code) and then it zips it and copies it to /accounting/thooki/raw (code)
[root@thooki-master ~]# ll /accounting/thooki/raw/bigbird10.cern.ch/2023/09/11/history.20230911T000254.gz
-rw-r--r--. 1 root root 7360550 Sep 11 01:19 /accounting/thooki/raw/bigbird10.cern.ch/2023/09/11/history.20230911T000254.gz

  • Encoder: Converts the history file to json (code) in the /tmp directory under history.20230911T000254.json
  • Flattener: Parses the json file in /tmp and splits it in multiple files based on the completion date of the jobs (code). The date of the jobs is looked at into the ClassAds variables CompletionDate and, if it's zero, it checks EnteredCurrentStatus.
    • CompletionDate: the time when the job completed, or undefined if the job has not yet completed. Measured in the number of seconds since the epoch (00:00:00 UTC, Jan 1, 1970). Note that older versions of HTCondor initialzed CompletionDate to the integer 0, so job ads from older versions of HTCondor might have a 0 CompletionDate for jobs which haven’t completed.
    • EnteredCurrentStatus: an integer containing the epoch time of when the job entered into its current status So for example, if the job is on hold, the ClassAd expression

  • In the example below, we can check there are different end dates: 31st August and 10th September.
[root@thooki-master bigbird10.cern.ch]# for i in `condor_history -file history.20230911T000254 -af CompletionDate | grep -v undefined`; do date --date=@$i; done | cut -d" " -f2-3 | sort | uniq -c
      8 Aug 31
   4767 Jan
  15336 Sep 10
[root@thooki-master bigbird10.cern.ch]# for i in `condor_history -file history.20230911T000254 -af EnteredCurrentStatus | grep -v undefined`; do date --date=@$i; done | cut -d" " -f2-3 | sort | uniq -c
      8 Aug 31
     40 Sep
  20063 Sep 10

  • It means, Thooki has created two splitted history files (the sync suffix will be added at a later stage):
[root@thooki-master ~]# ll /accounting/thooki/processed/bigbird10.cern.ch/2023/09/10/
total 160966
-rw-------. 1 root root 164828233 Sep 11 01:20 history.20230911T000254.json.sync

[root@thooki-master ~]# ll /accounting/thooki/processed/bigbird10.cern.ch/2023/08/31/
total 148650
-rw-------. 1 root root 152128914 Sep  1 00:18 history.20230901T000000.json.sync
-rw-------. 1 root root     26152 Sep  2 00:30 history.20230902T000000.json.sync
-rw-------. 1 root root     61548 Sep 11 01:20 history.20230911T000254.json.sync --> This is the file we care for in this example

  • All the files of those directories are merged on a temporary file with the name of the scheduler (code). The suffix sync is added to the different history files (code):
Under /accounting/thooki/processed/bigbird10.cern.ch/2023/09/10/, we would have:
history.20230911T000254.json.sync
bigbird10.cern.ch.json--> temporary file that is deleted

Under /accounting/thooki/processed/bigbird10.cern.ch/2023/08/31/, we would have:
history.20230901T000000.json.sync
history.20230902T000000.json.sync
history.20230911T000254.json.sync
bigbird10.cern.ch.json --> temporary file that is deleted

  • Zipper: Zips all the scheduler.json files and stores them in /accounting/thooki/zipped/ (code)
[root@thooki-master ~]# ll /accounting/thooki/zipped/2023/09/10/bigbird10.cern.ch.json.gz
-rw-r--r--. 1 root root 7175380 Sep 11 01:21 /accounting/thooki/zipped/2023/09/10/bigbird10.cern.ch.json.gz

[root@thooki-master ~]# ll /accounting/thooki/zipped/2023/08/31/bigbird10.cern.ch.json.gz
-rw-r--r--. 1 root root 5856073 Sep 11 01:21 /accounting/thooki/zipped/2023/08/31/bigbird10.cern.ch.json.gz

  • Uploader: uploads the zipped scheduler.json files in S3 (code). Note that the modified date of the files in S3 can't be trusted due to a daily cron job setting up ACLs and modifying the date (RQF2404552). This is the case only in the prod bucket, the date is OK in the dev bucket (Note it's UTC date, not CET).
-bash-4.2$ s3cmd ls -c s3/s3cfg-prod s3://accountingdata/batch/condor_thooki/2023/09/10/
2023-09-14 07:17      3602130  s3://accountingdata/batch/condor_thooki/2023/09/10/bigbird10.cern.ch.json.gz

-bash-4.2$ s3cmd ls -c s3/s3cfg-prod s3://accountingdata/batch/condor_thooki/2023/08/31/
2023-09-14 07:17     28748744  s3://accountingdata/batch/condor_thooki/2023/08/31/bigbird10.cern.ch.json.gz

  • Informer: For every date for which we have identified new job records, we update the DB. The DB is responsible for tracking which dates need to reprocess the accounting records (code)
MySQL [dirtyfiles]> select * from dirty_date where date="2023-09-10 00:00:00";
+--------+---------------------+-------+---------------------+---------------------+
| id     | date                | dirty | created_on          | last_modified_on    |
+--------+---------------------+-------+---------------------+---------------------+
| 102296 | 2023-09-10 00:00:00 |     1 | 2023-09-11 00:03:49 | 2023-09-11 00:03:49 |
+--------+---------------------+-------+---------------------+---------------------+
1 row in set (0.00 sec)

MySQL [dirtyfiles]> select * from dirty_date where date="2023-08-31 00:00:00";
+--------+---------------------+-------+---------------------+---------------------+
| id     | date                | dirty | created_on          | last_modified_on    |
+--------+---------------------+-------+---------------------+---------------------+
| 101419 | 2023-08-31 00:00:00 |     1 | 2023-09-01 00:05:44 | 2023-09-01 00:05:44 |
+--------+---------------------+-------+---------------------+---------------------+
1 row in set (0.00 sec)

  • Note that on date 2023-08-31, the dirty flag was set to TRUE on 2023-09-01 and it remained like that because of a bug reported in the next section.

Rerun Job

  • The rerun job is responsible for recalculating accounting records for those completion dates that are older to the history file where the job is reported. For that, two processes are used:
    • Scanner: Every 30min, it scans the table job_runs for all job ids with state PROCESSED (code)
      • Target - red BUG: The scanner process doesn't seem to be running. There are some errors about this in the logs
      • It checks the logs in /var/log/spark-scripts/condor-summaries/rerun/ in the accounting-spark-01 node and searches for the output and error files based on the job ids which exist in the array.
      • If the error file is empty and the log file contains the DONE-Token, then it updates JobID State in job_runs to DONE. Target - red BUG: However, the rerun job never sets back to FALSE the dirty column in the dirty_dates table. There's indeed a line in the code that says TODO: Update the dirty boolean flag to false (code)
      • If the error file is not empty, then it updates the State to ERROR and it adds the error message to the comments field in the job_runs table. The dirty flag is still TRUE so the job is relaunched again in the next iteration.
    • Launcher: Every 6h, it scans the table dirty_date for all jobs with dirty flag TRUE. If the corresponding rerun job for that date is in state PROCESSED, then it skips it, otherwise, it launches a new spark job for that date. (code)
      • Because of the scanner bug, every time the launcher runs, it reports more than 1,000 date for reprocessing as most dates have dirty flag set to TRUE.

-- MariaALANDESPRADILLO - 2023-09-13

Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r5 - 2024-02-26 - MariaALANDESPRADILLO
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback