Performance Reporting

What is it for?

The main goals of this system is to provide the following information:

  • size of all the objects that ATLAS stores in all officially produced data files.
  • timing of all the algorithms, tools and services used during official data production, simulation, reconstruction.
  • CPU time, wall clock time, memory leakage and other per job information

This information will be used:

  • by performance experts to identify places where optimization will be most beneficial
  • by management to determine current resource needs and to extrapolate for future running parameters
  • by reco shifters to continuously monitor changes in resource consumption
  • example1: find all the persistent objects not written out in last 1 year and remove them from the code.
  • example2: find all the unneeded persistent objects written out.
  • example3: find all the algorithms, tools or services not needed in a particular transformation step.
  • example4: estimate resources needed for a particular MC/real data production/reconstruction.

It is important to notice that in some cases even significantly less than 100% of the events will be taken into account. Still this should not matter for most of applications.

How it works?

Sources of information

While this can be changed for now we settled at:
  • stored object size is obtained from PoolFile.py root based tool (also used by checkFile.py) that opens root file and returns names, sizes and number of entries of all the collections stored.
  • algorithm timings are currently obtained from PerfMonSD. (Data collected before 1st April were obtained through ChronoSvc).
  • per job information is obtained from PerfMonSD. This in turn obtains this information from operating system.
Code collecting the information is situated in the doPostRunActions function of the Tools/PyJobTransformsCore/trf.py.

Data transport

  • Tier-0 collected data are sent directly to Oracle DB in Lyon. Authentication is done via environment variable TZAMIPW. In case that the default db is unreachable Oracle DB in CERN is used to temporarily store the data, from where data are moved automatically to Lyon.
  • prodSys (grid) jobs data are sent using special AMI commands. Authentication is done via VOMS. Only jobs having '/atlas/Role=production' in output of voms-proxy-info -fqan will try to send data.
Currently there is no backup solution in case of problems with AMI. In both cases it is guaranteed that if upload fails for whatever reason it will fail quietly and jobs will finish normally. There is a window of 60 seconds in which all of the data collection and delivery has to be done. All the code concerning data transport is in Tools/PyJobTransforms.

Data storage

Database scheme may be found here. For now there is no documentation describing all the stored procedures that are running on the db and which summarize the data received. In order to ease certain common tasks a table with run info is included in the data base. It contains:
  • mu - average mu - from the plots like these
  • lumi - luminosity
  • time - duration of the run
  • use - flag saying if this run is ok to use. It is manually set. It can be used to exclude runs where information is messed up.
  • ReadyForPhysics - flag obtained from AMI db. All the runs having this flag set to True are distributed.
This table is updated by the script perfStatusOfAmi.py run by cron job every day at 13:00.

How to use it?

Classify objects/algorithms

Independently what type of visualization you will use, and unless you are interested in only job performance data, you should always first make sure objects/algorithms are all properly classified. This is done through AMI interface. Path to it:
  • Start from the AMI home page (AMI home).
  • In the menu find: Applications->Atlas->AMI admin
  • Now menu has changed and you have in it also Database option
  • Click on it and you have a large list of the databases. You need the one named: COLL_SIZES_01
  • Click on it (not on "+" sign) and than bookmark this page as you'll need it. This bookmark is not the one from your browser but the one from AMI (you'll see it in the menu)
  • Now click on ATLAS_AMI_COLL_SIZES_01_LAL in 'database name' will give you list of all the relevant tables. For the details on the table schema click on table name. To look at the data in the table click on the link (Browse). For the classification purposes two table are important ones AlgoRef and object.

Initial object classification was done by hand. Objects are separately classified for all the data formats (AOD,ESD,DPD,...) Algorithms are classified by a script Tools/PerformanceReports/updateAlgoCategories.py. The script is kept up to date by ThomasKittelmann. Currently the only persons having necessary rights to make changes in classification are DavidRousseau, IlijaVukotic, ThomasKittelmann and RD Schaffer.

NEVER CHANGE INFO IN OTHER TABLES UNLESS YOU KNOW EXACTLY WHAT ARE YOU DOING - for example removing category or stream name will delete all the data ever classified in this category/stream.

Data mining

In order to ease investigative data mining program for easy browsing of the data is provided. Program is available after setting any recent ATLAS environment ( asetup rel_0,noTest is always a good bet). Program name is performanceBrowser.py. It gives a prompt were one can issue commands. Just typing help and pressing enter will give output like this:
>>>help
Documented commands (type help <topic>):
========================================
add  addCut  debug  export  remove  removeCut  show  stat  table
Undocumented commands:
======================
exit  help

You can also get help on the individual commands:

>>>help table
        select the table you would like to browse.
        syntax: table <obj>
        possibilities:
        <object>    - object sizes
        <alg>       - algorithm performances
        <job>       - job performances

I will here describe how to extract information on timings of all algorithms of one run in one data stream.

  1. Since we want info on algorithms we do:
        >>>table alg 
        rows selected:  1797807 
    Some commands print out number of rows(records) that satisfy all the criteria imposed. You can show this number at any time by using command stat. It is important to remember that AMI can't return more than 30k rows so before doing export make sure that you don't pass this value. It is also not very good idea to have that many rows exported as output file can quickly explode in size.
  2. to select the run we need:
        >>>addCut runNumber=180664
        cut (runNumber=180664) added.
    If in doubt about variable name you can always use add command without any arguments and it will print all the variables you can use.
  3. to see what are all the streams written for this run we can do:
        >>>add stream
        column ( stream ) added
        >>>stat uni
        rows selected:  24092
        ----- unique values of stream ------ 7
        Muons
        Egamma
        JetTauEtmiss
        CosmicCalo
        Background
        MinBias
        ZeroBias
        
  4. Now we can select a stream:
        >>>addCut stream='Egamma'
        cut (stream='Egamma') added.
  5. Similarly we can select processing step that we want:
        >>>addCut algo_proc_step='RAWtoESD'
  6. We test that number of rows selected is reasonable:
        >>>stat
        rows selected:  3250
  7. Now we add all the variables that we would like to see in our exported file:
        >>>add events cpuIni cpuFin cpuTime algoName algoCategoriesName
        column ( events ) added
        column ( cpuIni ) added
        column ( cpuFin ) added
        column ( cpuTime ) added
        column ( algoName ) added
        column ( algoCategoriesName ) added
  8. To check the values we use show command which by default prints first 10 rows. For more rows do show some_number
        algoName    |     algoCategoriesName    |     stream    |     cpuIni    |     cpuTime    |     events    |     cpuFin    |     
        ===================================
                  CmbTowerBldr.LArFCalCmbTwrBldr |           Other |               Egamma |       10.000 |        0.000 |     13 |        0.000 | 
                        EmTowerBldr.LArEmTwrBldr |           Other |               Egamma |       10.000 |        0.000 |     29 |        0.000 | 
                        EmTowerBldr.LArEmTwrBldr |           Other |               Egamma |       10.000 |        0.000 |      1 |        0.000 | 
                                TrigBSExtraction |         Trigger |               Egamma |        3.371 |       12.643 | 431829 |        0.000 | 
                     CmbTowerBldr.TileCmbTwrBldr |           Other |               Egamma |        9.286 |        0.000 |     14 |        0.714 | 
                  TopoTowerBldr.TopoTowerTwrBldr |           Other |               Egamma |       10.000 |        0.000 |      1 |        0.000 | 
                      CmbTowerBldr.LArCmbTwrBldr |           Other |               Egamma |       10.000 |        0.000 |      4 |        0.000 | 
                                TrigBSExtraction |         Trigger |               Egamma |        4.099 |       12.590 |  17977 |        0.000 | 
                            ManagedAthenaTileMon |   DQ monitoring |               Egamma |        4.000 |        5.805 | 431829 |        0.722 | 
  9. Two export formats are supported
    1. gzipped pickle file by doing export . The file contains two dictionaries, one with the row data and one with column names.
    2. comma separated values file convenient for import into Excel by doing export csv
          >>>export run180225.Egamma.RAWtoESD csv 
          exporting the data to file: run180225.Egamma.RAWtoESD
          writing CSV format 
  10. Finally we can
    exit

If more convenient one can make a script like this one and execute it by doing performanceBrowser.py < myScipt.txt . While one can also import performanceBrowser.py from a python script and directly call the functions corresponding to the commands (they always start with "do_" eg. do_exit()) this is probably not great idea. For a feature of general usability just ask for it. Finally a word of warning: When summing up stuff - keep in mind that:

  1. often you have to weight results by number of events in the run
  2. you are having significant statistics. Check number of events seen.
  3. need to check that run(s) are not accounted multiple times due to multiple re-processings using different tags
  4. you are not summing apples and oranges (for example summing event size across different formats)

Direct access

There is a way to directly use AMI commands to access the data. This is one very simple example:
amiCommand SearchQuery  processingStep=utility project=coll_sizes_01 sql="SELECT count(*) FROM objectSize  WHERE  runNumber='180124' "
In order to understand how to make query it is indispensable to look at the db schema. If you are not an SQL expert you may find it helpfull to do the following: start performanceBrowser.py and issue command debug. Effect of this is that for all the queries made later in the browser you'll get a printout of all the things you'll need to repeat the query using amiCommand. AMI also provides python API but for that please read their manual or look at the examples given in Tools/PerformanceReports/share.

Helper scripts

While performanceBrowser.py program can be scripted to do most of the everyday stuff there are cases where it could be needed to access the data on a deeper level using directly SQL commands through the AMI interface. One example is the perfStatusOfAmi.py script. By doing ~ >perfStatusOfAmi.py -h one can help on its options. As an example of its usage this is how one can get percentage of events seen in the range of runs:
~ >perfStatusOfAmi.py -fj -r 182072-182100                                                                                            
Ami Command:  sql=SELECT distinct runnumber, stream, jobprocessingstep FROM jobperformance WHERE ( ( runnumber BETWEEN 182072 AND 182100) ) 
---------------------------------------------------
182072 
---------------
        stream:  CosmicCalo             ESDtoAOD         events:     110041       ami:   110840   fraction seen:  99.0
        stream:  CosmicCalo             RAWtoESD         events:     109361       ami:   110840   fraction seen:  99.0
182073 
---------------
        stream:  CosmicCalo             ESDtoAOD         events:     367092       ami:   367270   fraction seen:  100.0
        stream:  CosmicCalo             RAWtoESD         events:     367056       ami:   367270   fraction seen:  100.0
In case there is a script you find generally usable please add it to this package.

Standard plots

Standard plots that should be monitored are all accessible from the web Performance Browser. The site has been made using HTML5, css3.0, javascript and asp. Please let me know if there are problems with your browser. Currently available:

  1. per-job Info: for a selected reprostep every variable stored can be shown vs run number where series correspond to physics streams.
  2. disk size: for a selected format and stream, disk size of each category is shown. (caveat lector: weighting has no special accounting for objects having number of entries different from the number of events, (objects from non-CollectionTree trees) )
  3. algo info: for a selected reprostep and stream, CPU timing of each category of algorithms(tools) is shown.
All the plots, tables, can be had vs instead of run number. All info is available in graph, table and cvs formats. It is possible to select a subrange of runs.

Currently proposed additions are:

  1. time evolution of the n largest containers, n slowest alg (not sure how to define largest/slowest as that depends on moment in time)
  2. sometimes the same run is processed with two different ftag info should be reported (this has to be fixed. currently the easiest way it to not show test reconstructions using "show use")

In the future switch for selecting between T0 and prodSys databases will be added.

Visualization

This part will be written when I finally get permission to connect performance SharePoint site with ReportServer.

Additional information

For any additional information please don't hesitate to contact me directly.

-- IlijaVukotic - 19-Apr-2011

Topic attachments
I Attachment History Action Size Date Who Comment
PDFpdf monitoring_-_proper_way.pdf r1 manage 5216.4 K 2011-05-09 - 17:36 IlijaVukotic Presentation of Performance Monitoring - ADC meeting one
Texttxt myScript.txt r1 manage 0.2 K 2011-06-22 - 18:58 IlijaVukotic example script to be used with performanceBrowser.py
PDFpdf performanceInAMIdesign.pdf r1 manage 9.5 K 2011-06-22 - 18:04 IlijaVukotic AMI db scheme
Edit | Attach | Watch | Print version | History: r9 < r8 < r7 < r6 < r5 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r9 - 2011-07-07 - IlijaVukotic
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback