Main Web>TWikiUsers>GerhardFerdinandRzehorz>T0RecoInvestigation (2016-03-09, unknown)

Investigation of T0 Reconstruction Jobs

ATLAS Reconstruction, Openstack VM

In this example the data was stored in the Wigner Computing Centre. A Reconstruction Job 'RAW to AOD', with 100 RAW events as input was run. The total input dataset is 2.6 GB in size, containing 3025 events, leading to an average event size of ~ 860 kB.

The monitoring examples below show the results of the Job, whereas the profile looked similar in each test-iteration. All tests were run on machines, where similar Jobs were run before, so the CVMFS cache was already filled and being used. The granularity of the monitoring data is 5 seconds.

Profile

The classification below is done looking at the monitoring data. Unfortunately this information is not yet included in the JobReport (see https://its.cern.ch/jira/browse/ATLASJT-291).

All in all the whole reconstruction Job can be split into several processing steps that, put together, explain the observed behaviour:

High network activity, low CPU, memory activity, no disk write
Medium network (bumps), high CPU activity, no memory activity, disk write
No network, medium CPU activity, memory activity, no disk write
Medium network, low CPU activity, no disk write
No network, medium CPU activity (high/low/high),disk write

These 5 steps translate to the following Job processes:

Initialisation or startup phase
Reconstruction (RAW to ESD)
Initialisation/startup second step
Reconstruction (ESD to AOD)
Finalisation or cleanup phase

These 5 steps are shown in the plot below, where the total Network I/O, the CPU usage of the user, the Memory utilisation and the Disk I/O (Disk read is zero/almost zero throughout the Job) are plotted. The Job begins and ends according to the time displayed on the x-axis. Note that there are 3 different y-axes. The test was reconstructing 100 events from RAW to AOD on a machine with 8 GB of memory - more than the Jobs needs and in general has available, but it gives a picture of an almost ideal scenario.

Whereas in the 1. step the environment is setup, by fetching all the necessary data. Mostly the code is already in CVMFS and only the detector geometry and calibrations matching to the particular dataset have to be fetched remotely.

In the 2. step the data is read event by event and reconstructed from RAW to AOD data. In the above plots this was shown for 100 events, where it was already dominating the consumed time. Typically it is better (less time consuming) to reconstruct even more events per Job, in order to minimise the time spent on the startup (and cleanup) phase, because their time doesn't vary with the number of events.

The 3. step looks comparable to the 1. step but because of its short duration it is harder to separate. It is also starting up a new reconstruction process

The 4. step is then reconstructing ESD to AOD data, with the difference that it is not reading as much remote data.

In the 5. step the results of the 4. step are stored to disk, Job statistics/reports are generated and the Job exits.

In total steps 3+4 are shorter by more than a factor of 2 than the 1.+2. step.

Components

Below a separate look at the different components is taken.

Network traffic

The baseline of the Network traffic that was captured was very low and almost steady, compared to the traffic that was observed from the Job (background peaks up to 2kB/s, whereas the smallest peaks of interest are ~400kB/s). For these reasons in the following plots the background is not considered any more.

In the graph below a Job that reconstructed one event was run. (Beginning at t=0, ending a bit after t=900s)

In the next plot a Job reconstructing 100 events is shown. The Job does not end, when the x-axis does, but continues until ~ 2550s - also the scale is different.

The beginning of the two executions look similar up until 550s (upper plot) or 650s (lower plot). The major difference being that the lower plot seems to be delayed by ~100s.

Afterwards, the difference between the 1 event and the 100 event graphs is clearly the ~53 little black bumps in the 100 event plot (the total amount of bumps ~100), indicating that the individual events are read (streamed) one after the other as input during Job execution. The 'little bumps' come from a different source (see filter, port 1095: xroot): They originate from an EOS disk server at CERN. This corroborates that they are the physics events being read.

The red line is shows all the traffic and apart from where it overlays with the black line, the remaining relevant traffic uses the same port (3128) and is called caproxylbp.cern.ch, as is shown in the conversations list:

This is sorted by Bytes. The first item is the the machine itself. The second item (p051....cern.ch) is the EOS disk server (84.7 MB divided by 100 events corresponds roughly to the expected 850 kB/event), which is contacted directly with little traffic: eosatlas.cern.ch. The following caproxylbp.cern.ch can actually be two separate things: CVMFS and ATLAS frontier (for the detector calibrations, geometries, constants...). Since this was run with CVMFS already cached, that part is small. At the beginning of the Job, ccwbvip14.in2p3.fr is contacted, which is the ATLAS Metadata Interface (AMI). Each dataset has an AMI-tag. ip-dns-1.cern.ch is the domain name system. Everything else is very small and can be attributed to background. In the following there is an overview of the relevant contacted endpoints in the setup phase and during Job execution:

mylcgvoms2.cern.ch is contacted in order to create the proxy.

Furthermore, zooming in on a 'bump' shows the broadness of ~0.6 s, slightly varying with the height.

CPU

The CPU is not used at maximum capacity. This is usually a result of the CPU waiting for input. Correlating our input source (network) with the CPU shows, that it can't be all explained by it:

The high network traffic peaks in the beginning can explain some of the idle CPU time. What is happening towards the end of the Job is the following: The Job is transforming RAW data to AOD. This is not done in a single step, but it first transforms RAW to ESD and afterwards ESD to AOD. The high CPU usage towards the end happens when the second step of the Job starts. Some part of the CPU is therefore allocated to the system (to perform the transition, similarly it can be observed in the beginning of the job), not being idle (here the percentage goes up to only 25%, because the job was run on one core of a 4-core-machine, which is accounted differently):

Memory

The new second step in the process leads to a lot of faults in the Paging Activity (see below). The black curve of major faults follows the trend of the red curve, only on a smaller scale (see second plot, where it was zoomed in on the major faults). It indicates that data has to be loaded into memory (major faults) or reassigned (faults) when the new process starts.

Major faults cost a lot more time, than faults, which is why even though they are less by a factor of ~40, they are still relevant.

As explained in the beginning, the existing memory for this Job was unusually high. That is why no swapping was observed, but will be included in the next series of tests, where the memory will be limited. However the profile of the memory usage shows the above explained transition between the RAW, ESD and AOD stage:

After the first part of the Job is finished it looks like the memory is freed and has to be reloaded for the second part of the Job.

2 GB RAM

For a virtual machine with 2 GB of RAM, memory and swapping looks like the following (note the huge difference in Job execution time, with the same number of events. One difference is that the last part is prolonged due to the creation of histograms (TBD)).

swapping:

paging (note the difference (especially pgpgin/s) to the previous plot with no memory limit):

CVMFS cache

In the beginning it was mentioned that all these Jobs were executed with an already used CVMFS cache.

The CVMFS cache was cleared:

cvmfs2 4096000 125175 3970826 4% /cvmfs/cvmfs-config.cern.ch
cvmfs2 4096000 125175 3970826 4% /cvmfs/atlas.cern.ch
cvmfs2 4096000 125175 3970826 4% /cvmfs/atlas-condb.cern.ch
cvmfs2 4096000 125175 3970826 4% /cvmfs/sft.cern.ch
cvmfs2 4096000 125175 3970826 4% /cvmfs/atlas-nightlies.cern.ch

This shows the 'empty' CVMFS cache under/cvmfs/ which now contains ~125 MB.

A cleared cache before Job execution influences the network traffic in the following way (The Job is not plotted until the end, since the only interesting area (differing to the network plots already looked at) in this case is in the beginning):

As in the network plot above, the black curve depicts only the physics data, whereas the red curve describes the whole traffic. Compared to the traffic plot above, where the cache was filled: The traffic before the actual reconstruction process is increased (a longer phase until the black peaks start and higher only-red peaks beforehand), due to the fetching of data that is not available from the cache.

Looking at the cache after Job execution shows a difference of ~ 1.2 GB, that are now occupying the cache:
cvmfs2 4096000 1317537 2778464 33% /cvmfs/cvmfs-config.cern.ch
cvmfs2 4096000 1317537 2778464 33% /cvmfs/atlas.cern.ch
cvmfs2 4096000 1317537 2778464 33% /cvmfs/atlas-condb.cern.ch
cvmfs2 4096000 1317537 2778464 33% /cvmfs/sft.cern.ch
cvmfs2 4096000 1317537 2778464 33% /cvmfs/atlas-nightlies.cern.ch

Whereas the total network traffic of the Job increased by about 300-400 MB.

AthenaMP

The above examples were Jobs that ran on one core and streamed the data from a remote location during reconstruction. The T0 use case however is another: The data is copied onto the local disk of the VM and run with AthenaMP (multi-process framework) on multiple cores. This leads to some differences in the Job profile.

First of all there is the RAW data download component, done here through xrdcp (from the same location as before):

This took 172s for the whole 2.6 GB, resulting in an average of ~15 MB/s transfer rate.

Afterwards the whole profile is shown (instead of through the network, the data is now read from disk):

The graph shows a Job reconstructing the full 3025 events on 4 cores, also creating histograms.

The Job is split into 6 parts as follows:

Initialisation
Reconstruction (RAW to ESD) + HIST
Reconstruction (ESD to AOD) + HIST
Merge AOD
Merge HIST
Validation

The previous plot didn't discriminate between disk read and write, which can be seen here:

This indicates that the areas with lower CPU usage correspond to higher disk reading activities.

For comparison, the result of the same Job, run at a different point in time:

In the following plot there was much swapping in the periods when the CPU was not used and the disk usage goes up:

Note that this Job therefore took much longer. The corresponding paging statistics can be seen in the following:

The next plot shows the same effect for a smaller amount of time and is therefore much faster.

Surprisingly the processing of the same 3025 events with 3 threads in parallel takes a similar amount of time, which may be explained by the periods when the CPUs are idle in the case of 4 threads:

In comparison the same workflow without histograms as output was run:

The whole Job was significantly faster (almost by a factor of 2), which could also be a result of other users running processes on the machine where the VM is on (The CPU efficiency looks to be higher in the second plot.). The major difference is highlighted by the red arrow in the second plot. At this point in the profile the first plot has a long (over 5*1000s) part with low CPU usage and high disk read rate (see also plot below).

-- GerhardFerdinandRzehorz - 2016-02-24

cpu_vs_page.png:

Topic revision: r21 - 2016-03-09 - unknown

Main

Webs

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
Main All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback