Main Web>DataPopularity (2012-04-23, JanIven)

Data Popularity of the EOS files

Data Popularity of the EOS files
Validation
- DB content
- Efficiency of the collection workflow
  - pps monitor
Monitoring Requests

The purpose of this working page is to collect feedback and requests about the popularity service looking at the data accessed at CERN inside EOS.

Two major goals should be achieved:

validation of the monitoring workflow
- the monitoring tool must provide the expected answer to known aspects of the EOS system (read-only rates, number of pps test files accessed per hour, etc)
collection of the metrics needed to satisfy the monitoring purposes
- this implies the formulation of well defined questions to be converted in specific data aggregations and DB queries

Brief description of the monitoring workflow

Collector of the Xrootd detailed monitoring data
- Based on UDP packets listener developed by M. Tavel and described in this talk Tadel-UsCmsXrdMon-Lyon-20111122.pdf
- xrootd monitor configuration: xrootd.monitor all flush 30s mbuff 1472 window 5s dest files (PLEASE CHECK)
Messaging System for Grid (MSG)
- Publish-subscribe model
- Reduce the number of services collecting the UDP packets
- Several consumers can access the MSG Broker
MSG Consumer to Oracle DB
- Collects the UDP packets cached in the MSG broker and uploads them in the DB.
  - NB: only read file information is uploaded
Web Frontend to expose query results

More details about the popularity service system already developed for CMS and about its extension to the EOS Data can be found in

CMS_xrootd-DataPopularity_giordano_2011_11_22.pdf

Data collected for popularity purposes

Not all the data in the udp packets are used for the popularity.

Data currently received in the udp packets are provided when a file is closed, and summarize the activity on that file, from open time to close time.

udp packet content

unique_id=xrd-1326472302000000
file_lfn=/eos/cms/store/data/Run2011A/Cosmics/ALCARECO/TkAlCosmics0T-v4/000/166/065/58C1BF80-6C8E-E011-AB24-001D09F24024.root
start_time=1326472302
end_time=1326472302
read_bytes=0
read_operations=0
read_min=0
read_max=0
read_average=0.000000
read_sigma=0.000000
write_bytes=0
write_operations=0
write_min=0
write_max=0
write_average=0.000000
write_sigma=0.000000
read_bytes_at_close=642348
write_bytes_at_close=0
user_dn=
user_vo=
user_role=
user_fqan=
client_domain=cern.ch
client_host=lxbrg1204
server_username=
server_domain=cern.chserver_host=lxfsre07a02

popdb content

Only the following data are stored into the DB

unique_id=xrd-1326472302000000
file_lfn=/eos/cms/store/data/Run2011A/Cosmics/ALCARECO/TkAlCosmics0T-v4/000/166/065/58C1BF80-6C8E-E011-AB24-001D09F24024.root
start_time=1326472302
end_time=1326472302
read_bytes=0
read_bytes_at_close=642348
write_bytes_at_close=0
user_dn=
user_vo=
client_domain=cern.ch
client_host=lxbrg1204
server_username=
server_domain=cern.chserver_host=lxfsre07a02

fields removed

Here the data not included in the popDb are reported

write_max
read_sigma
read_average
read_bytes
user_fqan
user_role
read_min
write_operations
write_average
write_bytes
write_min
read_max
write_sigma
read_operations

Monitoring page

The entry point for the monitoring page of EOS Atlas is

http://dashboard28.cern.ch/eosatl

The entry point for the monitoring page of EOS CMS is

http://dashboard28.cern.ch/popdb/xrdpopularity/

We will mainly use the EOS Atlas for this validation. But sometimes it will be useful to compare also with EOS CMS.

Since the development of the GUIs for CMS and for our EOS Atlas test are going in parallel, not necessary the two GUIs will evolve in the same way. Do not expect to find exactly the same information in both GUIs. I will try to keep the symmetry as much as possible.

Since the GUI is under deployment, it could be not accessible all the time.

Validation

DB content

First of all I would make sure that the information collected in the popularity DB is exhaustive or if there are other fields that should be added. Please have a look at the removed fields and at the kept fields and comment about it. Keep in mind that currently only the files accessed in read mode are stored in the popularity DB

Efficiency of the collection workflow

Given that the data collection workflow is based on several steps (udp packets -> MSG Broker -> DB) with services running at each step to extract and handle the information, I'm strongly interested in validating this workflow, to verify that we do not have any inefficiency.

In order to do that we can compare the rate of read files measured with what expected, for the full set of files in EOS and/or for specific files that are systematically read for test purpose.

pps monitor

The number of pps test files accessed per hour in the time range defined by [StartDate, EndDate], by the user dteam001 in read-only mode are shown in http://dashboard28.cern.ch/eosatl/xrdmonplotppstest

It would be interesting to know if EOS people can confirm that the numbers shown are expected. In particular, are the glitches expected and known. Is there any way to check if these glitches really did happen in the EOS side, or if they are a consequence of the popularity collection chain?

For CMS two pps recursive accesses are found and shown in this plot.

http://dashboard28.cern.ch/popdb/xrdpopularity/xrdmonplotppstest

Do they correspond to different test, as we see? In particular

pps_dteam accesses files: /eos/ppsscratch/test/slstest-eospps/test-eospps.cern.ch-from-srmmon04.cern.ch-xrdcp.static

pps_srmmon accesses files: /eos/ppsscratch/test/slstest-eospps/test-srm-eospps.cern.ch-PPSEOSSCRATCHDISK-4f930089-cfe9-4bb7-83b9-4adf2869cf59

Monitoring Requests

Please put here your desiderata in terms of metrics, aggregations you want to extract the informations you need

Access patterns - how much of each file has been read?

Access patterns - which files were most popular? (idea: automatic suffix removal to get top-10 most common prefixes - may need to develop algorithm).

See plots on Phillip Zigann's page at ZigannGeneralRequirementsECC and his initial presentation to IT-DSS group. Example: pie chart on percentage of file read by Phillip Zigann:

File access count (or used bandwidth) grouped by client network (use DNS to group by DNS domain; group non-resolvable hosts into subnets (class-B or below). Ideally should auto-detect groups (i.e. no preconfigured list; just do top-10 automatically)

-- DomenicoGiordano - 02-Mar-2012

-- SpigaDaniele - 09-Jun-2011

Attachments

Topic attachments
I	Attachment	History	Action	Size	Date	Who	Comment
png	Zigann-bytesfromfile.png	r1	manage	355.0 K	2012-04-23 - 16:01	JanIven	pie chart on percentage of file read by Phillip Zigann

Topic revision: r6 - 2012-04-23 - JanIven

Main

Webs

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
Main All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback