Data Popularity of the EOS files
The purpose of this working page is to collect feedback and requests about the popularity service looking at the data accessed at CERN inside EOS.
Two major goals should be achieved:
- validation of the monitoring workflow
- the monitoring tool must provide the expected answer to known aspects of the EOS system (read-only rates, number of pps test files accessed per hour, etc)
- collection of the metrics needed to satisfy the monitoring purposes
- this implies the formulation of well defined questions to be converted in specific data aggregations and DB queries
Brief description of the monitoring workflow
- Collector of the Xrootd detailed monitoring data
- Based on UDP packets listener developed by M. Tavel and described in this talk Tadel-UsCmsXrdMon-Lyon-20111122.pdf
- xrootd monitor configuration: xrootd.monitor all flush 30s mbuff 1472 window 5s dest files (PLEASE CHECK)
- Messaging System for Grid (MSG)
- Publish-subscribe model
- Reduce the number of services collecting the UDP packets
- Several consumers can access the MSG Broker
- MSG Consumer to Oracle DB
- Collects the UDP packets cached in the MSG broker and uploads them in the DB.
- NB: only read file information is uploaded
- Web Frontend to expose query results
More details about the popularity service system already developed for CMS and about its extension to the EOS Data can be found in
Data collected for popularity purposes
Not all the data in the udp packets are used for the popularity.
Data currently received in the udp packets are provided when a file is closed, and summarize the activity on that file, from open time to close time.
udp packet content
- unique_id=xrd-1326472302000000
- file_lfn=/eos/cms/store/data/Run2011A/Cosmics/ALCARECO/TkAlCosmics0T-v4/000/166/065/58C1BF80-6C8E-E011-AB24-001D09F24024.root
- start_time=1326472302
- end_time=1326472302
- read_bytes=0
- read_operations=0
- read_min=0
- read_max=0
- read_average=0.000000
- read_sigma=0.000000
- write_bytes=0
- write_operations=0
- write_min=0
- write_max=0
- write_average=0.000000
- write_sigma=0.000000
- read_bytes_at_close=642348
- write_bytes_at_close=0
- user_dn=
- user_vo=
- user_role=
- user_fqan=
- client_domain=cern.ch
- client_host=lxbrg1204
- server_username=
- server_domain=cern.chserver_host=lxfsre07a02
popdb content
Only the following data are stored into the DB
- unique_id=xrd-1326472302000000
- file_lfn=/eos/cms/store/data/Run2011A/Cosmics/ALCARECO/TkAlCosmics0T-v4/000/166/065/58C1BF80-6C8E-E011-AB24-001D09F24024.root
- start_time=1326472302
- end_time=1326472302
- read_bytes=0
- read_bytes_at_close=642348
- write_bytes_at_close=0
- user_dn=
- user_vo=
- client_domain=cern.ch
- client_host=lxbrg1204
- server_username=
- server_domain=cern.chserver_host=lxfsre07a02
fields removed
Here the data not included in the popDb are reported
- write_max
- read_sigma
- read_average
- read_bytes
- user_fqan
- user_role
- read_min
- write_operations
- write_average
- write_bytes
- write_min
- read_max
- write_sigma
- read_operations
Monitoring page
The entry point for the monitoring page of EOS Atlas is
http://dashboard28.cern.ch/eosatl
The entry point for the monitoring page of EOS CMS is
http://dashboard28.cern.ch/popdb/xrdpopularity/
We will mainly use the EOS Atlas for this validation. But sometimes it will be useful to compare also with EOS CMS.
Since the development of the GUIs for CMS and for our EOS Atlas test are going in parallel, not necessary the two GUIs will evolve in the same way. Do not expect to find exactly the same information in both GUIs. I will try to keep the symmetry as much as possible.
Since the GUI is under deployment, it could be not accessible all the time.
Validation
DB content
First of all I would make sure that the information collected in the popularity DB is exhaustive or if there are other fields that should be added. Please have a look at the
removed fields and at the
kept fields and comment about it. Keep in mind that currently only the files accessed in read mode are stored in the popularity DB
Efficiency of the collection workflow
Given that the data collection workflow is based on several steps (udp packets -> MSG Broker -> DB) with services running at each step to extract and handle the information, I'm strongly interested in validating this workflow, to verify that we do not have any inefficiency.
In order to do that we can compare the rate of read files measured with what expected, for the full set of files in EOS and/or for specific files that are systematically read for test purpose.
pps monitor
The number of pps test files accessed per hour in the time range defined by [StartDate,
EndDate], by the user dteam001 in read-only mode are shown in
http://dashboard28.cern.ch/eosatl/xrdmonplotppstest
It would be interesting to know if EOS people can confirm that the numbers shown are expected.
In particular, are the glitches expected and known. Is there any way to check if these glitches really did happen in the EOS side, or if they are a consequence of the popularity collection chain?
For CMS two pps recursive accesses are found and shown in this plot.
http://dashboard28.cern.ch/popdb/xrdpopularity/xrdmonplotppstest
Do they correspond to different test, as we see?
In particular
- pps_dteam accesses files: /eos/ppsscratch/test/slstest-eospps/test-eospps.cern.ch-from-srmmon04.cern.ch-xrdcp.static
- pps_srmmon accesses files: /eos/ppsscratch/test/slstest-eospps/test-srm-eospps.cern.ch-PPSEOSSCRATCHDISK-4f930089-cfe9-4bb7-83b9-4adf2869cf59
Monitoring Requests
Please put here your desiderata in terms of metrics, aggregations you want to extract the informations you need
- Access patterns - how much of each file has been read?
- Access patterns - which files were most popular? (idea: automatic suffix removal to get top-10 most common prefixes - may need to develop algorithm).
See plots on Phillip Zigann's page at
ZigannGeneralRequirementsECC and his initial presentation to
IT-DSS group. Example: pie chart on percentage of file read by Phillip Zigann:
- File access count (or used bandwidth) grouped by client network (use DNS to group by DNS domain; group non-resolvable hosts into subnets (class-B or below). Ideally should auto-detect groups (i.e. no preconfigured list; just do top-10 automatically)
--
DomenicoGiordano - 02-Mar-2012
--
SpigaDaniele - 09-Jun-2011