Castor Monitoring
the purpose of this page is to collect all requirements for additional monitoring in CASTOR. It also includes the
GridFTP, FTS and SRM area.
all monitoring information should be sampled in short time intervals, the visualization method regularly updated (max 2 minutes) and all information stored forever, to be able to compute trends.
One of the crucial units is the disk pool, the first following questions are relative to a given disk pool
1. what is the total input and output data data rate to/from a pool ? --> this is already implemented in lemon
2. what is the input and output data rate to/from tape servers ?
3. what is the input and output data rate to/from the client nodes (e.g. lxbatch) ?
4. what is the input and output data rate to/from external sites ?
5. how are the rates split over the different protocols (rootd, rfiod,
GridFTP?, xrootd) ?
6. how many files of size X have been read/written during the last Y time interval ?
7. percentage of files (# and total size) have not been read during time interval X ?
8. number of files staged into a pool more than once in time interval X ?
9. age distribution of files in time interval X ?
10. transfer performance histograms for the different protocols per time interval X ?
11. histograms of the achieved compression ratios on tape per time interval X ?
12. which files are currently on disk in a pool ?
13. LSF job throughput numbers, how many jobs have started/finished during a time interval ?
per pool queue and with user details, like the batch monitoring
14. 2d view of nodes and throughput over time (identify not working nodes and judge the load-balancing)
15. monitoring presentation of the GB switches. CS has enabled this already 9 month ago, but there are
no lemon plots yet. essential for congestion evaluations
16. throughput of SRM requests by type
17. backlog of SRM requests
The Castor operations team has currently a "parallel" lemon facility where some part of the questions are answered and
more details are presented :
http://castoradm1/new/
http://castoradm1/new/c2svc_class.php?cluster=c2atlast0%25t0perm&detailed=Get%20Detailed20Information&auto_update=Not%20auto%20update&time=0
This needs to be integrated into the official lemon system.
The iptables monitoring has started, but is still only local to the disk servers and needs to be accessible in lemon.
A very import view for debugging and load distribution analysis is the state of all servers in a collection over time in a 2d plot, which
does not yet exist.
--
BerndPanzersteindel - 03 Apr 2007