Progress Report, 2 February 2011
Infrastructure Status
- Integration sites: UCSD, FNAL, Purdue, Wisconsin
- UCSD passing all monitoring.
- Purdue passes heartbeat, but not redirector tests (needs TFC change).
- Wisconsin has problematic dCache servers. Mostly passes heartbeat. Fails redirector tests; needs TFC change.
- FNAL passes neither heartbeat nor redirector tests.
- Production sites: Caltech, Nebraska
- Nebraska passes all monitoring.
- Caltech passes heartbeat, but not redirector. Needs TFC change.
Action Item Status
Action items and progress from last week:
- Use Monalisa as a monitoring system: the repository / web interface will run at UCSD, got machine for that on Feb 9. Have working session with one of the developers (Ramiro) on Feb 10.
- Writeup deliverables and milestones: XrootdUscmsTimeline
- Start maintenance of UCSD xrootd/hdfs system: Done - but there seem to be issues with hadoop at UCSD, very long delays in file access.
- Start work in converting a physicist's analysis to use Xrootd: Started, but we need to break down this into 2-week-sized chunks. [MT comment: well, I just got code from Ben.]
- Service monitoring: Using Nebraska instance for now. Have heartbeat and redirector-based monitoring, as described here. We still need random file monitoring, JobRobot monitoring, and alerts.
Other items not from last meeting:
- Writeup of CMSSW I/O needs: CmsRootIoIssues.
- Brian (maybe Matevz?) will likely spend some serious time investigating the first 5 issues. Issue 1 is a concern for this project.
- Checklist for sites: XrootdProductionChecklist.
- Writeup of development items needed from Xrootd team: CmsRootIoIssues.
Action Items for Next Two Weeks
- UCSD UAF cluster. [MT comment: I thought Alja will be free after this week to help me on that but we just loaded a bunch of Fireworks stuff on her. And partially on me, too.]
- Improved service monitoring (missing tests, alerts).
- Clarify plans for JobRobot with Andrea Sciaba.
- Fix dcap deadlock issues (Brian).
- CMSSW TTreeCache management for 4_2_0 (Brian).
- Upgrade release to 3.0.2; test
cmsd
throttling from Andy.
- Update project webpages: remove references to demonstrator, add information about architecture we're working on deploying.
- Continue Monalisa monitoring investigation.