ADCoSWIP   Follow ADCShifts on Twitter

Requirements

  • ADCoS Shifter has to be an ATLAS member.
  • All current ADCoS Shifter are registered in https://e-groups.cern.ch/e-groups/Egroup.do?egroupName=atlas-project-adc-operations-shifts
  • Grid Certificate requirements
    • Every ADCoS Shifter on duty (Trainee, Senior, or Expert) is required to have a valid grid certificate registered to VO atlas at the time of the shift, see WorkBookStartingGrid. When an ADCoS Shifter starts his/her shift without the valid grid certificate registered to VO atlas his/her shift booking will be cancelled and he/she will not get any OTP credit for the shift. Recurring certificate issues may result in discontinuation of possibility to sing up for ADCoS Shifts.
    • Valid grid certificate of the shifter must be in /atlas/team VOMS group, and that addition is done by ADCoS Coordinators at the Step 2 of Trainee Shifter setup procedure. If for some reason your certificate is not yet in /atlas/team VOMS group, ask in advance ADCoS Coordinators to add it. Particularly don't forget to do that when you get a completely new certificate.
    • Check whether you are able to submit ATLAS TEAM GGUS ticket, https://ggus.eu/?mode=ticket_team (you should see TEAM option on you screen). Valid grid certificate (1) in your browser (2) with /atlas/team VOMS group (3) known to GGUS (4) is required for openning a TEAM ticket.
  • OTP requirements
    • It is strictly forbidden to book more than 1 shift within 24 hours. Overcoming this rule may result in discontinuation of possibility to sign up for ADCoS Shifts.
    • ADCoS Shifter does book shifts in her/his name and does shift as the person who booked the shift. It is forbidden to book shift in one persons name on behalf of a different persons. The ADCoS Coordinator can book shifts on behalf of different persons, such a shift will be booked in name of the Shifter.
    • In case of an emergency situation leading into impossibility to take shift ADCoS Shifter immediately notifies ADCoS Coordinator, and the ADCoS Shift Captain of the shift timezone. ADCoS Coordinator then cancels shift booking in OTP. ADCoS Coordinator or ADCoS Shift Captain may announce a Call for shifters to find a replacement Shifter. Contact can be found on Team page
    • Booked shifts can only be cancelled 45 days in advance, after this time interval shifter has to find a replacement

CHECKLIST

NEW For ADCOS Shifters page has useful CHECKLIST links on one page.

  • Before coming to your shift, make sure that you fulfill shift requirements!
  • Open Jabber and say hello in the Virtual control Room to let ADC/ADCoS community know you are on shift.
  • List of current shifters to check who covers each shift.
  • Check recent updates and additions to ADCoS procedures in this TWiki, as they will be marked with NEW stamp.
  • If TEAM GGUS ticket does not work for you, please DO NOT SUBMIT non-TEAM tickets. Ask other shifter, ADCoS Coordination, or ADCoS experts to submit a TEAM ticket. When you are opening TEAM GGUS, please decide which priority you need, see How_to_Submit_GGUS_Team_Tickets. There are rare cases of top priority problem for Tier-1s and especially for CERN-PROD.

  • At the beginning of your shift please have a look at the Known Problems and Daily_SHIFT_reports of previous shifters
  • Check ADC eLog
  • Check the status of the TEAM tickets in GGUS and hand-over. See TicketManagement
  • Remember to report every action in eLog. (for 'new' entry, click on existing entry first. If you solve an issue, put [SOLVED] in the eLog subject)
  • Remember to check if sites are in Scheduled Downtime before opening bugs.

Some of these tasks might require for shifter to contact the ADCoS Expert. In case there is no ADCoS shifter on duty, shifter can ask ADCoS Coordinators.

Most Common Mistakes by Shifters

  • Opening regular GGUS ticket rather than GGUS Team ticket.
  • Opening duplicate GGUS ticket . Please don't forget to check the list of open tickets before submitting a new GGUS.
  • Reopening the closed GGUS ticket or updating the existing one, rather than opening a new GGUS ticket when the issue/problem is different from the one in the existing GGUS ticket. Please check with the expert shifter if you are not sure if it's a new problem or different manifestation of the one which has already been reported.
  • Submitting a GGUS ticket on the site in downtime . Please always check the AGIS Downtime Calendar before opening a new ticket.
  • Forgetting to write the site name in the subject of the GGUS ticket. It is strongly advised to start the subject with the site name. That will make browsing by ticket subjects (GGUS/ELOG) much easier. On the other hand, if the site name is at the end of a lengthy subject line, it may not show on the summary list of GGUS tickets.
  • NEW Forgetting to add cloud support atlas-adc-cloud-[CLOUD]@cern.ch in CC field of the GGUS ticket, or adding an address atlas-support-cloud-[CLOUD]@cern.ch which GGUS can't process. Please always add the cloud support atlas-adc-cloud-[CLOUD]@cern.ch in GGUS CC.
  • NEW Coming to shift with expired grid certificate, or with a new certificate not in /atlas/team VOMS group, hence having problem opening GGUS Team ticket. Before coming to your shift please verify that you are able to open GGUS Team ticket.
  • Forgetting to put an ELOG entry after opening a new GGUS/Jira ticket. Major status updates, as well as closing the ticket, need an ELOG entry as well.
  • Opening a new ELOG thread on the evolving issue, which already has entry(s) in ELOG. Instead please continue the existing thread.
  • Forgetting to submit an evaluation report of the participant trainee shifter.
  • Using email address from TWiki without editing it to exclude SPAMNOT . Remove the word SPAMNOT, otherwise the email will bounce back.
  • NEW Submitting ticket about mcXX_valid tasks to validation jira. As this twiki clearly states, only tasks starting with valid should be reported in validation jira. mcXX_valid tasks should be reported in ADCSUPPORT jira

MC production

General guide-line (Fast Troubleshooting)

  • Spot sites with major problems
    • Start chasing major failures in both interfaces: f.i. cloud failling 100% of data import/export.

BigPanda monitor

  • Go to error distribution page
    • First look for sites with a lot of failing jobs
    • Then look for tasks with a lot of failing jobs
    • Compare them to know if problems are site related or task related
      • If task is found to be failing at several sites, probably it would be task problem, then file an ADCO-Support Jira bug report for non validation tasks or Validation Jira bug report for validation tasks.
      • When task is failing due to missing files, please follow the #Missing_files procedure
      • If all jobs are failing at one single site probably it would be a site related problem, then file a GGUS team ticket.

Job-states definitions in Panda

There are 10 values in Panda describing different possible states of the jobs, these are:

  • defined : job-record inserted in PandaDB
  • assigned : dispatchDBlock is subscribed to site
  • waiting : input files are not ready
  • activated: waiting for pilot requests
  • sent : sent to a worker node
  • running : running on a worker node
  • holding : adding output files to DQ2 datasets
  • transferring : output files are moving from T2 to BNL
  • finished : completed successfully
  • failed : failed due to errors

The normal sequence of job-states is the following:

 defined -> assigned -> activated -> sent -> running -> holding -> transferring -> finished/failed
If input files are not available:

 defined -> waiting
then, when files are ready
  -> assigned -> activated
And the workflow is:
  • defined -> assigned/waiting : automatic
  • assigned -> activated : received a callback for the dispatchDBlock. If jobs don't have input files, they get activated without a callback.
  • activated -> sent : sent the job to a pilot
  • sent -> running : the pilot received the job
  • waiting -> assigned : received a callback for the destinationDBlock of upstream jobs
  • running -> holding : received the final status report from the pilot
  • holding -> transffering : added the outout files to destinationDBlocks
  • transfering -> finished/failed : received callbacks for the destinationDBlocks

The job brokering for production is listed in PandaBrokerage#Special_brokerage_for_production

The delay for job rebrokering is listed in PandaBrokerage#Rebrokerage_policies_for_product

Task-states definitions in Panda

  • registered : the task information is inserted to the JEDI_Tasks table
  • defined : all task parameters are properly defined
  • assigning : the task brokerage is assiging the task to a cloud
  • ready : the task is ready to generate jobs
  • pending : the task has a temporary problem
  • scouting : the task is running scout jobs to collect job data
  • scouted : all scout jobs were successfully finished
  • running : the task is running jobs
  • prepared : outputs are ready for post-processing
  • done : all inputs of the task were successfully processed
  • failed : all inputs of the task were failed
  • finished : some inputs of the task were successfully processed but others were failed or not processed since the task was terminated
  • aborting : the task is being killed
  • aborted : the task is killed
  • finishing : the task is forced to get finished
  • topreprocess : preprocess job is ready for the task
  • preprocessing : preprocess job is running for the task
  • tobroken : the task is going to broken
  • broken : the task is broken, e.g., the task definition is wrong
  • toretry : the retry command was received for the task
  • toincexec : the incexec command was received for the task
  • rerefine : task parameters are going to be changed for incremental execution
For more details, see https://twiki.cern.ch/twiki/bin/view/PanDA/PandaJEDI#Transition_of_task_status

What to do when

A task is failing

  • If the task is assigned to CERN cloud to a queue different from CERN-PROD, i.e. CERN-BUILDS, CERN-RELEASE, CERN-UNVALID, CERNVM, CERN_8CORE, then forget about it. If the failing task is running in the CERN-PROD queue, then please follow standard procedure.
  • Validation tasks. Submit ATLAS validation Jira ticket for validation tasks (those beginning with valid). There's no need to file a bug for those tasks with small number of failures or to report bugs that has been already reported before:
    • Make sure that the bug has not been reported before.
  • Other tasks -those not beginning with valid: use ADCO-Support for non validation tasks with high failure rate.

A task is not assigned to site

A site is heavily failing

  • If the burst of errors are restricted to less than few hours and and there is no more error (from Panglia plot, the increase if failed jobs is sharp and flat since then), no action to take.
  • If jobs are continuously heavily failing:
    • If the site is in downtime, the site should have already been automatically set to test (check sites and incidents pages)
    • Make sure is a true site issue (not Athena issue for example)
    • File a GGUS Team-Ticket as described in this section with cloud responsible in CC.

A site is not getting jobs

  • Remember if the site has no jobs assigned, there's no chance to run.
  • Check that software versions requested by current are installed at the site (monitoring).
  • Check if site/queues are online
    • If queue is offline and site not in downtime and if there's no incident ongoing, contact the cloud responsible, and file eLog
    • If queue is offline and there's no incident ongoing, contact the cloud responsible, ADCoS coordinators and file eLog.
  • Check if pilots run in the last hours at the site
    • If you find problems, fill elog and contact the cloud responsible and the pilot factory responsible
  • More elaborated checks to be done by squad

Details about errors

Panda error codes (mapping of error code to diagnostics message) can be found here: https://twiki.cern.ch/twiki/bin/view/AtlasComputing/PandaErrorCodes

'Lost Heartbeat' error

An update by the pilot to the Panda server is sent to every 30 minutes. If there is no update within 6 hours, the job is declared 'lostheartbeat'. When the job is finished, the pilot will try to update the Panda Server 10 times separated by 2 minutes. PandaPilot contains all details about Pilots.

  • Most common reason : the local batch system has killed the job because it used more than the accepted resources (CpuTime, WallTime, memory). The ATLAS requirements are published in the VOid Card. By comparing similar jobs on different sites, try to identify which is the problematic variable. In this case, the number of failing jobs should be spread over time

  • Site or batch system is broken : The failing jobs should be spread over a period of few minutes.

  • CE has lost track of the job

If you ticket the site, provide an example to the site (jobID to be documented).

'Job killed by signal 15'

The local batch system issued a warning to the pilot informing that the job would be killed soon. The job stopped himself to be able to report the log.

'Exception caught in runJob'

No documentation yet

'Pilot has decided to kill looping job'

No documentation yet

'Get error : open connection to sename' or 'Get error: dccp/rfcp failed'

The pilot is not able to copy the input file from the SE. It means that the SE is totally or partially broken. Cross-check with DDM transfers.

'Get error: with guid xxx not found at'

The dataset is supposed to be available at the site but the file catalog scan reveals that the file is not at the site. No action for the moment.

'GUID for xxxx not found in DQ2 '

It usually means that a file was remove from dataset definition after the task defined the input files. In most case, the file was found as not available on SE with no other replica to recover. For the moment, there is no action from the shifter. The jobs will fail quickly and it is up to GDP to treat the task properly.

'Put error'

No information yet

'/opt/lcg/bin/lcg-cr '

The job is not able to copy the file on SE or register the file in LFC. The pilot tries the command twice with file deletion in between (not correct for the moment)

'Transformation not installed in CE '

No documentation yet

'Transfer time out'

The output was not transfered in time. If you manage to find the transfer on DDM dashboard, report it as any other failing transfer. If you cannot find it on the DDM dashboard, write Elog with all information you can get about the issue and then send email with the link and explanation to the cloud support to check the activity of the FTS channel.

'No PFN found in catalogue for GUID'

No information yet

ATLAS Software Releases problem

  • Jobs may fail if the needed software release is not present or badly installed. The procedure to follow is to notify ATLAS SW managers and ask for re-installation of the release at the site.
  • We will discuss about asking GGUS team to add special tag when submitting TEAM Tickets to address this specific problem to the SW responsible, but the interim solution to follow is:
    • Open normal GGUS team ticket and after that, set the "type of problem" to "VO Specific Software", and select "VO Specific"="yes", providing involved ATLAS release, site name and CE.
      and put atlas-grid-install@cernNOSPAMPLEASE.ch in CC

Missing input files on T2_PRODDISK

Missing files

  • For missing files as inputs for MC jobs taking inputs from T2 PRODDISK, please follow instructions at Missing input files on T2_PRODDISK.
  • Otherwise, follow these instructions:
  • Sometimes files look missing from the site but they were actually never registered or the task is misconfigured. Two simple steps that could help to understand what is happening (in case the errors are ambiguous):

In case you have difficulty in the procedure, please contact the expert either at the #ADC_Virtual_Control_Room, or via the ADCoS team ML (be aware: if there is now answer after 15 min in the chat, send an email to the list)

Follow the procedure (slides)

  1. From the panda job page where you found the missing file; click the file name
  2. Then on the page opened, you will find a SURL ( srm://... ) or a list of them
    • If a replica for the site investigated is not listed then it is not a site problem
    • If a replica is listed you should check the file with lcg-ls -l
      lcg-ls -l SURL
    • if it is ONLINE or AVAILABLE, then try to download it using lcg-cp
      lcg-cp --vo atlas <SURL> <LOCALFILE> 
    • If it is accessible it was a transitorial problem
    • If it is unaccessible it is a site problem
  3. If it is a site problem
    1. Open GGUS Team ticket to the site and report the files that have been lost.
    2. Notify the cloud contact about the missing files by adding the following address in cc filed of GGUS ticket
      • atlas-adc-cloud-[CLOUD]@cern.ch [where [CLOUD] stands for CA, CERN, DE, ES, FR, IT, ND, NL, RU, TW, UK, US] (see #ContactingCloudSupport)
  4. In any case (either a site problem or not a site problem)
    1. Open DDM ops Jira ticket to notify data management team about the lost files with
      • the lost files information (SURL, file name and the associated dataset)
      • the panda job information (a link to the panda job and TaskID)
      • a link to the GGUS ticket
      • in Mail Notification Carbon-Copy List add cloud support and task owner
      • Important! If the reported issue is urgent, please explicitely state this in eLog. ADCoS Expert. ADCoS Expert will then escalate issue to AMOD.
    2. When opening DDM ops Jira please follow the pattern
      • Task type Task xxxxx: task status in XX cloud
        eg
        • MC production Task xxxxx: waiting in XX cloud
        • Reprocessing Task yyyyy: input RAW file corrupted at SITE
        • Group production Task zzzzz: waiting for input in ZZ cloud
      • If you have additional information, eg. the input dataset has been deleted, add this information to Jira:
        • Dataset xxxx deleted from SITE
      • It is important for ddm-ops experts to know category of the problem(task type, dataset deleted) and the location (cloud, site), so that they can start do some real work without navigating through panda pages to collect this information. TaskID is necessary for ProdSys.

checksum errors during dq2-get/lcg-cp

DDM tools can check file (on Storage) consistency with LFC/DDM catalogs (filled when the file is registered in DDM and before any replication) on file by file basis. As soon as you have a doubt about a file, follow the procedure:

  1. Check the file consistency:
    • If the file belongs to a tid dataset: run a script to check the consistency of the file on the Storage at the source T1: link
    • If the file does not belong to a tid dataset : use dq2-get which will copy the file locally, compute the checksum and report inconsistency
  2. If the file is not corrupted on the source T1 Storage:
    • Run dq2-get to check the file consistency on the SE used by your application. If the file is correct, go to next point . If the file is not correct, fill a Jira ticket to DDM Ops providing the dataset name and the file name. Somebody with special priviledges will do the cleaning (not automatic yet). Consider the file as lost.
    • Check within your application. For example, it is possible that the file was not copied on the scratch disk associated to the CPU because it was full or the copy time-out occured before the file was completly copied.
  3. If the file is corrupted at the source T1, it needs to be deleted from the Storage and DDM. Fill a Jira ticket to DDM Ops providing the dataset name and the file name. Somebody with special priviledges will do the cleaning (not automatic yet). Consider the file as lost.

Waiting Jobs Procedure

this is being discussed with the Production operation team

How to know if the problems is task related or site related ?

  • Check to see if any jobs are done (e.g. scouts) by entering the task number in the task field on the bottom of the panda browser. If the scout have gone through and there are a lot of failures at 1-2 sites, the sites are more suspect than the task.
  • If the same task is found to be failing at several sites, probably it is a task related problem.
  • If jobs from a task are failing at one single site and running OK at other clusters, probably it is site related problem.

NEW Group production jobs

  • Group production is now being handled by the new DPD Production Team (contact: atlas-phys-dpd-coordination@cernNOSPAMPLEASE.ch.)
  • Monitoring for group production tasks
  • NEW Twiki with usefull info for group production reporting
  • Please report Group production jobs which did not finish within 1 day.
  • Group production task should run less than 1 week.
  • Problematic tasks are to be reported to ADCo Support Jira:
    • put "TASK" string and task ID(s) to the Subject
    • put task owner to CC
  • Site issue is to be reported to GGUS.
  • DaTrI requests for Group production datasets - please check at the beginning of your shift.
  • Group contacts
  • More info for Groups: DPDProductionTeam

Group production jobs experience/hints

  • When job fails with "ATH_FAILURE - Athena non-zero exit", please make sure that it has inputs defined first. If the task has no input defined, mention in in JIRA. In such case you don't have to check athena logs further and you don't have to put long excerpt of log into Jira ticket. In such case task has to be redefined by the task owner (whome you put into CC of the ticket).

NEW Task in pending state for long time

DDM

  • DDMGlobalOverview
  • Spot most problematic clouds in DDM dashboard: (begin with those in RED, then YELLOW and then BLUE):
    • Click on the Tier-1 name to get a breakdown for the sites. Chase the site(s) that is causing the low efficiency at the cloud by clicking on the error number (breakdown for errors).
      • Understand if the problem is site-related (DESTINATION error):
        • FTS State [Failed] FTS Retries [3] Reason [DESTINATION error during PREPARATION phase: [CONNECTION] failed to contact on remote SRM...
      • or if the problem is outside of this site (SOURCE error):
        • FTS State [Failed] FTS Retries [3] Reason [SOURCE error during PREPARATION phase: [CONNECTION] failed to contact on remote SRM
      • If the error message is : * SOURCE error during TRANSFER_PREPARATION phase: [REQUEST_TIMEOUT] failed to prepare source file in 180 seconds
          1. click the number on the right. You will see the list of files with FAILED_TRANSFER
          2. click some of the files to see the history of the file transfer
          3. if you see the error is persistent (many errors for more than 1 day), the problem should be reported explicitly mentioning the error is persistent.
          4. otherwise (if only a few errors, or errors within 1 day), no need to report
      • DDM is intrinsically linked as downtime on a site can cause collateral effects to all sites pulling or pushing data to it.
    • For problematic sites check Services column: DQ/Grid Status and report in case it is not OK (by this time only DQ is monitored)
  • If you're new to the Team, please check DDMDashboardHowTo
  • Please follow guidelines on what information to fill in DDM-specific ticket. This information is valid for both GGUS tickets to the site, and to DDM ops Jira tickets.

DDM dashboard shows timeout/SRMV2STAGER errors

  • When DDM dashboard shows timeout errors or SRMV2STAGER errors, you should wait and see if the error re-occurs and persists before you submit ticket to a site.
    • The aim of waiting is to make sure that the issue is still there, and to prevent us from sending false alarms to sites.
    • Duration of waiting period is shown on the error message, it can be from tens of minutes to day(s). In any case please file an eLog entry about the timeout/SRMV2STAGER issue, and mention this issue in your daily report.
  • NEW Staging Statistics and Staging Errors logs. Use these 2 pages to get summary information about the staging failures. If the failure rate is too high, please consult with your fellow ADCoS Expert shifter whether a GGUS ticket should be filed.

What to fill into GGUS ticket subject (short description)

  • Site name or spacetoken name
  • Short description of observed issue:
    • If the FTS error transfer contains 'locality is unavailable' : put locality is unavailable into GGUS ticket subject
      • Example ERROR MSG:
        ERROR MSG: [FTS] FTS State [Failed] FTS Retries [1] Reason [SOURCE error during TRANSFER_PREPARATION phase: [LOCALITY] S
        ource file [srm://ccsrm.in2p3.fr/pnfs/in2p3.fr/data/atlas/atlasdatadisk/step09/ESD/closed/step09.202010410000
        54L.physics_C.recon.ESD.closed/step09.20201041000054L.physics_C.recon.ESD.closed._lb0002._0001_1286547114]: l
        ocality is UNAVAILABLE]
    • If the FTS error contains 'gridftp_copy_wait: Connection timed out ' : put gridftp_copy_wait: Connection timed out
      • Example ERROR MSG:
        [FTS] FTS State [Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [CONNEC
        TION_ERROR] failed to contact on remote SRM [httpg://grid05.lal.in2p3.fr:8446/srm/managerv2]. Givin' up after
        3 tries]
    • Otherwise: try to describe problem within up to 4 words, please try to avoid phrases like "many transfer errors" only when better error description is provided in ERROR MSG.
      • If FTS error message states SOURCE error, put SITE_X cannot export data (SITE_X is name of the SOURCE site)
      • If FTS error message states DESTINATION error, put SITE_X cannot receive data (SITE_X is name of the DESTINATION site)
      • If FTS error message states TRANSFER error, put Transfer issues between SITE_X and SITE_Y

Which Problems to report

  • Tier-1s: No transfer reported at dashboard level during few hours (cross-check if the site is in Downtime before).
  • Report DDM errors only if :
    • the source site is problematic (as reported by FTS). Probably site is down or it is lost file
    • T1/T0 <-> T1/T0
    • T1<->T2 within same cloud
    • T1/T0 <-> T2_PRODDISK (afects production ) T2_GROUPDISK (group datasets are not aggregated at final destination) (cross cloud or not)
    • Do NOT report issues with T2-T2 transfers.
    • NEW When ERROR code in DDM dashboard is [DDM Site Services internal] , just report the error following DDM-specific ticket
    • If FTS error means that it is a problem at source (pattern SOURCE in FTS error log)
  • Dashboard: If there is no transfers shown in the DDM dashboard, please notify immediately the dashboard team: dashboard-support@cernNOSPAMPLEASE.ch. If you get no response after one hour and status is the same contact directly the ADC Expert.
    • DDM dashboard: Please report to atlas-adc-expert@cernNOSPAMPLEASE.ch and dashboard-support@cernNOSPAMPLEASE.ch. Please wait until the issue is resolved. Please monitor SAM SRMv2 tests in the meantime http://tinyurl.com/ATLAS-SRM-last48 and report the most recent issues to sites. Do not report to atlas-dq2-support at cern.ch.
    • That could be a side problem of various things:
      • Dashboard agents are not working
      • Site-services are not working
      • No data is transferred to the sites
      • Everything fine, but no data at all -very rarely seen, as there is always traffic either in the Tier-0 or at the production dashboard-
  • Please follow guidelines on what information to fill in DDM-specific ticket. This information is valid for both GGUS tickets to the site, and to DDM ops Jira tickets.
  • When a ticket is solved by site, and the issue disappears from our monitoring tools (will not occur the new issue if the same kind within 1 hr from the ticket solution), consider the issue to be solved. When the same issue reoccurs after 1 hr after the old ticket was solved, please open a new ticket.

Tier-0/Tier-1/Tier2 Data exportation

TBD

What to do when a site has no FREE disk space in space tokens?

  • In most cases when there is no free disk space in particular spacetoken this spacetoken should be blacklisted for writing automatically

  • If you see error DESTINATION error [NO_SPACE_LEFT]
    DESTINATION error during TRANSFER_PREPARATION phase: [NO_SPACE_LEFT] at Thu Jun 04 20:11:49 CEST 2009 state Failed : space with id=1209 does not have enough space
    
    • this is not the site problem, but an atlas issue in usage of given resources, do not send ticket to ggus ticket to the site
    • If there is indeed no space left and spacetoken is not blacklisted submit an DDM ops Jira with the cloud support in CC so that they can take an action (increase the space, reduce the share, etc...)
    • EXCEPTIONS
      • If you see error [NO_SPACE_LEFT] for DATAPE or MCTAPE, it is a site issue. Send email to atlas-adc-expert(at)cern.ch.

  • The following error messages means that the log area for the FTS server is full. In this case, submit a GGUS ticket to the site which hosts the FTS server
    [FTS] FTS State [Failed] FTS Retries [1] Reason [AGENT error during TRANSFER_SERVICE phase: [INTERNAL_ERROR] 
    cannot create archive repository: No space left on device]
    
    or
    Reason [AGENT error during TRANSFER_SERVICE phase: [INTERNAL_ERROR]
    error creating file for memmap /var/tmp/glite-url-copy-edguser/BNL-NDGF__2010-01-16-0659_m91sgu.mem: No space
    left on device]
    
    or
    [FTS] FTS State [Failed] FTS Retries [1] Reason [TRANSFER error during TRANSFER phase: [GRIDFTP_ERROR] globus
    _ftp_client: the server responded with an error 500 Command failed. : write error: No space left on device] 
    

The DDM endpoints *_LOCALGROUPDISK are not managed centrally or by the site admins. If a DDM endpoint is full, DDM automatically blacklists the site as destination. ADCoS shifter should submit a Jira ticket within ADCo Support for information. The cloud squad should acknowledge and close the ticket. The squad is responsible to inform the local users.

The actions are similar for ATLASGROUPDISK (DDM endpoints mainly called PERF-* or PHYS-*). The Jira ticket should be submitted to ADCo Support Jira and assigned to Group Production and the DPD contact person should be put in CC (list in DPDProductionTeam#Group_DPD_contact_persons). The ticket should be acknowledged by the Group production responsible and closed.

For the other space tokens, an automatic cleaning algorithm is defined and running for all space tokens. If the site is full, it means that the cleaning procedure is not perfect. The cleaning monitoring can be found at

  • More detailed information about the whole procedure (involving AMODs) is at ADCOpsSiteExclusion.

What to do when subscriptions are not processing?

  • If Site Services are under suspect, follow: Cental Services procedure
  • If the problems is related to data loss or catalogue inconsistencies:
    • Place as DDM ops Jira bug
      • Report the error message and associated link.
      • Report the dataset not transfered and when it was done.
      • Please follow guidelines on what information to fill in DDM-specific ticket. This information is valid for both GGUS tickets to the site, and to DDM ops Jira tickets.
  • Hardware status

Checking blacklisted sites in DDM

  • When a site is heavily failing and bypassing certain error threshold (#errors/time), the site is removed from site services so no further transfers requests happens for the site. This is done by ADCoS Expert Shifters for T2 and T3 sites, and by AMOD for T0 and T1 sites. There could be several reasons for doing this: Long scheduled/unscheduled downtime, persistent storage problems, fts issues, etc.
    • Sites on downtime (GOCDB/AGIS) are excluded automatically.
    • NEW Sites failing SAAB nagios tests are excluded automatically. SAAB blacklistings can be monitored here. Currently only put tests are active, so in case of the site problem only write/upload (w/u) part is blacklisted by SAAB. If the site has also a problem as a source or at deletions (r/f/d), shifter must treat them as regular transfer/deletion failure case. If site was blacklisted by SAAB, then the failing test issue should be followed up in GGUS ticket to the site. See more information in SAAB TWiki.
    • Sites which are not excluded automatically have to be excluded manually.
  • Shifters should check at the end of the shift if the blacklisted sites are still in troubles or if they have solved the problems so the site can be set online again, this has to be mainly done following these steps:
  • Instructions for ADCoS Expert shifters to exclude/re-include spacetoken from DDM: ADCoSExpert#DDM_spacetoken_exclusion

Checking the deletion error rate per site

  • Go to DDM dashboard
  • For each cloud in the table
    • if a site has more than 1000 errors over the last 4 hours:
      • Check if the error rate is constant over these 4 hours
    • Report to ADCoS expert who will check if it is worthwhile to contact the site and fill GGUS ticket if necessary
      • file a GGUS ticket to that site with CC to the corresponding cloud support
        • Ticket subject has to contain site name
          • e.g. Site CA-VICTORIA-WESTGRID-T2 has more than 18k deletion errors in last 4 hrs
        • Ticket details has to contain list of problematic spacetokens and examples of error extracted from the error table. The URL can be provided so that the site can check himself that, after correction, the error rate has decreased
      • file an elog
        • reference created GGUS ticket in that eLog

Panda queues

If a site or a queue at a site is in downtime or is heavily failing, the site should be set to test so that jobs are not directed anymore until the problem is solved. Currently, sites are permanently tested and manipulated by Hammercloud. The current status of queues can be found at sites page. Recent changes in general and also for given queue can be found at incidents page. Test jobs can be checked here http://bigpanda.cern.ch/jobs/?jobtype=test&display_limit=100&prodsourcelabel=prod_test .


Central services

Central services (hosted at CERN) to be monitored https://sls.cern.ch/sls/service.php?id=ADC_CS (need NICE login)

Frontier

The Frontier service provides access to the conditions data stored in the 3D databases which is streamed from CERN to several Tier1 sites. Conditions data accessed from Frontier is primarily used in user analysis jobs. Because conditions data changes relatively slowly a lot of requests are the same and so a series of squid caches have been set up to reduce the load on the Oracle databases. When a job requires conditions data it will first try and get it from a local site squid. If the required data is not in the squid, the squid should connect to the designated Frontier server which will connect to the Oracle database if it doesn't have the data cached already. The system is setup so that if a site squid or Frontier server fails then the request will try other Frontier / squid combinations in order to get their data. Problems with a site squid or Frontier server should therefore not cause jobs to fail, although this will cause additional load elsewhere. If this is allowed to build up the whole service could eventually fail.

Periodically (2-3 times per shift) check: http://sls.cern.ch/sls/service.php?id=ATLAS-Frontier

If this is not at 100% for any of the sites for more than an hour check:

If the site is not in a downtime and it is down in both SLS and MRTG then submit an urgent GGUS ticket to the site and cc in atlas-frontier-support@cernNOSPAMPLEASE.ch. If in doubt email atlas-frontier-support and copy in the expert shifter.

Once per shift check: http://dashb-atlas-ssb.cern.ch/dashboard/request.py/siteview#currentView=Frontier_Squid&highlight=false

If the site is red, click on the link. This will take you to the MRTG monitoring page which will show you when the squid stopped working. Check if the site is in downtime and if it isn't and the squid has not been responding for more than 4 hours (no "had been up for" line) submit a less urgent GGUS ticket to the site and cc in atlas-frontier-support@cernNOSPAMPLEASE.ch. Exception sites are mentioned on Known Problem page.

Miscelania

Contacting the Cloud Support

  • Carbon Copy always the Tier-1 expert list when submitting a GGUS
  • Carbon Copy when an action is performed affecting the sites inside the cloud

NEW Contacting the ticket portal support

In case GGUS or Jira tracker pages are not available, you should try to
  • clear the cache of your browser and try again
  • ask other shifters or colleagues if they are affected by this problem or it is only you
  • if you cannot find a cause of the problem on your side, send email to portal's support

Pilot Factories and methods

Communication and Organization

Elog Management

Choosing the right criticality in eLog

  • 1) top priority: data export from cern, problem at the tier0, problems of the central services, central catalog etc.
  • 2) very urgent: problems at the tier1s, like no acceptance data from the tier0, FTS down
  • 3) urgent: problems that affect the cloud
  • 4) less urgent: others

Replying to eLog entry

  • When replying to eLog entry modify the subject elog entry, for example :
    • If you updating information on the problem which has been already reported, put [update] in the subject
    • If the problem is solved, please put [SOLVED] in the subject
    • If the subject is saying that queues were set OFFLINE and you are setting them to TEST/ONLINE, reflect this in the subject: queues were set in TEST mode/ONLINE
    • If you can see that elog subject do not briefly report the problem (for example site name is missing in the subject) please modify elog subject (add a site name, if appropriate)

Ticket Management

Site naming convention - exceptions

General Rules

  1. Check GGUS Atlas tickets (see How to find tickets section)
    • Now shifters can follow all the TEAM tickets on the GGUS interface
  2. DO NOT open duplicate tickets.
    • If a ticket is already open about the SAME problem follow up on that ticket.
  3. Open only TEAM tickets so that other shifters can find them and follow up
  4. When you open a ticket: Write in the ADCoS eLog the reason it was opened and put a link to the opened GGUS ticket
    • Tickets for the US can now be opened in GGUS so they can also be treated in the same way.
    • Some of T3's cannot be found in the list of available sites. Please use TPM option instead of direc route to A site in this case. It is very important to put site name in the description line ( for example, SITET3: transfers are failing because certificate has expired).
  5. When you close a ticket: Write the solution ADCoS eLog
  6. Write everything that is in between in the ticket. The ticket is now the reference for what happens in between opening and closing.
    • In the ADCoS eLog should only go the reason the ticket has been opened, the link to the ticket and the solution when the ticket is closed.
  7. When updating the ticket do not change ticket status to "waiting to reply"; this status is reserved for sites. In this case when shifter check for tickets which needed to be updated, it is easy to see "waiting for reply" ticket.
  8. Don't open tickets for sites in downtime.
  9. Do not try re-open ALARM ticket. If ALARM ticket is solved, the problem re-appeared and you can't contact ADC expert, open new TEAM ticket.

How to Submit GGUS Team-Tickets (direct routing to sites)

  • Notice that when open the submit new ticket I/F a label appears on the top: Open TEAM ticket
  • Clicking it you are in the I/F for our special tickets that gets routed directly to the site
    • Set the ticket priority:
      1. top priority : Problem at CERN Services (affecting exports to every site) should be marked as "uop priority". This includes LFC at CERN, FTS at CERN, SRM(CASTOR) at CERN
      2. very urgent : Problem at services at Tier-1s (affecting exports to the given Tier-1 and within the Tier-1 cloud) or services at calibration Tier-2s should be marked as "very urgent". This includes LFC at Tier-1s, FTS at Tier-1s, SRM at Tier-1s, SRM at Calibration Tier-2s.
      3. urgent : Any other problem should be marked as "urgent"
      4. less urgent : Informational entries should be marked as "less urgent"
    • Select type of the problem
    • Select MoU Area
    • Select site affected
    • Put cloud support in the CC field - atlas-adc-cloud-[CLOUD]@cern.ch [where [CLOUD] stands for CA, CERN, DE, ES, FR, IT, ND, NL, RU, TW, UK, US] (see #ContactingCloudSupport)
  • GGUS.png
  • All people from ADCoS team with the correct certificate permissions in GGUS machinery can track and follow the tickets opened by anyone in our team.
    • Please notify GGUS support in case you find problems accessing to it.

  • NEW EXCEPTION : For south-african sites (ZA-*), the sites are not registered in GGUS. Submit the ticket to ROC NGI_ZA (14 May 2012 and should be solved in the coming weeks)

Overview of Jira trackers which ADCoS shifter might need

How to find tickets in GGUS

Shift Reports and tickets

To help expert shifters to compile the weekly report and also help your fellow shifters to get oriented when their start their shifts:

  • Write the ticket numbers of the the tickets you have opened and closed in the shifter report.
    • There is now a dedicated field for it.

Ticket format

GGUS

  • Provide meaningful subject (non-ATLAS people should understand it), including the site name.
  • Provide time information: When the failure(s) start to happen ?
  • Extract of the error message. Shifter should understand the error before reporting and provide translation from "ATLAS" language to general "language".
    • When possible provide detailed info of the command failed (if the failure is reproducible)
  • Provide link to the log(s) file(s) (panda/production dashboard for MC or DDM dashboard for data ditribution)
  • Approximate number of failures (related to the problem reported):
    • Last 12h for Monte Carlo (default Panda monitoring view)
    • Last 4h/24h for Data distribution (possible views in DDM dashboard)
  • Monte Carlo Specific:
    • Node(s) affected
      • Sometimes Worker Nodes act as black hole. High number of failures related to the same processing host could be an evidence.
    • Provide local batch system job ID
    • When providing link to Panda monitoring, please provide link to one particular failing job, "last12h" aggregation ling might be unvalid at the time site tries to address the ticket

Production or Validation Jira Bug Reporting

  • Task:
    • Task ID
    • Task name
    • Task Progress (Done, ToBeDone, Running, Pending)
    • Task efficiency
    • Task details (release, trf_version, DBrelease)
  • Errors:
    • Error summary. The content of the job log file is accessible from its panda monitoring page (and the panda page is linked from the dashboard). Try the Find and view log files link. If it doesn't work, click on the job log file name (in the table above, file type log). At the bottom of the new page, you find the SURL(s) for the log file and you can download them directly in a shell.
    • Link to the Log files (Panda/ProdSys dashboard)
  • Info flow:
    • Start the ticket body with the line mentioning the task owner’s name (Task Owner: name). To put the correct name write @ symbol and start writing the name of the task owner. A drop-down menu will appear to help you find the right person and complete the name.
    • Add task owner to the watchers (see below).
    • Remember to eLog.
NEW Note! ATLAS Distributed Computing groups have moved away from Savannah and are using now the Jira Issue Tracking Service for bug reporting. Jira is quite easy and intuitive to use. In Jira to see the list of issues (tickets) click on "Issues" on the left side menu. On the "Issues" view one can select to look "All Issues" or only the ones belonging to certain category (Unresolved, Added recently, Resolved recently, etc.). To open a new ticket click on "Create Issue" button on upper menu and fill the form. In Jira there is no direct CC option, but it can be done by adding “Watchers”. To do so, find the label called "Watchers:" on the right side menu while inside a particular issue (ticket), and click on the colored circle with a number inside (the number indicates how many watchers that issue already has). You will be prompted to add a watcher. To do so simply start typing the first letters of the name of the person you want to CC (task owner for example). Then Jira will open a matching list from which you can select the desired name. Then Jira will add that person as a “Watcher” and send email notification every time the ticket is updated.

Downtimes

  • Check the AGIS downtime calendar for downtimes.
  • Check all ongoing entries for site in question. Do care also about downtimes marked as NO_RISK_FOR_ATLAS.

Monitoring tools

Daily SHIFT report

  • Submit you daily shift summary report using the Interface located at: Shift report elog form. This triggers and automated shift report that is sent to the ADCoS mailing list and also a disk copy stored in the elog: Shift summaries elog

Trainee evaluation report

  • If trainee shifter participated in the shift, send e-mail to : 1) ADCOS coordinators : atlas-adc-adcos-coordinators@cernNOSPAMPLEASE.ch, as well as 2) current ADCoS Expert, where name of current Expert shifter can be found from query (PDF) in the top of Checklist. The e-mail subject should be in the form : "Trainee evaluation of ShifterName (shift Number), date timezone", where example of the date and timezone is "10/10/2012 EU" and total number of trainee shifts taken so far should be reported, like "(shift 3)"
  • Please, report the following (use copy/paste) :
    • Active presence: how much one is present and how much is proactive,
    • Monitoring tools understanding,
    • Errors understanding (at least as far as those explained in the twiki),
    • Ticket handling (learning how not to open duplicates and not to write a single error line... etc etc).
    • Evaluation grade: 0-3 range (1: new shifter, still learning; 2: quite experienced, but not yet ready for promotion; 3: ready to be promoted to senior shifter; 0: very quiet shift, not enough information to evaluate) NEW
  • More detailed description of evaluation grades
    • 0:not enough information to evaluate - for example, no ticket submitted or updated during the shift, not enough interactions and discussions to evaluate the shifter experience.
    • 1:Shifter lacks understanding of ADCoS shifter duties and is in process of learning them.
    • 2:Shifter has basic understanding of ADCoS duties but makes mistakes while perfoming them. Examples could be submitting GGUS ticket to the wrong site or with incomplete information, submitting JIRA ticket to the wrong tracker or with incomplete information or making mistakes mentioned in Most Common Mistakes by Shifters section.
    • 3:Shifter knows how to perform all the duties mentioned in the Checklist. In case of transfer issues, (s)he is able to submit GGUS ticket to correct site following the rules for GGUS ticket content. In case of task issues, (s)he is able to submit JIRA ticket to correct tracker following the rules for JIRA ticket content

ADC Virtual Control Room

Troubleshooting

  • Currently, it looks like Gmail jabber accounts face strange behaviour when you join the chat: you'll see very ancient chat log, but you will not see the most recent one when you log in. In the meantime, please try to use other jabber account than Gmail, e.g. try out your jabber account at CERN (jabber.cern.ch).
  • Jabber server jabbim.* is a privately-held jabber server. If it stops working there is nothing ADC experts can do about it. In that case please check jabbim.* servers monitoring and wait for the servers to get back. If getting jabbim back takes too long, please use your CERN jabber account.
  • If you are disconnected with a "Conflict" error, please reconnect again with a different nickname (handle).
  • If you are disconnected during the jabbim.com server downtime, please use your CERN jabber account to reconnect.

GGUS ATLAS TEAM membership

  • If for some reason your grid certificate is not yet in /atlas/team VOMS group, ask ADCoS Coordinators to add it. Particularly don't forget to do that when you get a completely new certificate. GGUS receives this list in daily basis and updates the membership accordingly.

Useful links

Tutorials

Shift Credits

  • Before coming to your shift, make sure that you fulfill shift requirements!
  • ADCoS is a Class 2 shift. All ADCoS shift have the same weight within ATLAS OTP.
  • NEW Each shifter is required to take at least 6 shifts every 4 months!!!
  • We have 3 flavours of shifters: ADCoS Expert shifters, ADCoS Senior shifters, ADCoS Trainee shifters.
    • Senior shifter: 8 hours of shift, 2-days blocks (Mon+Tue, Wed+Thu) and Friday credited with 78% (scaled from 100%), 2-days block (Sat+Sun) bonus credit 155% (scaled from 100%).
      • No upper limit on number of shifts. Please take at least few shifts a month. ADCoS training should be repeated if no shifts were taken within a year.
    • Trainee shifter: 8 hours of shift, shifts slots available Mon-Sat, 0% shift credit (scaled from 100%), no Sunday shift.
      • Please book your Trainee shift slot only if it is "red" on the OTP calendar. Please do not overallocate shift slots.
      • Trainee period takes 10 shifts, however, the final number of Trainee shifts strictly depends on Trainee shifter's performance, it can be significantly lower or higher than 10.
      • There is a time limit of 3 months for trainee shifters to finish training. If trainee shifter did not take shifts within last 1 month this shifter would be automatically excluded from the list. Each shift will be evaluated by Senior shifter.
      • After promotion to Senior it is required to take first Senior shift as soon as possible (within a month).
    • Expert shifter: 9 hours of shift, 7-days shift Wed-Tue, credited with 100%, no weekend bonus.
  • We provide 24/7 operations shift in three timezones (defined in CERN time)
    • 00:00 - 08:00 ASIA/PACIFIC (AP) - Shift Captain: Hiroshi Sakamoto
    • 08:00 - 16:00 EUROPE (EU) - Shift Captain: Alexei Sedov
    • 16:00 - 24:00 AMERICAS (US) - Shift Captain: Armen Vartapetian
  • Shifts are booked on first-come-first-served basis in ATLAS OTP.
  • ADCoS tasks in OTP:
    • 529221 - ADCoS Expert shifts
    • 529222 - ADCoS Senior shifts
    • 529223 - ADCoS Trainee shifts
    • 86 - ADCoS Coordination Shifts
  • Generally, more information about ATLAS shifts available at ATLAS OtpShiftClasses TWiki page.
  • In case of questions please contact ADCoS coordinators (atlas-adc-adcos-coordinators@cernNOSPAMPLEASE.ch).

ADCoS Expert Duties

TEAM MEMBERS


Major updates:
-- XavierEspinal - 30 Jul 2008 -- JaroslavaSchovancova - 2010-2011 -- MichalSvatos - 2014

%RESPONSIBLE% AlexeySedov
%REVIEW% Never reviewed

-- MichalSvatos - 25 Jun 2014

Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng GGUS.png r1 manage 51.0 K 2014-10-01 - 16:09 MichalSvatos  
Edit | Attach | Watch | Print version | History: r11 < r10 < r9 < r8 < r7 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r11 - 2015-01-13 - MichalSvatos
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback