Main Web>TWikiUsers>MichalSvatos>Frontiermon (2024-04-29, MichalSvatos)

EditAttachPDF

Frontier squids

Frontier contacts

CERN: Alessandro.DeSalvo@roma1.infn.it

Frontier machines

CERN machines
- DistributedComputingMachines twiki

Frontier monitoring code

Every investigation can be started by checking what crons are running:

sudo -u dbfrontier crontab -l

Access

From lxplus only

to master machine (to check crons, etc.)

ssh msvatos@frontiermon1.cern.ch

to backup machine (to test something in production, etc.)

ssh msvatos@frontiermon2.cern.ch

Monitoring scripts are located at ~dbfrontier/scripts.

Testing

Everything in dbfrontier on the master machine is copied to the backup machine every 5 minutes. To stop it (for example if I need to do some testing), I need to:

for short periods of testing:
- log into backup machine and rename ~dbfrontier/.ssh/authorized_keys. This will cause synchronization errors if too long (hours).
for longer testing:
- edit /home/dbfrontier/scripts/bin/identical_frontiermon by adding folders I do not want to have rewritten into EXCLUDE line

Code development

There is a git repo

checkout

git clone https://:@gitlab.cern.ch:8443/frontier/frontiermon.git

commit (in frontiermon folder) of changes on existing files
1. git pull
2. git add file
3. git commit -m 'reason'
4. git push
publish

~dbfrontier/bin/frontiermon_publish

N.B. Sometimes it is necessary to go to the script after it is published and set execution rights to it.

Frontier monitoring

frontpage

page: http://frontier.cern.ch/
https://gitlab.cern.ch/frontier/frontiermon/-/blob/master/wwwsrc/index.html
in /home/dbfrontier/wwwsrc/index.html

awstats monitoring

Awstats data are located at squidmon machines.

pages:
- ATLAS: http://wlcg-squid-monitor.cern.ch/awstats/atlas.html
- CVMFS: http://wlcg-squid-monitor.cern.ch/awstats/cvmfs.html
config file (in checked out repo files):
- /home/squidmon/conf/awstats/SiteProjectNodesMapping - sites from which awstats are read. Syntax:
  - SiteProject - grouping according to the site
  - DNS alias - actual name of the machine
  - awstats name - name of the machine (I choose) in awstats monitoring
  - role - launchpad/proxy
  - mode - production/testing
  - awstats group - another grouping of the machines
  - stratum1 time zone - time zone for CVMFS stratum 1s
the webpage uses perl script from awstats installation (/home/squidmon/etc/awstats/wwwroot/cgi-bin/awstats.pl), i.e. the pages for individual frontiers and time intervals come from awstats itself. ATLAS creates only the summary page
the page is created/updated by hand. It can be updated at wlcgsquidmon/wwwsrc/awstats/atlas.html in the SVN

maxthreads monitoring

page: http://frontier.cern.ch/maxthreads/frontier_inst_maxthreads.atlas.html
it is build on top of awstats
input files (created by frontier-maxthreads rpm):
- maxthreads
- averagetime [ms]
- avedbquerytime [ms]
- threadsthreshold
- log files samples (e.g. /home/dbfrontier/data/awstats/cernabase/chkthread_atlasfrontier-ai-1/maxthreads.atlasfrontier-ai-1.2024-03-19) contain:

# [03/19/24 00:01:24.900 CET +0100] to [03/19/24 00:04:51.006 CET +0100]
2024/03/19 00:04:51 atlr maxthreads=3 averagetime=14.586 msec avedbquerytime=12.1718 msec threadsthreshold=375
2024/03/19 00:04:51 devatlr maxthreads=0 averagetime=0 msec avedbquerytime=0 msec threadsthreshold=375
2024/03/19 00:04:51 t0atlr maxthreads=0 averagetime=0 msec avedbquerytime=0 msec threadsthreshold=375

config file:
- /home/dbfrontier/conf/maxthreads_monitor.config.atlas - sites displayed in the monitoring. Syntax:
  - servers - awstats frontier names
  - srcdir - directory, where awstats data will be stored
  - mailaddr - email address to receive alerts (atlas-frontier-support)
scripts:
- /home/dbfrontier/scripts/maxthreads/maxthreads_monitor.py - python script processing the logs and then making the plots and send alerts
- /home/dbfrontier/scripts/maxthreads/maxthreads_monitor.sh - bash wrapper
www pages:
- /home/dbfrontier/www/maxthreads/
cronjob:

# Puppet Name: Frontier instance threads monitor
*/5 * * * * ismaster || exit; /home/dbfrontier/scripts/maxthreads/maxthreads_monitor.sh >> /home/dbfrontier/logs/maxthreads_monitor.log 2>&1

Availability monitoring

page: https://monit-grafana.cern.ch/d/plTtqczZz/sls-details?orgId=17&var-services=frontier&theme=dark
config file:
- /home/dbfrontier/conf/slsfrontier/atlas_frontiers.txt - sites displayed in the monitoring. Syntax:
  - service group name
  - list of machines (machine:port)
  - servlet
  - name that will appear in the Grafana
  - email address to receive alerts (atlas-frontier-support)
script:
- /home/dbfrontier/scripts/slsfrontier/sls_frontier.py - python script making a query to the frontier and processing the response and then send data to MONIT infrastrusture and send alerts
- /home/dbfrontier/scripts/slsfrontier/sls_frontier.sh - bash wrapper
cronjob:

# Puppet Name: Execute SLS probes
*/5 * * * * ismaster || exit; /home/dbfrontier/scripts/slsfrontier/sls.sh &> /dev/null

Site squids

Every investigation can be started by checking what crons are running:

sudo -u squidmon crontab -l

Squid monitoring code

Access

From lxplus only

to master machine (to check crons, etc.)

ssh msvatos@wlcgsquidmon2.cern.ch

to backup machine (to test something in production, etc.)

ssh msvatos@wlcgsquidmon1.cern.ch

Testing

Everything in ~squidmon on the master machine is copied to the backup machine every 5 minutes. To stop it (for example if I need to do some testing), I need to:

for short periods of testing:
- log into backup machine and rename ~squidmon/.ssh/authorized_keys. This will cause synchronization errors if too long (hours).
for longer testing:
- edit /home/squidmon/scripts/bin/identicalsquidmon by adding folders I do not want to have rewritten into EXCLUDE line

Code development

There is a git repo

checkout master

git clone https://:@gitlab.cern.ch:8443/frontier/wlcgsquidmon.git

checkout centos7 branch

git clone https://:@gitlab.cern.ch:8443/frontier/wlcgsquidmon.git
cd wlcgsquidmon/
git checkout -b centos7 origin/centos7

commit (in wlcgsquidmon folder) of changes on existing files
1. git pull
2. git add file
3. git commit -m 'reason'
4. git push
publish

~squidmon/bin/squidmon_publish

N.B. Sometimes it is necessary to go to the script after it is published and set execution rights to it.

Squid monitoring

frontpage

page: http://wlcg-squid-monitor.cern.ch/
https://gitlab.cern.ch/frontier/wlcgsquidmon/-/blob/master/wwwsrc/index.html
in /home/squidmon/wwwsrc/index.html

WLCG mrtg monitoring

page: http://wlcg-squid-monitor.cern.ch/snmpstats/all.html
cronjob:

# Puppet Name: Generate squid information files
40 */3 * * * ismaster || exit; /home/squidmon/scripts/make_squid_info.py >> /home/squidmon/logs/make_squid_info.log 2>&1

this script (= /home/squidmon/scripts/make_squid_info.py=) runs 16 other scripts

ATLAS mrtg monitoring

page: http://wlcg-squid-monitor.cern.ch/snmpstats/mrtgatlas2/indexatlas2.html
Uses the same cronjob as all page
scripts:
- SquidList.py - reads squid info from GOCDB/OIM and CRIC, matches sites+squids (from GOCDB/OIM) with sites+endpoints (from CRIC). Then it matches sites+squids (from GOCDB/OIM) with sites+nodes (from CRIC). Finally, it writes output into a JSON
- PageBuilder.py - reads list of squids from JSON and creates the webpage (at /home/squidmon/www/snmpstats/mrtgatlas2)
config files
- exceptions are in /home/squidmon/conf/exceptions/mrtgatlas2exceptions.txt

SSB+availability monitoring

pages:
- SSB: https://monit-grafana.cern.ch/d/0oRfhq3Zk/site-status-board-overview?orgId=17&var-tier=0&var-tier=1&var-tier=2&var-country=All&var-cloud=All&var-federation=All&var-site=All
- All squid availability: http://wlcg-squid-monitor.cern.ch/snmpstats/SquidAvailabilityAll.html
- ATLAS squid availability: http://wlcg-squid-monitor.cern.ch/snmpstats/SquidAvailabilityATLAS.html
- CMS squid availability: http://wlcg-squid-monitor.cern.ch/snmpstats/SquidAvailabilityCMS.html
cronjob:

# Puppet Name: ATLAS SSB
*/25 * * * * ismaster || exit; /home/squidmon/scripts/cron/atlasSSB.job >> /home/squidmon/logs/atlasSSB.log 2>&1

this script does the following:
- runs squid state scripts which evaluate is a site squid can be considered available and write output into csv files
- runs squid availability scripts which read those cvs files and create the availability pages

Failover monitoring

N.B. squids are not excluded on backup proxies
- squids should never contact backup proxies
- this is for cases like cloud where squid and worker nodes can share IPs
description of the script functionality
cronjob:

# Puppet Name: Failover monitor
18 * * * * ismaster || exit; /home/squidmon/scripts/failover-mon/check-failover.sh >> /home/squidmon/logs/failover-mon.log 2>&1

Elastic Search

Logstash

Usage

The Elastic Search hosted in Chicago provides details of job logs which can provide further details in investigation.

How to search details in ATLAS Elastic Search

Open the Elastic Search page (needs user account)
Select "frontier_sql" index
Click on "Add a filter"
Choose "clientmachine" and "is"
Put IP address of the machine as the value.
Save

Filtering

to filter a message

message:value

to filter out a message

NOT message:value

WLCG-WPAD dashboard

link
the hits come when something on a WLCG site (or non-WLCG sites running WLCG jobs, e.g. LHC@Home jobs) tries to use proxy autodiscovery to get information about nearest squid or
services monitored (each has one server in FNAL and one in CERN):
- wlcg-wpad - replies positively only at grid sites, and includes backup proxies at those sites after squids that are found. I think the only production use is old-config LHC@Home jobs.
- lhchomeproxy - like wlcg-wpad except at non-grid sites it replies DIRECT for openhtc.io destinations, so will use cloudflare. Used by current LHC@Home jobs.
- cernvm-wpad - like lhchomeproxy except at non-grid sites it watches for too many requests in too short of a time (more below) and if so directs them to cernvm backup proxies on port 3125. Used as default for CernVM, cvmfsexec, and soon to be the default configuration for cvmfs if people do not set CVMFS_HTTP_PROXY and are using the cvmfs-config-default configuration rpm.
- cmsopsquid - like cervnm-wpad except too many requests from non-grid sites in too short of a time get sent to the cms backup proxies on port 3128. Used by CMS opportunistic jobs in the U.S.
dashboard content
- type of info
  - no squid found - wpad returned no squid
  - no org found - nothing found in the geoip database
  - default squid - wpad returned site's squid
  - disabled - if site's squid is disabled (recorded in worker-proxies.json and/or shoal-squids.json)
  - overload - if there are too many requests from one geoip org in too short of a time
- service names
  - hits per service
  - type of info for each of the service names

SAM tests

Decommissioning of a monitoring

stop execution - either ask for removal from cron or remove particular script from file which is run by the cron
remove files from git - scripts from scripts, config files from conf and webpages from wwwsrc
delete data folders
delete html of the monitoring page
if it is on the wlcg-squid-monitor.cern.ch, remove it from there

Tools

snmpwalk

Command that provides all information about the squid, e.g.

/usr/bin/snmpwalk -m ~squidmon/conf/mrtg/mib.txt -v2c -Cc -c public squid.farm.particle.cz:3401 squid

definition of variables https://wiki.squid-cache.org/Features/Snmp#Squid_OIDs
important variables:
- cache_mem : specifies the ideal amount of memory to be used for In-Transit objects, Hot Objects, and Negative-Cached objects
  - cacheMemMaxSize - The value of the cache_mem parameter in MB - should be 256 MB
  - cacheSysVMsize - Amount of cache_mem storage space used, in KB.
- cache_dir :
  - cacheSwapMaxSize - The total of the cache_dir space allocated in MB - should be 100000 - 200000 for non-small sites
  - cacheSysStorage - Amount of cache_dir storage space used, in KB.
- cacheMemUsage - Total memory accounted for KB
- cacheCpuUsage - The percentage use of the CPU
- cacheHttpErrors - if it is high, there is a problem (which is hard to say from outside what it is) - it needs a check of squid logs (access.log and cache.log)
- cacheVersionId - version of the squid-frontier displayed on the MRTG monitoring page
- cacheUptime - uptime displayed on the MRTG monitoring page

To get only one variable:

/usr/bin/snmpwalk -m ~squidmon/conf/mrtg/mib.txt -v2c -Cc -c public squid.farm.particle.cz:3401 cacheSwapMaxSize.0

Command that checks if the squid is working (if the snmpwalk gives timeouts, try traceroute squidname)

snmpwalk -v2c -Cc -c public squidname:3401 .1.3.6.1.4.1.3495.1.1

Just to check if the squid is responsing (returns End of MIB is the squid is reachable and Timeout: No Response from if it is not)

snmpwalk -v2c -c public squidname:3401 .

nmap

To check if the squid has open monitoing port:

nmap -p 3401 squidname

Decoding a query

command which decodes query from encoded string in the squid log: ~dwd/adm/bin/decodesquidlog

Contacts

frontier-talk@cernNOSPAMPLEASE.ch
- general discussions about Frontier
atlas-adc-frontier@cernNOSPAMPLEASE.ch
- ATLAS specific issues
wlcg-squidmon-support@cernNOSPAMPLEASE.ch
- rapid response to squid configuration questions
atlas-frontier-support@cernNOSPAMPLEASE.ch
- email for alerts
wlcg-squid-ops@cernNOSPAMPLEASE.ch
- Squid ops group

-- MichalSvatos - 2017-06-19

Topic revision: r105 - 2024-04-29 - MichalSvatos

Main

Webs

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
Main All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback