Frontier squids

Frontier contacts

  • CERN: Alessandro.DeSalvo@roma1.infn.it

Frontier machines

Frontier monitoring code

Every investigation can be started by checking what crons are running:
sudo -u dbfrontier crontab -l

Access

From lxplus only
  • to master machine (to check crons, etc.)
ssh msvatos@frontiermon1.cern.ch
  • to backup machine (to test something in production, etc.)
ssh msvatos@frontiermon2.cern.ch
Monitoring scripts are located at ~dbfrontier/scripts.

Testing

Everything in dbfrontier on the master machine is copied to the backup machine every 5 minutes. To stop it (for example if I need to do some testing), I need to:
  • for short periods of testing:
    • log into backup machine and rename ~dbfrontier/.ssh/authorized_keys. This will cause synchronization errors if too long (hours).
  • for longer testing:
    • edit /home/dbfrontier/scripts/bin/identical_frontiermon by adding folders I do not want to have rewritten into EXCLUDE line

Code development

There is a git repo.
  • checkout
git clone https://:@gitlab.cern.ch:8443/frontier/frontiermon.git
  • commit (in frontiermon folder) of changes on existing files
    1. git pull
    2. git add file
    3. git commit -m 'reason'
    4. git push
  • publish
~dbfrontier/bin/frontiermon_publish

N.B. Sometimes it is necessary to go to the script after it is published and set execution rights to it.

Frontier monitoring

frontpage

awstats monitoring

Awstats data are located at squidmon machines.
  • pages:
  • config file (in checked out repo files):
    • /home/squidmon/conf/awstats/SiteProjectNodesMapping - sites from which awstats are read. Syntax:
      • SiteProject - grouping according to the site
      • DNS alias - actual name of the machine
      • awstats name - name of the machine (I choose) in awstats monitoring
      • role - launchpad/proxy
      • mode - production/testing
      • awstats group - another grouping of the machines
      • stratum1 time zone - time zone for CVMFS stratum 1s
  • the webpage uses perl script from awstats installation (/home/squidmon/etc/awstats/wwwroot/cgi-bin/awstats.pl), i.e. the pages for individual frontiers and time intervals come from awstats itself. ATLAS creates only the summary page
  • the page is created/updated by hand. It can be updated at wlcgsquidmon/wwwsrc/awstats/atlas.html in the SVN

maxthreads monitoring

# [03/19/24 00:01:24.900 CET +0100] to [03/19/24 00:04:51.006 CET +0100]
2024/03/19 00:04:51 atlr maxthreads=3 averagetime=14.586 msec avedbquerytime=12.1718 msec threadsthreshold=375
2024/03/19 00:04:51 devatlr maxthreads=0 averagetime=0 msec avedbquerytime=0 msec threadsthreshold=375
2024/03/19 00:04:51 t0atlr maxthreads=0 averagetime=0 msec avedbquerytime=0 msec threadsthreshold=375
  • config file:
    • /home/dbfrontier/conf/maxthreads_monitor.config.atlas - sites displayed in the monitoring. Syntax:
      • servers - awstats frontier names
      • srcdir - directory, where awstats data will be stored
      • mailaddr - email address to receive alerts (atlas-frontier-support)
  • scripts:
    • /home/dbfrontier/scripts/maxthreads/maxthreads_monitor.py - python script processing the logs and then making the plots and send alerts
    • /home/dbfrontier/scripts/maxthreads/maxthreads_monitor.sh - bash wrapper
  • www pages:
    • /home/dbfrontier/www/maxthreads/
  • cronjob:
# Puppet Name: Frontier instance threads monitor
*/5 * * * * ismaster || exit; /home/dbfrontier/scripts/maxthreads/maxthreads_monitor.sh >> /home/dbfrontier/logs/maxthreads_monitor.log 2>&1

Availability monitoring

# Puppet Name: Execute SLS probes
*/5 * * * * ismaster || exit; /home/dbfrontier/scripts/slsfrontier/sls.sh &> /dev/null

Site squids

Every investigation can be started by checking what crons are running:
sudo -u squidmon crontab -l

Squid monitoring code

Access

From lxplus only
  • to master machine (to check crons, etc.)
ssh msvatos@wlcgsquidmon2.cern.ch
  • to backup machine (to test something in production, etc.)
ssh msvatos@wlcgsquidmon1.cern.ch

Testing

Everything in ~squidmon on the master machine is copied to the backup machine every 5 minutes. To stop it (for example if I need to do some testing), I need to:
  • for short periods of testing:
    • log into backup machine and rename ~squidmon/.ssh/authorized_keys. This will cause synchronization errors if too long (hours).
  • for longer testing:
    • edit /home/squidmon/scripts/bin/identicalsquidmon by adding folders I do not want to have rewritten into EXCLUDE line

Code development

There is a git repo.
  • checkout master
git clone https://:@gitlab.cern.ch:8443/frontier/wlcgsquidmon.git
  • checkout centos7 branch
git clone https://:@gitlab.cern.ch:8443/frontier/wlcgsquidmon.git
cd wlcgsquidmon/
git checkout -b centos7 origin/centos7
  • commit (in wlcgsquidmon folder) of changes on existing files
    1. git pull
    2. git add file
    3. git commit -m 'reason'
    4. git push
  • publish
~squidmon/bin/squidmon_publish
N.B. Sometimes it is necessary to go to the script after it is published and set execution rights to it.

Squid monitoring

frontpage

WLCG mrtg monitoring

# Puppet Name: Generate squid information files
40 */3 * * * ismaster || exit; /home/squidmon/scripts/make_squid_info.py >> /home/squidmon/logs/make_squid_info.log 2>&1
  • this script (= /home/squidmon/scripts/make_squid_info.py=) runs 16 other scripts

ATLAS mrtg monitoring

  • page: http://wlcg-squid-monitor.cern.ch/snmpstats/mrtgatlas2/indexatlas2.html
  • Uses the same cronjob as all page
  • scripts:
    • SquidList.py - reads squid info from GOCDB/OIM and CRIC, matches sites+squids (from GOCDB/OIM) with sites+endpoints (from CRIC). Then it matches sites+squids (from GOCDB/OIM) with sites+nodes (from CRIC). Finally, it writes output into a JSON
    • PageBuilder.py - reads list of squids from JSON and creates the webpage (at /home/squidmon/www/snmpstats/mrtgatlas2)
  • config files
    • exceptions are in /home/squidmon/conf/exceptions/mrtgatlas2exceptions.txt

SSB+availability monitoring

# Puppet Name: ATLAS SSB
*/25 * * * * ismaster || exit; /home/squidmon/scripts/cron/atlasSSB.job >> /home/squidmon/logs/atlasSSB.log 2>&1

Failover monitoring

# Puppet Name: Failover monitor
18 * * * * ismaster || exit; /home/squidmon/scripts/failover-mon/check-failover.sh >> /home/squidmon/logs/failover-mon.log 2>&1

Elastic Search

Logstash

Usage

The Elastic Search hosted in Chicago provides details of job logs which can provide further details in investigation.

How to search details in ATLAS Elastic Search

  1. Open the Elastic Search page (needs user account)
  2. Select "frontier_sql" index
  3. Click on "Add a filter"
  4. Choose "clientmachine" and "is"
  5. Put IP address of the machine as the value.
  6. Save

Filtering

  • to filter a message
message:value
  • to filter out a message
NOT message:value

WLCG-WPAD dashboard

  • link
  • the hits come when something on a WLCG site (or non-WLCG sites running WLCG jobs, e.g. LHC@Home jobs) tries to use proxy autodiscovery to get information about nearest squid or
  • services monitored (each has one server in FNAL and one in CERN):
    • wlcg-wpad - replies positively only at grid sites, and includes backup proxies at those sites after squids that are found. I think the only production use is old-config LHC@Home jobs.
    • lhchomeproxy - like wlcg-wpad except at non-grid sites it replies DIRECT for openhtc.io destinations, so will use cloudflare. Used by current LHC@Home jobs.
    • cernvm-wpad - like lhchomeproxy except at non-grid sites it watches for too many requests in too short of a time (more below) and if so directs them to cernvm backup proxies on port 3125. Used as default for CernVM, cvmfsexec, and soon to be the default configuration for cvmfs if people do not set CVMFS_HTTP_PROXY and are using the cvmfs-config-default configuration rpm.
    • cmsopsquid - like cervnm-wpad except too many requests from non-grid sites in too short of a time get sent to the cms backup proxies on port 3128. Used by CMS opportunistic jobs in the U.S.
  • dashboard content
    • type of info
      • no squid found - wpad returned no squid
      • no org found - nothing found in the geoip database
      • default squid - wpad returned site's squid
      • disabled - if site's squid is disabled (recorded in worker-proxies.json and/or shoal-squids.json)
      • overload - if there are too many requests from one geoip org in too short of a time
    • service names
      • hits per service
      • type of info for each of the service names

SAM tests

Decommissioning of a monitoring

  1. stop execution - either ask for removal from cron or remove particular script from file which is run by the cron
  2. remove files from git - scripts from scripts, config files from conf and webpages from wwwsrc
  3. delete data folders
  4. delete html of the monitoring page
  5. if it is on the wlcg-squid-monitor.cern.ch, remove it from there

Tools

snmpwalk

Command that provides all information about the squid, e.g.
/usr/bin/snmpwalk -m ~squidmon/conf/mrtg/mib.txt -v2c -Cc -c public squid.farm.particle.cz:3401 squid
  • definition of variables https://wiki.squid-cache.org/Features/Snmp#Squid_OIDs
  • important variables:
    • cache_mem : specifies the ideal amount of memory to be used for In-Transit objects, Hot Objects, and Negative-Cached objects
      • cacheMemMaxSize - The value of the cache_mem parameter in MB - should be 256 MB
      • cacheSysVMsize - Amount of cache_mem storage space used, in KB.
    • cache_dir :
      • cacheSwapMaxSize - The total of the cache_dir space allocated in MB - should be 100000 - 200000 for non-small sites
      • cacheSysStorage - Amount of cache_dir storage space used, in KB.
    • cacheMemUsage - Total memory accounted for KB
    • cacheCpuUsage - The percentage use of the CPU
    • cacheHttpErrors - if it is high, there is a problem (which is hard to say from outside what it is) - it needs a check of squid logs (access.log and cache.log)
    • cacheVersionId - version of the squid-frontier displayed on the MRTG monitoring page
    • cacheUptime - uptime displayed on the MRTG monitoring page
To get only one variable:
/usr/bin/snmpwalk -m ~squidmon/conf/mrtg/mib.txt -v2c -Cc -c public squid.farm.particle.cz:3401 cacheSwapMaxSize.0
Command that checks if the squid is working (if the snmpwalk gives timeouts, try traceroute squidname)
snmpwalk -v2c -Cc -c public squidname:3401 .1.3.6.1.4.1.3495.1.1
Just to check if the squid is responsing (returns End of MIB is the squid is reachable and Timeout: No Response from if it is not)
snmpwalk -v2c -c public squidname:3401 .

nmap

To check if the squid has open monitoing port:
nmap -p 3401 squidname

Decoding a query

  • command which decodes query from encoded string in the squid log: ~dwd/adm/bin/decodesquidlog

Contacts

-- MichalSvatos - 2017-06-19

Edit | Attach | Watch | Print version | History: r105 < r104 < r103 < r102 < r101 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r105 - 2024-04-29 - MichalSvatos
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback