Main Web>TWikiUsers>DavidGeoffreySavage>MyNP4LogbookEntries (2019-07-24, GeoffSavage)

Logbook Entries

24-July-2019 np04daq keytab expired

The new keytab file is available from the np04cluster in ~np04daq/krb5/np04daq.keytab. This file is actually a soft link to np04daq.keytab.20190724 in the same directory.

A keytab file is used in place of username/password for Kerberos access for the np04daq account. The DAQ operates as the np04daq user.

The keytab file stopped working when I intentionally let the password expire on the np04daq service account. I thought the keytab file would keep working.

Login as root
cern-get-keytab --keytab np04daq.keytab.20190724 --user --login np04daq
Enter the np04daq password when prompted.

What I don’t know is if the keytab file would have expired if I had updated the password. I suspect it would.

20-Jun-2019 timing artdaq builds

mrb z
mrbsetenv
time mrb install -j8
time mrb install -j32

np04-srv-023 builds

np04daq on np04-srv-023 in /nfs/sw/work_dirs/geoff_v330_beta

time mrb install -j8
real   5m32.037s
user   28m54.031s
sys   3m50.150s

time mrb install -j8 &>./log-build-8.txt
real   5m30.889s
user   28m50.026s
sys   3m51.627s


time mrb install -j32
real   3m24.388s
user   39m16.966s
sys   4m43.713s

time mrb install -j32 &>./log-build-32.txt
real   3m23.804s
user   39m17.094s
sys   4m43.530s

np04-srv-019 builds

Spending a lot of time here - Scanning dependencies of target artdaq-core_Core

06-May-2019 Backups failing

[root@np04-srv-008 ~]# df -h
Filesystem                              Size  Used Avail Use% Mounted on
/dev/mapper/cc_np04--srv--008-root       50G   13G   34G  28% /
devtmpfs                                7.8G     0  7.8G   0% /dev
tmpfs                                   7.8G     0  7.8G   0% /dev/shm
tmpfs                                   7.8G   18M  7.8G   1% /run
tmpfs                                   7.8G     0  7.8G   0% /sys/fs/cgroup
/dev/sda2                               976M  248M  662M  28% /boot
/dev/mapper/cc_np04--srv--008-rscratch  3.0T  2.0T  849G  71% /rscratch
/dev/mapper/cc_np04--srv--008-back      3.0T  2.9T     0 100% /back
np04-srv-007:/home                      3.0T  644G  2.2T  23% /nfs/home
np04-srv-007:/sw                        3.0T  2.1T  741G  75% /nfs/sw
tmpfs                                   1.6G     0  1.6G   0% /run/user/0
[root@np04-srv-008 ~]# cd /back
[root@np04-srv-008 back]# ls
data      home        np04-srv-009-etc   np04-srv-009-opt  np04-srv-014  wincc
database  lost+found  np04-srv-009-home  np04-srv-013      sw
[root@np04-srv-008 back]# du -sh
^C
[root@np04-srv-008 back]# du -h --max-depth=1
587M   ./np04-srv-009-home
680G   ./home
1.8T   ./sw
238G   ./database
47M   ./np04-srv-009-etc
41G   ./np04-srv-013
975M   ./wincc
16K   ./lost+found
6.5G   ./data
114G   ./np04-srv-014
546M   ./np04-srv-009-opt
2.9T   .
[root@np04-srv-008 back]# ls -l
total 176
drwxr-xr-x 2 np04daq np-comp 126976 May  2  2018 data
drwxr-xr-x 3 np04daq np-comp   4096 Jan 17  2018 database
drwxr-xr-x 4 root    root      4096 Nov  8  2017 home
drwx------ 2 root    root     16384 Aug 18  2017 lost+found
drwxr-xr-x 3 root    root      4096 Nov 10  2017 np04-srv-009-etc
drwxr-xr-x 3 root    root      4096 Nov 10  2017 np04-srv-009-home
drwxr-xr-x 3 root    root      4096 Nov 10  2017 np04-srv-009-opt
drwxr-xr-x 5 root    root      4096 Mar  6 19:12 np04-srv-013
drwxr-xr-x 5 root    root      4096 Mar  7 16:56 np04-srv-014
drwxr-xr-x 3 root    root      4096 Nov  7  2017 sw
drwxr-xr-x 4 root    root      4096 Nov 14  2017 wincc

16-Oct-2018 DAQ Operations

The DAQ group recommends keeping the DAQ running as much as possible to reduce the time spent starting a run.

During beam running if beam is not delivered for more than 30 minutes stop the run. Start a new run immediately. Keep running when beam arrives.
During extended non-beam running start a new run 30 minutes before beam is anticipated to arrive.

26-Sep-2018 Configuration upgrades

Move logstash
Supervisord configuration updates.
Rearrange disk layout on srv011/012/013/014.
Try to add spare disk to array on np04-srv-002.
Remove event builders running on np04-srv-004.

In addition I have not been able to bring the two raid arrays back. I'm trying to get assistance from CERN-IT. When we come back tomorrow I'd like to turn off the event builders on srv004. So I can run additional tests with out the risk of bringing the DAQ down. If I can fix the array on srv004 then I would move put srv004 into production and remove srv003 so I can work on the raid array there that is not working.

Computing tasks performed this morning. Thanks to Roland for helping today. Roland is the architect of the logstash and supervisord configuration management.

Logstash was moved to np04-srv-010 from np04-srv-014. srv014 runs board readers and srv010 is a utility computer. Logstash forwards messages from all the DAQ log files to kibana for display.
Supervisord configuration updates for online monitoring on np04-srv-023. We found some other issues in the supervisord configuration. Supervisord restarts applications automatically if they stop unexpectedly.
Disk layouts on srv011/012/013/014 were updated. We are having issues with root partitions filling to 100%. When this happens runs will not start. All other np04 servers were installed with separate partitions for /log, /scratch, and /home. All servers also have /home partitions with these servers having larger /home partitions than the others.
- srv011/srv012 created log and scratch directories in /home. Then linked them to /log and /scratch.
- srv013/srv/014 already had /scratch partitions. So I created log directories in /home and linked them to /log.
Enrico removed event builders 13,15, and 16 from the RC so I can work on the disk array that is having issues.

I did not have time to try and add in the spare disk back into md0 on np04-srv-002.

21-Sep-2018 Shift and run plan for the weekend

Nikos tuned the beam for 1,2,3,4,5,6,7 GeV. Beamline control moved to CESAR in NP04 control room.

David Rivera (trigger) tested trigger configurations with PD system.

Run 4581 - Cosmic, PD only run. Run 4583 - Cosmic, PD only run.

Beam down for 2 hours.

Started run 4584 just at the end of the shift.

Done

Removed all WIB, RCE, and FELIX components from the run.
Turn off beam
30 min - Cosmics run - disable trigger_0, np04_WibsReal_Ssps00125
Turn on 1 GeV beam
Then discovered there is no beam from CPS because the POPS is down.

Status

Beam line configured for 1 GeV
DAQ configured for PD only. Only SSP components included.

Overnight - 1 GeV - enable trigger_0, np04_WibsReal_Ssps_BeamTrig_00012
When purity monitor wants to run stop the run and start a new one.
Turn off beam
30 min - Cosmics run - disable trigger_0, np04_WibsReal_Ssps00125
Turn on 7 GeV beam
2 hours - 7 GeV - enable trigger_0, np04_WibsReal_Ssps_BeamTrig_4x_prescale_00003

20-Sep-2018 Move Run Control to srv024 (from srv010)

This morning run control was moved to np04-srv-024 (from np04-srv-010). Along with run control we moved the artdaq configuration database, the DIM dns server (dnsd), and the inhibit master.
The configuration was updated to np04_WibsReal_Sssps00124. This configuration has the location of dnsd and the inhibit master updated to srv024.
Configuration database access is now from from srv024 (instead of srv010).
This move has Run Control working on a more powerful computer and is part of our plan to improve computer security.

stop dnsd on np04-srv-010
start dnsd on np04-srv-024
Giovanna verified configurations can be retrieved from the database and inserted into the database.

artdaq db

[root@np04-srv-010 ~]# systemctl stop  webconfigeditor@cern_pddaq_v3x_db.service
[root@np04-srv-010 ~]# systemctl stop  mongodbserver@cern_pddaq_v3x_db.service
[root@np04-srv-010 ~]# systemctl disable  webconfigeditor@cern_pddaq_v3x_db.service
Removed symlink /etc/systemd/system/multi-user.target.wants/webconfigeditor@cern_pddaq_v3x_db.service.
[root@np04-srv-010 ~]# systemctl disable  mongodbserver@cern_pddaq_v3x_db.service
Removed symlink /etc/systemd/system/multi-user.target.wants/mongodbserver@cern_pddaq_v3x_db.service.
[root@np04-srv-010 ~]# systemctl status  mongodbserver@cern_pddaq_v3x_db.service
&#9679; mongodbserver@cern_pddaq_v3x_db.service - Mongo database service
   Loaded: loaded (/etc/systemd/system/mongodbserver@cern_pddaq_v3x_db.service; disabled; vendor preset: disabled)
   Active: inactive (dead)

Sep 20 12:02:17 np04-srv-010 systemd[1]: Stopping Mongo database service...
Sep 20 12:02:17 np04-srv-010 mongod-ctrl.sh[21916]: Info: MONGOD_DATABASE_NAME is set to 'cern_pddaq_v3x_db'
Sep 20 12:02:17 np04-srv-010 mongod-ctrl.sh[21916]: Info: MONGOD_BASE_DIR is set to '/nfs/sw/database'
Sep 20 12:02:17 np04-srv-010 mongod-ctrl.sh[21916]: Info: MONGOD_UPS_VER is set to 'v3_4_6'
Sep 20 12:02:17 np04-srv-010 mongod-ctrl.sh[21916]: Info: MONGOD_UPS_QUAL is set to 'e14:prof'
Sep 20 12:02:17 np04-srv-010 mongod-ctrl.sh[21916]: Info: MONGOD_PORT is set to '27037'
Sep 20 12:02:17 np04-srv-010 mongod-ctrl.sh[21916]: Info: mongod found: '/nfs/sw/artdaq/products/mongodb/v3_4_6/Linux64bit+3.10-2.17-e14-prof/bin/mongod'
Sep 20 12:02:17 np04-srv-010 mongod-ctrl.sh[21916]: Info: logfile=/nfs/sw/database/cern_pddaq_v3x_db/logs/mongod-201809201202.log
Sep 20 12:02:27 np04-srv-010 mongod-ctrl.sh[21916]: Stopping mongod: [  OK  ]
Sep 20 12:02:27 np04-srv-010 systemd[1]: Stopped Mongo database service.
[root@np04-srv-010 ~]# systemctl status  webconfigeditor@cern_pddaq_v3x_db.service
&#9679; webconfigeditor@cern_pddaq_v3x_db.service - WebConfigEditor service
   Loaded: loaded (/etc/systemd/system/webconfigeditor@cern_pddaq_v3x_db.service; disabled; vendor preset: disabled)
   Active: inactive (dead)

Sep 20 12:02:02 np04-srv-010 webconfigeditor-ctrl.sh[18515]: Info: MONGOD_BASE_DIR is set to '/nfs/sw/database'
Sep 20 12:02:02 np04-srv-010 webconfigeditor-ctrl.sh[18515]: Info: WEBEDITOR_UPS_VER is set to 'v1_01_00'
Sep 20 12:02:02 np04-srv-010 webconfigeditor-ctrl.sh[18515]: Info: WEBEDITOR_UPS_QUAL is set to 'e14:prof:s50'
Sep 20 12:02:02 np04-srv-010 webconfigeditor-ctrl.sh[18515]: Info: MONGOD_PORT is set to '27037'
Sep 20 12:02:02 np04-srv-010 webconfigeditor-ctrl.sh[18515]: Info: WEBEDITOR_BASE_PORT is set to '8880'
Sep 20 12:02:03 np04-srv-010 webconfigeditor-ctrl.sh[18515]: Info: node found: '/nfs/sw/artdaq/products/nodejs/v4_5_0/Linux64bit/bin/node'
Sep 20 12:02:03 np04-srv-010 webconfigeditor-ctrl.sh[18515]: Info: logfile=/nfs/sw/database/cern_pddaq_v3x_db/logs/webconfigeditor-201809201202.log
Sep 20 12:02:03 np04-srv-010 node[6842]: DIGEST-MD5 common mech free
Sep 20 12:02:03 np04-srv-010 webconfigeditor-ctrl.sh[18515]: Stopping Web Config Editor: [  OK  ]
Sep 20 12:02:03 np04-srv-010 systemd[1]: Stopped WebConfigEditor service.

[root@np04-srv-024 ~]# systemctl enable mongodbserver@cern_pddaq_v3x_db.service
Created symlink from /etc/systemd/system/multi-user.target.wants/mongodbserver@cern_pddaq_v3x_db.service to /etc/systemd/system/mongodbserver@cern_pddaq_v3x_db.service.
[root@np04-srv-024 ~]# 
[root@np04-srv-024 ~]# systemctl enable webconfigeditor@cern_pddaq_v3x_db.service
Created symlink from /etc/systemd/system/multi-user.target.wants/webconfigeditor@cern_pddaq_v3x_db.service to /etc/systemd/system/webconfigeditor@cern_pddaq_v3x_db.service.
[root@np04-srv-024 ~]# systemctl start  mongodbserver@cern_pddaq_v3x_db.service
[root@np04-srv-024 ~]# systemctl start  webconfigeditor@cern_pddaq_v3x_db.service
[root@np04-srv-024 ~]# systemctl status  mongodbserver@cern_pddaq_v3x_db.service
&#9679; mongodbserver@cern_pddaq_v3x_db.service - Mongo database service
   Loaded: loaded (/etc/systemd/system/mongodbserver@cern_pddaq_v3x_db.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2018-09-20 12:06:27 CEST; 25s ago
  Process: 50379 ExecStart=/nfs/sw/database/mongod-ctrl.sh start (code=exited, status=0/SUCCESS)
 Main PID: 50757 (mongod)
   CGroup: /system.slice/system-mongodbserver.slice/mongodbserver@cern_pddaq_v3x_db.service
           &#9492;&#9472;50757 /nfs/sw/artdaq/products/mongodb/v3_4_6/Linux64bit+3.10-2.17-e14-prof/bin/mongod --dbpath=/nfs/sw/database/cern_pddaq_v3x_db/data --pidfil...

Sep 20 12:06:20 np04-srv-024 mongod-ctrl.sh[50379]: Info: MONGOD_BASE_DIR is set to '/nfs/sw/database'
Sep 20 12:06:20 np04-srv-024 mongod-ctrl.sh[50379]: Info: MONGOD_UPS_VER is set to 'v3_4_6'
Sep 20 12:06:20 np04-srv-024 mongod-ctrl.sh[50379]: Info: MONGOD_UPS_QUAL is set to 'e14:prof'
Sep 20 12:06:20 np04-srv-024 mongod-ctrl.sh[50379]: Info: MONGOD_PORT is set to '27037'
Sep 20 12:06:21 np04-srv-024 mongod-ctrl.sh[50379]: Info: mongod found: '/nfs/sw/artdaq/products/mongodb/v3_4_6/Linux64bit+3.10-2.17-e14-prof/bin/mongod'
Sep 20 12:06:21 np04-srv-024 mongod-ctrl.sh[50379]: Info: logfile=/nfs/sw/database/cern_pddaq_v3x_db/logs/mongod-201809201206.log
Sep 20 12:06:21 np04-srv-024 mongod-ctrl.sh[50379]: Starting mongod: about to fork child process, waiting until server is ready for connections.
Sep 20 12:06:21 np04-srv-024 mongod-ctrl.sh[50379]: forked process: 50757
Sep 20 12:06:27 np04-srv-024 mongod-ctrl.sh[50379]: child process started successfully, parent exiting
Sep 20 12:06:27 np04-srv-024 systemd[1]: Started Mongo database service.
[root@np04-srv-024 ~]# systemctl status  webconfigeditor@cern_pddaq_v3x_db.service
&#9679; webconfigeditor@cern_pddaq_v3x_db.service - WebConfigEditor service
   Loaded: loaded (/etc/systemd/system/webconfigeditor@cern_pddaq_v3x_db.service; enabled; vendor preset: disabled)
   Active: activating (auto-restart) (Result: exit-code) since Thu 2018-09-20 12:06:45 CEST; 19s ago
  Process: 52848 ExecStop=/nfs/sw/database/webconfigeditor-ctrl.sh stop (code=exited, status=0/SUCCESS)
  Process: 51654 ExecStart=/nfs/sw/database/webconfigeditor-ctrl.sh start (code=exited, status=0/SUCCESS)
 Main PID: 52138 (code=exited, status=1/FAILURE)

Sep 20 12:06:45 np04-srv-024 systemd[1]: Unit webconfigeditor@cern_pddaq_v3x_db.service entered failed state.
Sep 20 12:06:45 np04-srv-024 systemd[1]: webconfigeditor@cern_pddaq_v3x_db.service failed.

Firewall Sets

The firewall has been updated by Giovanna to allow the CERN linux support servers access to our servers. Technically this means that the linuxsoft set has been added to the GPN services exposed to NP04 set.

I have tested this on np04-onl002. All my tests worked. + ping np04-onl-002 from lxplus + ssh from lxplus to np04-onl-002 as root, dsavage, and np04daq

The linuxsoft set consists of: IT LICENCE SERVERS IT LINUXSOFT IT NETWORK SERVICES NICE_DFS NICE_DOMAINCONTROLLERS NICE_LDAP NICE_XLDAP

14-Sep-2018 Reboot np04-srv-010

The run control developers were having issues working on np04-srv-010 this morning. After restarting software and seeing no change they rebooted np04-srv-010 to try and resolve the problem.

Looking at the monitoring of np04 computers we see a gradual trend of memory usage increasing. We believe this is mongodb, the database used to store configurations. The out of memory killer terminates the process using the most memory.

I have reenabled swap on this computer. We turned off the ability to swap to disk from memory on the DAQ computers to insure more reliable performance. srv010 does not run applications for transferring data so disabling swap is not needed.

The artdaq team is also looking into reducing the memory needed for the configuration database.

Sep 10 02:06:10 np04-srv-010 kernel: Out of memory: Kill process 14124 (mongod) score 424 or sacrifice child
Sep 10 02:06:10 np04-srv-010 kernel: Killed process 14124 (mongod) total-vm:7872428kB, anon-rss:6868300kB, file-rss:0kB, shmem-rss:0kB
Sep 11 02:06:07 np04-srv-010 kernel: Out of memory: Kill process 44940 (mongod) score 433 or sacrifice child
Sep 11 02:06:07 np04-srv-010 kernel: Killed process 44940 (mongod) total-vm:8168792kB, anon-rss:7020916kB, file-rss:0kB, shmem-rss:0kB
Sep 12 02:08:01 np04-srv-010 kernel: Out of memory: Kill process 30647 (mongod) score 448 or sacrifice child
Sep 12 02:08:01 np04-srv-010 kernel: Killed process 30647 (mongod) total-vm:8337200kB, anon-rss:7260292kB, file-rss:0kB, shmem-rss:0kB
Sep 12 10:27:36 np04-srv-010 kernel: Out of memory: Kill process 20488 (mongod) score 444 or sacrifice child
Sep 12 10:27:36 np04-srv-010 kernel: Killed process 20488 (mongod) total-vm:8223896kB, anon-rss:7194596kB, file-rss:0kB, shmem-rss:0kB
Sep 12 10:27:36 np04-srv-010 kernel: Out of memory: Kill process 20490 (Backgro.kSource) score 444 or sacrifice child
Sep 12 10:27:36 np04-srv-010 kernel: Killed process 20490 (Backgro.kSource) total-vm:8223896kB, anon-rss:7194712kB, file-rss:0kB, shmem-rss:0kB
Sep 13 02:09:38 np04-srv-010 kernel: Out of memory: Kill process 12295 (mongod) score 251 or sacrifice child
Sep 13 02:09:38 np04-srv-010 kernel: Killed process 12295 (mongod) total-vm:4997296kB, anon-rss:4073428kB, file-rss:0kB, shmem-rss:0kB
Sep 14 02:10:31 np04-srv-010 kernel: Out of memory: Kill process 14679 (mongod) score 361 or sacrifice child
Sep 14 02:10:31 np04-srv-010 kernel: Killed process 14679 (mongod) total-vm:6774520kB, anon-rss:5848112kB, file-rss:0kB, shmem-rss:0kB
Sep 14 02:10:31 np04-srv-010 kernel: Out of memory: Kill process 14686 (Backgro.kSource) score 361 or sacrifice child
Sep 14 02:10:31 np04-srv-010 kernel: Killed process 14686 (Backgro.kSource) total-vm:6774520kB, anon-rss:5848308kB, file-rss:0kB, shmem-rss:0kB

07-Sep-2018 New computers

The installation of four new computers was completed today. The four computers are housed in the same enclosure. Computers are np04-srv-021, np04-srv-022, np04-srv-023, np04-srv-024.

Allocations:

np04-srv-021 - felix
np04-srv-022
np04-srv-023 - online monitoring (monet)
np04-srv-024 - other services that are scattered on computers in the DAQ
- DAQ error messages (logstash)
- System monitoring (prometheus)
- File transfer to EOS (FTS-lite)

05-Sep-2018 CRT computer network interfaces

Moved the 10 Gb interface on the CRT computer to be np04-crt-001. (np04-crt-001 was originally the 1 Gb interface on the CRT computer). This move accomplishes two items.

For ssh to work correctly the DNS name needs to match the device name in landb (the CERN network database). This requirement is unexpected. A service desk ticket did not resolve the issue. So the swap was needed.
For artdaq to work correctly all DAQ computers need to be connected to the router. This is for multicast support.

The 1 Gb interface is now np04-crt-001-ctrl and requires your password to be entered to login.

04-Sep-2018 np04-srv-004 system disk filled

The system disk on np04-srv-004 filled up. This prevented DAQ runs from starting.

The cause is a raid array failure following the power outages last week. The event builder (eb14) corresponding to the failed raid array was selected for use in the run. Instead of writing data to the large space in the raid array the event builder wrote to the mount point which is on the system disk. Which caused the system disk to fill up.

The temporary fix was to remove eb14 from the run control. Thanks to Enrico Gamberini for doing this. Once the raid array is restored the event builder will be reenabled in run control.

Event builder 11 was also removed from run control for the same reason. The raid array failed following the power outage.

01-Sep-2018 High number of DNS queries

Admins received emails about a high number of dns queries for srv013 and srv014 at Friday, August 31, 2018 21:09. Email is below.

This indicates the system is not configured correctly for the nscd service. Indeed when I checked today the two systems were not configured to use nscd. The ncsd service can be started with the cern configuration tool, locmap. All the other computers I checked today were running nscd.

I created a configuration in the np04 configuration management to configure all the services controlled through locmap. With nscd enabled. A configuration run over all the np04 servers succeeded. Now nscd is running on srv013 and srv014.

Dear np04-onl-admins@cern.ch

You are listed as responsible for np04-srv-013 (.cern.ch).
Our DNS servers are warning that this host has been sending a VERY HIGH
rate of queries for the last hour (78 requests/sec).

Please, check the cause of this problem and sort it out
since it impacts the central DNS service performance. Please
also consult http://service-dns.web.cern.ch/service-dns/faq.asp
for information on setting up dns for high demanding clients.

Should this problem continue, we will have to block this system
to avoid performance problems in the central DNS service.

Thanks in advance,
CERN Network Support

More info:
10.73.136.33 queried 242715 times name livlhcb010.dyndns.cern.ch
10.73.136.33 queried 23874 times name np04-srv-013.cern.ch
10.73.136.33 queried 11937 times name 29.221.141.128.in-addr.arpa
10.73.136.33 queried 11937 times name 33.136.73.10.in-addr.arpa
10.73.136.33 queried 4 times name lxplus011.cern.ch
np04-onl-admins@cern.ch

30-Aug-2018 NFS failure

There were many complaints this morning about run control working slowly.

srv010 is the run control computer. These NFS error messages appeared in the srv010 system log files.

NFS: nfs4_reclaim_open_state: Lock reclaim failed!

Restarting nfs-client on srv010 did not fix the problem.

systemctl restart nfs-client

srv007 is the main nfs server with all the DAQ software. Restarted nfs server on srv007. *=systemctl restart nfs-server= Error messages on srv010 have stopped. This was monitored in an already open session.

Logging into srv010 did not work smoothly. The login hung. ctrl-c allowed the login to continue but the full login process did not complete. srv010 was in a non-operational state. The df command did not work. Reboot from the command line did not work.

I went to the DAQ barrack and power cycled srv010 manually.

28-Aug-2018 Cooling Water Failure

This is a continuation of a power cut from 26-Aug-2018 at 5:20 am.

Yesterday (Mon Aug 27) about 1800 hrs we were notified of a cooling water failure. This impacted the cooling water in the DAQ barrack. Giovanna powered off np04 computers via the network. Milo manually powered off all the np02 computers. Today I learned that the SPS is also off because of the same cooling water issue. Recovery time is not known at this time. The cooling water will be off for at least the rest of today.

We have turned on enough computing for CRT to resume development while Matt and John are here. nfs server with software - np04-srv-007. timing system - tlu and fanout_0 np04-onl-001 - usb connections to timing system np04-srv-012 - network connection to timing system

26-Aug-2018 Cooling Water Failure

DAQ racks powered off on Sunday 26 at 5:20 am. I received a slew of raid error emails.

Recovery status:

Computers in good shape overall.
Issues with np04-srv-003 - raid array 2 did not recover - can't run eventbuilder11
ssues with np04-srv-004 - raid array missing spare out of 11 used.

06-Mar-2018 Power Cuts

Powered down DAQ yesterday 05-Mar evening about 1930 hrs.

Giovanna enabled the power strips in the DAQ rack at 1230 06-Mar. For safety the power strips do not automatically turn back on.

DAQ recovery began at 1330 hrs. All computers were up at 1530. During the recovery process I changed the bios so all np04 computers that lose power unexpectedly do not boot when power is restored. There were no issues related to the recovery process.

Recovery order.

np04-srv-007 - nfs server with home areas
np04-srv-008 - nfs server with backup and scratch areas
np04-srv-009 - usb connections to timing and VST
np04-srv-010 - run control
np04-srv-011 thru 019 - DAQ computers
np04-srv-001 - disk server
np04-srv-002 - disk server to be configured
np04-srv-003
np04-srv-004 - still to be configured

Add 1 Gb network switch

Marc and Federico, from CERN networking, were at EHN1 today adding a 1 Gb switch with RJ45 ports. They have returned to their office to check on the connections. Once they are confident the two switches are on the General Purpose Network (GPN) they will send us instructions on how to request IP addresses.

DAQ status on Friday 22-Jun-2017

pddaq-gen01-ctrl0 pddaq-gen02-ctrl0 pddaq-gen03-daq0 pddaq-gen04-daq0 pddaq-gen05-daq0

nfs configuration on pddaq-gen05-daq0 /etc/exports

/daq/artdaq 10.73.136.0/16(rw,sync,no_root_squash,no_all_squash) #/daq/artdaq 10.193.0.0/16(rw,sync,no_root_squash,no_all_squash)

# restart nfs as follows: # sudo exportfs -a ; sudo systemctl restart nfs # don't forget to make sure the stupid firewall is off forever # sudo systemctl stop firewalld # sudo systemctl disable firewalld

On pddaq-gen04-daq0, pddaq-gen03-daq0, pddaq-gen01-ctrl0, pddaq-gen02-ctrl0 sudo mount -t nfs 10.73.136.20:/daq/artdaq /daq/artdaq

Computers: pddaq-gen01-ctrl0 pddaq-gen02-ctrl0 pddaq-gen03-daq0 pddaq-gen04-daq0 pddaq-gen05-daq0

NFS server is pddaq-gen05-daq0 and /daq/artdaq is mounted on gen01,03,04 from gen05.

Karol updated the RCE configurations to the new IPs.

-- DavidGeoffreySavage - 2017-06-23

Attachments

Topic attachments
I	Attachment	History	Action	Size	Date	Who	Comment
png	srv010-mem.png	r1	manage	103.3 K	2018-09-14 - 18:44	GeoffSavage

Topic revision: r19 - 2019-07-24 - GeoffSavage

Main

Webs

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
Main All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback