Logbook Entries
24-July-2019 np04daq keytab expired
The new keytab file is available from the np04cluster in ~np04daq/krb5/np04daq.keytab. This file is actually a soft link to np04daq.keytab.20190724 in the same directory.
A keytab file is used in place of username/password for Kerberos access for the np04daq account. The DAQ operates as the np04daq user.
The keytab file stopped working when I intentionally let the password expire on the np04daq service account. I thought the keytab file would keep working.
-
Login as root
-
cern-get-keytab --keytab np04daq.keytab.20190724 --user --login np04daq
-
Enter the np04daq password when prompted.
What I don’t know is if the keytab file would have expired if I had updated the password. I suspect it would.
20-Jun-2019 timing artdaq builds
mrb z
mrbsetenv
time mrb install -j8
time mrb install -j32
np04-srv-023 builds
np04daq on np04-srv-023 in /nfs/sw/work_dirs/geoff_v330_beta
time mrb install -j8
real 5m32.037s
user 28m54.031s
sys 3m50.150s
time mrb install -j8 &>./log-build-8.txt
real 5m30.889s
user 28m50.026s
sys 3m51.627s
time mrb install -j32
real 3m24.388s
user 39m16.966s
sys 4m43.713s
time mrb install -j32 &>./log-build-32.txt
real 3m23.804s
user 39m17.094s
sys 4m43.530s
np04-srv-019 builds
Spending a lot of time here -
Scanning dependencies of target artdaq-core_Core
06-May-2019 Backups failing
[root@np04-srv-008 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/cc_np04--srv--008-root 50G 13G 34G 28% /
devtmpfs 7.8G 0 7.8G 0% /dev
tmpfs 7.8G 0 7.8G 0% /dev/shm
tmpfs 7.8G 18M 7.8G 1% /run
tmpfs 7.8G 0 7.8G 0% /sys/fs/cgroup
/dev/sda2 976M 248M 662M 28% /boot
/dev/mapper/cc_np04--srv--008-rscratch 3.0T 2.0T 849G 71% /rscratch
/dev/mapper/cc_np04--srv--008-back 3.0T 2.9T 0 100% /back
np04-srv-007:/home 3.0T 644G 2.2T 23% /nfs/home
np04-srv-007:/sw 3.0T 2.1T 741G 75% /nfs/sw
tmpfs 1.6G 0 1.6G 0% /run/user/0
[root@np04-srv-008 ~]# cd /back
[root@np04-srv-008 back]# ls
data home np04-srv-009-etc np04-srv-009-opt np04-srv-014 wincc
database lost+found np04-srv-009-home np04-srv-013 sw
[root@np04-srv-008 back]# du -sh
^C
[root@np04-srv-008 back]# du -h --max-depth=1
587M ./np04-srv-009-home
680G ./home
1.8T ./sw
238G ./database
47M ./np04-srv-009-etc
41G ./np04-srv-013
975M ./wincc
16K ./lost+found
6.5G ./data
114G ./np04-srv-014
546M ./np04-srv-009-opt
2.9T .
[root@np04-srv-008 back]# ls -l
total 176
drwxr-xr-x 2 np04daq np-comp 126976 May 2 2018 data
drwxr-xr-x 3 np04daq np-comp 4096 Jan 17 2018 database
drwxr-xr-x 4 root root 4096 Nov 8 2017 home
drwx------ 2 root root 16384 Aug 18 2017 lost+found
drwxr-xr-x 3 root root 4096 Nov 10 2017 np04-srv-009-etc
drwxr-xr-x 3 root root 4096 Nov 10 2017 np04-srv-009-home
drwxr-xr-x 3 root root 4096 Nov 10 2017 np04-srv-009-opt
drwxr-xr-x 5 root root 4096 Mar 6 19:12 np04-srv-013
drwxr-xr-x 5 root root 4096 Mar 7 16:56 np04-srv-014
drwxr-xr-x 3 root root 4096 Nov 7 2017 sw
drwxr-xr-x 4 root root 4096 Nov 14 2017 wincc
16-Oct-2018 DAQ Operations
The DAQ group recommends keeping the DAQ running as much as possible to reduce the time spent starting a run.
- During beam running if beam is not delivered for more than 30 minutes stop the run. Start a new run immediately. Keep running when beam arrives.
- During extended non-beam running start a new run 30 minutes before beam is anticipated to arrive.
26-Sep-2018 Configuration upgrades
- Move logstash
- Supervisord configuration updates.
- Rearrange disk layout on srv011/012/013/014.
- Try to add spare disk to array on np04-srv-002.
- Remove event builders running on np04-srv-004.
In addition I have not been able to bring the two raid arrays back. I'm trying to get assistance from CERN-IT. When we come back tomorrow I'd like to turn off the event builders on srv004. So I can run additional tests with out the risk of bringing the DAQ down. If I can fix the array on srv004 then I would move put srv004 into production and remove srv003 so I can work on the raid array there that is not working.
Computing tasks performed this morning. Thanks to Roland for helping today. Roland is the architect of the logstash and supervisord configuration management.
- Logstash was moved to np04-srv-010 from np04-srv-014. srv014 runs board readers and srv010 is a utility computer. Logstash forwards messages from all the DAQ log files to kibana for display.
- Supervisord configuration updates for online monitoring on np04-srv-023. We found some other issues in the supervisord configuration. Supervisord restarts applications automatically if they stop unexpectedly.
- Disk layouts on srv011/012/013/014 were updated. We are having issues with root partitions filling to 100%. When this happens runs will not start. All other np04 servers were installed with separate partitions for /log, /scratch, and /home. All servers also have /home partitions with these servers having larger /home partitions than the others.
- srv011/srv012 created log and scratch directories in /home. Then linked them to /log and /scratch.
- srv013/srv/014 already had /scratch partitions. So I created log directories in /home and linked them to /log.
- Enrico removed event builders 13,15, and 16 from the RC so I can work on the disk array that is having issues.
- I did not have time to try and add in the spare disk back into md0 on np04-srv-002.
21-Sep-2018 Shift and run plan for the weekend
Nikos tuned the beam for 1,2,3,4,5,6,7
GeV. Beamline control moved to CESAR in NP04 control room.
David Rivera (trigger) tested trigger configurations with PD system.
Run 4581 - Cosmic, PD only run.
Run 4583 - Cosmic, PD only run.
Beam down for 2 hours.
Started run 4584 just at the end of the shift.
Done
- Removed all WIB, RCE, and FELIX components from the run.
- Turn off beam
- 30 min - Cosmics run - disable trigger_0, np04_WibsReal_Ssps00125
- Turn on 1 GeV beam
- Then discovered there is no beam from CPS because the POPS is down.
Status
- Beam line configured for 1 GeV
- DAQ configured for PD only. Only SSP components included.
Next
- Overnight - 1 GeV - enable trigger_0, np04_WibsReal_Ssps_BeamTrig_00012
- When purity monitor wants to run stop the run and start a new one.
- Turn off beam
- 30 min - Cosmics run - disable trigger_0, np04_WibsReal_Ssps00125
- Turn on 7 GeV beam
- 2 hours - 7 GeV - enable trigger_0, np04_WibsReal_Ssps_BeamTrig_4x_prescale_00003
20-Sep-2018 Move Run Control to srv024 (from srv010)
- This morning run control was moved to np04-srv-024 (from np04-srv-010). Along with run control we moved the artdaq configuration database, the DIM dns server (dnsd), and the inhibit master.
- The configuration was updated to np04_WibsReal_Sssps00124. This configuration has the location of dnsd and the inhibit master updated to srv024.
- Configuration database access is now from from srv024 (instead of srv010).
- This move has Run Control working on a more powerful computer and is part of our plan to improve computer security.
- stop dnsd on np04-srv-010
- start dnsd on np04-srv-024
- Giovanna verified configurations can be retrieved from the database and inserted into the database.
artdaq db
[root@np04-srv-010 ~]# systemctl stop webconfigeditor@cern_pddaq_v3x_db.service
[root@np04-srv-010 ~]# systemctl stop mongodbserver@cern_pddaq_v3x_db.service
[root@np04-srv-010 ~]# systemctl disable webconfigeditor@cern_pddaq_v3x_db.service
Removed symlink /etc/systemd/system/multi-user.target.wants/webconfigeditor@cern_pddaq_v3x_db.service.
[root@np04-srv-010 ~]# systemctl disable mongodbserver@cern_pddaq_v3x_db.service
Removed symlink /etc/systemd/system/multi-user.target.wants/mongodbserver@cern_pddaq_v3x_db.service.
[root@np04-srv-010 ~]# systemctl status mongodbserver@cern_pddaq_v3x_db.service
● mongodbserver@cern_pddaq_v3x_db.service - Mongo database service
Loaded: loaded (/etc/systemd/system/mongodbserver@cern_pddaq_v3x_db.service; disabled; vendor preset: disabled)
Active: inactive (dead)
Sep 20 12:02:17 np04-srv-010 systemd[1]: Stopping Mongo database service...
Sep 20 12:02:17 np04-srv-010 mongod-ctrl.sh[21916]: Info: MONGOD_DATABASE_NAME is set to 'cern_pddaq_v3x_db'
Sep 20 12:02:17 np04-srv-010 mongod-ctrl.sh[21916]: Info: MONGOD_BASE_DIR is set to '/nfs/sw/database'
Sep 20 12:02:17 np04-srv-010 mongod-ctrl.sh[21916]: Info: MONGOD_UPS_VER is set to 'v3_4_6'
Sep 20 12:02:17 np04-srv-010 mongod-ctrl.sh[21916]: Info: MONGOD_UPS_QUAL is set to 'e14:prof'
Sep 20 12:02:17 np04-srv-010 mongod-ctrl.sh[21916]: Info: MONGOD_PORT is set to '27037'
Sep 20 12:02:17 np04-srv-010 mongod-ctrl.sh[21916]: Info: mongod found: '/nfs/sw/artdaq/products/mongodb/v3_4_6/Linux64bit+3.10-2.17-e14-prof/bin/mongod'
Sep 20 12:02:17 np04-srv-010 mongod-ctrl.sh[21916]: Info: logfile=/nfs/sw/database/cern_pddaq_v3x_db/logs/mongod-201809201202.log
Sep 20 12:02:27 np04-srv-010 mongod-ctrl.sh[21916]: Stopping mongod: [ OK ]
Sep 20 12:02:27 np04-srv-010 systemd[1]: Stopped Mongo database service.
[root@np04-srv-010 ~]# systemctl status webconfigeditor@cern_pddaq_v3x_db.service
● webconfigeditor@cern_pddaq_v3x_db.service - WebConfigEditor service
Loaded: loaded (/etc/systemd/system/webconfigeditor@cern_pddaq_v3x_db.service; disabled; vendor preset: disabled)
Active: inactive (dead)
Sep 20 12:02:02 np04-srv-010 webconfigeditor-ctrl.sh[18515]: Info: MONGOD_BASE_DIR is set to '/nfs/sw/database'
Sep 20 12:02:02 np04-srv-010 webconfigeditor-ctrl.sh[18515]: Info: WEBEDITOR_UPS_VER is set to 'v1_01_00'
Sep 20 12:02:02 np04-srv-010 webconfigeditor-ctrl.sh[18515]: Info: WEBEDITOR_UPS_QUAL is set to 'e14:prof:s50'
Sep 20 12:02:02 np04-srv-010 webconfigeditor-ctrl.sh[18515]: Info: MONGOD_PORT is set to '27037'
Sep 20 12:02:02 np04-srv-010 webconfigeditor-ctrl.sh[18515]: Info: WEBEDITOR_BASE_PORT is set to '8880'
Sep 20 12:02:03 np04-srv-010 webconfigeditor-ctrl.sh[18515]: Info: node found: '/nfs/sw/artdaq/products/nodejs/v4_5_0/Linux64bit/bin/node'
Sep 20 12:02:03 np04-srv-010 webconfigeditor-ctrl.sh[18515]: Info: logfile=/nfs/sw/database/cern_pddaq_v3x_db/logs/webconfigeditor-201809201202.log
Sep 20 12:02:03 np04-srv-010 node[6842]: DIGEST-MD5 common mech free
Sep 20 12:02:03 np04-srv-010 webconfigeditor-ctrl.sh[18515]: Stopping Web Config Editor: [ OK ]
Sep 20 12:02:03 np04-srv-010 systemd[1]: Stopped WebConfigEditor service.
[root@np04-srv-024 ~]# systemctl enable mongodbserver@cern_pddaq_v3x_db.service
Created symlink from /etc/systemd/system/multi-user.target.wants/mongodbserver@cern_pddaq_v3x_db.service to /etc/systemd/system/mongodbserver@cern_pddaq_v3x_db.service.
[root@np04-srv-024 ~]#
[root@np04-srv-024 ~]# systemctl enable webconfigeditor@cern_pddaq_v3x_db.service
Created symlink from /etc/systemd/system/multi-user.target.wants/webconfigeditor@cern_pddaq_v3x_db.service to /etc/systemd/system/webconfigeditor@cern_pddaq_v3x_db.service.
[root@np04-srv-024 ~]# systemctl start mongodbserver@cern_pddaq_v3x_db.service
[root@np04-srv-024 ~]# systemctl start webconfigeditor@cern_pddaq_v3x_db.service
[root@np04-srv-024 ~]# systemctl status mongodbserver@cern_pddaq_v3x_db.service
● mongodbserver@cern_pddaq_v3x_db.service - Mongo database service
Loaded: loaded (/etc/systemd/system/mongodbserver@cern_pddaq_v3x_db.service; enabled; vendor preset: disabled)
Active: active (running) since Thu 2018-09-20 12:06:27 CEST; 25s ago
Process: 50379 ExecStart=/nfs/sw/database/mongod-ctrl.sh start (code=exited, status=0/SUCCESS)
Main PID: 50757 (mongod)
CGroup: /system.slice/system-mongodbserver.slice/mongodbserver@cern_pddaq_v3x_db.service
└─50757 /nfs/sw/artdaq/products/mongodb/v3_4_6/Linux64bit+3.10-2.17-e14-prof/bin/mongod --dbpath=/nfs/sw/database/cern_pddaq_v3x_db/data --pidfil...
Sep 20 12:06:20 np04-srv-024 mongod-ctrl.sh[50379]: Info: MONGOD_BASE_DIR is set to '/nfs/sw/database'
Sep 20 12:06:20 np04-srv-024 mongod-ctrl.sh[50379]: Info: MONGOD_UPS_VER is set to 'v3_4_6'
Sep 20 12:06:20 np04-srv-024 mongod-ctrl.sh[50379]: Info: MONGOD_UPS_QUAL is set to 'e14:prof'
Sep 20 12:06:20 np04-srv-024 mongod-ctrl.sh[50379]: Info: MONGOD_PORT is set to '27037'
Sep 20 12:06:21 np04-srv-024 mongod-ctrl.sh[50379]: Info: mongod found: '/nfs/sw/artdaq/products/mongodb/v3_4_6/Linux64bit+3.10-2.17-e14-prof/bin/mongod'
Sep 20 12:06:21 np04-srv-024 mongod-ctrl.sh[50379]: Info: logfile=/nfs/sw/database/cern_pddaq_v3x_db/logs/mongod-201809201206.log
Sep 20 12:06:21 np04-srv-024 mongod-ctrl.sh[50379]: Starting mongod: about to fork child process, waiting until server is ready for connections.
Sep 20 12:06:21 np04-srv-024 mongod-ctrl.sh[50379]: forked process: 50757
Sep 20 12:06:27 np04-srv-024 mongod-ctrl.sh[50379]: child process started successfully, parent exiting
Sep 20 12:06:27 np04-srv-024 systemd[1]: Started Mongo database service.
[root@np04-srv-024 ~]# systemctl status webconfigeditor@cern_pddaq_v3x_db.service
● webconfigeditor@cern_pddaq_v3x_db.service - WebConfigEditor service
Loaded: loaded (/etc/systemd/system/webconfigeditor@cern_pddaq_v3x_db.service; enabled; vendor preset: disabled)
Active: activating (auto-restart) (Result: exit-code) since Thu 2018-09-20 12:06:45 CEST; 19s ago
Process: 52848 ExecStop=/nfs/sw/database/webconfigeditor-ctrl.sh stop (code=exited, status=0/SUCCESS)
Process: 51654 ExecStart=/nfs/sw/database/webconfigeditor-ctrl.sh start (code=exited, status=0/SUCCESS)
Main PID: 52138 (code=exited, status=1/FAILURE)
Sep 20 12:06:45 np04-srv-024 systemd[1]: Unit webconfigeditor@cern_pddaq_v3x_db.service entered failed state.
Sep 20 12:06:45 np04-srv-024 systemd[1]: webconfigeditor@cern_pddaq_v3x_db.service failed.
Firewall Sets
The firewall has been updated by Giovanna to allow the CERN linux support servers access to our servers. Technically this means that the linuxsoft set has been added to the GPN services exposed to NP04 set.
I have tested this on np04-onl002. All my tests worked.
+ ping np04-onl-002 from lxplus
+ ssh from lxplus to np04-onl-002 as root, dsavage, and np04daq
The linuxsoft set consists of:
IT LICENCE SERVERS
IT LINUXSOFT
IT NETWORK SERVICES
NICE_DFS
NICE_DOMAINCONTROLLERS
NICE_LDAP
NICE_XLDAP
14-Sep-2018 Reboot np04-srv-010
The run control developers were having issues working on np04-srv-010 this morning. After restarting software and seeing no change they rebooted np04-srv-010 to try and resolve the problem.
Looking at the monitoring of np04 computers we see a gradual trend of memory usage increasing. We believe this is mongodb, the database used to store configurations. The out of memory killer terminates the process using the most memory.
I have reenabled swap on this computer. We turned off the ability to swap to disk from memory on the DAQ computers to insure more reliable performance. srv010 does not run applications for transferring data so disabling swap is not needed.
The artdaq team is also looking into reducing the memory needed for the configuration database.
Sep 10 02:06:10 np04-srv-010 kernel: Out of memory: Kill process 14124 (mongod) score 424 or sacrifice child
Sep 10 02:06:10 np04-srv-010 kernel: Killed process 14124 (mongod) total-vm:7872428kB, anon-rss:6868300kB, file-rss:0kB, shmem-rss:0kB
Sep 11 02:06:07 np04-srv-010 kernel: Out of memory: Kill process 44940 (mongod) score 433 or sacrifice child
Sep 11 02:06:07 np04-srv-010 kernel: Killed process 44940 (mongod) total-vm:8168792kB, anon-rss:7020916kB, file-rss:0kB, shmem-rss:0kB
Sep 12 02:08:01 np04-srv-010 kernel: Out of memory: Kill process 30647 (mongod) score 448 or sacrifice child
Sep 12 02:08:01 np04-srv-010 kernel: Killed process 30647 (mongod) total-vm:8337200kB, anon-rss:7260292kB, file-rss:0kB, shmem-rss:0kB
Sep 12 10:27:36 np04-srv-010 kernel: Out of memory: Kill process 20488 (mongod) score 444 or sacrifice child
Sep 12 10:27:36 np04-srv-010 kernel: Killed process 20488 (mongod) total-vm:8223896kB, anon-rss:7194596kB, file-rss:0kB, shmem-rss:0kB
Sep 12 10:27:36 np04-srv-010 kernel: Out of memory: Kill process 20490 (Backgro.kSource) score 444 or sacrifice child
Sep 12 10:27:36 np04-srv-010 kernel: Killed process 20490 (Backgro.kSource) total-vm:8223896kB, anon-rss:7194712kB, file-rss:0kB, shmem-rss:0kB
Sep 13 02:09:38 np04-srv-010 kernel: Out of memory: Kill process 12295 (mongod) score 251 or sacrifice child
Sep 13 02:09:38 np04-srv-010 kernel: Killed process 12295 (mongod) total-vm:4997296kB, anon-rss:4073428kB, file-rss:0kB, shmem-rss:0kB
Sep 14 02:10:31 np04-srv-010 kernel: Out of memory: Kill process 14679 (mongod) score 361 or sacrifice child
Sep 14 02:10:31 np04-srv-010 kernel: Killed process 14679 (mongod) total-vm:6774520kB, anon-rss:5848112kB, file-rss:0kB, shmem-rss:0kB
Sep 14 02:10:31 np04-srv-010 kernel: Out of memory: Kill process 14686 (Backgro.kSource) score 361 or sacrifice child
Sep 14 02:10:31 np04-srv-010 kernel: Killed process 14686 (Backgro.kSource) total-vm:6774520kB, anon-rss:5848308kB, file-rss:0kB, shmem-rss:0kB
07-Sep-2018 New computers
The installation of four new computers was completed today. The four computers are housed in the same enclosure.
Computers are np04-srv-021, np04-srv-022, np04-srv-023, np04-srv-024.
Allocations:
- np04-srv-021 - felix
- np04-srv-022
- np04-srv-023 - online monitoring (monet)
- np04-srv-024 - other services that are scattered on computers in the DAQ
- DAQ error messages (logstash)
- System monitoring (prometheus)
- File transfer to EOS (FTS-lite)
05-Sep-2018 CRT computer network interfaces
Moved the 10 Gb interface on the CRT computer to be np04-crt-001. (np04-crt-001 was originally the 1 Gb interface on the CRT computer). This move accomplishes two items.
- For ssh to work correctly the DNS name needs to match the device name in landb (the CERN network database). This requirement is unexpected. A service desk ticket did not resolve the issue. So the swap was needed.
- For artdaq to work correctly all DAQ computers need to be connected to the router. This is for multicast support.
The 1 Gb interface is now np04-crt-001-ctrl and requires your password to be entered to login.
04-Sep-2018 np04-srv-004 system disk filled
The system disk on np04-srv-004 filled up. This prevented DAQ runs from starting.
The cause is a raid array failure following the power outages last week. The event builder (eb14) corresponding to the failed raid array was selected for use in the run. Instead of writing data to the large space in the raid array the event builder wrote to the mount point which is on the system disk. Which caused the system disk to fill up.
The temporary fix was to remove eb14 from the run control. Thanks to Enrico Gamberini for doing this. Once the raid array is restored the event builder will be reenabled in run control.
Event builder 11 was also removed from run control for the same reason. The raid array failed following the power outage.
01-Sep-2018 High number of DNS queries
Admins received emails about a high number of dns queries for srv013 and srv014 at Friday, August 31, 2018 21:09. Email is below.
This indicates the system is not configured correctly for the nscd service. Indeed when I checked today the two systems were not configured to use nscd. The ncsd service can be started with the cern configuration tool, locmap. All the other computers I checked today were running nscd.
I created a configuration in the np04 configuration management to configure all the services controlled through locmap. With nscd enabled. A configuration run over all the np04 servers succeeded. Now nscd is running on srv013 and srv014.
Dear np04-onl-admins@cern.ch
You are listed as responsible for np04-srv-013 (.cern.ch).
Our DNS servers are warning that this host has been sending a VERY HIGH
rate of queries for the last hour (78 requests/sec).
Please, check the cause of this problem and sort it out
since it impacts the central DNS service performance. Please
also consult http://service-dns.web.cern.ch/service-dns/faq.asp
for information on setting up dns for high demanding clients.
Should this problem continue, we will have to block this system
to avoid performance problems in the central DNS service.
Thanks in advance,
CERN Network Support
More info:
10.73.136.33 queried 242715 times name livlhcb010.dyndns.cern.ch
10.73.136.33 queried 23874 times name np04-srv-013.cern.ch
10.73.136.33 queried 11937 times name 29.221.141.128.in-addr.arpa
10.73.136.33 queried 11937 times name 33.136.73.10.in-addr.arpa
10.73.136.33 queried 4 times name lxplus011.cern.ch
np04-onl-admins@cern.ch
30-Aug-2018 NFS failure
There were many complaints this morning about run control working slowly.
srv010 is the run control computer. These NFS error messages appeared in the srv010 system log files.
-
NFS: nfs4_reclaim_open_state: Lock reclaim failed!
Restarting nfs-client on srv010 did not fix the problem.
-
systemctl restart nfs-client
srv007 is the main nfs server with all the DAQ software. Restarted nfs server on srv007.
*=systemctl restart nfs-server=
Error messages on srv010 have stopped. This was monitored in an already open session.
Logging into srv010 did not work smoothly. The login hung. ctrl-c allowed the login to continue but the full login process did not complete. srv010 was in a non-operational state. The df command did not work. Reboot from the command line did not work.
I went to the DAQ barrack and power cycled srv010 manually.
28-Aug-2018 Cooling Water Failure
This is a continuation of a power cut from 26-Aug-2018 at 5:20 am.
Yesterday (Mon Aug 27) about 1800 hrs we were notified of a cooling water failure. This impacted the cooling water in the DAQ barrack. Giovanna powered off np04 computers via the network. Milo manually powered off all the np02 computers. Today I learned that the SPS is also off because of the same cooling water issue. Recovery time is not known at this time. The cooling water will be off for at least the rest of today.
We have turned on enough computing for CRT to resume development while Matt and John are here.
nfs server with software - np04-srv-007.
timing system - tlu and fanout_0
np04-onl-001 - usb connections to timing system
np04-srv-012 - network connection to timing system
26-Aug-2018 Cooling Water Failure
DAQ racks powered off on Sunday 26 at 5:20 am. I received a slew of raid error emails.
Recovery status:
- Computers in good shape overall.
- Issues with np04-srv-003 - raid array 2 did not recover - can't run eventbuilder11
- ssues with np04-srv-004 - raid array missing spare out of 11 used.
06-Mar-2018 Power Cuts
Powered down DAQ yesterday 05-Mar evening about 1930 hrs.
Giovanna enabled the power strips in the DAQ rack at 1230 06-Mar. For safety the power strips do not automatically turn back on.
DAQ recovery began at 1330 hrs. All computers were up at 1530. During the recovery process I changed the bios so all np04 computers that lose power unexpectedly do not boot when power is restored. There were no issues related to the recovery process.
Recovery order.
- np04-srv-007 - nfs server with home areas
- np04-srv-008 - nfs server with backup and scratch areas
- np04-srv-009 - usb connections to timing and VST
- np04-srv-010 - run control
- np04-srv-011 thru 019 - DAQ computers
- np04-srv-001 - disk server
- np04-srv-002 - disk server to be configured
- np04-srv-003
- np04-srv-004 - still to be configured
Add 1 Gb network switch
Marc and Federico, from CERN networking, were at EHN1 today adding a 1 Gb switch with RJ45 ports. They have returned to their office to check on the connections. Once they are confident the two switches are on the General Purpose Network (GPN) they will send us instructions on how to request IP addresses.
DAQ status on Friday 22-Jun-2017
pddaq-gen01-ctrl0
pddaq-gen02-ctrl0
pddaq-gen03-daq0
pddaq-gen04-daq0
pddaq-gen05-daq0
nfs configuration on pddaq-gen05-daq0 /etc/exports
/daq/artdaq 10.73.136.0/16(rw,sync,no_root_squash,no_all_squash)
#/daq/artdaq 10.193.0.0/16(rw,sync,no_root_squash,no_all_squash)
# restart nfs as follows:
# sudo exportfs -a ; sudo systemctl restart nfs
# don't forget to make sure the stupid firewall is off forever
# sudo systemctl stop firewalld
# sudo systemctl disable firewalld
On pddaq-gen04-daq0, pddaq-gen03-daq0, pddaq-gen01-ctrl0, pddaq-gen02-ctrl0
sudo mount -t nfs 10.73.136.20:/daq/artdaq /daq/artdaq
Computers:
pddaq-gen01-ctrl0
pddaq-gen02-ctrl0
pddaq-gen03-daq0
pddaq-gen04-daq0
pddaq-gen05-daq0
NFS server is pddaq-gen05-daq0 and /daq/artdaq is mounted on gen01,03,04 from gen05.
Karol updated the RCE configurations to the new IPs.
--
DavidGeoffreySavage - 2017-06-23