Main Web>TWikiUsers>ValHendrix>Tier3gPuppetSetup (revision 8)~~EditAttachPDF~~

Scientific Cluster Deployment & Recovery
Using puppet to simplify cluster management

V. Hendrix¹, D. Benjamin², Y. Yao¹

¹Lawrence Berkeley National Laboratory, Berkeley, CA, USA ²Duke University, Durham, NC, USA

Introduction

Deployment, maintenance and recovery of a scientific cluster, which has complex, specialized services, can be a time consuming task requiring the assistance of Linux system administrators, network engineers as well as domain experts. Universities and small institutions that have a part-time FTE with limited knowledge of the administration of such clusters can be strained by such maintenance tasks.

This current work is the result of an effort to maintain a data analysis cluster with minimal effort by a local system administrator. The realized benefit is the scientist, who is the local system administrator, is able to focus on the data analysis instead of the intricacies of managing a cluster. Our work provides a cluster deployment and recovery process based on the puppet configuration engine allowing a part-time FTE to easily deploy and recover entire clusters with minimal effort.

Puppet is a configuration management system (CMS) used widely in computing centers for the automatic management of resources. Domain experts use Puppet's declarative language to define reusable modules for service configuration and deployment.

Our deployment process has three actors: domain experts, a cluster designer and a cluster manager. The domain experts first write the puppet modules for the cluster services. A cluster designer would then define a cluster. This includes the creation of cluster roles, mapping the services to those roles and determining the relationships between the services. Finally, a cluster manager would acquire the resources (machines, networking), enter the cluster input parameters (hostnames, IP addresses) and automatically generate deployment scripts used by puppet to configure it to act as a designated role. In the event of a machine failure, the originally generated deployment scripts along with puppet can be used to easily reconfigure a new machine

The cluster definition produced in our cluster deployment process is an integral part of automating cluster deployment in a cloud environment. Our future cloud efforts will further build on this work.

Our Roadmap

This document is currently focused on deployment of new and existing Tier3g clusters. We plan on do the following work.

Puppet Configuration Work

Step 1

Get the existing puppet modules working with the Argonne Test Cluster

The following are items that we need to be aware of or to consider in order to finish this step

Refactor Puppet Modules: The puppet modules we have do not deal with the issues listed below.
- headnode http port 80 private network
- frontier squid regenerates the conf file
- hostname
  - machine where nfs storage is clustered in in terms of xrootd needs a private node.
  - Governed under redirector stand alone xrootd needs public hostname
- ldap vm needs a /root/bin area for ldap commands
- gateway needs to be settable ( this case head node)
Firewall Management: Currently the iptables are managed through system-config-securitylevel-tui in the puppet module at3_security. It does the masquerading and opening of ports. For existing Tier 3 sites, we could do any one of the following:
- NO firewall management in puppet: remove at3_security module from all the machine roles
- Manage the firewall in puppet as is
- Manage the firewall in puppet but add some extra rules to match the site's policy
Puppet VM configuration: Currently the only configuration of the puppet node happens during the kickstart installation. We need to consider how we are going to manage the puppet server.
- Should we have it manage itself via the puppet server? Things that might need to be managed are the firewall, and networking
- We need to refresh nodes.pp and and pxelinux.cfg.default from the head node when changes are made to the configuration scripts
Thoughts for improvements:
- Customized templates and parameters.py are not easy to update with code changes: This results runtime errors for missing parameters. We could consider the following solutions.
  - Default Values in Role Definition: default values for any additional parameters are defined in the role definitions. That way the only time you add a parameter in paramters.py is if it either does not have a default value or you want to override it.
  - Validation of parameters.py with user-friendly message telling the user what is missing.
- The script generation process is not completely data driven: It would be nice if the module and role definitions( JSON files) could drive the whole script generation process. Right now there is some role specific logic in the code that generates the nodes.pp. This logic makes up for the lack of robustness in the definition of roles and modules.
Puppet Module Reuse: When we had the Tier 3 Configuration and Recovery meeting at the end of July we discussed getting input from the T2 and T1 sites that are using puppet. The thought was that the puppet repository could be general enough for any site to use. Are we on track with this? Or do we need to step it up?

Step 2

Enhancements and Improvements of cluster configuration

condor config in puppet instead of NFS

Backup and Recovery

The following items are what we think needs to be backed up. The puppet modules reside in CERN's public readonly svn repo.

The following site specific configurations will be backed up in the BNL repo.

generated kickstart files and parameters.py
puppet configs, (put host keys in puppet)
user areas
ldap database (cron output needs to backup out of ldap vm),
vm definitions (/var/lib xml)
backup /var/lib/puppet for host keys
scenarios if this crashes this is what you do.
grid cert machines - cert machines should be backed up
grid host-certs should be in puppet

New Tier3g Site Setup

Before Start

Connect Network Cables

Connect your network like in the graph below:
For Head, Interactive, NFS
- connect eth0 to your outside network
- connect eth1 to the internal network just for tier3
For Workers:
- connect eth0 to your internal network just for tier3

Create USB Key

Prepare Bootable USB Key

Creating a bootable USB Key

Get Disk one ISO image of Scientific Linux installation CD or DVD.

Create a USB Key

mount iso image as a loop device

mkdir /tmp/loop
mount -o loop SL.55.051810.DVD.x86_64.disc1.iso /tmp/loop

Copy diskboot image file from mounted ISO to USB drive

dd if=/tmp/loop/images/diskboot.img of=/dev/sdx # where 'x' is the device representing the USB stick

Generate Kickstart Files in the USB Key

* Checkout the kickstart package:

svn export http://svnweb.cern.ch/guest/atustier3/ks/trunk ks

Create the configuration files for your cluster

cd ks
./generateScripts

Basic Configuration
Customize parameters.py mentioned in the output of the previous action.This is where you put your hostname, ipaddress etc. The parameters.py file should be self explanatory but let me know if it isn't.
Please note that this process makes an assumption that the INTERACTIVE and WORKER nodes have a starting ip address in the private subnet and increment for each successive node. If this is an issue, you can make the changes
```
vi ./mytier3/src/parameters.py
```

Generate all kickstart files and other necessary files

./generateScripts 
ls ./mytier3 # you should see all the generated files

cd ./mytier3

Copy kickstart and other necessary files to your USB key
```
cp -R /path/to/ks /path/to/usb/mount/ks  
```

Install Physical HEAD Node

The head node is the gateway and it will contain virtual machines for the PUPPET, PROXY and LDAP nodes. The HEAD node also has an Http server used for network installations of the other nodes in the cluster.

Kickstart Install

Boot into the USB stick and type:

linux ks=hd:sdx/ks/mytier3/kickstart-head.cfg

Note to replace head with your real file name for the nfs node, normally the domain name of the head node.

Replace sdx with the drive name of your usb disk. Normally it is the one after all your SATA harddisk. E.g. if you have 4 hard disks, your usb will be sde

Click Ignore Drive when prompted, so the installer does not format the USB drive.

Configure HEAD

Copy configuration files from USB key

mkdir /mnt/usb
mount /dev/sdx /mnt/usb # where 'x' is the letter of your usb key
export AT3_CONFIG_DIR=/root/atustier3 # or /root/working if you prefer
mkdir -p $AT3_CONFIG_DIR
cp -r /mnt/usb/ks $AT3_CONFIG_DIR
cd $AT3_CONFIG_DIR

Configure HEAD and install PUPPET

Install VM LDAP

Terminal on HEAD as root. It is important to use -X which enabling X11 forwarding to allow the "vm_ldap Virt Viewer" to load in your local environment.

ssh -X head ## where 'head' is the name of your head node.
cd $AT3_CONFIG_DIR/mytier3
./crvm-vmldap.sh

Once the installation of the puppet server finishes. Close the X-windows session OR Ctr-C on the head node


virsh autostart vm_vmldap # puts symlink to the xml file for the VM so that 
                          # the VM can be restarted if the head node is rebooted

If you would like to reattach to VM LDAP after viewing session has been cancelled:
```
ssh -X head 
virt-viewer --connect qemu:///system vm_vmldap
```

Install VM PROXY

Terminal on HEAD as root. It is important to use -X which enabling X11 forwarding to allow the "vm_proxy Virt Viewer" to load in your local environment.

ssh -X head ## where 'head' is the name of your head node. 
cd $AT3_CONFIG_DIR/mytier3 
./crvm-vmproxy.sh

Once the installation of the puppet server finishes. Close the X-windows session OR Ctr-C on the head node


virsh autostart vm_vmproxy # puts symlink to the xml file for the VM so that 
                          # the VM can be restarted if the head node is rebooted

If you would like to reattach to VM LDAP after viewing session has been cancelled:
```
ssh -X head
virt-viewer --connect qemu:///system vm_proxy
```

Install and configure the rest of the cluster

NFS node
The other nodes are clients of the NFS service so this node should be up and configured with puppet before continuing with the other nodes.
- Follow the Kickstart Install a Node instructions for the NFS node. Make sure to choose the NIC for the private network during the Linux installation.
- Configure NFS node with puppet by following these instructions:Create certificates for puppet client
Configure HEAD node with puppet by following these instructions: Create certificates for puppet client
Install INTERACTIVE nodes
- Follow the Kickstart Install a Node instructions for the INTERACTIVE nodes. Make sure to choose the NIC for the private network during the Linux installation.
- Configure INTERACTIVE nodes with puppet by following these instructions:Create certificates for puppet client
Install WORKER nodes
- Follow the Kickstart Install a Node instructions for the WORDER nodes.
- Configure WORKER nodes with puppet by following these instructions:Create certificates for puppet client

Existing Tier3g Site Setup

Follow these instructions to configure a Tier 3 puppet server and run the puppet clients against the nodes

On the HEAD Node

ssh -X root@head
yum -y --enablerepo=epel-testing install puppet
yum install python-setuptools
easy_install simplejson

Generate kickstart files for nodes

The script here creates kickstart files which are obviously unnecessary for retrofitting existing Tier 3 sites with puppet

Checkout the kickstart package in a working directory.

export AT3_CONFIG_DIR=/root/atustier3 # or /root/working if you prefer
mkdir -p $AT3_CONFIG_DIR
cd $AT3_CONFIG_DIR
svn export http://svnweb.cern.ch/guest/atustier3/ks/trunk ks

Create the configuration files for your cluster

cd ks
./generateScripts

Basic Configuration
Customize parameters.py mentioned in the output of the previous action.This is where you put your hostname, ipaddress etc. The parameters.py file should be self explanatory but let me know if it isn't.
Please note that this process makes an assumption that the INTERACTIVE and WORKER nodes have a starting ip address in the private subnet and increment for each successive node. If this is an issue, you can make the changes
```
vi ./mytier3/src/parameters.py
```

Generate all kickstart files and other necessary files

./generateScripts 
ls ./mytier3 # you should see all the generated files

cd ./mytier3

Configure HEAD and install PUPPET

On Any WORKER Node

Install puppet

yum -y --enablerepo=epel-testing install puppet

Now you can run puppet on the WORKER nodes by following Create certificates for puppet clients

Supplementary Installation Notes

Configure HEAD and install PUPPET

Minimally configure the HEAD node before installing the puppet server

./apply-puppet.sh head-init.pp # where head is the name of your head node

Kickstart installation of puppet VM on HEAD node
Instructions to come. The following script makes the following assumptions
- There are 60GB of free disk space
- There is 2GB of ram free
```
./crvm-puppet.sh
```
  Once the installation of the puppet server finishes. Close the X-windows session OR Ctr-C on the head node
```
virsh autostart vm_puppet # puts symlink to the xml file for the VM so that 
                          # the VM can be restarted if the head node is rebooted
```
On the PUPPET Node
- Start puppetmaster
```
service puppetmaster start
```
- Check that startup was successful by looking at the logs
```
tail -f /var/log/messages
```

Create certificates for puppet clients

On the PUPPET client
First run puppet on the puppet client. This will create a request with the puppet CA and wait for 30 seconds before trying
```
puppetd --no-daemonize --test --debug --waitforcert 30
```

On the PUPPET server
Now sign the request

puppetca --list  # this will tell you of the waiting requests
puppetca --sign puppetclient

On the PUPPET client
You should see the puppet agent startup after 30 seconds and run successfully. After you have confirmed that the puppet client runs successfully, do the following:
```
chkconfig puppet on
service puppet start
```
On the PUPPET server
You should turn on the puppetmaster service
```
chkconfig puppetmaster on
```

Puppet Server SETUP during kickstart installation

The following is how the puppet server is setup during the kickstart installation of the puppet VM


#########################
# Puppet Configuration
cd /etc/puppet
mkdir modules

# Checkout puppet definitions for the whole cluster
svn export http://svnweb.cern.ch/guest/atustier3/at3moduledef/trunk at3moduledef
svn export http://svnweb.cern.ch/guest/atustier3/puppetrepo/trunk puppetrepo

# Checkout all modules for use in the WORKER nodes only
# #       AUTOMATIC CHECKOUT WITH puppetrepo.py
python puppetrepo/puppetrepo.py --action export --moduledef=/etc/puppet/at3moduledef/modules.def --moduledir=/etc/puppet/modules/ --modulesppfile=/etc/puppet/manifests/modules.pp --loglevel=info

cd /etc/puppet
cp at3moduledef/auth.conf at3moduledef/fileserver.conf ./
cp at3moduledef/site.pp manifests/

## Copy config files over to puppet server from HEAD node
wget -O /etc/puppet/manifests/nodes.pp http://192.168.100.1:8080/nodes.pp
wget -O /etc/puppet/modules/at3_pxe/templates/default.erb http://192.168.100.1:8080/pxelinux.cfg.default

chown -R puppet:puppet /etc/puppet
chmod  -R g+rw /etc/puppet

Updating Puppet Modules

You perform these commands to update the puppet modules from the SVN repository


cd /etc/puppet
svn export --force http://svnweb.cern.ch/guest/atustier3/at3moduledef/trunk at3moduledef

python puppetrepo/puppetrepo.py --action export --moduledef=/etc/puppet/at3moduledef/modules.def --moduledir=/etc/puppet/modules/ --modulesppfile=/etc/puppet/manifests/modules.pp --loglevel=info --svnopts=--force

Checking your configuration files into BNL usatlas-cfg

Get access to the SVN Repository
- Send your Grid Certificate for your
- Distinguised Name (DN) to
- Doug Benjamin <benjamin@phy.duke.edu>
Create .p12 file for subversion crentialed access
- Convert cacert.pet, usercert.pem and userkey.pem from your Grid Certificate into a PKCS#12 file
```
openssl pkcs12 -export -in usercert.pem -inkey userkey.pem -certfile cacert.pem -name "[Friendly Name]" -out user-cert.p12
```

These steps are to make sure that a password is never stored in the clear.

emacs -nw ~/.subversion/servers

add the following sections


[global]
store-passwords = yes
store-plaintext-passwords = no
store-ssl-client-cert-pp-plaintext = no

[groups]
usatlas = svn.usatlas.bnl.gov

[usatlas]
ssl-client-cert-file = /path/to/your/user-cert.p12

Setup 1st LDAP User "atlasadmin"

#LDAP Configuration atlas tier3

#Login to any node (with a public IP)

cat > tt  <<EOF

dn: dc=mytier3,dc=com
ObjectClass: dcObject
ObjectClass: organization
dc: mytier3
o : mytier3

dn: ou=People,dc=mytier3,dc=com
ou: People
objectClass: top
objectClass: organizationalUnit

dn: ou=Group,dc=mytier3,dc=com
ou: Group
objectClass: top
objectClass: organizationalUnit

dn: cn=ldapusers,ou=Group,dc=mytier3,dc=com
objectClass: posixGroup
objectClass: top
cn: ldapusers
userPassword: {crypt}x
gidNumber: 9000

dn: cn=atlasadmin,ou=People,dc=mytier3,dc=com
cn: atlasadmin
objectClass: posixAccount
objectClass: shadowAccount
objectClass: inetOrgPerson
sn: User
uid: atlasadmin
uidNumber: 1025
gidNumber: 9000
homeDirectory: /export/home/atlasadmin
userPassword: {SSHA}MQstDGq3bTK1Fle+iAa+p4jYgeyl1RIG

EOF

ldapadd -x -D "cn=root,dc=mytier3,dc=com" -c -w abcdefg -f tt -H ldap://ldap/

ldapsearch -x -b 'dc=mytier3,dc=com' '(objectclass=*)' -H ldap://ldap/

mkdir /export/home/atlasadmin;chown atlasadmin:ldapusers /export/home/atlasadmin

Test if condor works:

cat > simple.c <<EOF
#include <stdio.h>

main(int argc, char **argv)
{
int sleep_time;
int input;
int failure;

if (argc != 3) {
printf("Usage: simple <sleep-time> <integer>\n");
failure = 1;
} else {
sleep_time = atoi(argv[1]);
input = atoi(argv[2]);

printf("Thinking really hard for %d seconds...\n", sleep_time);
sleep(sleep_time);
printf("We calculated: %d\n", input * 2);
failure = 0;
}
return failure;
}
EOF

gcc -o simple simple.c


cat > submit <<EOF
Universe = vanilla
Executable = simple
Arguments = 4 10
Log = simple.log
Output = simple.out
Error = simple.error
Queue
EOF

condor_submit submit

Kickstart Install a Node

There are several ways to perform a kickstart installation once you have the HEAD node up and minimally configured. Choose the one that fits your machines.

USB Key
You may duplicate the USB key to install the rest of the nodes in parallel. Using the previously made USB key boot into the machine you are installing. Use the code below. Replace xxx with the short hostname of the node:
```
linux ks=hd:sdx/ks/mytier3/kickstart-xxx.cfg
```
After reboot, Create certificates for puppet clients

PXE Install NFS, WORKER and INTERACTIVE Nodes
If your machines are PXE capable. You may enable a PXE boot in the Bios. Make sure that ethernet cable for the PXE boot is connected to the private network. Otherwise you will not be able to connect to the HEAD node which is the PXE Server.
- Enable and Boot via PXE
- Choose from menu which node it is, it will the automatically kickstart install the node.
- After reboot, Create certificates for puppet clients
- Check /var/log/messages for error messages

Major updates:
-- ValHendrix - 16-Sep-2011

Screen_shot_2010-07-02_at_1.55.17_PM.png:

Attachments

Topic attachments
I	Attachment	History	Action	Size	Date	Who	Comment
png	Screen_shot_2010-07-02_at_1.55.17_PM.png	r1	manage	70.9 K	2011-09-22 - 21:34	ValHendrix

Topic revision: r8 - 2011-09-26 - ValHendrix

Main

Webs

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
Main All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback

Scientific Cluster Deployment & Recovery Using puppet to simplify cluster management

Introduction

Our Roadmap

Puppet Configuration Work

Step 1

Step 2

Backup and Recovery

New Tier3g Site Setup

Before Start

Connect Network Cables

Create USB Key

Prepare Bootable USB Key

Generate Kickstart Files in the USB Key

Create the configuration files for your cluster

Install Physical HEAD Node

Kickstart Install

Configure HEAD

Install VM LDAP

Install VM PROXY

Install and configure the rest of the cluster

Existing Tier3g Site Setup

On the HEAD Node

Generate kickstart files for nodes

Checkout the kickstart package in a working directory.

Create the configuration files for your cluster

On Any WORKER Node

Install puppet

Supplementary Installation Notes

Configure HEAD and install PUPPET

Create certificates for puppet clients

Puppet Server SETUP during kickstart installation

Updating Puppet Modules

Checking your configuration files into BNL usatlas-cfg

Setup 1st LDAP User "atlasadmin"

Test if condor works:

Kickstart Install a Node

Scientific Cluster Deployment & Recovery
Using puppet to simplify cluster management