What is Roger?
Roger is a front end to a riak key/value store with some authentication and authorization that matches CERNs machine ownership model. Key/Values useful to monitoring are stored, can be updated/queried from anywhere at anytime, or can be used to filter into puppet via compilations. On the machine side they can be used to fire off actions according to value transitions.
What keys & values exactly?
Here's an example entry:
$ curl --key ~/private/x509/priv.pem --cert ~/private/x509/bejones.pem -k https://aiteigi01.cern.ch:8202/roger/v1/state/aiteigi01.cern.ch/
{"app_alarmed": false, "appstate": "production", "expires": "", "hostname": "aiteigi01.cern.ch", "hw_alarmed": true, "message": "switch hw alarms on", "os_alarmed": false, "update_time": "1375889868", "updated_by": "bejones"}
key |
description |
expected values |
app_alarmed |
toggle application alarms for target |
true or false |
os_alarmed |
toggle operating system alarms for target |
true or false |
hw_alarms |
toggle hardware alarms for target |
true or false |
appstate |
is the machine in production or some drain state |
production, draining, quiesce |
expires |
the time at which an entry will expire |
epoch time string |
hostname |
name of the machine |
fqdn |
update_time |
when the entry was created or updated |
epoch time string |
updated_by |
authenticated user who updated the entry |
string |
message |
a few words about why |
string |
In the case of expiry, any "get" operation will ensure that if an entry has expired, you get the previous entry from the history. The use case is people wanting to add/remove alarms for a set period of time, as is done currently with sms.
What's the API
It's REST, but for the purposes of monitoring I presume that it's only really queries that you'll be interested in. Please note that in the following examples the server name of the URLs is not going to be production, but the rest should be.
Note there are two ports: 8201 for kerberos (mod_auth_krb), and 8202 for ssl. You can use either. All connections must be authenticated, but other than that anyone can read (the exception being machines using their cert or keytab, which can only see their own entries).
So, again, here's getting one entry:
$ curl --key ~/private/x509/priv.pem --cert ~/private/x509/bejones.pem -k https://aiteigi01.cern.ch:8202/roger/v1/state/aiteigi01.cern.ch/
{"app_alarmed": false, "appstate": "production", "expires": "", "hostname": "aiteigi01.cern.ch", "hw_alarmed": true, "message": "switch hw alarms on", "os_alarmed": false, "update_time": "1375889868", "updated_by": "bejones"}
You can get everything (and wiki isn't going to help with the formatting of this much):
$ curl --key ~/private/x509/priv.pem --cert ~/private/x509/bejones.pem -k https://aiteigi01.cern.ch:8202/roger/v1/state/
{"meta": {"limit": 20, "next": "/roger/v1/state/?limit=20&offset=20", "offset": 0, "previous": null, "total_count": 1562}, "objects": [{"app_alarmed": true, "appstate": "production", "expires": "", "hostname": "p05153026053325.cern.ch", "hw_alarmed": true, "message": "bulk update from puppetdb", "os_alarmed": true, "update_time": "1375912723", "updated_by": "bejones"}, {"app_alarmed": true, "appstate": "production", "expires": "", "hostname": "lxbrf18b04.cern.ch", "hw_alarmed": true, "message": "bulk update from puppetdb", "os_alarmed": true, "update_time": "1375912468", "updated_by": "bejones"}, {"app_alarmed": true, "appstate": "production", "expires": "", "hostname": "lxbsp2810.cern.ch", "hw_alarmed": true, "message": "bulk update from puppetdb", "os_alarmed": true, "update_time": "1375912579", "updated_by": "bejones"}, {"app_alarmed": true, "appstate": "production", "expires": "", "hostname": "lxbrf18b11.cern.ch", "hw_alarmed": true, "message": "bulk update from puppetdb", "os_alarmed": true, "update_time": "1375912471", "updated_by": "bejones"}, {"app_alarmed": true, "appstate": "production", "expires": "", "hostname": "lxbsq2302.cern.ch", "hw_alarmed": true, "message": "bulk update from puppetdb", "os_alarmed": true, "update_time": "1375912623", "updated_by": "bejones"}, {"app_alarmed": true, "appstate": "production", "expires": "", "hostname": "lxbsp2335.cern.ch", "hw_alarmed": true, "message": "bulk update from puppetdb", "os_alarmed": true, "update_time": "1375912551", "updated_by": "bejones"}, {"app_alarmed": true, "appstate": "production", "expires": "", "hostname": "b6502beeeb.cern.ch", "hw_alarmed": true, "message": "bulk update from puppetdb", "os_alarmed": true, "update_time": "1375912366", "updated_by": "bejones"}, {"app_alarmed": true, "appstate": "production", "expires": "", "hostname": "p05153026607938.cern.ch", "hw_alarmed": true, "message": "bulk update from puppetdb", "os_alarmed": true, "update_time": "1375912758", "updated_by": "bejones"}, {"app_alarmed": true, "appstate": "production", "expires": "", "hostname": "lxbsq2215.cern.ch", "hw_alarmed": true, "message": "bulk update from puppetdb", "os_alarmed": true, "update_time": "1375912618", "updated_by": "bejones"}, {"app_alarmed": true, "appstate": "production", "expires": "", "hostname": "c2repacksrv401.cern.ch", "hw_alarmed": true, "message": "bulk update from puppetdb", "os_alarmed": true, "update_time": "1375912447", "updated_by": "bejones"}, {"app_alarmed": true, "appstate": "production", "expires": "", "hostname": "b6f395192e.cern.ch", "hw_alarmed": true, "message": "bulk update from puppetdb", "os_alarmed": true, "update_time": "1375912414", "updated_by": "bejones"}, {"app_alarmed": true, "appstate": "production", "expires": "", "hostname": "p05153061403857.cern.ch", "hw_alarmed": true, "message": "bulk update from puppetdb", "os_alarmed": true, "update_time": "1375912823", "updated_by": "bejones"}, {"app_alarmed": true, "appstate": "production", "expires": "", "hostname": "b64a42d2e0.cern.ch", "hw_alarmed": true, "message": "bulk update from puppetdb", "os_alarmed": true, "update_time": "1375912364", "updated_by": "bejones"}, {"app_alarmed": true, "appstate": "production", "expires": "", "hostname": "lxbrf29b10.cern.ch", "hw_alarmed": true, "message": "bulk update from puppetdb", "os_alarmed": true, "update_time": "1375912509", "updated_by": "bejones"}, {"app_alarmed": true, "appstate": "production", "expires": "", "hostname": "p05151876953154.cern.ch", "hw_alarmed": true, "message": "bulk update from puppetdb", "os_alarmed": true, "update_time": "1375912720", "updated_by": "bejones"}, {"app_alarmed": true, "appstate": "production", "expires": "", "hostname": "b581fc2d4b.cern.ch", "hw_alarmed": true, "message": "bulk update from puppetdb", "os_alarmed": true, "update_time": "1375912343", "updated_by": "bejones"}, {"app_alarmed": true, "appstate": "production", "expires": "", "hostname": "p05153061053073.cern.ch", "hw_alarmed": true, "message": "bulk update from puppetdb", "os_alarmed": true, "update_time": "1375912791", "updated_by": "bejones"}, {"app_alarmed": true, "appstate": "production", "expires": "", "hostname": "b6db8ad68b.cern.ch", "hw_alarmed": true, "message": "bulk update from puppetdb", "os_alarmed": true, "update_time": "1375912408", "updated_by": "bejones"}, {"app_alarmed": true, "appstate": "production", "expires": "", "hostname": "p05151876753742.cern.ch", "hw_alarmed": true, "message": "bulk update from puppetdb", "os_alarmed": true, "update_time": "1375912716", "updated_by": "bejones"}, {"app_alarmed": true, "appstate": "production", "expires": "", "hostname": "p05153026904396.cern.ch", "hw_alarmed": true, "message": "bulk update from puppetdb", "os_alarmed": true, "update_time": "1375912780", "updated_by": "bejones"}]}
The main thing to note with list results is that there is always pagination. You have to be able to deal with it, but note that there's meta info of what the limit is, offset from 0, total count, and URIs for "next" and "previous".
Or, you can search. Most of the fields are in principle searchable, but in theory the update_time might be the most useful, to see what's changed since T:
$ curl --key ~/private/x509/priv.pem --cert ~/private/x509/bejones.pem -k https://aiteigi01.cern.ch:8202/roger/v1/state/?update_time__gt=1375912880
{"meta": {"limit": 20, "next": null, "offset": 0, "previous": null, "total_count": 5}, "objects": [{"app_alarmed": true, "appstate": "production", "expires": "", "hostname": "siteargus02.cern.ch", "hw_alarmed": true, "message": "bulk update from puppetdb", "os_alarmed": true, "update_time": "1375912881", "updated_by": "bejones"}, {"app_alarmed": true, "appstate": "production", "expires": "", "hostname": "vmargus02.cern.ch", "hw_alarmed": true, "message": "bulk update from puppetdb", "os_alarmed": true, "update_time": "1375912883", "updated_by": "bejones"}, {"app_alarmed": true, "appstate": "production", "expires": "", "hostname": "vmargus01.cern.ch", "hw_alarmed": true, "message": "bulk update from puppetdb", "os_alarmed": true, "update_time": "1375912882", "updated_by": "bejones"}, {"app_alarmed": true, "appstate": "production", "expires": "", "hostname": "vmargus03.cern.ch", "hw_alarmed": true, "message": "bulk update from puppetdb", "os_alarmed": true, "update_time": "1375912883", "updated_by": "bejones"}, {"app_alarmed": true, "appstate": "production", "expires": "", "hostname": "siteargus03.cern.ch", "hw_alarmed": true, "message": "bulk update from puppetdb", "os_alarmed": true, "update_time": "1375912881", "updated_by": "bejones"}]}
Search is django style, so key and "gt, gte, lt, lte" joined with two underbars.
What will production look like?
It's riak, so it scales horizontally. All machines have to be able to query their data directly. In the first instance there'll be three servers, behind a DNS load balancer.