The block latency analysis
Meetings
For our first meeting (Wednesday 26th November, 10:00am) we will use vidyo:
https://vidyoportal.cern.ch/flex.html?roomdirect.html&key=AGJFEAldmWkm. The goals for this meeting are to organise people, data and process.
Timeline
This analysis needs to be done in time for CHEP 2015, which means we need final plots by about February, not much later than that. Time is not on our side!
Eventually, we want to be able to monitor the latency of transfers continuously in a way that makes sense. How we do that will depend partly on what we learn from this analysis.
Background
Since
PhEDEx version 4, more or less, in 2012, we have information stored in
PhEDEx on the latency of block transfers and file transfers. Historical information is kept in the
t_log_block_latency and
t_log_file_latency tables. Their definitions are:
create table t_log_block_latency
(time_update float not null,
destination integer not null,
block integer , -- block id, can be null if block removed
files integer not null, -- number of files
bytes integer not null, -- block size in bytes
priority integer not null, -- t_dps_block_dest priority
is_custodial char (1) not null, -- t_dps_block_dest custodial
time_subscription float not null, -- time block was subscribed
block_create float not null, -- time the block was created
block_close float not null, -- time the block was closed
first_request float , -- time block was first routed (t_xfer_request appeared)
first_replica float , -- time the first file was replicated
percent25_replica float , -- time the 25th-percentile file was replicated
percent50_replica float , -- time the 50th-percentile file was replicated
percent75_replica float , -- time the 75th-percentile file was replicated
percent95_replica float , -- time the 95th-percentile file was replicated
last_replica float not null, -- time the last file was replicated
primary_from_node integer , -- id of the node from which most of the files were transferred
primary_from_files integer , -- number of files transferred from primary_from_node
total_xfer_attempts integer , -- total number of transfer attempts for all files in the block
total_suspend_time float , -- seconds the block was suspended since the start of the transfer
latency float not null, -- final latency for this block
);
create table t_log_file_latency
(time_subscription float not null,
time_update float not null,
destination integer not null, -- destination node id
fileid integer , -- file id, can be NULL for invalidated files
inblock integer not null, -- block id
filesize integer not null, -- file size in bytes
priority integer , -- task priority
is_custodial char (1) , -- task custodiality
time_request float , -- timestamp of the first time the file was activated for transfer by FileRouter
original_from_node integer , -- node id of the source node for the first valid transfer path created by FileRouter
from_node integer , -- node id of the source node for the successful transfer task (can differ from above in case of rerouting)
time_route float , -- timestamp of the first time that a valid transfer path was created by FileRouter
time_assign float , -- timestamp of the first time that a transfer task was created by FileIssue
time_export float , -- timestamp of the first time was exported for transfer (staged at source Buffer, or same as assigned time for T2s)
attempts integer , -- number of transfer attempts
time_first_attempt float , -- timestamp of the first transfer attempt
time_on_buffer float , -- timestamp of the successful WAN transfer attempt (to Buffer for T1 nodes)
time_at_destination float , -- timestamp of arrival on destination node (same as before for T2 nodes, or migration time for T1s)
)
The block latency information is kept without pruning, but the file latency information is only kept for about 3-4 months (I'm not sure of the exact time), because it would grow too quickly. For this analysis, we only need the block information, I don't think we will want to look at the file information. In any case, we can't, because we haven't been harvesting it, and anything older than the last few months is lost.
The key parameters in the block latency log table are the
first_replica,
percent_X, and
last_replica fields. These record the times at which the first file was transferred, at which X percent of the block (by file count) was transferred, and at which the last file was transferred.
Q for Nicolo: Are these values adjusted correctly for blocks that are filled while transferring? I think so, but am not sure.
Code, organisation...
I have a github repository,
git@github.com:TonyWildish/PhEDEx-latency.git
, with some initial code that I wrote a while ago to explore this data. There's nothing complete there, but it does provide a starting point, and we can use it to host the analysis code. Please send me your github account name if you want to be able to write to it directly, or just make pull requests when you have something to add and I will merge them.
The github repository contains a
bin and a
data directory, with
README files. The short version is that the
bin directory has a script for extracting the data from
PhEDEx (you will need a
PhEDEx installation for the Perl modules, and a DBParam with read access to the production database). For convenience, I've extracted a set of CSV files and stored them in the
data directory. This may be enough for the analysis to work with, or we may need to correlate with other variables later on, in which case the extraction will need to be enhanced.
There is also a directory
'Tony' for my preliminary analysis code. There's a README there, but basically it reads the CSV files and produces a bunch of R data-frames and a few initial plots to explore the data. The code all runs, but isn't that well documented. Note that it takes quite a bit of memory, you may have trouble fitting it into an older laptop.
For now, I suggest we all make our own directories in the repository, rather than trying to share code directly. Once we get established we can see again how to organise ourselves better.
I personally don't care what language is used for the analysis. I prefer R, others will definitely prefer Python. I say we each use whatever we like best, we can converge/convert later if we need to.
Starting the analysis
Cleaning the data
The block latency data contains information on every block transferred since we made the schema changes. This includes all the single-file blocks, all the blocks with small files, and all the other odd things that we get in
PhEDEx. Someone will have to spend some effort cleaning the data to extract a meaningful sub-set for analysis.
There will probably also be blocks that were growing while the transfer was taking place, in which case we may need to correct for them, exclude them, or treat them differently somehow. To spot that, we will have to look at the
t_dps_file table for the creation time of files in blocks and see if they fall in the window between
first_replica and
last_replica.
Some analysis variables
I had the idea of defining a
skew for a dataset. Define the variable
skew_X for a block as:
skew_X = (time spent transferring the last 5 percent of the files) / (time spent transferring the first X percent of the files) times X/5
If transfers happen at a constant rate, the skew should be one for all values of X. If the skew is much greater than one, then the last 5 percent took much longer than they should compared to the first X percent. If the skew is much lower than one then the first X percent took much longer than it should, compared to the last 5 percent of the files.
Given the values recorded in the table, we can calculate
skew_25,
skew_50,
skew_75 and
skew_95. Depending on what the source of latency turns out to be, one or more of these skews may be relevant, but I expect the
skew_75 and
skew_95 variables to show the most promise.
We could define other skews, based not on the last 5 percent but on the last 25 percent etc, but that might be less useful.
Plan of attack
Here's a proposal for how to proceed:
- (send me your github names if you want to write code!)
- Someone (who?) needs to do a preliminary analysis to select a subsample of blocks for the analysis. These should have a reasonable number of files, each of a reasonable size, whatever that turns out to be.
- We need to cross-check that the blocks we select are clean, in that they weren't being added to while they were transferring.
- For blocks that were growing while being transferred, we need to see if there are lots or not, and figure out how to deal with them (ignore them or not?)
- Once we have a clean subset we can figure out what the next steps are
The difficult question is: how do we break down the work to achieve this?