Advanced Network Services for Experiments
FDT, the fdtcp wrapper and fdtd daemon
Here's a small primer concerning the
FDT integration into
PhEDEx.
As you well know,
FDT is a standalone transfer tool which has a number of advantages over existing transfer mechanisms.
In order to integrate
FDT into
PhEDEx, several modules/wrappers have been created. This development work was done independently from the FDT team.
Currently this integration is comprised of 4 main components:
- The FDT java tool itself
- A PhEDEx backend (FDT.pm). This was created in Perl mirroring the functionality of SRM.pm.
- fdtcp wrapper. This is a Python wrapper and interfaces between PhEDEx and FDT data transfers. It:
- prepares copyjob/fileList transfer file as required by FDT
- does necessary translation of source destination file names
- harvests report and log files to propagate back to PhEDEx
- invokes remote fdtd service (forwards certificate proxy for authentication)
- fdtd - FDT service wrapper. It permanently runs as a daemon on FDT-enabled sites. It:
- receives requests (PYRO (Python Remote Objects) calls) to launch FDT on sites. It either launches the FDT client on the source sites or the FDT server party on the destination sites.
- is responsible for authentication
Here is a diagram on the basic mode of operation of these components.
This example mirrors the current
ANSE testbed, in which we have a total of 4 different computers (although it can be reduced to 2 computers, if the
PhEDEx site also is a storage site).
In this configuration, I have stumbled upon some issues with fdtd and fdtcp which I will go over later.
When the
PhEDEx decides to copy files from site A to site B, its
FDT backend calls the fdtcp wrapper locally. This wrapper invokes (through PYRO calls) both fdtd services.
The fdtd service on site A (source) launches the
FDT client tool, while the fdtd service on site B (destination) launches the FDT server tool.
Noteworthy is the fact that fdtd and fdtcp needs to be installed on every computer that wants to transfer data in this mode, even if no
FDT tools would be called on the
PhEDEx sites.
More details about the
FDT Integration can be found
here
Changes to fdtcp
The fdtcp wrapper was developed a while back and some of the components and configuration files needed to be updated to suit our needs
I have made various changes to the fdtcp RPMs (will post a link to the new ones here...):
- removed any hadoop dependencies from the configuration files as Hadoop is not used in the current testbed.
- parametrised and updated some hardcoded flags that were passed to the FDT client and server
- modified fdtd.py to listen on all interfaces instead of just one. Sandy01-{gva,ams} have at least 2 interfaces: management, seen from outside, and a private one on which the circuit has been established. The problem was that fdtd was not listening by default on port 8444 on all interfaces - it was resolving the host name to the ip (which was the management interface) and listening on that interface alone.
- Removed the "-f" flag passed on to the FDT server restricting which clients can connect to the machine: This was in response to the problem that I had when issuing transfers from hermes2.uslhcnet.org (T2_ANSE_Geneva) between the Sandy01-{gva,ams} nodes (this is how PhEDEx would work with attached storages). As I explained earlier, two PYRO calls would be issued from the hermes2 side going to the server and the client. The command sent to the server (which is the receiving side) would also specify a list of allowed IPs that could connect to it. The problem was that the DNS name of the client was resolved on the hermes2 side, instead on the server which meant that the IP passed as argument was the public IP of the client, instead of the circuit interface address.
fdtd-system-conf.sh
...
# Reporting interval (in seconds) to MonALISA (FDT default is 30 seconds)
AP_MON_DELAY=5
# FDT Java client configs
FDT_PARALLEL=16
FDT_READER_COUNT=1
# FDT Java server settings
FDT_BUFFER_SIZE=2M
FDT_WRITER_COUNT=1
...
fdtd.conf
...
fdtSendingClientCommand = sudo -u %(sudouser)s /usr/bin/wrapper_fdt.sh -cp $FDTJAR lia.util.net.copy.FDT -P $FDT_PARALLEL -p %(port)s -c %(hostDest)s -d / -fl %(fileList)s -rCount $FDT_READER_COUNT -noupdates -enable_apmon -monID %(monID)s -apmon_rep_delay $AP_MON_DELAY
...
fdtReceivingServerCommand = sudo -u %(sudouser)s /usr/bin/wrapper_fdt.sh -cp $FDTJAR lia.util.net.copy.FDT -bs $FDT_BUFFER_SIZE -p %(port)s -wCount $FDT_WRITER_COUNT -S -noupdates -enable_apmon -monID %(monID)s -apmon_rep_delay $AP_MON_DELAY
...
ANSE Testbed (FDTrelated)
As of 30.10.2013, the working
ANSE testbed for
FDT consists of 4 servers in uslhcnet.org:
Of those 4 servers, sandy01-ams and sandy01-gva are used as attached storage. Between them, a 7Gbps circuit has been established for use in
ANSE tests.
A higher bandwidth can be reserved if needed. The two "sandy01" servers have several SSDs working in RAID 0, under 4 controllers.
Initial tests with
FDT transfers using only one controller indicate that we can reach about 4500Mbps network throughput and 525MB/sec disk throughput when transferring 100GB files (random fill)
FDT transfer results
PhEDEx transfers: multiple jobs of 2GB files each, all files situated on one disk controller
- PhEDEx reported rates:
- Network output reported by MonALISA (sandy01-ams to sandy01-gvA)
- Disk read reported by MonALISA (sandy01-ams to sandy01-gvA)
The transfer rates reported by
PhEDEx were below 300MB/sec. This seemed a bit low since we knew that 1 disk controller could do much more.
The plots that we got from
MonALISA also seemed to be indicating that were was something amiss.
The following issues were identified:
- Some FDT transfers remained active until the job would timeout even though there wasn't any data left to be transferred.
- There was a 4-5 second delay between subsequent file transfers
- PhEDEx sometimes reported 2x the usual transfer rate (the two peaks at ~600MB/sec)
- Errors didn't seem to be propagated correctly upstream, causing PhEDEx to report very high transfer rates when the transfer job failed.
Upun further investigation:
- Was caused by the TCP buffers not properly flushing at the end of the transfer. This caused FDT to remain active but idle and because of this effectively lowered the rates that were reported by PhEDEx
- Had two causes:
- There was a large idle time caused by the file system when FDT forced a sync on a file close. To ameliorate this, in version 0.19.0 FDT no longer forces this and results have significantly improved.
- A cronjob runs every minute, finding files older than 5 minutes and erasing them. This also causes a small delay.
- Still under investigation
- Validation script was to blame
Standalone FDT transfers: one transfer task of multiple 5GB files each, all files situated on one disk controller:
With the new version of
FDT transfers ran much more smoothly and the 4-5 second delay between each different file transfers disappeared. This is what was possible using a single disk controller
Given that the disk transfer speed was reached, on that particular controller, we may attribute the transfer rate fluctuations to the
SSD write amplification phenomenon.
Standalone FDT transfers, one transfer task of multiple 15GB files each. Files equally distributed on two controllers:
One of the built in features in
FDT is especially useful: when given a list of files to transfer it will automatically:
- detect if the files to be transferred reside on different disks/controllers (in order to launch multiple Readers)
- detect if files at the destination will be written on different disks/controllers (in order to launch multiple Writers)
Here files were distributed equally both on the source and destination sides. Because of this
FDT launched 2 Readers and 2 Writers.
This is an especially important point for
PhEDEx. It means that, given a list of files to transfer,
FDT will always launch the optimal number of readers/writers, trying to maximize the transfer performance.
When the circuit is the limiting factor instead of the disks, we're seeing much more stable transfer rates.
PhEDEx transfers: multiple jobs of 15GB files each. Files equally distributed on two controllers:
In order to get the same result with
PhEDEx and
FDT we had to properly set up a few things.
We have to make sure that the list of source and destination files given to
FDT, is balanced between different controllers, both on the transmitting and on the receiving side.
On the source (transmitting) side, we have to make sure to build and inject a block which consists of files located on different disks/controllers.
PhEDEx will only export at most one block at a time. If that block doesn't have the files distributed on different controllers,
FDT won't be able to start more than one Reader.
On the destination (receiving) side, we have to make sure to distribute files on different disks/controllers as well, or else
FDT won't be able to start more than one Writer. This is done with the help of the TFC and smart file naming on our part
Here is the TFC on the gva side used to transfer files from sandy01-ams/data{2,3} to sandy01-gva/data{2,3}
<storage-mapping>
<lfn-to-pfn protocol="direct" destination-match=".*" path-match="(store/data/test/data/RAW/000/000000000/(.*)-526fb1cb.root)" result="/data2/ANSE/$1"/>
<lfn-to-pfn protocol="direct" destination-match=".*" path-match="(store/data/test/data/RAW/000/000000000/(.*)-526fb1e4.root)" result="/data3/ANSE/$1"/>
<lfn-to-pfn protocol="fdt" destination-match=".*" chain="direct" path-match="/*(.*)" result="fdt://sandy01-gva.uslhcnet.org:8444/$1"/>
<pfn-to-lfn protocol="direct" destination-match=".*" path-match="(/?.*)(/store.*)" result="$2"/>
<pfn-to-lfn protocol="fdt" destination-match=".*" chain="direct" path-match="(fdt://[\w\d\-\.]*(:\d*)?)(/.*)" result="$3"/>.
</storage-mapping>
As you can see, the files not only have to be located on different disks/controllers on the source side but also they have to be named differently depending on that location. This will ensure that we'll be able to create a rule in the TFC which distributes the files on the destination side.
Here are two
PhEDEx "runs" in which ~15TB of data is being transferred - divided into several jobs.
At the end of some transfer tasks, we can see that there is a drop off from 7Gbps to around 4.5Gbps.
This is due to the list of files not being perfectly balanced among controllers (more files from one disk/controller were queued for transfer than on the other).
PhEDEx transfers: multiple jobs of 15GB files each. Files equally distributed on all controllers (four):
Blocks of 800 files, jobs of 150 files each. Reporting of rates is done when a job finishes. In our case we have either a 150 file job or 50 file job (last remaining piece of the block: 800-750). Because of the 1h binning,
PhEDEx reported rates are bumpy, due to where each report comes with respect to each bin.
Analyzing delays in fdtcp
fdtcp 10 files copyjobs:
Command issued @ 16:02:55,684
Transfer started @ 16:03:08
Delay = 12.3sec
fdtcp 100 files copyjobs:
Command issued @ 16:07:16,880
Transfer started @ 16:07:30
Delay = 13.2sec
fdtcp 1000 files copyjobs:
Command issued @ 16:08:56,360
Transfer started @ 16:09:09
Delay = 12.6sec
fdtcp 10000 files copyjobs:
Command issued @ 16:18:13,096
Transfer started @ 16:18:54
Delay = 41sec
Database related ideas