Advanced Network Services for Experiments

FDT, the fdtcp wrapper and fdtd daemon

Here's a small primer concerning the FDT integration into PhEDEx.

As you well know, FDT is a standalone transfer tool which has a number of advantages over existing transfer mechanisms.

In order to integrate FDT into PhEDEx, several modules/wrappers have been created. This development work was done independently from the FDT team.

Currently this integration is comprised of 4 main components:

  • The FDT java tool itself
  • A PhEDEx backend (FDT.pm). This was created in Perl mirroring the functionality of SRM.pm.
  • fdtcp wrapper. This is a Python wrapper and interfaces between PhEDEx and FDT data transfers. It:
    • prepares copyjob/fileList transfer file as required by FDT
    • does necessary translation of source destination file names
    • harvests report and log files to propagate back to PhEDEx
    • invokes remote fdtd service (forwards certificate proxy for authentication)
  • fdtd - FDT service wrapper. It permanently runs as a daemon on FDT-enabled sites. It:
    • receives requests (PYRO (Python Remote Objects) calls) to launch FDT on sites. It either launches the FDT client on the source sites or the FDT server party on the destination sites.
    • is responsible for authentication

Here is a diagram on the basic mode of operation of these components.
FDT_transfer_schema.jpg

This example mirrors the current ANSE testbed, in which we have a total of 4 different computers (although it can be reduced to 2 computers, if the PhEDEx site also is a storage site). In this configuration, I have stumbled upon some issues with fdtd and fdtcp which I will go over later.

When the PhEDEx decides to copy files from site A to site B, its FDT backend calls the fdtcp wrapper locally. This wrapper invokes (through PYRO calls) both fdtd services. The fdtd service on site A (source) launches the FDT client tool, while the fdtd service on site B (destination) launches the FDT server tool.

Noteworthy is the fact that fdtd and fdtcp needs to be installed on every computer that wants to transfer data in this mode, even if no FDT tools would be called on the PhEDEx sites.

More details about the FDT Integration can be found here

Changes to fdtcp

The fdtcp wrapper was developed a while back and some of the components and configuration files needed to be updated to suit our needs

I have made various changes to the fdtcp RPMs (will post a link to the new ones here...):

  1. removed any hadoop dependencies from the configuration files as Hadoop is not used in the current testbed.
  2. parametrised and updated some hardcoded flags that were passed to the FDT client and server
  3. modified fdtd.py to listen on all interfaces instead of just one. Sandy01-{gva,ams} have at least 2 interfaces: management, seen from outside, and a private one on which the circuit has been established. The problem was that fdtd was not listening by default on port 8444 on all interfaces - it was resolving the host name to the ip (which was the management interface) and listening on that interface alone.
  4. Removed the "-f" flag passed on to the FDT server restricting which clients can connect to the machine: This was in response to the problem that I had when issuing transfers from hermes2.uslhcnet.org (T2_ANSE_Geneva) between the Sandy01-{gva,ams} nodes (this is how PhEDEx would work with attached storages). As I explained earlier, two PYRO calls would be issued from the hermes2 side going to the server and the client. The command sent to the server (which is the receiving side) would also specify a list of allowed IPs that could connect to it. The problem was that the DNS name of the client was resolved on the hermes2 side, instead on the server which meant that the IP passed as argument was the public IP of the client, instead of the circuit interface address.

fdtd-system-conf.sh

...
# Reporting interval (in seconds) to MonALISA (FDT default is 30 seconds)
AP_MON_DELAY=5

# FDT Java client configs
FDT_PARALLEL=16
FDT_READER_COUNT=1

# FDT Java server settings
FDT_BUFFER_SIZE=2M
FDT_WRITER_COUNT=1
...

fdtd.conf

...
fdtSendingClientCommand = sudo -u %(sudouser)s /usr/bin/wrapper_fdt.sh -cp $FDTJAR lia.util.net.copy.FDT -P $FDT_PARALLEL -p %(port)s -c %(hostDest)s -d / -fl %(fileList)s -rCount $FDT_READER_COUNT -noupdates -enable_apmon -monID %(monID)s -apmon_rep_delay $AP_MON_DELAY
...
fdtReceivingServerCommand = sudo -u %(sudouser)s /usr/bin/wrapper_fdt.sh -cp $FDTJAR lia.util.net.copy.FDT -bs $FDT_BUFFER_SIZE -p %(port)s -wCount $FDT_WRITER_COUNT -S -noupdates -enable_apmon -monID %(monID)s -apmon_rep_delay $AP_MON_DELAY
...

ANSE Testbed (FDTrelated)

As of 30.10.2013, the working ANSE testbed for FDT consists of 4 servers in uslhcnet.org:
Preliminary_tests.jpg

Of those 4 servers, sandy01-ams and sandy01-gva are used as attached storage. Between them, a 7Gbps circuit has been established for use in ANSE tests.
A higher bandwidth can be reserved if needed. The two "sandy01" servers have several SSDs working in RAID 0, under 4 controllers.
Initial tests with FDT transfers using only one controller indicate that we can reach about 4500Mbps network throughput and 525MB/sec disk throughput when transferring 100GB files (random fill)

FDT transfer results

PhEDEx transfers: multiple jobs of 2GB files each, all files situated on one disk controller
  • PhEDEx reported rates:
    PhEDEx_rate_plot.png
  • Network output reported by MonALISA (sandy01-ams to sandy01-gvA)
    sandy01-ams-netout_mean_405MBps.png
  • Disk read reported by MonALISA (sandy01-ams to sandy01-gvA)
    sandy01-ams-diskread_mean_379.31MBps.png

The transfer rates reported by PhEDEx were below 300MB/sec. This seemed a bit low since we knew that 1 disk controller could do much more.
The plots that we got from MonALISA also seemed to be indicating that were was something amiss.

The following issues were identified:

  1. Yes / Done Some FDT transfers remained active until the job would timeout even though there wasn't any data left to be transferred.
  2. Yes / Done There was a 4-5 second delay between subsequent file transfers
  3. Warning, important PhEDEx sometimes reported 2x the usual transfer rate (the two peaks at ~600MB/sec)
  4. Yes / Done Errors didn't seem to be propagated correctly upstream, causing PhEDEx to report very high transfer rates when the transfer job failed.

Upun further investigation:

  1. Was caused by the TCP buffers not properly flushing at the end of the transfer. This caused FDT to remain active but idle and because of this effectively lowered the rates that were reported by PhEDEx
  2. Had two causes:
    1. There was a large idle time caused by the file system when FDT forced a sync on a file close. To ameliorate this, in version 0.19.0 FDT no longer forces this and results have significantly improved.
    2. A cronjob runs every minute, finding files older than 5 minutes and erasing them. This also causes a small delay.
  3. Still under investigation
  4. Validation script was to blame

Standalone FDT transfers: one transfer task of multiple 5GB files each, all files situated on one disk controller:
ContinuousDownload.png

With the new version of FDT transfers ran much more smoothly and the 4-5 second delay between each different file transfers disappeared. This is what was possible using a single disk controller
Given that the disk transfer speed was reached, on that particular controller, we may attribute the transfer rate fluctuations to the SSD write amplification phenomenon.

Standalone FDT transfers, one transfer task of multiple 15GB files each. Files equally distributed on two controllers:

One of the built in features in FDT is especially useful: when given a list of files to transfer it will automatically:

  • detect if the files to be transferred reside on different disks/controllers (in order to launch multiple Readers)
  • detect if files at the destination will be written on different disks/controllers (in order to launch multiple Writers)

Here files were distributed equally both on the source and destination sides. Because of this FDT launched 2 Readers and 2 Writers.
This is an especially important point for PhEDEx. It means that, given a list of files to transfer, FDT will always launch the optimal number of readers/writers, trying to maximize the transfer performance.

FDT_Transfers_-_15GB_files.png

When the circuit is the limiting factor instead of the disks, we're seeing much more stable transfer rates.

PhEDEx transfers: multiple jobs of 15GB files each. Files equally distributed on two controllers:

In order to get the same result with PhEDEx and FDT we had to properly set up a few things.
We have to make sure that the list of source and destination files given to FDT, is balanced between different controllers, both on the transmitting and on the receiving side.

TIP On the source (transmitting) side, we have to make sure to build and inject a block which consists of files located on different disks/controllers. PhEDEx will only export at most one block at a time. If that block doesn't have the files distributed on different controllers, FDT won't be able to start more than one Reader.
TIP On the destination (receiving) side, we have to make sure to distribute files on different disks/controllers as well, or else FDT won't be able to start more than one Writer. This is done with the help of the TFC and smart file naming on our part

Here is the TFC on the gva side used to transfer files from sandy01-ams/data{2,3} to sandy01-gva/data{2,3}

<storage-mapping>

  <lfn-to-pfn protocol="direct" destination-match=".*" path-match="(store/data/test/data/RAW/000/000000000/(.*)-526fb1cb.root)" result="/data2/ANSE/$1"/>
  <lfn-to-pfn protocol="direct" destination-match=".*" path-match="(store/data/test/data/RAW/000/000000000/(.*)-526fb1e4.root)" result="/data3/ANSE/$1"/>

  <lfn-to-pfn protocol="fdt" destination-match=".*" chain="direct" path-match="/*(.*)" result="fdt://sandy01-gva.uslhcnet.org:8444/$1"/>

  <pfn-to-lfn protocol="direct" destination-match=".*" path-match="(/?.*)(/store.*)" result="$2"/>
  <pfn-to-lfn protocol="fdt" destination-match=".*" chain="direct" path-match="(fdt://[\w\d\-\.]*(:\d*)?)(/.*)" result="$3"/>.

</storage-mapping>

As you can see, the files not only have to be located on different disks/controllers on the source side but also they have to be named differently depending on that location. This will ensure that we'll be able to create a rule in the TFC which distributes the files on the destination side.

PhEDEx_Transfers__-_multiple_jobs_of_150_files.png

PhEDEx_Other_transfers__-_multiple_jobs_of_150_files_.png

Here are two PhEDEx "runs" in which ~15TB of data is being transferred - divided into several jobs.

At the end of some transfer tasks, we can see that there is a drop off from 7Gbps to around 4.5Gbps.
This is due to the list of files not being perfectly balanced among controllers (more files from one disk/controller were queued for transfer than on the other).

PhEDEx transfers: multiple jobs of 15GB files each. Files equally distributed on all controllers (four):
Blocks of 800 files, jobs of 150 files each. Reporting of rates is done when a job finishes. In our case we have either a 150 file job or 50 file job (last remaining piece of the block: 800-750). Because of the 1h binning, PhEDEx reported rates are bumpy, due to where each report comes with respect to each bin.

24h_PhEDEx_reported_rates.png
Sandy01-AMS_NETOUT.png
Sandy01-GVA_NETIN.png
AMS_CPU_Utilisation.png
GVA_CPU_Utilisation.png

Analyzing delays in fdtcp

fdtcp 10 files copyjobs: Command issued @ 16:02:55,684 Transfer started @ 16:03:08 Delay = 12.3sec

fdtcp 100 files copyjobs: Command issued @ 16:07:16,880 Transfer started @ 16:07:30 Delay = 13.2sec

fdtcp 1000 files copyjobs: Command issued @ 16:08:56,360 Transfer started @ 16:09:09 Delay = 12.6sec

fdtcp 10000 files copyjobs: Command issued @ 16:18:13,096 Transfer started @ 16:18:54 Delay = 41sec

Database related ideas

Lock Proposals for implementing circuits in PhEDEx

Lock Proposals for PhEDEx Monitoring

Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng 24h_PhEDEx_reported_rates.png r1 manage 31.4 K 2013-12-13 - 15:19 VladLapadatescu Latest tests
PNGpng AMS_CPU_Utilisation.png r1 manage 261.2 K 2013-12-13 - 15:19 VladLapadatescu Latest tests
PNGpng ContinuousDownload.png r1 manage 188.5 K 2013-10-30 - 13:14 VladLapadatescu FDT transfer - 5GB files between AMS and GVA
PNGpng FDT_Transfers_-_15GB_files.png r1 manage 54.3 K 2013-11-06 - 11:26 VladLapadatescu Transfers from two controllers
JPEGjpg FDT_transfer_schema.jpg r2 r1 manage 542.1 K 2013-10-30 - 15:22 VladLapadatescu FDTCP, FDTD, FDT transfer mechanism
PNGpng GVA_CPU_Utilisation.png r1 manage 241.0 K 2013-12-13 - 15:19 VladLapadatescu Latest tests
PNGpng PhEDEx_Other_transfers__-_multiple_jobs_of_150_files_.png r1 manage 131.5 K 2013-11-06 - 15:31 VladLapadatescu Transfers from two controllers II
PNGpng PhEDEx_Transfers__-_multiple_jobs_of_150_files.png r1 manage 125.1 K 2013-11-06 - 11:26 VladLapadatescu Transfers from two controllers
PNGpng PhEDEx_rate_plot.png r1 manage 29.3 K 2013-10-30 - 16:17 VladLapadatescu Preliminary rate plot
JPEGjpg Preliminary_tests.jpg r4 r3 r2 r1 manage 578.1 K 2013-10-30 - 13:19 VladLapadatescu FDT - PhEDEx Testbed
PNGpng Sandy01-AMS_NETOUT.png r1 manage 412.0 K 2013-12-13 - 15:19 VladLapadatescu Latest tests
PNGpng Sandy01-GVA_NETIN.png r1 manage 413.7 K 2013-12-13 - 15:19 VladLapadatescu Latest tests
PNGpng sandy01-ams-diskread_mean_379.31MBps.png r1 manage 201.0 K 2013-10-30 - 16:17 VladLapadatescu Sandy01AMS tests
PNGpng sandy01-ams-netout_mean_405MBps.png r1 manage 197.9 K 2013-10-30 - 16:17 VladLapadatescu Sandy01AMS tests
Edit | Attach | Watch | Print version | History: r17 < r16 < r15 < r14 < r13 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r17 - 2014-01-13 - VladLapadatescu
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback