ANSE < Main < TWiki

TWiki>

Main Web>TWikiUsers>VladLapadatescu>ANSE (2014-01-13, VladLapadatescu)

EditAttachPDF

Advanced Network Services for Experiments

Advanced Network Services for Experiments

FDT, the fdtcp wrapper and fdtd daemon

Here's a small primer concerning the FDT integration into PhEDEx.

As you well know, FDT is a standalone transfer tool which has a number of advantages over existing transfer mechanisms.

In order to integrate FDT into PhEDEx, several modules/wrappers have been created. This development work was done independently from the FDT team.

Currently this integration is comprised of 4 main components:

The FDT java tool itself
A PhEDEx backend (FDT.pm). This was created in Perl mirroring the functionality of SRM.pm.
fdtcp wrapper. This is a Python wrapper and interfaces between PhEDEx and FDT data transfers. It:
- prepares copyjob/fileList transfer file as required by FDT
- does necessary translation of source destination file names
- harvests report and log files to propagate back to PhEDEx
- invokes remote fdtd service (forwards certificate proxy for authentication)
fdtd - FDT service wrapper. It permanently runs as a daemon on FDT-enabled sites. It:
- receives requests (PYRO (Python Remote Objects) calls) to launch FDT on sites. It either launches the FDT client on the source sites or the FDT server party on the destination sites.
- is responsible for authentication

Here is a diagram on the basic mode of operation of these components.

This example mirrors the current ANSE testbed, in which we have a total of 4 different computers (although it can be reduced to 2 computers, if the PhEDEx site also is a storage site). In this configuration, I have stumbled upon some issues with fdtd and fdtcp which I will go over later.

When the PhEDEx decides to copy files from site A to site B, its FDT backend calls the fdtcp wrapper locally. This wrapper invokes (through PYRO calls) both fdtd services. The fdtd service on site A (source) launches the FDT client tool, while the fdtd service on site B (destination) launches the FDT server tool.

Noteworthy is the fact that fdtd and fdtcp needs to be installed on every computer that wants to transfer data in this mode, even if no FDT tools would be called on the PhEDEx sites.

More details about the FDT Integration can be found here

Changes to fdtcp

The fdtcp wrapper was developed a while back and some of the components and configuration files needed to be updated to suit our needs

I have made various changes to the fdtcp RPMs (will post a link to the new ones here...):

removed any hadoop dependencies from the configuration files as Hadoop is not used in the current testbed.
parametrised and updated some hardcoded flags that were passed to the FDT client and server
modified fdtd.py to listen on all interfaces instead of just one. Sandy01-{gva,ams} have at least 2 interfaces: management, seen from outside, and a private one on which the circuit has been established. The problem was that fdtd was not listening by default on port 8444 on all interfaces - it was resolving the host name to the ip (which was the management interface) and listening on that interface alone.
Removed the "-f" flag passed on to the FDT server restricting which clients can connect to the machine: This was in response to the problem that I had when issuing transfers from hermes2.uslhcnet.org (T2_ANSE_Geneva) between the Sandy01-{gva,ams} nodes (this is how PhEDEx would work with attached storages). As I explained earlier, two PYRO calls would be issued from the hermes2 side going to the server and the client. The command sent to the server (which is the receiving side) would also specify a list of allowed IPs that could connect to it. The problem was that the DNS name of the client was resolved on the hermes2 side, instead on the server which meant that the IP passed as argument was the public IP of the client, instead of the circuit interface address.

fdtd-system-conf.sh

...
# Reporting interval (in seconds) to MonALISA (FDT default is 30 seconds)
AP_MON_DELAY=5

# FDT Java client configs
FDT_PARALLEL=16
FDT_READER_COUNT=1

# FDT Java server settings
FDT_BUFFER_SIZE=2M
FDT_WRITER_COUNT=1
...

fdtd.conf

...
fdtSendingClientCommand = sudo -u %(sudouser)s /usr/bin/wrapper_fdt.sh -cp $FDTJAR lia.util.net.copy.FDT -P $FDT_PARALLEL -p %(port)s -c %(hostDest)s -d / -fl %(fileList)s -rCount $FDT_READER_COUNT -noupdates -enable_apmon -monID %(monID)s -apmon_rep_delay $AP_MON_DELAY
...
fdtReceivingServerCommand = sudo -u %(sudouser)s /usr/bin/wrapper_fdt.sh -cp $FDTJAR lia.util.net.copy.FDT -bs $FDT_BUFFER_SIZE -p %(port)s -wCount $FDT_WRITER_COUNT -S -noupdates -enable_apmon -monID %(monID)s -apmon_rep_delay $AP_MON_DELAY
...

ANSE Testbed (FDTrelated)

As of 30.10.2013, the working ANSE testbed for FDT consists of 4 servers in uslhcnet.org:

Of those 4 servers, sandy01-ams and sandy01-gva are used as attached storage. Between them, a 7Gbps circuit has been established for use in ANSE tests.
A higher bandwidth can be reserved if needed. The two "sandy01" servers have several SSDs working in RAID 0, under 4 controllers.
Initial tests with FDT transfers using only one controller indicate that we can reach about 4500Mbps network throughput and 525MB/sec disk throughput when transferring 100GB files (random fill)

FDT transfer results

PhEDEx transfers: multiple jobs of 2GB files each, all files situated on one disk controller

PhEDEx reported rates:
Network output reported by MonALISA (sandy01-ams to sandy01-gvA)
Disk read reported by MonALISA (sandy01-ams to sandy01-gvA)

The transfer rates reported by PhEDEx were below 300MB/sec. This seemed a bit low since we knew that 1 disk controller could do much more.
The plots that we got from MonALISA also seemed to be indicating that were was something amiss.

The following issues were identified:

Some FDT transfers remained active until the job would timeout even though there wasn't any data left to be transferred.
There was a 4-5 second delay between subsequent file transfers
PhEDEx sometimes reported 2x the usual transfer rate (the two peaks at ~600MB/sec)
Errors didn't seem to be propagated correctly upstream, causing PhEDEx to report very high transfer rates when the transfer job failed.

Upun further investigation:

Was caused by the TCP buffers not properly flushing at the end of the transfer. This caused FDT to remain active but idle and because of this effectively lowered the rates that were reported by PhEDEx
Had two causes:
1. There was a large idle time caused by the file system when FDT forced a sync on a file close. To ameliorate this, in version 0.19.0 FDT no longer forces this and results have significantly improved.
2. A cronjob runs every minute, finding files older than 5 minutes and erasing them. This also causes a small delay.
Still under investigation
Validation script was to blame

Standalone FDT transfers: one transfer task of multiple 5GB files each, all files situated on one disk controller:

With the new version of FDT transfers ran much more smoothly and the 4-5 second delay between each different file transfers disappeared. This is what was possible using a single disk controller
Given that the disk transfer speed was reached, on that particular controller, we may attribute the transfer rate fluctuations to the SSD write amplification phenomenon.

Standalone FDT transfers, one transfer task of multiple 15GB files each. Files equally distributed on two controllers:

One of the built in features in FDT is especially useful: when given a list of files to transfer it will automatically:

detect if the files to be transferred reside on different disks/controllers (in order to launch multiple Readers)
detect if files at the destination will be written on different disks/controllers (in order to launch multiple Writers)

Here files were distributed equally both on the source and destination sides. Because of this FDT launched 2 Readers and 2 Writers.
This is an especially important point for PhEDEx. It means that, given a list of files to transfer, FDT will always launch the optimal number of readers/writers, trying to maximize the transfer performance.

When the circuit is the limiting factor instead of the disks, we're seeing much more stable transfer rates.

PhEDEx transfers: multiple jobs of 15GB files each. Files equally distributed on two controllers:

In order to get the same result with PhEDEx and FDT we had to properly set up a few things.
We have to make sure that the list of source and destination files given to FDT, is balanced between different controllers, both on the transmitting and on the receiving side.

On the source (transmitting) side, we have to make sure to build and inject a block which consists of files located on different disks/controllers. PhEDEx will only export at most one block at a time. If that block doesn't have the files distributed on different controllers, FDT won't be able to start more than one Reader.
On the destination (receiving) side, we have to make sure to distribute files on different disks/controllers as well, or else FDT won't be able to start more than one Writer. This is done with the help of the TFC and smart file naming on our part

Here is the TFC on the gva side used to transfer files from sandy01-ams/data{2,3} to sandy01-gva/data{2,3}

<storage-mapping>

  <lfn-to-pfn protocol="direct" destination-match=".*" path-match="(store/data/test/data/RAW/000/000000000/(.*)-526fb1cb.root)" result="/data2/ANSE/$1"/>
  <lfn-to-pfn protocol="direct" destination-match=".*" path-match="(store/data/test/data/RAW/000/000000000/(.*)-526fb1e4.root)" result="/data3/ANSE/$1"/>

  <lfn-to-pfn protocol="fdt" destination-match=".*" chain="direct" path-match="/*(.*)" result="fdt://sandy01-gva.uslhcnet.org:8444/$1"/>

  <pfn-to-lfn protocol="direct" destination-match=".*" path-match="(/?.*)(/store.*)" result="$2"/>
  <pfn-to-lfn protocol="fdt" destination-match=".*" chain="direct" path-match="(fdt://[\w\d\-\.]*(:\d*)?)(/.*)" result="$3"/>.

</storage-mapping>

As you can see, the files not only have to be located on different disks/controllers on the source side but also they have to be named differently depending on that location. This will ensure that we'll be able to create a rule in the TFC which distributes the files on the destination side.

PhEDEx_Other_transfers__-_multiple_jobs_of_150_files_.png

Here are two PhEDEx "runs" in which ~15TB of data is being transferred - divided into several jobs.

At the end of some transfer tasks, we can see that there is a drop off from 7Gbps to around 4.5Gbps.
This is due to the list of files not being perfectly balanced among controllers (more files from one disk/controller were queued for transfer than on the other).

PhEDEx transfers: multiple jobs of 15GB files each. Files equally distributed on all controllers (four):

Blocks of 800 files, jobs of 150 files each. Reporting of rates is done when a job finishes. In our case we have either a 150 file job or 50 file job (last remaining piece of the block: 800-750). Because of the 1h binning, PhEDEx reported rates are bumpy, due to where each report comes with respect to each bin.

PhEDEx reported rates:

MonALISA reported rates:

MonALISA reported CPU %:

Analyzing delays in fdtcp

fdtcp 10 files copyjobs: Command issued @ 16:02:55,684 Transfer started @ 16:03:08 Delay = 12.3sec

fdtcp 100 files copyjobs: Command issued @ 16:07:16,880 Transfer started @ 16:07:30 Delay = 13.2sec

fdtcp 1000 files copyjobs: Command issued @ 16:08:56,360 Transfer started @ 16:09:09 Delay = 12.6sec

fdtcp 10000 files copyjobs: Command issued @ 16:18:13,096 Transfer started @ 16:18:54 Delay = 41sec

Database related ideas

Proposals for implementing circuits in PhEDEx

Proposals for PhEDEx Monitoring

Attachments

Topic attachments
I	Attachment	History	Action	Size	Date	Who	Comment
png	24h_PhEDEx_reported_rates.png	r1	manage	31.4 K	2013-12-13 - 15:19	VladLapadatescu	Latest tests
png	AMS_CPU_Utilisation.png	r1	manage	261.2 K	2013-12-13 - 15:19	VladLapadatescu	Latest tests
png	ContinuousDownload.png	r1	manage	188.5 K	2013-10-30 - 13:14	VladLapadatescu	FDT transfer - 5GB files between AMS and GVA
png	FDT_Transfers_-_15GB_files.png	r1	manage	54.3 K	2013-11-06 - 11:26	VladLapadatescu	Transfers from two controllers
jpg	FDT_transfer_schema.jpg	r2 r1	manage	542.1 K	2013-10-30 - 15:22	VladLapadatescu	FDTCP, FDTD, FDT transfer mechanism
png	GVA_CPU_Utilisation.png	r1	manage	241.0 K	2013-12-13 - 15:19	VladLapadatescu	Latest tests
png	PhEDEx_Other_transfers__-_multiple_jobs_of_150_files_.png	r1	manage	131.5 K	2013-11-06 - 15:31	VladLapadatescu	Transfers from two controllers II
png	PhEDEx_Transfers__-_multiple_jobs_of_150_files.png	r1	manage	125.1 K	2013-11-06 - 11:26	VladLapadatescu	Transfers from two controllers
png	PhEDEx_rate_plot.png	r1	manage	29.3 K	2013-10-30 - 16:17	VladLapadatescu	Preliminary rate plot
jpg	Preliminary_tests.jpg	r4 r3 r2 r1	manage	578.1 K	2013-10-30 - 13:19	VladLapadatescu	FDT - PhEDEx Testbed
png	Sandy01-AMS_NETOUT.png	r1	manage	412.0 K	2013-12-13 - 15:19	VladLapadatescu	Latest tests
png	Sandy01-GVA_NETIN.png	r1	manage	413.7 K	2013-12-13 - 15:19	VladLapadatescu	Latest tests
png	sandy01-ams-diskread_mean_379.31MBps.png	r1	manage	201.0 K	2013-10-30 - 16:17	VladLapadatescu	Sandy01AMS tests
png	sandy01-ams-netout_mean_405MBps.png	r1	manage	197.9 K	2013-10-30 - 16:17	VladLapadatescu	Sandy01AMS tests

Topic revision: r17 - 2014-01-13 - VladLapadatescu

Main

Webs

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
Main All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback