IPv6 Transfer Test Results

Testing the full mesh

I ran a test between several sites recently:
  • I used Caltech, CERN, FZU, GARR, GRIDKA, and INFN. DESY still have a problem with the UID of their IPv6 user
  • for each pair of sites, tests were run in both directions
  • for any source-destination pair, the process was:
    • transfer a 1 GB file using globus-url-copy
    • use uberftp to verify the file arrived with the correct size
    • delete the file at the destination, again using uberftp
    • 10 seconds later, start the next transfer
  • so, any of the 5 sites could be involved in from 0 to 4 outbound transfers and from 0-4 inbound transfers simultaneously
  • all transfers were third-party, controlled from a VM at CERN. This VM did not transfer data itself during this test

The results are shown below. The graphic shows the distribution of transfer duration between site pairs. The source-site is along the rows, the destination site is along the columns. So the plot in the top-right is from Caltech to INFN, the plot in the bottom-left is from INFN to Caltech. The number of data-points in each plot (including overflows) is shown as N, and the mean transfer duration by M. The x-axis is in seconds, and all plots are on the same scale, 0-1500 seconds. The y-axis differs from plot to plot, but it's only the shape of the distribution that matters.

N.B. don't be fooled by the bin-size. Bin-size is determined by the properties of the data, so stuff that's all bunched up at the left gets lots of narrow bins, stuff that's spread out may have wider bins and show a lot more blue (don't ask me why!). E.g., compare Caltech to INFN (top-right) with CERN to INFN (just below it). The first plot has 23 entries, the second has 5, and many of those are in the overflow!

  • Grid of transfer duration plots between source-destination pairs:
    Rstats-01.gif

Conclusions:

  • For Caltech, GRIDKA, and INFN, things look good.Transfers between these sites run well in both directions
  • FZU
    • Transfers from FZU run well
    • Transfers to FZU are much slower, often less than 1 MB/sec (> 1000 seconds for a 1 GB file)
  • For GARR, transfers to or from anywhere are extremely slow, often off the chart.
  • CERN
    • Transfers to CERN are fine for INFN and GRIDKA, but not for Caltech and GARR
    • Transfers from CERN are off the chart, often less than 1 MB/sec

So some sites (Caltech, GRIDKA, INFN) behave nicely. Others show transfers that perform badly, and asymmetrically (FZU and CERN) or uniformly badly (GARR)

CERN, GARR, FZU

I have other results (not shown here, sorry) that show that transfers from my private VM to the CERN host are also quite slow (mean 108 seconds), and that transfers from my VM to Caltech are fast (mean 13 seconds) so it's probably not a firewall issue for CERN.

For GARR and FZU, I suspect the problems may be load-related. I ran a test between just these two nodes, no others. Transfers ran reliably (no errors) but slowly (1 MB/sec).

Comparison with IPv4 performance

Here's the same plot after three days of running with the IPv4 endpoints for comparison:

  • Duration plots between source-destination pairs using the IPv4 endpoints:
    R-combined-all-IPv4.gif

Points to note:

  • Caltech and INFN still look OK, once the misbehaving sites are taken into account.
  • Transfers to and from CERN are mostly better than the IPv6 case.
  • Transfers from FZU are significantly worse than the IPv6 case. Transfers to FZU are better than for IPv6
  • Transfers to and from GARR are somewhat better, but still not exactly fast
  • GRIDKA performs much worse on IPv4 than on IPv6, both as a source and as a destination

So some sites are not so different (Caltech, INFN), some are better on IPv4 (CERN, GARR), some are worse on IPv4 (GRIDKA), and some are better in one direction, worse in the other (FZU). I have no idea what to conclude from that!

Error rates

To check for the error rates I ran a test between only CERN and Caltech, to avoid overloading the CERN end. Preliminary results after a day of running show that I still see an error rate of 2-4%. The errors are randomly distributed in time and occur in both directions (see plots). All 13 errors seen so far report the same problem:
error: globus_ftp_client: the server responded with an error
500 500-Command failed. : callback failed.
500-globus_xio: System error in recv: Connection reset by peer
500-globus_xio: A system call failed: Connection reset by peer
500 End.

This error rate is small enough that I can build a working PhEDEx system with it, though of course it is worth reducing it further.

  • Transfer profile from Caltech to CERN. Blue dots are successful transfers, red dots show failed transfers:
    Rstats-Caltech-to-CERN.gif

  • Transfer profile from CERN to Caltech. Blue dots are successful transfers, red dots show failed transfers:
    Rstats-CERN-to-Caltech.gif

Single link-pair tests

Running the same test with single link-pairs (i.e. a pair of nodes that only transfer between themselves, in both directions) showed more stable performance. The link-pairs tested were:
  • Caltech to/from FZU
  • Caltech to/from GRIDKA
  • Caltech to/from INFN
  • CERN to/from GRIDKA
  • CERN to/from INFN
  • GARR to/from INFN

About 1000 transfers or more were run in both directions on each link-pair, of the same 1 GB file. The only errors seen were from CERN to INFN, where 2.3% of the transfers failed. There's no obvious reason why this particular link was the only one to show errors, see below for more results.

Full-mesh tests with small files

In order to try to investigate performance problems that might be due to transferring a large file (1 GB in the above tests), I tried a full-mesh test with a 100 MB file. The results are shown here.

From the distribution of transfer profiles, you can see that all nodes are pretty reliable, though performance varies a lot:

  • transfers from CERN are systematically slow, less than 1 MB/sec.
  • GARR is the other main problem, with transfers to and from running at typically 1 MB.sec.

R-combined-all.gif

  • transfer duration profiles for a 100 MB file:

The only errors seen during transfer were either from CERN or to CERN. Error rates are less than 1% in all cases, so are essentially negligible, but it's interesting that CERN is always involved. These errors are widely distributed in time and occur with a number of partner nodes.

Topic attachments
I Attachment History Action Size Date Who Comment
GIFgif R-combined-all-IPv4.gif r1 manage 26.3 K 2013-01-28 - 12:03 TonyWildish Duration plots between source-destination pairs using the IPv4 endpoints
GIFgif R-combined-all.gif r1 manage 24.8 K 2013-02-08 - 15:59 TonyWildish transfer duration profiles for a 100 MB file
GIFgif Rstats-01.gif r1 manage 21.3 K 2013-01-23 - 17:25 TonyWildish Grid of transfer duration plots between source-destination pairs
GIFgif Rstats-CERN-to-Caltech.gif r2 r1 manage 23.7 K 2013-01-24 - 14:28 TonyWildish Transfer profile from CERN to Caltech. Blue dots are successful transfers, red dots show failed transfers
GIFgif Rstats-Caltech-to-CERN.gif r2 r1 manage 28.1 K 2013-01-24 - 14:29 TonyWildish Transfer profile from CERN to Caltech. Blue dots are successful transfers, red dots show failed transfers
Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r5 - 2013-02-08 - TonyWildish
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback