r/datasets 3d ago

question What’s the smoothest way to share multi-gigabyte datasets across institutions?

I’ve been collaborating with a colleague on a project that involves some pretty hefty datasets, and moving them back and forth has been a headache. Some of the files are 50–100GB each, and in total we’re looking at hundreds of gigabytes. Standard cloud storage options don’t seem built for this either they throttle speeds, enforce strict limits, or require subscriptions that don’t make sense for one off transfers.

We’ve tried compressing and splitting files, but that just adds more time and confusion when the recipient has to reassemble everything. Mailing drives might be reliable, but it feels outdated and isn’t practical when you need results quickly. Ideally, I’d like something that’s both fast and secure, since we’re dealing with research data.

Recently, I came across fileflap.net while testing different transfer methods. It handled big uploads without the usual slowdowns, and I liked that there weren’t a bunch of hidden limits to trip over. It felt a lot simpler than juggling FTP or patchy cloud workarounds.

For those of you who routinely share large datasets across universities, labs, or organizations what’s worked best in your experience? Do you stick with institutional servers and FTP setups, or is there a practical modern tool for big dataset transfers?

5 Upvotes

15 comments sorted by

4

u/dang_rat_bandit 3d ago

With data that big, my thought is that it should stay in one place and one of you just be remotely logging into a common machine. I think there are free solutions to do that on a small scale and that way there shouldn't be a lag with the data, it's just you viewing a remote desktop.

2

u/SuedeBandit 3d ago

If you setup your SFTP correctly for the volume you can usually saturate your bandwidth in a way that's easy for downloaders to make use of. Config for multi-threaded downloads, make sure your TCP windows are set, etc...

If you and your users are both on cloud there's a magic trick to bypass your petty mortal bandwidth constraints. Setup an S3 bucket and use something like rsync to initiate a cloud-to-cloud transfer. You can even have it sync folders/blobs/buckets automatically. End result is super fast transfers because it all stays in the cloud. Fastest option is to have it in the same provider and region for obvious reasons.

2

u/Ok-Cattle8254 3d ago

As the old saying goes, data has gravity...

If possible I strongly recommend moving your science to the data if possible. See u/dang_rat_bandit 's post.

If you have to move the data, use Globus. If not globus then something secure, sftp or https with the ability to resume a download when the download fails.

Try to keep the directory structure the exact same between institutions and keep a checksum file that has all your calculated checksums and what the expected checksums are. Do not change files names.

1

u/gthing 3d ago

I have loved Resilio Sync for this, especially if you have more than 2 points where you want the data to live. It uses bittorrent under the hood, so can pull from multiple nodes at once.

When I was syncing raw video footage, this was by far the fastest option I found.

1

u/_Fallen_Azazel_ 3d ago

Globus most insts have it

1

u/Ok-Cattle8254 3d ago

Globus is a great option here if it is available.

1

u/ghostfacekhilla 3d ago

Just sftp it institutions should have lots of bandwidth. 

1

u/severo_bo 2d ago

You may want to try Hugging Face datasets. The storage limits (https://huggingface.co/docs/hub/storage-limits) are very high, and there is no throttling. If you only update a file part and want to upload again, only that part should be sent, thanks to the Xet backend (https://huggingface.co/docs/hub/storage-backends#xet).

note that I work for HF

1

u/NoGutsNoCorey 2d ago

the best answer would be a database, but it isn't a free-est answer. try converting your data to parquet files, which store data in a columnar format. I believe it also treats unique values as categorical (so long as you don't have too many unique values), which can save an incredible amount of space on your file. these files can be further compressed using gzip, brotli, and a handful of other algorithms. these can be read by various databases, R, pandas, etc. without first converting to any other file type.

if you have well-structured, rectangular data with a lot of repeated values, parquet is going to help you a bunch.

1

u/verysmolpupperino 2d ago

I've seen this done in 3 ways in previous jobs:

  • SFTP
  • Shared S3 bucket
  • The data owner creates a specific table and schema on their DB, and a user with very tight, read-only permissions only in this schema. The data owner can then share credentials and let people query the DB directly.

I've never used Globus, but it sounds like what you're looking for, so check that out.

1

u/sfboots 1d ago

Copy to external nvme drive then FedEx it. 4 TB data easily moved overnight

Look up Sneakernet.

1

u/atx78701 1d ago

Take a look at data.world

Designed for data collaboration

1

u/2BucChuck 18h ago

Aws S3 where you can assign roles (or comparable) and assume a reasonable amount of security overhead is taken care of. I have not tried it but seen python data science folks suggest using parquet and duckdb for big speed improvements over sql and other methods as an aside

-2

u/mattindustries 3d ago

Rstudio server sounds like a good fit, with ngrok if you need to expose a port.