For both backup and staging purposes, we regularly need to backup an entire S3 bucket to another bucket. AWS has no built-in function to do this, nor does the boto Python library.
We started off with a simple
for key in bucket.list() and copied the files one by one in sequence with
key.copy(dest_bucket, key_name). This is imperfect for a few reasons:
- There are many files, and the files are very large. Processing one by one takes a long time. Sometimes we need a copy asap.
- AWS is designed to fail. Applications built on AWS should be developed to handle failures. With the sequential design, if any one of the key copy requests fails, for any reason, it will interrupt the rest of the process.
This seems like a perfect problem for threading, and I have been looking for an excuse to play with Python’s built-in threading features. This also seems like a perfect chance to try hosting an open source project on GitHub, also a first for me.
- Does not set ACL. I assume this is set to bucket default.
- Timeout is clumsy, results in multiple 30 second delays. Instead, should log error/timeouts and retry x times.
- 52GB / 7 minutes = 52,000MB / 420 Seconds = 123.8 MB/sec
With tweaks to timeout and error handling, this can be significantly improved. Curious to hear other people’s experiences too.
Try it out!