I am trying to imagine a website/system where only 1 copy of a file is ever stored, but multiple users get their own custom link to the file, and it uses the CDN features. For this question we can assume we are using S3 and CloudFront or other AWS CDN features, but I am probably going to use Google Cloud Storage instead (so I’ll translate an S3 solution to that, as I assume Google Cloud will receive much lower attention than an AWS question).
I am imagining like taking the MD5 or SHAx hash of a file, like the blockchain does, to create a unique fingerprint of the content, so we can prevent the same file from being stored more than once.
Say UserA stores an image of something. That creates a record in the database of the fingerprint and resulting bucket location in S3. Then UserB tries to upload the same file, unknowing that UserA already uploaded it. The system would figure out based on the fingerprint that the file was already uploaded, and wouldn’t upload a duplicate to a new location. Instead, it would simply return the user a new URL/ID to reference their file. So from the user’s perspective, it seems as though they just uploaded a new file.
This would work fine if we go through the theoretical “app” layer (like running through a Node.js or Rails app), which would take some path like
/id/123 and translate it to bucket location
/bucket1/images/foo.jpg. Then UserB might get
/id/312, and it also resolves to
/bucket1/images/foo.jpg. This is because the app database contains a mapping from “slug” to bucket location.
But I am pretty sure you couldn’t use CDN functionality in this case. That is, if it’s routing through your app to S3, you are then piping the image/file through your app from say S3 or CloudFront, to the customer. That pretty much negates the benefits of a CDN from my understanding.
Question is, can you somehow have this same sort of mechanism, but also take advantage of the CDN? I know you cannot create “symlinks” on S3, for one. But is there any way architecturally to accomplish this?
- All files are only uploaded once, not more than once (due to unique fingerprinting).
- Yet users get their own unique URL/ID to the image if they try to upload the same file.
- Yet, the file locations are close to the end customer (via the CDN).
Is all of this possible somehow? If so, what is the general mechanism to accomplish it? If not, what is the main problem?
If it’s not possible, then I will forget the idea of trying to save on space and prevent 1000 people from uploading the same 50mb PDF or whatever, I’ll just upload it 1000 times. But if there is a way to accomplish this, then it could potentially save a lot of file storage costs.
Sidenote: I don’t want to give the users the same ID/path because I want them to have custom permissions on the file for example, and also be charged based on how much traffic is loading that particular URL. So if 50 users all uploaded the same file and shared their links, they would only pay for the traffic received from their URL.
Trying to optimize for file storage costs, and also performance (latency and speed of loading the file) for the end customer (like loading an image in the browser).
- User uploads file to your app and receives a unique URL which points to your app.
- User follows that url, your app receives the request, and responds with a 302 redirect to the CDN’s shared file location.
- User’s browser automatically follows the redirect and downloads or displays the file or image directly, without passing through your app.
As long as you are fine with the “true” location of the file showing up in the Networks tab of the developer tools on a user’s browser, this seems like it satisfies your objective.