Summary: I'm using Blobstore to let users upload images to be served. I want to prevent users from uploading files that aren't valid images or have dimensions that are too large. I'm using App Engine's Images service to get the relevant metadata. BUT, in order to get any information about the image type or dimensions from the Images service, you have to first execute a transform, which fetches the transformed image to the App Engine server. I have it do a no-op crop and encode as a very low quality JPEG image, but it's still fetching an actual image, and all I want is the dimensions and file type. Is this the best I can do? Will the internal transfer of the image data (from Blobstore to App Engine server) cost me?
Details:
It seems like Blobstore was carefully designed for efficient serving of images from App Engine. On the other hand, certain operations seem to make you jump through inefficient hoops. I'm hoping someone can tell me that there's a more efficient way, or convince me that what I'm doing is not as wasteful as I think it is.
I'm letting users upload images to be served as part of other user-generated content. Blobstore makes the uploading and serving pretty easy. Unfortunately it lets the user upload any file they want, and I want to impose restrictions.
(Side note: Blobstore does let you limit the file size of uploads, but this feature is poorly documented. It turns out that if the user tries to exceed the limit, Blobstore will return a 413 "Entity too large", and the App Engine handler is not called at all.)
I want to allow only valid JPEG, GIF, and PNG files, and I want to limit the dimensions. The way to do this seems to be to check the file after upload, and delete it if it's not allowed. Here's what I've got:
class ImageUploadHandler(blobstore_handlers.BlobstoreUploadHandler):
def post(self):
try:
# TODO: Check that user is logged in and has quota; xsrfToken.
uploads = self.get_uploads()
if len(uploads) != 1:
logging.error('{} files uploaded'.format(len(uploads)))
raise ServerError('Must be exactly 1 image per upload')
image = images.Image(blob_key=uploads[0].key())
# Do a no-op transformation; otherwise execute_transforms()
# doesn't work and you can't get any image metadata.
image.crop(0.0, 0.0, 1.0, 1.0)
image.execute_transforms(output_encoding=images.JPEG, quality=1)
if image.width > 640 or image.height > 640:
raise ServerError('Image must be 640x640 or smaller')
resultUrl = images.get_serving_url(uploads[0].key())
self.response.headers['Content-Type'] = 'application/json'
self.response.body = jsonEncode({'status': 0, 'imageUrl': resultUrl})
except Exception as e:
for upload in uploads:
blobstore.delete(upload.key()) # TODO: delete in parallel with delete_async
self.response.headers['Content-Type'] = 'text/plain'
self.response.status = 403
self.response.body = e.args[0]
Comments in the code highlight the issue.
I know the image can be resized on the fly at serve time (using get_serving_url), but I'd rather force users to upload a smaller image in the first place, to avoid using up storage. Later, instead of putting a limit on the original image dimensions, I might want to have it automatically get shrunk at upload time, but I'd still need to find out its dimensions and type before shrinking it.
Am I missing an easier or more efficient way?
Actually the Blobstore is not exactly optimized for serving images, it operates on any kind of data. The BlobReader class can be used to manage the raw blob data.
The GAE Images service can be used to manage images (including those stored as blobs in the BlobStore). You are right in the sense that this service only offers info about the uploaded image only after executing a transformation on it, which doesn't help with deleting undesirable blob images prior to processing.
What you can do is use the Image module from the PIL library (available between the GAE's Runtime-Provided Libraries) overlayed on top of the BlobReader class.
The PIL Image format and size methods to get the info you seek and sanitize the image data before reading the entire image:
>>> image = Image.open('Spain-rail-map.jpg')
>>> image.format
'JPEG'
>>> image.size
(410, 317)
These methods should be very efficient since they only need image header info from the blob loaded by the open method:
Opens and identifies the given image file. This is a lazy operation;
the function reads the file header, but the actual image data is not
read from the file until you try to process the data (call the load
method to force loading).
This is how overlaying can be done in your ImageUploadHandler:
from PIL import Image
with blobstore.BlobReader(uploads[0].key()) as fd:
image = Image.open(fd)
logging.error('format=%s' % image.format)
logging.error('size=%dx%d' % image.size)
When you upload to Google Cloud Storage (GCS) instead of the blobstore you have much more control over object upload conditions like name, type and size. A policy document controls the user conditions. If the user does not meet these upload conditions, the object will be rejected.
Docs here.
Example:
{"expiration": "2010-06-16T11:11:11Z",
"conditions": [
["starts-with", "$key", "" ],
{"acl": "bucket-owner-read" },
{"bucket": "travel-maps"},
{"success_action_redirect":"http://www.example.com/success_notification.html" },
["eq", "$Content-Type", "image/jpeg" ],
["content-length-range", 0, 1000000]
]
}
The POST response if the content length was exceeded:
<Error>
<Code>EntityTooLarge</Code>
<Message>
Your proposed upload exceeds the maximum allowed object size.
</Message>
<Details>Content-length exceeds upper bound on range</Details>
</Error>
The POST response if a PDF was send:
<Error>
<Code>InvalidPolicyDocument</Code>
<Message>
The content of the form does not meet the conditions specified in the policy document.
</Message>
<Details>Policy did not reference these fields: filename</Details>
</Error>
And here you can find my Python code for a direct upload to GCS.
Related
so my current task is to receive image from my host (currently s3), but the catch is that nothing should be persisted about this image, what this means is that i can not persist its url, for example since s3 always includes the same key name in the url even if it is presigned, i can not use that data directly, the solution for this would be to create a image server which would download the image from s3 and send it back to client but url for this image would be always dynamic and random (with jwt), now problem is that, base64 that i have just received is persistent, it is not changed, what can i tradeoff is that i can randomly modify characters inside the base64 string, maybe 2,3 characters that would mess up pixel or two but as long as its not noticeable thats okay with me, but this technique seems bit slow, because of bandwidth size, is there any way that i can use to make image non persistent and random every time client receives it?
I provide users with a SAS token to upload blobs. I'd like to check whether the blobs represent a valid image or not. How can I do this? In the SAS token, I make sure the blob name ends with a jpeg extension, but this does not mean the users upload an image since everything is uploaded as a byte stream.
This is Not possible as described here. Perhaps, the better way to validate is at the front end when the user tries to upload the file.
You can write an Azure Function that will be triggered every time a new blob is uploaded. And in that function you can validate if the blob is a valid image file, if it is not then you can delete it or send an email to uploader.
I am trying to parse text in an image of a restaurant bill. I've been able to set up the ruby AWS SDK which has the Rekognition client using this example. Moreover, locally I have been able to make a call to Rekognition, passing an image locally.
When I make the call with #detect_text (docs), I get a response and the response has TextDetections which represent either lines or words in the image. I would like however that response to only contain TextDetections of type LINE. Here are my questions:
Is it possible to get a response back that only contains TextDetections of type LINE?
Is it possible to increase the limit of words detected in an image? Apparently according to the docs:
DetectText can detect up to 50 words in an image
That sounds like a hard limit to me.
Is there a way I can get around the limit of 50 words in an image? Perhaps I can make multiple calls on the same image where Rekognition can parse the same image multiple times until it has all the words?
Yes. You cannot detect more than 50 words in an image. A workaround is to crop the image into multiple images, and run DetectText on each cropped image.
I am converting Oracle blob content in to byte stream and uploading the contents to azure cloud storage. Is there any way i can cross check whether the uploaded files to the storage are proper or not corrupted.
Thanks for your support.
#Bala,
As far as I known, we can check whether uploaded files are success via these methods:
After uploaded file, we can get the blob file length property and compare with the original file size.
blob.FetchAttributes();
bool success = blob.Properties.Length == length;
Another approach is that we can split files into chunks and upload those chunks asynchronously using PutBlockAsync method. We can view the uploading progress if you can create a progress bar based this methods and chunks size. I recommend you refer to this post about how to use this methods:
https://stackoverflow.com/a/21182669/4836342 or this blog.
I currently have two buckets in S3 - let's call them photos and photos-thumbnails. Right now, when a user uploads an image from our iOS app, we directly upload that photo to the photos bucket, which triggers a lambda function that resizes the photo into a thumbnail and uploads the thumbnail into the photos-thumbnails bucket.
I now want to include some image compression for the images in the photos bucket before a thumbnail is created in the original bucket (photos). However, if I set the compression lambda function to be triggered whenever an object is created in the photos bucket, it will wind up in a never-ending loop of the user uploading the original photo, triggering the compression and placing back in the same bucket, triggering compression again, etc.
Is there a way I can intercept this before it becomes a recursive call for image compression? Or is the only way to create a third bucket?
A third bucket would probably be the best. If you really want to use the same bucket, just choose some criteria controlling whether the image in photos should be modified or not (perhaps image file size or something), then ensure that images that have been processed once fall below the threshold. The lambda will still run twice, but the second time it will examine the image and find it has already been processed and thus not process it again. To my knowledge there is no way to suppress the second run of the lambda.
Another option might be to filter based on how the object is created. The following event types can be used in S3. Use one for what your users upload (maybe POST?) and the other for what your lambda does (maybe PUT?)
s3:ObjectCreated:Put
s3:ObjectCreated:Post
s3:ObjectCreated:Copy
s3:ObjectCreated:CompleteMultipartUpload
A third bucket would work, or essentially the same thing, rename the file with a prefix after compressing and then check for that prefix before reprocessing the file.
If you name the outputs of your function in a predictable way, you can just filter any files that were created by your function at the start of the function.
However, as was mentioned previously, using a different bucket for the output would be simpler.