I'm considering building a serverless web API which uses API Gateway to receive a stream of JSON blobs. I'd like to archive every incoming blob (after some basic authentication and validation of course). What are your recommendations on how to do this?
Additional info:
I'm using AWS Lambda reduce cost.
The archives will be accessed very infrequently, so I've been eyeballing S3 Glacier to reduce pricing. My issue is I need to figure out how to do batching of blobs per S3 file to avoid the overhead of many files.
Alternative storage services that I've been looking at are Cloudwatch logs and DynamoDB.
Great question, I'll post my possible solution. Also is it possible to list more requirements, what load are you expecting, what benchmark you try to achieve etc? Is the Glacier the service you want to use because retrieving data from there takes quite a bit time and is expensive (Do some calculations based on the data amount you have)?
I would split this task into two different AWS Lambda functions:
First Lambda receives the JSON and saves it into S3 as plain text file.
Now setup the Cloudwatch alert, based on the items save into bucket or the size(http://docs.aws.amazon.com/AmazonS3/latest/dev/cloudwatch-monitoring.html). The alert is triggering the next lambda archive function. Let's say the alert is executed after 30MB of data is received.
The archive function job is to take the S3 bucket content but together the zip/tar package, remove archived files and move archive into next S3 bucket/Glacier.
Related
I am looking to setup an event driven architecture to process messages from SQS and load into AWS S3. The events will be low volume and I was looking at either using Databricks or AWS lambda to process these messages as these are the 2 tools we already have procured.
I wanted to understand which one would be best to use as I'm struggling to differentiate them for this task as the throughput is only up to 1000 messages per day and unlikely to go higher at the moment so both are capable.
I just wanted to see what other people would consider and see as the differentiators between the two of these products so I can make sure this is future proofed as best I can?
We have used lambda more where I work and it may help to keep it consistent as we have more AWS skills in house but we are looking to build out databricks capability and I do personally find it easier to use.
If it was big data then I would have made the decision easier.
Thanks
AWS Lambda seems to be a much better choice in this case. Following are some benefits you will get with Lambda as compared to DataBricks.
Pros:
Free of cost: AWS Lambda is free for 1 Million requests per month and 400,000 GB-seconds of compute time per month, which means your request rate of 1000/day will easily be covered under this. More details here.
Very simple setup: The Lambda function implementation will be very straight-forward. Connect the SQS Queue with your Lambda function using the AWS Console or AWS cli. More details here. The Lambda function code will just be a couple of lines. It receives the message from SQS queue and writes to S3.
Logging and monitoring: You won't need any separate setup to track the performance metrics - How many messages were processed by Lambda, how many were successful, how much time it took. All these metrics are automatically generated by AWS CloudWatch. You also get an in-built retry mechanism, just specify the retry policy and AWS Lambda will take care of the rest.
Cons:
One drawback of this approach would be that each invocation of Lambda will write to a separate file in S3 because S3 doesn't provide APIs to append to existing files. So you will get 1000 files in S3 per day. Maybe you are fine with this (depends on what you want to do with this data in S3). If not, you will either need a separate job to join all files periodically or do a download of existing file from S3, append to it and upload back, which makes your Lambda a bit more complex.
DataBricks on the other hand, is built for different kind of use cases - Loading large datasets from Amazon S3 and performing analytics, SQL-like queries, builing ML models etc. It won't be suitable for this use case.
For now, in my company, every team that needs to serve data from HDFS to users creates own tool for that task.
We want to create a generic tool for fast/in real-time serving that data through HTTP from HDFS to my services. By generic I mean the tool should serve data only for my selected services added to the configuration and that should be the only action that users have to perform to use this generic tool. This new tool should be informed about new data that appeared in HDFS and then invoke some kind of job that moves data from HDFS to our fast storage.
Applications can update their data every day or every hour but every service can do it at different times (service A can be updated every day at 7 AM and service B can be updated every hour). I think we do not want to use any schemas and want to access our data using the only key and partition date. Querying is not necessary.
We do not know yet how much capacity or read/writes per second our tool needs to withstand.
We’ve worked some solution for our problem but we are interested if there are similar solutions in open source already or maybe any of you had a similar use case?
This is our proposal of architecture:
architecture
If you need to access HDFS over HTTP then WebHDFS might fit your use case. You could add a caching layer to speed up requests for hot files, but I think as long as you are using HDFS you'll never get sub-second response for any file that isn't already cached. You must decide if that is acceptable for you.
I'm unsure of how well WebHDFS works with huge files.
https://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-hdfs/WebHDFS.html
I have lifecycle rule enabled on s3, which moves objects to galcier after 30days. Since AWS does not support event notification yet, I don't have a way to update my application about object moving to glacier.
My use case is "Once object is moved to glacier, I want to restrict users from performing any action on that object". Is there a way to get update once object moves to S3?
I am planning to implement a scheduler (using spring #Scheduler) which will run every 1Hr and scan all objects in s3 and check if they have moved to glacier then update application RDS accordingly.
Let me know if there are other good approaches to handle this use case instead of writing a scheduler.
Regards.
Is there a way to get update once object moves to S3?
At the moment, there is no S3 event support for this.
Let me know if there are other good approaches to handle this use case
instead of writing a scheduler.
The approach you are planning to use, seems reasonable. For the scheduler, you can use a Lambda function with scheduled events without putting the burden to your current server.
I will make a project in the not too distant future, a project where we will be storing thousands of thousands of images in the course of time. I'm on a hard decision whether to use Amazon S3 or EFS to store those images. Both I think are a very good option, but my question goes to what would be the best service or what would be the best practice?
My application will be done with Laravel and I already did the integration of both services.
Most of the characteristics of the project are:
Most of the files I will store will be photos about 95%.
Approximately 1.5k photos would be stored daily.
The photos are very large (professional cameras).
Traffic to the application will not be much, approx. 100 users at a time.
Each user would consult about 100 photos per day.
What do you recommend?
S3 is absolutely the right answer and practice. I have built numerous applications like you describe, some with 100s of millions of images, and S3 is superior. It also allows for flexibility such as your API returning the images as pre-signed URLs which will reduce load to your servers, images can be linked directly via static web hosting, and it provides lifecycle policies to archive less used data. Additionally, further integration with other AWS services is easy using event triggers.
As for storing/uploading, S3 multi-part upload is very useful to both increase performance and increase reliability.
EFS would make sense for your type of scenario if you were doing some intensive processing where you had a cluster of severs that needed lower latency with a shared file system - think HPC. EFS would also come at a higher cost and doesn't provide as many extensibility options or built-in features as S3. Your scenario doesn't sound like it requires EFS.
http://docs.aws.amazon.com/AmazonS3/latest/dev/WebsiteHosting.html
http://docs.aws.amazon.com/AmazonS3/latest/dev/ShareObjectPreSignedURL.html
http://docs.aws.amazon.com/AmazonS3/latest/dev/object-lifecycle-mgmt.html
http://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html
http://docs.aws.amazon.com/AmazonS3/latest/dev/mpuoverview.html
For the scenario you proposed AWS S3 is the choice. Why?
Since images are more often added, it costs roughly 1/10 th of EFS.
Less overhead on your web servers since files can be directly uploaded and downloaded with S3.
You can leverage event driven processing with Lambda e.g Generating thumbnail, Image processing filters by S3 Lambda trigger.
Higher level of SLA for availability and durability.
Supporting for inbuilt lifecycle management to archival and reduce cost.
AWS EFS can also be an option if it happens to frequently modify the images (Where EBS is also an option)
You can also consider using AWS CloudFront with either the option to cache images.
Note: At the end its not about using a single service. Based on your upcoming requrements you can choose either one of them or both.
I'm just beginning to learn about Big Data and I'm interested in Hadoop. I'm planning on building a simple analytics system to make sense of certain events that occurs in my site.
So I'm planning to have code (both front and back end) to trigger some events that would queue messages (most likely with RabbitMQ). These messages will then be processed by a consumer that would write the data continuously to HDFS. Then, I can run a map reduce job at any time to analyze the current data set.
I'm leaning towards Amazon EMR for the Hadoop functionality. So my question is this, from my server running the consumer, how do I save the data to HDFS? I know there's a command like "hadoop dfs -copyFromLocal", but how do I use this across servers? Is there a tool available?
Has anyone tried a similar thing? I would love to hear about your implementations. Details and examples would be very much helpful. Thanks!
If you mention EMR, it's takes input from a folder in s3 storage, so you can use your preffered language library to push data to s3 to analyse it later with EMR jobs. For example, in python one can use boto.
There are even drivers allowing you to mount s3 storage as a device, but some time ago all of them were too buggy to use them in production systems. May be thing have changed with time.
EMR FAQ:
Q: How do I get my data into Amazon S3? You can use Amazon S3 APIs to
upload data to Amazon S3. Alternatively, you can use many open source
or commercial clients to easily upload data to Amazon S3.
Note that emr (as well as s3) implies additional costs, and that it's usage is justified for really big data. Also note that it is always benefical to have relatively large files both in terms of Hadoop performance and storage costs.