How to upload a jar file for batch processing? - heroku

I have a Java process which is an executable jar file. This java process reads in a file, performs some processing and then writes the output to a new file. I'd like to share this process via a cloud service but I'm what to use.
Is there a Heroku or Amazon setup I could use for this ?
Using Amazon I could upload the file to processed Amazon Simple Storage Service and trigger the process job and then expose the results of this job via a web service?
I'm just enquiring about high level options as I'm not sure where to begin?

Where do these input files come from? What services consume them afterwards?
One possible workflow on AWS.
1. Modify your jar to be able read from S3 and write to S3
2. Create a webservice in this jar with a simple API that takes an S3 location and gives back an S3 location where the output will go.
3. Run this jar on an EC2 host, possibly using Elastic Beanstalk
The workflow would be something like:
1. Upload an input file to s3
2. Call the webservice, providing the S3 url to the input file
3. The webservice does the job and writes the file to another S3 bucket, giving back the URL
This can be modified in all sorts of ways, but it's difficult to give more guidance without knowing more of your requirements.

Related

Sending Oracle AWS RDS logs to an S3 bucket

I'm trying to send logs from an Oracle RDS hosted in Amazon to an S3 bucket. I'd like to send logs to the S3 bucket daily.
What would be a recommended course of action to achieve this? I'm not concerned if the data is compressed or in it's original format.
I'm also relatively new to AWS so I'm not fully aware of all the features that are available which could make this possible if there are any.
There are 2 ways you can do that:
Download the log file using instructions here: http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_LogAccess.html#USER_LogAccess.Procedural.Downloading and then upload it to S3
Automate the process of downloading the log file using CLI (check the above link for CLI commands) and upload to S3.

Cloud Services to run Batch script when file is uploaded?

I am looking to run a batch script on files that are uploaded from my website (one at a time), and return the resulting file produced by that batch script. The website is hosted on a shared linux environment, so I cannot run the batch file on the server.
It sounds like something I could accomplish with Amazon S3 and Amazon Lambda, but I was wondering if there were any other services out there that would allow me to accomplish the same task.
I would recommend that you look into S3 Events and Lambda.
Using S3 events, you can trigger a lambda function on puts and deletes in a S3 bucket and depending on your "batch file" task you may be able to achieve your goal purely in Lambda.
If you cannot use Lambda to replace the functionality of your batch file you can try the following:
If you need to have the batch process run on a specific instance, take a look at Amazon SQS. You can have the S3 event triggered Lambda create a work item in SQS and your instance can regularly poll SQS for work to process.
If you need something a bit more real time, you could use Amazon SNS for a push rather than pull approach to the above.
If you don't need the file to be processed by a specific instance but you have to have a batch file run against it, perhaps you can have your S3 event triggered Lambda create an instance that has a UserData script that will sys prep the server as needed, download the s3 file, process the batch against it and then finally self terminate by looking up it's own instance ID via the EC2 Metadata service and calling the api method terminate instances.
Here is some related reading to assist with the above approaches:
Amazon SQS
https://aws.amazon.com/documentation/sqs/
Amazon SNS
https://aws.amazon.com/documentation/sns/
Amazon Lambda
https://aws.amazon.com/documentation/lambda/
Amazon S3 Event Notifications
http://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html
EC2 UserData
http://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/ec2-instance-metadata.html#instancedata-add-user-data
EC2 Metadata Service
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-metadata.html#instancedata-data-retrieval
AWS Tools for Powershell Cmdlet Reference
http://docs.aws.amazon.com/powershell/latest/reference/Index.html

Download a file from HDFS cluster

I am developing an API for using hdfs as a distributed file storage. I have made a REST api for allowing a server to mkdir, ls, create and delete a file in the HDFS cluster using Webhdfs. But since Webhdfs does not support downloading a file, are there any solutions for achieving this. I mean I have a server who runs my REST api and communicates with the cluster. I know the OPEN operation just supports reading a text file content, but suppose I have a file which is 300 MB in size, how can I download it from the hdfs cluster. Do you guys have any possible solutions.? I was thinking of directly pinging the datanodes for a file, but this solution is flawed as if the file is 300 MB in size, it will put a huge load on my proxy server, so is there a streaming API to achieve this.
As an alternative you could make use of streamFile provided by DataNode API.
wget http://$datanode:50075/streamFile/demofile.txt
It'll not read the file as a whole, so the burden will be low, IMHO. I have tried it, but on a pseudo setup and it works fine. You can just give it a try on your fully distributed setup and see if it helps.
One way which comes to my mind, is to use a proxy worker, which reads the file using hadoop file system API, and creates a local normal file.And the provide download link to this file. Downside being
Scalablity of Proxy server
Files may be theoretically too large to fit into disk of a single proxy server.

Amazon S3 multipart upload often fails

I'm trying to upload a 32GB file to a S3 bucket using the s3cmd CLI. It's doing a multipart upload and often fails. I'm doing this from a server which has a 1000mbps bandwidth to play with. But the upload still is VERY slow. Is there something I can do to speed this up?
On the other hand, the file is on the HDFS on the server I mentioned. Is there a way to reference the Amazon Elastic Map Reduce job to pick it up from this HDFS? It's still an upload but the job is getting executed as well. So the overall process is much quicker.
First I'll admit that I've never used the Multipart feature of s3cmd, so I can't speak to that. However, I have used boto in the past to upload large (10-15GB files) to S3 with a good deal of success. In fact, it became such a common task for me that I wrote a little utility to make it easier.
As for your HDFS question, you can always reference an HDFS path with a fully qualified URI, e.g., hdfs://{namenode}:{port}/path/to/files. This assumes your EMR cluster can access this external HDFS cluster (might have to play with security group settings)

How can I share jar libraries with amazon elastic mapreduce?

To speedup jar to s3 uploading I want to copy all my common jar to something like "$HADOOP_HOME/lib" in normal hadoop. Is it possible for me to create custom EMR hadoop instance with these libraries preinstalled. Or there are easier way?
You could do this as a bootstrap action. It's as simple as placing a script to do the copying into S3, and then if you're starting EMR from the command line, add a parameter like this:
--bootstrap-action 's3://my-bucket/boostrap.sh'
Or if you're doing it through the web interface, just enter the location in the appropriate field.

Resources