Amazon MapReduce best practices for logs analysis - hadoop

I'm parsing access logs generated by Apache, Nginx, Darwin (video streaming server) and aggregating statistics for each delivered file by date / referrer / useragent.
Tons of logs generated every hour and that number likely to be increased dramatically in near future - so processing that kind of data in distributed manner via Amazon Elastic MapReduce sounds reasonable.
Right now I'm ready with mappers and reducers to process my data and tested the whole process with the following flow:
uploaded mappers, reducers and data to Amazon S3
configured appropriate job and processed it successfully
downloaded aggregated results from Amazon S3 to my server and inserted them into MySQL database by running CLI script
I've done that manually according to thousands of tutorials that are googlable on the Internet about Amazon ERM.
What should I do next? What is a best approach to automate this process?
Should I control Amazon EMR jobTracker via API?
How can I make sure my logs will not be processed twice?
What is the best way to move processed files to archive?
What is the best approach to insert results into PostgreSQL/MySQL?
How data for the jobs should be laid out in input/output directories?
Should I create a new EMR job each time using the API?
What is the best approach to upload raw logs to Amazon S3?
Can anyone share their setup of the data processing flow?
How to control file uploads and jobs completions?
I think that this topic can be useful for many people who try to process access logs with Amazon Elastic MapReduce but were not able to find good materials and/or best practices.
UPD: Just to clarify here is the single final question:
What are best practices for logs processing powered by Amazon Elastic MapReduce?
Related posts:
Getting data in and out of Elastic MapReduce HDFS

That's a very very wide open question, but here are some thoughts you could consider:
Using Amazon SQS: this is a distributed queue, and is very useful for workflow management, you cna have a process that writes to the queue as soon as a log is available, and another who reads from it, processes the log described in the queue message, and deletes it when it's done processing. This would ensure that logs are processed only once.
Apache Flume as you mentionned is very useful for log aggregation. This is something you should consider, even if you don't need real-time, as it gives you at the very least a standardized aggregation process.
Amazon recently release SimpleWorkFlow. I have just started looking into it, but that sounds promising to manage every step of your data pipeline.
Hope that gives you some clues.

Related

In memory processing with Apache Beam

I am running my own GRPC server collecting events coming from various data sources. The server is developed in Go and all the event sources send the events in a predefined format as a protobuf message.
What I want to do is to process all these events with Apache Beam in memory.
I looked through the docs of Apache Beam and couldn't find a sample that does something like I want. I'm not going to use Kafka, Flink or any other streaming platform, just process the messages in memory and output the results.
Can someone show me a direction of a right way to start coding a simple stream processing app?
Ok, first of all, Apache Beam is not a data processing engine, it's an SDK that allows you to create a unified pipeline and run it on different engines, like Spark, Flink, Google Dataflow, etc. So, to run a Beam pipeline you would need to leverage any of supported data processing engine or use DirectRunner, which will run your pipeline locally but (!) it has many limitations and was mostly developed for testing purposes.
As every pipeline in Beam, one has to have a source transform (bounded or unbounded) which will read data from your data source. I can guess that in your case it will be your GRPC server which should retransmit collected events. So, for the source transform, you either can use already implemented Beam IO transforms (IO connectors) or create your own since there is no GrpcIO or something similar for now in Beam.
Regarding the processing data in memory, I'm not sure that I fully understand what you meant. It will mostly depend on used data processing engine since in the end, your Beam pipeline will be translated in, for example, Spark or Flink pipeline (if you use SparkRunner or FlinkRunner accordingly) before actual running and then data processing engine will manage the pipeline workflow. Most of the modern engines do their best efforts to keep all processed data in memory and flush it on disk only in the last resort.

How do you setup multiple Spark Streaming jobs with different batch durations?

We are in the beginning phases of transforming the current data architecture of a large enterprise and I am currently building a Spark Streaming ETL framework in which we would connect all of our sources to destinations (source/destinations could be Kafka topics, Flume, HDFS, etc.) through transformations. This would look something like:
SparkStreamingEtlManager.addEtl(Source, Transformation*, Destination)
SparkStreamingEtlManager.streamEtl()
streamingContext.start()
The assumptions is that, since we should only have one SparkContext, we would deploy all of the ETL pipelines in one application/jar.
The problem with this is that the batchDuration is an attribute of the context itself and not of the ReceiverInputDStream (Why is this?). Do we need to therefore have multiple Spark Clusters, or, allow for multiple SparkContexts and deploy multiple applications? Is there any other way to control the batch duration per receiver?
Please let me know if any of my assumptions are naive or need to be rephrased. Thanks!
In my experience, different streams have different tuning requirements. Throughput, latency, capacity of the receiving side, SLAs to be respected, etc.
To cater for that multiplicity, we require to configure each Spark Streaming job to address said specificity. So, not only batch interval but also resources like memory and cpu, data partitioning, # of executing nodes (when the loads are network bound).
It follows that each Spark Streaming job becomes a separate job deployment on a Spark Cluster. That will also allow for monitoring and management of separate pipelines independently of each other and help in the further fine-tuning of the processes.
In our case, we use Mesos + Marathon to manage our set of Spark Streaming jobs running 3600x24x7.

Use Spark to stream the contents of an S3 bucket that is constantly updated

I have an app that exports files to an S3 bucket every certain amount of time. I need to develop a Spark Streaming app that streams from this bucket and delivers the lines of the new files every 30 secs.
I have read this post which helped me understanding about the credentials, but still won’t address my needs.
Q1. Could anyone provide some code or hint on how to do this? I’ve seen the twitter example but I could not figure out how to apply it to my scenario.
Q2. How does Spark Streaming know which was the last file that streamed before picking up the next one? Is this based on the file’s LastModified header or some sort of timestamp?
Q3. If the cluster goes down, how do I manage to start streaming from where I left?
Thanks in advance!!

Is there any tool to find at what time of the day hadoop cluster is usually free from load and submit job at that time daily

I need to schedule a job in our production cluster.I am trying to schedule it at a time when the cluster is expected to be free based on how cluster load was in past 30 days.Oozie doesn't have any feature that supports this out of the box.I am trying to achieve this using some hacks within oozie.
Is there any standard way to find at what times cluster was usually free in the past few days? and automatically submit the job at that time everyday.
Linkedin White elephant seems to be the one that you are looking for. Ganglia has pretty good APIs to gauge cluster usage, which you could use.
You can use Cloudera manager for checking the complete cluster health (if you are using CDH).
There are Cloudera Manager APIs to interact. you can look at that also to get your work-around.
http://blog.cloudera.com/blog/2012/09/automating-your-cluster-with-cloudera-manager-api/

MapReduce on AWS

Anybody played around with MapReduce on AWS yet? Any thoughts? How's the implementation?
It's easy to get started.
Here's a FAQ: http://aws.amazon.com/elasticmapreduce/faqs/
And here's the Getting Started Guide: http://docs.amazonwebservices.com/ElasticMapReduce/latest/GettingStartedGuide/
If you have an EC2 account already, you can enable MapReduce and have a sample application up and running in less than 10 minutes using the AWS Management Console.
I did the pre-packaged Word Count sample application, which returns a count of each word contained in about 20 MB of text. You can provision up to 20 instances to run concurrently, though I just used 2 instances and the job completed in about 3 minutes.
The job returns a 300 KB alphabetized list of words and how often each word appears in the sample corpus.
I really like that MapReduce jobs can be written in my choice of Perl, Python, Ruby, PHP, C++, R, or Java. The process was painless and straightforward, and the interface gives good feedback on the status of your instances and the job flow.
Be aware that, since AWS charges for a full hour when an instance is created, and since the MapReduce instances are automatically terminated at the end of the job flow, the cost of multiple fast-running job flows can add up quickly.
For example, if I create a job flow that uses 20 instances and returns results in 15 minutes, and then re-run the job flow 3 more times, I'll be charged for 80 hours of machine time even though I only had 20 instances running for 1 hour.
You also have the possibility to run MapReduce (Hadoop) on AWS with StarCluster. This tool configures the cluster for you and has the advantage that you don´t have to pay the extra Amazon Elastic MapReduce Price (if you want to reduce your costs) and you could create your own Image (AMI) with your tools (this could be good if the installation of the tools can´t be done by a bootstrap script).
It is very convenient because you don't have to administer your own cluster. You just pay per use so I think it is a good idea if you have a job that needs to run once in a while. We are running Amazon MapReduce just once a month so, for our usage, it is worth it.
However, as far as I can tell, a drawback of Amazon Map Reduce is that you can't tell which Operating System is running, or even its version. This caused me problems running c++ code that compiled with g++ 4.44, some of the OS images does not support cUrl library, etc.
If you don't need any special libraries for your use case, I would say go for it.
Good answer by MB.
To be clear: you can run Hadoop clusters in two ways:
1) Run it on Amazon EC2 instances. This means that you have to install it, configure it, terminate it, etc.
2) Run it using Elastic MapReduce, or EMR: this is an automated way to run an Hadoop cluster on Amazon Web Services. You pay a little extra on top of the basic cost for EC2, but you don't need to manage anything: just upload your data, then your algorithm, then crunch. EMR will shut down the instances automatically once your jobs are finished.
Best,
Simone
EMR is the best way to use available resources with a very little added cost over EC2 however you will how time saving and easy it is. Most of the MR implementation on Cloud are using this model i.e. Apache Hadoop on Windows Azure, Mortar Data etc.. I have worked on both Amazon EMR and Apache Hadoop on Windows Azure and found incredible to use.
Also, depending on the type / duration of jobs you plan to run, you can use AWS spot instances with EMR to get better pricing.
I am working with AWS EMR. It is pretty neat. I mean once you start up their cluster and login into their Master node. You can play around with the hadoop directory structure. And do pretty cool things.. If you have a edu account don;t forget to apply for a research grant. They give unto 100$ free credits to use their AWS.
AWS EMR is a good choice when you use S3 storage for your data.
It provides out of the box integration with S3 for loading files and posting processed files.
In use cases where you need to run the job on demand, you are saved from the cost of running the whole cluster all the time, this really helps you save on instance hours.
Leveraging the above advantage, one can use AWS lambda to spawn event driven clusters.

Resources