Make an execution timeline on Amazon EMR - hadoop

I am interested in using the job_history_summary.py script to create a Task Timeline of my EMR cluster, similar to this (picture from Smith College Hadoop Tutorial 1.1, but apparently from the Yahoo report on the TeraSort experiment.).
It seems that the Hadoop logs are stored on each node, rather than on the central server. Do I need to manually combine the logs? It also seems that the script doesn't actually produce the graph.

You can enable logging and provide s3 bucket. Logs will be zipped and stored in s3 bucket provided.

Related

How can i copy files from external Hadoop cluster to Amazon S3 without running any commands on the cluster

I have scenario in which i have to pull data from Hadoop cluster into AWS.
I understand running dist-cp on the hadoop cluster is a way to copy the data into s3, but i have a restriction here, i wont be able to run any commands in the cluster. I should be able to pull the files from hadoop cluster into AWS. The data is available in hive.
I thought of the below options:
1) Sqoop data from Hive ? Is it possible ?
2) S3-distcp (running it on aws), if so what would be the configuration needed ?
Any Suggestions ?
If the hadoop cluster is visible from EC2-land, you could run a distcp command there, or, if it's a specific bit of data, some hive query which uses hdfs:// as input and writes out to s3. You'll need to deal with kerberos auth though: you cannot use distcp in an un-kerberized cluster to read data from a kerberized one, though you can go the other way.
You can also run distcp locally in 1+ machine, though you are limited by the bandwidth of those individual systems. distcp is best when it schedules the uploads on the hosts which actually have the data.
Finally, if it is incremental backup you are interested in, you can use the HDFS audit log as a source of changed files...this is what incremental backup tools tend to use

Is Namenode still necessary if I use S3 instead of HDFS?

Recently I am setting up my Hadoop cluster over Object Store with S3, all data file are store in S3 instead of HDFS, and I successfully run spark and MP over S3, so I wonder if my namenode is still necessary, if so, what does my namenode do while I am running hadoop application over S3? Thanks.
No, provided you have a means to deal with the fact that S3 lacks the consistency needed by the shipping work committers. Every so often, if S3's listings are inconsistent enough, your results will be invalid and you won't even notice.
Different suppliers of Spark on AWS solve this in their own way. If you are using ASF spark, there is nothing bundled which can do this.
https://www.youtube.com/watch?v=BgHrff5yAQo

Can you send AWS RDS Postgres logs to a AWS Hadoop cluster easily?

In particular, I'd like to push all of the INSERT, UPDATE, and DELETE statements from my Postgres logs to a AWS Hadoop cluster and have a nice way to search them to see the history of a row or rows.
I'm not a Hadoop expert in any way, so let me know if this is a red herring.
Thanks!
Use flume to send logs from your RDS instance to Hadoop cluster. Using flume you could use regex interceptor to filter events and send just INSERT, UPDATE and DELETE statements. Hadoop does not make your data searchable so you have to use something like Solr.
You could either get the data to Hadoop first and then run bunch of MapReduce jobs to insert data into Solr. Or you could directly configure flume to write data to Solr, see link below.
Links:
Using flume solr sink
Flume Regex Filtering Interceptor
EDIT:
It seems like RDS instances don't have SSH access, which means that you cannot natively run flume on the RDS instance itself but you have to periodically get the logs of the RDS instance manually to a machine (this could be a EC2 instance) which has flume configured.

Amazon S3 multipart upload often fails

I'm trying to upload a 32GB file to a S3 bucket using the s3cmd CLI. It's doing a multipart upload and often fails. I'm doing this from a server which has a 1000mbps bandwidth to play with. But the upload still is VERY slow. Is there something I can do to speed this up?
On the other hand, the file is on the HDFS on the server I mentioned. Is there a way to reference the Amazon Elastic Map Reduce job to pick it up from this HDFS? It's still an upload but the job is getting executed as well. So the overall process is much quicker.
First I'll admit that I've never used the Multipart feature of s3cmd, so I can't speak to that. However, I have used boto in the past to upload large (10-15GB files) to S3 with a good deal of success. In fact, it became such a common task for me that I wrote a little utility to make it easier.
As for your HDFS question, you can always reference an HDFS path with a fully qualified URI, e.g., hdfs://{namenode}:{port}/path/to/files. This assumes your EMR cluster can access this external HDFS cluster (might have to play with security group settings)

Anyone using DynamoDB and Hive without using EMR?

I was reading the below integration of using Hive for querying data on DynamoDB.
http://aws.typepad.com/aws/2012/01/aws-howto-using-amazon-elastic-mapreduce-with-dynamodb.html
But as per that link, Hive needs to be setup on top of EMR. But I wanted to know if I can use this integration with the standalone Hadoop cluster I already have instead of using EMR. Has anyone done this? Will there be sync issues between data in DynamoDB and HDFS happen compared to using EMR?
To be able to use it on your own cluster, you would need the custom StorageHandler for DynamoDB(it probably involves a custom SerDe as well).
It seems to be no available at the moment, at least not at AWS website.
What you can do is use the JDBC interface, provided by Amazon, to produce the queries from your cluster, but it would still be executed on top of EMR.

Resources