Run Pig with Lipstick on AWS EMR - hadoop

I'm running an AWS EMR Pig job using script-runner.jar as described here: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-hadoop-script.html
Now, I want to hook up Netflix' Lipstick to monitor my scripts. I set up the server, and in the wiki here: https://github.com/Netflix/Lipstick/wiki/Getting-Started I can't quite figure out how to do the last step:
hadoop jar lipstick-console-[version].jar -Dlipstick.server.url=http://$LIPSTICK_URL
Should I substitute script-runner.jar with this?
Also, after following the build process in wiki I ended up with 3 different console jars:
lipstick-console-0.6-SNAPSHOT.jar
lipstick-console-0.6-SNAPSHOT-withHadoop.jar
lipstick-console-0.6-SNAPSHOT-withPig.jar
What is the purpose of the latter two jars?
UPDATE:
I think I'm making progress, but it still does not seem to work.
I set the pig.notification.listener parameter as described here and lipstick server url. There is more than one way to do it in EMR. Since I am using ruby API, I had to specify a step
hadoop_jar_step:
jar: 's3://elasticmapreduce/libs/script-runner/script-runner.jar'
properties:
- pig.notification.listener.arg: com.netflix.lipstick.listeners.LipstickPPNL
- lipstick.server.url: http://pig_server_url
Next, I added lipstick-console-0.6-SNAPSHOT.jar to hadoop classpath. For this, I had to create a bootstrap action as follows:
bootstrap_actions:
- name: copy_lipstick_jar
script_bootstrap_action:
path: #s3 path to bootstrap_lipstick.sh
where contents of bootstrap_lipstick.sh is
#!/bin/bash
hadoop fs -copyToLocal s3n://wp-data-west-2/load_code/java/lipstick-console-0.6-SNAPSHOT.jar /home/hadoop/lib/
The bootstrap action copies the lipstick jar to cluster nodes, and /home/hadoop/lib/ is already in hadoop classpath (EMR takes care of that).
It still does not work, but I think I am missing something really minor ... Any ideas appreciated.
Thanks!

Currently Lipstick's Main class is a drop-in replacement to Pig's Main class. This is a hack (and far from ideal) to have access to the logical and physical plans for your script before and after optimization that are simply not accessible otherwise. As such it unfortunately won't work to just register the LipstickPPNL class as a PPNL for Pig. You've got to run Lipstick Main as though it was Pig.
I have not tried to run lipstick on EMR but it looks like you're going to need to use a custom jar step, not a script step. See the docs here: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-launch-custom-jar-cli.html
The jar name would be the lipstick-console-0.6-SNAPSHOT-withHadoop.jar. It contains all the necessary dependencies to run Lipstick. Additionally the lipstick.server.url will need to be set.
Alternatively, you might take a look at https://www.mortardata.com/ which runs on EMR and has lipstick integration built-in.

Related

How to use AvroParquetReader inside a Flink application?

I am having trouble using AvroParquetReader inside a Flink Application. (flink>=1.15)
Motivaton (AKA why I want to use it)
According to official doc one can read Parquet files in Flink into FileSource. However, I only want to write a function to load parquet file into Avro records without creating a DataStreamSource. In particular, I want to load parquet files into FileInputFormat which is a complete separate API (for some weird reasons). (And I could not see easily how one could cast BulkFormat or StreamFormat into it, if one dig one level deeper.)
Therefore, it would much simpler if one use org.apache.parquet.avro.AvroParquetReader to read it directly.
Error description
However, I found this error after run the Flink application locally: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found.
This is quite unexpected, since the flink-s3-hadoop-fs jar has already been loaded inside the plugin system (and the file path has already been added to HADOOP_CLASSPATH as well). So not only flink knows where it is, so should the local hadoop as well.
Comments:
Without this AvroParquetReader, the Flink app can write to S3 without problem.
The Hadoop is not a flink shaded one, but installed separately with version 2.10.
Would love to hear if you have some insights about this.
ParquetAvroReader should be able to read the parquet files without problem.
there is an official hadoop guide that has some potential fixes for the issue and can be found here. If I recall correnctly this issue was cause by some Hadoop AWS dependencies missing.

Running a hadoop job using java command

I have a simple java program that sets up a MR job. I could successfully execute this in Hadoop infrastructure (hadoop 2x) using 'hadoop jar '. But I want to achieve the same thing using java command as below.
java className
How can I pass hadoop configuration to this className?
What extra arguments do I need to supply?
Any link/documentation would be highly appreciated.
As you run your 'hadoop jar' command with the other parameters, same way you can run using java.
check if, this commands evaluates to hadoop class path
$ hadoop classpath
then whatever your custom jar is should be added in class path
$ java -cp `hadoop classpath`:/my/tools/jar/tools.jar
I am able to get mine working with this, on my hadoop cluster
I don't think you can find a documentation on this. hadoop command is a script, a lot of classes are used there eg. Class for accessing filesystem FsShell, class used when we run a jar RunJar etc. Adding hadoop related libraries, configuration files to classpath are handled in the hadoop command itself.
You better take a look at the hadoop script.
How can you do that? Any jar file execution means, it has to execute in distributed environment where all daemons work together to complete the execution.
We are not running locally or on local file system. So, it needs be executed as per the norms of hdfs so i don't think we can execute like we do in local file system.
Hadoop is a framework which simplifies the distributed computing. Before hadoop also, programmers know about parallel processing and multi threading concepts. But when you deal with multiple machines you need to know how to
Communicate between machines
Network processing
What if one machine fails? fault tolerance
and many more! which is a huge, that's where hadoop simplifies your job. It takes care of all your operating level stuff and you can focus on just your business logic.
So in your case, based on what you are asking, there is no direct answer for that. Because by passing parameters the your program doesn't work. You will need to write lot of libraries to deal with distributed computing. If you want to explore them, then I would suggest go ahead and read hadoop source code.
http://hadoop.apache.org/version_control.html

Spark workers unable to find JAR on EC2 cluster

I'm using spark-ec2 to run some Spark code. When I set master to
"local", then it runs fine. However, when I set master to $MASTER,
the workers immediately fail, with java.lang.NoClassDefFoundError for
the classes. The workers connect to the master, and show up in the UI, and try to run the task; but immediately raise that exception as soon as it loads its first dependency class (which is in the assembly jar).
I've used sbt-assembly to make a jar with the classes, confirmed using
jar tvf that the classes are there, and set SparkConf to distribute
the classes. The Spark Web UI indeed shows the assembly jar to be
added to the classpath:
http://172.x.x.x47441/jars/myjar-assembly-1.0.jar
It seems that, despite the fact that myjar-assembly contains the
class, and is being added to the cluster, it's not reaching the
workers. How do I fix this? (Do I need to manually copy the jar file?
If so, to which dir? I thought that the point of the SparkConf add
jars was to do this automatically)
My attempts at debugging have shown:
The assembly jar is being copied to /root/spark/work/app-xxxxxx/1/
(Determined by ssh to worker and searching for jar)
However, that path doesn't appear on the worker's classpath
(Determined from logs, which show java -cp but lack that file)
So, it seems like I need to tell Spark to add the path to the assembly
jar to the worker's classpath. How do I do that? Or is there another culprit? (I've spent hours trying to debug this but to no avail!)
NOTE: EC2 specific answer, not a general Spark answer. Just trying to round out an answer to a question asked a year ago, one that has the same symptom but often different causes and trips up a lot of people.
If I am understanding the question correctly, you are asking, "Do I need to manually copy the jar file? If so, to which dir?" You say, "and set SparkConf to distribute the classes" but you are not clear if this is done via spark-env.sh or spark-defaults.conf? So making some assumptions, the main one being your are running in cluster mode, meaning your driver runs on one of the workers and you don't know which one in advance... then...
The answer is yes, to the dir named in the classpath. In EC2 the only persistent data storage is /root/persistent-hdfs, but I don't know if that's a good idea.
In the Spark docs on EC2 I see this line:
To deploy code or data within your cluster, you can log in and use
the provided script ~/spark-ec2/copy-dir, which, given a directory
path, RSYNCs it to the same location on all the slaves.
SPARK_CLASSPATH
I wouldn't use SPARK_CLASSPATH because it's deprecated as of Spark 1.0 so a good idea is to use its replacement in $SPARK_HOME/conf/spark-defaults.conf:
spark.executor.extraClassPath /path/to/jar/on/worker
This should be the option that works. If you need to do this on the fly, not in a conf file, the recommendation is "./spark-submit with --driver-class-path to augment the driver classpath" (from Spark docs about spark.executor.extraClassPath and see end of answer for another source on that).
BUT ... you are not using spark-submit ... I don't know how that works in EC2, looking at the script I didn't figure out where EC2 let's you supply these parameters on a command line. You mention you already do this in setting up your SparkConf object so stick with that if that works for you.
I see in Spark-years this is a very old question so I wonder how you resolved it? I hope this helps someone, I learned a lot researching the specifics of EC2.
I must admit, as a limitation on this, it confuses me in the Spark docs that for spark.executor.extraClassPath it says:
Users typically should not need to set this option
I assume they mean most people will get the classpath out through a driver config option. I know most of the docs for spark-submit make it should like the script handles moving your code around the cluster but I think that's only in "standalone client mode" which I assume you are not using, I assume EC2 must be in "standalone cluster mode."
MORE / BACKGROUND ON SPARK_CLASSPATH deprecation:
More background that leads me to think SPARK_CLASSPATH is deprecated is this archived thread. and this one, crossing the other thread and this one about a WARN message when using SPARK_CLASSPATH:
14/07/09 13:37:36 WARN spark.SparkConf:
SPARK_CLASSPATH was detected (set to 'path-to-proprietary-hadoop-lib/*:
/path-to-proprietary-hadoop-lib/lib/*').
This is deprecated in Spark 1.0+.
Please instead use:
- ./spark-submit with --driver-class-path to augment the driver classpath
- spark.executor.extraClassPath to augment the executor classpath
You are require to register a jar with spark cluster while submitting your app, to make it possible you can edit your code as follows.
jars(0) = "/usr/local/spark/lib/spark-assembly-1.3.0-hadoop2.4.0.jar"
val conf: SparkConf = new SparkConf()
.setAppName("Busigence App")
.setMaster(sparkMasterUrl)
.setSparkHome(sparkHome)
.setJars(jars);

how to get multipleOutput in hadoop

I'm new to Hadoop, and now have to process a input file. I want to process each line and the output should be one file for each line.
I surf the internet and found MultipleOutputFormat, and generateFileNameForKeyValue.
But most people write it with JobConf class. As I'm using Hadoop 0.20.1, I think Job class takes place. And I don't know how to use Job class to generate multiple output files by key.
Could anyone help me?
The Eclipse plugin is mainly used to submit and monitor jobs as well as interact with HDFS, against a real or 'psuedo' cluster.
If you're running in local mode, then i don't think the plugin gains you anything - seeing as your job will be run in a single JVM. With this in mind i would say include include the most recent 1.x hadoop-core in your Eclipse project's classpath.
Eitherway MultipleOutputFormat has not been ported to the new mapreduce package (neither in 1.1.2 or 2.0.4-alpha), so you'll either need to port it yourself or find another way (maybe MultipleOutputs - The Javadoc page has some usage on using MultipleOutputs)

Access hdfs from outside hadoop

I want to run some executables outside of hadoop (but on the same cluster) using input files that are stored inside HDFS.
Do these files need to be copied locally to the node? or is there a way to access HDFS outside of hadoop?
Any other suggestions on how to do this are fine. Unfortunately my executables can not be run within hadoop though.
Thanks!
There are a couple typical ways:
You can access HDFS files through the HDFS Java API if you are writing your program in Java. You are probably looking for open. This will give you a stream that acts like a generic open file.
You can stream your data with hadoop cat if your program takes input through stdin: hadoop fs -cat /path/to/file/part-r-* | myprogram.pl. You could hypothetically create a bridge with this command line command with something like popen.
Also check WebHDFS which made into the 1.0.0 release and will be in the 23.1 release also. Since it's based on rest API, any language can access it and also Hadoop need not be installed on the node on which the HDFS files are required. Also. it's equally fast as the other options mentioned by orangeoctopus.
The best way is install "hadoop-0.20-native" package on the box where you are running your code.
hadoop-0.20-native package can access hdfs filesystem. It can act as a hdfs proxy.
I had similar issue and asked appropriate question. I needed to access HDFS / MapReduce services outside of cluster. After I found solution I posted answer here for HDFS. Most painfull issue there happened to be user authentication which in my case was solved in most simple case (complete code is in my question).
If you need to minimize dependencies and don't want to install hadoop on clients here is nice Cloudera article how to configure Maven to build JAR for this. 100% success for my case.
Main difference in Remote MapReduce job posting comparing to HDFS access is only one configuration setting (check for mapred.job.tracker variable).

Resources