EC2 Job Flow Failure - amazon-ec2

I have a jar file MapReduce that I'd like to run on s3. It takes two args, an input dir and an output file.
So I tried the following command using the elastic-mapreduce ruby cmd line tool:
elastic-mapreduce -j j-JOBFLOW --jar s3n://this.bucket.com/jars/this.jar --arg s3n://this.bucket.com/data/ --arg s3n://this.bucket.com/output/this.csv
This failed with error
Exception in thread "main" java.lang.ClassNotFoundException: s3n://this/bucket/com/data/
So I tried it using --input and --output after the respective args. That failed too with an error on the --input class not being found (seems like it couldn't decipher --input and not that it couldn't decipher the argument after input)
This seems like such a basic thing, yet I'm having trouble getting it to work. Any help is much appreciated. Thanks.

Try:
elastic-mapreduce --create --jar s3n://this.bucket.com/jars/this.jar --args "s3n://this.bucket.com/data/,s3n://this.bucket.com/output/this.csv"
Double check your jar,input data are there:
s3cmd ls s3://this.bucket.com/data/

Related

How to load results of bash command into gcs using Airflow?

I try to save json that is a result of bash command in the bucket of gcs. Executing the bash command in my local terminal everything works properly and it loads data into gcs. Unfortunately the same bash command doesn't work via Airflow. Airflow marks the task as successfully done but in gcs I can see empty file. I suspect that this happens because of the out of airflow memory but I am not sure. If so someone can explain me how and where the results are stored in airflow ? I see in the bash operator documentation that airflow creates a temporary directory which is cleaned after the execution. Does it mean that the results of bash command also are cleaned afterwards ? Is there any way to save the results in gcs ?
This is my dag:
get_data = BashOperator(
task_id='get_data',
bash_command='curl -X GET -H 'XXX: xxx' some_url | gsutil cp -L manifest.txt - gs://bucket/folder1/filename.json; rm manifest.txt',
dag=dag
)

Google Dataproc initialization script error File not Found

I'm using Google Dataproc to initialize a Jupyter cluster.
At first I used the "dataproc-initialization-actions" available in github, and it works like a charm.
This is the create cluster Call available in the documentation:
gcloud dataproc clusters create my-dataproc-cluster \
--metadata "JUPYTER_PORT=8124" \
--initialization-actions \
gs://dataproc-initialization-actions/jupyter/jupyter.sh \
--bucket my-dataproc-bucket \
--num-workers 2 \
--properties spark:spark.executorEnv.PYTHONHASHSEED=0,spark:spark.yarn.am.memory=1024m \
--worker-machine-type=n1-standard-4 \
--master-machine-type=n1-standard-4
But I want to customize it, so I got the initialization file and saved it o my Google Storage (that is under the same project where I'm trying to create the cluster). So, I changed the call to point to my script instead, like this:
gcloud dataproc clusters create my-dataproc-cluster \
--metadata "JUPYTER_PORT=8124" \
--initialization-actions \
gs://myjupyterbucketname/jupyter.sh \
--bucket my-dataproc-bucket \
--num-workers 2 \
--properties spark:spark.executorEnv.PYTHONHASHSEED=0,spark:spark.yarn.am.memory=1024m \
--worker-machine-type=n1-standard-4 \
--master-machine-type=n1-standard-4
But running this I got the following error:
Waiting on operation [projects/myprojectname/regions/global/operations/cf20
466c-ccb1-4c0c-aae6-fac0b99c9a35].
Waiting for cluster creation operation...done.
ERROR: (gcloud.dataproc.clusters.create) Operation [projects/myprojectname/
regions/global/operations/cf20466c-ccb1-4c0c-aae6-fac0b99c9a35] failed: Multiple
Errors:
- Google Cloud Dataproc Agent reports failure. If logs are available, they can
be found in 'gs://myjupyterbucketname/google-cloud-dataproc-metainfo/231e5160-75f3-
487c-9cc3-06a5918b77f5/my-dataproc-cluster-m'.
- Google Cloud Dataproc Agent reports failure. If logs are available, they can
be found in 'gs://myjupyterbucketname/google-cloud-dataproc-metainfo/231e5160-75f3-
487c-9cc3-06a5918b77f5/my-dataproc-cluster-w-1'..
Well the files where there, so I think it may not be some access permission problem. The file named "dataproc-initialization-script-0_output" has the following content:
/usr/bin/env: bash: No such file or directory
Any ideas?
Well, found my answer here
Turns out the script had windows line endings instead of unix line endings.
Made an online convertion using dos2unix and now it runs fine.
With help from #tix I could check that the file was reacheable using a SSH connection to the cluster (Successful "gsutil cat gs://myjupyterbucketname/jupyter.sh")
AND, the initialization file was correctly saved locally in the directory "/etc/google-dataproc/startup-scripts/dataproc-initialization-script-0"

MissingArgumentException while configuring Flume

I installed Flume
and tried to run this command
flume-ng agent -n $agent_name -c conf -f /home/gautham/Downloads/apache-flume-1.5.0.1-bin/conf/flume-conf.properties.template
and I get this exception
ERROR node.Application: A fatal error occurred while running. Exception follows.
org.apache.commons.cli.MissingArgumentException: Missing argument for option: n
at org.apache.commons.cli.Parser.processArgs(Parser.java:343)
at org.apache.commons.cli.Parser.processOption(Parser.java:393)
at org.apache.commons.cli.Parser.parse(Parser.java:199)
at org.apache.commons.cli.Parser.parse(Parser.java:85)
at org.apache.flume.node.Application.main(Application.java:252)
Check if you are named your flume file with .conf extension.
And try to use the below command:
$ flume-ng agent \
--conf-file PathOfYourFlumeFile\
--name agentameInFlumeFile\
--conf $FLUME_HOME/conf \
Change $agent_name with the name of agent you have used in your flume file.
You have to mention path of flume file with .conf extension instead of this /home/gautham/Downloads/apache-flume-1.5.0.1-bin/conf/flume-conf.properties.template
Instead of $agent_name use the actual name of the agent in your conf file.
I suspect that you do not have an $agent_name environment variable so it's being replaced with empty string.
I had this similar issue. Later found out that by replacing all the hyphens(-) with hyphens again, it started working. Probably when i copied this command, the hyphens were replaced with minus(-) signs.

Error: Could not find or load main class org.apache.flume.node.Application - Install flume on hadoop version 1.2.1

I have built a hadoop cluster which 1 master-slave node and the other is slave. And now, I wanna build a flume to get all log of the cluster on master machine. However, when I try to install flume from tarball and I always get:
Error: Could not find or load main class org.apache.flume.node.Application
So, please help me to find the answer, or the best way to install flume on my cluster.
many thanks!
It is basically because of FLUME_HOME..
Try this command
$ unset FLUME_HOME
I know its been almost a year for this question, but I saw it!
When you set your agnet using sudo bin/flume-ng.... make sure to specify the file where the agent configuration is.
--conf-file flume_Agent.conf -> -f conf/flume_Agent.conf
This did the trick!
look like you run flume-ng in /bin folder
flume after build in
/flume-ng-dist/target/apache-flume-1.5.0.1-bin/apache-flume-1.5.0.1-bin
run flume-ng in this
I suppose you are trying to run flume from cygwin on windows? If that is the case, I had a similar issue. The problem might be with the flume-ng script.
Find the following line in bin/flume-ng:
$EXEC java $JAVA_OPTS $FLUME_JAVA_OPTS "${arr_java_props[#]}" -cp "$FLUME_CLASSPATH" \
-Djava.library.path=$FLUME_JAVA_LIBRARY_PATH "$FLUME_APPLICATION_CLASS" $*
and replace it with this
$EXEC java $JAVA_OPTS $FLUME_JAVA_OPTS "${arr_java_props[#]}" -cp `cygpath -wp "$FLUME_CLASSPATH"` \
-Djava.library.path=`cygpath -wp $FLUME_JAVA_LIBRARY_PATH` "$FLUME_APPLICATION_CLASS" $*
Notice that the paths have been replaced with the windows directories. Java would not be able to find the library paths from the cygdrive paths and we would have to convert it to the correct windows paths wherever applicable
Maybe you are using the source files, you first should compile the source code and generate the binary code, then inside the binary files directory, you can execute: bin/flume-ng agent --conf ./conf/ -f conf/flume.conf -Dflume.root.logger=DEBUG,console -n agent1. All these information you can follow: https://cwiki.apache.org/confluence/display/FLUME/Getting+Started
I got same issue before, it's simply due to FLUME_CLASSPATH not set
the best way to debug is see the java command being fired and make sure that flume lib is included in the CLASSPATH (-cp),
As in following command its looking for /lib/*, thats where the flume-ng-*.jar are, but its incorrect because there's nothing in /lib, in this line -cp '/staging001/Flume/server/conf://lib/*:/lib/*'. It has to be ${FLUME_HOME}/lib.
usr/lib/jvm/java-1.8.0-ibm-1.8.0.3.20-1jpp.1.el7_2.x86_64/jre/bin/java -Xms100m -Xmx500m $'-Dcom.sun.management.jmxremote\r' \
-Dflume.monitoring.type=http \
-Dflume.monitoring.port=34545 \
-cp '/staging001/Flume/server/conf://lib/*:/lib/*' \
-Djava.library.path= org.apache.flume.node.Application \
-f /staging001/Flume/server/conf/flume.conf -n client
So, if you look at the flume-ng script,
There's FLUME_CLASSPATH setup, which if absent it is setup based on FLUME_HOME.
# prepend $FLUME_HOME/lib jars to the specified classpath (if any)
if [ -n "${FLUME_CLASSPATH}" ] ; then
FLUME_CLASSPATH="${FLUME_HOME}/lib/*:$FLUME_CLASSPATH"
else
FLUME_CLASSPATH="${FLUME_HOME}/lib/*"
fi
So make sure either of those environments is set. With FLUME_HOME set, (I'm using systemd)
Environment=FLUME_HOME=/staging001/Flume/server/
Here's the working java exec.
/usr/lib/jvm/java-1.8.0-ibm-1.8.0.3.20-1jpp.1.el7_2.x86_64/jre/bin/java -Xms100m -Xmx500m \
$'-Dcom.sun.management.jmxremote\r' \
-Dflume.monitoring.type=http \
-Dflume.monitoring.port=34545 \
-cp '/staging001/Flume/server/conf:/staging001/Flume/server/lib/*:/lib/*' \
-Djava.library.path= org.apache.flume.node.Application \
-f /staging001/Flume/server/conf/flume.conf -n client

How to access file in s3n when running a customized jar in Amazon Elastic MapReduce

I am running the below step in my cluster in EMR:
./elastic-mapreduce -j CLUSTERID -jar s3n://mybucket/somejar
--main-class SomeClass
--arg -conf --arg 's3n://mybucket/configuration.xml'
The SomeClass is Hadoop job and implements Runnable interface. It reads configuration.xml for parameters, but in the above command the SomeClass can not access "s3n://mybucket/configuration.xml" (no error reported). I tried "s3://mybucket/configuration.xml" and it does not work either. I am sure the file existed, since I can see it with "hadoop fs -ls s3n://mybucket/configuration.xml". Any suggestion for the problem?
Thanks,
Here are the options to try
Use s3 instead of s3n.
Check the access permission for s3 bucket.
You can specify the log location and can check the log after job
fails.You can create job like below
elastic-mapreduce --create --name "j_flow_name" --log-uri "s3://your_s3_bucket"
It gives you the more debug information.
3.
./elastic-mapreduce -j JobFlowId -jar s3://your_bucket --arg "s3://your_conf_file_bucket_name" --arg "second parameter"
For more detailed information EMR CLI

Resources