Pig UDF running on AWS EMR with java.lang.NoClassDefFoundError: org/apache/pig/LoadFunc - hadoop

I am developing an application that try to read log file stored in S3 bucks and parse it using Elastic MapReduce. Current the log file has following format
-------------------------------
COLOR=Black
Date=1349719200
PID=23898
Program=Java
EOE
-------------------------------
COLOR=White
Date=1349719234
PID=23828
Program=Python
EOE
So I try to load the file into my Pig script, but the build-in Pig Loader doesn't seems be able to load my data, so I have to create my own UDF. Since I am pretty new to Pig and Hadoop, I want to try script that written by others before I write my own, just to get a teast of how UDF works. I found one from here http://pig.apache.org/docs/r0.10.0/udf.html, there is a SimpleTextLoader. In order to compile this SimpleTextLoader, I have to add a few imports, as
import java.io.IOException;
import java.util.ArrayList;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.InputFormat;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigSplit;
import org.apache.pig.backend.executionengine.ExecException;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;
import org.apache.pig.data.DataByteArray;
import org.apache.pig.PigException;
import org.apache.pig.LoadFunc;
Then, I found out I need to compile this file. I have to download svn and pig running
sudo apt-get install subversion
svn co http://svn.apache.org/repos/asf/pig/trunk
ant
Now i have a pig.jar file, then I try to compile this file.
javac -cp ./trunk/pig.jar SimpleTextLoader.java
jar -cf SimpleTextLoader.jar SimpleTextLoader.class
It compiles successful, and i type in Pig entering grunt, in grunt i try to load the file, using
grunt> register file:/home/hadoop/myudfs.jar
grunt> raw = LOAD 's3://mys3bucket/samplelogs/applog.log' USING myudfs.SimpleTextLoader('=') AS (key:chararray, value:chararray);
2012-12-05 00:08:26,737 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2998: Unhandled internal error. org/apache/pig/LoadFunc Details at logfile: /home/hadoop/pig_1354666051892.log
Inside the pig_1354666051892.log, it has
Pig Stack Trace
---------------
ERROR 2998: Unhandled internal error. org/apache/pig/LoadFunc
java.lang.NoClassDefFoundError: org/apache/pig/LoadFunc
I also try to use another UDF (UPPER.java) from http://wiki.apache.org/pig/UDFManual, and I am still get the same error by try to use UPPER method. Can you please help me out, what's the problem here? Much thanks!
UPDATE: I did try EMR build-in Pig.jar at /home/hadoop/lib/pig/pig.jar, and get the same problem.

Put the UDF jar in the /home/hadoop/lib/pig directory or copy the pig-*-amzn.jar file to /home/hadoop/lib and it will work.
You would probably use a bootstrap action to do either of these.

Most of the Hadoop ecosystem tools like pig and hive look up $HADOOP_HOME/conf/hadoop-env.sh for environment variables.
I was able to resolve this issue by adding pig-0.13.0-h1.jar (it contains all the classes required by the UDF) to the HADOOP_CLASSPATH:
export HADOOP_CLASSPATH=/home/hadoop/pig-0.13.0/pig-0.13.0-h1.jar:$HADOOP_CLASSPATH
pig-0.13.0-h1.jar is available in the Pig home directory.

Related

error: cannot import non-existent remote object , image import failing

I'm trying to import an existing Linux image. I used the following command
terraform import azurerm_marketplace_agreement.publisher /subscriptions/YOUR-AZURE-SUBSCRIPTION-ID/providers/Microsoft.MarketplaceOrdering/agreements/publisher/offers/offer/plans/plan
But when I run this in pipeline, I'm getting error at every alternate run. The error is
Error: cannot import non-existent remote object
Do I need to do anything special in my script before I run this command?

ClassNotFoundException while running MissingPokerCards on ec2 Instance

I'm getting the following error when I try to run jar file -
Exception in thread "main" java.lang.ClassNotFoundException: finalPoker.MissingPokerCards
at java.net.URLClassLoader$1.run(URLClassLoader.java:360)
at java.net.URLClassLoader$1.run(URLClassLoader.java:349)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:348)
at java.lang.ClassLoader.loadClass(ClassLoader.java:430)
at java.lang.ClassLoader.loadClass(ClassLoader.java:363)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:278)
at org.apache.hadoop.util.RunJar.main(RunJar.java:153)
The following code is my MissingPokerCards program which will count the number of missing cards from the deck of 52 cards.
package MissingPokerCards;
import java.io.IOException;
import java.util.ArrayList;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class PokerCardsProgramme {
//Mapper function
//Reduce funtion
//Main function
public static void main(String[] args) throws Exception {
Configuration config = new Configuration();
Job job = new Job(config, "Search for list of missing Cards");
job.setJarByClass(PokerCardsProgramme.class);
job.setMapperClass(mapper.class);
job.setReducerClass(reducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}
Compiled code using - javac -classpath /home/ec2-user/hadoop_home/hadoop-1.2.1/hadoop-core-1.2.1.jar PokerCardsProgramme.java
Jar is created by using following command - jar cvf MissingPokerCards.jar PokerCardsProgramme*.class
Jar file is ran using - hadoop jar MissingPokerCards.jar MissingPokerCards.PokerCardsProgramme \input\inputcards.txt output
My Hadoop version is 1.2.1 and java version is 1.7.0_241
Even I tried using a different version of Hadoop-2-7-3
hadoop jar /home/ec2-user/hadoop/hadoop-2.7.3/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.7.3.jar MissingPokerCards.PokerCardsProgramme inputcards.txt /output
Still facing the same issue. I think I am missing the PokerCards function related jar file.
Can anybody please help me with this problem. Am I using the correct command to compile and run the program or else is there any way to execute the MissingPokerCards program on ec2 instance.
I am able to run the same code in eclipse but when I tried to execute on ec2 it is showing this issue.
The error has nothing to do with Hadoop or EC2. This is just a regular Java error. If you really want to run Hadoop code in AWS use EMR, not EC2 instances
Your package is MissingPokerCards. The error says it's finalPoker
Your class is PokerCardsProgramme. Your error says it's MissingPokerCards
FWIW, not many people actually write mapreduce nowadays, but you definitely should be using Hadoop 2 or 3 with Java 8, not 1.2.1 with Java 7

Input path doesn't exist in pyspark for hadoop path

Am trying to get the fetch the file from hdfs in pyspark using visual studio code...
i have checked through jps all the nodes are in active status only.
my file path in hadoop is
hadoop fs -cat emp/part-m-00000
1,A,ABC
2,B,ABC
3,C,ABC
and core-site.xml is
fs.default.name
hdfs://localhost:9000
am fetching the above mentioned file through visual studio code in pyspark..
but am getting error like
py4j.protocol.Py4JJavaError: An error occurred while calling o31.partitions.
: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://localhost:9000/emp/part-m-00000
please help me
i have tried giving the hadoop path
from pyspark.conf import SparkConf
from pyspark.context import SparkContext
from pyspark.sql import HiveContext
sc= SparkContext('local','example')
hc = HiveContext(sc)
tf1 = sc.textFile("hdfs://localhost:9000/emp/part-m-00000")
print(tf1.first())
i need to get the file from hadoop

PYspark SparkContext Error "error occurred while calling None.org.apache.spark.api.java.JavaSparkContext."

I know this question has been posted before, but I tried implementing the solutions, but none worked for me. I installed Spark for Jupyter Notebook
using this tutorial:
https://medium.com/#GalarnykMichael/install-spark-on-mac-pyspark-
453f395f240b#.be80dcqat
Installed Latest Version of Apache Spark on the MAC
When I try to run the following code in Jupyter
wordcounts = sc.textFile('words.txt')
I get the following error:
name 'sc' is not defined
When I try adding the Code:
from pyspark import SparkContext, SparkConf
sc =SparkContext()
getting the following error:
An error occurred while calling
None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.NoClassDefFoundError: Could not initialize class
org.apache.hadoop.util.StringUtils
at
org.apache.hadoop.security.SecurityUtil.
getAuthenticationMethod(SecurityUtil.java:611)
Added the path in bash:
export SPARK_PATH=~/spark-2.2.1-bin-hadoop2.7
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
#For python 3, You have to add the line below or you will get an error
# export PYSPARK_PYTHON=python3
alias snotebook='$SPARK_PATH/bin/pyspark --master local[2]'
Please help me resolve this.

RethinkDB import error

I'm trying to import CSV or JSON file to Rethink DB but I always get the same error:
rethinkdb import -f ~/Downloads/convertcsv.json --table test.stats --format json
[ ] 0%
0 rows imported in 1 table
'indexes'
In file: /home/xxxxx/Downloads/convertcsv.json
Errors occurred during import
I don't see anything in logs and the same files import ok on my laptop.
Import creates the table but that's about it.
My system:
- List item
- Ubuntu 10.10
- Python 2.7.8
- rethinkdb 1.16.0+1~0utopic (GCC 4.9.1)
Already tried to re-install RethinkDB, sudo pip2 install --upgrade rethinkdb. Not sure what else I can do.
This appears to have been an oversight when adding export/import of secondary indexes - the import script is looking for the indexes field in the info, which doesn't exist when importing a single file. This can be worked around by providing the flag --no-secondary-indexes. A fix was released in the RethinkDB Python driver version 1.16.0-2, see the Github issue #3278 for details.

Resources