Input path doesn't exist in pyspark for hadoop path - visual-studio

Am trying to get the fetch the file from hdfs in pyspark using visual studio code...
i have checked through jps all the nodes are in active status only.
my file path in hadoop is
hadoop fs -cat emp/part-m-00000
1,A,ABC
2,B,ABC
3,C,ABC
and core-site.xml is
fs.default.name
hdfs://localhost:9000
am fetching the above mentioned file through visual studio code in pyspark..
but am getting error like
py4j.protocol.Py4JJavaError: An error occurred while calling o31.partitions.
: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://localhost:9000/emp/part-m-00000
please help me
i have tried giving the hadoop path
from pyspark.conf import SparkConf
from pyspark.context import SparkContext
from pyspark.sql import HiveContext
sc= SparkContext('local','example')
hc = HiveContext(sc)
tf1 = sc.textFile("hdfs://localhost:9000/emp/part-m-00000")
print(tf1.first())
i need to get the file from hadoop

Related

Jobs spark failure

When I want to launch a spark job on R I get this error :
Erreur : java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext.
This stopped SparkContext was created at:
org.apache.spark.SparkContext.<init>(SparkContext.scala:82) ....
In spark logs (/opt/mapr/spark/spark-version/logs) I find a lot of theses exceptions :
ERROR FsHistoryProvider: Exception encountered when attempting to load application log maprfs:///apps/spark/.60135a9b-ec7c-4f71-8f92-4d4d2fbb1e2b
java.io.FileNotFoundException: File maprfs:///apps/spark/.60135a9b-ec7c-4f71-8f92-4d4d2fbb1e2b does not exist.
Any idea how could I solve this issue ?
You need to create sparkContext (or get if it exists)
import org.apache.spark.{SparkConf, SparkContext}
// 1. Create Spark configuration
val conf = new SparkConf()
.setAppName("SparkMe Application")
.setMaster("local[*]") // local mode
// 2. Create Spark context
val sc = new SparkContext(conf)
or
SparkContext.getOrCreate()

FileStream on cluster gives me an exception

I am writing a Spark STreaming application using file stream...
val probeFileLines = ssc.fileStream[LongWritable, Text, TextInputFormat]("/data-sources/DXE_Ver/1.4/MTN_Abuja/DXE/20160221/HTTP", filterF, false) //.persist(StorageLevel.MEMORY_AND_DISK_SER)
But I get exception error for file/IO..for
16/09/07 10:20:30 WARN FileInputDStream: Error finding new files
java.io.FileNotFoundException: /mapr/cellos-mapr/data-sources/DXE_Ver/1.4/MTN_Abuja/DXE/20160221/HTTP
at com.mapr.fs.MapRFileSystem.listMapRStatus(MapRFileSystem.java:1486)
at com.mapr.fs.MapRFileSystem.listStatus(MapRFileSystem.java:1523)
at com.mapr.fs.MapRFileSystem.listStatus(MapRFileSystem.java:86)
While the directory exist in my cluster.
I am running my job using spark submit
spark-submit --class "StreamingEngineSt" target/scala-2.11/sprkhbase_2.11-1.0.2.jar
This could be related to file permissions or ownership(May be need hdfs user).

sparkSession/sparkContext can not get hadoop configuration

I am running spark 2, hive, hadoop at local machine, and I want to use spark sql to read data from hive table.
It works all fine when I have hadoop running at default hdfs://localhost:9000, but if I change to a different port in core-site.xml:
<name>fs.defaultFS</name>
<value>hdfs://localhost:9099</value>
Running a simple sql spark.sql("select * from archive.tcsv3 limit 100").show(); in spark-shell will give me the error:
ERROR metastore.RetryingHMSHandler: AlreadyExistsException(message:Database default already exists)
.....
From local/147.214.109.160 to localhost:9000 failed on connection exception: java.net.ConnectException: Connection refused;
.....
I get the AlreadyExistsException before, which doesn't seem to influence the result.
I can make it work by creating a new sparkContext:
import org.apache.spark.SparkContext
import org.apache.spark.sql.SparkSession
sc.stop()
var sc = new SparkContext()
val session = SparkSession.builder().master("local").appName("test").enableHiveSupport().getOrCreate()
session.sql("show tables").show()
My question is, why the initial sparkSession/sparkContext did not get the correct configuration? How can I fix it? Thanks!
If you are using SparkSession and you want to set configuration on the the spark context then use session.sparkContext
val session = SparkSession
.builder()
.appName("test")
.enableHiveSupport()
.getOrCreate()
import session.implicits._
session.sparkContext.hadoopConfiguration.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
You don't need to import SparkContext or created it before the SparkSession

Class org.apache.hadoop.fs.s3native.NativeS3FileSystem not found (Spark 1.6 Windows)

I am trying to access s3 files from local spark context using pySpark.
I keep getting File "C:\Spark\python\lib\py4j-0.9-src.zip\py4j\protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o20.parquet.
: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3native.NativeS3FileSystem not found
I had set os.environ['AWS_ACCESS_KEY_ID'] and
os.environ['AWS_SECRET_ACCESS_KEY'] before I called df = sqc.read.parquet(input_path). I also added these lines:
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3.awsSecretAccessKey", os.environ["AWS_SECRET_ACCESS_KEY"])
hadoopConf.set("fs.s3.awsAccessKeyId", os.environ["AWS_ACCESS_KEY_ID"])
I have also tried changing s3 to s3n, s3a. Neither worked.
Any idea how to make it work?
I am on Windows 10, pySpark, Spark 1.6.1 built for Hadoop 2.6.0
I'm running pyspark appending the libraries from hadoop-aws.
You will need to use s3n in your input path. I'm running that from Mac-OS. so I'm not sure if it will work in Windows.
$SPARK_HOME/bin/pyspark --packages org.apache.hadoop:hadoop-aws:2.7.1
This package declaration works even in spark-shell
spark-shell --packages org.apache.hadoop:hadoop-aws:2.7.1
and specify in the shell
sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "xxxxxxxxxxxxx")
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "xxxxxxxxxxxxxxxxx")

Pig UDF running on AWS EMR with java.lang.NoClassDefFoundError: org/apache/pig/LoadFunc

I am developing an application that try to read log file stored in S3 bucks and parse it using Elastic MapReduce. Current the log file has following format
-------------------------------
COLOR=Black
Date=1349719200
PID=23898
Program=Java
EOE
-------------------------------
COLOR=White
Date=1349719234
PID=23828
Program=Python
EOE
So I try to load the file into my Pig script, but the build-in Pig Loader doesn't seems be able to load my data, so I have to create my own UDF. Since I am pretty new to Pig and Hadoop, I want to try script that written by others before I write my own, just to get a teast of how UDF works. I found one from here http://pig.apache.org/docs/r0.10.0/udf.html, there is a SimpleTextLoader. In order to compile this SimpleTextLoader, I have to add a few imports, as
import java.io.IOException;
import java.util.ArrayList;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.InputFormat;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigSplit;
import org.apache.pig.backend.executionengine.ExecException;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;
import org.apache.pig.data.DataByteArray;
import org.apache.pig.PigException;
import org.apache.pig.LoadFunc;
Then, I found out I need to compile this file. I have to download svn and pig running
sudo apt-get install subversion
svn co http://svn.apache.org/repos/asf/pig/trunk
ant
Now i have a pig.jar file, then I try to compile this file.
javac -cp ./trunk/pig.jar SimpleTextLoader.java
jar -cf SimpleTextLoader.jar SimpleTextLoader.class
It compiles successful, and i type in Pig entering grunt, in grunt i try to load the file, using
grunt> register file:/home/hadoop/myudfs.jar
grunt> raw = LOAD 's3://mys3bucket/samplelogs/applog.log' USING myudfs.SimpleTextLoader('=') AS (key:chararray, value:chararray);
2012-12-05 00:08:26,737 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2998: Unhandled internal error. org/apache/pig/LoadFunc Details at logfile: /home/hadoop/pig_1354666051892.log
Inside the pig_1354666051892.log, it has
Pig Stack Trace
---------------
ERROR 2998: Unhandled internal error. org/apache/pig/LoadFunc
java.lang.NoClassDefFoundError: org/apache/pig/LoadFunc
I also try to use another UDF (UPPER.java) from http://wiki.apache.org/pig/UDFManual, and I am still get the same error by try to use UPPER method. Can you please help me out, what's the problem here? Much thanks!
UPDATE: I did try EMR build-in Pig.jar at /home/hadoop/lib/pig/pig.jar, and get the same problem.
Put the UDF jar in the /home/hadoop/lib/pig directory or copy the pig-*-amzn.jar file to /home/hadoop/lib and it will work.
You would probably use a bootstrap action to do either of these.
Most of the Hadoop ecosystem tools like pig and hive look up $HADOOP_HOME/conf/hadoop-env.sh for environment variables.
I was able to resolve this issue by adding pig-0.13.0-h1.jar (it contains all the classes required by the UDF) to the HADOOP_CLASSPATH:
export HADOOP_CLASSPATH=/home/hadoop/pig-0.13.0/pig-0.13.0-h1.jar:$HADOOP_CLASSPATH
pig-0.13.0-h1.jar is available in the Pig home directory.

Resources