Jobs spark failure - hadoop

When I want to launch a spark job on R I get this error :
Erreur : java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext.
This stopped SparkContext was created at:
org.apache.spark.SparkContext.<init>(SparkContext.scala:82) ....
In spark logs (/opt/mapr/spark/spark-version/logs) I find a lot of theses exceptions :
ERROR FsHistoryProvider: Exception encountered when attempting to load application log maprfs:///apps/spark/.60135a9b-ec7c-4f71-8f92-4d4d2fbb1e2b
java.io.FileNotFoundException: File maprfs:///apps/spark/.60135a9b-ec7c-4f71-8f92-4d4d2fbb1e2b does not exist.
Any idea how could I solve this issue ?

You need to create sparkContext (or get if it exists)
import org.apache.spark.{SparkConf, SparkContext}
// 1. Create Spark configuration
val conf = new SparkConf()
.setAppName("SparkMe Application")
.setMaster("local[*]") // local mode
// 2. Create Spark context
val sc = new SparkContext(conf)
or
SparkContext.getOrCreate()

Related

Input path doesn't exist in pyspark for hadoop path

Am trying to get the fetch the file from hdfs in pyspark using visual studio code...
i have checked through jps all the nodes are in active status only.
my file path in hadoop is
hadoop fs -cat emp/part-m-00000
1,A,ABC
2,B,ABC
3,C,ABC
and core-site.xml is
fs.default.name
hdfs://localhost:9000
am fetching the above mentioned file through visual studio code in pyspark..
but am getting error like
py4j.protocol.Py4JJavaError: An error occurred while calling o31.partitions.
: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://localhost:9000/emp/part-m-00000
please help me
i have tried giving the hadoop path
from pyspark.conf import SparkConf
from pyspark.context import SparkContext
from pyspark.sql import HiveContext
sc= SparkContext('local','example')
hc = HiveContext(sc)
tf1 = sc.textFile("hdfs://localhost:9000/emp/part-m-00000")
print(tf1.first())
i need to get the file from hadoop

Submitting Spark application via YARN client

I am using the org.apache.spark.deploy.yarn.Client (Spark 2.1.0) to submit spark a yarn application (SparkPi example). Following are the pertinent lines:
List<String> arguments = Lists.newArrayList("--class", "org.apache.spark.examples.SparkPi","--jar", "path/to/spark examples jar", "--arg", "10");
SparkConf sparkConf = new SparkConf();
applicationTag = "TestApp-" + new Date().getTime();
sparkConf.set("spark.yarn.submit.waitAppCompletion", "false");
sparkConf.set("spark.yarn.tags", applicationTag);
sparkConf.set("spark.submit.deployMode", "cluster");
sparkConf.set("spark.yarn.jars", "/opt/spark/jars/*.jar");
System.setProperty("SPARK_YARN_MODE", "true");
System.setProperty("SPARK_HOME", "/opt/spark");
ClientArguments cArgs = new ClientArguments(arguments.toArray(new String[arguments.size()]));
Client client = new Client(cArgs, sparkConf);
client.run();
This seems to be working and the Spark application appears in the YARN RM UI & succeeds. However, the container logs show that the URL for the staging directory is being picked up as
SPARK_YARN_STAGING_DIR -> file:/home/{current user}/.sparkStaging/application_xxxxxx. Going through org.apache.spark.deploy.yarn.Client shows the likely reason for it being that the base path for the staging directory is not picked up correctly. The base path should be hdfs://localhost:9000/user/{current user}/ rather than file:/home/{current user}/ as confirmed by the following error appearing in the logs when the staging directory is cleared off:
java.lang.IllegalArgumentException: Wrong FS: file:/home/user/.sparkStaging/application_1496908076154_0022, expected: hdfs://127.0.0.1:9000
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:649)
at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:194)
at org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:106)
at org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:707)
at org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:703)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.delete(DistributedFileSystem.java:714)
at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$cleanupStagingDir(ApplicationMaster.scala:545)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:233)
at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
This all works fine when spark-submit is used as I believe it sets all the required environment variables correctly.
I have also tried with setting sparkConf.set("spark.yarn.stagingDir", "hdfs://localhost:9000/user/{current user}"); but to no avail as it results in some other errors, such as hdfs not being recognised as a valid file system.

FileStream on cluster gives me an exception

I am writing a Spark STreaming application using file stream...
val probeFileLines = ssc.fileStream[LongWritable, Text, TextInputFormat]("/data-sources/DXE_Ver/1.4/MTN_Abuja/DXE/20160221/HTTP", filterF, false) //.persist(StorageLevel.MEMORY_AND_DISK_SER)
But I get exception error for file/IO..for
16/09/07 10:20:30 WARN FileInputDStream: Error finding new files
java.io.FileNotFoundException: /mapr/cellos-mapr/data-sources/DXE_Ver/1.4/MTN_Abuja/DXE/20160221/HTTP
at com.mapr.fs.MapRFileSystem.listMapRStatus(MapRFileSystem.java:1486)
at com.mapr.fs.MapRFileSystem.listStatus(MapRFileSystem.java:1523)
at com.mapr.fs.MapRFileSystem.listStatus(MapRFileSystem.java:86)
While the directory exist in my cluster.
I am running my job using spark submit
spark-submit --class "StreamingEngineSt" target/scala-2.11/sprkhbase_2.11-1.0.2.jar
This could be related to file permissions or ownership(May be need hdfs user).

sparkSession/sparkContext can not get hadoop configuration

I am running spark 2, hive, hadoop at local machine, and I want to use spark sql to read data from hive table.
It works all fine when I have hadoop running at default hdfs://localhost:9000, but if I change to a different port in core-site.xml:
<name>fs.defaultFS</name>
<value>hdfs://localhost:9099</value>
Running a simple sql spark.sql("select * from archive.tcsv3 limit 100").show(); in spark-shell will give me the error:
ERROR metastore.RetryingHMSHandler: AlreadyExistsException(message:Database default already exists)
.....
From local/147.214.109.160 to localhost:9000 failed on connection exception: java.net.ConnectException: Connection refused;
.....
I get the AlreadyExistsException before, which doesn't seem to influence the result.
I can make it work by creating a new sparkContext:
import org.apache.spark.SparkContext
import org.apache.spark.sql.SparkSession
sc.stop()
var sc = new SparkContext()
val session = SparkSession.builder().master("local").appName("test").enableHiveSupport().getOrCreate()
session.sql("show tables").show()
My question is, why the initial sparkSession/sparkContext did not get the correct configuration? How can I fix it? Thanks!
If you are using SparkSession and you want to set configuration on the the spark context then use session.sparkContext
val session = SparkSession
.builder()
.appName("test")
.enableHiveSupport()
.getOrCreate()
import session.implicits._
session.sparkContext.hadoopConfiguration.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
You don't need to import SparkContext or created it before the SparkSession

/hbase/meta-region-server because node does not exist (not an error)

I'm running hbase in a cluster mode and I'm getting the following error:
DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil - catalogtracker-on-hconnection-0x6e704bd0x0, quorum=node2:2181, baseZNode=/hbase Set watcher on znode that does not yet exist, /hbase/meta-region-server
I had the similar the error and got it resolved by doing these:
1) Making sure HBase Client Version is compatible with the HBase version on the cluster.
2) Adding hbase-site.xml to your application classpath, so that HBase client determines all the appropriate HBase configurations from it.
val conf = org.apache.hadoop.hbase.HBaseConfiguration.create()
// Instead of the following settings, pass hbase-site.xml in classpath
// conf.set("hbase.zookeeper.quorum", hbaseHost)
// conf.set("hbase.zookeeper.property.clientPort", hbasePort)
HBaseAdmin.checkHBaseAvailable(conf);
log.debug("HBase found! with conf " + conf);

Resources