I did a join of two dataframes on one common column and then ran a show method:
df= df1.join(df2, df1.col1== df2.col2, 'inner')
df.show()
Then join ran very slow and finally raise an error: slave lost.
Py4JJavaError: An error occurred while calling o109.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 8.0 failed 4 times, most recent failure: Lost task 0.3 in stage 8.0 : ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Slave lost
Driver stacktrace:
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at scala.Option.foreach(Option.scala:236) at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832) at
org.apache.spark.SparkContext.runJob(SparkContext.scala:1845) at
org.apache.spark.SparkContext.runJob(SparkContext.scala:1858) at
org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:212)
at
org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165)
at
org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174)
at
org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
at
org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
at
org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086)
at
org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498)
at
org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505)
at
org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375)
at
org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374)
at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099)
at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374) at
org.apache.spark.sql.DataFrame.take(DataFrame.scala:1456) at
org.apache.spark.sql.DataFrame.showString(DataFrame.scala:170) at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381) at
py4j.Gateway.invoke(Gateway.java:259) at
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79) at
py4j.GatewayConnection.run(GatewayConnection.java:209) at
java.lang.Thread.run(Thread.java:745)
After some search, it seems this is a memory related issue. Then I increased repartition to 3000, increased executor memory,increased memoryOverhead, but still no luck, I got the same slave lost error. During df.show(), I found one of the execuctor shuffle write size is very high, the others were not so high.
Any clue?
Thanks
If using scala try
val df = df1.join(df2,Seq("column name"))
if pyspark
df = df1.join(df2,["columnname"])
or
df = df1.join(df2,df1.columnname == df2.columnname)
display(df)
If trying to do same in pyspark - sql
df1.createOrReplaceTempView("left_test_table")
df2..createOrReplaceTempView("right_test_table")
left <- sql(sqlContext, "SELECT * FROM left_test_table")
right <- sql(sqlContext, "SELECT * FROM right_test_table")
head(drop(join(left, right), left$name))
Did somebody manage to write files (and especially CSV) using Spark's DataFrame on Windows?
Many answers on SO are outdated (e.g. this one) because of Sparks native capabilities to write .CSV (and a unified write() method) since version 2.0. Also, I downloaded and added winutils.exe like proposed here.
Code:
// reading works just fine
val df = spark.read
.option("header", true)
.option("inferSchema", true)
.csv("file:///C:/tmp/in.csv")
// writing fails, none of these work
df.write.csv("file:///C:/tmp/out.csv")
df.write.csv("C:/tmp/out.csv")
Error:
Exception in thread "main" org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelationCommand.scala:149)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:115)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:487)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194)
at org.apache.spark.sql.DataFrameWriter.csv(DataFrameWriter.scala:551)
at prost.ebtl.load.DataSourceCSV$.loadFromFilesystem(DataSourceCSV.scala:12)
at TestScala$$anonfun$main$2.apply(TestScala.scala:98)
at TestScala$$anonfun$main$2.apply(TestScala.scala:80)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at TestScala$.main(TestScala.scala:80)
at TestScala.main(TestScala.scala)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 3.0 failed 4 times, most recent failure: Lost task 1.3 in stage 3.0 (TID 13, 192.168.56.1): java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.createFileWithMode0(Ljava/lang/String;JJJI)Ljava/io/FileDescriptor;
at org.apache.hadoop.io.nativeio.NativeIO$Windows.createFileWithMode0(Native Method)
at org.apache.hadoop.io.nativeio.NativeIO$Windows.createFileOutputStreamWithMode(NativeIO.java:559)
at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:219)
at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:209)
at org.apache.hadoop.fs.RawLocalFileSystem.createOutputStreamWithMode(RawLocalFileSystem.java:305)
at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:294)
at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:326)
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.<init>(ChecksumFileSystem.java:393)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:456)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:435)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:909)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:890)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:787)
at org.apache.hadoop.mapreduce.lib.output.TextOutputFormat.getRecordWriter(TextOutputFormat.java:132)
at org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter.<init>(CSVRelation.scala:191)
at org.apache.spark.sql.execution.datasources.csv.CSVOutputWriterFactory.newInstance(CSVRelation.scala:169)
at org.apache.spark.sql.execution.datasources.BaseWriterContainer.newOutputWriter(WriterContainer.scala:131)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:247)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1450)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1438)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1437)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1437)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1659)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1871)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1884)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1904)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelationCommand.scala:143)
... 27 more
Caused by: java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.createFileWithMode0(Ljava/lang/String;JJJI)Ljava/io/FileDescriptor;
at org.apache.hadoop.io.nativeio.NativeIO$Windows.createFileWithMode0(Native Method)
at org.apache.hadoop.io.nativeio.NativeIO$Windows.createFileOutputStreamWithMode(NativeIO.java:559)
at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:219)
at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:209)
at org.apache.hadoop.fs.RawLocalFileSystem.createOutputStreamWithMode(RawLocalFileSystem.java:305)
at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:294)
at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:326)
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.<init>(ChecksumFileSystem.java:393)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:456)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:435)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:909)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:890)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:787)
at org.apache.hadoop.mapreduce.lib.output.TextOutputFormat.getRecordWriter(TextOutputFormat.java:132)
at org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter.<init>(CSVRelation.scala:191)
at org.apache.spark.sql.execution.datasources.csv.CSVOutputWriterFactory.newInstance(CSVRelation.scala:169)
at org.apache.spark.sql.execution.datasources.BaseWriterContainer.newOutputWriter(WriterContainer.scala:131)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:247)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Note: A folder named out.csv is created though
Setup: Hadoop v.2.7.3, Spark 2.0.1 Intelli J IDEA 2016.2, Scala 2.11.8, Testcluster on Win7 Workstation
I tried this, its working. You need to set warehouse dir configuration. That's the only thing missing from your code,also do you have write access to the directory where you are trying to write.
val spark = SparkSession
.builder()
.appName("Spark SQL CSV example")
.master("local")
.config("spark.sql.warehouse.dir", "file:///C:/IJava/")
.getOrCreate()
val df = spark.read
.option("header", true)
.option("inferSchema", true)
.csv("file:///C:/Users/sankar/Downloads/FLinsurancesample.csv")
df.write.csv("file:///C:/Users/sankar/Downloads/out.csv")
My Spark application is failing when it has to access numerous CSV files (~1000 # 63MB each) from S3, and pipe them into a Spark RDD. The actual process of splitting up the CSV seems to work, but an extra function call to S3NativeFileSystem seems to be causing an error and the job to crash.
To begin, the following is my PySpark Application:
from pyspark import SparkContext
sc = SparkContext("local", "Simple App")
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
import time
startTime = float(time.time())
dataPath = 's3://PATHTODIRECTORY/'
sc._jsc.hadoopConfiguration().set("fs.s3.awsAccessKeyId", "MYKEY")
sc._jsc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey", "MYSECRETKEY")
def buildSchemaDF(tableName, columnList):
currentRDD = sc.textFile(dataPath + tableName).map(lambda line: line.split("|"))
currentDF = currentRDD.toDF(columnList)
return currentDF
loadStartTime = float(time.time())
lineitemDF = buildSchemaDF('lineitem*', ['l_orderkey','l_partkey','l_suppkey','l_linenumber','l_quantity','l_extendedprice','l_discount','l_tax','l_returnflag','l_linestatus','l_shipdate','l_commitdate','l_receiptdate','l_shipinstruct','l_shipmode','l_comment'])
lineitemDF.registerTempTable("lineitem")
loadTimeElapsed = float(time.time()) - loadStartTime
queryStartTime = float(time.time())
qstr = """
SELECT
lineitem.l_returnflag,
lineitem.l_linestatus,
sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price,
sum(l_discount) as sum_disc,
sum(l_tax) as sum_tax,
avg(l_quantity) as avg_qty,
avg(l_extendedprice) as avg_price,
avg(l_discount) as avg_disc,
count(l_orderkey) as count_order
FROM
lineitem
WHERE
l_shipdate <= '19981001'
GROUP BY
l_returnflag,
l_linestatus
ORDER BY
l_returnflag,
l_linestatus
"""
tpch1DF = sqlContext.sql(qstr)
queryTimeElapsed = float(time.time()) - queryStartTime
totalTimeElapsed = float(time.time()) - startTime
tpch1DF.show()
queryResults = [qstr, loadTimeElapsed, queryTimeElapsed, totalTimeElapsed]
distData = sc.parallelize(queryResults)
distData.saveAsTextFile(dataPath + 'queryResults.csv')
print 'Load Time: ' + str(loadTimeElapsed)
print 'Query Time: ' + str(queryTimeElapsed)
print 'Total Time: ' + str(totalTimeElapsed)
To take it step by step I start off by spinning up a Spark EMR Cluster with the following AWS CLI command (carriage returns added for readability):
aws emr create-cluster --name "Big TPCH Spark cluster2" --release-label emr-4.6.0
--applications Name=Spark --ec2-attributes KeyName=blazing-test-aws
--log-uri s3://aws-logs-132950491118-us-west-2/elasticmapreduce/j-1WZ39GFS3IX49/
--instance-type m3.2xlarge --instance-count 6 --use-default-roles
After the EMR cluster finishes provisioning I then copy over my Pyspark application onto the master node at '/home/hadoop/pysparkApp.py'. With it copied over I'm able to add the Step for spark-submit.
aws emr add-steps --cluster-id j-1DQJ8BDL1394N --steps
Type=spark,Name=SparkTPCHTests,Args=[--deploy-mode,cluster,-
conf,spark.yarn.submit.waitAppCompletion=true,--num-executors,5,--executor
cores,5,--executor memory,20g,/home/hadoop/tpchSpark.py]
,ActionOnFailure=CONTINUE
Now if I run this step over only a few of the aforementioned CSV files the final results will be generated, but the script will still claim to have failed.
I think it's associated with an extra call to S3NativeFileSystem, but I'm not certain. These are the Yarn log messages I'm getting which lead me to that conclusion. The first call appears to work just fine:
16/05/15 23:18:00 INFO HadoopRDD: Input split: s3://data-set-builder/splitLineItem2/lineitemad:0+64901757
16/05/15 23:18:00 INFO latency: StatusCode=[200], ServiceName=[Amazon S3], AWSRequestID=[ED8011CE4E1F6F18], ServiceEndpoint=[https://data-set-builder.s3-us-west-2.amazonaws.com], HttpClientPoolLeasedCount=0, RetryCapacityConsumed=0, RequestCount=1, HttpClientPoolPendingCount=0, HttpClientPoolAvailableCount=2, ClientExecuteTime=[77.956], HttpRequestTime=[77.183], HttpClientReceiveResponseTime=[20.028], RequestSigningTime=[0.229], CredentialsRequestTime=[0.003], ResponseProcessingTime=[0.128], HttpClientSendRequestTime=[0.35],
While the second one does not seem to execute properly, resulting in "Partial Results" (206 Error):
16/05/15 23:18:00 INFO S3NativeFileSystem: Opening 's3://data-set-builder/splitLineItem2/lineitemad' for reading
16/05/15 23:18:00 INFO latency: StatusCode=[206], ServiceName=[Amazon S3], AWSRequestID=[10BDDE61AE13AFBE], ServiceEndpoint=[https://data-set-builder.s3.amazonaws.com], HttpClientPoolLeasedCount=0, RetryCapacityConsumed=0, RequestCount=1, HttpClientPoolPendingCount=0, HttpClientPoolAvailableCount=2, Client Execute Time=[296.86], HttpRequestTime=[295.801], HttpClientReceiveResponseTime=[293.667], RequestSigningTime=[0.204], CredentialsRequestTime=[0.002], ResponseProcessingTime=[0.34], HttpClientSendRequestTime=[0.337],
16/05/15 23:18:02 INFO ApplicationMaster: Waiting for spark context initialization ...
I'm lost as to why it's even making the second call to S3NativeFileSystem when the first one appears to have responded effectively and even split the file. Is this something that is a product of my EMR configuration? I know S3Native has file limit issues and that a straight S3 call is optimal, which is what I've tried to do, but this call seems to be there no matter what I do. Please help!
Also, to add a few other error messages in my Yarn Log in case they are relevant.
1)
16/05/15 23:19:22 ERROR ApplicationMaster: SparkContext did not initialize after waiting for 100000 ms. Please check earlier log output for errors. Failing the application.
16/05/15 23:19:22 INFO ApplicationMaster: Final app status: FAILED, exitCode: 13, (reason: Timed out waiting for SparkContext.)
2)
16/05/15 23:19:22 ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file /mnt/yarn/usercache/hadoop/appcache/application_1463354019776_0001/blockmgr-f847744b-c87a-442c-9135-57cae3d1f6f0/2b/temp_shuffle_3fe2e09e-f8e4-4e5d-ac96-1538bdc3b401
java.io.FileNotFoundException: /mnt/yarn/usercache/hadoop/appcache/application_1463354019776_0001/blockmgr-f847744b-c87a-442c-9135-57cae3d1f6f0/2b/temp_shuffle_3fe2e09e-f8e4-4e5d-ac96-1538bdc3b401 (No such file or directory)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.<init>(FileOutputStream.java:221)
at org.apache.spark.storage.DiskBlockObjectWriter.revertPartialWritesAndClose(DiskBlockObjectWriter.scala:162)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.stop(BypassMergeSortShuffleWriter.java:226)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
16/05/15 23:19:22 ERROR BypassMergeSortShuffleWriter: Error while deleting file /mnt/yarn/usercache/hadoop/appcache/application_1463354019776_0001/blockmgr-f847744b-c87a-442c-9135-57cae3d1f6f0/2b/temp_shuffle_3fe2e09e-f8e4-4e5d-ac96-1538bdc3b401
16/05/15 23:19:22 WARN TaskMemoryManager: leak 32.3 MB memory from org.apache.spark.unsafe.map.BytesToBytesMap#762be8fe
16/05/15 23:19:22 ERROR Executor: Managed memory leak detected; size = 33816576 bytes, TID = 14
16/05/15 23:19:22 ERROR Executor: Exception in task 13.0 in stage 1.0 (TID 14)
java.io.FileNotFoundException: /mnt/yarn/usercache/hadoop/appcache/application_1463354019776_0001/blockmgr-f847744b-c87a-442c-9135-57cae3d1f6f0/3a/temp_shuffle_b9001fca-bba9-400d-9bc4-c23c002e0aa9 (No such file or directory)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.<init>(FileOutputStream.java:221)
at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:88)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Order of precedence for spark configurations is :
SparkContext (code/application) > Spark-submit > Spark-defaults.conf
So couple of things to point here -
Use YARN cluster as deploy mode and master in your spark submit command -
spark-submit --deploy-mode cluster --master yarn ...
OR
spark-submit --master yarn-cluster ...
Remove "local" string from line sc = SparkContext("local", "Simple App") in your code. Use conf = SparkConf().setAppName(appName)
sc = SparkContext(conf=conf) to initialize Spark context.
Ref - http://spark.apache.org/docs/latest/programming-guide.html
I've a couple of problems executing Hive on cosmos fiware lab instance.
First, after log into the machine, I enter in Hive command line and I get the following error (I saw other questions related to this, but I couldn't find a solution):
$ hive
log4j:ERROR Could not instantiate class [org.apache.hadoop.hive.shims.HiveEventCounter].
java.lang.RuntimeException: Could not load shims in class org.apache.hadoop.log.metrics.EventCounter
at org.apache.hadoop.hive.shims.ShimLoader.createShim(ShimLoader.java:123)
at org.apache.hadoop.hive.shims.ShimLoader.loadShims(ShimLoader.java:115)
at org.apache.hadoop.hive.shims.ShimLoader.getEventCounter(ShimLoader.java:98)
at org.apache.hadoop.hive.shims.HiveEventCounter.<init>(HiveEventCounter.java:34)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at java.lang.Class.newInstance0(Class.java:357)
at java.lang.Class.newInstance(Class.java:310)
at org.apache.log4j.helpers.OptionConverter.instantiateByClassName(OptionConverter.java:330)
at org.apache.log4j.helpers.OptionConverter.instantiateByKey(OptionConverter.java:121)
at org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.java:664)
at org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.java:647)
at org.apache.log4j.PropertyConfigurator.configureRootCategory(PropertyConfigurator.java:544)
at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:440)
at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:476)
at org.apache.log4j.PropertyConfigurator.configure(PropertyConfigurator.java:354)
at org.apache.hadoop.hive.common.LogUtils.initHiveLog4jDefault(LogUtils.java:127)
at org.apache.hadoop.hive.common.LogUtils.initHiveLog4jCommon(LogUtils.java:77)
at org.apache.hadoop.hive.common.LogUtils.initHiveLog4j(LogUtils.java:58)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:641)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:625)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:197)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.log.metrics.EventCounter
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:171)
at org.apache.hadoop.hive.shims.ShimLoader.createShim(ShimLoader.java:120)
... 27 more
log4j:ERROR Could not instantiate appender named "EventCounter".
Logging initialized using configuration in jar:file:/usr/local/apache-hive-0.13.0-bin/lib/hive-common-0.13.0.jar!/hive-log4j.properties
However, I'm able to run a query like SELECT * FROM table;
On the other hand, if I try to run other query more specific like display only a column field, a map reduce job starts to run and it results in the following error:
hive> SELECT table.column FROM table;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201507101501_40071, Tracking URL = http://cosmosmaster-gi:50030/jobdetails.jsp?jobid=job_201507101501_40071
Kill Command = /usr/lib/hadoop-0.20/bin/hadoop job -kill job_201507101501_40071
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2016-01-29 12:49:45,518 Stage-1 map = 0%, reduce = 0%
2016-01-29 12:50:08,642 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201507101501_40071 with errors
Error during job, obtaining debugging information...
Job Tracking URL: http://cosmosmaster-gi:50030/jobdetails.jsp?jobid=job_201507101501_40071
Examining task ID: task_201507101501_40071_m_000002 (and more) from job job_201507101501_40071
Task with the most failures(4):
-----
Task ID:
task_201507101501_40071_m_000000
URL:
http://cosmosmaster-gi:50030/taskdetails.jsp?jobid=job_201507101501_40071&tipid=task_201507101501_40071_m_000000
-----
Diagnostic Messages for this Task:
java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:386)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:324)
at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278)
at org.apache.hadoop.mapred.Child.main(Child.java:260)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.jav
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Job 0: Map: 1 HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec
Any help or suggestion is welcome.
Thanks.
The first error is not relevant and does not affect Hive querying, as you have seen.
Regarding the second error, most probably it is because the stored data in HDFS is in Json format (most probably stored by the Cygnus tool) and a Json SerializerDeserializer (serde) must be set. You can do this by executoing the following sentence before doing the select column from table:
$ add jar /usr/local/apache-hive-0.13.0-bin/lib/json-serde-1.3.1-SNAPSHOT-jar-with-dependencies.jar;
$ select column from table;
When I use hive with select * from table_name;, it works.
When I use select t.a from table_name t OR select * from table_name where ..., the following error happens :
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1414949555870_118360, Tracking URL = N/A
Kill Command = /usr/local/hadoop-2.5.1/bin/hadoop job -kill job_1414949555870_118360
java.io.IOException: Job status not available
at org.apache.hadoop.mapreduce.Job.updateStatus(Job.java:322)
at org.apache.hadoop.mapreduce.Job.getJobState(Job.java:347)
at org.apache.hadoop.mapred.JobClient$NetworkedJob.getJobState(JobClient.java:295)
at org.apache.hadoop.hive.shims.HadoopShimsSecure.isJobPreparing(HadoopShimsSecure.java:104)
at org.apache.hadoop.hive.ql.exec.mr.HadoopJobExecHelper.progress(HadoopJobExecHelper.java:242)
at org.apache.hadoop.hive.ql.exec.mr.HadoopJobExecHelper.progress(HadoopJobExecHelper.java:541)
at org.apache.hadoop.hive.ql.exec.mr.ExecDriver.execute(ExecDriver.java:431)
at org.apache.hadoop.hive.ql.exec.mr.MapRedTask.execute(MapRedTask.java:136)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:151)
at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:65)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1485)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1263)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1091)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:931)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:921)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:268)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:220)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:422)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:790)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:684)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:623)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Ended Job = job_1414949555870_118360 with exception java.io.IOException(Job status not available )
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask`
So, what's wrong about hive?
DEBUG info as follow, maybe useful!! Please help me!!!
15/04/14 16:53:48 DEBUG ipc.Client: getting client out of cache: org.apache.hadoop.ipc.Client#60cb201b
15/04/14 16:53:48 DEBUG mapred.ClientServiceDelegate: Failed to contact AM/History for job job_1414949555870_118441 retrying..
java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "slave109":43759; java.net.UnknownHostException; For more details see: http://wiki.apache.org/hadoop/UnknownHost