SparkR dapply not working - sparkr

I'm trying to call lapply within a function applied on spark data frame. According to documentation it's possible since Spark 2.0.
wrapper = function(df){
out = df
out$len <- unlist(lapply(df$value, function(y) length(y)))
return(out)
}
# dd is Spark Data Frame with one column (value) of type raw
dapplyCollect(dd, wrapper)
It returns error:
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...): org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 37.0 failed 1 times, most recent failure: Lost task 0.0 in stage 37.0 (TID 37, localhost): org.apache.spark.SparkException: R computation failed with
Error in (function (..., deparse.level = 1, make.row.names = TRUE) :
incompatible types (from raw to logical) in subassignment type fix
The following works fine:
wrapper(collect(dd))
But we want computation to run on nodes (not on driver).
What could be the problem? There is a related question but it does not help.
Thanks.

You need to add the schema as it can only be defaulted if the columns of the output are the same mode as the input.

Related

Pytorch is not working with DistributedDataParallel for multi gpu training

I am trying to train my model on multiple GPUS. I used the libraries and a added a code for it
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.distributed import init_process_group, destroy_process_group
Initialization
def ddp_setup(rank: int, world_size: int):
os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = "12355"
os.environ["TORCH_DISTRIBUTED_DEBUG"]="DETAIL"
init_process_group(backend="gloo", rank=0, world_size=1)
my model
model = CMGCNnet(config,
que_vocabulary=glovevocabulary,
glove=glove,
device=device)
model = model.to(0)
if -1 not in args.gpu_ids and len(args.gpu_ids) > 1:
model = DDP(model, device_ids=[0,1])
it throws following error:
config_yml : model/config_fvqa_gruc.yml
cpu_workers : 0
save_dirpath : exp_test_gruc
overfit : False
validate : True
gpu_ids : [0, 1]
dataset : fvqa
Loading FVQATrainDataset…
True
done splitting
Loading FVQATestDataset…
Loading glove…
Building Model…
Traceback (most recent call last):
File “trainfvqa_gruc.py”, line 512, in
train()
File “trainfvqa_gruc.py”, line 145, in train
ddp_setup(0,1)
File “trainfvqa_gruc.py”, line 42, in ddp_setup
init_process_group(backend=“gloo”, rank=0, world_size=1)
File “/home/seecs/miniconda/envs/mucko-edit/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py”, line 360, in init_process_group
timeout=timeout)
RuntimeError: [enforce fail at /opt/conda/conda-bld/pytorch_1544202130060/work/third_party/gloo/gloo/transport/tcp/device.cc:128] rp != nullptr. Unable to find address for: 127.0.0.1localhost.
localdomainlocalhost
I tried printing the issue with os.environ["TORCH_DISTRIBUTED_DEBUG"]="DETAIL"
it outputs:
Loading FVQATrainDataset...
True
done splitting
Loading FVQATestDataset...
Loading glove...
Building Model...
Segmentation fault
with NCCL background it starts the training but get stuck and doesn’t go further than this :slight_smile:
Training for epoch 0:
0%| | 0/2039 [00:00<?, ?it/s]
I found this solution but where to add these lines?
GLOO_SOCKET_IFNAME* , for example export GLOO_SOCKET_IFNAME=eth0`
mentioned in
https://discuss.pytorch.org/t/runtime-error-using-distributed-with-gloo/16579/3
Can someone help me with this issue?
to seek help. I am hoping to get and answer

pyspark: org.xml.sax.SAXParseException Current config of the parser doesn't allow a maxOccurs attribute value to be set greater than the value 5,000

I am trying to parse xml files with XSD using spark-xml library in pyspark.
Below is the code :
xml_df = spark.read.format("com.databricks.spark.xml") \
.option("rootTag", "Document") \
.option("rowTag", "row01") \
.option("rowValidationXSDPath","auth.011.001.02_ABC_1.1.0.xsd") \
.load("/mnt/bronze/ABC-3.xml")
I am getting error as
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3) (10.0.1.4 executor driver): java.util.concurrent.ExecutionException: org.xml.sax.SAXParseException; systemId: file:/local_disk0/auth.011.001.02_ABC_1.1.0.xsd; lineNumber: 5846; columnNumber: 99; Current configuration of the parser doesn't allow a maxOccurs attribute value to be set greater than the value 5,000.
I have looked for the ways to setup jdk.xml.maxOccurLimit=0 in databricks cluster but didn't find any.
Any help on solving this error will be highly appreciated.
As per documentation, you can setup jdk.xml.maxOccurLimit=0 and also follow below sample code:
I reproduce same in my environment got this output:
Sample Code:
spark.conf.set("spark.jvm.args", "-Djdk.xml.maxOccurLimit=0")
df = spark.read.format("com.databricks.spark.xml").option("rowTag", "book").load("dbfs:/FileStore/gg.xml")
display(df)

How to avoid out-of-memory error when creating large arrays from PySpark and Parquet for data analysis?

We have a somewhat unusual requirement that we're experimenting with. We have a parquet table with ~320 GB of frequency pixel values in arrays - the data is from a 3-dimensional spectral line image cube (A FITs source file).
+--------------------+------------+-------+
| col_name| data_type|comment|
+--------------------+------------+-------+
| spi_index| int| null|
| spi_image|array<float>| null|
| spi_filename| string| null|
| spi_band| int| null|
|# Partition Infor...| | |
| # col_name| data_type|comment|
| spi_filename| string| null|
| spi_band| int| null|
+--------------------+------------+-------+
There is a legacy source finding algorithm written in C that is used to analyse the original data source files, but it's restrained to the physical memory available on one machine. We're running a CDH Express cluster, 3 management nodes (8 cores, 32 GB ram) 10 worker nodes (16 cores, 64 GB ram) and a gateway node running Jupyterhub (8 cores, 32 GB of ram). We've modified the original C program into a shared object, and distributed it across the cluster. We've incorporated the C shared object into a partitioner class so we can run multiple partitions in multiple executors across the cluster. We have this up and running using pyspark.
The problem that we're experiencing is that we optimally need an input pixel array of a minimum of about 15GB and we seem to be hitting a wall at creating arrays of around 7.3 GB, and we're unsure of why this is.
YARN maximum allocation setting.
yarn.scheduler.maximum-allocation-mb = 40GB
Spark configuration settings
--executor-memory 18g \
--num-executors 29 \
--driver-memory 18g \
--executor-cores 5 \
--conf spark.executor.memoryOverhead='22g' \
--conf spark.driver.memoryOverhead='22g' \
--conf spark.driver.maxResultSize='24g' \
--conf "spark.executor.extraJavaOptions-XX:MaxDirectMemorySize=20G -XX:+UseCompressedOops" \
Summary of partitioner class
class Partitioner:
def __init__(self):
self.callPerDriverSetup
def callPerDriverSetup(self):
pass
def callPerPartitionSetup(self):
sys.path.append('sofia')
#import example
import sofia
import myLib
import faulthandler
import time as tm
from time import time, clock, process_time, sleep
self.sofia=sofia
self.myLib=myLib
#self.example=example
self.parameterFile=SparkFiles.get('sofia.par')
def doProcess(self, element):
### here's the call to the C library for each row of the dataframe partition
### In here we have to transform the flattened array data to the format SoFiA
### requires, as well as the
ra=np.array(element.right_ascension, dtype='<f4')
dec=np.array(element.declination, dtype='<f4')
frequencies=np.array(element.frequency, dtype='<f4')
Pixels=np.array(element.pixels, dtype='<f4')
dataPtr= Pixels.ravel()
#
# create the dictionary of the original header
#
hdrKey = np.array(element.keys, dtype='U')
hdrValue = np.array(element.values, dtype='U')
hdrDict = dict(zip(hdrKey,hdrValue))
newHdr=self.myLib.CreateHeader(hdrDict)
# Get the crpix adjustment values for the new header
crpix1,crpix2,crpix3,crpix4=self.myLib.GetHeaderUpdates(newHdr,\
element.raHeaderIndex, \
element.decHeaderIndex, \
element.freqHeaderIndex)
# Get the new axis values
naxis1 = len(ra)
naxis2 = len(dec)
naxis4 = len(frequencies)
newHdr['CRPIX1']=float(crpix1)
newHdr['CRPIX2']=float(crpix2)
newHdr['CRPIX3']=float(crpix3)
newHdr['CRPIX4']=float(crpix4)
newHdr['NAXIS1']=float(naxis1)
newHdr['NAXIS2']=float(naxis2)
newHdr['NAXIS4']=float(naxis4)
newHdr.pop("END",None)
hdrstr,hdrsize = self.myLib.dict2FITSstr(newHdr)
path_to_par = self.parameterFile
parsize = len(path_to_par)
# pass off to sofia C library
try:
# This is the call to the shared object
ret = self.sofia.sofia_mainline(dataPtr,hdrstr,hdrsize,path_to_par,parsize)
returnCode= ret[0]
sofiaMsg="Call to sofia has worked!"
except RuntimeError:
print("Caught general exception\n")
sofiaMsg="Caught general exception"
returnCode=1
#sys.exit()
except StopIteration:
print("Caught Null pointer\n")
sofiaMsg="Caught Null pointer"
returnCode=2
#sys.exit()
except MemoryError:
print("Caught ALLOC error\n")
sofiaMsg="Caught ALLOC error"
returnCode=3
#sys.exit()
except IndexError:
print("Caught index range error\n")
sofiaMsg="Caught index range error"
returnCode=4
#ys.exit()
except IOError:
print("Caught file error\n")
sofiaMsg="Caught file error"
returnCode=5
#sys.exit()
except OverflowError:
print("Caught integer overflow error\n")
sofiaMsg="Caught integer overflow error"
returnCode=6
#sys.exit()
except TypeError:
print("Caught user input error\n")
sofiaMsg="Caught user input error"
returnCode=7
#sys.exit()
except SystemExit:
print("Caught no sources error\n")
sofiaMsg="Caught no sources error"
returnCode=8
#sys.exit()
if returnCode==0:
catalogXML=ret[5].tobytes()
pass
else:
catalogXML=""
DELIMITER=chr(255) #"{}{}"
msg="{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}"\
.format( str(returnCode), DELIMITER, \
str(element.raHeaderIndex), DELIMITER, \
str(element.decHeaderIndex), DELIMITER, \
str(len(ra)), DELIMITER, \
str(len(dec)) , DELIMITER, \
str(len(frequencies)), DELIMITER, \
str(element.freqHeaderIndex) , DELIMITER, \
str(catalogXML), DELIMITER, \
str(dataPtr.nbytes)
)
return msg
def processPartition(self, partition):
self.callPerPartitionSetup()
for element in partition:
yield self.doProcess(element)
We create a dataframe where each row of the dataframe represents a 3-dimensional array of data and containing columns of array data for the 3 position dimensions, and the array containing the pixel values (the call in doProcess for ra, dec, frequencies and pixels). The dataframe row also contains coordinate system information from the original source file, which we use to build a new set of header information, which is passed to an instantiated instance of the Partitioner class via a df.rdd.mapPartitions call.
p=Partitioner()
try:
...
...
...
...
# Creates the positional array dataframes,
# and the single rows representing the 3d images in finalSubCubeDF
# finalSubCube
#
...
...
...
finalSubcubeDF=finalSubcubeDF\
.join(broadcast(raRangeDF), finalSubcubeDF.bins == raRangeDF.grp, "inner")\
.join(broadcast(decRangeDF), finalSubcubeDF.bins == decRangeDF.grp, "inner")\
.join(broadcast(freqRangeDF), finalSubcubeDF.bins == freqRangeDF.grp, "inner")\
.join(broadcast(hdrDF), finalSubcubeDF.bins == hdrDF.grp, "inner")\
.select("bins","right_ascension","declination","frequency","raHeaderIndex",\
"decHeaderIndex","freqHeaderIndex","pixels","keys","values")
# repartition on the bins column to distribute the processing
finalSubcubeDF=finalSubcubeDF.repartition("bins")
finalSubcubeDF.persist(StorageLevel.MEMORY_AND_DISK_SER)
...
# Calls the partitioner class which containes the C shared object calls, as above
...
rddout=finalSubcubeDF.rdd.mapPartitions(p.processPartition)
DELIMITER=chr(255)
rddout= rddout.map(lambda x:x.split(DELIMITER))
...
...
# Write the results (which activates the computation) including the catalogue XML file
rddout.saveAsTextFile("hdfs:///<file path>")
except Exception as e:
...
...
...
Error message captured from the calling process
2021-02-16 11:03:42,830 INFO ERROR! ProcessThread FAILURE...An error occurred while calling o1989.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 47.1 failed 4 times, most recent failure: Lost task 5.3 in stage 47.1 (TID 5875, hercules-1-2.nimbus.pawsey.org.au, executor 12): ExecutorLostFailure (executor 12 exited caused by one of the running tasks) Reason: Container marked as failed: container_1612679558527_0181_01_000014 on host: hercules-1-2.nimbus.pawsey.org.au. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Killed by external signal
2021-02-16 11:03:42,830 INFO ERROR! ProcessThread FAILURE...An error occurred while calling o1989.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 47.1 failed 4 times, most recent failure: Lost task 5.3 in stage 47.1 (TID 5875, hercules-1-2.nimbus.pawsey.org.au, executor 12): ExecutorLostFailure (executor 12 exited caused by one of the running tasks) Reason: Container marked as failed: container_1612679558527_0181_01_000014 on host: hercules-1-2.nimbus.pawsey.org.au. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Killed by external signal
And the error message from the YARN Container log
LogType:stdout
Log Upload Time:Tue Feb 16 11:19:03 +0000 2021
LogLength:124
Log Contents:
#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill %p"
# Executing /bin/sh -c "kill 20417"...
So the problem is obviously memory related, but not entirely sure why that's the case given the executor and driver process memory settings are set quite high? At the moment we're just grasping at straws.
We're aware that collecting distributed data into an array under normal circumstances isn't recommended; however, it seems that being able to run multiple C shared objects in parallel across the cluster could be more efficient than running 30-40 GB extractions serially on a single machine.
Thanks in advance for any thoughts and assistance.

Deepwater threw java.lang.ArrayIndexOutOfBoundsException during training if balance_classes=TRUE

In AWS, I followed the instruction in here and launched a g2.2xlarge EC2 using the community AMI ami-97591381
On the docker image, I can run a simple deepwater tutorial without a problem. However, when I tried to train a deepwater model using my own data (which worked ok with a non-GPU deeplearning model), h2o gave me this exception:
java.lang.RuntimeException: java.lang.ArrayIndexOutOfBoundsException: 0 <= 186393 < 170807
at water.Futures.blockForPending(Futures.java:88)
at hex.deepwater.DeepWaterDatasetIterator.Next(DeepWaterDatasetIterator.java:99)
at hex.deepwater.DeepWaterTask.setupLocal(DeepWaterTask.java:168)
at water.MRTask.setupLocal0(MRTask.java:550)
at water.MRTask.dfork(MRTask.java:456)
at water.MRTask.doAll(MRTask.java:389)
at water.MRTask.doAll(MRTask.java:385)
at hex.deepwater.DeepWater$DeepWaterDriver.trainModel(DeepWater.java:345)
at hex.deepwater.DeepWater$DeepWaterDriver.buildModel(DeepWater.java:205)
at hex.deepwater.DeepWater$DeepWaterDriver.computeImpl(DeepWater.java:118)
at hex.ModelBuilder$Driver.compute2(ModelBuilder.java:173)
at hex.deepwater.DeepWater$DeepWaterDriver.compute2(DeepWater.java:111)
at water.H2O$H2OCountedCompleter.compute(H2O.java:1256)
at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 0 <= 186393 < 170807
at water.fvec.Vec.elem2ChunkIdx(Vec.java:925)
at water.fvec.Vec.chunkForRow(Vec.java:1063)
at hex.deepwater.DeepWaterDatasetIterator$FrameDataConverter.compute2(DeepWaterDatasetIterator.java:76)
... 6 more
This is my code, which you can run as I made the S3 links public:
library(h2o)
library(jsonlite)
library(curl)
h2o.init()
df.truth <- h2o.importFile("https://s3.amazonaws.com/nw.data.test.us.east/df.truth.zeroed", header = T, sep=",")
df.truth$isFemale <- h2o.asfactor(df.truth$isFemale)
hotnames.truth <- fromJSON("https://s3.amazonaws.com/nw.data.test.us.east/hotnames.json", simplifyVector = T)
# Training and validation sets
splits <- h2o.splitFrame(df.truth, c(0.9), seed=1234)
train.truth <- h2o.assign(splits[[1]], "train.truth.hex")
valid.truth <- h2o.assign(splits[[2]], "valid.truth.hex")
dl.2.balanced <- h2o.deepwater(
training_frame = train.truth, model_id="dl.2.balanced",
x=setdiff(hotnames.truth[1:(length(hotnames.truth)/2)], c("isFemale", "nwtcs")),
y="isFemale", stopping_metric = "AUTO", seed = 1000000,
sparse = F,
balance_classes = T,
mini_batch_size = 20)
The h2o version is 3.13.0.356.
Update:
I think I found the h2o bug. If I set balance_classes to FALSE, then it will run w/o crashing.
Please note that Deep Water is a legacy project (as of December 2017), which means that it is no longer under active development. The H2O.ai team has no current plans to add new features, however, contributions from the community (in the form of pull requests) are welcome.

Writing Parquet file in standalone mode works..multiple worker mode fails

In Spark, version 1.6.1 (code is in Scala 2.10), I am trying to write a data frame to a Parquet file:
import sc.implicits._
val triples = file.map(p => _parse(p, " ", true)).toDF()
triples.write.mode(SaveMode.Overwrite).parquet("hdfs://some.external.ip.address:9000/tmp/table.parquet")
When I do it in development mode, everything works fine. It also works fine if I setup a master and one worker in standalone mode in a docker environment (separate docker containers) on the same machine. It fails when I try to execute it on a cluster (1 master, 5 workers). If I set it up local on the master it also works.
When I try to execute it, I get following stacktrace:
{
"duration": "18.716 secs",
"classPath": "LDFSparkLoaderJobTest2",
"startTime": "2016-07-18T11:41:03.299Z",
"context": "sql-context",
"result": {
"errorClass": "org.apache.spark.SparkException",
"cause": "Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 6, curry-n3): java.lang.NullPointerException
at org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:147)
at org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:113)
at org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:112)
at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetRelation.scala:101)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.abortTask$1(WriterContainer.scala:294)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:271)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)\n\nDriver stacktrace:",
"stack":[
"org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)",
"org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)",
"org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)",
"scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)",
"scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)",
"org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)",
"org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)",
"org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)",
"scala.Option.foreach(Option.scala:236)",
"org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)",
"org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)",
"org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)",
"org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)",
"org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)",
"org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)",
"org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)",
"org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)",
"org.apache.spark.SparkContext.runJob(SparkContext.scala:1922)",
"org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:150)",
"org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)",
"org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)",
"org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)",
"org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)",
"org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)",
"org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)",
"org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)",
"org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)",
"org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)",
"org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)",
"org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)",
"org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)",
"org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)",
"org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:256)",
"org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)",
"org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139)",
"org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:334)",
"LDFSparkLoaderJobTest2$.readFile(SparkLoaderJob.scala:55)",
"LDFSparkLoaderJobTest2$.runJob(SparkLoaderJob.scala:48)",
"LDFSparkLoaderJobTest2$.runJob(SparkLoaderJob.scala:18)",
"spark.jobserver.JobManagerActor$$anonfun$spark$jobserver$JobManagerActor$$getJobFuture$4.apply(JobManagerActor.scala:268)",
"scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)",
"scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)",
"java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)",
"java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)",
"java.lang.Thread.run(Thread.java:745)"
],
"causingClass": "org.apache.spark.SparkException",
"message": "Job aborted."
},
"status": "ERROR",
"jobId": "54ad3056-3aaa-415f-8352-ca8c57e02fe9"
}
Notes:
The job is submitted via the Spark Jobserver.
The file that needs to be converted to a Parquet file is 15.1 MB in size.
Question:
Is there something I am doing wrong (I followed the docs)
Or is there another way I can create the Parquet file, so all my workers have access to it?
In your stand alone setup only one worker is working with ParquetRecordWriter. so it worked fine.
In case of real test i.e. cluster (1 master, 5 workers). with ParquetRecordWriter it will fail since you are concurrently writing with multiple workers...
pls try below.
import sc.implicits._
val triples = file.map(p => _parse(p, " ", true)).toDF()
triples.write.mode(SaveMode.Append).parquet("hdfs://some.external.ip.address:9000/tmp/table.parquet")
pls. see SaveMode.Append "append" When saving a DataFrame to a data source, if data/table already exists, contents of the DataFrame are expected to be appended to existing data.
I had not exactly the same, but similar issues writing dataframes to parquet files in cluster mode. Those problems disappeared when deleteing the file, just before writing, using this convenience function 'write(..)' :
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.fs.Path
..
def main(arg: Array[String]) {
..
val fs = FileSystem.get(sc.hadoopConfiguration)
..
def write(df:DataFrame, fn:String ) = {
val op1=s"hdfs:///user/you/$fn"
fs.delete(new Path(op1))
df.write.parquet(op1)
}
Give it a go, tell me if it works for you...

Resources