unable to save rdd on local filesystem on windows 10 - windows

I have a scala/spark program that is used to validate xmls file in an input directory and then writes the report to another input parameter (local filesystem path to write report to).
As per the requirements from stakeholders this program is to run on local machines hence I am using spark in local mode.
Till now things were fine, i was using the code below to save my report to a file
dataframe.repartition(1)
.write
.option("header", "true")
.mode("overwrite")
.csv(reportPath)
However this required winutils to be installed/configured on the machines running my program.
Given we use cloudera updates very often, there was an overhead of changing winutils after evry update as we would be updating the jars to the latest version in our pom file. Hence, I have been asked to remove the dependency on winutils
On a quick google search and after coming across How to save Spark RDD to local filesystem
I decided to change the above pice of code to
val outputRdd = dataframe.rdd
val count = outputRdd.count()
println("\nCount is: " + count + "\n")
println("\nOutput path is: " + reportPath + "\n")
outputRdd.coalesce(1).saveAsTextFile(reportPath)
However, on running the code I am now getting this error
Count is: 15
Output path is: C:\\codingdir\\test\\report
Exception in thread "main" java.lang.IllegalAccessError: tried to access method org.apache.hadoop.mapred.JobContextImpl.<init>(Lorg/apache/hadoop/mapred/JobConf;Lorg/apache/hadoop/mapreduce/JobID;)V from class org.apache.spark.internal.io.HadoopMapRedWriteConfigUtil
at org.apache.spark.internal.io.HadoopMapRedWriteConfigUtil.createJobContext(SparkHadoopWriter.scala:178)
at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:67)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1096)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1094)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1094)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1094)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply$mcV$sp(PairRDDFunctions.scala:1067)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1032)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1032)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1032)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply$mcV$sp(PairRDDFunctions.scala:958)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:958)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:958)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:957)
at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply$mcV$sp(RDD.scala:1499)
at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1478)
at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1478)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1478)
at com.optus.dcoe.hawk.XmlParser$.delayedEndpoint$com$optus$dcoe$hawk$XmlParser$1(XmlParser.scala:120)
at com.optus.dcoe.hawk.XmlParser$delayedInit$body.apply(XmlParser.scala:16)
at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
at scala.App$class.main(App.scala:76)
at com.optus.dcoe.hawk.XmlParser$.main(XmlParser.scala:16)
at com.optus.dcoe.hawk.XmlParser.main(XmlParser.scala)
I have tried changing the value of reportPath varible to
C:\codingdir\test\report
file://C:/codingdir/test/report
file://C:/codingdir/test/report
and other values as suggested on
Write RDD as textfile using Apache Spark
How to save Spark RDD to local filesystem
How to access local files in Spark on Windows?
and other links but I am still getting same error
I have found these articles about java.lang.IllegalAccessError but not sure how do i get around this error:
https://examples.javacodegeeks.com/java-basics/exceptions/java-lang-illegalaccesserror-how-to-resolve-illegal-access-error/
java.lang.IllegalAccessError: tried to access method
Can someone please help me in resolving this?
Env variable HADOOP_HOME pertaining to winutls has been removed.
winutils entry has been removed from PATH variable
I am using java 8 on windows 10 (all the users of the program would be on similar laptops)
Spark version is 2.4.0-cdh6.2.1

Finally found out the issue,
It was caused by some unwanted mapreduce related dependencies which have now been removed and I have moved to another error now

Related

File read from ADLS Gen2 Error - Configuration property xxx.dfs.core.windows.net not found

I am using ADLS Gen2, from a Databricks notebook trying to process the file using 'abfss' path.
I am able to read parquet files just fine but when I try to load the XML files, I am getting the error the configuration is not found - Configuration property xxx.dfs.core.windows.net not found.
I haven't tried mounting the file but trying to understand if it's a known limitation with XML files, as I am able to read the parquet files just fine.
Here is my XML libraries config
com.databricks:spark-xml_2.11:0.9.0
I tried a couple of things per the other articles but still getting the same error.
Added a new scope to see if it's a scope issue in the Databricks Workspace.
Tried adding configuration
spark.conf.set("fs.azure.account.key.xxxxx.dfs.core.windows.net", "xxxx==")
df = spark.read.format("xml")
.option("rootTag","BookArticle")
.option("inferSchema", "true")
.option("error_bad_lines",True)
.option("mode", "DROPMALFORMED")
.load(abfsssourcename) ##abfsssourcename is the path of the source file name
Exception Details: Py4JJavaError: An error occurred while calling o1113.load.
Configuration property xxxx.dfs.core.windows.net not found. at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AbfsConfiguration.getStorageAccountKey(AbfsConfiguration.java:392) at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.initializeClient(AzureBlobFileSystemStore.java:1008) at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.<init>(AzureBlobFileSystemStore.java:151) at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.initialize(AzureBlobFileSystem.java:106) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:500) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:469) at org.apache.spark.SparkContext$$anonfun$newAPIHadoopFile$2.apply(SparkContext.scala:1281) at org.apache.spark.SparkContext$$anonfun$newAPIHadoopFile$2.apply(SparkContext.scala:1269) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.SparkContext.withScope(SparkContext.scala:820) at org.apache.spark.SparkContext.newAPIHadoopFile(SparkContext.scala:1269) at com.databricks.spark.xml.util.XmlFile$.withCharset(XmlFile.scala:46) at com.databricks.spark.xml.DefaultSource$$anonfun$createRelation$1.apply(DefaultSource.scala:71) at com.databricks.spark.xml.DefaultSource$$anonfun$createRelation$1.apply(DefaultSource.scala:71) at com.databricks.spark.xml.XmlRelation$$anonfun$1.apply(XmlRelation.scala:43) at com.databricks.spark.xml.XmlRelation$$anonfun$1.apply(XmlRelation.scala:42) at scala.Option.getOrElse(Option.scala:121) at com.databricks.spark.xml.XmlRelation.<init>(XmlRelation.scala:41) at com.databricks.spark.xml.XmlRelation$.apply(XmlRelation.scala:29) at com.databricks.spark.xml.DefaultSource.createRelation(DefaultSource.scala:74) at com.databricks.spark.xml.DefaultSource.createRelation(DefaultSource.scala:52) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:311) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:297) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:214) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
I summarize the solution as below.
The package com.databricks:spark-xml seems using RDD API to read xml file. When we use using the RDD API to access Azure Data Lake Storage Gen2, wecannot access Hadoop configuration options set using spark.conf.set(...). So we should update the code as spark._jsc.hadoopConfiguration().set("fs.azure.account.key.xxxxx.dfs.core.windows.net", "xxxx=="). For more details, please refer to here.
Besides, you aslo can mount Azure Data Lake Storage Gen2 as file system in Azure databricks.

Unable to run basic example for hadoop

Hadoop version 2.9.0, Java - 1.8.0_162
When trying to run the example given here: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html, under standalone operation, I get the following error:
$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.0.jar grep input output 'dfs[a-z.]+'
java.lang.NoSuchMethodError: org.apache.hadoop.util.ProgramDriver.run([Ljava/lang/String;)I
at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
I am new to hadoop and not sure how to fix this. I have set the JAVA_HOME in hadoop-env.sh. I am pretty sure that I am using the correct compatible versions of java and hadoop.
Any help will be useful.
Nish,
The error message means that while the runtime was able to find the class ProgramDriver, the function run() is not present.
The most likely reason for this is that you're running an old version of Hadoop that exposed a difference interface in ProgramDriver. About a year ago this method was renamed to run() after being called driver().
The fix for that would be making sure you're running a recent version of Hadoop.
For your reference please check following links they have asked same question.
Error while executing hadoop-mapreduce-examples-2.2.0.jar
Can the hadoop programm which write under the hadoop-2.2.0 run in hadoop-1.2.1?

Input path does not exist: file:/D:/pigsample_1749383998_1377684507424

I am facing a wiered issue.
I am running PIG 0.11 on windows7/64 bit machine with latest version of cygwin.
I am a weblog which I want to order it by userName to have all the user activities for the same user together to feed for next line of processing.
I am starting commandprompt -> cygwin.bat -> on the cygwin console go to D:/ -> pig and typing the following script on grunt shall (local mode).
(Note I've set PIG_HOME, PIG_CLASSPATH correctly).
Script is :
USERACTIVITIES = LOAD '/D:/path/of/logs/useractivities' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',') AS (datetimeUnProcessed:chararray, username:chararray, request:chararray);
USERACTIVITIES_ORDERED = ORDER USERACTIVITIES by username;
STORE USERACTIVITIES_ORDERED INTO '/D:/readyfornextinput/useractivities' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',');
When I do illustrate USERACTIVITIES_ORDERED I see it going smooth.
But when I do store/dump I face wiered issue.
It fails by saying :
java.lang.RuntimeException: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/D:/pigsample_1749383998_1377684507424
When I tried to search this pigsample_number file I could find that in :
D:/tmp//mapred/local/localRunner
I am not sure how it is happening.
I am not sure if its windows/cygwin related issue or someone saw this on Linux also.
For reference, you can find the stacktrace attached here:
2013-08-28 15:38:28,863 [Thread-46] WARN
org.apache.hadoop.mapred.LocalJobRunner - job_local_0004
java.lang.RuntimeException:
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input
path does not exist: file:/D:/pigsample_1749383998_1377684507424
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.setConf(WeightedRangePartitioner.java:157)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.(MapTask.java:677)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:756)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:214)
Caused by:
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input
path does not exist: file:/D:/pigsample_1288777582_1377684802262
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:235)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigFileInputFormat.listStatus(PigFileInputFormat.java:37)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:252)
at org.apache.pig.impl.io.ReadToEndLoader.init(ReadToEndLoader.java:190)
at org.apache.pig.impl.io.ReadToEndLoader.(ReadToEndLoader.java:126)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.setConf(WeightedRangePartitioner.java:131)
... 6 more
Any help on this will be useful.
Looks like this is reproducible only on cygwin environment.
I've documented the root cause and solution here

Oracle Data Integrator (ODI - v11.1.1.3) "unable to load language: beanshell" Error

Following an install of Eclipse 3.7.2 on my Ubuntu 12.04 development machine, I have been unable to execute any ODI packages/interfaces/procedures. On execution (for both simulated and actual runs), an error is thrown (java trace below). I am not sure if it's anything to do with the Eclipse install, but it seems likely. Does anyone have an idea how to fix this?
Also, when launching ODI from the terminal using 'bash odi', the following error is displayed in the terminal:
2013-08-15 14:43:46.162 ERROR Error during RuntimeClassLoader initialization. ODI will start without RuntimeClassLoader
Error output:
oracle.odi.core.exception.OdiRuntimeException: Error during Code Interpretor creation
at com.sunopsis.dwg.codeinterpretor.SnpCodeInterpretor.getInstance(SnpCodeInterpretor.java:209)
at com.sunopsis.dwg.codeinterpretor.SnpGeneratorSQLCIT.<init>(SnpGeneratorSQLCIT.java:300)
at com.sunopsis.graphical.dialog.SnpsDialogExecution.doPackageExecuter(SnpsDialogExecution.java:907)
at oracle.odi.ui.action.SnpsPopupActionExecuteHandler.actionPerformed(SnpsPopupActionExecuteHandler.java:68)
at oracle.odi.ui.SnpsActionControler.handleEvent(SnpsActionControler.java:75)
at oracle.ide.controller.IdeAction.performAction(IdeAction.java:529)
at oracle.ide.controller.IdeAction.actionPerformedImpl(IdeAction.java:884)
at oracle.ide.controller.IdeAction.actionPerformed(IdeAction.java:501)
at javax.swing.AbstractButton.fireActionPerformed(AbstractButton.java:1995)
at javax.swing.AbstractButton$Handler.actionPerformed(AbstractButton.java:2318)
at javax.swing.DefaultButtonModel.fireActionPerformed(DefaultButtonModel.java:387)
at javax.swing.DefaultButtonModel.setPressed(DefaultButtonModel.java:242)
at javax.swing.AbstractButton.doClick(AbstractButton.java:357)
at javax.swing.plaf.basic.BasicMenuItemUI.doClick(BasicMenuItemUI.java:809)
at javax.swing.plaf.basic.BasicMenuItemUI$Handler.mouseReleased(BasicMenuItemUI.java:850)
at java.awt.Component.processMouseEvent(Component.java:6297)
at javax.swing.JComponent.processMouseEvent(JComponent.java:3275)
at java.awt.Component.processEvent(Component.java:6062)
at java.awt.Container.processEvent(Container.java:2039)
at java.awt.Component.dispatchEventImpl(Component.java:4660)
at java.awt.Container.dispatchEventImpl(Container.java:2097)
at java.awt.Component.dispatchEvent(Component.java:4488)
at java.awt.LightweightDispatcher.retargetMouseEvent(Container.java:4575)
at java.awt.LightweightDispatcher.processMouseEvent(Container.java:4236)
at java.awt.LightweightDispatcher.dispatchEvent(Container.java:4166)
at java.awt.Container.dispatchEventImpl(Container.java:2083)
at java.awt.Window.dispatchEventImpl(Window.java:2489)
at java.awt.Component.dispatchEvent(Component.java:4488)
at java.awt.EventQueue.dispatchEventImpl(EventQueue.java:674)
at java.awt.EventQueue.access$400(EventQueue.java:81)
at java.awt.EventQueue$2.run(EventQueue.java:633)
at java.awt.EventQueue$2.run(EventQueue.java:631)
at java.security.AccessController.doPrivileged(Native Method)
at java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:87)
at java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:98)
at java.awt.EventQueue$3.run(EventQueue.java:647)
at java.awt.EventQueue$3.run(EventQueue.java:645)
at java.security.AccessController.doPrivileged(Native Method)
at java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:87)
at java.awt.EventQueue.dispatchEvent(EventQueue.java:644)
at java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:269)
at java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:184)
at java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:174)
at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:169)
at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:161)
at java.awt.EventDispatchThread.run(EventDispatchThread.java:122)
Caused by: org.apache.bsf.BSFException: unable to load language: beanshell
at org.apache.bsf.BSFManager.loadScriptingEngine(BSFManager.java:718)
at com.sunopsis.dwg.codeinterpretor.SnpCodeInterpretor.loadEngine(SnpCodeInterpretor.java:85)
at com.sunopsis.dwg.codeinterpretor.SnpCodeInterpretor.<init>(SnpCodeInterpretor.java:75)
at com.sunopsis.dwg.codeinterpretor.SnpCodeInterpretor.getInstance(SnpCodeInterpretor.java:184)
... 45 more
After digging around for about a day on this issue, I brazenly tried running ODI as the root user on the off chance that this was a permissions issue. I started ODI from the command line (using 'bash odi') for greater verbosity, and it loaded without the error mentioned above. Something gave me the impression that this wasn't a permissions issue, but one related to the user settings.
To rectify the issue, I removed my user's odi settings folder (renaming it, for safety):
mv ~/.odi ~/.backup_odi
Then I started ODI from the terminal under my own user (i.e. not root) - there were no errors! None of my connections were available in the new settings folder though. This I fixed by closing ODI and entering the following:
cp ~/.backup_odi/oracledi/snps_login_work.xml ~/.odi/oracledi/
If anybody else encounters this issue, I hope you find this post quicker than it took me to fix it!
org.apache.bsf.BSFException: unable to load language: beanshell
The exception was thrown because bsh-2.Ob4.jar was not in the classpath and it is a dependent jar of bsf.jar

LeaseExpiredException: No lease error on HDFS

I am trying to load large data to HDFS and I sometimes get the error below. any idea why?
The error:
org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /data/work/20110926-134514/_temporary/_attempt_201109110407_0167_r_000026_0/hbase/site=3815120/day=20110925/107-107-3815120-20110926-134514-r-00026 File does not exist. Holder DFSClient_attempt_201109110407_0167_r_000026_0 does not have any open files.
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1557)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1548)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:1603)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:1591)
at org.apache.hadoop.hdfs.server.namenode.NameNode.complete(NameNode.java:675)
at sun.reflect.GeneratedMethodAccessor16.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1434)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1430)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1428)
at org.apache.hadoop.ipc.Client.call(Client.java:1107)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
at $Proxy1.complete(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at $Proxy1.complete(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:3566)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.close(DFSClient.java:3481)
at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:61)
at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:86)
at org.apache.hadoop.io.SequenceFile$Writer.close(SequenceFile.java:966)
at org.apache.hadoop.io.SequenceFile$BlockCompressWriter.close(SequenceFile.java:1297)
at org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat$1.close(SequenceFileOutputFormat.java:78)
at org.apache.hadoop.mapreduce.lib.output.MultipleOutputs$RecordWriterWithCounter.close(MultipleOutputs.java:303)
at org.apache.hadoop.mapreduce.lib.output.MultipleOutputs.close(MultipleOutputs.java:456)
at com.my.hadoop.platform.sortmerger.MergeSortHBaseReducer.cleanup(MergeSortHBaseReducer.java:145)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:178)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:572)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:414)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.mapred.Child.main(Child.java:264)
I managed to fix the problem:
When the job ends he deletes /data/work/ folder. If few jobs are running in parallel the deletion will also delete the files of the another job. actually I need to delete /data/work/.
In other words this exception is thrown when the job try to access to files which are not existed anymore
I meet the same problem when i use spark streaming to saveAsHadoopFile to Hadoop(2.6.0-cdh5.7.1), of course i use MultipleTextOutputFormat to write different data to different path. Sometimes the exception what Zohar said would happen. The reason is as Matiji66 say:
another program read,write and delete this tmp file cause this error.
but the root reason he didn't talk about is the hadoop speculative:
Hadoop doesn’t try to diagnose and fix slow running tasks, instead, it tries to detect them and runs backup tasks for them.
So the really reason is that, your task execute slow, then hadoop run another task to do the same thing(in my case is to save data to a file on hadoop), when one task of the two task finished, it will delete the temp file, and the other after finished, it will delete the same file, then it does not exists, so the exception
does not have any open files
happened
you can fix it by close the speculative of spark and hadoop:
sparkConf.set("spark.speculation", "false");
sparkConf.set("spark.hadoop.mapreduce.map.speculative", "false");
sparkConf.set("spark.hadoop.mapreduce.reduce.speculative", "false")
For my case, another program read,write and delete this tmp file cause this error.
Try to avoid this.
ROOT CAUSE
Storage policy was set on staging directory and hence MAPREDUCE job failed.
<property>
<name>yarn.app.mapreduce.am.staging-dir</name>
<value>/user</value>
</property>
RESOLUTION
Setup staging directory for which storage policy is not setup. I.e. modify yarn.app.mapreduce.am.staging-dir in yarn-site.xml
<property>
<name>yarn.app.mapreduce.am.staging-dir</name>
<value>/tmp</value>
</property>
I use Sqoop to import into HDFS and have same error. By the help of previous answers I have realized that I needed to remove last "/" from --target-dir /dw/data/
I used --target-dir /dw/data
works fine
I encountered this problem when I changed my program to use saveAsHadoopFile method to improve performance, in which scenario I can't make use of DataFrame API directly. see the problem
The reason why this would happen is basically what Zohar said, the saveAsHadoopFile method with MultipleTextOutputFormat actually doesn't allow multiple programs concurrently running to save files to the same directory. Once a program finished, it would delete the common _temporary directory the others still need, I am not sure if it's a bug in M/R API. (2.6.0-cdh5.12.1)
You can try this solution below if you can't redesign your program:
This is the source code of FileOutputCommitter in M/R API: (you must download an according version)
package org.apache.hadoop.mapreduce.lib.output;
public class FileOutputCommitter extends OutputCommitter {
private static final Log LOG = LogFactory.getLog(FileOutputCommitter.class);
/**
* Name of directory where pending data is placed. Data that has not been
* committed yet.
*/
public static final String PENDING_DIR_NAME = "_temporary";
Changes:
"_temporary"
To:
System.getProperty("[the property name you like]")
Compiles the single Class with all required dependencies, then creates a jar with the three output class files and places the jar to you classpath. (make it before the original jar)
Or, you can simply put the source file to your project.
Now, you can config the temp directory for each program by setting a different system property.
Hope it can help you.

Resources