How to run Sparkling Water example with spark in local mode - h2o

I am trying to run sparkling water deep learning demo in IntelliJ IDEA
The code link is:
https://github.com/h2oai/sparkling-water/blob/RELEASE-2.0.3/examples/src/main/scala/org/apache/spark/examples/h2o/DeepLearningDemo.scala
If fails to start, the exception is:
17/01/06 11:18:41 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerJobEnd(1,1483672721446,JobFailed(org.apache.spark.SparkException: Job 1 cancelled because SparkContext was shut down))
at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:187)
at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:177)
at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
at java.lang.Thread.getStackTrace(Thread.java:1108)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1890)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1903)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1916)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1930)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:912)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.RDD.collect(RDD.scala:911)
at org.apache.spark.h2o.backends.internal.InternalBackendUtils$class.startH2O(InternalBackendUtils.scala:163)
at org.apache.spark.h2o.backends.internal.InternalBackendUtils$.startH2O(InternalBackendUtils.scala:262)
at org.apache.spark.h2o.backends.internal.InternalH2OBackend.init(InternalH2OBackend.scala:99)
at org.apache.spark.h2o.H2OContext.init(H2OContext.scala:102)
at org.apache.spark.h2o.H2OContext$.getOrCreate(H2OContext.scala:279)
at org.apache.spark.h2o.H2OContext$.getOrCreate(H2OContext.scala:301)
at com.xyz.HelloSparklingWater$.main(HelloSparklingWater.scala:35)
at com.xyz.HelloSparklingWater.main(HelloSparklingWater.scala)
It looks exception is thrown when constructing H2OContext and InternalH2OBackend.
I would ask how to run this example in spark local mode that is run within IDE

Related

org.apache.hive.com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 21

I have yarn cluster with spark(1.6.1), hdfs and hive(2.1). My workflows worked fine for few months till this day (without any changes in code / on environments). I started to get errors like this:
org.apache.hive.com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 21
Serialization trace:
outputFileFormatClass (org.apache.hadoop.hive.ql.plan.PartitionDesc)
aliasToPartnInfo (org.apache.hadoop.hive.ql.plan.MapWork)
invertedWorkGraph (org.apache.hadoop.hive.ql.plan.SparkWork)
at org.apache.hive.com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:119)
at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:656)
at org.apache.hive.com.esotericsoftware.kryo.serializers.DefaultSerializers$ClassSerializer.read(DefaultSerializers.java:238)
at org.apache.hive.com.esotericsoftware.kryo.serializers.DefaultSerializers$ClassSerializer.read(DefaultSerializers.java:226)
at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:745)
at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:113)
at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507)
at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:776)
at org.apache.hive.com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:139)
at org.apache.hive.com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:17)
at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694)
at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507)
at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:776)
at org.apache.hive.com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:131)
at org.apache.hive.com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:17)
at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694)
at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507)
at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:672)
at org.apache.hadoop.hive.ql.exec.spark.KryoSerializer.deserialize(KryoSerializer.java:49)
at org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient$JobStatusJob.call(RemoteHiveSparkClient.java:318)
at org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:366)
at org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:335)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Using hive i can do simple selects, but every other operation which needs spark ends with Error: Error while processing statement: FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask (state=08S01,code=3) in console, and error above in yarn logs.
Now my every hive database is paralyzed (i have few). I was trying to solve this problem whole day, but couldnt do antything (hive restart, yarn node's restarts, changing yarn master).
What do you think causes the problem and how can it be solved?
I figured it out.
After restarting hive-server2 for small period of time instead of getting error: org.apache.hive.com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 26 i got error: org.apache.hive.com.esotericsoftware.kryo.KryoException: Unable to find class: org.apache.hadoop.hive.ql.io.RCFileOutputFormat. With second form it was obvious, that spark executed on node's didn't have some jars on classpath. I don't know the reason, why spark in one moment was unable to load these jars, but after copying them manually to his lib folder on every node and restarting node everything went back to normal.

Submitting remote spark job from eclipse IDE, getting paranamer error

I'm submitting remote spark job to the cluster using the yarn-client mode using eclipse IDE.
But getting below error message in the eclipse IDE.
java.lang.NoClassDefFoundError: com/thoughtworks/paranamer/BytecodeReadingParanamer
com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.<init>(BeanIntrospector.scala:40)
com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.<clinit>(BeanIntrospector.scala)
com.fasterxml.jackson.module.scala.introspect.ScalaPropertiesCollector.<init>(ScalaPropertiesCollector.scala:22)
com.fasterxml.jackson.module.scala.introspect.ScalaClassIntrospector$.constructPropertyCollector(ScalaClassIntrospector.scala:24)
com.fasterxml.jackson.databind.introspect.BasicClassIntrospector.collectProperties(BasicClassIntrospector.java:142)
com.fasterxml.jackson.databind.introspect.BasicClassIntrospector.forSerialization(BasicClassIntrospector.java:68)
com.fasterxml.jackson.databind.introspect.BasicClassIntrospector.forSerialization(BasicClassIntrospector.java:11)
com.fasterxml.jackson.databind.SerializationConfig.introspect(SerializationConfig.java:490)
com.fasterxml.jackson.databind.ser.BeanSerializerFactory.createSerializer(BeanSerializerFactory.java:133)
com.fasterxml.jackson.databind.SerializerProvider._createUntypedSerializer(SerializerProvider.java:873)
com.fasterxml.jackson.databind.SerializerProvider._createAndCacheUntypedSerializer(SerializerProvider.java:833)
com.fasterxml.jackson.databind.SerializerProvider.findValueSerializer(SerializerProvider.java:387)
com.fasterxml.jackson.databind.SerializerProvider.findTypedValueSerializer(SerializerProvider.java:478)
com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:97)
com.fasterxml.jackson.databind.ObjectMapper._configAndWriteValue(ObjectMapper.java:2718)
com.fasterxml.jackson.databind.ObjectMapper.writeValueAsString(ObjectMapper.java:2210)
org.apache.spark.rdd.RDDOperationScope.toJson(RDDOperationScope.scala:51)
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:144)
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
org.apache.spark.SparkContext.withScope(SparkContext.scala:714)
org.apache.spark.SparkContext.textFile(SparkContext.scala:830)
org.apache.spark.api.java.JavaSparkContext.textFile(JavaSparkContext.scala:181)
com.nuevora.core.spark.commons.CommonFunctions.getJavaRDDFromFile(CommonFunctions.java:61)
com.nuevora.core.spark.UpdateDataset.modifyInputDataset(UpdateDataset.java:103)
com.nuevora.controllers.FormsValidatorServlet.service(FormsValidatorServlet.java:3070)
javax.servlet.http.HttpServlet.service(HttpServlet.java:722)
Executor logs are not helpful. But the final status of job is "succeeded "
Need help in solving this error
I solved this issue by adding the paranamer jar which comes with the cloudera distribution

Java run time error on shell script in pentaho

I have a script in pentaho where it gets values like process id and few variables from the successful execution of previous job and write it as a file name into another location based on the process id and variable. While the shell script is being executed, it is throwing below run time error only few times. Please help.
ERROR 21-07 03:27:26,604 - Shell_Create_Trigger_File - (stderr) java.io.IOException: Stream closed
at java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:162)
at java.io.BufferedInputStream.read(BufferedInputStream.java:325)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:283)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:325)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:177)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:154)
at java.io.BufferedReader.readLine(BufferedReader.java:317)
at java.io.BufferedReader.readLine(BufferedReader.java:382)
at org.pentaho.di.core.util.StreamLogger.run(StreamLogger.java:57)
at java.lang.Thread.run(Thread.java:745)
"(stderr) java.io.IOException: Stream closed" usually happens if there is an abrupt closing of the connection or data connection is getting lost. You can check this question.
Also there could be multiple reasons like slow network, pdi process getting killed in the middle. But answer to your question can be broad. As you said the error is happening only few times, i suggest you look into the server you are running the code.
Hope it helps :)

Kafka on Windows - start service error

I am trying to cleanly start Kafka 2.10 - 0.8.2.1 on Windows but I am getting an annoying error everytime I start it.
I have just installed Kafka by following the Quick Start guide (with the exception that I have installed Zookeeper myself). Both Kafka and Zookeeper were installed very basic, on a single machine.
Problem
When I run the starting script:
kafka-server-start.bat C:\kafka_2.10-0.8.2.1\config\server.properties
I get the error:
Error
[2015-07-14 17:00:45,197] WARN Error when freeing index buffer (kafka.log.OffsetIndex)
java.lang.NullPointerException
at kafka.log.OffsetIndex.kafka$log$OffsetIndex$$forceUnmap(OffsetIndex.scala:301)
at kafka.log.OffsetIndex$$anonfun$resize$1.apply(OffsetIndex.scala:283)
at kafka.log.OffsetIndex$$anonfun$resize$1.apply(OffsetIndex.scala:276)
at kafka.utils.Utils$.inLock(Utils.scala:535)
at kafka.log.OffsetIndex.resize(OffsetIndex.scala:276)
at kafka.log.Log.loadSegments(Log.scala:179)
at kafka.log.Log.<init>(Log.scala:67)
at kafka.log.LogManager$$anonfun$loadLogs$2$$anonfun$3$$anonfun$apply$7$$anonfun$apply$1.apply$mcV
at kafka.utils.Utils$$anon$1.run(Utils.scala:54)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
[2015-07-14 17:00:45,219] INFO Completed load of log test-0 with log end offset 0 (kafka.log.Log)
What I observed
When I delete the Kafka log folder, the error does not appear the first time I re-run the starting script. Kafka log folder path:
C:\tmp\kafka-logs
I have tried to stop the service using the provided script, but it does not help. The server stop script:
kafka-server-stop.bat
Although the same error appears when I start it the second time, Kafka start-up continues, and it seems that it works normally.
Help
How to get rid of the above error?
This exception shouldn't be a problem during startup but if you want to get rid of it I guess you have to file a Jira for it.
This seems to be related to KAFKA-1008 and this commit.
/**
* Forcefully free the buffer's mmap. We do this only on windows.
*/
private def forceUnmap(m: MappedByteBuffer) {
try {
if(m.isInstanceOf[sun.nio.ch.DirectBuffer])
(m.asInstanceOf[sun.nio.ch.DirectBuffer]).cleaner().clean()
} catch {
case t: Throwable => warn("Error when freeing index buffer", t)
}
}
when index file size == 0 , cleaner where be null
see
https://github.com/apache/kafka/pull/1718

my hadoop job 252 hours later died(tasks then killed)

I had 81,068 tasks complete but then 11,799 failed and only 12 were killed. They SEEM to all failed from
2013-09-10 03:07:36,316 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201308301539_0002_m_083001_0: Error initializing attempt_201308301539_0002_m_083001_0:
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/jobcache/job_201308301539_0002/work in any of the configured local directories
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:389)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:138)
at org.apache.hadoop.mapred.TaskTracker$TaskInProgress.localizeTask(TaskTracker.java:1817)
at org.apache.hadoop.mapred.TaskTracker$TaskInProgress.launchTask(TaskTracker.java:1933)
at org.apache.hadoop.mapred.TaskTracker.launchTaskForJob(TaskTracker.java:830)
at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:824)
at org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:1664)
at org.apache.hadoop.mapred.TaskTracker.access$1200(TaskTracker.java:97)
at org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:1629)
At this point, I am just looking for guidance on how I can debug this before I re-run this again. For some reason out in the cluster, it looks like all the files are deleted though I thought hadoop M/R only deleted successfull task logs????
Anyone have some advice/ideas on how to debug this further?
It looks like all the default directories for map/reduce are used... /tmp/hadoop-hduser for my hduser.
I have seen stuff on /etc/hosts but then I don't get why 81,000 tasks succeeded before finally failing???
I am using the web interface to get some of this information of course and some logs where hadoopinstalled/logs
thanks,
Dean

Resources