NLTK StanfordPOSTagger not working - Windows - windows

OS:Windows 10 x64
Python: 2.7.3
NLTK: 3.1
I want to use Stanford Pos tagger in python so based on documentation in here, I did these:
I've downloaded stanford-postagger-2015-04-20.zip from here and extract it to D:\Downloads\stanford-postagger-2015-04-20\stanford-postagger-2015-04-20
I've created two environment variables:
CLASSPATH >>> D:\Downloads\stanford-postagger-2015-04-20\stanford-postagger-2015-04-20\stanford-postagger.jar
STANFORD_MODELS >>> D:\Downloads\stanford-postagger-2015-04-20\stanford-postagger-2015-04-20\models\
Using this code to use the tagger:
import os
java_path = "C:/Program Files (x86)/Java/jdk1.7.0_71/bin/java.exe"
os.environ['JAVAHOME'] = java_path
from nltk.tag import StanfordPOSTagger
st = StanfordPOSTagger('english-bidirectional-distsim.tagger')
st.tag('What is the airspeed of an unladen swallow ?'.split())
raises this error:
java.lang.UnsupportedClassVersionError: edu/stanford/nlp/tagger/maxent/MaxentTagger : Unsupported major.minor version 52.0
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:482)
Exception in thread "main"
Traceback (most recent call last):
File "<tmp 1>", line 9, in <module>
st.tag('What is the airspeed of an unladen swallow ?'.split())
File "c:\python27\lib\site-packages\nltk\tag\stanford.py", line 66, in tag
return sum(self.tag_sents([tokens]), [])
File "c:\python27\lib\site-packages\nltk\tag\stanford.py", line 89, in tag_sents
stdout=PIPE, stderr=PIPE)
File "c:\python27\lib\site-packages\nltk\internals.py", line 134, in java
raise OSError('Java command failed : ' + str(cmd))
OSError: Java command failed : ['C:/Program Files (x86)/Java/jdk1.7.0_71/bin/java.exe', '-mx1000m', '-cp', 'D:\\Downloads\\stanford-postagger-2015-04-20\\stanford-postagger-2015-04-20\\stanford-postagger.jar', 'edu.stanford.nlp.tagger.maxent.MaxentTagger', '-model', 'D:\\Downloads\\stanford-postagger-2015-04-20\\stanford-postagger-2015-04-20\\models\\english-bidirectional-distsim.tagger', '-textFile', 'c:\\users\\wiki\\appdata\\local\\temp\\tmp0ruajz', '-tokenize', 'false', '-outputFormatOptions', 'keepEmptySentences', '-encoding', 'utf8']
What should I do?

Related

install pyspark in EC2 instance (amazon linux)

Hi I tried to install pyspark in a EC2 instance (standard Amazon linux image). I installed anaconda python 3.6 and used "pip install pyspark" to install spark. It worked just fine. But when I try to enter pyspark with command "pyspark", I got the following error message. What could have gone wrong? Thanks!
Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19)
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
18/03/03 05:47:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/03/03 05:47:13 WARN SparkContext: Another SparkContext is being constructed (or threw an exception in its constructor). This may indicate an error, since only one SparkContext may be running in this JVM (see SPARK-2243). The other SparkContext was created at:
org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
java.lang.reflect.Constructor.newInstance(Constructor.java:423)
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
py4j.Gateway.invoke(Gateway.java:236)
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
py4j.GatewayConnection.run(GatewayConnection.java:214)
java.lang.Thread.run(Thread.java:748)
Traceback (most recent call last):
File "/home/ec2-user/anaconda3/lib/python3.6/site-packages/pyspark/python/pyspark/shell.py", line 45, in <module>
spark = SparkSession.builder\
File "/home/ec2-user/anaconda3/lib/python3.6/site-packages/pyspark/sql/session.py", line 169, in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
File "/home/ec2-user/anaconda3/lib/python3.6/site-packages/pyspark/context.py", line 334, in getOrCreate
SparkContext(conf=conf or SparkConf())
File "/home/ec2-user/anaconda3/lib/python3.6/site-packages/pyspark/context.py", line 118, in __init__
conf, jsc, profiler_cls)
File "/home/ec2-user/anaconda3/lib/python3.6/site-packages/pyspark/context.py", line 180, in _do_init
self._jsc = jsc or self._initialize_context(self._conf._jconf)
File "/home/ec2-user/anaconda3/lib/python3.6/site-packages/pyspark/context.py", line 273, in _initialize_context
return self._jvm.JavaSparkContext(jconf)
File "/home/ec2-user/anaconda3/lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1401, in __call__
File "/home/ec2-user/anaconda3/lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.ExceptionInInitializerError
at org.apache.spark.SparkConf.validateSettings(SparkConf.scala:546)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:373)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:236)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.UnknownHostException: 10-236-108-194: 10-236-108-194: Temporary failure in name resolution
at java.net.InetAddress.getLocalHost(InetAddress.java:1505)
at org.apache.spark.util.Utils$.findLocalInetAddress(Utils.scala:891)
at org.apache.spark.util.Utils$.org$apache$spark$util$Utils$$localIpAddress$lzycompute(Utils.scala:884)
at org.apache.spark.util.Utils$.org$apache$spark$util$Utils$$localIpAddress(Utils.scala:884)
at org.apache.spark.util.Utils$$anonfun$localHostName$1.apply(Utils.scala:941)
at org.apache.spark.util.Utils$$anonfun$localHostName$1.apply(Utils.scala:941)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.util.Utils$.localHostName(Utils.scala:941)
at org.apache.spark.internal.config.package$.<init>(package.scala:204)
at org.apache.spark.internal.config.package$.<clinit>(package.scala)
... 14 more
Caused by: java.net.UnknownHostException: 10-236-108-194: Temporary failure in name resolution
at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
at java.net.InetAddress.getLocalHost(InetAddress.java:1500)
... 23 more
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ec2-user/anaconda3/lib/python3.6/site-packages/pyspark/python/pyspark/shell.py", line 54, in <module>
spark = SparkSession.builder.getOrCreate()
File "/home/ec2-user/anaconda3/lib/python3.6/site-packages/pyspark/sql/session.py", line 169, in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
File "/home/ec2-user/anaconda3/lib/python3.6/site-packages/pyspark/context.py", line 334, in getOrCreate
SparkContext(conf=conf or SparkConf())
File "/home/ec2-user/anaconda3/lib/python3.6/site-packages/pyspark/context.py", line 118, in __init__
conf, jsc, profiler_cls)
File "/home/ec2-user/anaconda3/lib/python3.6/site-packages/pyspark/context.py", line 180, in _do_init
self._jsc = jsc or self._initialize_context(self._conf._jconf)
File "/home/ec2-user/anaconda3/lib/python3.6/site-packages/pyspark/context.py", line 273, in _initialize_context
return self._jvm.JavaSparkContext(jconf)
File "/home/ec2-user/anaconda3/lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1401, in __call__
File "/home/ec2-user/anaconda3/lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.internal.config.package$
at org.apache.spark.SparkConf.validateSettings(SparkConf.scala:546)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:373)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:236)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Just use a Docker container: https://github.com/jupyter/docker-stacks/tree/master/pyspark-notebook why go thru the hassle of configuring it?

Error with hc=H2OContext.getOrCreate(sc) in pysparkling

I am new in Pysparkling. I work with yarn cluster, Spark 1.6, Cloudera CDH 5.8.0,python 2.7.6 and i have problem with hc=H2OContext.getOrCreate(sc). Do you have some ideas ?
from pysparkling import * import h2o hc = H2OContext.getOrCreate(sc)
17/04/16 17:13:59 INFO spark.SparkContext: Added JAR /root/.cache/Python-Eggs/h2o_pysparkling_1.6-1.6.10-py2.7.eg g-tmp/sparkling_water/sparkling_water_assembly.jar at spark://147.232.202.114:47251/jars/sparkling_water_assembly .jar with timestamp 1492355639066
17/04/16 17:13:59 WARN internal.InternalH2OBackend: Increasing 'spark.locality.wait' to value 30000 17/04/16 17:13:59 WARN internal.InternalH2OBackend: Due to non-deterministic behavior of Spark broadcast-based jo ins We recommend to disable them by configuring spark.sql.autoBroadcastJoinThreshold variable to value -1: sqlContext.sql("SET spark.sql.autoBroadcastJoinThreshold=-1")
17/04/16 17:13:59 WARN internal.InternalH2OBackend: The property 'spark.scheduler.minRegisteredResourcesRatio' is not specified! We recommend to pass --conf spark.scheduler.minRegisteredResourcesRatio=1
17/04/16 17:13:59 WARN internal.InternalH2OBackend: Unsupported options spark.dynamicAllocation.enabled detected!
17/04/16 17:13:59 WARN internal.InternalH2OBackend: The application is going down, since the parameter (spark.ext.h2o.fail.on.unsupported.spark.param,true) is true! If you would like to skip the fail call, please, specify the value of the parameter to false.
Traceback (most recent call last): File "", line 1, in File "build/bdist.linux-x86_64/egg/pysparkling/context.py", line 128, in getOrCreate File "/opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway. py", line 813, in call File "/opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/lib/spark/python/pyspark/sql/utils.py", line 45, in deco return f(a, *kw) File "/opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/lib/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o54.invoke. : java.lang.IllegalArgumentException: Unsupported argument: (spark.dynamicAllocation.enabled,true) at org.apache.spark.h2o.backends.internal.InternalBackendUtils$$anonfun$checkUnsupportedSparkOptions$1.ap ply(InternalBackendUtils.scala:48) at org.apache.spark.h2o.backends.internal.InternalBackendUtils$$anonfun$checkUnsupportedSparkOptions$1.ap ply(InternalBackendUtils.scala:40) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.h2o.backends.internal.InternalBackendUtils$class.checkUnsupportedSparkOptions(Interna lBackendUtils.scala:40) at org.apache.spark.h2o.backends.internal.InternalH2OBackend.checkUnsupportedSparkOptions(InternalH2OBack end.scala:31) at org.apache.spark.h2o.backends.internal.InternalH2OBackend.checkAndUpdateConf(InternalH2OBackend.scala: 61) at org.apache.spark.h2o.H2OContext.(H2OContext.scala:96) at org.apache.spark.h2o.H2OContext$.getOrCreate(H2OContext.scala:294) at org.apache.spark.h2o.H2OContext.getOrCreate(H2OContext.scala) at org.apache.spark.h2o.JavaH2OContext.getOrCreate(JavaH2OContext.java:191) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:209) at java.lang.Thread.run(Thread.java:745)
This can be solved by running the program with the following command, this has been tested with spark 1.6 and H2O version > 3.0
bin/pysparkling h2o_spark.py --conf spark.ext.h2o.fail.on.unsupported.spark.param=false

trying tio connect to phoenix table with pyspark and getting the following error

my commnd for reading phoenix table:
sql_sc.read.format("org.apache.phoenix.spark").option("table", tablename).option("zkUrl", "10.0.11.21:2181").load()
error:
Traceback (most recent call last):
File "/bdaas/exe/healthcare/hl7visualization.py", line 42, in
hl7 = phoenix_sparkdata(spark_app='hl7-app',spark_master='local',table_name='hl7table_v2_3')
File "/bdaas/exe/healthcare/hl7visualization.py", line 19, in init
self.dataframe = self.phoenix_getdataframe(table_name)
File "/bdaas/exe/healthcare/hl7visualization.py", line 41, in phoenix_getdataframe
df = self.sql_sc.read.format("org.apache.phoenix.spark").option("table", tablename).option("zkUrl", "10.0.11.21:2181").load()
File "/usr/hdp/2.4.2.0-258/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 139, in load
File "/usr/hdp/2.4.2.0-258/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in call
File "/usr/hdp/2.4.2.0-258/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 45, in deco
File "/usr/hdp/2.4.2.0-258/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o43.load.
: java.lang.NoSuchMethodError: com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer$.handledType()Ljava/lang/Class;
at com.fasterxml.jackson.module.scala.deser.NumberDeserializers$.(ScalaNumberDeserializersModule.scala:49)
at com.fasterxml.jackson.module.scala.deser.NumberDeserializers$.(ScalaNumberDeserializersModule.scala)
at com.fasterxml.jackson.module.scala.deser.ScalaNumberDeserializersModule$class.$init$(ScalaNumberDeserializersModule.scala:61)
at com.fasterxml.jackson.module.scala.DefaultScalaModule.(DefaultScalaModule.scala:19)
at com.fasterxml.jackson.module.scala.DefaultScalaModule$.(DefaultScalaModule.scala:35)
at com.fasterxml.jackson.module.scala.DefaultScalaModule$.(DefaultScalaModule.scala)
at org.apache.spark.rdd.RDDOperationScope$.(RDDOperationScope.scala:81)
at org.apache.spark.rdd.RDDOperationScope$.(RDDOperationScope.scala)
at org.apache.spark.SparkContext.withScope(SparkContext.scala:714)
at org.apache.spark.SparkContext.newAPIHadoopRDD(SparkContext.scala:1152)
at org.apache.phoenix.spark.PhoenixRDD.(PhoenixRDD.scala:46)
at org.apache.phoenix.spark.PhoenixRelation.schema(PhoenixRelation.scala:50)
at org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:37)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)

Running a Windows Batch File through Piping in Apache Spark

I have a requirement in which I have to run a Windows batch file using Apache Spark on multiple nodes of the Spark cluster.
So is it possible to do the same using Piping concept of Apache Spark?
I have before run a shell file using Piping in Spark on a Ubuntu machine. My below code doing the same runs fine:
data = ["hi","hello","how","are","you"]
distScript = "/home/aawasthi/echo.sh"
distScriptName = "echo.sh"
sc.addFile(distScript)
RDDdata = sc.parallelize(data)
print RDDdata.pipe(SparkFiles.get(distScriptName)).collect()
I tried to adapt the same code to run a Windows batch file on a Windows machine having Spark (1.6 prebuilt for Hadoop 2.6) installed. But it gives me the error on the sc.addFile step. Code is below:
batchFile = "D:/spark-1.6.2-bin-hadoop2.6/data/OpenCV/runOpenCv"
batchFileName = "runOpenCv"
sc.addFile(batchFile)
Error thrown by Spark is below:
Py4JJavaError Traceback (most recent call last)
<ipython-input-11-9e13c265cbae> in <module>()
----> 1 sc.addFile(batchFile)`
Py4JJavaError: An error occurred while calling o160.addFile.
: java.io.FileNotFoundException: Added file D:/spark-1.6.2-bin-hadoop2.6/data/OpenCV/runOpenCv does not exist.
at org.apache.spark.SparkContext.addFile(SparkContext.scala:1364)
at org.apache.spark.SparkContext.addFile(SparkContext.scala:1340)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)
Although the batch file exists at the given location.
UPDATE:
Added .bat as extension in the batchFile & batchFileName & file:/// in the starting of the file path. The modified code is:
from pyspark import SparkFiles
from pyspark import SparkContext
sc
batchFile = "file:///D:/spark-1.6.2-bin-hadoop2.6/data/OpenCV/runOpenCv.bat"
batchFileName = "runOpenCv.bat"
sc.addFile(batchFile)
RDDdata = sc.parallelize(["hi","hello"])
print SparkFiles.get("runOpenCv.bat")
print RDDdata.pipe(SparkFiles.get(batchFileName)).collect()
Now it doesn't give error in the addFile step, and print SparkFiles.get("runOpenCv.bat") prints the path
C:\Users\abhilash.awasthi\AppData\Local\Temp\spark-c0f383b1-8365-4840-bd0f-e7eb46cc6794\userFiles-69051066-f18c-45dc-9610-59cbde0d77fe\runOpenCv.bat
So file is added. But in the last step of the code it throws the below error:
Py4JJavaError Traceback (most recent call last)
<ipython-input-6-bf2b8aea3ef0> in <module>()
----> 1 print RDDdata.pipe(SparkFiles.get(batchFileName)).collect()
D:\spark-1.6.2-bin-hadoop2.6\python\pyspark\rdd.pyc in collect(self)
769 """
770 with SCCallSiteSync(self.context) as css:
--> 771 port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
772 return list(_load_from_socket(port, self._jrdd_deserializer))
773
D:\spark-1.6.2-bin-hadoop2.6\python\lib\py4j-0.9-src.zip\py4j\java_gateway.py in __call__(self, *args)
811 answer = self.gateway_client.send_command(command)
812 return_value = get_return_value(
--> 813 answer, self.gateway_client, self.target_id, self.name)
814
815 for temp_arg in temp_args:
D:\spark-1.6.2-bin-hadoop2.6\python\pyspark\sql\utils.pyc in deco(*a, **kw)
43 def deco(*a, **kw):
44 try:
---> 45 return f(*a, **kw)
46 except py4j.protocol.Py4JJavaError as e:
47 s = e.java_exception.toString()
D:\spark-1.6.2-bin-hadoop2.6\python\lib\py4j-0.9-src.zip\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name)
306 raise Py4JJavaError(
307 "An error occurred while calling {0}{1}{2}.\n".
--> 308 format(target_id, ".", name), value)
309 else:
310 raise Py4JError(
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 1 times, most recent failure: Lost task 1.0 in stage 0.0 (TID 1, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "D:\spark-1.6.2-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\worker.py", line 111, in main
File "D:\spark-1.6.2-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\worker.py", line 106, in process
File "D:\spark-1.6.2-bin-hadoop2.6\python\pyspark\rdd.py", line 317, in func
return f(iterator)
File "D:\spark-1.6.2-bin-hadoop2.6\python\pyspark\rdd.py", line 715, in func
shlex.split(command), env=env, stdin=PIPE, stdout=PIPE)
File "C:\Anaconda2\lib\subprocess.py", line 710, in __init__
errread, errwrite)
File "C:\Anaconda2\lib\subprocess.py", line 958, in _execute_child
startupinfo)
WindowsError: [Error 2] The system cannot find the file specified
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:927)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.collect(RDD.scala:926)
at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:405)
at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "D:\spark-1.6.2-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\worker.py", line 111, in main
File "D:\spark-1.6.2-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\worker.py", line 106, in process
File "D:\spark-1.6.2-bin-hadoop2.6\python\pyspark\rdd.py", line 317, in func
return f(iterator)
File "D:\spark-1.6.2-bin-hadoop2.6\python\pyspark\rdd.py", line 715, in func
shlex.split(command), env=env, stdin=PIPE, stdout=PIPE)
File "C:\Anaconda2\lib\subprocess.py", line 710, in __init__
errread, errwrite)
File "C:\Anaconda2\lib\subprocess.py", line 958, in _execute_child
startupinfo)
WindowsError: [Error 2] The system cannot find the file specified
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
... 1 more
Please escape /
batchFile = "D://spark-1.6.2-bin-hadoop2.6//data//OpenCV//runOpenCv"
Also, as AA suggested above, it may have .cmd or .bat extension.

org.apache.hadoop.hbase.io.ImmutableBytesWritable exception in HBase

We tried to test the following example code for accessing HBase tables (Spark-1.3.1, HBase-1.1.1, Hadoop-2.7.0):
import sys
from pyspark import SparkContext
if __name__ == "__main__":
if len(sys.argv) != 3:
print >> sys.stderr, """
Usage: hbase_inputformat <host> <table>
Run with example jar:
./bin/spark-submit --driver-class-path /path/to/example/jar \
/path/to/examples/hbase_inputformat.py <host> <table>
Assumes you have some data in HBase already, running on <host>, in <table>
"""
exit(-1)
host = sys.argv[1]
table = sys.argv[2]
sc = SparkContext(appName="HBaseInputFormat")
conf = {"hbase.zookeeper.quorum": host, "hbase.mapreduce.inputtable": table}
keyConv = "org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter"
valueConv = "org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter"
hbase_rdd = sc.newAPIHadoopRDD(
"org.apache.hadoop.hbase.mapreduce.TableInputFormat",
"org.apache.hadoop.hbase.io.ImmutableBytesWritable",
"org.apache.hadoop.hbase.client.Result",
keyConverter=keyConv,
valueConverter=valueConv,
conf=conf)
output = hbase_rdd.collect()
for (k, v) in output:
print (k, v)
sc.stop()
We got the following error:
15/10/14 12:46:24 INFO BlockManagerMaster: Registered BlockManager
Traceback (most recent call last):
File "/opt/python/son.py", line 30, in
conf=conf)
File "/usr/hdp/2.3.0.0-2557/spark/python/pyspark/context.py", line 547, in newAPIHadoopRDD
jconf, batchSize)
File "/usr/hdp/2.3.0.0-2557/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in call
File "/usr/hdp/2.3.0.0-2557/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.io.ImmutableBytesWritable
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:278)
at org.apache.spark.util.Utils$.classForName(Utils.scala:157)
at org.apache.spark.api.python.PythonRDD$.newAPIHadoopRDDFromClassNames(PythonRDD.scala:509)
at org.apache.spark.api.python.PythonRDD$.newAPIHadoopRDD(PythonRDD.scala:494)
at org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
Any insights are highly appreciated.
The error occurs because you haven't got the HBase libs in your classpath. You will need hbase-common and hbase-client jars, which you should pass to pyspark via the --jars parameters
I resolved this by execute the MapReduce Job by adding hbase-common.jar in environment variable: HADOOP_CLASSPATH:
export
HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HBASE_HOME/lib/hbase-common-1.3.1.jar

Resources