Is it possible to read a file using SparkSession object of Scala language on Windows? - windows

I've been trying to read from a .csv file on many ways, utilizing SparkContext object. I found it possible through scala.io.Source.fromFile function, but I want to use spark object. Everytime I run function textfile for org.apache.spark.SparkContext I get the same error:
scala> sparkSession.read.csv("file://C:\\Users\\184229\\Desktop\\bigdata.csv")
21/12/29 16:47:32 WARN streaming.FileStreamSink: Error while looking for metadata directory.
java.lang.UnsupportedOperationException: Not implemented by the DistributedFileSystem FileSystem implementation
.....
As it's mentioned in the title I run the code on Windows in IntelliJ
[Edit]
In build.sbt have no redundant or overlapped dependencies. I use hadoop-tools, spark-sql and hadoop-xz.

Have you tried to run your spark-shell using local mode?
spark-shell --master=local
Also pay attention to not use both Hadoop-code and Hadoop-commons as a dependencies since you may have conflicting jars issues.

I've found the solution, precisely one of my colleague did that.
In dependencies build.sbt I changed hadoop-tools to hadoop-commons and it worked out.

Related

org.apache.kylin.job.exception.ExecuteException: java.lang.NoClassDefFoundError: org/apache/hadoop/hive/serde2/typeinfo/TypeInfo

I find similar error on https://issues.apache.org/jira/browse/KYLIN-2511
env:
hadoop-2.7.1
hbase-1.3.2
apache-hive-2.1.1-bin
apache-kylin-1.6.0-hbase1.x-bin
I've tried copy all the hive libs to kylin, but get another ERROR.
org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.NoClassDefFoundError: org/apache/hadoop/hive/serde2/typeinfo/TypeInfo
The missing class should be in hive-exec-.jar; Check and debug the "bin/find-hive-dependency.sh" to see why it wasn't able to locate this jar from your server. You can manually add it to the "hive_exec_path" variable.
BTW, Kylin 1.6 is quite old, try to upgrade to a 2.x version.
Why you just try the method mentioned in https://issues.apache.org/jira/browse/KYLIN-2511. You'd better prepare the env according to the document of v16. It is better for using the latest version of Kylin. It has more feature and fixes some bugs.

Beam / DataFlow unexpected error ProtocolMessageEnum not implemented when using DataFlowRunner

When running my Beam pipeline locally it all works as expected but when trying to run it on the DataflowRunner I suddenly get the error below. Honestly I don't even know where to start evaluating this because the DataflowRunner seems to be a black box.
Jan 14, 2019 11:26:51 AM org.apache.beam.runners.dataflow.DataflowRunner fromOptions
INFO: PipelineOptions.filesToStage was not specified. Defaulting to files from the classpath: will stage 165 files. Enable logging at DEBUG level to see which files will be staged.
Exception in thread "main" java.lang.IncompatibleClassChangeError: Class org.apache.beam.model.pipeline.v1.RunnerApi$StandardPTransforms$Primitives does not implement the requested interface com.google.protobuf.ProtocolMessageEnum
at org.apache.beam.runners.core.construction.BeamUrns.getUrn(BeamUrns.java:27)
at org.apache.beam.runners.core.construction.PTransformTranslation.<clinit>(PTransformTranslation.java:58)
at org.apache.beam.runners.core.construction.UnconsumedReads$1.visitValue(UnconsumedReads.java:49)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:666)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:649)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:649)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:649)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.access$600(TransformHierarchy.java:311)
at org.apache.beam.sdk.runners.TransformHierarchy.visit(TransformHierarchy.java:245)
at org.apache.beam.sdk.Pipeline.traverseTopologically(Pipeline.java:458)
at org.apache.beam.runners.core.construction.UnconsumedReads.ensureAllReadsConsumed(UnconsumedReads.java:40)
at org.apache.beam.runners.dataflow.DataflowRunner.replaceTransforms(DataflowRunner.java:868)
at org.apache.beam.runners.dataflow.DataflowRunner.run(DataflowRunner.java:660)
at org.apache.beam.runners.dataflow.DataflowRunner.run(DataflowRunner.java:173)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:313)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:299)
at (my code: pipe.run().waitUntilFinish();)
check the versions of beam etc and upgrade your dependencies where possible.
I had the same error and after seeing you get this error, I thought it must be a dependency conflict as it didn't exist before.
I'm using scio to deploy to dataflow and just referenced what they're using. https://github.com/spotify/scio/blob/v0.7.1/build.sbt
I updated guava and protobuf also.
I know you're using java, but try updating beam to 2.9.0 and maybe guava, protobuf...

Class org.apache.hadoop.fs.s3native.NativeS3FileSystem not found (Spark 1.6 Windows)

I am trying to access s3 files from local spark context using pySpark.
I keep getting File "C:\Spark\python\lib\py4j-0.9-src.zip\py4j\protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o20.parquet.
: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3native.NativeS3FileSystem not found
I had set os.environ['AWS_ACCESS_KEY_ID'] and
os.environ['AWS_SECRET_ACCESS_KEY'] before I called df = sqc.read.parquet(input_path). I also added these lines:
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3.awsSecretAccessKey", os.environ["AWS_SECRET_ACCESS_KEY"])
hadoopConf.set("fs.s3.awsAccessKeyId", os.environ["AWS_ACCESS_KEY_ID"])
I have also tried changing s3 to s3n, s3a. Neither worked.
Any idea how to make it work?
I am on Windows 10, pySpark, Spark 1.6.1 built for Hadoop 2.6.0
I'm running pyspark appending the libraries from hadoop-aws.
You will need to use s3n in your input path. I'm running that from Mac-OS. so I'm not sure if it will work in Windows.
$SPARK_HOME/bin/pyspark --packages org.apache.hadoop:hadoop-aws:2.7.1
This package declaration works even in spark-shell
spark-shell --packages org.apache.hadoop:hadoop-aws:2.7.1
and specify in the shell
sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "xxxxxxxxxxxxx")
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "xxxxxxxxxxxxxxxxx")

AdminTask.listTCPEndPoints('abc(abc)') throws exception: ADMF0007E: target object is required

I'm working on deploying application to WebSphere 7 using python script and the script is throwing exception at this line:-
AdminTask.listTCPEndPoints('abc(abc)')
If I run the above command before I run the python script, it works fine. It gives me an error ADMF0003E: Invalid parameter value. But the same command fails in the python script with this error:
wsadmin>AdminTask.listTCPEndPoints('abc(abc)')
WASX7015E: Exception running command: "AdminTask.listTCPEndPoints('abc(abc)')"; exception information: com.ibm.websphere.management.cmdframework.CommandValidationException: ADMF0007E: target object is required.
I can guess that there something in the python script that is causing this issue, but I don't understand why is the AdminTask.listTCPEndPoints command is not able to see the parameter being passed. I'm new to WebSphere, I have only used it in past but never configured it. Any help/insight would be highly appreciated.
Thanks!
Added stack trace of interactive mode option
wsadmin>print AdminTask.listTCPEndPoints('-interactive')
List NamedEndPoints that can be used by a TCPInboundChannel
Lists all NamedEndPoints that can be associated with a TCPInboundChannel
*TCPInboundChannel: abc(abc)
excludeDistinguished (excludeDistinguished): 0
WASX7435W: Value 0 is converted to a boolean value of false.
unusedOnly (unusedOnly): 0
WASX7435W: Value 0 is converted to a boolean value of false.
List NamedEndPoints that can be used by a TCPInboundChannel
F (Finish)
C (Cancel)
Select [F, C]: [F] F
WASX7278I: Generated command line: AdminTask.listTCPEndPoints('[-excludeDistinguished false -unusedOnly false]')
WASX7015E: Exception running command: "AdminTask.listTCPEndPoints('-interactive')"; exception information:
com.ibm.websphere.management.cmdframework.CommandValidationException: ADMF0007E: target object is required.
Follow this link. It appears that you have not specified the target object that's why that error is coming.
I suggest use the following command as a starter
print AdminTask.listTCPEndPoints('-interactive')
Note: Instead of copying and pasting the command, type it on the command line. sometimes command editor does not take the command after pasting it directly.
Okay, I was able to fix the error. I was getting that error because as part of the application deployment script, I was copying few of my application jars to WebSphere's java/jre/lib/ext directory so that those are available in classpath. In one of those jar, I had bundled an IBM class (Base64Coder.class) which was required by a class in my jar and it was corrupting the WebSphere AdminTask utility. When I removed that Base64Coder.class from my jar, python script worked fine. I believe, the reason it corrupted WebSphere was that there was a duplication of the same class in the JVM as the class comes with IBM WebSphere installation and was present in AppServer/runtimes/com.ibm.ws.webservices.thinclient_7.0.0.jar

my pig UDF runs in local mode but fails with "Deserialization error: could not instantiate" on my cluster

I have a pig UDF which runs perfectly in local mode, but fails with: could not instantiate 'com.bla.myFunc' with arguments 'null' when I try it on the cluster.
my mistake was not digging hard enough in the task logs.
when you dig there thru the jobTracker UI, you could see that the root cause was:
Caused by: java.lang.ClassNotFoundException: com.google.common.collect.Maps
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
so, besides the usual:
pigServer.registerFunction("myFunc", new FuncSpec("com.bla.myFunc"));
we should add:
registerJar(pigServer, Maps.class);
and so on for any jar used by the UDF.
Another option is to use build-jar-with-dependencies, but then you have to put the pig.jar before yours in the classpath, or else you'll tackle this one: embedded hadoop-pig: what's the correct way to use the automatic addContainingJar for UDFs?

Resources