Unable to connect to Twitter API using Spark Streaming - spark-streaming

For the first time, I am using Twitter API for analysis. I am not able to connect to it by using Spark. I am using Scala with SBT and following dependencies:
"org.apache.spark" % "spark-core_2.11" % "1.5.2",
"org.apache.spark" % "spark-streaming_2.11" % "1.5.2",
"org.apache.spark" % "spark-streaming-twitter_2.11" % "1.5.2"
I have created the twitter app and got all access tokens. I am getting following error:
PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target

Related

When I run the example code of openai-java in my machine, SSLHandshakeException error happens

I tried to run the example of openai-java, but SSLHandshakeException error happens.
here is my code
java.lang.RuntimeException: javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
at io.reactivex.internal.util.ExceptionHelper.wrapOrThrow(ExceptionHelper.java:45)
at io.reactivex.internal.observers.BlockingMultiObserver.blockingGet(BlockingMultiObserver.java:91)
at io.reactivex.Single.blockingGet(Single.java:2585)
at com.theokanning.openai.OpenAiService.createCompletion(OpenAiService.java:116)
Also, I used postman to send post requests to https://api.openai.com/v1/completions with my secret-key, it gives "You exceeded your current quota, please check your plan and billing details"
I guess it may because of https Certificate, and I followed How to Resolve Error message "PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target", and put certificate of https://openai.com/ into my JDK. And It doesn't work. I wanna ask how to fix this problem

UnsatisfiedLinkError while writing to S3 using Staging S3A Committer on Windows

I'm trying to write Parquet data to AWS S3 directory with Apache Spark. I use my local machine on Windows 10 without having Spark and Hadoop installed, but rather added them as SBT dependency (Hadoop 3.2.1, Spark 2.4.5). My SBT is below:
scalaVersion := "2.11.11"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-sql" % "2.4.5",
"org.apache.spark" %% "spark-hadoop-cloud" % "2.3.2.3.1.0.6-1",
"org.apache.hadoop" % "hadoop-client" % "3.2.1",
"org.apache.hadoop" % "hadoop-common" % "3.2.1",
"org.apache.hadoop" % "hadoop-aws" % "3.2.1",
"com.amazonaws" % "aws-java-sdk-bundle" % "1.11.704"
)
dependencyOverrides ++= Seq(
"com.fasterxml.jackson.core" % "jackson-core" % "2.11.0",
"com.fasterxml.jackson.core" % "jackson-databind" % "2.11.0",
"com.fasterxml.jackson.module" %% "jackson-module-scala" % "2.11.0"
)
resolvers ++= Seq(
"apache" at "https://repo.maven.apache.org/maven2",
"hortonworks" at "https://repo.hortonworks.com/content/repositories/releases/",
)
I use S3A Staging Directory Committer as described in Hadoop and Cloudera documentation. I'm also aware of these two questions on StackOverflow and used them for proper configuration:
Apache Spark + Parquet not Respecting Configuration to use “Partitioned” Staging S3A Committer
How To Get Local Spark on AWS to Write to S3
I have added all required (as of my understanging) configurations including latest two specific for Parquet:
val spark = SparkSession.builder()
.appName("test-run-s3a-commiters")
.master("local[*]")
.config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
.config("spark.hadoop.fs.s3a.endpoint", "s3.eu-central-1.amazonaws.com")
.config("spark.hadoop.fs.s3a.aws.credentials.provider", "com.amazonaws.auth.profile.ProfileCredentialsProvider")
.config("spark.hadoop.fs.s3a.connection.maximum", "100")
.config("spark.hadoop.fs.s3a.committer.name", "directory")
.config("spark.hadoop.fs.s3a.committer.magic.enabled", "false")
.config("spark.hadoop.fs.s3a.committer.staging.conflict-mode", "append")
.config("spark.hadoop.fs.s3a.committer.staging.unique-filenames", "true")
.config("spark.hadoop.fs.s3a.committer.staging.abort.pending.uploads", "true")
.config("spark.hadoop.fs.s3a.buffer.dir", "tmp/")
.config("spark.hadoop.fs.s3a.committer.staging.tmp.path", "hdfs_tmp/")
.config("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "2")
.config("spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a", "org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory")
.config("spark.sql.sources.commitProtocolClass", "org.apache.spark.internal.io.cloud.PathOutputCommitProtocol")
.config("spark.sql.parquet.output.committer.class", "org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter")
.getOrCreate()
spark.sparkContext.setLogLevel("info")
From the logs I can see that StagingCommitter is actually applied (also I can see intermediate data in my local filesystem under specified paths and no _temporary directory in S3 during execution like it would be with default FileOutputCommitter).
Then I'm running simple code to write test data to S3 bucket:
import spark.implicits._
val sourceDF = spark
.range(0, 10000)
.map(id => {
Thread.sleep(10)
id
})
sourceDF
.write
.format("parquet")
.save("s3a://my/test/bucket/")
(I use Thread.sleep to simulate some processing and have little time to check the intermediate content of my local temp directory and S3 bucket)
However, I get an java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$POSIX.stat error during commit task attempt.
Below is the piece of logs (reduced to 1 executor) and error stack trace.
20/05/09 15:13:18 INFO InternalParquetRecordWriter: Flushing mem columnStore to file. allocated memory: 15000
20/05/09 15:13:18 INFO StagingCommitter: Starting: Task committer attempt_20200509151301_0000_m_000000_0: needsTaskCommit() Task attempt_20200509151301_0000_m_000000_0
20/05/09 15:13:18 INFO StagingCommitter: Task committer attempt_20200509151301_0000_m_000000_0: needsTaskCommit() Task attempt_20200509151301_0000_m_000000_0: duration 0:00.005s
20/05/09 15:13:18 INFO StagingCommitter: Starting: Task committer attempt_20200509151301_0000_m_000000_0: commit task attempt_20200509151301_0000_m_000000_0
20/05/09 15:13:18 INFO StagingCommitter: Task committer attempt_20200509151301_0000_m_000000_0: commit task attempt_20200509151301_0000_m_000000_0: duration 0:00.019s
20/05/09 15:13:18 ERROR Utils: Aborting task
java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$POSIX.stat(Ljava/lang/String;)Lorg/apache/hadoop/io/nativeio/NativeIO$POSIX$Stat;
at org.apache.hadoop.io.nativeio.NativeIO$POSIX.stat(Native Method)
at org.apache.hadoop.io.nativeio.NativeIO$POSIX.getStat(NativeIO.java:460)
at org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfoByNativeIO(RawLocalFileSystem.java:821)
at org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:735)
at org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:703)
at org.apache.hadoop.fs.LocatedFileStatus.<init>(LocatedFileStatus.java:52)
at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:2091)
at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:2071)
at org.apache.hadoop.fs.FileSystem$5.hasNext(FileSystem.java:2190)
at org.apache.hadoop.fs.s3a.S3AUtils.applyLocatedFiles(S3AUtils.java:1295)
at org.apache.hadoop.fs.s3a.S3AUtils.flatmapLocatedFiles(S3AUtils.java:1333)
at org.apache.hadoop.fs.s3a.S3AUtils.listAndFilter(S3AUtils.java:1350)
at org.apache.hadoop.fs.s3a.commit.staging.StagingCommitter.getTaskOutput(StagingCommitter.java:385)
at org.apache.hadoop.fs.s3a.commit.staging.StagingCommitter.commitTask(StagingCommitter.java:641)
at org.apache.spark.mapred.SparkHadoopMapRedUtil$.performCommit$1(SparkHadoopMapRedUtil.scala:50)
at org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:77)
at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.commitTask(HadoopMapReduceCommitProtocol.scala:225)
at org.apache.spark.internal.io.cloud.PathOutputCommitProtocol.commitTask(PathOutputCommitProtocol.scala:220)
at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:78)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:247)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
20/05/09 15:13:18 ERROR Utils: Aborting task
According to my current understanding, the configuration is correct. Probably, the error is caused by some version incompatibilities or my local environment settings.
Provided code works as expected for ORC and CSV without any error, but not for Parquet.
Please, suggest what could cause the error and how to resolve this?
For everyone who comes here, I found the solution. As expected, the problem is not related to S3A output committers or library dependencies.
The UnsatisfiedLinkError exception on Java native method raised because of version incompatibility between Hadoop version in SBT dependencies and winutils.exe (HDFS wrapper) on my Windows machine.
I've downloaded corresponding version from cdarlint/winutils and it all worked. LOL
this is related to the installation not having the native libs to support the file:// URL, and s3a using that for buffering writes.
you can switch to using memory for buffering -just make sure that you are uploading to s3 as fast as you generate data. there are some options covered in the s3a docs to help manage that by limiting the #of active blocks a single output stream can queue for uploading in parallel.
<property>
<name>fs.s3a.fast.upload.buffer</name>
<value>bytebuffer</value>
</property>

SpringBoot in IntelliJ. unable to find valid certification path to requested target

IntelliJ Community Edition, Java8, Spring Boot 2.1.11
Trying to do a basic linkedIn course (Building Reactive apps with Spring Boot2 by Chris Anatalio)
I am unable to run the application. It uses an embedded Mongo DB.
failed
:ReactivespringApplication.main()
org.springframework.beans.BeanInstantiationException: Failed to instantiate [de.flapdoodle.embed.mongo.MongodExecutable]: Factory method 'embeddedMongoServer' threw exception; nested exception is de.flapdoodle.embed.process.exceptions.DistributionException: prepare executable
de.flapdoodle.embed.process.exceptions.DistributionException: prepare executable
java.io.IOException: Could not open inputStream for https://downloads.mongodb.org/win32/mongodb-win32-x86_64-3.5.5.zip
javax.net.ssl.SSLHandshakeException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
Update:
I found that LinkedIn provides an 'end' code set for each lecture. I downloaded and imported it and it ran fine. I dont know what is different. Did a WinMerge but no obvious differences.
Been in the same situation, Root cause is that you are behind a proxy server which requires ssl certificate to communicate/download.
Resolution:
1. Try different network, this worked for me when I connected to a mobile hotspot.
2. import/create a ssl certificate or bypass ssl, am also still exploring this path.

Accessing S3 from Spark 2.0

I'm trying to access S3 file from SparkSQL job. I already tried solutions from several posts but nothing seems to work. Maybe because my EC2 cluster runs the new Spark2.0 for Hadoop2.7.
I setup hadoop this way:
sc.hadoopConfiguration.set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
sc.hadoopConfiguration.set("fs.s3a.awsAccessKeyId", accessKey)
sc.hadoopConfiguration.set("fs.s3a.awsSecretAccessKey", secretKey)
I build an uber-jar using sbt assembly using:
name := "test"
version := "0.2.0"
scalaVersion := "2.11.8"
libraryDependencies += "com.amazonaws" % "aws-java-sdk" % "1.7.4"
libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % "2.7.3" excludeAll(
ExclusionRule("com.amazonaws", "aws-java-sdk"),
ExclusionRule("commons-beanutils")
)
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.0" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.0" % "provided"
When I submit my job to the cluster, I always got the following errors:
Exception in thread "main" org.apache.spark.SparkException: Job
aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most
recent failure: Lost task 0.3 in stage 0.0 (TID 6, 172.31.7.246):
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class
org.apache.hadoop.fs.s3a.S3AFileSystem not found at
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
at
org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2638)
at
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2651)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92) at
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371) at
org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1726) at
org.apache.spark.util.Utils$.doFetchFile(Utils.scala:662) at
org.apache.spark.util.Utils$.fetchFile(Utils.scala:446) at
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$3.apply(Executor.scala:476)
It seems that the driver is able to read from S3 without problem but not the workers/executors... I do not understand why my uberjar is not sufficient.
However, I tried as well without success to configure spark-submit using:
--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3
PS: If I switch to s3n protocol, I got the following exception:
java.io.IOException: No FileSystem for scheme: s3n
If you want to use s3n:
sc.hadoopConfiguration.set("fs.s3n.impl","org.apache.hadoop.fs.s3native.NativeS3FileSystem")
sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", accessKey)
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", secretKey)
Now, regarding the exception, you need to make sure both JARs are on the driver and worker classpaths, and make sure to distribute them to the worker node if you're using Client Mode via the --jars flag:
spark-submit \
--conf "spark.driver.extraClassPath=/location/to/aws-java-sdk.jar" \
--conf "spark.driver.extraClassPath=/location/to/hadoop-aws.jar" \
--jars /location/to/aws-java-sdk.jar,/location/to/hadoop-aws.jar \
Also, if you're building your uber JAR and including aws-java-sdk and hadoop-aws, no reason to use the --packages flag.
Actually all operations of spark working on workers. and you set these configuration on master so once you can try to app configuration of s3 on mapPartition{
}

Apache Spark: Unable to build: [error] Server access Error..jetty

I have downloaded spark-1.4.1.tgz, unzipped it.
Now, when i try to build as follows, it gets stuck:
$ ./sbt/sbt assembly
Invoking 'build/sbt assembly' now
[info] Loading project definition...
[warn] Multiple resolvers having different access mechanism configured with same name 'sbt-plugin-releases'.
...
[info] Resolving org.eclipse.jetty#jetty-parent;18
[error] Server access Error: Connection reset url=http://download.eclipse.org/jgit/maven/org/eclipse/jetty/jetty-parent/18/jetty-parent-18.jar
[error] Server access Error: Server redirected too many times (20) url=http://scalasbt.artifactoryonline.com/scalasbt/sbt-plugin-releases/org.eclipse.jetty.orbit/jetty-orbit/1/jars/jetty-orbit.jar
The build gets stuck at this point. Am i missing any configuration/libraries?
You have a network connectivity issue and probably need to retry.

Resources