Why netlib-java native blas/lapack libraries doesn't give performance improvement? - maven

I am using this piece of code to calculate spark recommendations:
SparkSession spark = SparkSession
.builder()
.appName("SomeAppName")
.config("spark.master", "local[" + args[2] + "]")
.config("spark.local.dir",args[4])
.getOrCreate();
JavaRDD<Rating> ratingsRDD = spark
.read().textFile(args[0]).javaRDD()
.map(Rating::parseRating);
Dataset<Row> ratings = spark.createDataFrame(ratingsRDD, Rating.class);
ALS als = new ALS()
.setMaxIter(Integer.parseInt(args[3]))
.setRegParam(0.01)
.setUserCol("userId")
.setItemCol("movieId")
.setRatingCol("rating").setImplicitPrefs(true);
ALSModel model = als.fit(ratings);
model.setColdStartStrategy("drop");
Dataset<Row> rowDataset = model.recommendForAllUsers(50);
These are maven dependencies to make this piece of code work:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.4.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.4.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.11</artifactId>
<version>2.4.0</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.11.8</version>
</dependency>
Calculating recommendations with this code takes ~70sec for my data file. This code produces following warning:
WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
WARN LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeSystemLAPACK
WARN LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeRefLAPACK
Now I try to enable netlib-java by adding this dependency in maven:
<dependency>
<groupId>com.github.fommil.netlib</groupId>
<artifactId>all</artifactId>
<version>1.1.2</version>
<type>pom</type>
</dependency>
to avoid crashing of this new environment I had to do this extra trick:
LD_PRELOAD=/usr/lib64/libopenblas.so
Now it also works, it gives no warnings, but it works slower and it takes ~170sec on average to perform the same calculation. I am running this on CentOS.
Shouldn't it be faster with native libraries? Is it possible to make it faster?

first you can check you centos version ,for centos 6 may not using the native libraries, check this
As far as I konw , the ALS algorithm has been improved since 2.0 version , you can check
Highlights in 2.2
And the source code from 2.2 as blow :
so the native libraries has not help!

Related

S3 access point as Spark Eventlog Directory

We have a standalone Spring boot based Spark application where at the moment property spark.eventLog.dir is set to an s3 location.
SparkConf sparkConf = new SparkConf()
.setMaster("local[*]")
.setAppName("MyApp")
.set("spark.hadoop.fs.permissions.umask-mode", "000")
.set("hive.warehouse.subdir.inherit.perms", "false")
.set("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "2")
.set("spark.speculation", "false")
.set("spark.eventLog.enabled", "true")
.set("spark.extraListeners", "com.ClassName");
sparkConf.set("spark.eventLog.dir", "s3a://my-bucket-name/eventlog");
This has been working as expected however now the bucket access has changed to access point, so now the URL has to be arn:aws:s3:<bucket-region>:<accountNumber>:accesspoint:<access-point-name> e.g:
sparkConf.set("spark.eventLog.dir", "s3a://arn:aws:s3:eu-west-2:1234567890:accesspoint:my-access-point/eventlog");
After this change we are getting following Stack trace while booting up this app:
java.lang.NullPointerException: null uri host.
at java.base/java.util.Objects.requireNonNull(Objects.java:246)
at org.apache.hadoop.fs.s3native.S3xLoginHelper.buildFSURI(S3xLoginHelper.java:71)
at org.apache.hadoop.fs.s3a.S3AFileSystem.setUri(S3AFileSystem.java:470)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:235)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
at org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1866)
at org.apache.spark.scheduler.EventLoggingListener.<init>(EventLoggingListener.scala:71)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:522)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520)
at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$5(SparkSession.scala:935)
at scala.Option.getOrElse(Option.scala:138)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926)
Looking at the class S3xLoginHelper, looks like it is failing to create a java.net.URI object with the : char in the URL string.
I have following following relevant maven dependencies:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>2.4.4</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>2.4.4</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-aws</artifactId>
<version>3.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>3.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>3.2.0</version>
</dependency>
Update:
Also, tried to add following in core-site.xml (also tried in hdfc-site.xml) as mentioned in the hadoop-aws documentation:
<property>
<name>fs.s3a.bucket.my-access-point.accesspoint.arn</name>
<value>arn:aws:s3:eu-west-2:1234567890:accesspoint:my-access-point</value>
<description>Configure S3a traffic to use this AccessPoint</description>
</property>
And updated the code with sparkConf.set("spark.eventLog.dir", "s3a://my-access-point/eventlog");
This give a stack trace with java.io.FileNotFoundException: Bucket my-access-point does not exist which indicates that it is not using those updated properties for spark.eventLog.dir and treating my-access-point as bucket name!

Databricks local test fail with java.lang.NoSuchMethodError: org.apache.hadoop.security.HadoopKerberosName.setRuleMechanism

I have a unit test to databricks code, and I want to run it locally on windows. Unluckily when I run pytest with PyCharm, it throws the following exception:
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.security.HadoopKerberosName.setRuleMechanism(Ljava/lang/String;)V
at org.apache.hadoop.security.HadoopKerberosName.setConfiguration(HadoopKerberosName.java:84)
at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:315)
at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:300)
at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:575)
at org.apache.spark.util.Utils$.$anonfun$getCurrentUserName$1(Utils.scala:2747)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2747)
at org.apache.spark.SecurityManager.<init>(SecurityManager.scala:79)
at org.apache.spark.deploy.SparkSubmit.secMgr$lzycompute$1(SparkSubmit.scala:368)
at org.apache.spark.deploy.SparkSubmit.secMgr$1(SparkSubmit.scala:368)
at org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$8(SparkSubmit.scala:376)
at scala.Option.map(Option.scala:230)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:376)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
And from source code it is from the initialization:
spark = SparkSession.builder \
.master("local[2]") \
.appName("Helper Functions Unit Testing") \
.getOrCreate()
I do search the above error and most of them are related to maven configure to add dependency of hadoop auth. However, for pyspark, I don't know how to deal with it. Does anyone have experience or insight for this error?
Here my workaround is to have python version to 3.7 and change pyspark version to 3.0, and then it seems ok. So it is related to the environment and dependency inconsistent.
This is just limit to my case, and from my search on web most is related to maven to add hadoop-auth.jar dependency for hadoop configuration.
Encountered this error for a Maven project written in Scala, not Python. What did it for me was adding not only the hadoop-auth dependency like OP specified but also the hadoop-common dependency in my pom file like so,
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>3.1.2</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-auth</artifactId>
<version>3.1.2</version>
</dependency>
Replace 3.1.2 with whatever version you're using. However, I also found that I had to find other dependencies that conflicted with hadoop-common and hadoop-auth and add exclusions to them like so,
<exclusions>
<exclusion>
<artifactId>hadoop-common</artifactId>
<groupId>org.apache.hadoop</groupId>
</exclusion>
<exclusion>
<artifactId>hadoop-auth</artifactId>
<groupId>org.apache.hadoop</groupId>
</exclusion>
</exclusions>

How to resolve NoSuchMethodError in spring TransactionSynchronizationManager.currentTransaction when using pring-data-r2dbc

I am writing service based on Spring webflux, which reads data from PostgreSQL using r2dbc. I need to use latest release of r2dbc, however I am getting NoSuchMethodError exception in using TransactionSynchronizationManager spring-tx 5.2.0.RELEASE library.
I basically need to know what is correct spring-tx library version to be compatible with version of spring-data-r2dbc which works correctly with latest r2dbc-postgresql and r2dbc-spi libraries.
Here are my Maven dependencies.
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-webflux</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.data</groupId>
<artifactId>spring-data-r2dbc</artifactId>
<version>1.0.0.M2</version>
</dependency>
<dependency>
<groupId>io.r2dbc</groupId>
<artifactId>r2dbc-postgresql</artifactId>
<version>0.8.0.RC2</version>
</dependency>
<dependency>
<groupId>io.r2dbc</groupId>
<artifactId>r2dbc-spi</artifactId>
<version>0.8.0.M8</version>
</dependency>
I am using interface extending ReactiveCrudRepository interface to retrieve table data per below.
#Query("...")
Flux<QuoteHistory> findAllBySecIdAndDateTimeBetweenAndUpdateTypeIn(LocalDate date, Long secId);
I was able to get this code to work with earlier versions of r2dbc-postgresql and r2dbc-spi but now I am getting following exception.
java.lang.NoSuchMethodError: org.springframework.transaction.reactive.TransactionSynchronizationManager.currentTransaction()Lreactor/core/publisher/Mono;
at org.springframework.data.r2dbc.connectionfactory.ConnectionFactoryUtils.doGetConnection(ConnectionFactoryUtils.java:88) ~[spring-data-r2dbc-1.0.0.M2.jar:1.0.0.M2]
at org.springframework.data.r2dbc.connectionfactory.ConnectionFactoryUtils.getConnection(ConnectionFactoryUtils.java:70) ~[spring-data-r2dbc-1.0.0.M2.jar:1.0.0.M2]
at org.springframework.data.r2dbc.core.DefaultDatabaseClient.getConnection(DefaultDatabaseClient.java:189) ~[spring-data-r2dbc-1.0.0.M2.jar:1.0.0.M2]
These are r2dbc dependencies which code works with.
<dependency>
<groupId>org.springframework.data</groupId>
<artifactId>spring-data-r2dbc</artifactId>
<version>1.0.0.M1</version>
</dependency>
<dependency>
<groupId>io.r2dbc</groupId>
<artifactId>r2dbc-postgresql</artifactId>
<version>1.0.0.M7</version>
</dependency>
Please use the following dependency combination:
R2DBC Postgres: 0.8.0.RC2
R2DBC SPI: 0.8.0.RC2
Spring Data R2DBC: 1.0.0.RC1
Spring Framework: 5.2.0.RELEASE
Alternatively, go to https://start.spring.io to get a dependency-managed project with versions that work together.

Correct the classpath of your application so that it contains a single, compatible version of org.elasticsearch.common.logging.Loggers

I'm getting following error while running my Spring boot app, I'm new to Spring boot and elastic search, please help to solve this issue.And attached my pom dependencies below.
Thanks in advance,
*************************** APPLICATION FAILED TO START ***************************
Description:
An attempt was made to call the method org.elasticsearch.common.logging.Loggers.getLogger(Ljava/lang/String;)Lorg/apache/logging/log4j/Logger; but it does not exist. Its class, org.elasticsearch.common.logging.Loggers, is available from the following locations:
jar:file:/C:/Users/Sudhakar/.m2/repository/org/elasticsearch/elasticsearch/6.6.2/elasticsearch-6.6.2.jar!/org/elasticsearch/common/logging/Loggers.class
It was loaded from the following location:
file:/C:/Users/Sudhakar/.m2/repository/org/elasticsearch/elasticsearch/6.6.2/elasticsearch-6.6.2.jar
Action:
Correct the classpath of your application so that it contains a single, compatible version of org.elasticsearch.common.logging.Loggers
Process finished with exit code 1
Maven dependencies:
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-to-slf4j</artifactId>
<version>2.11.1</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>1.7.24</version>
</dependency>
<dependency>
<groupId>org.elasticsearch</groupId>
<artifactId>elasticsearch</artifactId>
<version>6.6.2</version>
</dependency>
<dependency>
<groupId>org.elasticsearch.client</groupId>
<artifactId>transport</artifactId>
<version>6.6.2</version>
</dependency>
Spring Boot uses Elasticsearch 6.4 by default. By using 6.6.2 as the versions for two Elasticsearch modules you will have ended up with a mixture of the two versions. You should remove the <version> configuration in your pom. If you are able to use Spring Boot’s default version there’s nothing more to do. If you need to use 6.6.2 you should add an entry in your pom’s <properties>:
<elasticsearch.version>6.6.2</elasticsearch.version>

Maven fails to download CoreNLP models

When building the sample application from the Stanford CoreNLP website, I ran into a curious exception:
Exception in thread "main" java.lang.RuntimeException: edu.stanford.nlp.io.RuntimeIOException: Unrecoverable error while loading a tagger model
at edu.stanford.nlp.pipeline.StanfordCoreNLP$4.create(StanfordCoreNLP.java:493)
…
Caused by: java.io.IOException: Unable to resolve "edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger" as either class path, filename or URL
…
This only happened when the property pos and the ones after it were included in the properties.
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
Here is the dependency from my pom.xml:
<dependencies>
<dependency>
<groupId>edu.stanford.nlp</groupId>
<artifactId>stanford-corenlp</artifactId>
<version>3.2.0</version>
<scope>compile</scope>
</dependency>
</dependencies>
I actually found the answer to that in the problem description of another question on Stackoverflow.
Quoting W.P. McNeill:
Maven does not download the model files automatically, but only if you
add models line to the .pom. Here is a .pom
snippet that fetches both the code and the models.
Here's what my dependencies look like now:
<dependencies>
<dependency>
<groupId>edu.stanford.nlp</groupId>
<artifactId>stanford-corenlp</artifactId>
<version>3.2.0</version>
</dependency>
<dependency>
<groupId>edu.stanford.nlp</groupId>
<artifactId>stanford-corenlp</artifactId>
<version>3.2.0</version>
<classifier>models</classifier>
</dependency>
</dependencies>
The important part to note is the entry <classifier>models</classifier> at the bottom. In order for Eclipse to maintain both references, you'll need to configure a dependency for each stanford-corenlp-3.2.0 and stanford-corenlp-3.2.0-models.
In case you need to use the models for other languages (like Chinese, Spanish, or Arabic) you can add the following piece to your pom.xml file (replace models-chinese with models-spanish or models-arabic for these two languages, respectively):
<dependency>
<groupId>edu.stanford.nlp</groupId>
<artifactId>stanford-corenlp</artifactId>
<version>3.8.0</version>
<classifier>models-chinese</classifier>
</dependency>
With Gradle apparently you can use:
implementation 'edu.stanford.nlp:stanford-corenlp:3.9.2'
implementation 'edu.stanford.nlp:stanford-corenlp:3.9.2:models'
or if you use compile (depricated):
compile group: 'edu.stanford.nlp', name: 'stanford-corenlp', version: '3.9.2'
compile group: 'edu.stanford.nlp', name: 'stanford-corenlp', version: '3.9.2' classifier: 'models'

Resources