Maven fails to download CoreNLP models - maven

When building the sample application from the Stanford CoreNLP website, I ran into a curious exception:
Exception in thread "main" java.lang.RuntimeException: edu.stanford.nlp.io.RuntimeIOException: Unrecoverable error while loading a tagger model
at edu.stanford.nlp.pipeline.StanfordCoreNLP$4.create(StanfordCoreNLP.java:493)
…
Caused by: java.io.IOException: Unable to resolve "edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger" as either class path, filename or URL
…
This only happened when the property pos and the ones after it were included in the properties.
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
Here is the dependency from my pom.xml:
<dependencies>
<dependency>
<groupId>edu.stanford.nlp</groupId>
<artifactId>stanford-corenlp</artifactId>
<version>3.2.0</version>
<scope>compile</scope>
</dependency>
</dependencies>

I actually found the answer to that in the problem description of another question on Stackoverflow.
Quoting W.P. McNeill:
Maven does not download the model files automatically, but only if you
add models line to the .pom. Here is a .pom
snippet that fetches both the code and the models.
Here's what my dependencies look like now:
<dependencies>
<dependency>
<groupId>edu.stanford.nlp</groupId>
<artifactId>stanford-corenlp</artifactId>
<version>3.2.0</version>
</dependency>
<dependency>
<groupId>edu.stanford.nlp</groupId>
<artifactId>stanford-corenlp</artifactId>
<version>3.2.0</version>
<classifier>models</classifier>
</dependency>
</dependencies>
The important part to note is the entry <classifier>models</classifier> at the bottom. In order for Eclipse to maintain both references, you'll need to configure a dependency for each stanford-corenlp-3.2.0 and stanford-corenlp-3.2.0-models.

In case you need to use the models for other languages (like Chinese, Spanish, or Arabic) you can add the following piece to your pom.xml file (replace models-chinese with models-spanish or models-arabic for these two languages, respectively):
<dependency>
<groupId>edu.stanford.nlp</groupId>
<artifactId>stanford-corenlp</artifactId>
<version>3.8.0</version>
<classifier>models-chinese</classifier>
</dependency>

With Gradle apparently you can use:
implementation 'edu.stanford.nlp:stanford-corenlp:3.9.2'
implementation 'edu.stanford.nlp:stanford-corenlp:3.9.2:models'
or if you use compile (depricated):
compile group: 'edu.stanford.nlp', name: 'stanford-corenlp', version: '3.9.2'
compile group: 'edu.stanford.nlp', name: 'stanford-corenlp', version: '3.9.2' classifier: 'models'

Related

Databricks local test fail with java.lang.NoSuchMethodError: org.apache.hadoop.security.HadoopKerberosName.setRuleMechanism

I have a unit test to databricks code, and I want to run it locally on windows. Unluckily when I run pytest with PyCharm, it throws the following exception:
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.security.HadoopKerberosName.setRuleMechanism(Ljava/lang/String;)V
at org.apache.hadoop.security.HadoopKerberosName.setConfiguration(HadoopKerberosName.java:84)
at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:315)
at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:300)
at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:575)
at org.apache.spark.util.Utils$.$anonfun$getCurrentUserName$1(Utils.scala:2747)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2747)
at org.apache.spark.SecurityManager.<init>(SecurityManager.scala:79)
at org.apache.spark.deploy.SparkSubmit.secMgr$lzycompute$1(SparkSubmit.scala:368)
at org.apache.spark.deploy.SparkSubmit.secMgr$1(SparkSubmit.scala:368)
at org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$8(SparkSubmit.scala:376)
at scala.Option.map(Option.scala:230)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:376)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
And from source code it is from the initialization:
spark = SparkSession.builder \
.master("local[2]") \
.appName("Helper Functions Unit Testing") \
.getOrCreate()
I do search the above error and most of them are related to maven configure to add dependency of hadoop auth. However, for pyspark, I don't know how to deal with it. Does anyone have experience or insight for this error?
Here my workaround is to have python version to 3.7 and change pyspark version to 3.0, and then it seems ok. So it is related to the environment and dependency inconsistent.
This is just limit to my case, and from my search on web most is related to maven to add hadoop-auth.jar dependency for hadoop configuration.
Encountered this error for a Maven project written in Scala, not Python. What did it for me was adding not only the hadoop-auth dependency like OP specified but also the hadoop-common dependency in my pom file like so,
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>3.1.2</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-auth</artifactId>
<version>3.1.2</version>
</dependency>
Replace 3.1.2 with whatever version you're using. However, I also found that I had to find other dependencies that conflicted with hadoop-common and hadoop-auth and add exclusions to them like so,
<exclusions>
<exclusion>
<artifactId>hadoop-common</artifactId>
<groupId>org.apache.hadoop</groupId>
</exclusion>
<exclusion>
<artifactId>hadoop-auth</artifactId>
<groupId>org.apache.hadoop</groupId>
</exclusion>
</exclusions>

Why netlib-java native blas/lapack libraries doesn't give performance improvement?

I am using this piece of code to calculate spark recommendations:
SparkSession spark = SparkSession
.builder()
.appName("SomeAppName")
.config("spark.master", "local[" + args[2] + "]")
.config("spark.local.dir",args[4])
.getOrCreate();
JavaRDD<Rating> ratingsRDD = spark
.read().textFile(args[0]).javaRDD()
.map(Rating::parseRating);
Dataset<Row> ratings = spark.createDataFrame(ratingsRDD, Rating.class);
ALS als = new ALS()
.setMaxIter(Integer.parseInt(args[3]))
.setRegParam(0.01)
.setUserCol("userId")
.setItemCol("movieId")
.setRatingCol("rating").setImplicitPrefs(true);
ALSModel model = als.fit(ratings);
model.setColdStartStrategy("drop");
Dataset<Row> rowDataset = model.recommendForAllUsers(50);
These are maven dependencies to make this piece of code work:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.4.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.4.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.11</artifactId>
<version>2.4.0</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.11.8</version>
</dependency>
Calculating recommendations with this code takes ~70sec for my data file. This code produces following warning:
WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
WARN LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeSystemLAPACK
WARN LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeRefLAPACK
Now I try to enable netlib-java by adding this dependency in maven:
<dependency>
<groupId>com.github.fommil.netlib</groupId>
<artifactId>all</artifactId>
<version>1.1.2</version>
<type>pom</type>
</dependency>
to avoid crashing of this new environment I had to do this extra trick:
LD_PRELOAD=/usr/lib64/libopenblas.so
Now it also works, it gives no warnings, but it works slower and it takes ~170sec on average to perform the same calculation. I am running this on CentOS.
Shouldn't it be faster with native libraries? Is it possible to make it faster?
first you can check you centos version ,for centos 6 may not using the native libraries, check this
As far as I konw , the ALS algorithm has been improved since 2.0 version , you can check
Highlights in 2.2
And the source code from 2.2 as blow :
so the native libraries has not help!

Selenium-TestNG-Maven - Getting "java.lang.NoClassDefFoundError: org/openqa/selenium/firefox/FirefoxDriver"

This is my first selenium script using TestNG and Maven.
Created a simple "Hello World" code and a selenium test code which just checks the title of google page.
Selenium Code Below with TestNG:
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.firefox.FirefoxDriver;
import org.testng.annotations.Test;
public class HelloTest {
#Test
public void testOne() {
//WebDriver d=new FirefoxDriver();
System.setProperty("webdriver.gecko.driver","D:\\Firefox Driver\\geckodriver-v0.17.0-win64\\geckodriver.exe");
WebDriver d=new FirefoxDriver();
d.get("https://www.google.com");
System.out.println("This is first TestNG");
}
}
This is working absolutely fine when run through eclipse - Run As - Test NG test.
But when ran through Maven - mvn clean install from cmd prompt, i am getting below error
T E S T S
-------------------------------------------------------
Running HelloTest
Configuring TestNG with: TestNG652Configurator
Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.94 sec <<< FAILURE! - in HelloTest
testOne(HelloTest) Time elapsed: 0.032 sec <<< FAILURE!
java.lang.NoClassDefFoundError: org/openqa/selenium/firefox/FirefoxDriver
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at HelloTest.testOne(HelloTest.java:11)
It is showing error at WebDriver d=new FirefoxDriver();. Not sure where the issue is. Added all jar files , checked the build path and all the jar were there. Below is my POM file.
<?xml version="1.0"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.demo.micky</groupId>
<artifactId>MavenDemoTwo</artifactId>
<version>0.0.1-SNAPSHOT</version>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-surefire-plugin</artifactId>
<version>2.18.1</version>
</plugin>
</plugins>
</build>
<dependencies>
<dependency>
<groupId>org.testng</groupId>
<artifactId>testng</artifactId>
<version>6.9.8</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-java</artifactId>
<version>3.12.0</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-firefox-driver</artifactId>
<version>3.12.0</version>
</dependency>
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-server</artifactId>
<version>3.12.0</version>
</dependency>
<dependency>
<artifactId>guava</artifactId>
<groupId>com.google.guava</groupId>
<type>jar</type>
<version>15.0</version>
</dependency>
</dependencies>
</project>
Any help is deeply appreciated.
What is NoClassDefFoundError
NoClassDefFoundError in Java occurs when JVM is not able to find a particular class at runtime which was available at compile time. For example, if we have resolved a method call from a Class or accessing any static member of a Class and that Class is not available during runtime then JVM will throw NoClassDefFoundError.
The error you are seeing is :
java.lang.NoClassDefFoundError: org/openqa/selenium/firefox/FirefoxDriver
This clearly indicates that Selenium is trying to resolve the particular FirefoxDriver Class at runtime from org/openqa/selenium/firefox/FirefoxDriver which is not available.
What went wrong :
This situation occurs if there are presence of multiple sources to resolve the Classes and Methods through JDK/Maven/Gradle.
From the pom.xml it is pretty clear that you have added multiple dependencies for FirefoxDriver Class as follows:
<artifactId>selenium-java</artifactId>:
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-java</artifactId>
<version>3.12.0</version>
<scope>test</scope>
</dependency>
<artifactId>selenium-firefox-driver</artifactId>:
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-firefox-driver</artifactId>
<version>3.12.0</version>
</dependency>
<artifactId>selenium-server</artifactId>:
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-server</artifactId>
<version>3.12.0</version>
</dependency>
Additionally you have also added all jar files.
From all the above mentioned points it's clear that the related Class or Methods were resolved from one source at Compile Time which was not available during Run Time.
Solution :
Here are a few steps to solve NoClassDefFoundError :
While using a Build Tool e.g. Maven or Gradle, remove all the External JARs from the Java Build Path. Maven or Gradle will download all the dependencies mentioned in the configuration file (e.g. pom.xml) to resolve the Classes and Methods.
If using Selenium JARs within a Java Project add only required External JARs within the Java Build Path and remove the unused and duplicate External JARs.
If you are using FirefoxDriver use either of the <artifactId>selenium-java</artifactId> or <artifactId>selenium-server</artifactId>. Avoid using both at the same time.
Remove the unwanted and duplicated from pom.xml
Clean your Project Workspace through your IDE and Rebuild your project with required dependencies only.
Use CCleaner tool to wipe off all the OS chores before and after the execution of your Test Suite.
If your base Web Client version is too old, then uninstall it through Revo Uninstaller and install a recent GA and released version of Web Client.
Take a System Reboot.
While you execute a Maven Project always perform the folling in sequence:
maven clean
maven install
maven test
You can find related discussions in:
Exception in thread “main” java.lang.NoClassDefFoundError: org/openqa/selenium/WebDriver
How to solve java.lang.NoClassDefFoundError? Selenium
java.lang.NoClassDefFoundError: com/google/common/base/Function
I was using eclipse 09-2019 which is the latest one with latest JDK installed 13, and latest selenium jar files 3.141.59, I installed other JDKs to work around to solve this issue after trying all answers on similar question. Then after 4 days trying, installed eclipse neon version(https://www.eclipse.org/downloads/download.php?file=/technology/epp/downloads/release/neon/3/eclipse-java-neon-3-win32-x86_64.zip&mirror_id=105) and used Selenium-Java 3.5.2 jar files (https://jar-download.com/artifacts/org.seleniumhq.selenium/selenium-java/3.5.2/source-code) and now it is working perfectly Alhamdullilah.
Also I don't know what was the errors root cause exactly or at all, but now it is solved.
Wish if that help You

Stanford 3.7 models on Maven?

I am using Stanford 3.7 for NER/RegexNER. I have the following Maven dependencies in my pom:
<dependency>
<groupId>edu.stanford.nlp</groupId>
<artifactId>stanford-corenlp</artifactId>
<version>3.7.0</version>
</dependency>
<dependency>
<groupId>edu.stanford.nlp</groupId>
<artifactId>stanford-corenlp</artifactId>
<version>3.7.0</version>
<classifier>models</classifier>
</dependency>
And I am using the Stanford CoreNLP API, following the documentation:
Properties props = new Properties();
props.put("regexner.mapping", "my_file_name");
props.put("regexner.ignorecase", "true");
props.put("annotators", "tokenize, ssplit, ner, regexner");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
Annotation document = new Annotation(text);
pipeline.annotate(document);
However, I still get the RuntimeException error:
java.io.IOException: Unable to open "edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger" as class path, filename or URL
Does anyone know how to use the API with the models from version 3.7? Any advice welcome.
That file is in the stanford-corenlp-3.7.0-models.jar so it might be that when you are running your application somehow that jar is not in the CLASSPATH. How are you running this Java code?

TaskID.<init>(Lorg/apache/hadoop/mapreduce/JobID;Lorg/apache/hadoop/mapreduce/TaskType;I)V

val jobConf = new JobConf(hbaseConf)
jobConf.setOutputFormat(classOf[TableOutputFormat])
jobConf.set(TableOutputFormat.OUTPUT_TABLE, tablename)
val indataRDD = sc.makeRDD(Array("1,jack,15","2,Lily,16","3,mike,16"))
indataRDD.map(_.split(','))
val rdd = indataRDD.map(_.split(',')).map{arr=>{
val put = new Put(Bytes.toBytes(arr(0).toInt))
put.add(Bytes.toBytes("cf"),Bytes.toBytes("name"),Bytes.toBytes(arr(1)))
put.add(Bytes.toBytes("cf"),Bytes.toBytes("age"),Bytes.toBytes(arr(2).toInt))
(new ImmutableBytesWritable, put)
}}
rdd.saveAsHadoopDataset(jobConf)
When I run hadoop or spark jobs, I often meet the error:
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.mapred.TaskID.<init>(Lorg/apache/hadoop/mapreduce/JobID;Lorg/apache/hadoop/mapreduce/TaskType;I)V
at org.apache.spark.SparkHadoopWriter.setIDs(SparkHadoopWriter.scala:158)
at org.apache.spark.SparkHadoopWriter.preSetup(SparkHadoopWriter.scala:60)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1188)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1161)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1161)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1161)
at com.iteblog.App$.main(App.scala:62)
at com.iteblog.App.main(App.scala)`
At the begin, I think, is the jar conflict, but I carefully checked the jar: there are no other jars. The spark and hadoop versions are:
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.0.1</version>`
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-core</artifactId>
<version>2.6.0-mr1-cdh5.5.0</version>
And I found that the TaskID and TaskType are all in the hadoop-core jar, but not in the same package. Why the mapred.TaskID can refer the mapreduce.TaskType ?
Oh,I have resolve this problem,add the maven dependency
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-core</artifactId>
<version>2.6.0-cdh5.5.0</version>
</dependency>
the error disappear!
I have also faced such issue . It basically due to jar issue only.
Add the Jar file from Maven spark-core_2.10
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>2.0.2</version>
</dependency>
After changing the Jar file

Resources