Flume - TwitterSource language filter

Flume - TwitterSource language filter - hadoop

I would like to ask your help in the following case.
I'm currently using Cloudera CDH 5.1.2 and I tried to collect Twitter data using Flume as it is described in the following porsts (Cloudera):
http://blog.cloudera.com/blog/2012/10/analyzing-twitter-data-with-hadoop-part-2-gathering-data-with-flume/
github.com/cloudera/cdh-twitter-example
I downloaded the source and rebuilt the flume-sources after updating the versions in pom.xml:
<flume.version>1.5.0-cdh5.1.2</flume.version>
<hadoop.version>2.3.0-cdh5.1.2</hadoop.version>
It worked perfectly.
After that I wanted to add a "language" filter, to capture only the tweets of a specific language. For this, I modified the TwitterSource.java to call the FilterQuery.language method somehow like this:
FilterQuery query = new FilterQuery();
...
if (languages.length != 0) {
query.language(languages);
}
I'm trying to use twitter4j-stream version 3.0.6. I updated it in pom.xml:
<!-- For the Twitter API -->
<dependency>
<groupId>org.twitter4j</groupId>
<artifactId>twitter4j-stream</artifactId>
<version>3.0.6</version>
</dependency>
With these settings I rebuilt the jar (mvn package).
When I start my agent, I get the following exception (NoSuchMethodError):
Unable to start EventDrivenSourceRunner: { source:com.cloudera.flume.source.TwitterSource{name:Twitter,state:IDLE} } - Exception follows.
java.lang.NoSuchMethodError: twitter4j.FilterQuery.language([Ljava/lang/String;)Ltwitter4j/FilterQuery;
at com.cloudera.flume.source.TwitterSource.start(TwitterSource.java:165)
at org.apache.flume.source.EventDrivenSourceRunner.start(EventDrivenSourceRunner.java:44)
at org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run(LifecycleSupervisor.java:251)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
I checked, and this version of twitter4j-stream contains the language method:
github.com/yusuke/twitter4j/blob/3.0.6/twitter4j-stream/src/main/java/twitter4j/FilterQuery.java
What am I doing wrong?
Thanks in advance,
Peter

Finally I managed to solve this problem. So here's the solution to anyone out there facing the same issue.
First (in the above case in the original post) I placed my generated jar to /var/lib/flume-ng/plugins.d/twitter-streaming/lib/, and set it up in the Cloudera Manager config to use this location.
In this case the CM placed this directory to the and of the classpath in the runner file (after the parcel directory). So the directory order in the classpath looked like this:
/opt/cloudera/parcels/CDH-5.1.2-1.cdh5.1.2.p0.3/lib/flume-ng/lib/*
/var/lib/flume-ng/plugins.d/twitter-streaming/lib/*
Unfortunately there was a twitter4j-stream-3.0.3.jar and twitter4j-core-3.0.3.jar in the parcel directory, and flume tried to use that instead of 3.0.6, and in that version FilterQuery.language obviously doesn't exist.
So I just deleted those jars from the parcel directory, and it works fine now.

I tried this with cdh3 and it worked fine with me. One of the thing i noticed was system time should be set to current time. In your case, I think it is looking Language method in FilterQuery class.

Related

Grails 3 (Gradle) dependency without parent directory

Can I not use the following Gradle approach to dependencies in Grails? I do not have nor want a parent directory;
https://stackoverflow.com/a/19303545/2288004
When I try it, I get the the following error;
Caused by: java.lang.IllegalStateException: Expected method not found:
java.lang.NoSuchMethodException:
org.springframework.boot.context.embedded.tomcat.TomcatEmbeddedContext.addApplicationListener(org.apache.catalina.deploy.ApplicationListener)
It works when I use a parent directory for the settings.gradle, but unfortunately it’s not how I want to structure the project.

The following was indeed the solution I was looking for,
include ":myplugin"
project(':myplugin').projectDir = new File(settingsDir, '../myplugin')
The error was down to how I was managing my tomcat dependencies between the two projects.
Tomcat was already being pulled in via the plugin but while I still needed to reference tomcat at compile time in the application, I also needed to make sure it was the same version, and so added the following just above "dependencies" to target the version I required;
ext['tomcat.version'] = '7.0.70'

Running spark job using Yarn giving error:com.google.common.util.concurrent.Futures.withFallback

I am trying run spark job using yarn,but getting below error
java.lang.NoSuchMethodError: com.google.common.util.concurrent.Futures.withFallback(Lcom/google/common/util/concurrent/ListenableFuture;Lcom/google/common/util/concurrent/FutureFallback;Ljava/util/concurrent/Executor;)Lcom/google/common/util/concurrent/ListenableFuture;
at com.datastax.driver.core.Connection.initAsync(Connection.java:176)
at com.datastax.driver.core.Connection$Factory.open(Connection.java:721)
at com.datastax.driver.core.ControlConnection.tryConnect(ControlConnection.java:248)
at com.datastax.driver.core.ControlConnection.reconnectInternal(ControlConnection.java:194)
at com.datastax.driver.core.ControlConnection.connect(ControlConnection.java:82)
at com.datastax.driver.core.Cluster$Manager.init(Cluster.java:1307)
at com.datastax.driver.core.Cluster.init(Cluster.java:159)
at com.datastax.driver.core.Cluster.connect(Cluster.java:249)
at com.figmd.processor.ProblemDataloader$ParseJson.call(ProblemDataloader.java:46)
at com.figmd.processor.ProblemDataloader$ParseJson.call(ProblemDataloader.java:34)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$4$1.apply(JavaRDDLike.scala:140)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$4$1.apply(JavaRDDLike.scala:140)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:618)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:618)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
cluster Details:
Spark 1.2.1,hadoop 2.7.1
I have provided class path using spark.driver.extraClassPath.
hadoop user has access to that class path as well.But I think yarn is not getting the JAR's on that classpath.
I am not able to reach root cause of it.Any help will be appreciated.
Thanks.

I faced the same problem, and the solution was shade guava to avoid classpath collision.
If you're using sbt assembly to build your jar, you can just add this to your build.sbt:
assemblyShadeRules in assembly := Seq(
ShadeRule.rename("com.google.**" -> "shadeio.#1").inAll
)
I wrote a blog post which describes my process to arrive to this solution: Making Hadoop 2.6 + Spark-Cassandra Driver Play Nice Together.
Hope it helps!

Issue is related to guava version mismatch.
withFallback was added to version 14 of Guava. It looks like you have Guava < 14 on your classpath

Adding to #Arjones answer, if you are using gradle + GradleShadow, you can add this to your build.gradle to relocate or rename the Guava classes.
shadowJar {
relocate 'com.google.common', 'com.example.com.google.common'
}

cleartk dependency not found when calling StanfordCoreNLPAnnotator from UIMA RUTA

I am trying to call ClearTK's StanfordCoreNLPAnnotator from within UIMA RUTA, but cannot get it to work. I am using eclipse with a maven-enabled RUTA project in which I also have Java code for auxiliary tasks. I have imported cleartk-stanford-corenlp 0.8 using maven.
I tried using this line in my script:
ENGINE utils.MyStanfordEngine;
... where utils/MyStanfordEngine.xml is an XML descriptor file created using this java code:
MyStanfordAnnotator.getDescription().toXML(new FileOutputStream("descriptor/utils/MyStanfordEngine.xml"));
No errors appear, but upon execution I get:
Exception in thread "main" org.apache.uima.resource.ResourceInitializationException: Initialization of annotator class ... failed.
(Descriptor: file:.../descriptor/mainScriptEngine.xml)
...
Caused by: org.apache.uima.resource.ResourceInitializationException: Annotator class
"org.cleartk.stanford.StanfordCoreNLPAnnotator" was not found.
(Descriptor: file:.../descriptor/utils/MyStanfordEngine.xml)
...
I think I understand that the RUTA project does not find it in the Maven dependencies, but I need to stick to Maven as my dependency tool because of collaboration purposes.
Can someone help?
UPDATE:
When I encountered the problem, I was using RUTA 2.1.0. I have updated to 2.2.0rc1 since then, but the problem persisted.
With Peter's suggestion below (Thanks!), in the Java build path, I referenced a blank Maven-enabled Java project that does nothing but imports cleartk-stanford-corenlp 0.8. I can now run the following RUTA code:
TYPESYSTEM utils.CleartkRutaTypeSystem;
ENGINE utils.MyStanfordEngine;
Document{-> CALL(MyStanfordEngine)};
... successfully does what looks like all intended annotations for all documents in the input folder, but eventually crashes with this Exception:
[Stanford Tools Logging output ...]
22.02.2014 12:44:22 org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl callAnalysisComponentProcess(406)
SCHWERWIEGEND: Exception occurred
org.apache.uima.analysis_engine.AnalysisEngineProcessException: Annotator processing failed.
at org.apache.uima.ruta.engine.RutaEngine.process(RutaEngine.java:477)
at org.apache.uima.analysis_component.JCasAnnotator_ImplBase.process(JCasAnnotator_ImplBase.java:48)
at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:374)
at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.processAndOutputNewCASes(PrimitiveAnalysisEngine_impl.java:298)
at org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:267)
at org.apache.uima.ruta.ide.launching.RutaLauncher.processFile(RutaLauncher.java:168)
at org.apache.uima.ruta.ide.launching.RutaLauncher.main(RutaLauncher.java:129)
Caused by: java.lang.NullPointerException
at org.apache.uima.cas.impl.CASImpl.createFS(CASImpl.java:483)
at org.apache.uima.cas.impl.CASImpl.createAnnotation(CASImpl.java:3837)
at org.apache.uima.ruta.action.CallAction.callEngine(CallAction.java:192)
at org.apache.uima.ruta.action.CallAction.execute(CallAction.java:62)
at org.apache.uima.ruta.rule.AbstractRuleElement.apply(AbstractRuleElement.java:130)
at org.apache.uima.ruta.rule.RuleElementCaretaker.applyRuleElements(RuleElementCaretaker.java:111)
at org.apache.uima.ruta.rule.ComposedRuleElement.applyRuleElements(ComposedRuleElement.java:547)
at org.apache.uima.ruta.rule.AbstractRuleElement.doneMatching(AbstractRuleElement.java:84)
at org.apache.uima.ruta.rule.ComposedRuleElement.fallback(ComposedRuleElement.java:468)
at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:377)
at org.apache.uima.ruta.rule.RutaRuleElement.startMatch(RutaRuleElement.java:100)
at org.apache.uima.ruta.rule.ComposedRuleElement.startMatch(ComposedRuleElement.java:73)
at org.apache.uima.ruta.rule.RutaRule.apply(RutaRule.java:47)
at org.apache.uima.ruta.rule.RutaRule.apply(RutaRule.java:40)
at org.apache.uima.ruta.rule.RutaRule.apply(RutaRule.java:29)
at org.apache.uima.ruta.RutaScriptBlock.apply(RutaScriptBlock.java:63)
at org.apache.uima.ruta.RutaModule.apply(RutaModule.java:48)
at org.apache.uima.ruta.engine.RutaEngine.process(RutaEngine.java:475)
... 6 more
Exception in thread "main" org.apache.uima.analysis_engine.AnalysisEngineProcessException: Annotator processing failed.
at org.apache.uima.ruta.engine.RutaEngine.process(RutaEngine.java:477)
at org.apache.uima.analysis_component.JCasAnnotator_ImplBase.process(JCasAnnotator_ImplBase.java:48)
at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:374)
at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.processAndOutputNewCASes(PrimitiveAnalysisEngine_impl.java:298)
at org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:267)
at org.apache.uima.ruta.ide.launching.RutaLauncher.processFile(RutaLauncher.java:168)
at org.apache.uima.ruta.ide.launching.RutaLauncher.main(RutaLauncher.java:129)
Caused by: java.lang.NullPointerException
at org.apache.uima.cas.impl.CASImpl.createFS(CASImpl.java:483)
at org.apache.uima.cas.impl.CASImpl.createAnnotation(CASImpl.java:3837)
at org.apache.uima.ruta.action.CallAction.callEngine(CallAction.java:192)
at org.apache.uima.ruta.action.CallAction.execute(CallAction.java:62)
at org.apache.uima.ruta.rule.AbstractRuleElement.apply(AbstractRuleElement.java:130)
at org.apache.uima.ruta.rule.RuleElementCaretaker.applyRuleElements(RuleElementCaretaker.java:111)
at org.apache.uima.ruta.rule.ComposedRuleElement.applyRuleElements(ComposedRuleElement.java:547)
at org.apache.uima.ruta.rule.AbstractRuleElement.doneMatching(AbstractRuleElement.java:84)
at org.apache.uima.ruta.rule.ComposedRuleElement.fallback(ComposedRuleElement.java:468)
at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:377)
at org.apache.uima.ruta.rule.RutaRuleElement.startMatch(RutaRuleElement.java:100)
at org.apache.uima.ruta.rule.ComposedRuleElement.startMatch(ComposedRuleElement.java:73)
at org.apache.uima.ruta.rule.RutaRule.apply(RutaRule.java:47)
at org.apache.uima.ruta.rule.RutaRule.apply(RutaRule.java:40)
at org.apache.uima.ruta.rule.RutaRule.apply(RutaRule.java:29)
at org.apache.uima.ruta.RutaScriptBlock.apply(RutaScriptBlock.java:63)
at org.apache.uima.ruta.RutaModule.apply(RutaModule.java:48)
at org.apache.uima.ruta.engine.RutaEngine.process(RutaEngine.java:475)
... 6 more
Sorry for the whole stack trace, but I thought if a RUTA developer is reading this they may want the whole thing.
Is there a way to solve this? What am I doing wrong?

There are several limitations to consider:
UIMA Ruta 2.1.0 does not support mixin projects: maven dependencies need to be specified in another project. The Ruta project then has to depend on the additional java project.
UIMA Ruta Workbench 2.1.0 has some problems validating imported type system that import again other type systems by name. Here, rather import by location should be used.
UIMA CAS Editor 2.5.0 has some problems resolving type system imports using the datapath, which causes problems visualizing the created annotations if the type system descriptor needs additional information such as the datapath. Here, the creation of a type system descriptor of a script should include (not only import) all types of imported type systems. This can be configured in the preferences (I have not used that for a while). This problem can again be prevented by using import by location.
UIMA Ruta 2.2.0 supports mixin projects. Here, only the problem with the CAS Editor remains.
This described project can be created the following way (with UIMA Ruta 2.2.0):
Create a new UIMA Ruta Project
Make it a maven project: popup->Configure->Convert to Maven Project
Add a dependency to cleartk-stanford-corenlp in the pom
<dependency>
<groupId>org.cleartk</groupId>
<artifactId>cleartk-stanford-corenlp</artifactId>
<version>0.8.0</version>
</dependency>
Provide the type systems in the descriptor folder or in a dependent project, e.g., copy the org folder of cleartk-type-system-1.2.0 to the descriptor folder. Mind that the CAS Editor will have problems resolving the imports, if the descriptors are not adapted.
Create a simple script that imports the type system, imports the analysis engine and excutes the analysis engine. Here, the uimaFIT component is directly imported instead of a descriptor. The EXEC action need to be extended with interesting types if later rules should be able to operate on the result of the imported analysis engine.
TYPESYSTEM org.cleartk.TypeSystem;
UIMAFIT org.cleartk.stanford.StanfordCoreNLPAnnotator;
Document{->EXEC(StanfordCoreNLPAnnotator)};
If there is a text file in the import folder, then running this script should be able to annotate it.
This example directly uses the StanfordCoreNLPAnnotator instead of an additional analysis engine, but switching to another implementation or analysis engine should be straightforward.

Using elasticsearch Java API

I am very new to elasticsearch. I found some simple java code for using elasticsearch:
import static org.elasticsearch.node.NodeBuilder.*;
// on startup
Node node = nodeBuilder().node();
Client client = node.client();
// on shutdown
node.close();
I am getting the following error:
package org.elasticsearch.node doesn't exist
Later, I found that I have put some information in pom.xml. What is that? How to make this simple program run?

Do you know what Maven is?
I mean that if you are using Maven, then you need to add elasticsearch-VERSION.jar
as a dependency in your pom.xml.
If not, then you need to add elasticsearch jar in your project classpath and some other libs such as (it depends of elasticsearch version you are using):
antlr-runtime-3.5.jar
asm-4.1.jar
asm-commons-4.1.jar
jna-3.3.0.jar
jts-1.12.jar
log4j-1.2.17.jar
lucene-analyzers-common-4.6.0.jar
lucene-codecs-4.6.0.jar
lucene-core-4.6.0.jar
lucene-expressions-4.6.0.jar
lucene-grouping-4.6.0.jar
lucene-highlighter-4.6.0.jar
lucene-join-4.6.0.jar
lucene-memory-4.6.0.jar
lucene-misc-4.6.0.jar
lucene-queries-4.6.0.jar
lucene-queryparser-4.6.0.jar
lucene-sandbox-4.6.0.jar
lucene-spatial-4.6.0.jar
lucene-suggest-4.6.0.jar
spatial4j-0.3.jar
I'd recommend to use Maven because it's much easier to deal with dependencies.
Hope this helps

JDOM + Jaxen + Websphere 7 = java.lang.NoClassDefFoundError: org.jaxen.BaseXPath

I would like to use JDOM in a Webapp project. This works just fine. But now I want to add some stuff using XPath, but if I try to work with an XPath, I just get an exception:
com.ibm.ws.webcontainer.servlet.ServletWrapper service SRVE0068E: Uncaught exception created in one of the service methods of the servlet MyServlet in application MyProjectEAR. Exception created : java.lang.NoClassDefFoundError: org.jaxen.BaseXPath
at java.lang.J9VMInternals.verifyImpl(Native Method)
at java.lang.J9VMInternals.verify(J9VMInternals.java:72)
at java.lang.J9VMInternals.initialize(J9VMInternals.java:134)
at java.lang.Class.forNameImpl(Native Method)
at java.lang.Class.forName(Class.java:136)
at org.jdom.xpath.XPath.newInstance(XPath.java:126)
at org.jdom.xpath.XPath.selectNodes(XPath.java:337)
[..]
Caused by: java.lang.ClassNotFoundException: org.jaxen.BaseXPath
at java.net.URLClassLoader.findClass(URLClassLoader.java:421)
at com.ibm.ws.bootstrap.ExtClassLoader.findClass(ExtClassLoader.java:150)
at java.lang.ClassLoader.loadClass(ClassLoader.java:652)
at com.ibm.ws.bootstrap.ExtClassLoader.loadClass(ExtClassLoader.java:90)
at java.lang.ClassLoader.loadClass(ClassLoader.java:618)
at com.ibm.ws.classloader.ProtectionClassLoader.loadClass(ProtectionClassLoader.java:62)
at com.ibm.ws.classloader.ProtectionClassLoader.loadClass(ProtectionClassLoader.java:58)
at com.ibm.ws.classloader.CompoundClassLoader.loadClass(CompoundClassLoader.java:540)
at java.lang.ClassLoader.loadClass(ClassLoader.java:618)
... 35 more
The jaxen.jar is in my classpath, and the org.jaxen.BaseXPath class is there just fine. Why is Websphere not finding it? It works with all the other libraries I have there. When googling I found this, where someone says that he has a conflicting version somewhere and I should make sure that jars from my web app directory have precedence. In eclise' Built Path Configuration I set Web App Libraries above the WebSphere library (only the src dir is now above the web app libs), but that did not change anything. Unfortunatelly I did not really understand the part about the EAR which seems important...?
Update: In the meantime this gave me a new clue. I found on WebSphere's Administration Console the classpath and a list with all jars that are considered by the class loaders. These are quite a number and I searched them with a little grep and unzip -l magic and figured that the file /opt/ibm/WebSphere/PortalServer/wcm/prereq.wcm/wcm/shared/app/jdom.jar contains jdom (without the jaxen stuff). So maybe this jdom jar is loaded, but jaxen in an incompatible version is loaded from my lib directory?
Additionally I found in WebSphere's Administration Console the "parent first/last" setting for my application, but everything is grayed out! I can't switch to parent last :-(.
What can I do to find and fix the problem?

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Flume - TwitterSource language filter - hadoop

I tried this with cdh3 and it worked fine with me. One of the thing i noticed was system time should be set to current time. In your case, I think it is looking Language method in FilterQuery class.

Related

Grails 3 (Gradle) dependency without parent directory

Running spark job using Yarn giving error:com.google.common.util.concurrent.Futures.withFallback

cleartk dependency not found when calling StanfordCoreNLPAnnotator from UIMA RUTA

Using elasticsearch Java API

JDOM + Jaxen + Websphere 7 = java.lang.NoClassDefFoundError: org.jaxen.BaseXPath

Categories

Resources