Trying to run a simple hadoop job, but hadoop is throwing a NoClassDef on "org/w3c/dom/Document"
I'm trying to run the basic examples from the "Mahout In Action" book (https://github.com/tdunning/MiA).
I do this using nearly the same maven setup but tooled for cassandra use rather than a file data model.
But, when I try to run the *-job.jar, it spits a NoClassDef from the datastax/hadoop end.
I'm using 1.0.5-dse of the driver as that's the only one that supports the current DSE version of Cassandra(1.2.1) if that helps at all though the issue seems to be deeper.
Attached is a gist with more info included.
There is the maven file, this brief overview, and the console output.
https://gist.github.com/zmarcantel/8d56ae4378247bc39be4
Thanks
try dropping the jar file for class of org.w3c.dom.Document to $DSE/resource/hadoop/lib/ folder as a work around.
Related
I'm not too familiar with Kafka but I would like to know what's the best way to
read data in batches from Kafka so I can use Elasticsearch Bulk Api to load the data faster and reliably.
Btw, am using Vertx for my Kafka consumer
Thank you,
I cannot tell if this is the best approach or not, but when I started looking for similar functionality I could not find any readily available frameworks. I found this project:
https://github.com/reachkrishnaraj/kafka-elasticsearch-standalone-consumer/tree/branch2.0
and started contributing to it as it was not doing everything I wanted, and was also not easily scalable. Now the 2.0 version is quite reliable and we use it in production in our company processing/indexing 300M+ events per day.
This is not a self-promotion :) - just sharing how we do the same type of work. There might be other options right now as well, of course.
https://github.com/confluentinc/kafka-connect-elasticsearch
Or You can try this source
https://github.com/reachkrishnaraj/kafka-elasticsearch-standalone-consumer
Running as a standard Jar
**1. Download the code into a $INDEXER_HOME dir.
**2. cp $INDEXER_HOME/src/main/resources/kafka-es-indexer.properties.template /your/absolute/path/kafka-es-indexer.properties file - update all relevant properties as explained in the comments
**3. cp $INDEXER_HOME/src/main/resources/logback.xml.template /your/absolute/path/logback.xml
specify directory you want to store logs in:
adjust values of max sizes and number of log files as needed
**4. build/create the app jar (make sure you have MAven installed):
cd $INDEXER_HOME
mvn clean package
The kafka-es-indexer-2.0.jar will be created in the $INDEXER_HOME/bin. All dependencies will be placed into $INDEXER_HOME/bin/lib. All JAR dependencies are linked via kafka-es-indexer-2.0.jar manifest.
**5. edit your $INDEXER_HOME/run_indexer.sh script: -- make it executable if needed (chmod a+x $INDEXER_HOME/run_indexer.sh) -- update properties marked with "CHANGE FOR YOUR ENV" comments - according to your environment
**6. run the app [use JDK1.8] :
./run_indexer.sh
I used spark streaming and the it was quite a simple implementation using Scala.
I'm trying to create a custom filter on hbase 0.98.1 in standalone mode on ubuntu 14.04.
I created a class extending FilterBase. I put the jar in HBASE_HOME/lib. Looking in the logs, I see that my jar is in the path.
Then I have a java client that first makes a get with a columnPrfixFilter, then it makes a get with my custom filter. The columnPrefixFilter works perfectly fine. With my filter, nothing happens. The client freeze for 10 minutes and close the connexion.
I don't see any thing in the log.
Could you please give me some hint on what and where to check ?
regards,
EDIT:
It turns out to be a protc vesrion conflict. I generated java class form proto file with protoc 2.4.0 and in my filter I was using protobuf-java 2.5.0
I aligned to 2.5.0 and it's now working fine.
I've uploaded json-serde-1.1.9.2.jar to the blob store with path "/lib/" and added
ADD JAR /lib/json-serde-1.1.9.2.jar
But am getting
/lib/json-serde-1.1.9.2.jar does not exist
I've tried it without the path and also provided the full url to the ADD JAR statement with the same result.
Would really appreciate some help on this, thanks!
If you don't include the scheme, then Hive is going to look on the local filesystem (you can see the code around line 768 of the source)
when you included the URI, make sure you use the full form:
ADD JAR wasb:///lib/json-serde-1.1.9.2.jar
If that still doesn't work, provide your updated command as well as some details about how you are launching the code. Are you RDP'd in to the cluster running via the Hive shell, or running remote via PowerShell or some other API?
I'm writing my first Avro job that is meant to take an avro file and output text. I tried to reverse engineer it from this example:
https://gist.github.com/chriswhite199/6755242
I am getting the error below though.
Error: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
I looked around and found it was likely an issue with what jar files are being used. I'm running CDH4 with MR1 and am using the jar files below:
avro-tools-1.7.5.jar
hadoop-core-2.0.0-mr1-cdh4.4.0.jar
hadoop-mapreduce-client-core-2.0.2-alpha.jar
I can't post code for security reasons but it shouldn't need anything not used in the example code. I don't have maven set up yet either so I can't follow those routes. Is there something else I can try to get around these issues?
Try using avro 1.7.3
AVRO-1170 bug
I want to deploy to all four processes on a Websphere cluster with two nodes. Is there a way of doing this with one Jython command or do I have to call 'AdminControl.invoke' on each one?
Easiest way to install an application using wsadmin is with AdminApp and not AdminControl.
I suggest you download wsadminlib.py (Got the link from here)
it has a lot of functions, one of them is installApplication which works also with cluster.
Edit:
Lately I found out about AdminApplication which is a script library included in WAS 7 (/opt/IBM/WebSphere/AppServer/scriptLibraries/application/V70)
The docuemntation is not great in the info center but its a .py file you can look inside to see what it does.
It is imported automatically to wsadmin and you can use it without any imports or other configuration.
Worth a check.
#aviram-segal is right, wsadminlib is really helpful for this.
I use the following syntax:
arg = ["-reloadEnabled", "-reloadInterval '0'", "-cell "+self.cellName, "-node "+self.nodeName, "-server '"+ self.serverName+"'", "-appname "+ name, '-MapWebModToVH',[['.*', '.*', self.virtualHost]]]
AdminApp.install(path, arg)
Where path is the location of your EAR/WAR file.
You can find documentation here