Error calling Solr Cloud Index within Hadoop Job - hadoop

My goal is to run an elastic map reduce job that queries a Solr index in the map phase and writes the result to S3. Solr and Hadoop worked fine together when building a Solr index within a Hadoop job (ie writing to Solr index). When I run the a job to query a Solr index I get an error when trying to initiate the Solr client. I suspect that there's a dependency issue between Hadoop and Solr, I recall they both use different versions of http clients and the error is a method not found issue. Here's the stack trace
2013-07-24 03:17:47,082 FATAL org.apache.hadoop.mapred.Child (main): Error running child : java.lang.NoSuchMethodError: org.apache.http.impl.conn.SchemeRegistryFactory.createSystemDefault()Lorg/apache/http/conn/scheme/SchemeRegistry;
at org.apache.http.impl.client.SystemDefaultHttpClient.createClientConnectionManager(SystemDefaultHttpClient.java:118)
at org.apache.http.impl.client.AbstractHttpClient.getConnectionManager(AbstractHttpClient.java:445)
at org.apache.solr.client.solrj.impl.HttpClientUtil.setMaxConnections(HttpClientUtil.java:179)
at org.apache.solr.client.solrj.impl.HttpClientConfigurer.configure(HttpClientConfigurer.java:33)
at org.apache.solr.client.solrj.impl.HttpClientUtil.configureClient(HttpClientUtil.java:115)
at org.apache.solr.client.solrj.impl.HttpClientUtil.createClient(HttpClientUtil.java:105)
at org.apache.solr.client.solrj.impl.HttpSolrServer.<init>(HttpSolrServer.java:154)
at org.apache.solr.client.solrj.impl.HttpSolrServer.<init>(HttpSolrServer.java:127)

Adding this opt did the trick
--args -s,mapreduce.user.classpath.first=true
Using the user defined classpath worked out the dependency issue between the hadoop and solr jar

Related

Prometheus Integration with Hadoop (Ozone Cluster)

I am trying to follow the Apache documentation in order to integrate Prometheus with Apache Hadoop. One of the preliminary steps is to setup Apache Ozone cluster. However, I am finding issues in running the ozone cluster concurrently with Hadoop. It throws a class not found exception for "org.apache.hadoop.ozone.HddsDatanodeService" whenever I try to start the ozone manager or storage container manager.
I also found that ozone 1.0 release is pretty recent and it is mentioned that it is tested with Hadoop 3.1. I have a running Hadoop cluster of version of 3.3.0. Now, I doubt if the version is a problem.
The tar ball for Ozone also has the Hadoop config files, but I wanted to configure ozone with my existing Hadoop cluster. I want to configure the ozone with my existing hadoop cluster.
Please let me know what should be the right approach here. If this can not be done, then please also let me know what is good way to monitor and extract metrics for Apache Hadoop in production.

ClassNotFoundException in Nifi flow

Hey StackOverflow Community,
I have some problems with my Nifi flow. I made one to take my data from my Azure blob to put them into my HDFSCluster ( still in Azure).
My configuration in the item PutHDFS in Nifi is :
PutHDFSConfiguration
But when I inform the field "Hadoop ressources", i have this following error:
PutHDFS[id=89381b69-015d-1000-deb7-50b6cf485d28] org.apache.hadoop.fs.adl.HdiAdlFileSystem: java.lang.ClassNotFoundException: org.apache.hadoop.fs.adl.HdiAdlFileSystem
PutHDFS[id=89381b69-015d-1000-deb7-50b6cf485d28] PutHDFS[id=89381b69-015d-1000-deb7-50b6cf485d28] failed to invoke #OnScheduled method due to java.lang.RuntimeException: Failed while executing one of processor's OnScheduled task.; processor will not be scheduled to run for 30 seconds: java.lang.RuntimeException: Failed while executing one of processor's OnScheduled task.
How can i resolve this and put my data into my clusters.
Thanks for your answer.
Apache NiFi does not bundle any of the Azure related libraries, it only bundles the standard Apache Hadoop client, currently 2.7.3 if using a recent NiFi release.
You can specify the location of the additional Azure JARs through the PutHDFS processor property called "Additional Classpath Resources".

jdbc to HDFS import using spring batch job

I am able to import data from my MS sql to HDFS using JDBCHDFC Spring Batch jobs.But if that containers fails , the job does not shift to other container. How do I proceed to make the job fault tolerant.
I am using spring xd 1.0.1 release
You don't mention which version of Spring XD you're currently using so I can't verify the exact behavior. However, on a container failure with a batch job running in the current version, the job should be re-deployed to a new eligible container. That being said, it will not restart the job automatically. We are currently looking at options for how to allow a user to specify if they want it restarted (there are scenarios that fall into both camps so we need to allow a user to configure that).

Nutch 2.2.1 and Elasticsearch 0.90.11 NoSuchFieldError: STOP_WORDS_SET

I am trying to integrate Apache Nutch 2.2.1 with Elasticsearch 0.90.11.
I have followed all tutorials available (although there are not so many) and even changed bin/crawl.sh to use elasticsearch to index instead of solr.
It seems that all works when I run the script until elasticsearch is trying to index the crawled data.
I checked hadoop.log inside logs folder under nutch and found the following errors:
Error injecting constructor, java.lang.NoSuchFieldError: STOP_WORDS_SET
Error injecting constructor, java.lang.NoClassDefFoundError: Could not initialize class org.apache.lucene.analysis.en.EnglishAnalyzer$DefaultSetHolder
If you managed to get it working I would very much appreciate the help.
Thanks,
Andrei.
Having never used Apache Nutch, but briefly reading about it, I would suspect that your inclusion of Elasticsearch is causing a classpath collision with a different version of Lucene that is also on the classpath. Based on its Maven POM, which does not specify Lucene directly, then I would suggest only including the Lucene bundled with Elasticsearch, which should be Apache Lucene 4.6.1 for your version.
Duplicated code (differing versions of the same jar) tend to be the cause of NoClassDefFoundError when you are certain that you have the necessary code. Based on the fact that you switched from Solr to Elasticsearch, then it would make sense that you left whatever jars from Solr on your classpath, which would cause the collision at hand. The current release of Solr is 4.7.0, which is the same as Lucene and that would collide with 4.6.1.

Pig accessing HBase using Spring Data Hadoop

Has anyone got experience of using Spring Data Hadoop to run a Pig script that connects to HBase using Elephant Bird's HBaseLoader?
I'm new to all of the above, but need to take some existing Pig scripts that were executed via a shell script and instead wrap them up in a self-contained Java application. Currently the scripts are run from a specific server that has Hadoop, HBase and Pig installed, and config for all of the above in /etc/. Pig has the HBase config on its classpath, so I'm guessing this is how it know how to connect to HBase
I want to have all configuration in Spring. Is this possible if I need Pig to connect to HBase? How do I configure HBase such that the Pig script and the Elephant Bird library will know how to connect to it?

Resources