How to reduce nodes using Hbase-Indexer-mr - hadoop

I have a question about how to reduce nodes when I am running hbase-indexer-mr-*-job.jar.
I am using Cloudera Manager 5.2
Basically, I have 9 servers. cys-master, cys-slave1, cys-slave2, cys-slave3, cys-slave4, cys-slave5, cys-slave6, cys-slave7, cys-slave8.
Hbase tables are stored in these 9 servers.
However, I only installed Solr in cys-master, cys-slave1, cys-slave2, cys-slave3, cys-slave4.
In cys-master, when I ran
hadoop jar hbase-indexer-mr-*-job.jar
--hbase-indexer-zk localhost
--hbase-indexer-name myindexer
--reducers 0
There are about 15 map tasks in hadoop.
cys-slave1, cys-slave2, cys-slave3 and cys-slave4 could successfully upload indexers to Solr.
cys-slave5, cys-slave6, cys-slave7, cys-slave8 got error messages: "Connection refused". The whole process will fail eventually.
My question is: How could I exclude cys-slave5, cys-slave6, cys-slave7 and cys-slave8?

Related

Elasticsearch indexing fails after successful Nutch crawl

I'm not sure why but Nutch 1.13 is failing to index the data to ES (v2.3.3). It is crawling, that is fine, but when it comes time to index to ES its giving me this error message:
Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:865)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:239)
Right before that is has this:
elastic.bulk.close.timeout : elastic timeout for the last bulk in seconds. (default 600)
I'm not sure exactly if the timeout has anything to do with the job failing?
I've run Nutch v1.10 many times with no problems but decided to upgrade now. Never had this error before until now, with upgrading.
EDIT:
After closer inspection of the error message:
Error running:
/home/david/tutorials/nutch/nutch-1.13/runtime/local/bin/nutch index -Delastic.server.url=http://localhost:9300/search-index/ searchcrawl//crawldb -linkdb searchcrawl//linkdb searchcrawl//segments/20170519125546
It seems to be failing there, on that particular segment, what does that mean? I only know the basics of how to use Nutch, I'm by no means an expert. Is it failing on a link?
Until Nutch 1.14 is out, you need to apply this patch https://github.com/apache/nutch/pull/156 and rebuild:
cd apache-nutch-1.13
wget https://raw.githubusercontent.com/apache/nutch/e040ace189aa0379b998c8852a09c1a1a2308d82/src/java/org/apache/nutch/indexer/CleaningJob.java
mv CleaningJob.java src/java/org/apache/nutch/indexer/.

Spark (shell), Cassandra : Hello, World?

How do you get a basic Hello, world! example running in Spark with Cassandra? So far, we've found this helpful answer:
How to load Spark Cassandra Connector in the shell?
Which works perfectly!
Then we attempt to follow to documentation and the getting started example:
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/1_connecting.md
It says to do this:
import com.datastax.spark.connector.cql.CassandraConnector
CassandraConnector(conf).withSessionDo { session =>
session.execute("CREATE KEYSPACE test2 WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 1 }")
session.execute("CREATE TABLE test2.words (word text PRIMARY KEY, count int)")
}
But it says we don't have com.datastax.spark.connector.cql?
Btw, we got the Spark connector from here:
Maven Central Repository (spark-cassandra-connector-java_2.11)
So how do you get to the point where you can create a keyspace, a table and insert rows after you have Spark and Cassandra running locally?
The jar you downloaded only has the Java api so it won't work with the Scala Spark Shell. I recommend you follow the instructions on the Spark Cassandra Connector page.
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/13_spark_shell.md
These instructions will have you build the full assembly jar with all the dependencies and add it to the Spark Shell classpath using --jars.

hbase ExportSnapshot from 0.94 to 0.98

We have two clusters with hbase 0.94, hadoop 1.04 and hbase 0.98, hadoop 2.4
I've created a snapshot from table on 0.94 snapshot and want to migrate it to cluster with hbase 0.98.
After run this command on 0.98 cluster:
hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot snapshot-name -copy-from webhdfs://hadoops-master:9000/hbase -copy-to hdfs://solr1:8020/hbase
I see:
Exception in thread "main" org.apache.hadoop.hbase.snapshot.ExportSnapshotException: Failed to copy the snapshot directory: from=webhdfs://hadoops-master:9000/hbase/.hbase-snapshot/snapshot-name to=hdfs://solr1:8020/hbase/.hbase-snapshot/.tmp/snapshot-name
at org.apache.hadoop.hbase.snapshot.ExportSnapshot.run(ExportSnapshot.java:916)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.hbase.snapshot.ExportSnapshot.innerMain(ExportSnapshot.java:1000)
at org.apache.hadoop.hbase.snapshot.ExportSnapshot.main(ExportSnapshot.java:1004)
Caused by: java.net.SocketException: Unexpected end of file from server
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:772)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:633)
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:769)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:633)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1323)
at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:468)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$Runner.getResponse(WebHdfsFileSystem.java:596)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$Runner.run(WebHdfsFileSystem.java:530)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.run(WebHdfsFileSystem.java:417)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getHdfsFileStatus(WebHdfsFileSystem.java:630)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getFileStatus(WebHdfsFileSystem.java:641)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:337)
at org.apache.hadoop.hbase.snapshot.ExportSnapshot.run(ExportSnapshot.java:914)
... 3 more
Lars Hofhansl himself (a principal HBase committer and the maintainer of 0.94) has stated that exports are not supported from 0.94 and 0.98. So you are likely going nowehere with this.
Here is an excerpt from his thread just this afternoon:
It's tricky from multiple angles:- replication between 0.94 and 0.98 does not work (there's a gateway process that supposedly does that, but it's not known to be very reliable)- snapshots cannot be exported from 0.94 and 0.98
source: hbase user's mailing list today 12/15/14
UPDATE On the HBase mailing list there was a report that one user was able to find a way to do the export. A piece of that info is:
If exporting from an HBase 0.94 cluster to an HBase 0.98 cluster, you will need to use the webhdfs protocol (or possibly hftp, though I couldn’t get that to work). You also need to manually move some files around because snapshot layouts have changed. Based on the example above, on the 0.98 cluster do the following:
check whether any imports already exist for the table:
hadoop fs -ls /apps/hbase/data/archive/data/default
It would not be correct for me to reproduce the entire discussion here: a link to nabble that contains the entire gory details is:
http://apache-hbase.679495.n3.nabble.com/0-94-going-forward-td4066883.html
I am digging into this issue and it has more to do with the underlying HDFS.
Once the streams (in my case for distcp) have been written, close is called:
public void close() throws IOException {
try {
super.close();
} finally {
try {
validateResponse(op, conn, true);
} finally {
conn.disconnect();
}
}
}
Wherein it fails in the validate response call (probably the other end of connection has been closed).
This could be due to incompatibility between HDFS 1.0 and 2.4!
I've repacked the hadoop and hbase libs with jarjar https://code.google.com/p/jarjar/
It needs to fix some class name issues.
Then I've write a mapreduce copyTable job. It reads rows from 94 culster and writes to 98 cluster.
Here is the code:
https://github.com/fiserro/copy-table-94to98
Thanks github.com/falsecz for the idea and help!

pig + hbase + hadoop2 integration

has anyone had successful experience loading data to hbase-0.98.0 from pig-0.12.0 on hadoop-2.2.0 in an environment of hadoop-2.20+hbase-0.98.0+pig-0.12.0 combination without encountering this error:
ERROR 2998: Unhandled internal error.
org/apache/hadoop/hbase/filter/WritableByteArrayComparable
with a line of log trace:
java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/filter/WritableByteArra
I searched the web and found a handful of problems and solutions but all of them refer to pre-hadoop2 and base-0.94-x which were not applicable to my situation.
I have a 5 node hadoop-2.2.0 cluster and a 3 node hbase-0.98.0 cluster and a client machine installed with hadoop-2.2.0, base-0.98.0, pig-0.12.0. Each of them functioned fine separately and I got hdfs, map reduce, region servers , pig all worked fine. To complete an "loading data to base from pig" example, i have the following export:
export PIG_CLASSPATH=$HADOOP_INSTALL/etc/hadoop:$HBASE_PREFIX/lib/*.jar
:$HBASE_PREFIX/lib/protobuf-java-2.5.0.jar:$HBASE_PREFIX/lib/zookeeper-3.4.5.jar
and when i tried to run : pig -x local -f loaddata.pig
and boom, the following error:ERROR 2998: Unhandled internal error. org/apache/hadoop/hbase/filter/WritableByteArrayComparable (this should be the 100+ times i got it dying countless tries to figure out a working setting).
the trace log shows:lava.lang.NoClassDefFoundError: org/apache/hadoop/hbase/filter/WritableByteArrayComparable
the following is my pig script:
REGISTER /usr/local/hbase/lib/hbase-*.jar;
REGISTER /usr/local/hbase/lib/hadoop-*.jar;
REGISTER /usr/local/hbase/lib/protobuf-java-2.5.0.jar;
REGISTER /usr/local/hbase/lib/zookeeper-3.4.5.jar;
raw_data = LOAD '/home/hdadmin/200408hourly.txt' USING PigStorage(',');
weather_data = FOREACH raw_data GENERATE $1, $10;
ranked_data = RANK weather_data;
final_data = FILTER ranked_data BY $0 IS NOT NULL;
STORE final_data INTO 'hbase://weather' USING
org.apache.pig.backend.hadoop.hbase.HBaseStorage('info:date info:temp');
I have successfully created a base table 'weather'.
Has anyone had successful experience and be generous to share with us?
ant clean jar-withouthadoop -Dhadoopversion=23 -Dhbaseversion=95
By default it builds against hbase 0.94. 94 and 95 are the only options.
If you know which jar file contains the missing class, e.g. org/apache/hadoop/hbase/filter/WritableByteArray, then you can use the pig.additional.jars property when running the pig command to ensure that the jar file is available to all the mapper tasks.
pig -D pig.additional.jars=FullPathToJarFile.jar bulkload.pig
Example:
pig -D pig.additional.jars=/usr/lib/hbase/lib/hbase-protocol.jar bulkload.pig

HPCC/HDFS Connector

Does anyone know about HPCC/HDFS connector.we are using both HPCC and HADOOP.There is one utility(HPCC/HDFS connector) developed by HPCC which allows HPCC cluster to acess HDFS data
i have installed the connector but when i run the program to acess data from hdfs it gives error as libhdfs.so.0 doesn't exist.
I tried to build libhdfs.so using command
ant compile-libhdfs -Dlibhdfs=1
its giving me error as
target "compile-libhdfs" does not exist in the project "hadoop"
i used one more command
ant compile-c++-libhdfs -Dlibhdfs=1
its giving error as
ivy-download:
[get] Getting: http://repo2.maven.org/maven2/org/apache/ivy/ivy/2.1.0/ivy-2.1.0.jar
[get] To: /home/hadoop/hadoop-0.20.203.0/ivy/ivy-2.1.0.jar
[get] Error getting http://repo2.maven.org/maven2/org/apache/ivy/ivy/2.1.0/ivy-2.1.0.jar
to /home/hadoop/hadoop-0.20.203.0/ivy/ivy-2.1.0.jar
BUILD FAILED java.net.ConnectException: Connection timed out
any suggestion will be a great help
Chhaya, you might not need to build libhdfs.so, depending on how you installed hadoop, you might already have it.
Check in HADOOP_LOCATION/c++/Linux-<arch>/lib/libhdfs.so, where HADOOP_LOCATION is your hadoop install location, and arch is the machine’s architecture (i386-32 or amd64-64).
Once you locate the lib, make sure the H2H connector is configured correctly (see page 4 here).
It's just a matter of updating the HADOOP_LOCATION var in the config file:
/opt/HPCCSystems/hdfsconnector.conf
good luck.

Resources