hbase-indexer solr numFound different from hbase table rows size - hadoop

Recently my team is using hbase-indexer on CDH for indexing hbase table column to solr . When we deploy hbase-indexer server (which is called Key-Value Store Indexer) and begin testing. We found a situation that the rows size between hbase table and solr index is different :
We used Phoenix to count hbase table rows:
0: jdbc:phoenix:slave1,slave2,slave3:2181> SELECT /*+ NO_INDEX */ COUNT(1) FROM C_PICRECORD;
+------------------------------------------+
| COUNT(1) |
+------------------------------------------+
| 4084355 |
+------------------------------------------+
And we use Solr Web UI to count solr index size :
numFound : 4060479
We could not found any error log from hbase-indexer log and solr log. But the rows size between hbase table and solr index is really different ! Is there anyone meet this situation ? I don't know how to do

My understanding :
Hbase rowcount - Solr rowcount(numfound) = missing records
4084355 - 4060479 = 23876 (which are there in Hbase and missing in Solr)
The Key-Value Store Indexer service uses the Lily HBase NRT Indexer to index the stream of records being added to HBase tables.
NRT works on incremental data not whole data.
Out of my experience these are possible reasons :
1) NRT worked initially, and if suddenly NRT is not working(due to some health issues) then there is a possibility of discrepancy in numbers.
2) NRT works on WAL(write ahead log) if WAL is switched off while inserting the records in to HBASE (possible.. for performance reasons), NRT wont work.
Possible solution :
1) Delete Solr documents and freshly load data in to Solr from Hbase.
Hbase batch indexer you can run on whole data (Batch indexer wont work on incremental data, it works on whole dataset)
2) As part of data-flow pipe line, Write a map-reduce program to insert the data in to solr.(what we have done in one of our implementation)

All right, we solved the problem recently.
The reason why solr numfound is different from hbase table row count due to hbase-indexer make a mistake
of deleting some row instead of inserting them. We found this situation according to hbase-indexer metrics :
https://github.com/NGDATA/hbase-indexer/wiki/Metrics
We use jconsole to watch jmx metrics data and found :
indexer deletes count = hbase table row count - solr numfound
Finally we debug into the hbase-indexer source code and find some code will cause this problem, maybe it is a issue about hbase-indexer, please see : https://github.com/NGDATA/hbase-indexer/issues/78

Related

Record limit to ingest from teradata to Hadoop

I am ingesting 5 tables from teradata to Hadoop using jdbc connector. I have written configuration files for the same.
Four out of the 5 tables are able to ingest perfectly and the record count is also matching. One table is not getting ingested at all. The count of this table is 56 Million (largest in this set) , the ingestion runs upto some ~35 Million records and stops abruptly, no error msg. The table is not getting created in Hadoop even for that 35M records. This is my usual ingestion method and nothing could go wrong in this.
Can someone suggest if there is any limit to the number of records that can get ingested from Teradata to Hadoop ?

how to manage modified data in Apache Hive

We are working on Cloudera CDH and trying to perform reporting on the data stored on Apache Hadoop. We send daily reports to client so need to import data from operational store to hadoop daily.
Hadoop works on the append only mode. Hence we can not perform the Hive update/delete query. We can perform Insert overwrite on dimension tables and add delta values in the fact tables. Introducing thousands for the delta rows daily does not seem quite impressive solution.
Are there any other standard better ways to update modified data in Hadoop?
Thanks
HDFS might be append only, but Hive does support updates from 0.14 on.
see here:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Update
A design pattern is to take all your previous and current data and insert it into a new table every time.
Depending on your usecase have a look at Apache Impala/Hbase/... or even Drill.

Hadoop Cassandra CqlInputFormat pagination

I am a quite newbie in Cassandra and have following question:
I have a 7-node Cassandra (v2.0.11) cluster and a table with 10k rows.
I run a hadoop job (datanodes reside on cassandra nodes of course) that reads data from that table and I see that only 7k rows is read to map phase.
I checked CqlInputFormat source code and noticed that a CQL query is build to select node-local date and also LIMIT clause is added (1k default). So that 7k read rows can be explained:
7 nodes * 1k limit = 7k rows read total
The limit can be changed using CqlConfigHelper:
CqlConfigHelper.setInputCQLPageRowSize(job.getConfiguration(), "1000");
Please help me with questions below:
Is this a desired behavior?
Why CqlInputFormat does not page through the rest of rows?
Is it a bug or should I just increase the InputCQLPageRowSize value?
What if I want to read all data in table and do not know the row count?
My problem was related to a bug in cassandra 2.0.11 that added a strange LIMIT clause in underlying CQL query run to read data to the map task:
I posted that issue to cassandra jira: https://issues.apache.org/jira/browse/CASSANDRA-9074
It turned out that that problem was stricly related to the following bug fixed in cassandra 2.0.12: https://issues.apache.org/jira/browse/CASSANDRA-8166

Real time database in Hadoop ecosystem

Pardon me if this is a silly question.
I have a cloudera manager installed in a single node.
I am trying to use Hbase and Hadoop for logging request and response in my web application.
I am trying to list latest user activity using the log.
Rows are added using the below table structure.
1 Column Family, RowId, 11 columns. I store every value as string. Fairly simple & similar to a mysql table.
RowId
entry:addedTime
entry:value
entry:ip
entry:accessToken
entry:identifier
entry:userId
entry:productId
entry:object
entry:requestHeader
entry:completeDate
entry:tag
Now, in order to get rows from my Hbase, I use
SingleColumnValueFilter("entry", "userId", "=", binary:"25", true, true)
Now, I am struggling to order this by
entry:completeDate DESCENDING
and limit by 25 rows for pagination or infinite scroll.
My question,
Is Hbase the only real time querying database available in Hadoop ecosystem?
Am I using Hbase for wrong reasons? Is my table structure correct?
I work in a startup and these are our baby steps to moving to BigData. Though BigData created lot of hype, the Hadoop is poorly supported for latest linux and looks too complicated.
Any help or suggestions would be appreciated.
Many thanks,
Karthik

How to create a data pipeline from hive table to relational database

Background :
I have a Hive Table "log" which contains log information. This table is loaded with new log data every hour. I want to do some quick analytics on logs for past 2 days, so i want to extract last 48 hours of data into my relational database.
To solve the above problem I have created a staging hive table which is loaded by a HIVE SQL query. After loading the new data into the staging table, i load the new logs into relational database using sqoop Query.
Problem is that sqoop is loading data into relational database in BATCH. So at any particular time i have only partial logs for a particular hour.
This is leading to erroneous analytics output.
Questions:
1). How to make this Sqoop data load transactional, i.e either all records are exported or none are exported.
2). What is best way to build this data pipeline where this whole process of Hive Table -> Staging Table -> Relational Table.
Technical Details:
Hadoop version 1.0.4
Hive- 0.9.0
Sqoop - 1.4.2
You should be able to do this with sqoop by using the option called --staging-table. What this does is basically act as an auxiliary table that is used to stage exported data. The staged data is finally moved to the destination table in a single transaction. So by doing this, you shouldn't have consistency issues with partial data.
(source: Sqoop documentation)
Hive and Hadoop are such great technologies that can allow your analytics to run inside MapReduce tasks, performing the analytics very fast by utilizing multiple nodes.
Use that to your benefit. First of all partition your Hive table.
I guess that you store all logs in a single Hive table. Thus when you run your queries and you have a
SQL .... WHERE LOG_DATA > '17/10/2013 00:00:00'
Then you effictivelly query all the data that you have collected so far.
Instead if you use partitions - let's say one per day you can define in your query
WHERE p_date=20131017 OR p_date=20131016
Hive is partitioned and now knows to read only those two files
So let's say you got 10 GB of logs per day - then a HIVE QUERY should succeed in a few seconds in a decent Hadoop cluster

Resources