Hadoop Cassandra CqlInputFormat pagination - hadoop

I am a quite newbie in Cassandra and have following question:
I have a 7-node Cassandra (v2.0.11) cluster and a table with 10k rows.
I run a hadoop job (datanodes reside on cassandra nodes of course) that reads data from that table and I see that only 7k rows is read to map phase.
I checked CqlInputFormat source code and noticed that a CQL query is build to select node-local date and also LIMIT clause is added (1k default). So that 7k read rows can be explained:
7 nodes * 1k limit = 7k rows read total
The limit can be changed using CqlConfigHelper:
CqlConfigHelper.setInputCQLPageRowSize(job.getConfiguration(), "1000");
Please help me with questions below:
Is this a desired behavior?
Why CqlInputFormat does not page through the rest of rows?
Is it a bug or should I just increase the InputCQLPageRowSize value?
What if I want to read all data in table and do not know the row count?

My problem was related to a bug in cassandra 2.0.11 that added a strange LIMIT clause in underlying CQL query run to read data to the map task:
I posted that issue to cassandra jira: https://issues.apache.org/jira/browse/CASSANDRA-9074
It turned out that that problem was stricly related to the following bug fixed in cassandra 2.0.12: https://issues.apache.org/jira/browse/CASSANDRA-8166

Related

How to increase cassandra yaml file performance to have fast write query?

when i tried to insert one million rows of data in cassandra database such as i have performance hardware, take one minutes for insert it. Anyone can help me please how to increase settings by default of cassandra, i want exactlly what are the main parameteres wanto to modify in cassandra yaml file or cassandra-env file to have best performance to fit well with my server
i tried with using cql commands
CQL Command: Copy table_name(field1, ..., fieldn) to 'path_file'
9000 rows in seconds, 56 millions rows inserted in 54 minutes

Record limit to ingest from teradata to Hadoop

I am ingesting 5 tables from teradata to Hadoop using jdbc connector. I have written configuration files for the same.
Four out of the 5 tables are able to ingest perfectly and the record count is also matching. One table is not getting ingested at all. The count of this table is 56 Million (largest in this set) , the ingestion runs upto some ~35 Million records and stops abruptly, no error msg. The table is not getting created in Hadoop even for that 35M records. This is my usual ingestion method and nothing could go wrong in this.
Can someone suggest if there is any limit to the number of records that can get ingested from Teradata to Hadoop ?

hbase-indexer solr numFound different from hbase table rows size

Recently my team is using hbase-indexer on CDH for indexing hbase table column to solr . When we deploy hbase-indexer server (which is called Key-Value Store Indexer) and begin testing. We found a situation that the rows size between hbase table and solr index is different :
We used Phoenix to count hbase table rows:
0: jdbc:phoenix:slave1,slave2,slave3:2181> SELECT /*+ NO_INDEX */ COUNT(1) FROM C_PICRECORD;
+------------------------------------------+
| COUNT(1) |
+------------------------------------------+
| 4084355 |
+------------------------------------------+
And we use Solr Web UI to count solr index size :
numFound : 4060479
We could not found any error log from hbase-indexer log and solr log. But the rows size between hbase table and solr index is really different ! Is there anyone meet this situation ? I don't know how to do
My understanding :
Hbase rowcount - Solr rowcount(numfound) = missing records
4084355 - 4060479 = 23876 (which are there in Hbase and missing in Solr)
The Key-Value Store Indexer service uses the Lily HBase NRT Indexer to index the stream of records being added to HBase tables.
NRT works on incremental data not whole data.
Out of my experience these are possible reasons :
1) NRT worked initially, and if suddenly NRT is not working(due to some health issues) then there is a possibility of discrepancy in numbers.
2) NRT works on WAL(write ahead log) if WAL is switched off while inserting the records in to HBASE (possible.. for performance reasons), NRT wont work.
Possible solution :
1) Delete Solr documents and freshly load data in to Solr from Hbase.
Hbase batch indexer you can run on whole data (Batch indexer wont work on incremental data, it works on whole dataset)
2) As part of data-flow pipe line, Write a map-reduce program to insert the data in to solr.(what we have done in one of our implementation)
All right, we solved the problem recently.
The reason why solr numfound is different from hbase table row count due to hbase-indexer make a mistake
of deleting some row instead of inserting them. We found this situation according to hbase-indexer metrics :
https://github.com/NGDATA/hbase-indexer/wiki/Metrics
We use jconsole to watch jmx metrics data and found :
indexer deletes count = hbase table row count - solr numfound
Finally we debug into the hbase-indexer source code and find some code will cause this problem, maybe it is a issue about hbase-indexer, please see : https://github.com/NGDATA/hbase-indexer/issues/78

how to manage modified data in Apache Hive

We are working on Cloudera CDH and trying to perform reporting on the data stored on Apache Hadoop. We send daily reports to client so need to import data from operational store to hadoop daily.
Hadoop works on the append only mode. Hence we can not perform the Hive update/delete query. We can perform Insert overwrite on dimension tables and add delta values in the fact tables. Introducing thousands for the delta rows daily does not seem quite impressive solution.
Are there any other standard better ways to update modified data in Hadoop?
Thanks
HDFS might be append only, but Hive does support updates from 0.14 on.
see here:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Update
A design pattern is to take all your previous and current data and insert it into a new table every time.
Depending on your usecase have a look at Apache Impala/Hbase/... or even Drill.

Real time database in Hadoop ecosystem

Pardon me if this is a silly question.
I have a cloudera manager installed in a single node.
I am trying to use Hbase and Hadoop for logging request and response in my web application.
I am trying to list latest user activity using the log.
Rows are added using the below table structure.
1 Column Family, RowId, 11 columns. I store every value as string. Fairly simple & similar to a mysql table.
RowId
entry:addedTime
entry:value
entry:ip
entry:accessToken
entry:identifier
entry:userId
entry:productId
entry:object
entry:requestHeader
entry:completeDate
entry:tag
Now, in order to get rows from my Hbase, I use
SingleColumnValueFilter("entry", "userId", "=", binary:"25", true, true)
Now, I am struggling to order this by
entry:completeDate DESCENDING
and limit by 25 rows for pagination or infinite scroll.
My question,
Is Hbase the only real time querying database available in Hadoop ecosystem?
Am I using Hbase for wrong reasons? Is my table structure correct?
I work in a startup and these are our baby steps to moving to BigData. Though BigData created lot of hype, the Hadoop is poorly supported for latest linux and looks too complicated.
Any help or suggestions would be appreciated.
Many thanks,
Karthik

Resources