Record limit to ingest from teradata to Hadoop - hadoop

I am ingesting 5 tables from teradata to Hadoop using jdbc connector. I have written configuration files for the same.
Four out of the 5 tables are able to ingest perfectly and the record count is also matching. One table is not getting ingested at all. The count of this table is 56 Million (largest in this set) , the ingestion runs upto some ~35 Million records and stops abruptly, no error msg. The table is not getting created in Hadoop even for that 35M records. This is my usual ingestion method and nothing could go wrong in this.
Can someone suggest if there is any limit to the number of records that can get ingested from Teradata to Hadoop ?

Related

How to do Count validation for realtime streaming data

I am working in a project ,where the architecture is like below:
Transactions initiated from web application first get stored in oracle DB(Transactional database) and after that it will ingested into Bigdata framework and get processed and then stored in Hbase. CR8 is used to real timedata integration, it captures all the transactional logs from oracle and send to kafka .After that spark jobs processed these data in real time and store the processed data in Hbase
To make sure all the transaction records in oracle DB are received in Hbase what I am doing is taking Sample of 1000 transactionIDs from Oracle and checking the 1000 IDs are present in Hbase or not. I can only do count validation on sample records.
Is there any method to do 100% countvalidation,to ensure whatever the transaction record in Oracle is ingested into Hbase or not.
Am expecting something like -validate argument in SQOOP import job which doing the 100 % count validation between RDBMS and HDFS.
Or is there any other way to do count validation in the above architecture?

hive data processing taking longer time than expected

I'm facing an issue with ORC type data in hive. Needed some suggestions if someone faced similar problem.
I've huge data stored in hive table (partitioned & ORCed). The ORC data size is around 4 TB. I'm trying to copy this data to an uncompressed normal hive table (same table structure).
The process is running forever & occupying huge amount of non DFS storage in the pursuit. At present the process is running for 12 hours & has occupied 130 TB of non-DFS. That's very much abnormal for a Hadoop cluster with 20 servers.
Below are my parameters:
Hadoop running: HDP 2.4
Hive: 0.13
No. of servers: 20 (2 NN included)**
I wonder what a simple join or a normal analytics operation on this ORCed table would do. And theory tells that ORC format data increases performance for basic DML queries.
Can someone please let me know if I'm doing something wrong or is this a normal behavior? With ORCed data, this is my first experience.
Well, on a starters I saw that yarn log files are getting created in huge size. Mostly it shows the error logs only in heavy.
Thanks

hbase-indexer solr numFound different from hbase table rows size

Recently my team is using hbase-indexer on CDH for indexing hbase table column to solr . When we deploy hbase-indexer server (which is called Key-Value Store Indexer) and begin testing. We found a situation that the rows size between hbase table and solr index is different :
We used Phoenix to count hbase table rows:
0: jdbc:phoenix:slave1,slave2,slave3:2181> SELECT /*+ NO_INDEX */ COUNT(1) FROM C_PICRECORD;
+------------------------------------------+
| COUNT(1) |
+------------------------------------------+
| 4084355 |
+------------------------------------------+
And we use Solr Web UI to count solr index size :
numFound : 4060479
We could not found any error log from hbase-indexer log and solr log. But the rows size between hbase table and solr index is really different ! Is there anyone meet this situation ? I don't know how to do
My understanding :
Hbase rowcount - Solr rowcount(numfound) = missing records
4084355 - 4060479 = 23876 (which are there in Hbase and missing in Solr)
The Key-Value Store Indexer service uses the Lily HBase NRT Indexer to index the stream of records being added to HBase tables.
NRT works on incremental data not whole data.
Out of my experience these are possible reasons :
1) NRT worked initially, and if suddenly NRT is not working(due to some health issues) then there is a possibility of discrepancy in numbers.
2) NRT works on WAL(write ahead log) if WAL is switched off while inserting the records in to HBASE (possible.. for performance reasons), NRT wont work.
Possible solution :
1) Delete Solr documents and freshly load data in to Solr from Hbase.
Hbase batch indexer you can run on whole data (Batch indexer wont work on incremental data, it works on whole dataset)
2) As part of data-flow pipe line, Write a map-reduce program to insert the data in to solr.(what we have done in one of our implementation)
All right, we solved the problem recently.
The reason why solr numfound is different from hbase table row count due to hbase-indexer make a mistake
of deleting some row instead of inserting them. We found this situation according to hbase-indexer metrics :
https://github.com/NGDATA/hbase-indexer/wiki/Metrics
We use jconsole to watch jmx metrics data and found :
indexer deletes count = hbase table row count - solr numfound
Finally we debug into the hbase-indexer source code and find some code will cause this problem, maybe it is a issue about hbase-indexer, please see : https://github.com/NGDATA/hbase-indexer/issues/78

Hadoop Cassandra CqlInputFormat pagination

I am a quite newbie in Cassandra and have following question:
I have a 7-node Cassandra (v2.0.11) cluster and a table with 10k rows.
I run a hadoop job (datanodes reside on cassandra nodes of course) that reads data from that table and I see that only 7k rows is read to map phase.
I checked CqlInputFormat source code and noticed that a CQL query is build to select node-local date and also LIMIT clause is added (1k default). So that 7k read rows can be explained:
7 nodes * 1k limit = 7k rows read total
The limit can be changed using CqlConfigHelper:
CqlConfigHelper.setInputCQLPageRowSize(job.getConfiguration(), "1000");
Please help me with questions below:
Is this a desired behavior?
Why CqlInputFormat does not page through the rest of rows?
Is it a bug or should I just increase the InputCQLPageRowSize value?
What if I want to read all data in table and do not know the row count?
My problem was related to a bug in cassandra 2.0.11 that added a strange LIMIT clause in underlying CQL query run to read data to the map task:
I posted that issue to cassandra jira: https://issues.apache.org/jira/browse/CASSANDRA-9074
It turned out that that problem was stricly related to the following bug fixed in cassandra 2.0.12: https://issues.apache.org/jira/browse/CASSANDRA-8166

How to create a data pipeline from hive table to relational database

Background :
I have a Hive Table "log" which contains log information. This table is loaded with new log data every hour. I want to do some quick analytics on logs for past 2 days, so i want to extract last 48 hours of data into my relational database.
To solve the above problem I have created a staging hive table which is loaded by a HIVE SQL query. After loading the new data into the staging table, i load the new logs into relational database using sqoop Query.
Problem is that sqoop is loading data into relational database in BATCH. So at any particular time i have only partial logs for a particular hour.
This is leading to erroneous analytics output.
Questions:
1). How to make this Sqoop data load transactional, i.e either all records are exported or none are exported.
2). What is best way to build this data pipeline where this whole process of Hive Table -> Staging Table -> Relational Table.
Technical Details:
Hadoop version 1.0.4
Hive- 0.9.0
Sqoop - 1.4.2
You should be able to do this with sqoop by using the option called --staging-table. What this does is basically act as an auxiliary table that is used to stage exported data. The staged data is finally moved to the destination table in a single transaction. So by doing this, you shouldn't have consistency issues with partial data.
(source: Sqoop documentation)
Hive and Hadoop are such great technologies that can allow your analytics to run inside MapReduce tasks, performing the analytics very fast by utilizing multiple nodes.
Use that to your benefit. First of all partition your Hive table.
I guess that you store all logs in a single Hive table. Thus when you run your queries and you have a
SQL .... WHERE LOG_DATA > '17/10/2013 00:00:00'
Then you effictivelly query all the data that you have collected so far.
Instead if you use partitions - let's say one per day you can define in your query
WHERE p_date=20131017 OR p_date=20131016
Hive is partitioned and now knows to read only those two files
So let's say you got 10 GB of logs per day - then a HIVE QUERY should succeed in a few seconds in a decent Hadoop cluster

Resources