how to validate a data transfer from an external database (oracle) to hdfs - oracle

I have a job that transfers data from oracle to hdfs. I need an efficient way to validate this transfer, to make sure that the all the rows are properly transferred.

A simple way I feel is to take the count of rows from Source Oracle table
select count(*) from tablename;
You will get the count of rows from the Oracle table
From HDFS point of View
Count total number of lines(rows) in HDFS file:
hadoop fs -cat /yourdestinationhdfsfiles/* | wc -l
Data Validation strategy
Create a (Temp) Hive table similar to Oracle table structure
Take few records from the Target HDFS file and load the data into HIVE table and validate if records and structure are matching.[Manual process for validation]
Note: This can be done for full data also provided you have enough storage space and processing unit.
Hope this Helps!!!..

Related

Hive load data into HDFS

i have data set having 100+ columns for each row. question is how can i load selected columns using hive into hdfs.
for example : col1 ,col2,col3...col50,col51....col99,col100 . I need to load only selected columns col1,col2,col34 and col99.
Approach 1:
1. load all the columns
2. and create view based on selected columns.
Approach 1 - cons- i need to load all the columns unnecessary and it will consume more memory in hdfs also i need to write big query for specifying the column
. Any other best approach.
Hive provides a tabular view on top of HDFS data. If your data is in HDFS, then you can create an external table on it to reference the existing data. You will need to put a schema over the data. This is a one time effort and then you can use all the features of Hive to explore and analyze the dataset. Hive supports views also.
Illustration
Sample data file: data.csv
1,col_1a,col1b
2,col_2a,col2b
3,col_3a,col3b
4,col_4a,col4b
5,col_5a,col5b
6,col_6a,col6b
7,col_7a,col7b
Load and verify data in HDFS
hadoop fs -mkdir /hive-data/mydata
hadoop fs -put data.csv /hive-data/mydata
hadoop fs -cat /hive-data/mydata/*
1,col_1a,col1b
2,col_2a,col2b
3,col_3a,col3b
4,col_4a,col4b
5,col_5a,col5b
6,col_6a,col6b
7,col_7a,col7b
Create a Hive table on top of the HDFS data in default database
CREATE EXTERNAL TABLE default.mydata
(
id int,
data_col1 string,
data_col2 string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
LOCATION 'hdfs:///hive-data/mydata';
Query the Hive table
select * from default.mydata;
mydata.id mydata.data_col1 mydata.data_col2
1 col_1a col1b
2 col_2a col2b
3 col_3a col3b
4 col_4a col4b
5 col_5a col5b
6 col_6a col6b
7 col_7a col7b

Unable to partition hive table backed by HDFS

Maybe this is an easy question but, I am having a difficult time resolving the issue. At this time, I have an pseudo-distributed HDFS that contains recordings that are encoded using protobuf 3.0.0. Then, using Elephant-Bird/Hive I am able to put that data into Hive tables to query. The problem that I am having is partitioning the data.
This is the table create statement that I am using
CREATE EXTERNAL TABLE IF NOT EXISTS test_messages
PARTITIONED BY (dt string)
ROW FORMAT SERDE
"com.twitter.elephantbird.hive.serde.ProtobufDeserializer"
WITH serdeproperties (
"serialization.class"="path.to.my.java.class.ProtoClass")
STORED AS SEQUENCEFILE;
The table is created and I do not receive any runtime errors when I query the table.
When I attempt to load data as follows:
ALTER TABLE test_messages_20180116_20180116 ADD PARTITION (dt = '20171117') LOCATION '/test/20171117'
I receive an "OK" statement. However, when I query the table:
select * from test_messages limit 1;
I receive the following error:
Failed with exception java.io.IOException:java.lang.IllegalArgumentException: FieldDescriptor does not match message type.
I have been reading up on Hive table and have seen that the partition columns do not need to be part of the data being loaded. The reason I am trying to partition the date is both for performance but, more so, because the "LOAD DATA ... " statements move the files between directories in HDFS.
P.S. I have proven that I am able to run queries against hive table without partitioning.
Any thoughts ?
I see that you have created EXTERNAL TABLE. So you cannot add or drop partition using hive. you need to create a folder using hdfs or MR or SPARK. EXTERNAL table can only be read by hive but not managed by HDFS. You can check the hdfs location '/test/dt=20171117' and you will see that folder has not been created.
My suggestion is create the folder(partition) using "hadoop fs -mkdir '/test/20171117'" then try to query the table. although it will give 0 row. but you can add the data to that folder and read from Hive.
You need to specify a LOCATION for an EXTERNAL TABLE
CREATE EXTERNAL TABLE
...
LOCATION '/test';
Then, is the data actually a sequence file? All you've said is that it's protobuf data. I'm not sure how the elephantbird library works, but you'll want to double check that.
Then, your table locations need to look like /test/dt=value in order for Hive to read them.
After you create an external table over HDFS location, you must run MSCK REPAIR TABLE table_name for the partitions to be added to the Hive metastore

How Hive reads data even after dropping from hdfs?

I have an external table in hive and pointing to HDFS location. By mistake I have ran the job to load the data into HDFS two times.
Even after deleting the duplicate file from HDFS hive is showing the data count two times(i.e. including deleted duplicate data file count).
select count(*) from tbl_name -- returns double time
But ,
select count(col_name) from tbl_name -- returns actual count.
Same table when I tried from Impala after
INVALIDATE METADATA
I could see only data count which is available in HDFS(not duplicate).
How can hive give count as double even after deleting from physical location(hdfs) , does it read from statistics?
Hive is using statistics for computing cont(*). You deleted files manually (not using Hive) that is why the stats is wrong.
The solution is:
to switch-off statistics usage in such cases:
set hive.compute.query.using.stats=false;
to analyze table as you mention in your comment:
analyze table tbl_name partition(a,b,c) compute statistics;

Data in HDFS files not seen under hive table

I have to create a hive table from data present in oracle tables.
I'm doing a sqoop, thereby converting the oracle data into HDFS files. Then I'm creating a hive table on the HDFS files.
The sqoop completes successfully and the files also get generated in the HDFS target directory.
Then I run the create table script in hive. The tables gets created. But it is an empty table, no data is seen in the hive table.
Has anyone faced a similar problem?
Hive default delimiter is ctrlA, if you don't specify any delimiter it will take default delimiter. Add below line in your hive script .
row format delimited fields terminated by '\t'
Your Hive script and your expectation is wrong. You are trying to create a partitioned table on the data that you have already imported, partitions won't work that way. If your query has no partition in it then you can able to see data.
Basically If you want partitioned table , you can't create on the under lying data like you have tried above. If you want hive partition load the data from intermediate table or that sqoop directory to your partitioned table to get Hive partitions.

Hive/Impala count distinct on a partitioned column results in all data files being read?

When querying a hive table according to a partitioning column, it would be logical that a simple
select count(distinct partitioned_column_name) from my_partitioned_table
would complete almost instantaneously.
But we are seeing that both hive and impala are unable to execute this query properly: they just read the entire table!
What do we need to do to ensure the above command executes rapidly?
Just as a hack, if the column is partitioned - it is bound to be distinct in warehouse dir.
You could try something like:
hadoop fs -ls /<hive_warehouse_directory>/<database.db>/<table_name> | wc -l
Usually, hive warehouse is kept as /user/hive/warehouse

Resources