Data in hive giving different data when inserted into a table and queries directly - hadoop

I am working on Hive version 2.3.8.
There is a table which is getting data directly from the S3 location on AWS. When I am using this table to insert it into an temporary ORC table, incorrect data is stored.
But the same query without inserting in a table will give the correct data.
There is no join on the table.
At last I am fetching data directly, but that is not efficient since the data is required in more than 2-3 places. This is increasing the turnaround time for my query
As asked:
Let the dump be like
col1 col2 col3
a 1st jan 1st jan
a 1st jan 2nd jan
a 1st jan 3rd jan
When I am inserting into a table:
col1 col2 col3
a 1st jan 1st jan
a 1st jan 2nd jan
a 1st jan 3rd jan
a 1st jan 4th jan
a 1st jan 5th jan
And if queries directly from the dump:
col1 col2 col3
a 1st jan 1st jan
a 1st jan 2nd jan
a 1st jan 3rd jan
While inserting even if I create a completely new table with a very weird name that can never be present with ghost data, I am still getting the same result.

Related

TIMESTAMP column not interpreting correct value for ORC file in HDP3.1

As part of cluster migration we are copying ORC hdfs files from old cluster - IBM IOP 4.2 to HDP 3.1. Post migration we see TIMESTAMP column shows -1 hour in HDP 3.1. Similar question posted in - TimeStamp issue in hive 1.1
We cross checked TIME ZONE configuration for all the nodes in cluster - Linux OS and Hive and both set to EDT (local time zone).
Tried testing this scenario by reading the ORC file content using hive -orcfiledump -d and we see the actual file has the correct timestamp value in the orc file. The column value is getting changed when even Hive is reading it and displaying the records.
ORC external table output on OLD cluster.
DATE_KEY DTDATE
20100701 7/1/2010 12:00:00 AM
20100702 7/2/2010 12:00:00 AM
20100703 7/3/2010 12:00:00 AM
ORC external table output on new HDP 3.1 cluster. The DTDATE column shows -1 hour
DATE_KEY DTDATE
20100701 6/30/2010 11:00:00 PM
20100702 7/1/2010 11:00:00 PM
20100703 7/2/2010 11:00:00 PM

Oracle Partition based on other table data

I have two table
1. Order
2. Payout_bank
According to our business order happens on date and “payout_bank” of that transaction happens 2 or more days after the transaction of order.
Now I would like to partition Order table based on date say 1stJan2018. So two partition.
I would like to partition Payout_bank based on the transactionNumber of order of 1stJan2018 or after 1st Jan 2018 in partition one and rest in other partition.
Note we have to do this as we would be dumping the older partion to archival database, so we want the referential integrity of our data in archival database.
Question : how can we define the partition of Payout_bank table?

Incremental fetching data for every half an hour from Oracle Database

I have a table which has 150 Billion data,
1) First I need to fetch historical data from the table and load into target with high performance(I have indexes on few columns). I have tried with below queries like
select *
from Historical_table
and
select /*+PARALLEL(a,8)*/ a.*
from Historical_table a
It's taking almost an hour time but no result.
2) And I need to run incremental data which fetches every half an hour updated data from the table
eg: I have a time stamp column if I'm fetching the data at 4 pm it should fetch only the data from 3:30 PM-4:00 PM data and
if I'm running at 4:30 PM it has to fetch the data from 4:00 PM-4:30 PM
and this incremental query should run in less than 15 min. I'm looking for the better performance query.
Could someone please help me on this
Thanks.

how to resolve date difference between Hive text file format and parquet file format

We created one external parquet table in hive, inserted the existing text file data into the external parquet table using insert overwrite.
but we did observe date from existing text file are not matching with parquet Files.
Data from to file
txt file date : 2003-09-06 00:00:00
parquet file date : 2003-09-06 04:00:00
Questions :
1) how we can resolve this issue.
2) why we are getting these discrepancy in data.
Even we faced a similar issue when we are sqooping the tables from sql server this is because of driver or jar issue.
when you are doing an insert overwrite try using cast for the date fields.
This should work let me know if you face any issues.
Thanks for your help..
using both beeline and impala query editor in Hue. to access the data stores in parquet table, with the timestamp issue occuring when you use impala query via Hue.
This is most likely related to a known difference in the way Hive and Impala handles timestamp values:
- when Hive stores a timestamp value into Parquet format, it converts local time into UTC time, and when it reads data out, it converts back to local time.
- Impala, however on the other hand, does no conversion when it reads the timestamp field, hence, UTC time is returned instead of local time.
If your servers are located in EST time zone, this can give an explanation for the +4h time offset as below:
- the timestamp 2003-09-06 00:00 in the example should be understood as EST EDT time (sept. 06 is daylight saving time, therefore UTC-4h time zone)
- +4h is added to the timestamp when stored by Hive
- the same offset is subtracted when it is read back by Hive, getting the correct value
- no correction is done when read back by Impala, thus showing 2003-09-06 04:00:00

Not able to partition table on Hive- Error in metadata

I have created a table in Hive with data loaded into it.
I want to partition it on Column spec: DoJ where value is 2012
I used:
ALTER TABLE employee
ADD PARTITION (year='2012')
location '/home/rvarun/2012/part2012';
I am getting the error:
FAILED: Error in metadata: table is not partitioned but partition spec exists: {year=2012}
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
I am a little new to Hive so please excuse me for any noobity.
The table I have looks like this:
1001 Varun 100000 Security Lead 2011
1002 Saloni 85000 Database Admin 2012
1003 Karan 90000 Network Engineer Lead 2012
1004 Pratik 98000 TrainEngine Driver 2012
1005 Ashish 120000 Senior Consultant 2013
1006 Gautam 70000 Salesforce Consultant 2013
1007 Mohit Sacheva 20000 Peon 2014
Can anyone tell me why this is happening? TIA
Your table is not created properly. Let's say your table name is my_table. you should include
'PARTITIONED BY (year string)'
so your create table should look like below
DROP TABLE IF EXISTS my_table;
CREATE EXTERNAL TABLE my_table
(col1 string,
col2 string,
col3 string)
PARTITIONED BY (year string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION '/home/rvarun/2012/part2012';
If you want to use column name as DoJ, just replace year with DoJ in partitioned by.

Resources