Insert overwrite to Hive table saves less records than actual record number - hadoop

I have a partitioned table tab and I want to create some tmp table test1 from it. Here is how I created the tmp table:
CREATE TABLE IF NOT EXISTS test1
(
COL1 string,
COL2 string,
COL3 string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE;
Write to this table with:
INSERT OVERWRITE TABLE test1
SELECT TAB.COL1 as COL1,
TAB.COL2 as COL2,
TAB.COL3 as COL3
FROM TAB
WHERE PT='2019-05-01';
Then I count records in test1, it has 94493486 records, while the following SQL returns count 149248486:
SELECT COUNT(*) FROM
(SELECT TAB.COL1 as COL1,
TAB.COL2 as COL2,
TAB.COL3 as COL3
FROM TAB
WHERE PT='2019-05-01') AS TMP;
Also, when I save the selected partition(PT is the partition column) to HDFS, the record count is correct:
INSERT OVERWRITE directory '/user/me/wtfhive' row format delimited fields terminated by '|'
SELECT TAB.COL1 as COL1,
TAB.COL2 as COL2,
TAB.COL3 as COL3
FROM TAB
WHERE PT='2019-05-01';
My Hive version is 3.1.0 coming with Ambari 2.7.1.0. Anyone have any idea what may cause this issue? Thanks.
=================== UPDATE =================
I find something might related to this issue.
The table tab uses ORC as storage format. Its data is imported from ORC data file of another table, in another Hive cluster, with following script:
LOAD DATA INPATH '/data/hive_dw/db/tablename/pt=2019-04-16' INTO TABLE tab PARTITION(pt='2019-04-16');
As the 2 table shares same format, the loading procedure is basically just moves data file from HDFS source directory to Hive directory.
In following procedure, I can load without any issue:
export data from ORC table tab to HDFS text file
load from the text file to a Hive temp table
load data back to tab from the temp table
now I can select/export from tab to other tables without any record missing
I suspect the issue is in ORC format. I just don't understand why it can export to HDFS text file without any problem, but export to another table(no matter what storage format the other table uses) will loss data.

use this below serde properties :
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ",",
"quoteChar" = "\""
)
STORED as TEXTFILE

Related

load text to Orc file

How to load text file into Hive orc external table?
create table MyDB.TEST (
Col1 String,
Col2 String,
Col3 String,
Col4 String)
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat';
I have already created above table as Orc. but while fetching data from table it show below error
Failed with exception
java.io.IOException:org.apache.orc.FileFormatException: Malformed ORC file hdfs://localhost:9000/Ext/sqooporc/part-m-00000. Invalid
postscript.
There are multiple steps to that. Follows the details.
Create a hive table which is able to read from the plain text file. Assuming that your file is a comma delimited file and your file is on HDFS on a location called /user/data/file1.txt, follows will be the syntax.
create table MyDB.TEST (
Col1 String,
Col2 String,
Col3 String,
Col4 String
)
row format delimited
fields terminated by ','
location '/user/data/file1.txt';
Now you have a schema which is in sync with the format of the data you posses.
Create another table with ORC schema
Now you need to create the ORC table as you were creating earlier. Here is a simpler syntax for creating that table.
create table MyDB.TEST_ORC (
Col1 String,
Col2 String,
Col3 String,
Col4 String)
STORED AS ORC;
Your TEST_ORC table is an empty table now. You can populate this table using the data from TEST table using the following command.
INSERT OVERWRITE TABLE TEST_ORC SELECT * FROM TEST;
The aforementioned statement will select all the records from TEST table and will try to write those records to TEST_ORC table. Since TEST_ORC is an ORC table, the data will be converted to ORC format on the fly when written into the table.
You can even check the storage location of TEST_ORC table for ORC files.
Now your data is in ORC format and your table TEST_ORC has the required schema to parse it. You may drop your TEST table now, if not needed.
Hope that helps!

Download hive query to csv format using beeline command

I need to download hive query result to local file path in csv format. Additionally, column values should be enclosed in quotes, fields are terminated by comma and file should have column headers in the first row.
Can anyone help me with best approach to achieve this? Note - Query usually returns more than 5M rows.
The best approach is to create a hive table with your selected data as below.
CREATE EXTERNAL TABLE ramesh_csv (col1 INT, col2 STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ",",
"quoteChar" = "\""
)
STORED AS TEXTFILE
LOCATION 'mylocation';
INSERT OVERWRITE TABLE ramesh_csv VALUES (1, 'TEST'), (2, 'TEST AGAIN');
In your case, you will insert your selected records into the table.
Now get the HDFS file created. That will be comma separated with double quotes enclosed.
See my output below
"1","TEST"
"2","TEST AGAIN"
And you can use hdfs -getmerge hdfs://mylocation data.csv to download the hdfs part files into a single local file

How to use Parquet in my current architecture?

My current system is architected in this way.
Log parser will parse raw log at every 5 minutes with format TSV and output to HDFS. I created Hive table out of the TSV file from HDFS.
From some benchmark, I found that Parquet can save up to 30-40% of the space usage. I also found that I can create Hive table out of Parquet file starting Hive 0.13. I would like know if I can convert TSV to Parquet file.
Any suggestion is appreciated.
Yes, in Hive you can easily convert from one format to another by inserting from one table to the other.
For example, if you have a TSV table defined as:
CREATE TABLE data_tsv
(col1 STRING, col2 INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';
And a Parquet table defined as:
CREATE TABLE data_parquet
(col1 STRING, col2 INT)
STORED AS PARQUET;
You can convert the data with:
INSERT OVERWRITE TABLE data_parquet SELECT * FROM data_tsv;
Or you can skip the Parquet table DDL by:
CREATE TABLE data_parquet STORED AS PARQUET AS SELECT * FROM data_tsv;

Hive - How to load data from a file with filename as a column?

I am running the following commands to create my table ABC and insert data from all files that are in my designated file path. Now I want to add a column with filename, but I can't find any way to do that without looping through the files or something. Any suggestions on what the best way to do this would be?
CREATE TABLE ABC
(NAME string
,DATE string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
hive -e "LOAD DATA LOCAL INPATH '${DATA_FILE_PATH}' INTO TABLE ABC;"
Hive does have virtual columns, which include INPUT__FILE__NAME. The link shows how to use this in a statement.
To fill another table with the filename as a column. Assuming your location of your data is hdfs://hdfs.location:port/data/folder/filename1
DROP TABLE IF EXISTS ABC2;
CREATE TABLE ABC2 (
filename STRING COMMENT 'this is the file the row was in',
name STRING,
date STRING);
INSERT INTO TABLE ABC2 SELECT split(INPUT__FILE__NAME,'folder/')[1],* FROM ABC;
You can alter the split to change how much of the full path you actually want to store.

Pig: get data from hive table and add partition as column

I have a partitioned Hive table that i want to load in a Pig script and would like to add partition as column also.
How can I do that?
Table definition in Hive:
CREATE EXTERNAL TABLE IF NOT EXISTS transactions
(
column1 string,
column2 string
)
PARTITIONED BY (datestamp string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION '/path';
Pig script:
%default INPUT_PATH '/path'
A = LOAD '$INPUT_PATH'
USING PigStorage('|')
AS (
column1:chararray,
column2:chararray,
datestamp:chararray
);
The datestamp column is not populated. Why is it so?
I am sorry I didn't get the part which says add partition as column also. Once created, partition keys behave like regular columns. What exactly do you need?
And you are loading the data directly from a given HDFS location, not as a Hive table. If you intend to use Pig to load/store data from/into a Hive table you should use HCatalog.
For example :
A = LOAD 'transactions' USING org.apache.hcatalog.pig.HCatLoader();

Resources