How to make an UNION in HIVE over two EXTERNAL TABLES which point to the same file - hadoop

I'm trying to write a Hive script which creates two External tables, both of them pointing to the same file LOCATION with differents regular expressions (filters). When I try to make an UNION between them, results aren't as expected.
The first chunk of code creates the tables
CREATE EXTERNAL TABLE logsFormat1(col1 INT, col2 STRING, col3 INT)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES ("input.regex" = "Regex1",
"output.format.string" = "%1$s %2$s %3$s")
STORED AS TEXTFILE
LOCATION '/user/.../directoryFile';
CREATE EXTERNAL TABLE logsFormat2(col1 STRING, col2 INT, col3 INT)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES ("input.regex" = "Regex2",
"output.format.string" = "%1$s %2$s %3$s")
STORED AS TEXTFILE
LOCATION '/user/.../directoryFile';
UNION statement just get results from last SELECT in a weird way.
`SELECT l1.url FROM logsFormat1 l1 where l1.url is not null
UNION ALL
SELECT l2.url FROM logsFormat2 l2 where l2.url is not null`
I discovered that this happen because both TABLES location are pointing to the same file. The problem is that I can't have two files. I need solve this with the same file location due to the real file is very long

Finally I solved my problem creating a TEMPORARY TABLE
CREATE TEMPORARY TABLE TEMPlogsFormat1
STORED AS TEXTFILE
AS SELECT * FROM logsFormat1 l1

Related

Creating external hive table from parquet file which contains json string

I have a parquet file which is stored in a partitioned directory. The format of the partition is
/dates=*/hour=*/something.parquet.
The content of parquet file looks like as follows:
{a:1,b:2,c:3}.
This is json data and i want to create external hive table.
My approach:
CREATE EXTERNAL TABLE test_table (a int, b int, c int) PARTITIONED BY (dates string, hour string) STORED AS PARQUET LOCATION '/user/output/';
After that i run MSCK REPAIR TABLE test_table; but i get following output:
hive> select * from test_table;
OK
NULL NULL NULL 2021-09-27 09
The other three columns are null. I think i have to define JSON schema somehow but i have no idea how to proceed further.
Create table with the same schema as parquet file:
CREATE EXTERNAL TABLE test_table (value string) PARTITIONED BY (dates string, hour string) STORED AS PARQUET LOCATION '/user/output/';
Run repair table to mount partitions:
MSCK REPAIR TABLE test_table;
Parse value in query:
select e.a, e.b, e.c
from test_table t
lateral view json_tuple(t.value, 'a', 'b', 'c') e as a,b,c
Cast values as int if necessary: cast(e.a as int) as a
You can also create a table for json fields as columns using this:
CREATE EXTERNAL TABLE IF NOT EXISTS test_table(
a INT,
b INT,
c INT)
partitioned by (dates string, hour string)
ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS PARQUET
location '/user/output/';
Then run MSCK REPAIR TABLE test_table;
You would be able to query directly without writing any parsers.

insert data to external table from an external table

While inserting data to an external table-2 from an external table-1 the data of external table-2 gets stored in /user/hive/warehouse/db-name/table-name/,but as an external table it should not store data into warehouse directory right?
Should we specify location for storing data to external table?
Yes, you will have to mention the location while creating the external table.
You can simply do it in following way.
Create the tables table1 and table2:
CREATE EXTERNAL TABLE table1(col1 INT, col2 BIGINT,col3 STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '<hdfs_location1>';
CREATE EXTERNAL TABLE table2(col21 INT, col22 BIGINT,col23 STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '<hdfs_location2>';
Now insert the data from table1 to table 2
INSERT OVERWRITE TABLE table2(col21,col22,col23) SELECT * FROM table1
It will copy the data from table 1 to table2 hdfs location.
Please note that CTAS(Create table AS Select) is not supported for external tables.
Any table you create in hive whether its internal or external file is moved to '/user/hive/warehouse' or whatever you specify in
hive.metastore.warehouse.dir
in hive-site.xml
External table is created- to prevent the data loss when someone drop the table accidentally. Try to create 2 external tables and browse the filesystem. You can easily understand the concept.
I think you have created external table-2 without specifying LOCATION. Try using below syntax
CREATE EXTERNAL TABLE [db_name.]table_name
[(col_name data_type [COMMENT col_comment], ...)]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
[
[ROW FORMAT row_format]
[STORED AS file_format]
| STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]
]
[LOCATION hdfs_path]
[TBLPROPERTIES (property_name=property_value, ...)]
[AS select_statement];

Hive Create Table Partitions from file name

New to Hadoop. I know how to create a table in Hive (Syntax)
Creating a table with 3 Partition Key. but the keys are in File Names.
FileName Example : ServerName_ApplicationName_ApplicationName.XXXX.log.YYYY-MM-DD
there are hundreds of file in a directory want to create a table with following Partition Keys from file Name :ServerName,ApplicationName,Date and load all the files in to table
Hive Script would be the preference but open to any other ideas
(files are CSV. and I know The schema(column definitions) of the file )
I assume the File Name is in format ServerName_ApplicationName.XXXX.log.YYYY-MM-DD (removed second "applicationname" assuming it to be a typo).
Create a table on the contents of the original file. Some thing like..
create external table default.stack
(col1 string,
col2 string,
col3 string,
col4 int,
col5 int
)
ROW FORMAT DELIMITED
FIELDS terminated by ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
location 'hdfs://nameservice1/location1...';
Create another partitioned table in another location like
create external table default.stack_part
(col1 string,
col2 string,
col3 string,
col4 int,
col5 int
)
PARTITIONED BY ( servername string, applicationname string, load_date string)
STORED as AVRO -- u can choose any format for the final file
location 'hdfs://nameservice1/location2...';
Insert into partitioned table from base table using below query:
set hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.compress.output=true;
set hive.exec.parallel=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
Insert overwrite table default.stack_part
partition ( servername, applicationname, load_date)
select *,
split(reverse(split(reverse(INPUT__FILE__NAME),"/")[0]),"_")[0] as servername
,split(split(reverse(split(reverse(INPUT__FILE__NAME),"/")[0]),"_")[1],'[.]')[0] as applicationname
,split(split(reverse(split(reverse(INPUT__FILE__NAME),"/")[0]),"_")[1],'[.]')[3] as load_date
from default.stack;
I have tested this and it works.

How insert overwrite table in hive with diffrent where clauses?

I want to read a .tsv file from Hbase into hive. The file has a columnfamily, which has 3 columns inside: news, social and all. The aim is to store these columns in an table in hbase which has the columns news, social and all.
CREATE EXTERNAL TABLE IF NOT EXISTS topwords_logs (key String,
columnfamily String , wort String , col String, occurance int)ROW FORMAT
DELIMITED FIELDS TERMINATED BY '\t'STORED AS TEXTFILE LOCATION '/home
/hfu/Testdaten';
load data local inpath '/home/hfu/Testdaten/part-r-00000.tsv' into table topwords_logs;
CREATE TABLE newtopwords (columnall int, columnsocial int , columnnews int) PARTITIONED BY(wort STRING) STORED AS SEQUENCEFILE;
Here i created a external table, which contain the data from hbase. Further on I created a table with the 3 columns.
What i have tried so far is this:
insert overwrite table newtopwords partition(wort)
select occurance ,'1', '1' , wort from topwords_log;
This Code works fine, but i have for each column an extra where clause. How can I insert data like this?
insert overwrite table newtopwords partition(wort)
values(columnall,(select occurance from topwords_logs where col =' all')),(columnnews,( select occurance from topwords_logs where col =' news')) ,(columnsocial,( select occurance from topwords_logs where col =' social')),(wort,(select wort from topwords_log));
This code isnt working ->NoViableAltException.
On every example I just see Code, where they insert data without a Where clause. How can I insert Data with a Where clause?

hive external partitioned table

First i created hive external table partitioned by code and date
CREATE EXTERNAL TABLE IF NOT EXISTS XYZ
(
ID STRING,
SAL BIGINT,
NAME STRING,
)
PARTITIONED BY (CODE INT,DATE STRING)
ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe'
STORED AS
INPUTFORMAT "parquet.hive.DeprecatedParquetInputFormat"
OUTPUTFORMAT "parquet.hive.DeprecatedParquetOutputFormat"
LOCATION '/old_work/XYZ';
and then i execute insert overwrite on this table taking data from other table
INSERT OVERWRITE TABLE XYZ PARTITION (CODE,DATE)
SELECT
*
FROM TEMP_XYZ;
and after that i count the number of records in hive
select count(*) from XYZ;
it shows me 1000 records are there
and then i rename or move the location '/old_work/XYZ' to '/new_work/XYZ'
and then i again drop the XYZ table and created again pointing location to new directory
means '/new_work/XYZ'
CREATE EXTERNAL TABLE IF NOT EXISTS XYZ
(
ID STRING,
SAL BIGINT,
NAME STRING,
)
PARTITIONED BY (CODE INT,DATE STRING)
ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe'
STORED AS
INPUTFORMAT "parquet.hive.DeprecatedParquetInputFormat"
OUTPUTFORMAT "parquet.hive.DeprecatedParquetOutputFormat"
LOCATION '/new_work/XYZ';
But then when i execute select count(*) from XYZ table in hive , it shows 0 records ,
i think i missed something , please help me on this????
You need not drop the table and re create it the second time:
As soon as you move or rename a external hdfs location of the table just do this :
msck repair table <table_name>
In your case the error was because, The hive metastore wasnt updated with the new path .

Resources