Refresh hive tables in Hive - hadoop

I have few tables in Hive, every day new csv file will be adding to the hive table location. When a new data is available i need to refresh the tables so that i can see new data in the tables.
steps we follow to load the data:
first create a table with csv serde properties
create another table with parquet table to do in production
insert the data from first table to second table.
Initial:
1,a
2,b
3,c
New file:
4,d
I searched in google and found this can be done via:
1) incremental table, loading the new file in to incremental table and do insert statement. In my case we have more than 100 tables and so not want to create these many incremental tables
2) Using refresh command via Impala shell.
Our initial tables are stored as csv serde format. so when i do refresh on the initial tables i get an error impala does't support serde propertied.
Can you please provide a solution in my case.

Related

How to delete fields from a partitioned table in Hive stored as parquet?

I'm looking for a way to modify a parquet data table in HIVE to remove some fields. The table is managed but it doesn't matter because I can convert it to external.
The problem is that I can not use the command ALTER TABLE ... REPLACE COLUMN with partitioned parquet tables.
It is works well for textfile format (partitioned or not) and only for non-partitioned parquet tables.
I've tried to replace column but this is the result:
hive> ALTER TABLE db_test.mytable REPLACE COLUMNS(name String);
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask.
Replacing columns cannot drop columns for table db_test.mytable.
SerDe may be incompatible
I've thought about some solutions, but none of them fits my scenario:
First
- [Optional] Convert the table in external.
- Delete the table.
- Re-create the table with the fields that I want.
- MSCK REPAIR TABLE to add HDFS partitions.
- [Optional] Convert back to managed table.
Second
- Create temporary table as selection of the original table with the fields that I choose.
- Delete the original table.
- Rename the temporary table to the original name.
Both options affect my process because I would lose the statistics of my table. This table is consumed with MicroStrategy by Impala and I need to mantain the statistics.
In addition, the second solution is bad with very large data tables.
Any suggestions?
Thanks in advance.
You can use first method and then run
hive> anayze table <db_name>.<table_name> compute statistics;
to compute all the statistics of the table.

Insert partitioned data into partitioned hive table

I have stored the data in hdfs using Pig Multistorage with the column id.
So data stored as
/output/1/part-0000
/output/2/
/output/3/
Now I have created a partitioned table in hive and I want to load the data from /output folder into this partitioned table. Is there any way to achieve this?
First you create a temp hive table where you load all the data from pig output.
Then You load to your actual partitioned hive table from temp table.
Something like below:
FROM emp_external temp INSERT OVERWRITE TABLE emp_partition PARTITION(country) SELECT temp.id,temp.name,temp.dept,temp.sal,temp.country;
Else you can explore Hcatlog for this case.
not sure if you are looking to insert the data in the outputfolder (created from pig) to an existing table or loading the data in the output folder in to a new hive partitioned table.
If you want to load the data in to new hive table, you can create a new partitioned table pointing to the output folder
If you are looking to load the data into an existing hive table, then you can either create a temp table as #Aman mentioed and do a insert in to the destination table
or
You can just move/copy the files in the hdfs from output/ to hive table location.
Hope this helps
Assign a Hive schema to pig output location with partitioned columns (Alter table Add Partition) as column id. Now both are hive tables and you can use where clause over partitioned column to move over the data.

Refresh one hive table from another hive table

I have a few Hive tables that i am bringing in from RDBMS using Sqoop incremental imports every hour and staging them. I am joining these tables and creating new dimension tables. Whenever i bring in new rows from RDBMS into Hive staging tables, I have to refresh the dimension tables. If there are no new rows, the refresh of dim tables should not be done. The hive version I'm using does not have ACID features.
Need some advice on how this could be achieved in hive.
You can INSERT new data in existing Hive tables, like any other database. And Hive also supports the WHERE NOT EXISTS clause.
INSERT INTO TABLE MyDim
SELECT Id, Blah1, Blah2
FROM MySource s
WHERE NOT EXISTS
(SELECT 1 FROM MyDim z WHERE z.Id =s.Id)
But there is a catch: each INSERT will create a new HDFS file, even when there are zero records involved. Too much fragmentation will reduce performance over time.
A weekly "compaction" job would be helpful (e.g. rename the fragmented table, re-CREATE the table, INSERT OVERWRITE from renamed table, drop renamed)

Load from HIVE table into HDFS as AVRO file

I want to load a file into HDFS (as .avro file) from HIVE table.
Currently I am able to move a table as a file from HIVE to HDFS but I am not able to specify a particular format of my Target file. can some one help me in this.??
So your question is really
How do I convert a Hive table to a different storage format?
Create a new table with the same fields and types as the avro table and change the input format. Then insert into the new table from the old table.
INSERT OVERWRITE TABLE newtable SELECT * FROM oldtable

Hadoop & Hive as warehouse: daily data deliveries

I am evaluating the combination of hadoop & hive (& impala) as a repolacement for a large data warehouse. I already set up a version and performance is great in read access.
Can somebody give me any hint what concept should be used for daily data deliveries to a table?
I have a table in hive based on a file I put into hdfs. But now I have on a daily basis new transactional data coming in.
How do I add them ti the table in hive.
Inserts are not possible. HDFS cannot append. So whats the gernal concept I need to follow.
Any advice or direction to documentation is appreciated.
Best regards!
Hive allows for data to be appended to a table - the underlying implementation of how this happens in HDFS doesn't matter. There are a number of things you can do append data:
INSERT - You can just append rows to an existing table.
INSERT OVERWRITE - If you have to process data, you can perform an INSERT OVERWRITE to re-write a table or partition.
LOAD DATA - You can use this to bulk insert data into a table and, optionally, use the OVERWRITE keyword to wipe out any existing data.
Partition your data.
Load data into a new table and swap the partition in
Partitioning is great if you know you're going to be performing date based searches and gives you the ability to use options 1, 2, & 3 at either the table or partition level.
Inserts are not possible
Inserts are possible ,like you can create a new table and insert the data from new table to old table.
But simple solution is You can load data of the file into Hive table with the below command.
load data inpath '/filepath' [overwrite] into table tablename;
If you use overwrite then only existing data replced with new data otherwise It is appending only.
You can even schedule the script by creating a shell script.

Resources