hive difference between insert and load data - hadoop

I am new to Hadoop and Hive, and I am confused about hive's insert into and load data statements.
When I execute INSERT INTO TABLE_NAME (field1, field2) VALUES(value1, value2);, hiveserver will execute mapReduce task.
When I execute LOAD DATA LOCAL INPATH PATH_TO_MY_DATA INTO TABLE TABLE_NAME;, it only loads data from the file and does nothing else.
I wrote a program with Python, here is my problem, if I use pyhs2 and use the insert statement to save data records, each record will execute a MapReduce task, and it is very slow.
Should I first save my data somewhere, and later use the load data statement to load it?

Load
Hive does not do any transformation while loading data into tables. Load operations are currently pure copy/move operations that move datafiles into locations corresponding to Hive tables.
Insert
Query Results can be inserted into tables by using the insert clause.
INSERT INTO TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement;
in load all the data which in file is copied into table, in insert you can put data based on some condition.
your solution
for every single row you execute your given hql so every time map reduce run.
if you want to execute your query in single mapreduce then
INSERT INTO TABLE students
VALUES ('fred flintstone', 35, 1.28), ('barney rubble', 32, 2.32);
create a single query and execute it. If you have more records in this condition you can make it a batch.

INSERT OVERWRITE will overwrite any existing data in the table or partition
and INSERT INTO will append to the table or partition keeping the existing
data (ref at apache.org).

Related

Is there a way to prevent a Hive table from being overwritten if the SELECT query of the INSERT OVERWRITE does not return any results

I am developing a batch job that loads data into Hive tables from HDFS files. The flow of data is as follows
Read the file received in HDFS using an external Hive table
INSERT OVERWRITE the final hive table from the external Hive table applying certain transformations
Move the received file to Archive
This flow works fine if there is a file in the input directory for the external table to read during step 1.
If there is no file, the external table will be empty and as a result executing step 2 will empty the final table. If the external table is empty, I would like to keep the existing data in the final table (the data loaded during the previous execution).
Is there a hive property that I can set so that the final table is overwritten only if we are overwriting it with some data?
I know that I can check if the input file exists using an HDFS command and conditionally launch the Hive requests. But I am wondering if I can achieve the same behavior directly in Hive which would help me avoid this extra verification
Try to add dummy partition to your table, say LOAD_TAG and use dynamic partition load:
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
INSERT OVERWRITE TABLE your_table PARTITION(LOAD_TAG)
select
col1,
...
colN,
'dummy_value' as LOAD_TAG
from source_table;
The partition value should always be the same in your case.

Loading more records than actual in HIve

While inserting from Hive table to HIve table, It is loading more records that actual records. Can anyone help in this weird behaviour of Hive ?
My query would be looking like this:
insert overwrite table_a
select col1,col2,col3,... from table_b;
My table_b consists of 6405465 records.
After inserting from table_b to table_a, i found total records in table_a are 6406565.
Can any one please help here ?
If hive.compute.query.using.stats=true; then optimizer is using statistics for query calculation instead of querying table data. This is much faster because metastore is a fast database like MySQL and does not require map-reduce. But statistics can be not fresh (stale) if the table was loaded not using INSERT OVERWRITE or configuration parameter hive.stats.autogather responsible for statistics auto gathering was set to false. Also statistics will be not fresh after loading files or after using third-party tools. It's because files was never analyzed, statistics in metastore is not fresh, if you have put new files, nobody knows about how the data was changed. Also after sqoop loading, etc. So, it's a good practice to gather statistics for table or partition after loading using 'ANALYZE TABLE ... COMPUTE STATISTICS'.
In case it's impossible to gather statistics automatically (works for INSERT OVERWRITE) or by running ANALYZE statement then better to switch off hive.compute.query.using.stats parameter. Hive will query data instead of using statistics.
See this for reference: https://cwiki.apache.org/confluence/display/Hive/StatsDev#StatsDev-StatisticsinHive

Hive reading while inserting

What would happen if I'm trying to read from a hive table , while concurrently there is someone inserting into the hive table.
Does it lock the files while querying into it, or does it do a dirty read???
Hadoop is meant for parallel proceesing. So in a cluster parallel querying can be done on a hive table.
While some data is being inserted in the table if another user queries the same table, the files are not locked, rather a job is accepted is put to execution.
Now if the new data insert is successful before the second query
is processed the the result of the second query will
acknowledge the inserted data.
Note: In most cases data insertion takes lesser time than
querying a table because while querying MR jobs are created and
are run in backend

HQL - How to Copy/Move data in few partitions from one table to another

I have a Table (main_table) which is partitioned and stores history of records with a flag to indicate if the record is deleted or not. I have another table9del_table), which has same schema as main_table, but stores only deleted records for a day (delete_falg='Y').
As a process I need to move records available in del_table to main_table on daily basis. I am trying to write a LOAD DATA INPATH command, which could move data available in respective partitions of del_table to corresponding partitions of main_table but none of my tries seems to work. Please let me know if it is possible to achieve it by using LOAD DATA INPATH command, withoud specifying individual partitions?
I am trying below steps but it is failing in 2nd step:
set nonstrict hive property:
LOAD DATA INPATH '...../del_table/' into table main_table partition(partition_col_name)
Do you have to use the Loan data inpath?
If not you could do the following https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-DynamicpartitionInsert
This would regenerate the whole table daily though.
Try setting this property first:
set hive.exec.dynamic.partition.mode=nonstrict;
then run your command:
LOAD DATA INPATH '...../del_table/' into table main_table partition(partition_col_name)
For more info you can refer to this link: Partition

Hadoop & Hive as warehouse: daily data deliveries

I am evaluating the combination of hadoop & hive (& impala) as a repolacement for a large data warehouse. I already set up a version and performance is great in read access.
Can somebody give me any hint what concept should be used for daily data deliveries to a table?
I have a table in hive based on a file I put into hdfs. But now I have on a daily basis new transactional data coming in.
How do I add them ti the table in hive.
Inserts are not possible. HDFS cannot append. So whats the gernal concept I need to follow.
Any advice or direction to documentation is appreciated.
Best regards!
Hive allows for data to be appended to a table - the underlying implementation of how this happens in HDFS doesn't matter. There are a number of things you can do append data:
INSERT - You can just append rows to an existing table.
INSERT OVERWRITE - If you have to process data, you can perform an INSERT OVERWRITE to re-write a table or partition.
LOAD DATA - You can use this to bulk insert data into a table and, optionally, use the OVERWRITE keyword to wipe out any existing data.
Partition your data.
Load data into a new table and swap the partition in
Partitioning is great if you know you're going to be performing date based searches and gives you the ability to use options 1, 2, & 3 at either the table or partition level.
Inserts are not possible
Inserts are possible ,like you can create a new table and insert the data from new table to old table.
But simple solution is You can load data of the file into Hive table with the below command.
load data inpath '/filepath' [overwrite] into table tablename;
If you use overwrite then only existing data replced with new data otherwise It is appending only.
You can even schedule the script by creating a shell script.

Resources