I'm currently implementing ETL (Talend) of monitoring data to HDFS, and Hive table.
I am now facing concerns about duplicates. More in details, if we need to run one ETL Job 2 times with the same input, we will end up with duplicates in our Hive table.
The solution to that in RDMS would have been to store the input file name and to "DELETE WHERE file name=..." before sending the data. But Hive is not a RDBMS, and does not support deletes.
I would like to have your advice on how to handle this. I envisage two solutions :
Actually, the ETL is putting CSV files to the HDFS, which are used to feed an ORC table with a "INSERT INTO TABLE ... SELECT ..." The problem is that, with this operation, I'm losing the file name, and the ORC file is named 00000. Is it possible to specify the file name of this created ORC file ? If yes, I would be able to search the data by it's file name and delete it before launching the ETL.
I'm not used to Hive's ACID capability (feature on Hive 0.14+). Would you recommend to enable ACID with Hive ? Will I be able to "DELETE WHERE" with it ?
Feel free to propose should you have any other solution to that.
Bests,
Orlando
If the data volume in target table is not too large, I would advise
INSERT INTO TABLE trg
SELECT ... FROM src
WHERE NOT EXISTS
(SELECT 1
FROM trg x
WHERE x.key =src.key
AND <<additional filter on target to reduce data volume>>
)
Hive will automatically rewrite the correlated sub-query into a MapJoin, extracting all candidate keys in target table into a Java HashMap, and filtering source rows on-the-fly. As long as the HashMap can fit in the RAM available for Mappers heap size (check your default conf files, increase with a set command in Hive script if necessary) the performance will be sub-optimal, but you can be pretty sure that you will not have any duplicate.
And in your actual use case you don't have to check each key but only a "batch ID", more precisely the original file name; the way I've done it in my previous job was
INSERT INTO TABLE trg
SELECT ..., INPUT__FILE__NAME as original_file_name
FROM src
WHERE NOT EXISTS
(SELECT DISTINCT 1
FROM trg x
WHERE x.INPUT__FILE__NAME =src.original_file_name
AND <<additional filter on target to reduce data volume>>
)
That implies an extra column in your target table, but since ORC is a columnar format, it's the number of distinct values that matter -- so that the overhead would stay low.
Note the explicit "DISTINCT" in the sub-query; a mature DBMS optimizer would automatically do it at execution time, but Hive does not (not yet) so you have to force it. Note also the "1" is just a dummy value required because of "SELECT" semantics; again, a mature DBMS would allow a dummy "null" but some versions of Hive would crash (e.g. with Tez in V0.14) so "1" or "'A'" are safer.
Reference:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SubQueries#LanguageManualSubQueries-SubqueriesintheWHEREClause
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+VirtualColumns
I'm answering myself. I found a solution :
I partitionned my table with (date,input_file_name) (note, I can get the input_file_name with SELECT INPUT__FILE__NAME in Hive.
Once I did this, before running the ETL, I can send to Hive an ALTER TABLE DROP IF EXISTS PARTITION (file_name=...) so that the folder containing the input data is deleted if this INPUT_FILE has already been sent to the ORC table.
Thank you everyone for your help.
Cheers,
Orlando
Related
I have an external table in hive and pointing to HDFS location. By mistake I have ran the job to load the data into HDFS two times.
Even after deleting the duplicate file from HDFS hive is showing the data count two times(i.e. including deleted duplicate data file count).
select count(*) from tbl_name -- returns double time
But ,
select count(col_name) from tbl_name -- returns actual count.
Same table when I tried from Impala after
INVALIDATE METADATA
I could see only data count which is available in HDFS(not duplicate).
How can hive give count as double even after deleting from physical location(hdfs) , does it read from statistics?
Hive is using statistics for computing cont(*). You deleted files manually (not using Hive) that is why the stats is wrong.
The solution is:
to switch-off statistics usage in such cases:
set hive.compute.query.using.stats=false;
to analyze table as you mention in your comment:
analyze table tbl_name partition(a,b,c) compute statistics;
I have external tables in hive, I am trying to run select count(*) from table_name query but the query returns instantaneously and gives result which is i think already stored. The result returned by query is not correct. Is there a way to force a map reduce job and make the query execute each time.
Note: This behavior is not followed for all external tables but some of them.
Versions used : Hive 0.14.0.2.2.6.0-2800, Hadoop 2.6.0.2.2.6.0-2800 (Hortonworks)
After some finding I have got a method that kicks off MR for counting number of records on orc table.
ANALYZE TABLE 'table name' PARTITION('partition columns') COMPUTE STATISTICS;
--OR
ANALYZE TABLE 'table name' COMPUTE STATISTICS;
This is not a direct alternative for count(*) but provides latest count of records in the table.
Doing a wc -l on ORC data won't give you an accurate result, since the data is encoded. This would work if the data was stored in a simple text file format with one row per line.
Hive does not need to launch a MapReduce for count(*) of an ORC file since it can use the ORC metadata to determine the total count.
Use the orcfiledump command to analyse ORC data from the command line
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC#LanguageManualORC-ORCFileDumpUtility
From personal experience, COUNT(*) on an ORC table usually returns wrong figures -- i.e. it returns the number of rows on the first data file only. If the table was fed by multiple INSERTs then you are stuck.
With V0.13 you could fool the optimizer into running a dummy M/R job by adding a dummy "where 1=1" clause -- takes much longer, but actually counts the rows.
With 0.14 the optimizer got smarter, you must add a non-deterministic clause e.g. "where MYKEY is null". Assuming that MYKEY is a String, otherwise the "is null" clause may crash your query -- another ugly ORC bug.
By the way, a SELECT DISTINCT on partition key(s) will also return wrong results -- all existing partitions will be shown, even the empty ones. Not specific to ORC this time.
please try the below :
hive>set hive.fetch.task.conversion=none in your hive session and then trigger select count(*) operation in your hive session to mandate mapreduce
In hbase shell , I use describe 'table_name' , there is only column_family return. How can I get to know all the column in each columnfamily?
As #zsxwing said you need to scan all the rows since in HBase each row can have a completely different schema (that's part of the power of Hadoop - the ability to store poly-structured data). You can see the HFile file structure and see that HBase doesn't track the columns
Thus the column family(s) and its(their) setting are in fact the schema of the HBase table and that's what you get when you 'describe' it
I am evaluating the combination of hadoop & hive (& impala) as a repolacement for a large data warehouse. I already set up a version and performance is great in read access.
Can somebody give me any hint what concept should be used for daily data deliveries to a table?
I have a table in hive based on a file I put into hdfs. But now I have on a daily basis new transactional data coming in.
How do I add them ti the table in hive.
Inserts are not possible. HDFS cannot append. So whats the gernal concept I need to follow.
Any advice or direction to documentation is appreciated.
Best regards!
Hive allows for data to be appended to a table - the underlying implementation of how this happens in HDFS doesn't matter. There are a number of things you can do append data:
INSERT - You can just append rows to an existing table.
INSERT OVERWRITE - If you have to process data, you can perform an INSERT OVERWRITE to re-write a table or partition.
LOAD DATA - You can use this to bulk insert data into a table and, optionally, use the OVERWRITE keyword to wipe out any existing data.
Partition your data.
Load data into a new table and swap the partition in
Partitioning is great if you know you're going to be performing date based searches and gives you the ability to use options 1, 2, & 3 at either the table or partition level.
Inserts are not possible
Inserts are possible ,like you can create a new table and insert the data from new table to old table.
But simple solution is You can load data of the file into Hive table with the below command.
load data inpath '/filepath' [overwrite] into table tablename;
If you use overwrite then only existing data replced with new data otherwise It is appending only.
You can even schedule the script by creating a shell script.
I have a design question where in in my CDH 4.1.2(Cloudera) installation I have daily rolling log data dumped into the HDFS. I have some reports to calculate the success and failure rates per day.
I have two approaches
load the daily log data into Hive Tables and create a complex query.
Run a MapReduce job upfront everyday to generate the summary (which
is essentially few lines) and keep appending to a common file which is a Hive Table. Later while running the report I could use a simple select query to fetch the summary.
I am trying to understand which would be a better approach among the two or if there is a better one.
The second approach adds some complexity in terms of merging files. If not merged I would have lots of very small files which seems to be a bad idea.
Your inputs are appreciated.
Thanks
Hive seems well suited to this kind of tasks, and it should be fairly simple to do:
Create an EXTERNAL table in Hive which should be partitioned by day. The goal is that the directory where you will dump your data will be directly in your Hive table. You can specify the delimiter of the fields in your daily logs like shown below where I use commas:
create external table mytable(...) partitioned by (day string) row format delimited keys terminated by ',' location '/user/hive/warehouse/mytable`
When you dump your data in HDFS, make sure you dump it on the same directory with day= so it can be recognized as a hive partition. For example in /user/hive/warehouse/mytable/day=2013-01-23.
You need then to let Hive know that this table has a new partition:
alter table mytable add partition (day='2013-01-23')
Now the Hive metastore knows about your partition, you can run your summary query. Make sure you're only querying the partition by specifying ... where day='2013-01-23'
You could easily script that to run daily in cron or something else and get the current date (for example with the shell date command) and substitute variables to this in a shell script doing the steps above.