At my organization we are trying to use HIVE Or PIG as alternative
Primary goal : reduce process time
NETEZZA process time : 90 min
looking to end process : within 30 min
How does process works:
Process is about to maintain incremental history.There are two tables history_table and new_table.History table maintains total history and new
_table has updated records.So every day updated records are added to history table.Process has very complex stored procedures (Joins/deletion/insert/update)
same process is being applied on multiple tables.Every history table has almost billions of records.
Doubts I have :
Does HIVE/PIG perform better than NETEZZA ?
Is UDF in hive a good alternative for Stored Procedure as I want to create generic process for multiple tables (where I can pass table name as argument) ?
Which performs better HIVE or PIG for really complex joins with multiple condition, generating create statement dynamically and exception handling?
use impala which is Netezza on Hadoop, try Kudu for real time and batch or use HBase for real time and impala for batch , but you can query HBase using impala
Related
I had one query where we have Hive table created and when we select * from table where=< condition>; ,it gives results immediately without invoking MR job.When I create a same duplicate table and try to execute a query then MR is invoked. What could be the possible reason for this?
I got the answer,The reason was Hive analyze command was issued on the table .Once you execute a hive analyze command it stores number of row,file size in hive metastore.So ,when u do select count(*) from table.It directly fetches it from the hive metastore instead of invoking a map reduce job.
You can also issue a Analyze command on column as well.
ANALYZE TABLE [db_name.]tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)] -- (Note: Fully support qualified table name since Hive 1.2.0, see HIVE-10007.)
COMPUTE STATISTICS
[FOR COLUMNS] -- (Note: Hive 0.10.0 and later.)
[CACHE METADATA] -- (Note: Hive 2.1.0 and later.)
[NOSCAN];
Documentation link :
https://cwiki.apache.org/confluence/display/Hive/StatsDev
Local mode (hive not invoking MR) depends on several conditions (see HIVE-1408):
hive.exec.mode.local.auto=true/false - Lets Hive determine whether to run in local mode automatically.
hive.exec.mode.local.auto.input.size.max=1G - When hive.exec.mode.local.auto is true, input bytes should be less than this for local mode.
hive.exec.mode.local.auto.input.files.max=4 - When hive.exec.mode.local.auto is true, the number of tasks should be less than this for local mode.
If the tables have the same data, my guess is that there is a difference in the number of tasks that are spawned when querying the two tables causing one query to run in local mode and another to spawn a MR job.
I recently did an integration between Hive and HBase. I created a hive table with HBase serde and when i insert the records into the hive table it gets loaded into the HBase table. I am trying to understand what if the insert into HiveHBase table fails in between ? (HBase service fails / network issue). I assume the records which have already loaded into the HBase will be there and when i do a rerun of the operation i will have two copies of data with different timestamp (Assuming out of 20K records 10k was inserted and the failure occured).
What is the best way to insert records into HBase ?
Can Hive provide me the security check to see if the data is already there ?
Is mapreduce the best shot for scenarios like these ? I will write a mapreduce program that reads data from hive and checks record by record in hbase before the insertion . This makes sure there are no duplicate writes
Any help on this would be greatly appreciated.
Yes, you will have 2 versions of data when you rerun the load operation. But that's ok since the 2nd version will get cleaned up on the next compaction. As long as your inserts are idempotent (which they most likely are), you won't have a problem.
At Lithium+Klout, we use a custom built HBaseSerDe which writes HFiles, instead of using Put's to insert the data. So we generate the HFiles and use the bulk load tool to load all of the data after the job has completed. That's another way you can integrate Hive and HBase.
My understanding is that Hive is an SQL-like language that can perform database-related tasks by invoking underlying MapReduce programs. However, I learned that some Hive commands does not invoke MapReduce job. I am curious to know that what are these commands, and why they do not need to invoke MapReduce job.
You are right, Hive uses MR jobs on the background to process the data.
Wen you fire a SQL like query in hive, it converts it into various MR jobs on the background and gives you the result.
Having said that, There are very few queries that doesnt need MR jobs.
for e.g
SEKECT * FROM table LIMIT 10;
If you see in the above query we dont need any data processing. All we need is just to read a few rows from a table.
So the above hive query doesnt fire a MR job
But if we slightly modify the above query.
SELECT COUNT(*) FROM table;
It will fire MR jobs. Because we need to read all the data for this query and MR job will do it for us quickly(parallel processing)
Since hive table is stored in the form of a file in HDFS,processing time and effort are saved by hive for operations like 'Select *' , 'Select * limit' by avoiding mapreduce calls and directly fetching the whole file or a part of the file from hdfs and displaying to the user.
Anyway, this default behavior can also be changed by modifying hive-site.xml hive.fetch.task.conversion property to invoke map-reduce programs for all the operations.
What would happen if I'm trying to read from a hive table , while concurrently there is someone inserting into the hive table.
Does it lock the files while querying into it, or does it do a dirty read???
Hadoop is meant for parallel proceesing. So in a cluster parallel querying can be done on a hive table.
While some data is being inserted in the table if another user queries the same table, the files are not locked, rather a job is accepted is put to execution.
Now if the new data insert is successful before the second query
is processed the the result of the second query will
acknowledge the inserted data.
Note: In most cases data insertion takes lesser time than
querying a table because while querying MR jobs are created and
are run in backend
I have a design question where in in my CDH 4.1.2(Cloudera) installation I have daily rolling log data dumped into the HDFS. I have some reports to calculate the success and failure rates per day.
I have two approaches
load the daily log data into Hive Tables and create a complex query.
Run a MapReduce job upfront everyday to generate the summary (which
is essentially few lines) and keep appending to a common file which is a Hive Table. Later while running the report I could use a simple select query to fetch the summary.
I am trying to understand which would be a better approach among the two or if there is a better one.
The second approach adds some complexity in terms of merging files. If not merged I would have lots of very small files which seems to be a bad idea.
Your inputs are appreciated.
Thanks
Hive seems well suited to this kind of tasks, and it should be fairly simple to do:
Create an EXTERNAL table in Hive which should be partitioned by day. The goal is that the directory where you will dump your data will be directly in your Hive table. You can specify the delimiter of the fields in your daily logs like shown below where I use commas:
create external table mytable(...) partitioned by (day string) row format delimited keys terminated by ',' location '/user/hive/warehouse/mytable`
When you dump your data in HDFS, make sure you dump it on the same directory with day= so it can be recognized as a hive partition. For example in /user/hive/warehouse/mytable/day=2013-01-23.
You need then to let Hive know that this table has a new partition:
alter table mytable add partition (day='2013-01-23')
Now the Hive metastore knows about your partition, you can run your summary query. Make sure you're only querying the partition by specifying ... where day='2013-01-23'
You could easily script that to run daily in cron or something else and get the current date (for example with the shell date command) and substitute variables to this in a shell script doing the steps above.