Using Hive in real world applications? - hadoop

I am a newbee on Hadoop stack, I have learned map-reduce and now hive.
But I am not sure about hive use?
In map-R we have one or more output files n that's our final result, but In hive we can select the records using SQL like queries i.e. HQL but we are not getting any final output file. Results will be shown on terminal only.
Now my Q is how can we use this select HQL so that it can be used by some other analytic's team?

There are lot of ways to extract/export the hive query result outside.
If you want the result in any RDBMS storage you can use Sqoop.
I suggest you go through what Sqoop is and what it does.
And if you want your query results in a file, then there are lot of ways.
Hive supports exporting data from tables.
INSERT OVERWRITE LOCAL DIRECTORY '/home/lvermeer/temp'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
select * from table;
Another simple approach would be to simple redirecting you hive query outputs to a file while running your hive queries in CLI.
hive -e "select * from table" > output.txt

Related

hive, get the data location using an one-liner

I wonder if there is a way to get the data location from hive using a one-liner. Something like
select d.location from ( describe formatted table_name partition ( .. ) ) as d;
My current solution is to get the full output and then parse it.
Unlike traditional RDBMS, Hive metadata is stored in a separate database. In most cases it is in MySQL or Postgres. The metastore database details can be found in hive-site.conf. If you have access to the metastore database, you can run SELECT on table TBLS to get the details about the tables and COLUMNS_V2 to get the details about columns etc..
If you do not have access to the metastore, the only option is to describe each table to get the details. If you have a lot of databases and tables, you could write a shell script to get the list of tables using "show tables" and loop around the tables.
Two methods if you do not have access to the metadata.
Parse DESCRIBE TABLE in the shell like in this answer: https://stackoverflow.com/a/43804621/2700344
Also Hive has a virtual column INPUT__FILE__NAME.
select INPUT__FILE__NAME from table
will output locations URLs for each file.
You can split URL by '/', get element you need, aggregate, etc

One query runs map reduce, the other does not

I have a table inside hive, I want to fetch all data from it. The problem is that:
select * from tbl;
Gives me very different results than:
select count(*) from tbl;
Why is that? The second query seems to be running hadoop map reduce, the first does not - it simply returns the results. The table is not partitioned or bucketed, it's in the text (csv) format.
When you submit a Hive query, Hive converts a query into one or more stages. Stages could be a MapReduce stage, a sampling stage, a merge stage, a limit stage, or other possible tasks Hive needs to do.
select * from table_name;
This query simply scan the entire table and dump the output on screen, therefore you see the different log output on console.
While select count(*) from table_name just scan the Hive meta_information and put the result from their itself. It also don't run any MapReduce job.
You can run below command on Hive console and you will be able to see the entire information.
hive> describe formatted table_name;
Table Parameters:
COLUMN_STATS_ACCURATE true
numFiles xx
numRows xxxxxxxx
In hadoop, aggregation/conditional/arithmetical operations etc required a processing engine to process and execute the result and therefore whenever you submit this type of job, it internally get translated into a MapReduce program, the MapReduce program gets executed on behalf of the query and produce its result to hive and Hive display on your screen and therefore you see a different result.
You can put the EXPLAIN keyword in front of the query to see the query plan and other information.
Please refer Programming Hadoop Book, Chapter 10 to know more about use of Hive EXPLAIN features.

How to select the last table from a list of hive tables?

I have a list of hive tables and want to select the last table for performing some query.
Here is what I use to get the list of similar hive tables.
show tables 'test_temp_table*';
It displays the below result
test_temp_table_1
test_temp_table_2
test_temp_table_3
test_temp_table_4
test_temp_table_5
test_temp_table_6
I need to run some query on test_temp_table_6. I can do this using shell script by writing the output to a temp file and reading the last value from it but is there a simple way using hive query to get the last table that has the maximum number at the end?
Using shell:
last_table=$(hive -e "show tables 'test_temp_table*';" | sort -r | head -n1)
You can actually run a "select query" on the Hive metastore based on tablenames (and then use regular sql sorting using ORDER BY DESC and LIMIT 1) instead of using "SHOW TABLES", by following the approach mentioned here: Query Hive Metadata Store for table metadata

Hive count(*) query is not invoking mapreduce

I have external tables in hive, I am trying to run select count(*) from table_name query but the query returns instantaneously and gives result which is i think already stored. The result returned by query is not correct. Is there a way to force a map reduce job and make the query execute each time.
Note: This behavior is not followed for all external tables but some of them.
Versions used : Hive 0.14.0.2.2.6.0-2800, Hadoop 2.6.0.2.2.6.0-2800 (Hortonworks)
After some finding I have got a method that kicks off MR for counting number of records on orc table.
ANALYZE TABLE 'table name' PARTITION('partition columns') COMPUTE STATISTICS;
--OR
ANALYZE TABLE 'table name' COMPUTE STATISTICS;
This is not a direct alternative for count(*) but provides latest count of records in the table.
Doing a wc -l on ORC data won't give you an accurate result, since the data is encoded. This would work if the data was stored in a simple text file format with one row per line.
Hive does not need to launch a MapReduce for count(*) of an ORC file since it can use the ORC metadata to determine the total count.
Use the orcfiledump command to analyse ORC data from the command line
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC#LanguageManualORC-ORCFileDumpUtility
From personal experience, COUNT(*) on an ORC table usually returns wrong figures -- i.e. it returns the number of rows on the first data file only. If the table was fed by multiple INSERTs then you are stuck.
With V0.13 you could fool the optimizer into running a dummy M/R job by adding a dummy "where 1=1" clause -- takes much longer, but actually counts the rows.
With 0.14 the optimizer got smarter, you must add a non-deterministic clause e.g. "where MYKEY is null". Assuming that MYKEY is a String, otherwise the "is null" clause may crash your query -- another ugly ORC bug.
By the way, a SELECT DISTINCT on partition key(s) will also return wrong results -- all existing partitions will be shown, even the empty ones. Not specific to ORC this time.
please try the below :
hive>set hive.fetch.task.conversion=none in your hive session and then trigger select count(*) operation in your hive session to mandate mapreduce

MapReduce & Hive application Design

I have a design question where in in my CDH 4.1.2(Cloudera) installation I have daily rolling log data dumped into the HDFS. I have some reports to calculate the success and failure rates per day.
I have two approaches
load the daily log data into Hive Tables and create a complex query.
Run a MapReduce job upfront everyday to generate the summary (which
is essentially few lines) and keep appending to a common file which is a Hive Table. Later while running the report I could use a simple select query to fetch the summary.
I am trying to understand which would be a better approach among the two or if there is a better one.
The second approach adds some complexity in terms of merging files. If not merged I would have lots of very small files which seems to be a bad idea.
Your inputs are appreciated.
Thanks
Hive seems well suited to this kind of tasks, and it should be fairly simple to do:
Create an EXTERNAL table in Hive which should be partitioned by day. The goal is that the directory where you will dump your data will be directly in your Hive table. You can specify the delimiter of the fields in your daily logs like shown below where I use commas:
create external table mytable(...) partitioned by (day string) row format delimited keys terminated by ',' location '/user/hive/warehouse/mytable`
When you dump your data in HDFS, make sure you dump it on the same directory with day= so it can be recognized as a hive partition. For example in /user/hive/warehouse/mytable/day=2013-01-23.
You need then to let Hive know that this table has a new partition:
alter table mytable add partition (day='2013-01-23')
Now the Hive metastore knows about your partition, you can run your summary query. Make sure you're only querying the partition by specifying ... where day='2013-01-23'
You could easily script that to run daily in cron or something else and get the current date (for example with the shell date command) and substitute variables to this in a shell script doing the steps above.

Resources