Force HiveServer2 to run MapReduce job - hadoop

I am using hive-1.1.0.
Submitting queries to HiveServer2 via Beeline which are read-only and contain no predicates will cause HiveServer2 to try to read the data from HDFS itself without spawning a MapReduce job:
SELECT * FROM my_table LIMIT 100;
For very large datasets this can cause HiveServer2 to hold onto a lot of memory leading to long garbage collection pauses. Adding a "fake" predicate will cause HiveServer2 to run the MapReduce job as desired; e.g.
SELECT * FROM my_table WHERE (my_id > 0 OR my_id <= 0) LIMIT 100;
By "fake", I mean a predicate that does not matter; the above example predicate will always be true.
Is there a setting to force HiveServer2 to always run the MapReduce job without having to add bogus predicates?
I am not talking about when HiveServer2 determines it can run a MapReduce job locally; I have this disabled entirely:
> SET hive.exec.mode.local.auto;
+----------------------------------+--+
| set |
+----------------------------------+--+
| hive.exec.mode.local.auto=false |
+----------------------------------+--+
but queries without predicates are still read entirely by HiveServer2 causing issues.
Any guidance much appreciated.
Thanks!

Some select queries can be converted to a single FETCH task, without map-reduce at all.
This behavior is controlled by hive.fetch.task.conversion configuration parameter.
Possible values are: none, minimal and more.
If you want to disable fetch task conversion, set it to none:
set hive.fetch.task.conversion=none;
minimal will trigger FETCH-only task for
SELECT *, FILTER on partition columns (WHERE and HAVING clauses),
LIMIT only.
more will trigger FETCH-only task for
SELECT any kind of expressions including UDFs, FILTER, LIMIT only
(including TABLESAMPLE, virtual columns)
Read also about hive.fetch.task.conversion.threshold parameter and more details here: Hive Configuration Properties

Related

Is there a limit of sqoop.export.records.per.statement for Sqoop Export job?

Does anyone know if there is a limit for the value of sqoop.export.records.per.statement for Sqoop batch export job?
I have very large size of data, like 200,000,000 rows of data to export, from Impala to Vertica. I will get [Vertica][VJDBC](5065) ERROR: Too many ROS containers exist for the following projections if records per statement is set too low, or java.lang.OutOfMemoryError: GC overhead limit exceeded if records per statement is set too high.
Anyone know how to fix this problem? Thanks!
I think the limit is that of memory. If you increase the heap it'll let you set the number higher. Try adding -D mapred.child.java.opts=-Xmx1024M or some larger number than your current settings?
You could try to increase export.statements.per.transaction and reduce your records per statement. I'm thinking this won't help on the ROS container side because I am thinking each batch of SQL = 1 COPY statement = 1 ROS container. I don't think it'll convert multiple batches of INSERTs into a single COPY, but I don't have a way to test it right now.
You could bypass sqoop and stream the data (You may need to construct the COPY), something like:
impala-shell -k -i server:port -B -q 'select * from mytable' --output_delimiter="|" | vsql -h database_host -U user -w password -c 'copy mytable from stdin direct'

Avoiding Data Duplication when Loading Data from Multiple Servers

I have a dozen web servers each writing data to a log file. At the beginning of each hour, the data from the previous hour is loaded to hive using a cron script running the command:
hive -e "LOAD DATA LOCAL INPATH 'myfile.log' INTO TABLE my_table PARTITION(dt='2015-08-17-05')"
In some cases, the command fails and exits with a code other than 0, in which case our script awaits and tries again. The problem is, in some cases of failure, the data loading does not fail, even though it shows a failure message. How can I know for sure whether or not the data has been loaded?
Example for such a "failure" where the data is loaded:
Loading data to table default.my_table partition (dt=2015-08-17-05)
Failed with exception
org.apache.hadoop.hive.ql.metadata.HiveException: Unable to alter
partition. FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.MoveTask
Edit:
Alternatively, is there a way to query hive for the filenames loaded into it? I can use DESCRIBE to see the number of files. Can I know their names?
About "which files have been loaded in a partition":
if you had used an EXTERNAL TABLE and just uploaded your raw data
file in the HDFS directory mapped to LOCATION, then you could
(a) just run a hdfs dfs -ls on that directory from command line (or use the equivalent Java API call)
(b) run a Hive query such as select distinct INPUT__FILE__NAME from (...)
but in your case, you copy the data into a "managed" table, so there
is no way to retrieve the data lineage (i.e. which log file was used
to create each managed datafile)
...unless you add explicitly the original file name inside the log file, of
course (either on "special" header record, or at the beginning of each record - which can be done with good old sed)
About "how to automagically avoid duplication on INSERT": there is a way, but it would require quite a bit of re-engineering, and would cost you in terms of processing time /(extra Map step plus MapJoin)/...
map your log file to an EXTERNAL TABLE so that you can run an
INSERT-SELECT query
upload the original file name into your managed table using INPUT__FILE__NAME pseudo-column as source
add a WHERE NOT EXISTS clause w/ correlated sub-query, so that if the source file name is already present in target then you load nothing more
INSERT INTO TABLE Target
SELECT ColA, ColB, ColC, INPUT__FILE__NAME AS SrcFileName
FROM Source src
WHERE NOT EXISTS
(SELECT DISTINCT 1
FROM Target trg
WHERE trg.SrcFileName =src.INPUT__FILE__NAME
)
Note the silly DISTINCT that is actually required to avoid blowing away the RAM in your Mappers; it would be useless with a mature DBMS like Oracle, but the Hive optimizer is still rather crude...
I don't believe you can simply do this is in Hadoop/Hive. So here are the basics of an implementation in python:
import subprocess
x=subprocess.check_output([hive -e "select count(*) from my_table where dt='2015-08-17-05'"])
print type(x)
print x
But you have to spend some time working with backslashes to get hive -e to work using python. It can be very difficult. It may be easier to write a file with that simple query in it first, and then use hive -f filename. Then, print the output of subprocess.check_output in order to see how the output is stored. You may need to do some regex or type conversions, but I think it should just come back as a string. Then simply use an if statement:
if x > 0:
pass
else:
hive -e "LOAD DATA LOCAL INPATH 'myfile.log' INTO TABLE my_table PARTITION(dt='2015-08-17-05')"

Hive unable to perform queries other than SELECT *

I am running hive on my system where I have successfully created a database and a table. I have loaded that table with a csv file which is located on my HDFS.
I am successfully able to describe the table in hive, seeing all of the columns that I intended to be created.
I am also successfully able to run the simple SELECT * FROM table; query which returns an enormous list of data.
My problem starts whenever I try to run a query that is any more complex than that. Specifically, when I try to run a query that is selecting a specific column name or selecting any aggregate of data. If I try anything else, I receive this error message after my map and reduce tasks have sat at 0% for a while.
Diagnostic Messages for this Task:
java.lang.RuntimeException: java.lang.NullPointerException
at org.apache.hadoop.hive.ql.exec.Utilities.getMapRedWork(Utilities.java:230)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.init(HiveInputFormat.java:255)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.pushProjectionsAndFilters(HiveInputFormat.java:381)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.pushProjectionsAndFilters(HiveInputFormat.java:374)
at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:536)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:394)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1438)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
Caused by: java.lang.NullPointerException
at org.ap
I have tried many different syntax techniques and performed numerous sanity checks to confirm that the table is actually there. What confuses me is that the SELECT * works while all other queries fail.
Any advice is appreciated.
Here is a query I ran with as many NULL checks as would allow: SELECT year FROM flights WHERE year != NULL AND length(year) > 0 AND year <> ''; This query still failed.
SELECT * doesn't invoke mapreduce jobs.
But any complex queries involve map reduce jobs.
Please check the MR job logs.
Also this can be a data issue, Data might be incompatible with the table schema.
Please check with fewer rows.
May be your input data consists any null values. Because,
if you use select all command that job will not enter into mapreduce phase.
if you select any specific column it will enter into mapreduce phase. so you may get this error.
What is happening here is that none of the queries involving mapreduce jobs are running.
The "select *" query doesn't invoke any mapreduce and just displays the data as it is. Please check your mapreduce logs and see if you can find something which is causing this.

Hive Map reduce Job clarification on selecting column

Map reduce Jobs on Hive Statement
When i Query the following statment in Hive
hive> SELECT * FROM USERS LIMIT 100;
It doesn't launch a Map reduce Job, as becuase we are selection every things from the table and limiting the number of records it's return
But when i do the following
hive> select age,occupation from users limit 100;
it's actually kicks a Map reduce Job ?
does that means, applying column level projection required a Map reduce Job, ? though i have not applied any kind of filter on it.
Whenever you run a normal 'select *', a fetch task is created rather than a mapreduce task which just dumps the data as it is without doing anything on it. This is equivalent to a:
hadoop fs -cat $file_name
Whereas whenever you do a 'select column', a map job internally filters that particular column and gives the output.
when you write select * from table_name, the whole file is viewed, while if you select column, a map only job is launched, no reduce will be launched for it, since we are selecting the whole column .
Select * from table_name; --> will not launch a MR JOB
Select column from table_name; --> will launch a M JOB (map only job)
Select MAX(column_name) from table_name; --> will launch a MR JOB

Forcing a reduce phase or a second map reduce job in hive

I am running a hive query of the following form:
INSERT OVERWRITE LOCAL DIRECTORY ...
SELECT /*+ MAPJOIN(...) */ * FROM ...
Because of the MAPJOIN, the result does not require a reduce phase. The map phase uses about 5000 mappers, and it ends up taking about 50 minutes to complete the job. It turns out that most of this time is spent copying those 5000 files to the local directory.
To try to optimize this, I replaced SELECT * ... with SELECT DISTINCT * ... (I know in advance that my results are already distinct, so this doesn't actually change my result), in order to force a second map reduce job. The first map reduce job is the same as before, with 5000 mappers and 0 reducers. The second map reduce job now has 5000 mappers and 3 reducers. With this change, there are now only 3 files to be copied, rather than 5000, and the query now only takes a total of about 20 minutes.
Since I don't actually need the DISTINCT, I'd like to know whether my query can be optimized in a less kludge-y way, without using DISTINCT?
What about wrapping you query with another SELECT, and maybe a useless WHERE clause to make sure it kicks off a job.
INSERT OVERWRITE LOCAL DIRECTORY ...
SELECT *
FROM (
SELECT /*+ MAPJOIN(...) */ *
FROM ..
) x
WHERE 1 = 1
I'll run this when I get a chance tomorrow and delete this part of the answer if it doesn't work. If you get to it before me then great.
Another option would be to take advantage of the virtual columns for file name and line number to force distinct results. This complicates the query and introduces two meaningless columns, but has the advantage that you no longer have to know in advance that your results will be distinct. If you can't abide the useless columns, wrap it in another SELECT to remove them.
INSERT OVERWRITE LOCAL DIRECTORY ...
SELECT {{enumerate every column except the virutal columns}}
FROM (
SELECT DISTINCT /*+ MAPJOIN(...) */ *, INPUT__FILE__NAME, BLOCK__OFFSET__INSIDE__FILE
FROM ..
) x
Both solutions are more kludge-y than what you came up with, but have the advantage that you are not limited to queries with distinct results.
We get another option if you aren't limited to Hive. You could get rid of the LOCAL and write the results to HDFS, which should be fast even with 5000 mappers. Then use hadoop fs -getmerge /result/dir/on/hdfs/ to pull the results into the local filesystem. This unfortunately reaches out of Hive, but maybe setting up a two step Oozie workflow is acceptable for your use case.

Resources