I am running hive on my system where I have successfully created a database and a table. I have loaded that table with a csv file which is located on my HDFS.
I am successfully able to describe the table in hive, seeing all of the columns that I intended to be created.
I am also successfully able to run the simple SELECT * FROM table; query which returns an enormous list of data.
My problem starts whenever I try to run a query that is any more complex than that. Specifically, when I try to run a query that is selecting a specific column name or selecting any aggregate of data. If I try anything else, I receive this error message after my map and reduce tasks have sat at 0% for a while.
Diagnostic Messages for this Task:
java.lang.RuntimeException: java.lang.NullPointerException
at org.apache.hadoop.hive.ql.exec.Utilities.getMapRedWork(Utilities.java:230)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.init(HiveInputFormat.java:255)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.pushProjectionsAndFilters(HiveInputFormat.java:381)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.pushProjectionsAndFilters(HiveInputFormat.java:374)
at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:536)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:394)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1438)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
Caused by: java.lang.NullPointerException
at org.ap
I have tried many different syntax techniques and performed numerous sanity checks to confirm that the table is actually there. What confuses me is that the SELECT * works while all other queries fail.
Any advice is appreciated.
Here is a query I ran with as many NULL checks as would allow: SELECT year FROM flights WHERE year != NULL AND length(year) > 0 AND year <> ''; This query still failed.
SELECT * doesn't invoke mapreduce jobs.
But any complex queries involve map reduce jobs.
Please check the MR job logs.
Also this can be a data issue, Data might be incompatible with the table schema.
Please check with fewer rows.
May be your input data consists any null values. Because,
if you use select all command that job will not enter into mapreduce phase.
if you select any specific column it will enter into mapreduce phase. so you may get this error.
What is happening here is that none of the queries involving mapreduce jobs are running.
The "select *" query doesn't invoke any mapreduce and just displays the data as it is. Please check your mapreduce logs and see if you can find something which is causing this.
Related
I am using hive-1.1.0.
Submitting queries to HiveServer2 via Beeline which are read-only and contain no predicates will cause HiveServer2 to try to read the data from HDFS itself without spawning a MapReduce job:
SELECT * FROM my_table LIMIT 100;
For very large datasets this can cause HiveServer2 to hold onto a lot of memory leading to long garbage collection pauses. Adding a "fake" predicate will cause HiveServer2 to run the MapReduce job as desired; e.g.
SELECT * FROM my_table WHERE (my_id > 0 OR my_id <= 0) LIMIT 100;
By "fake", I mean a predicate that does not matter; the above example predicate will always be true.
Is there a setting to force HiveServer2 to always run the MapReduce job without having to add bogus predicates?
I am not talking about when HiveServer2 determines it can run a MapReduce job locally; I have this disabled entirely:
> SET hive.exec.mode.local.auto;
+----------------------------------+--+
| set |
+----------------------------------+--+
| hive.exec.mode.local.auto=false |
+----------------------------------+--+
but queries without predicates are still read entirely by HiveServer2 causing issues.
Any guidance much appreciated.
Thanks!
Some select queries can be converted to a single FETCH task, without map-reduce at all.
This behavior is controlled by hive.fetch.task.conversion configuration parameter.
Possible values are: none, minimal and more.
If you want to disable fetch task conversion, set it to none:
set hive.fetch.task.conversion=none;
minimal will trigger FETCH-only task for
SELECT *, FILTER on partition columns (WHERE and HAVING clauses),
LIMIT only.
more will trigger FETCH-only task for
SELECT any kind of expressions including UDFs, FILTER, LIMIT only
(including TABLESAMPLE, virtual columns)
Read also about hive.fetch.task.conversion.threshold parameter and more details here: Hive Configuration Properties
I got the following error when loading data from Impala to Vertica with Sqoop.
Error: java.io.IOException: Can't export data, please check failed map
task logs at
org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:112)
at
org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:39)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145) at
org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at
org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163) at
java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:422) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused
by: java.io.IOException: java.sql.BatchUpdateException:
[Vertica]VJDBC One or more rows were rejected by the server.
at
org.apache.sqoop.mapreduce.AsyncSqlRecordWriter.write(AsyncSqlRecordWriter.java:233)
at
org.apache.sqoop.mapreduce.AsyncSqlRecordWriter.write(AsyncSqlRecordWriter.java:46)
at
org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:658)
at
org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
at
org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
at
org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:84)
... 10 more Caused by: java.sql.BatchUpdateException:
[Vertica]VJDBC One or more rows were rejected by the server.
at com.vertica.jdbc.SStatement.processBatchResults(Unknown Source)
at com.vertica.jdbc.SPreparedStatement.executeBatch(Unknown Source)
at
org.apache.sqoop.mapreduce.AsyncSqlOutputFormat$AsyncSqlExecThread.run(AsyncSqlOutputFormat.java:231)
And I was running the following command:
sudo -u impala sqoop export -Dsqoop.export.records.per.statement=xxx
--driver com.vertica.jdbc.Driver --connect jdbc:vertica://host:5433/db --username name --password pw --table table --export-dir /some/dir -m 1 --input-fields-terminated-by '\t' --input-lines-terminated-by '\n'
--batch
This error was not raised every time. I had several successful tests loading over 2 million rows of data. So I guess there might be some bad data that contains special characters in the rejected rows. This is very annoying because when this error raised, mapreduce job would rollback and retry. In this case, there would be lots of duplicate data in the target table.
Does anyone have idea if there is any sqoop export parameter that can be set to deal with special characters or if there is any way to skip the bad data, which means to disable rollback? Thanks!
This may not be just special characters. If you try to stuff 'abc' into a numeric field, for example, that row would get rejected. Even though you get this error, I believe it not until after the load and all data should be committed that could be committed (but I would verify that). If you isolate the "missing" rows you might be able to figure out what is wrong with the data or the field definition.
Common things to look for:
Stuffing character type data into numeric fields (maybe implicit conversions, or only show up when the values are non-NULL).
NULL values into NOT NULL fields
Counting characters and VARCHAR octets as equivalent. VARCHAR(x) represents octets, but a UTF-8 character can have multiple octets.
Similar to #3, strings too long to fit in designated fields.
In the driver, the batch inserts are being replaced with a COPY FROM STDIN statement. You might be able to find the statement in query_requests although I'm not sure it will help.
Sqoop doesn't really give you much opportunity to investigate this further (as far as I am aware, I checked the generic JDBC Loader). One could look at the return array for executeBatch() and tie this to your execution batch. Maybe modify the generic JDBC loader?
Hope this helps.
I have a dozen web servers each writing data to a log file. At the beginning of each hour, the data from the previous hour is loaded to hive using a cron script running the command:
hive -e "LOAD DATA LOCAL INPATH 'myfile.log' INTO TABLE my_table PARTITION(dt='2015-08-17-05')"
In some cases, the command fails and exits with a code other than 0, in which case our script awaits and tries again. The problem is, in some cases of failure, the data loading does not fail, even though it shows a failure message. How can I know for sure whether or not the data has been loaded?
Example for such a "failure" where the data is loaded:
Loading data to table default.my_table partition (dt=2015-08-17-05)
Failed with exception
org.apache.hadoop.hive.ql.metadata.HiveException: Unable to alter
partition. FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.MoveTask
Edit:
Alternatively, is there a way to query hive for the filenames loaded into it? I can use DESCRIBE to see the number of files. Can I know their names?
About "which files have been loaded in a partition":
if you had used an EXTERNAL TABLE and just uploaded your raw data
file in the HDFS directory mapped to LOCATION, then you could
(a) just run a hdfs dfs -ls on that directory from command line (or use the equivalent Java API call)
(b) run a Hive query such as select distinct INPUT__FILE__NAME from (...)
but in your case, you copy the data into a "managed" table, so there
is no way to retrieve the data lineage (i.e. which log file was used
to create each managed datafile)
...unless you add explicitly the original file name inside the log file, of
course (either on "special" header record, or at the beginning of each record - which can be done with good old sed)
About "how to automagically avoid duplication on INSERT": there is a way, but it would require quite a bit of re-engineering, and would cost you in terms of processing time /(extra Map step plus MapJoin)/...
map your log file to an EXTERNAL TABLE so that you can run an
INSERT-SELECT query
upload the original file name into your managed table using INPUT__FILE__NAME pseudo-column as source
add a WHERE NOT EXISTS clause w/ correlated sub-query, so that if the source file name is already present in target then you load nothing more
INSERT INTO TABLE Target
SELECT ColA, ColB, ColC, INPUT__FILE__NAME AS SrcFileName
FROM Source src
WHERE NOT EXISTS
(SELECT DISTINCT 1
FROM Target trg
WHERE trg.SrcFileName =src.INPUT__FILE__NAME
)
Note the silly DISTINCT that is actually required to avoid blowing away the RAM in your Mappers; it would be useless with a mature DBMS like Oracle, but the Hive optimizer is still rather crude...
I don't believe you can simply do this is in Hadoop/Hive. So here are the basics of an implementation in python:
import subprocess
x=subprocess.check_output([hive -e "select count(*) from my_table where dt='2015-08-17-05'"])
print type(x)
print x
But you have to spend some time working with backslashes to get hive -e to work using python. It can be very difficult. It may be easier to write a file with that simple query in it first, and then use hive -f filename. Then, print the output of subprocess.check_output in order to see how the output is stored. You may need to do some regex or type conversions, but I think it should just come back as a string. Then simply use an if statement:
if x > 0:
pass
else:
hive -e "LOAD DATA LOCAL INPATH 'myfile.log' INTO TABLE my_table PARTITION(dt='2015-08-17-05')"
I am trying to use sqoop transfer from cdh5 to import large postgreSQL table to HDFS. The whole table is about 15G.
First, I tried to import just use the basic information, by entering schema and table name, it didn't work. I always get GC overhead limit exceeded. I tried to change the JVM heap size on Cloudera manager configuration for Yarn and sqoop to maximum (4G), still no help.
Then, I am trying to use sqoop transfer SQL statement to transfer partly of the table, I added SQL statement in the field as the following:
select * from mytable where id>1000000 and id<2000000 ${CONDITIONS}
(partition column is id).
The statement is failed, actually any kind of statements with my own "where" condition were having the error: "GENERIC_JDBC_CONNECTOR_0002:Unable to execute the SQL statement"
Also I tried to use the boundary query, I can use "select min(id), 1000000 from mutable", and it worked, but I tried to use "select 1000000, 2000000 from mytable" to select data further ahead which caused the sqoop server crash and down.
Could someone help? How to add where condition? or how to use the boundary query. I have searched in many places, I didn't find any good document about how to write SQL statement with sqoop2. Also is that possible to use direct on sqoop2?
Thanks
I am trying to execute the query below:
INSERT OVERWRITE TABLE nasdaq_daily
PARTITION(stock_char_group)
select exchage, stock_symbol, date, stock_price_open,
stock_price_high, stock_price_low, stock_price_close,
stock_volue, stock_price_adj_close,
SUBSTRING(stock_symbol,1,1) as stock_char_group
FROM nasdaq_daily_stg;
I have already set hive.exec.dynamic.partition=true and hive.exec.dynamic.partiion.mode=nonstrict;.
Table nasdaq_daily_stg table contains proper information in the form of a number of CSV files. When I execute this query, I get this error message:
Caused by: java.lang.SecurityException: sealing violation: package org.apache.derby.impl.jdbc.authentication is sealed.
FAILED: Execution Error, return code -101 from org.apache.hadoop.hive.ql.exec.MapRedTask
The mapreduce job didnt start at all. So there are no logs present in the jobtracker web-UI for this error. I am using derby to store meta-store information.
Can someone help me fix this?
Please try this. This may be the issue. You may be having Derby classes twice on your classpath.
"SecurityException: sealing violation" when starting Derby connection