Hive out of memory even with two rows - hadoop

I tested Hive with the following queries:
create table test (key string, value string) stored as orc;
insert into table test values ('a','a'), ('b','b');
select key, count(*) from test group by key;
And I got the out-of-memory error:
Error: Failure while running task:java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:157)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:137)
I have searched online, but people usually got this error when they were working on some bigger files. In my case, the file only has two rows, and my computer has 14G memory.
I have set /etc/hadoop/conf/hadoop-env.sh HADOOP_HEAPSIZE to 1024. It does not work.

First I increased tez.runtime.io.sort.mb, but I got this error instead: tez.runtime.io.sort.mb should be larger than 0 and should be less than the available task memory
Then I increased hive.tez.java.opts (and some other parameters) as suggested by #Hellmar Becker. That fixed the problem.

I got the same error while creating the truck table as ORC in this Hadoop Hello World tutorial. You can try to compress the ORC storage using:
CREATE TABLE XXX STORED AS ORC TBLPROPERTIES ("orc.compress.size"="1024");
I hope this helps (for me, it worked).

They also agree that its issue in the Sandbox...
https://community.hortonworks.com/questions/34426/failure-to-execute-hive-query-from-lab-2.html#comment-35900

Tried many solutions , not working . Time being using this work around -
CREATE TABLE avg_mileage (truckid STRING,avgmpg BIGINT ) STORED AS ORC;

Related

ODI-1228: Task Load data-LKM SQL to Oracle- fails on the target > connection

I'm working with Oracle Data Integrator inserting information from original source to temp table (BI_DSA.TMP_TABLE)
ODI-1228: Task Load data-LKM SQL to Oracle- fails on the target
connection BI_DSA. Caused By: java.sql.BatchUpdateException:
ORA-12899: value too large for column
"BI_DSA"."C$_0DELTA_TABLE"."FIELD" (actual: 11, maximum: 10)
I tried changing the lenght of 'FIELD' to more than 10 and reverse engineering but it didn't work.
Is this error coming from the original source? I'm doing a replica so I just have view privileges on it and I believe so because is the C$ table where the error comes from.
Thanks for the help!
Solution: I tried with the length option before like the answers suggested but didn't work, I noticed the orginal source modified their field lenght so I reverse enginereed source table and problem solved.
Greetings!
As Bobby mentioned in the comment it might come from the byte/char semantics.
The C$ tables created by the LKMs usually copy the structure of the source data. So a workaround would be to go in the model and manually increase the size of the FIELD column in the source datastore (even if it doesn't represent what is in the database). The C$ table will be created whith that size on the next run.

Insufficient memory error in proc sort

My data is stored in Oracle table MY_DATA. This table contains only 2 rows with 7 columns. But when I execute step:
proc sort data=oraclelib.MY_DATA nodupkey out=SORTED_DATA;
by client_number;
run;
the following error appears:
ERROR: The SAS System stopped processing this step because of insufficient memory.
If I comment nodupkey option then error disappears. If I copy dataset in work library and execute proc sort on it then everything is OK too.
My memory options:
SORTSIZE=1073741824
SUMSIZE=0
MAXMEMQUERY=268435456
LOADMEMSIZE=0
MEMSIZE=31565617920
REALMEMSIZE=0
What can be the root of the problem and how can I fix it?
My Oracle password was in grace period and when I changed it the issue disappeared.

Avoiding Data Duplication when Loading Data from Multiple Servers

I have a dozen web servers each writing data to a log file. At the beginning of each hour, the data from the previous hour is loaded to hive using a cron script running the command:
hive -e "LOAD DATA LOCAL INPATH 'myfile.log' INTO TABLE my_table PARTITION(dt='2015-08-17-05')"
In some cases, the command fails and exits with a code other than 0, in which case our script awaits and tries again. The problem is, in some cases of failure, the data loading does not fail, even though it shows a failure message. How can I know for sure whether or not the data has been loaded?
Example for such a "failure" where the data is loaded:
Loading data to table default.my_table partition (dt=2015-08-17-05)
Failed with exception
org.apache.hadoop.hive.ql.metadata.HiveException: Unable to alter
partition. FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.MoveTask
Edit:
Alternatively, is there a way to query hive for the filenames loaded into it? I can use DESCRIBE to see the number of files. Can I know their names?
About "which files have been loaded in a partition":
if you had used an EXTERNAL TABLE and just uploaded your raw data
file in the HDFS directory mapped to LOCATION, then you could
(a) just run a hdfs dfs -ls on that directory from command line (or use the equivalent Java API call)
(b) run a Hive query such as select distinct INPUT__FILE__NAME from (...)
but in your case, you copy the data into a "managed" table, so there
is no way to retrieve the data lineage (i.e. which log file was used
to create each managed datafile)
...unless you add explicitly the original file name inside the log file, of
course (either on "special" header record, or at the beginning of each record - which can be done with good old sed)
About "how to automagically avoid duplication on INSERT": there is a way, but it would require quite a bit of re-engineering, and would cost you in terms of processing time /(extra Map step plus MapJoin)/...
map your log file to an EXTERNAL TABLE so that you can run an
INSERT-SELECT query
upload the original file name into your managed table using INPUT__FILE__NAME pseudo-column as source
add a WHERE NOT EXISTS clause w/ correlated sub-query, so that if the source file name is already present in target then you load nothing more
INSERT INTO TABLE Target
SELECT ColA, ColB, ColC, INPUT__FILE__NAME AS SrcFileName
FROM Source src
WHERE NOT EXISTS
(SELECT DISTINCT 1
FROM Target trg
WHERE trg.SrcFileName =src.INPUT__FILE__NAME
)
Note the silly DISTINCT that is actually required to avoid blowing away the RAM in your Mappers; it would be useless with a mature DBMS like Oracle, but the Hive optimizer is still rather crude...
I don't believe you can simply do this is in Hadoop/Hive. So here are the basics of an implementation in python:
import subprocess
x=subprocess.check_output([hive -e "select count(*) from my_table where dt='2015-08-17-05'"])
print type(x)
print x
But you have to spend some time working with backslashes to get hive -e to work using python. It can be very difficult. It may be easier to write a file with that simple query in it first, and then use hive -f filename. Then, print the output of subprocess.check_output in order to see how the output is stored. You may need to do some regex or type conversions, but I think it should just come back as a string. Then simply use an if statement:
if x > 0:
pass
else:
hive -e "LOAD DATA LOCAL INPATH 'myfile.log' INTO TABLE my_table PARTITION(dt='2015-08-17-05')"

H2 database Load csv data faster

I want to load about 2 million rows from CSV formatted file to database and run some SQL statement for analysis, and then remove the data. File size is 2GB in size. Data is web server log message.
Did some research and found H2 in-memory database seems to be faster, since its keep the data in memory. When I try to load the data got OutOfMemory error message because of 32 bit java. Planning to try with 64 bit java.
I am looking for all the optimization option to load the quickly and run the SQL.
test.sql
CREATE TABLE temptable (
f1 varchar(250) NOT NULL DEFAULT '',
f2 varchar(250) NOT NULL DEFAULT '',
f3 reponsetime NOT NULL DEFAULT ''
) as select * from CSVREAD('log.csv');
Running like this in 64 bit java:
java -Xms256m -Xmx4096m -cp h2*.jar org.h2.tools.RunScript -url 'jdbc:h2:mem:test;LOG=0;CACHE_SIZE=65536;LOCK_MODE=0;UNDO_LOG=0' -script test.sql
If any other database available to use in AIX please let me know.
thanks
If the CSV file is 2 GB, then it will need more than 4 GB of heap memory when using a pure in-memory database. The exact memory requirements depend a lot on how redundant the data is. If the same values appear over and over again, then the database will need less memory as common objects are re-used (no matter if it's a string, long, timestamp,...).
Please note the LOCK_MODE=0, UNDO_LOG=0, and LOG=0 are not needed when using create table as select. In addition, the CACHE_SIZE does not help when using the mem: prefix (but it helps for in-memory file systems).
I suggest to try using the in-memory file system first (memFS: instead of mem:), which is slightly slower than mem:, but needs less memory usually:
jdbc:h2:memFS:test;CACHE_SIZE=65536
If this is not enough, try the compressed in-memory mode (memLZF:), which is again slower but uses even less memory:
jdbc:h2:memLZF:test;CACHE_SIZE=65536
If this is still not enough, I suggest to try the regular persistent mode and see how fast this is:
jdbc:h2:~/data/test;CACHE_SIZE=65536

Creating index in hive 0.9

I am trying to create index on tables in Hive 0.9. One table has 1 billion rows, another has 30 Million rows. The command I used is (other than creating the table and so on)
CREATE INDEX DEAL_IDX_1 ON TABLE DEAL (ID) AS
'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
WITH DEFERRED REBUILD;
alter index DEAL_IDX_1 ON DEAL rebuild;
set hive.optimize.autoindex=true;
set hive.optimize.index.filter=true;
For the 30 Mill. row table, the rebuilding process looks alright (mapper and reducer both finished) until in the end it prints
Invalid alter operation: Unable to alter index.
FAILED: Execution Error, return code 1
from org.apache.hadoop.hive.ql.exec.DDLTask
Checking the log, and it had the error
java.lang.ClassNotFoundException: org.apache.derby.jdbc.EmbeddedDriver"
Not sure why this error was encountered, but anyway, I added the derby-version.jar:
add jar /path/derby-version.jar
The reported error was resolved, but still got another error:
org.apache.hadoop.hive.ql.exec.FileSinkOperator:
StatsPublishing error: cannot connect to database
Not sure how to solve the problem. I do see the created index table under hive/warehouse though.
For the 1 Billion row table, it is another story. The mapper just got stuck at 2% or so. And error showed
FATAL org.apache.hadoop.mapred.Child: Error running child :
java.lang.OutOfMemoryError: Java heap space
I attempted to enforce max heap size, as well as max mapr memory (see the settings mentioned somewhere but not in hive's configuration settings):
set mapred.child.java.opts = -Xmx6024m
set mapred.job.map.memory.mb=6000;
set mapred.job.reduce.memory.mb=4000;
However, this is not help. The mapper would still got stuck at 2% with the same error.
I had a similar problem of the index creating and in the hive/warehouse, but the process as a whole failing. My index_name was TypeTarget (yours is DEAL_IDX_1) and after many days of trying different approaches, making the index_name all lowercase (typetarget) fixed the issue. My problem was in Hive 0.10.0.
Also, the class not found and StatsPublishing issue is because by default, hive.stats.autogather is turned on. Turning that off (false) in hive-site.xml should get rid of those issues.
Hopefully this helps anyone looking for a quick fix.

Resources