OOM in tez/hive - hadoop

[After a few answers and comments I asked a new question based on the knowledge gained here: Out of memory in Hive/tez with LATERAL VIEW json_tuple ]
One of my query consistently fails with the error:
ERROR : Status: Failed
ERROR : Vertex failed, vertexName=Map 1, vertexId=vertex_1516602562532_3606_2_03, diagnostics=[Task failed, taskId=task_1516602562532_3606_2_03_000001, diagnostics=[TaskAttempt 0 failed, info=[Container container_e113_1516602562532_3606_01_000008 finished with diagnostics set to [Container failed, exitCode=255. Exception from container-launch.
Container id: container_e113_1516602562532_3606_01_000008
Exit code: 255
Stack trace: ExitCodeException exitCode=255:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:933)
at org.apache.hadoop.util.Shell.run(Shell.java:844)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1123)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:237)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:317)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:83)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 255
]], TaskAttempt 1 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:173)
The keyword here seems to be java.lang.OutOfMemoryError: Java heap space.
I looked around but none of what I thought I understood from Tez helps me:
​yarn-site/yarn.nodemanager.resource.memory-mb is maxed up => I use all the memory I can
yarn-site/yarn.scheduler.maximum-allocation-mb: same as yarn.nodemanager.resource.memory-mb
yarn-site/yarn.scheduler.minimum-allocation-mb = 1024
hive-site/hive.tez.container.size = 4096 (multiple of yarn.scheduler.minimum-allocation-mb)
​My query has 4 mappers, 3 go very fast, the 4th dies everytime. Here is the Tez graphical view of the query:
From this image:
table contact has 150M rows, 283GB of ORC compressed data (there is one large json field, LATERAL VIEW'ed)
table m has 1M rows, 20MB of ORC compressed data
table c has 2k rows, < 1MB ORC compressed
table e has 800k rows, 7GB of ORC compressed
e is LEFT JOIN'ed with all the other tables
e and contact are partitioned and only one partition in selected in the WHERE clause.
I thus tried to increase the number of maps:
tez.grouping.max-size: 650MB by default, even if I lower it to -
tez.grouping.min-size​ (16MB) it makes no difference
tez.grouping.split-count even increased to 1000 does not make a difference
tez.grouping.split-wave 1.7 by default, even increased to 5 makes no difference
If it's relevant, here are some other memory settings:
mapred-site/mapreduce.map.memory.mb = 1024 (Min container size)
mapred-site/mapreduce.reduce.memory.mb = 2048 (2 * min container size)
mapred-site/mapreduce.map.java.opts = 819 (0.8 * min container size)
mapred-site/mapreduce.reduce.java.opts = 1638 (0.8 * mapreduce.reduce.memory.mb)
mapred-site/yarn.app.mapreduce.am.resource.mb = 2048 (2 * min container size)
mapred-site/yarn.app.mapreduce.am.command-opts = 1638 (0.8 * yarn.app.mapreduce.am.resource.mb)
mapred-site/mapreduce.task.io.sort.mb = 409 (0.4 * min container size)
My understanding was that tez can split the work in many loads, thus taking long but eventually completing. ​Am I wrong, or is there a way I have not found?
context: hdp2.6, 8 datanodes with 32GB Ram, query using a chunky lateral view based on json run via beeline.

The issue is clearly due to SKEWED data. I would recommand that you add DISTRIBUTE BY COL to you select query from source so that the reducer has evenly distributed data. In the below example COL3 is more evenly distributed data like ID column
Example
ORIGINAL QUERY : insert overwrite table X AS SELECT COL1,COL2,COL3 from Y
NEW QUERY : insert overwrite table X AS SELECT COL1,COL2,COL3 from Y distribute by COL3

I had the same issue and increasing all the memory parameter didnt help.
Then I switched to MR and got the below error.
Failed with exception Number of dynamic partitions created is 2795, which is more than 1000.
After setting the higher value I returned back to tez, and the problem was solved.

Related

Limit(n) vs Show(n) performance disparity in Pyspark

Trying to get a deeper understanding of how spark works and was playing around with the pyspark cli (2.4.0). I was looking for the difference between using limit(n).show() and show(n). I ended up getting two very different performance times for two very similar queries. Below are the commands I ran. The parquet file referenced in the code below has about 50 columns and is over 50gb in size on remote HDFS.
# Create dataframe
>>> df = sqlContext.read.parquet('hdfs://hdfs.host/path/to.parquet') ↵
# Create test1 dataframe
>>> test1 = df.select('test_col') ↵
>>> test1.schema ↵
StructType(List(StructField(test_col,ArrayType(LongType,true),true)))
>>> test1.explain() ↵
== Physical Plan ==
*(1) Project [test_col#40]
+- *(1) FileScan parquet [test_col#40]
Batched: false,
Format: Parquet,
Location: InMemoryFileIndex[hdfs://hdfs.host/path/to.parquet],
PartitionCount: 25,
PartitionFilters: [],
PushedFilters: [],
ReadSchema: struct<test_col:array<bigint>>
# Create test2 dataframe
>>> test2 = df.select('test_col').limit(5) ↵
>>> test2.schema ↵
StructType(List(StructField(test_col,ArrayType(LongType,true),true)))
>>> test2.explain() ↵
== Physical Plan ==
CollectLimit 5
+- *(1) Project [test_col#40]
+- *(1) FileScan parquet [test_col#40]
Batched: false,
Format: Parquet,
Location: InMemoryFileIndex[hdfs://hdfs.host/path/to.parquet],
PartitionCount: 25,
PartitionFilters: [],
PushedFilters: [],
ReadSchema: struct<test_col:array<bigint>>
Notice that the physical plan is almost identical for both test1 and test2. The only exception is test2's plan starts with "CollectLimit 5". After setting this up I ran test1.show(5) and test2.show(5). Test 1 returned the results instantaneously. Test 2 showed a progress bar with 2010 tasks and took about 20 minutes to complete (I only had one executor)
Question
Why did test 2 (with limit) perform so poorly compared to test 1 (without limit)? The data set and result set were identical and the physical plan was nearly identical.
Keep in mind:
show() is an alias for show(20) and relies internally on take(n: Int): Array[T]
limit(n: Int) returns another dataset and is an expensive operation that reads the whole source
Limit - result in new dataframe and taking longer time because this is because predicate pushdown is currently not supported in your input file format. Hence reading entire dataset and applying limit.

Elasticsearch curl error Connection aborted.', RemoteDisconnected('Remote end closed connection without response')

I am using requests library to connect to elasticsearch for fetching data. I have
26 indices,
spread across 2 nodes,
with 1st node having 16GB RAM / 8 vCPU and the
2nd 8GB RAM / 4 vCPU.
All my nodes are in AWS EC2.
In all I have around 200 GB of data. I am primarily using the database for aggregation exercises.
A typical data record would look like this
SITE_ID DATE HOUR MAID
123 2021-05-05 16 m434
I am using the following python3 definition to send the request and get the data.
def elasticsearch_curl(uri, json_body='',verb='get'):
headers={'Content-Type': 'application/json',}
try:
resp = requests.get(uri, headers=headers, data=json_body)
try:
resp_text = json.loads(resp.text)
except:
print("Error")
except Exception as error:
print('\nelasticsearch_curl() error:', error)
return resp_text
##Variables
tabsite : ['r123','r234'] ##names of indices
siteid : [123,124,125] ##name of sites
I am using the following code to get the data:
for key,value in tabsite.items():
k=key.replace('_','')
if es.indices.exists(index=k):
url="http://localhost:9200/"+str(k)+"/_search"
jb1='{"size":0,"query": {"bool" : {"filter" : [{"terms" : {"site_id": ' + str(siteid) + '}},{"range" : {"date" : \
{"gte": "'+str(st)+'","lte": "'+str(ed)+'"}}}]}}, "aggs" : {"group_by" : {"terms": {"field": "site_id","size":100},"aggs" : {"bydate" : {"terms" : \
{"field":"date","size": 10000},"aggs" : {"uv" : {"cardinality": {"field": "maid"}}}}}}}}'
try:
r2=elasticsearch_curl(url, json_body=jb1)
k1=r2.get('aggregations',{}).get('group_by',{}).get('buckets')
print(k1)
except:
pass
The above code returns the data from r123 which has 18GB of data while it fails to get it from r234 which has 55 GB of data.
I am getting the following error:
elasticsearch_curl() error: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
I have tried the following:
Try running the above code in a machine which has only r234 index with around 45GB of data. It worked.
I tried increasing the RAM size of the 2nd machine in production from 8GB to 16GB - it failed.
When I searched for options here, I understood I need to close the headers. I am not sure how.
I have the following questions:
How do I keep my elasticsearch nodes stable without getting them shutdown automatically?
How do I get rid of the above error which shuts down one of the nodes or both.
Is there any optimimal configuration setting ratio for volume of data : number of nodes : amount of RAM / vCPUs.

How to avoid out-of-memory error when creating large arrays from PySpark and Parquet for data analysis?

We have a somewhat unusual requirement that we're experimenting with. We have a parquet table with ~320 GB of frequency pixel values in arrays - the data is from a 3-dimensional spectral line image cube (A FITs source file).
+--------------------+------------+-------+
| col_name| data_type|comment|
+--------------------+------------+-------+
| spi_index| int| null|
| spi_image|array<float>| null|
| spi_filename| string| null|
| spi_band| int| null|
|# Partition Infor...| | |
| # col_name| data_type|comment|
| spi_filename| string| null|
| spi_band| int| null|
+--------------------+------------+-------+
There is a legacy source finding algorithm written in C that is used to analyse the original data source files, but it's restrained to the physical memory available on one machine. We're running a CDH Express cluster, 3 management nodes (8 cores, 32 GB ram) 10 worker nodes (16 cores, 64 GB ram) and a gateway node running Jupyterhub (8 cores, 32 GB of ram). We've modified the original C program into a shared object, and distributed it across the cluster. We've incorporated the C shared object into a partitioner class so we can run multiple partitions in multiple executors across the cluster. We have this up and running using pyspark.
The problem that we're experiencing is that we optimally need an input pixel array of a minimum of about 15GB and we seem to be hitting a wall at creating arrays of around 7.3 GB, and we're unsure of why this is.
YARN maximum allocation setting.
yarn.scheduler.maximum-allocation-mb = 40GB
Spark configuration settings
--executor-memory 18g \
--num-executors 29 \
--driver-memory 18g \
--executor-cores 5 \
--conf spark.executor.memoryOverhead='22g' \
--conf spark.driver.memoryOverhead='22g' \
--conf spark.driver.maxResultSize='24g' \
--conf "spark.executor.extraJavaOptions-XX:MaxDirectMemorySize=20G -XX:+UseCompressedOops" \
Summary of partitioner class
class Partitioner:
def __init__(self):
self.callPerDriverSetup
def callPerDriverSetup(self):
pass
def callPerPartitionSetup(self):
sys.path.append('sofia')
#import example
import sofia
import myLib
import faulthandler
import time as tm
from time import time, clock, process_time, sleep
self.sofia=sofia
self.myLib=myLib
#self.example=example
self.parameterFile=SparkFiles.get('sofia.par')
def doProcess(self, element):
### here's the call to the C library for each row of the dataframe partition
### In here we have to transform the flattened array data to the format SoFiA
### requires, as well as the
ra=np.array(element.right_ascension, dtype='<f4')
dec=np.array(element.declination, dtype='<f4')
frequencies=np.array(element.frequency, dtype='<f4')
Pixels=np.array(element.pixels, dtype='<f4')
dataPtr= Pixels.ravel()
#
# create the dictionary of the original header
#
hdrKey = np.array(element.keys, dtype='U')
hdrValue = np.array(element.values, dtype='U')
hdrDict = dict(zip(hdrKey,hdrValue))
newHdr=self.myLib.CreateHeader(hdrDict)
# Get the crpix adjustment values for the new header
crpix1,crpix2,crpix3,crpix4=self.myLib.GetHeaderUpdates(newHdr,\
element.raHeaderIndex, \
element.decHeaderIndex, \
element.freqHeaderIndex)
# Get the new axis values
naxis1 = len(ra)
naxis2 = len(dec)
naxis4 = len(frequencies)
newHdr['CRPIX1']=float(crpix1)
newHdr['CRPIX2']=float(crpix2)
newHdr['CRPIX3']=float(crpix3)
newHdr['CRPIX4']=float(crpix4)
newHdr['NAXIS1']=float(naxis1)
newHdr['NAXIS2']=float(naxis2)
newHdr['NAXIS4']=float(naxis4)
newHdr.pop("END",None)
hdrstr,hdrsize = self.myLib.dict2FITSstr(newHdr)
path_to_par = self.parameterFile
parsize = len(path_to_par)
# pass off to sofia C library
try:
# This is the call to the shared object
ret = self.sofia.sofia_mainline(dataPtr,hdrstr,hdrsize,path_to_par,parsize)
returnCode= ret[0]
sofiaMsg="Call to sofia has worked!"
except RuntimeError:
print("Caught general exception\n")
sofiaMsg="Caught general exception"
returnCode=1
#sys.exit()
except StopIteration:
print("Caught Null pointer\n")
sofiaMsg="Caught Null pointer"
returnCode=2
#sys.exit()
except MemoryError:
print("Caught ALLOC error\n")
sofiaMsg="Caught ALLOC error"
returnCode=3
#sys.exit()
except IndexError:
print("Caught index range error\n")
sofiaMsg="Caught index range error"
returnCode=4
#ys.exit()
except IOError:
print("Caught file error\n")
sofiaMsg="Caught file error"
returnCode=5
#sys.exit()
except OverflowError:
print("Caught integer overflow error\n")
sofiaMsg="Caught integer overflow error"
returnCode=6
#sys.exit()
except TypeError:
print("Caught user input error\n")
sofiaMsg="Caught user input error"
returnCode=7
#sys.exit()
except SystemExit:
print("Caught no sources error\n")
sofiaMsg="Caught no sources error"
returnCode=8
#sys.exit()
if returnCode==0:
catalogXML=ret[5].tobytes()
pass
else:
catalogXML=""
DELIMITER=chr(255) #"{}{}"
msg="{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}"\
.format( str(returnCode), DELIMITER, \
str(element.raHeaderIndex), DELIMITER, \
str(element.decHeaderIndex), DELIMITER, \
str(len(ra)), DELIMITER, \
str(len(dec)) , DELIMITER, \
str(len(frequencies)), DELIMITER, \
str(element.freqHeaderIndex) , DELIMITER, \
str(catalogXML), DELIMITER, \
str(dataPtr.nbytes)
)
return msg
def processPartition(self, partition):
self.callPerPartitionSetup()
for element in partition:
yield self.doProcess(element)
We create a dataframe where each row of the dataframe represents a 3-dimensional array of data and containing columns of array data for the 3 position dimensions, and the array containing the pixel values (the call in doProcess for ra, dec, frequencies and pixels). The dataframe row also contains coordinate system information from the original source file, which we use to build a new set of header information, which is passed to an instantiated instance of the Partitioner class via a df.rdd.mapPartitions call.
p=Partitioner()
try:
...
...
...
...
# Creates the positional array dataframes,
# and the single rows representing the 3d images in finalSubCubeDF
# finalSubCube
#
...
...
...
finalSubcubeDF=finalSubcubeDF\
.join(broadcast(raRangeDF), finalSubcubeDF.bins == raRangeDF.grp, "inner")\
.join(broadcast(decRangeDF), finalSubcubeDF.bins == decRangeDF.grp, "inner")\
.join(broadcast(freqRangeDF), finalSubcubeDF.bins == freqRangeDF.grp, "inner")\
.join(broadcast(hdrDF), finalSubcubeDF.bins == hdrDF.grp, "inner")\
.select("bins","right_ascension","declination","frequency","raHeaderIndex",\
"decHeaderIndex","freqHeaderIndex","pixels","keys","values")
# repartition on the bins column to distribute the processing
finalSubcubeDF=finalSubcubeDF.repartition("bins")
finalSubcubeDF.persist(StorageLevel.MEMORY_AND_DISK_SER)
...
# Calls the partitioner class which containes the C shared object calls, as above
...
rddout=finalSubcubeDF.rdd.mapPartitions(p.processPartition)
DELIMITER=chr(255)
rddout= rddout.map(lambda x:x.split(DELIMITER))
...
...
# Write the results (which activates the computation) including the catalogue XML file
rddout.saveAsTextFile("hdfs:///<file path>")
except Exception as e:
...
...
...
Error message captured from the calling process
2021-02-16 11:03:42,830 INFO ERROR! ProcessThread FAILURE...An error occurred while calling o1989.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 47.1 failed 4 times, most recent failure: Lost task 5.3 in stage 47.1 (TID 5875, hercules-1-2.nimbus.pawsey.org.au, executor 12): ExecutorLostFailure (executor 12 exited caused by one of the running tasks) Reason: Container marked as failed: container_1612679558527_0181_01_000014 on host: hercules-1-2.nimbus.pawsey.org.au. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Killed by external signal
2021-02-16 11:03:42,830 INFO ERROR! ProcessThread FAILURE...An error occurred while calling o1989.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 47.1 failed 4 times, most recent failure: Lost task 5.3 in stage 47.1 (TID 5875, hercules-1-2.nimbus.pawsey.org.au, executor 12): ExecutorLostFailure (executor 12 exited caused by one of the running tasks) Reason: Container marked as failed: container_1612679558527_0181_01_000014 on host: hercules-1-2.nimbus.pawsey.org.au. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Killed by external signal
And the error message from the YARN Container log
LogType:stdout
Log Upload Time:Tue Feb 16 11:19:03 +0000 2021
LogLength:124
Log Contents:
#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill %p"
# Executing /bin/sh -c "kill 20417"...
So the problem is obviously memory related, but not entirely sure why that's the case given the executor and driver process memory settings are set quite high? At the moment we're just grasping at straws.
We're aware that collecting distributed data into an array under normal circumstances isn't recommended; however, it seems that being able to run multiple C shared objects in parallel across the cluster could be more efficient than running 30-40 GB extractions serially on a single machine.
Thanks in advance for any thoughts and assistance.

Getting AVG in pig

I need to get the average age in each gender group...
Here is my data set:
01::F::21::0001
02::M::31::21345
03::F::22::33323
04::F::18::123
05::M::31::14567
Basically this is
userid::gender::age::occupationid
Since there is multiple delimiter, i read somewhere here in stackoverflow to load it first via TextLoader()
loadUsers = LOAD '/user/cloudera/test/input/users.dat' USING TextLoader() as (line:chararray);
testusers = FOREACH loadusers GENERATE FLATTEN(STRSPLIT(line,'::')) as (user:int, gender:chararray, age:int, occupation:int);
grunt> DESCRIBE testusers;
testusers: {user: int,gender: chararray,age: int,occupation: int}
grouped_testusers = GROUP testusers BY gender;
average_age_of_testusers = FOREACH grouped_testusers GENERATE group, AVG(testusers.age);
after running
dump average_age_of_testusers
this is the error in hdfs
2016-10-31 13:39:22,175 [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats -
ERROR 0: Exception while executing (Name: grouped_testusers: Local Rearrange[tuple]{chararray}(false) - scope-284 Operator Key: scope-284): org.apache.pig.backend.executionengine.ExecException:
ERROR 2106: Error while computing average in Initial 2016-10-31 13:39:22,175 [main]
ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
Input(s):
Failed to read data from "/user/cloudera/test/input/users.dat"
Output(s):
Failed to produce result in "hdfs://quickstart.cloudera:8020/tmp/temp-169204712/tmp-1755697117"
This is my first try in programming in pig, so forgive me if the solution is very obvious.
Analyzing it further, it seems it has trouble computing the average, i thought i made a mistake in data type but age is int.
if you can help me, thank you.
I figured out the problem in this one. Please refer to How can correct data types on Apache Pig be enforced? for a better explanation.
But then, just to show what I did... I had to cast my data
FOREACH loadusers GENERATE FLATTEN((tuple(int,chararray,int,int)) STRSPLIT(line,'::')) as (user:int, gender:chararray, age:int, occupation:int);
AVG is failing because loadusers.age is being treated as string instead of int.

SparkR dapply not working

I'm trying to call lapply within a function applied on spark data frame. According to documentation it's possible since Spark 2.0.
wrapper = function(df){
out = df
out$len <- unlist(lapply(df$value, function(y) length(y)))
return(out)
}
# dd is Spark Data Frame with one column (value) of type raw
dapplyCollect(dd, wrapper)
It returns error:
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...): org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 37.0 failed 1 times, most recent failure: Lost task 0.0 in stage 37.0 (TID 37, localhost): org.apache.spark.SparkException: R computation failed with
Error in (function (..., deparse.level = 1, make.row.names = TRUE) :
incompatible types (from raw to logical) in subassignment type fix
The following works fine:
wrapper(collect(dd))
But we want computation to run on nodes (not on driver).
What could be the problem? There is a related question but it does not help.
Thanks.
You need to add the schema as it can only be defaulted if the columns of the output are the same mode as the input.

Resources