File size exceeding defined limit in Hadoop local.cache.size

File size exceeding defined limit in Hadoop local.cache.size - caching

In my cluster I have defined local.cache.size to 10 GB, but I have seen some file with size of around 24GB. I am failed to understand why it increased the defined limit . The size limit is done at the initial installation of cluster.
Can anyone know about the solution for this

In hadoop2.0 it is removed and has no effect:
https://issues.apache.org/jira/browse/HADOOP-7184

Related

how to limit memory usage of elasticsearch in ubuntu 17.10?

My elasticsearch service is consuming around 1 gb.
My total memory is 2gb. The elasticsearch service keeps getting shut down. I guess the reason is because of the high memory consumption. How can i limit the usage to just 512 MB?
This is the memory before starting elastic search
After running sudo service elasticsearch start the memory consumption jumps
I appreciate any help! Thanks!

From the official doc
The default installation of Elasticsearch is configured with a 1 GB heap. For just about every deployment, this number is usually too small. If you are using the default heap values, your cluster is probably configured incorrectly.
So you can change it like this
There are two ways to change the heap size in Elasticsearch. The easiest is to set an environment variable called ES_HEAP_SIZE. When the server process starts, it will read this environment variable and set the heap accordingly. As an example, you can set it via the command line as follows: export ES_HEAP_SIZE=512m
But it's not recommended. You just can't run an Elasticsearch in the optimal way with so few RAM available.

HDFS and small files - part 2

This is with reference to the question : Small files and HDFS blocks where the answer quotes Hadoop: The Definitive Guide:
Unlike a filesystem for a single disk, a file in HDFS that is smaller than a single block does not occupy a full block’s worth of underlying storage.
Which I completely agree with because as per my understanding, blocks are just a way for the namenode to map which piece of file is where in the entire cluster. And since HDFS is an abstraction over our regular filesystems, there is no way a 140 MB will consume 256 MB of space on HDFS if the block size is 128MB, or in other words, the remaining space in the block will not get wasted.
However, I stumbled upon another answer here in Hadoop Block size and file size issue which says:
There are limited number of blocks available dependent on the capacity of the HDFS. You are wasting blocks as you will run out of them before utilizing all the actual storage capacity.
Does that mean if I have 1280 MB of HDFS storage and I try to load 11 files with size 1 MB each ( considering 128 MB block size and 1 replication factor per block ), the HDFS will throw an error regarding the storage?
Please correct if I am assuming anything wrong in the entire process. Thanks!

No. HDFS will not throw error because
1280 MB of storage limit is not exhausted.
11 meta entries won't cross memory limits on the namenode.
For example, say we have 3GB of memory available on namenode. Namenode need to store meta entries for each file, each block. Each of this entries take approx. 150 bytes. Thus, you can store roughly max. 1 million files with each having one block. Thus, even if you have much more storage capacity, you will not be able to utilize it fully if you have multiple small files reaching the memory limit of namenode.
But, specific example mentioned in the question does not reach this memory limit. Thus, there should not be any error.
Consider, hypothetical scenario having available memory in the namenode is just 300 bytes* 10. In this case, it should give an error for request to store 11th block.
References:
http://blog.cloudera.com/blog/2009/02/the-small-files-problem/
https://www.mail-archive.com/core-user#hadoop.apache.org/msg02835.html

org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.OutOfMemoryError: Java heap space

I have a 90MB snappy compressed file that I am attempting to use as input to Hadoop 2.2.0 on AMI 3.0.4 in AWS EMR.
Immediately upon attempting to read the file my record reader gets the following exception:
2014-05-06 14:25:34,210 FATAL [main] org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.OutOfMemoryError: Java heap space
at org.apache.hadoop.io.compress.BlockDecompressorStream.getCompressedData(BlockDecompressorStream.java:123)
at org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:98)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
at java.io.InputStream.read(InputStream.java:101)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:211)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:365)
...
I'm running on an m1.xlarge in AWS using the default memory and io.sort.mb. If we decompress the file and use that as input instead everything goes fine. Trouble is we have a very large number of compressed files and don't want to go around decompressing everything.
I'm not sure if we're missing a configuration setting or a wiring in our code of some sort. Not sure how to proceed.

As per the log you have provided , it seems size of decompressed block is more than your available heap size. I don't know about m1.large instance specifications on EMR, however here are some of the things you can try to ward off this error. Usually error running child means , the child that yarn spawned can't find enough heap space to continue its MR job. Options to try : 1) Increase mapred.java.child.opts size. It is the default size that child gets as its separate JVM process. By default, its 200mb , which is small for any reasonable data analysis. Change the parameters -XmxNu( max heapsize of N in u units) and -XmsNu (initial heap size of N in units of u ). Try for 1Gb i.e. -Xmx1g and see the effect and if it succeeds then go smaller 2) set up mapred.child.ulimit to 1.5 or 2 times the size of max heap size as set previously. It sets the amount of virtual memory for for a process. 3) reduce mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum to set max no of parallel mappers and reducers running at a time.4) io.sort.mb - which you have already tried. try it to 0.25*mapred.child.java.opts < io.sort.mb < 0.5*mapred.child.java.opts . And at last, its a trial and error method, so try and see which one sticks.

Multiple volume & limit disk usage with Hadoop

I am using Hadoop to processing on large set of data. I set up a hadoop node to use multiple volumes : one of these volume is a NAS with 10To disk, and the other one is the local disk from server with a storage capacity of 400 GB.
The problem is, if I understood, that data-nodes will attempt to place equal amount of data in each volumes. Thus when I run a job on a large set of data the disk with 400 GB is quickly full, while the 10 To disk got enough space remained. Then my map-reduce program produce by Hive freeze because my cluster turn on the safe mode...
I tried to set the property for limit Data node's disk usage, but it does nothing : I have still the same problem.
Hope that someone could help me.
Well it seems that my mapreduce program turn on safe mode because :
The ratio of reported blocks 0.0000 has not reached the threshold 0.9990.
I saw that error on the namenode web interface. I want to disable this option with the property dfs.safemode.threshold.pct but I do not know if it is a good way to solve it?

I think you can turn to dfs.datanode.fsdataset.volume.choosing.policy for help.
<property><name>dfs.datanode.fsdataset.volume.choosing.policy</name><value>org.apache.hadoop.hdfs.server.datanode.fsdataset.AvailableSpaceVolumeChoosingPolicy</value>

Use the dfs.datanode.du.reserved configuration setting in $HADOOP_HOME/conf/hdfs-site.xml for limiting disk usage.
Reference
<property>
<name>dfs.datanode.du.reserved</name>
<!-- cluster variant -->
<value>182400</value>
<description>Reserved space in bytes per volume. Always leave this much space free for non dfs use.
</description>
</property>

Hadoop HDFS maximum file size

A colleague of mine thinks that HDFS has no maximum file size, i.e., by partitioning into 128 / 256 meg chunks any file size can be stored (obviously the HDFS disk has a size and that will limit, but is that the only limit). I can't find anything saying that there is a limit so is she correct?
thanks, jim

Well there is obviously a practical limit. But physically HDFS Block IDs are Java longs
so they have a max of 2^63 and if your block size is 64 MB then the maximum size is 512 yottabytes.

I think she's right about saying there's no maximum file size on HDFS. The only thing you can really set is the chunk size, which is 64 MB by default. I guess sizes of any length can be stored, the only constraint could be that the bigger the size of the file, the greater the hardware to accommodate it.

I am not an expert in Hadoop, but AFAIK, there is no explicit limitation on a single file size, though there are implicit factors such as overall storage capacity and maximum namespace size. Also, there might be administrative quotes on number of entities and directory sizes. The HDFS capacity topic is very well described in this document. Quotes are described here and discussed here.
I'd recommend paying some extra attention to the Michael G Noll's blog referred by the last link, it covers many hadoop-specific topics.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

File size exceeding defined limit in Hadoop local.cache.size - caching

In my cluster I have defined local.cache.size to 10 GB, but I have seen some file with size of around 24GB. I am failed to understand why it increased the defined limit . The size limit is done at the initial installation of cluster. Can anyone know about the solution for this

In hadoop2.0 it is removed and has no effect: https://issues.apache.org/jira/browse/HADOOP-7184

Related

how to limit memory usage of elasticsearch in ubuntu 17.10?

HDFS and small files - part 2

org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.OutOfMemoryError: Java heap space

Multiple volume & limit disk usage with Hadoop

Hadoop HDFS maximum file size

Categories

Resources