From "Hadoop the definitive guide"
[Each map task has a circular memory buffer that it writes the output to. The buffer is
100 MB by default, a size that can be tuned by changing the io.sort.mb property. When
the contents of the buffer reaches a certain threshold size (io.sort.spill.percent,
which has the default 0.80, or 80%), a background thread will start to spill the contents
to disk]
Question here is that since each map task works on a single input split (which more or less would be equal to the size of HDFS block i.e 64 MB), the condition for spilling back to the disk shall never arose. Am i missing something. Please help.
Why do you assume the Split Size or the Block size would be 64 MB? Practically I have seen having a small Block size reduces the performance (For the scale of data I analyse). I have seen better performance with block size/ split size of 256MB in my use case.
Coming back to your question,
Having Way too many Mappers is also an overhead on the framework. Going by the use case mentioned in the question we might not be spilling keys,values from the 100 MB circular Buffer. But consider these case where split-size is 64MB and the Mapper does some calculations based on the input and emits additional calculation results as a part of Map output, there are chances that the Map output can be more than the configured circular buffer size. Another use case we have 64 MB of block-compressed data the data just bursts up in size when processing. Considers mappers which will fetch additional data from "Side Data Distribution", "distributed cache" in the Map phase.
Just an additional Note:
From my experience I can clearly say that when we work on/with a framework the default configurations will never suit our requirements. We need to tweak and tune the system to give us the best possible performance.
Related
I am implementing a model inspired by the NMT model. I am using a training set stored as TFRecords files, using a TFRecordDataset to fetch it and feed the model. Following Google's recommendations about input pipeline performances improvement, I have:
preprocessed as much as possible beforehand on CPU
stacked several training examples up to about 100 MB TFrecords files (having less files containing more examples)
used num_parallel_calls and prefetch on the Dataset map operations.
However, GPU remains at maximum 40%, and it is barely as slow as when run on CPU. I am thus wondering about the prefetch operation.
If I understand correctly, it will create a special thread that buffers N examples. But what does it mean ? What happens to the other examples not buffered ?
is there an optimal relation between the prefetch buffer size, the number of examples in the complete Dataset and the batch size ? In the NMT code, prefetch buffer size is set at 1000*batch_size, but why ? If e.g. I am using 10000 examples, a batch size of 100, what should be the prefetch buffer size ?
Any other advice regarding Dataset speedup would be appreciated.
Apparently, Dataset API runs on CPU and not on GPU, so this answers the question.
I'm studying Blocked sort-based indexing and the algorithm talks about loading files by some block of 32 or 64kb because disk reading is by block so it is efficient.
My first question is how am I supposed to load file by block?buffer reader of 64kb? But if I use java input stream, whether or not this optimization has already beed done and I can just tream the stream?
I actually use apache spark, so whether or not sparkContext.textFile() does this optimization? what about spark streaming?
I don't think on the JVM you have any direct view onto the file system that would make it meaningful to align reads and block-sizes. Also there are various kinds of drives and many different file systems now, and block sizes would most likely vary or even have little effect on the total I/O time.
The best performance would probably be to use java.nio.FileChannel, and then you can experiment with reading ByteBuffers of given block sizes to see if it makes any performance difference. I would guess the only effect you see is that the JVM overhead for very small buffers matters more (extreme case, reading byte by byte).
You may also use the file-channel's map method to get hold of a MappedByteBuffer.
I have been going through documentation but I am unable to identify what the general guidelines are for bulk loading.
As far as I can see the best way to bulk load data into graphdb is by using the LoadRDF tool.
However the general rules for the appropriate settings are not familiar to me.
First of all if you have an "average" server with an SSD drive what kind of parsing speed is acceptable? 1.000 statements / sec, 10.000 statements / sec or is it much more or less?
Also what are good settings? For example you can set the -Dpool.buffer.size which has a default of 200.000 statements but if you have 10gig of ram what would be the rule of thumb to increase this and if you have 100 or 300 gig of ram?
Another option is the -Dinfer.pool.size which is set to the maximum of threads as there are cpus with a minimum of 4. Thus 1 core = 4 threads and 32 cores is 32 threads. I think this does not require any extra tuning or is this only there if you want to reduce the CPU load and not overshoot to lets say 64 threads if you have 32 cores?
There are also extra options available through the turtle file with examples in configs/templates where perhaps owlim:cache-memory and owlim:tuple-index-memory could be useful during loading and the other settings more useful for after loading?
In the end does it also matter if you have 100's of individual files instead of one big turtle file and / or does compressing the files increase loading speed or does it only reduce the initial disk usage?
For me personally, I currently have a setup of 290gb ram and 32 cores and 1.8T raid 0 SSD drives (which will have a backup after loading) and trying to do an initial load of 3 billion triples, from SSD to same SSD, which with the global speed of 16.461 statements per second will take a while but I am not sure if and how to improve this.
The best place to get a reference to the standard data loading speed is the GraphDB benchmark page.
From a computational point of view, the data loading process consists in generating unique internal IDs for all RDF resources and indexing all statements in multiple sorted collections like PSOC, POSC and CPSO (if context indexes are enabled). This process is mainly affected by:
Reasoning complexity - the database integrates a forward chaining inference engine. This means that for every newly added statement a predefined set of rules is fired recursively. Depending on the particular dataset and the configured rules, the number of materialised implicit statements may increase dramatically. The data loading speed is affected by the number of indexed statements, but not input explicit triples.
Size of the dataset - with the increase of the numbered indexed statements in each collection, the time to add more data also increases. The main two factors are the logarithmic complexity of the sorted collection, and the number of page splits because of the random coming IDs in at least one of the collections.
The number of CPU cores will speed up the data loading only if there is inference. The import of every new file will have a minimal overhead, so this should not be a concern unless their size is considerable. For the heap size, we have found best that a combination between SSD and a heap size limited to 30GB works best. If you restrict the heap size to 30GB, then you can benefit from the XX:+UseCompressedOops and still have a reasonable GC time.
Please note that GraphDB 8.x will also reserve off heap space for immutable data structures like the mapping of RDF resources to internal IDs! For a 3B dataset, it may become as big as 15GB. The main reason behind this design decision is to save GC time.
I want to make sure I am getting this concept right:
In Hadoop the Definite Guide it is stated that: "the goal while designing a file system is always to reduce the number of seeks in comparison to the amount of data to be transferred." In this statement the author is referring to the "seeks()" of Hadoop logical blocks, right?
I am thinking that no matter how big the Hadoop block size is (64MB or 128MB or bigger) the number of seeks of the physical blocks (which are usually 4KB or 8KB) that the underlying filesystem (e.g. ext3/fat) will have to perform will be the same no matter the size of Hadoop block size.
Example: To keep numbers simple, assume underlying file system block size is 1MB. We want to read a file of size 128MB. If the Hadoop block size is 64MB, the file occupies 2 blocks. When reading there are 128 seeks. if the Hadoop block size is increased to 128MB, the number of seeks performed by the files system is still 128. In the second case, Hadoop will perform 1 seek instead of 2.
Is my understanding correct?
If I am correct, a substantial performance improvement by increasing block size will only be observed only for very large files, right? I am thinking that in the case of files that are in the 1~GB size range, reducing the number of seeks from ~20 seeks (64MB block size) to ~10 seek (128MB block size) shouldn't make much of a difference, right?
You are correct that increasing the file system block size will improve performance. Linux requires that the block size be less than or equal to the page size. The x86 page size is limited to 4K; therefore, the largest block size that you can use is 4K even if the file system can support larger block sizes. The performance benefits of a large block size and page size are significant: reduction in read/write system calls, reduction in rotational delays and seeks (don't begin to consider SSDs), fewer context switches, improved cache locality, fewer TLB misses, etc. This is all goodness.
I analytically modeled the benefits of various block sizes based on our disk usage pattern and in some cases predicted order of magnitude improvements from the disk subsystem. This would shift the performance bottleneck elsewhere.
You are correct that substantial performance gains are possible. Unfortunately, a certain engineer who controls such improvements sees no value in page sizes larger than 4K. He mocks enterprise users who need high perform from largely homogeneous workloads on big iron and focuses on heterogeneous workloads that are interactively run on desktop or laptop systems where high performance is unimportant.
When recording from a microphone in CoreAudio, what is kAudioDevicePropertyBufferFrameSize for? The docs say it's "A UInt32 whose value indicates the number of frames in the IO buffers". However, this doesn't give any indication of why you would want to set it.
The kAudioDevicePropertyBufferFrameSizeRange property gives you a valid minimum and maximum for the bufferframe size. Does setting the bufferframe size to the max slow things down? When would you want to set it to something other than the default?
Here's what they had to say on the CoreAudio list:
An application that is looking for low
latency IO should set this value as
small as it can keep up with.
On the other hand, apps that don't
have large interaction requirements or
other reasons for low latency can
increase this value to allow the data
to be chunked up in larger chunks and
reduce the number of times per second
the IOProc gets called. Note that this
does not necessarily lower the total
load on the system. In fact,
increasing the IO buffer size can have
the opposite impact as the buffers are
larger which makes them much less
likely to fit in caches and what not
which can really sap performance.
At the end of the day, the value an
app chooses for it's IO size is really
something that is dependent on the app
and what it does.
Usually you'd leave it at the default, but you might want to change the buffer size if you have an AudioUnit in the processing chain that expects or is optimized for a certain buffer size.
Also, generally, larger buffer sizes result in higher latency between recording and playback, while smaller buffer sizes increase the CPU load of each channel being recorded.