Hadoop Benchmark: TestDFSIO - hadoop

I am testing my hadoop configuration with the apache provided benchmark file TestDFSIO. I'm running it according to this tutorial (resource 1):
http://www.michael-noll.com/blog/2011/04/09/benchmarking-and-stress-testing-an-hadoop-cluster-with-terasort-testdfsio-nnbench-mrbench/#testdfsio
The usage of the test is as follows:
TestDFSIO.0.0.4
Usage: hadoop jar $HADOOP_HOME/hadoop-*test*.jar TestDFSIO
-read | -write | -clean
[-nrFiles N] [-fileSize MB]
[-resFile resultFileName] [-bufferSize Bytes]
I'm a little confused about some of the flags, specifically, what is the buffer size flag for? Also, while navigating hdfs when the job completed successfully (I first performed a write TestDFSIO), I couldn't find the filename I supposedly created by choosing a resultFileName. Why can't I find the file by the resultFileName I used?
I had also looked at this page (resource 2) (specifically page 25):
http://wr.informatik.uni-hamburg.de/_media/research/labs/2009/2009-12-tien_duc_dinh-evaluierung_von_hadoop-report.pdf
As one of the parameters of their test, they were using block sizes of 64MB and 128MB. I tried putting '64MB' (converted to bytes) after the bufferSize flag, but this led to a failed job, which leads me to believe I do not understand what the buffersize flag is for, and how to use different block sizes for testing. How do you change the block size of the test (as per resource 2)?

What is the buffer size flag for?
The buffer size flag describes the length of the write buffer in bytes. See the WriteMapper constructor in TestDFSIO.java:
public WriteMapper() {
for(int i=0; i < bufferSize; i++)
buffer[i] = (byte)('0' + i % 50);
}
Here, data is generated and written to the buffer in memory before being written to disk. When it's written to disk later, it's all written in one step rather than one step per byte. Fewer writes often means better performance, so a larger buffer might improve performance.
Why can't I find the file by the resultFileName I used?
Results are usually automatically written to /benchmarks/TestDFSIO. If you don't find it there, search for mapred.output.dir in your job log.
How do you change the block size of the test (as per resource 2)?
Block size can be passed as a parameter as a generic option. Try something like:
hadoop jar $HADOOP_HOME/hadoop-*test*.jar TestDFSIO -D dfs.block.size=134217728 -write

Why can't I find the file by the resultFileName I used?
You should have probably seen a line like this at the end of job execution log:
java.io.FileNotFoundException: File does not exist: /benchmarks/TestDFSIO/io_write/part-00000
while dealing with TestDFSIO it usually means that lzo or other compression is used (so there's extra something appended to the filename).
so instad of looking for
/benchmarking/TestDFSIO/io_write/part-00000
try this (see * wildcard at the end):
hadoop fs -ls /benchmarking/TestDFSIO/io_write/part-00000*

Try this for this question (How do you change the block size of the test (as per resource 2)?
hadoop jar $_HADOOP_HOME/share/hadoop/mapreduce/hadoop-*test*.jar.jar TestDFSIO -write -nrFiles 4 -fileSize 250GB -resFile /tmp/TestDFSIOwrite.txt

Related

What determines bv_len inside BIO structure (for I/O request)?

I built a ram based virtual block device driver with blk-mq API that uses none for I/O scheduler. I am running fio to perform random read/write on the device and noticed that the bv_len in each bio request is always 1024 bytes. I am not aware any place in code that sets this value explicitly. The file system is ext4.
Is this a default config or something I could change in code?
I am not aware any place in code that sets this [bv_len] value explicitly.
In a 5.7 kernel isn't it set explicitly in __bio_add_pc_page__bio_add_pc_page() and __bio_add_page() (which are within block/bio.c)? You'll have to trace back through callers to see how the passed len was set though.
(I found this by searching for the bv_len identifier in LXR and then going through results)
However, #stark's comment about tune2fs is the key to any answer. You never told us the filesystem block size and if your block device is "small" it's likely your filesystem is also small and by default the choice of block size is dependent on that. If you read the mke2fs man page you will see it says the following:
-b block-size
Specify the size of blocks in bytes. Valid block-size values are 1024, 2048 and 4096 bytes per block. If omitted, block-size is heuristically determined by the filesystem size and the expected usage of the filesystem (see the -T option).
[...]
-T usage-type[,...]
[...]
If this option is is not specified, mke2fs will pick a single default usage type based on the size of the filesystem to be created. If the filesystem size is less than or equal to 3 megabytes, mke2fs will use the filesystem type floppy. [...]
And if you look in the default mke2fs.conf the blocksize for a floppy is 1024.

Writing last N bytes to file opened with FILE_FLAG_NO_BUFFERING

When writing lots of sequential data to disk I found that having an internal 4MB buffer and when opening the file for writing I specify [FILE_FLAG_NO_BUFFERING][1], so that my internal buffer is used.
But that also creates a requirement to write in full sector blocks (512 bytes on my machine).
How do I write the last N<512 bytes to disk?
Is there some flag to WriteFile to allow this?
Do I pad them with extra NUL characters and then truncate the file size down to the correct value?
(With SetFileValidData or similar?)
For those wondering the reason for trying this approach. Our application logs a lot. To handle this a dedicated log-thread exists, which formats and writes logs to disk. Also if we log with fullest detail we might log more per second than the disk-system can handle. (Usually noticed for customers with SAN systems that are not well tweaked.)
So, the goal is log write as much as possible, but also notice when we start to overload the system, and then hold back a bit, like reducing the details of the logs.
Hence the idea to have a fill a big memory-block and give that to the OS, hoping to reduce the overheads.
As the comments suggest, doing file writing this way is probably not the best solution for real world situations. But if writing with FILE_FLAG_NO_BUFFERING is used,
SetFileInformationByHandle is the way to mark the file shorter than whole blocks.
int data_len = len(str);
int len_last_block = BLOCKSIZE%datalen;
int padding_to_fill_block = (data_last_block == BLOCKSIZE ? 0 : (BLOCKSIZE-len_last_block);
str.append('\0', padding_to_fill_block);
ULONG bytes_written = 0;
::WriteFile(hFile, data, data_len+padding_to_fill_block, &bytes_written, NULL));
m_filesize += bytes_written;;
LARGE_INTEGER end_of_file_pos;
end_of_file_pos.QuadPart = m_filesize - padding_to_fill_block;
if (!::SetFileInformationByHandle(hFile, FileEndOfFileInfo, &end_of_file_pos, sizeof(end_of_file_pos)))
{
HRESULT hr = ::GetLastErrorMessage();
}

Windows (ReFS,NTFS) file preallocation hint

Assume I have multiple processes writing large files (20gb+). Each process is writing its own file and assume that the process writes x mb at a time, then does some processing and writes x mb again, etc..
What happens is that this write pattern causes the files to be heavily fragmented, since the files blocks get allocated consecutively on the disk.
Of course it is easy to workaround this issue by using SetEndOfFile to "preallocate" the file when it is opened and then set the correct size before it is closed. But now an application accessing these files remotely, which is able to parse these in-progress files, obviously sees zeroes at the end of the file and takes much longer to parse the file.
I do not have control over the this reading application so I can't optimize it to take zeros at the end into account.
Another dirty fix would be to run defragmentation more often, run Systernal's contig utility or even implement a custom "defragmenter" which would process my files and consolidate their blocks together.
Another more drastic solution would be to implement a minifilter driver which would report a "fake" filesize.
But obviously both solutions listed above are far from optimal. So I would like to know if there is a way to provide a file size hint to the filesystem so it "reserves" the consecutive space on the drive, but still report the right filesize to applications?
Otherwise obviously also writing larger chunks at a time obviously helps with fragmentation, but still does not solve the issue.
EDIT:
Since the usefulness of SetEndOfFile in my case seems to be disputed I made a small test:
LARGE_INTEGER size;
LARGE_INTEGER a;
char buf='A';
DWORD written=0;
DWORD tstart;
std::cout << "creating file\n";
tstart = GetTickCount();
HANDLE f = CreateFileA("e:\\test.dat", GENERIC_ALL, FILE_SHARE_READ, NULL, CREATE_ALWAYS, 0, NULL);
size.QuadPart = 100000000LL;
SetFilePointerEx(f, size, &a, FILE_BEGIN);
SetEndOfFile(f);
printf("file extended, elapsed: %d\n",GetTickCount()-tstart);
getchar();
printf("writing 'A' at the end\n");
tstart = GetTickCount();
SetFilePointer(f, -1, NULL, FILE_END);
WriteFile(f, &buf,1,&written,NULL);
printf("written: %d bytes, elapsed: %d\n",written,GetTickCount()-tstart);
When the application is executed and it waits for a keypress after SetEndOfFile I examined the on disc NTFS structures:
The image shows that NTFS has indeed allocated clusters for my file. However the unnamed DATA attribute has StreamDataSize specified as 0.
Systernals DiskView also confirms that clusters were allocated
When pressing enter to allow the test to continue (and waiting for quite some time since the file was created on slow USB stick), the StreamDataSize field was updated
Since I wrote 1 byte at the end, NTFS now really had to zero everything, so SetEndOfFile does indeed help with the issue that I am "fretting" about.
I would appreciate it very much that answers/comments also provide an official reference to back up the claims being made.
Oh and the test application outputs this in my case:
creating file
file extended, elapsed: 0
writing 'A' at the end
written: 1 bytes, elapsed: 21735
Also for sake of completeness here is an example how the DATA attribute looks like when setting the FileAllocationInfo (note that the I created a new file for this picture)
Windows file systems maintain two public sizes for file data, which are reported in the FileStandardInformation:
AllocationSize - a file's allocation size in bytes, which is typically a multiple of the sector or cluster size.
EndOfFile - a file's absolute end of file position as a byte offset from the start of the file, which must be less than or equal to the allocation size.
Setting an end of file that exceeds the current allocation size implicitly extends the allocation. Setting an allocation size that's less than the current end of file implicitly truncates the end of file.
Starting with Windows Vista, we can manually extend the allocation size without modifying the end of file via SetFileInformationByHandle: FileAllocationInfo. You can use Sysinternals DiskView to verify that this allocates clusters for the file. When the file is closed, the allocation gets truncated to the current end of file.
If you don't mind using the NT API directly, you can also call NtSetInformationFile: FileAllocationInformation. Or even set the allocation size at creation via NtCreateFile.
FYI, there's also an internal ValidDataLength size, which must be less than or equal to the end of file. As a file grows, the clusters on disk are lazily initialized. Reading beyond the valid region returns zeros. Writing beyond the valid region extends it by initializing all clusters up to the write offset with zeros. This is typically where we might observe a performance cost when extending a file with random writes. We can set the FileValidDataLengthInformation to get around this (e.g. SetFileValidData), but it exposes uninitialized disk data and thus requires SeManageVolumePrivilege. An application that utilizes this feature should take care to open the file exclusively and ensure the file is secure in case the application or system crashes.

How to avoid Parquet MemoryManager exception

I'm generating some parquet (v1.6.0) output from a PIG (v0.15.0) script. My script takes several input sources and joins them with some nesting. The script runs without error but then during the STORE operation I get:
2016-04-19 17:24:36,299 [PigTezLauncher-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=FAILED, progress=TotalTasks: 249 Succeeded: 220 Running: 0 Failed: 1 Killed: 28 FailedTaskAttempts: 43, diagnostics=Vertex failed, vertexName=scope-1446, vertexId=vertex_1460657535752_15030_1_18, diagnostics=[Task failed, taskId=task_1460657535752_15030_1_18_000000, diagnostics=[TaskAttempt 0 failed, info=[Error: Failure while running task:parquet.hadoop.MemoryManager$1: New Memory allocation 134217728 exceeds minimum allocation size 1048576 with largest schema having 132 columns
at parquet.hadoop.MemoryManager.updateAllocation(MemoryManager.java:125)
at parquet.hadoop.MemoryManager.addWriter(MemoryManager.java:82)
at parquet.hadoop.ParquetRecordWriter.<init>(ParquetRecordWriter.java:104)
at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:309)
at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.getRecordWriter(PigOutputFormat.java:81)
at org.apache.tez.mapreduce.output.MROutput.initialize(MROutput.java:398)
...
The above exception was thrown when I executed the script using -x tez but I get the same exception when using mapreduce. I have tried to increase parallelization using SET default_parallel as well as adding an (unneccessary w.r.t. my real objectives) ORDER BY operation just prior to my STORE operations to ensure PIG has an opportunity to ship data off to different reducers and minimize the memory required on any given reducer. Finally, I've tried pushing up the available memory using SET mapred.child.java.opts. None of this has helped however.
Is there something I'm just missing? Are there known strategies for avoiding the issue of one reducer carrying too much of the load and causing things to fail during write? I've experienced similar issues writing to avro output that appear to be caused by insufficient memory to execute the compression step.
EDIT: per this source file the issue seems to boil down to the fact that memAllocation/nCols<minMemAllocation. However the memory allocation seems unaffected by the mapred.child.java.opts setting I tried out.
I solved this finally using the parameter parquet.block.size. The default value (see source) is big enough to write a 128-column wide file, but no bigger. The solution in pig was to use SET parquet.block.size x; where x >= y * 1024^2 and y is the number of columns in your output.

When it comes to mapreduce how are the Accumulo tablets mapped to an HDFS block

If my environment set up is as follows:
-64MB HDFS block
-5 tablet servers
-10 tablets of size 1GB each per tablet server
If I have a table like below:
rowA | f1 | q1 | v1
rowA | f1 | q2 | v2
rowB | f1 | q1 | v3
rowC | f1 | q1 | v4
rowC | f2 | q1 | v5
rowC | f3 | q3 | v6
From the little documentation, I know all data about rowA will go one tablet which may or may not contain data about other rows ie its all or none. So my questions are:
How are the tablets mapped to a Datanode or HDFS block? Obviously, One tablet is split into multiple HDFS blocks (8 in this case) so would they be stored on the same or different datanode(s) or does it not matter?
In the example above, would all data about RowC (or A or B) go onto the same HDFS block or different HDFS blocks?
When executing a map reduce job how many mappers would I get? (one per hdfs block? or per tablet? or per server?)
Thank you in advance for any and all suggestions.
To answer your questions directly:
How are the tablets mapped to a Datanode or HDFS block? Obviously, One tablet is split into multiple HDFS blocks (8 in this case) so would they be stored on the same or different datanode(s) or does it not matter?
Tablets are stored in blocks like all other files in HDFS. You will typically see all blocks for a single file on at least one data node (this isn't always the case, but seems to mostly hold true when i've looked at block locations for larger files)
In the example above, would all data about RowC (or A or B) go onto the same HDFS block or different HDFS blocks?
Depends on the block size for your tablets (dfs.block.size or if configured the Accumulo property table.file.blocksize). If the block size is the same size as the tablet size, then obviously they will be in the same HDFS block. Otherwise if the block size is smaller than the tablet size, then it's pot luck as to whether they are in the same block or not.
When executing a map reduce job how many mappers would I get? (one per hdfs block? or per tablet? or per server?)
This depends on the ranges you give InputFormatBase.setRanges(Configuration, Collection<Ranges>).
If you scan the entire table (-inf -> +inf), then you'll get a number of mappers equal to the number of tablets (caveated by disableAutoAdjustRanges). If you define specific ranges, you'll get a different behavior depending on whether you've called InputFormatBase.disableAutoAdjustRanges(Configuration) or not:
If you have called this method then you'll get one mapper per range defined. Importantly, if you have a range that starts in one tablet and ends in another, you'll get one mapper to process that entire range
If you don't call this method, and you have a range that spans over tablets, then you'll get one mapper for each tablet the range covers
For writing to Accumulo (data ingest), it makes sense to run MapReduce jobs, where the mapper inputs are your input files on HDFS. You would basically follow this example from Accumulo documentation:
http://accumulo.apache.org/1.4/examples/mapred.html
(Section IV of this paper provides some more background on techniques for ingesting data into Accumulo: http://ieee-hpec.org/2012/index_htm_files/byun.pdf)
For reading from Accumulo (data query), I would not use MapReduce. Accumulo/Zookeeper will automatically distribute your query across tablet servers. If you're using rows as atomic records, use (or extend) the WholeRowIterator and launch a Scanner (or BatchScanner) on the range of rows you're interested in. The Scanner will run in parallel across your tablet servers. You don't really want to access Accumulo data directly from HDFS or MapReduce.
Here's some example code to help get your started:
//some of the classes you'll need (in no particular order)...
import org.apache.accumulo.core.client.Connector;
import org.apache.accumulo.core.client.Instance;
import org.apache.accumulo.core.client.ZooKeeperInstance;
import org.apache.accumulo.core.Constants;
import org.apache.accumulo.core.client.Scanner;
import org.apache.accumulo.core.client.IteratorSetting;
import org.apache.accumulo.core.data.Key;
import org.apache.accumulo.core.data.Range;
import org.apache.accumulo.core.data.Value;
import org.apache.hadoop.io.Text;
//Accumulo client code...
//Accumulo connection
Instance instance = new ZooKeeperInstance( /* put your installation info here */ );
Connector connector = instance.getConnector(username, password);
//setup a Scanner or BatchScanner
Scanner scanner = connector.createScanner(tableName, Constants.NO_AUTHS);
Range range = new Range(new Text("rowA"), new Text("rowB"));
scanner.setRange(range);
//use a WholeRowIterator to keep rows atomic
IteratorSetting itSettings = new IteratorSetting(1, WholeRowIterator.class);
scanner.addScanIterator(itSettings);
//now read some data!
for (Entry<Key, Value> entry : scanner) {
SortedMap<Key,Value> wholeRow = WholeRowIterator.decodeRow(entry.getKey(), entry.getValue());
//do something with your data!
}

Resources