why is the length of block after the block - binaryfiles

I'm extracting data from a binary file and see that the length of the binary data block comes after the block itself (the character chunks within the block have length first then 00 and then the information)
what is the purpose of the the block? is it for error checking?

Couple of examples:
The length of block was unknown when write operation began. Consider audio stream from microphone which we want to write as single block. It is not feasible to buffer it in RAM because it may be huge. That's why after we received EOF, we append effective size of block to the file. (Alternative way would be to reserve couple of bytes for length field in the beginning of block and then, after EOF, to write length there. But this requires more IO.)
Database WALs (write-ahead logs) may use such scheme. Consider that user starts transaction and makes lots of changes. Every change is appended as single record (block) to WAL. If user decides to rollback transaction, it is easy now to go backwards and then to chop off all records which were added as part of transaction user wants to rollback.
It is common for binary files to carry two blocks of metainformation: one block in the beginning (e.g. creation date, hostname) and another one in the end (e.g. statistics and checksum). When application opens existing binary file, it first wants to load these two blocks to make decisions about memory allocation and the like. This is much easier to load last block if its length is stored in the very end of file rather then scanning file from the beginning.

Related

How can I append bytes of data in an already uploaded file in StorJ?

I was getting segment error while uploading a large file.
I have read the file data in chunks of bytes using the Read method through io.Reader. Now, I need to upload the bytes of data continuously into the StorJ.
Storj, architected as an S3-compatible distributed object storage system, does not allow changing objects once uploaded. Basically, you can delete or overwrite, but you can't append.
You could make something that seemed like it supported append, however, using Storj as the backend. For example, by appending an ordinal number to your object's path, and incrementing it each time you want to add to it. When you want to download the whole thing, you would iterate over all the parts and fetch them all. Or if you only want to seek to a particular offset, you could calculate which part that offset would be in, and download from there.
sj://bucket/my/object.name/000
sj://bucket/my/object.name/001
sj://bucket/my/object.name/002
sj://bucket/my/object.name/003
sj://bucket/my/object.name/004
sj://bucket/my/object.name/005
Of course, this leaves unsolved the problem of what to do when multiple clients are trying to append to your "file" at the same time. Without some sort of extra coordination layer, they would sometimes end up overwriting each other's objects.

Very long lines of input in Ruby

I have a Ruby script that performs some substitutions on the output of mysqldump.
The input can have very long lines (hundreds of MB), because a single line can represent a multi-row INSERT statement for all the data in a table. The mysqldump utility can be coerced to produce one INSERT statement per row, but I don't have control of every client.
My script naively expects IO#each_line to control memory usage:
$stdin.each_line do |line|
next if options[:entity_excludes].any? { |entity| line =~ /^(DROP TABLE IF EXISTS|INSERT INTO) `custom_#{entity}(s|_meta)`/ }
line.gsub!(/^CREATE TABLE `/, "CREATE TABLE IF NOT EXISTS `")
line.gsub!('{{__OPF_SITEURL__}}', siteurl) if siteurl
$stdout.write(line)
end
I've already seen input with maximum line length over 400MB, and this translates directly into process resident memory.
Are there libraries for Ruby that allow text transforms on an input stream using buffers instead of relying on line-delimited input?
This was marked as a duplicate of a simpler question. But there's quite a bit more to this. You need to keep track of multiple buffers and test for application of transforms even when they apply across a buffer boundary. It's easy to get wrong, which is why I'm hoping a library already exists.

COMPARE AND WRITE on umapped block?

The COMPARE AND WRITE command description in the SBC-4 doesn't say anything about the case when the range of logical blocks to be replaced contains unmapped blocks.
What's the common practice to deal with this case on the target side? Should a target assume that the verification step is always successful in the case when an initiator asks to replace unmapped blocks with something meaningful?
When a block is unmapped it is equal to all zeroes. You should perform all your commands that read or compare data with this block as it was just zeroes.

Is it possible to know the serial number of the block of input data on which map function is currently working?

I am a novice in Hadoop and here I have the following questions:
(1) As I can understand, the original input file is split into several blocks and distributed over the network. Does a map function always execute on a block in its entirety? Could there be more than one map functions executing on data in a single block?
(2) Is there any way that it can be learned, from within the map function, which section of the original input text the mapper is currently working on? I would like to get something like a serial number, for instance, for each block starting from the first block of the input text.
(3) Is it possible to make the splits of the input text in such a way that each block has a predefined word count? If possible then how?
Any help would be appreciated.
As I can understand, the original input file is split into several blocks and distributed over the network. Does a map function always execute on a block in its entirety? Could there be more than one map functions executing on data in a single block?
No. A block(split to be precise) gets processed by only one mapper.
Is there any way that it can be learned, from within the map function, which section of the original input text the mapper is currently working on? I would like to get something like a serial number, for instance, for each block starting from the first block of the input text.
You can get some valuable info, like the file containing split's data, the position of the first byte in the file to process. etc, with the help of FileSplit class. You might find it helpful.
Is it possible to make the splits of the input text in such a way that each block has a predefined word count? If possible then how?
You can do that by extending FileInputFormat class. To begin with you could do this :
In your getSplits() method maintain a counter. Now, as you read the file line by line keep on tokenizing them. Collect each token and increase the counter by 1. Once the counter reaches the desired value, emit the data read upto this point as one split. Reset the counter and start with the second split.
HTH
If you define a small max split size you can actually have multiple mappers processing a single HDFS block (say 32mb max split for a 128 MB block size - you'll get 4 mappers working on the same HDFS block). With the standard input formats, you'll typically never see two or more mappers processing the same part of the block (the same records).
MapContext.getInputSplit() can usually be cast to a FileSplit and then you have the Path, offset and length of the file being / block being processed).
If your input files are true text flies, then you can use the method suggested by Tariq, but note this is highly inefficient for larger data sources as the Job Client has to process each input file to discover the split locations (so you end up reading each file twice). If you really only want each mapper to process a set number of words, you could run a job to re-format the text files into sequence files (or another format), and write the records down to disk with a fixed number of words per file (using Multiple outputs to get a file per number of words, but this again is inefficient). Maybe if you shared the use case as for why you want a fixed number of words, we can better understand your needs and come up with alternatives

What does SetFileValidData doing ? what is the difference with SetEndOfFile?

I look for a way to extend a file asynchronously and efficiently .
In a support document Asynchronous Disk I/O Appears as Synchronous on Windows NT, Windows 2000, and Windows XP said:
NOTE: Applications can make the previously mentioned write operation
asynchronous by changing the Valid Data Length of the file by using
the SetFileValidData function, and then issuing a WriteFile.
in MSDN, SetFileValidData is a function for Sets the valid data length of the specified file.
But I still not understand what is the "valid data", what is the difference between it and the size of file?
I can use SetFilePointerEx and SetEndOfFile to extend the file size, but how do this by SetFileValidData?
SetFileValidData cannot input a argument large than the size of file. In this case, what is the living meaning of SetFileValidData?
When you use SetEndOfFile to increase the length of a file, the logical file length changes and the necessary disk space is allocated, but no data is actually physically written to the disk sectors corresponding to the new part of the file. The valid data length remains the same as it was.
This means you can use SetEndOfFile to make a file very large very quickly, and if you read from the new part of the file you'll just get zeros. The valid data length increases when you write actual data to the new part of the file.
That's fine if you just want to reserve space, and will then be writing data to the file sequentially. But if you make the file very large and immediately write data near the end of it, zeros need to be written to the new part of the file, which will take a long time. If you don't actually need the file to contain zeros, you can use SetFileValidData to skip this step; the new part of the file will then contain random data from previously deleted files.
Addendum:
The rules for sparse files are different.
You should not use SetFileValidData on a file that non-privileged users have read access to; this could leak content from deleted files that belonged to other users.
Please note that SetEndOfFile() doesn't write any zeros to any allocated sectors on disk, it just allocates the space pointers inside MFT records and then updates the space bitmap of the whole file system. But the OS, or FS, will record the valid/logical file length in its MFT record.
If you enlarge the file, from 1GB to 2GB, then the appended 1GB should be all zeros, but the FS won't write the zeros to disks, it refers to this file's valid length to know that the 1GB should be zeros. If you try to read from this enlarged 1GB portion, it will fill zeros directly in RAM and then feedback to your application. But if you write any byte inside this 1GB portion, the FS has to fill with zeros from the original 1GB offset to the current pointer that your application is trying to write to, but not the other bytes from the current location to the tail of the file. Meanwhile, it records the valid/logical length to be from 0 to the current location, the physical size and allocated size is still 2GB.
But, if you use SetFileValidData(), the FS will set the valid length to 2GB directly, and won't bother to fill any zeros. Whatever location you write to, it just writes, but whatever location you read from, you may read out some garbage data which was previously generated by other applications before the file was extended into that disk space.
Agree with Harry Johnston's answer, and from the practice point of view, while SetFileValidData has performance advantage because it does not require writing zeros, it does have security implications because the file might contain data from other deleted files. So a special privilege, SE_MANAGE_VOLUME_NAME, is required, as MSDN mentioned: http://msdn.microsoft.com/en-us/library/windows/desktop/aa365544(v=vs.85).aspx
The reason is that, if the user account of the running program doesn't have that privilege, using SetFileValidData can expose other user's deleted data into the view of that particular file, so normal users (non-administrators) are not allowed to do that. Even for privileged users, they still need to take care to use ACL (access control lists) in the file system to protect that file so that it is not shared with non-privileged users.
It seems that SenEndofFile does not really allocate reserved disk space for the target file, SetFileValidData is responsible for the job.
Refered to MSDN,
You can use the SetFileValidData function to create large files in very specific circumstances so that the performance of subsequent file I/O can be better than other methods. Specifically, if the extended portion of the file is large and will be written to randomly, such as in a database type of application, the time it takes to extend and write to the file will be faster than using SetEndOfFile and writing randomly.
If SetEndOfFile really allocate space, then SetFileValidData will do nothing better than SetEndOfFile when writing randomly. So SetEndOfFile may just create a sparse file with hole(s), while SetFileValidData do the actual allocation.

Resources