Modifying an IO stream in-place? Ruby - ruby

I've been writing a ruby programme that merges the content of two files.
For example if a torrent have been downloaded two times separately, it tries to merge their contents for the blocks which have been completed.
So, I've been looking for a method which modifies a stream only at the place required and saves only that block instead of saving the whole stream again.
I'm reading the file in blocks of 16 KiBs, and how do I "replace" (not append) the content of that 16 KiBs so that only those bytes are written to disk and not the whole file is re-written each time!
Kind of,
#Doesn't exist unfortunately.
#By default it appends instead of replacing, so file size grows.
IO.write(file_name, content, offset, :replace => true)
Is there exists a method which achieves kind of that functionality?

Open the file in "r+b" mode, seek to the location and just write to it:
f=File.new("some.existing.file", "r+b");
f.seek(1024);
f.write("test\n");
f.close()
This will overwrite 5 characters of the file, following offset 1024.
If the file is shorter than your seek offset, an appropriate number of null characters are inserted to the file.

Related

ruby file not created new when "created" multiple times in an each loop

I try to run a regex over the files of some subfolder. Obviously this leads to unwanted behaviour. It seems that ruby doesn't create the temporary file new but that it would just re-open and continue. Every time I run the script the whole file (or multiple files?) seem to be appended and grow and grow and grow.
The first step of my strategy is like follows
I discover the files in the subdirectories via Dir.glob, save the full_paths in an array and read the files into a buffer file in an each loop.
I close the file and run the regex over the buffer
The second step consists of
in another each loop over the array consisting of the full-paths of the files I want to manipulate,
I create a temporary file in the same folder. every time with the same name,
delete the original file,
write the regex'ed buffer into the temporary file,
rename the temporary file after the original file.
So:
In the same directory as the original file
fh2 = File.new(temp_filename, "w:UTF-8")
count = fh2.write(file_buffer)
puts "That was #{count} bytes of data to #{temp_filename}"
fh2.close
count grows continuously greater with each file that is manipulated/created this way.
Could it be, that it doesn't create a fresh temp-file because of the same file_handle, the same name or the w-option together with File.new?
Thanks
Found my mistake. I didn't empty the buffer (and the buffer isn't local to a block !)
# load file in buffer
fh = File.open(file,"r+:UTF-8")
while (line = fh.gets)
file_buffer += line
end
fh.close
The thread can be closed. Sorry for wasting your time.

why is the length of block after the block

I'm extracting data from a binary file and see that the length of the binary data block comes after the block itself (the character chunks within the block have length first then 00 and then the information)
what is the purpose of the the block? is it for error checking?
Couple of examples:
The length of block was unknown when write operation began. Consider audio stream from microphone which we want to write as single block. It is not feasible to buffer it in RAM because it may be huge. That's why after we received EOF, we append effective size of block to the file. (Alternative way would be to reserve couple of bytes for length field in the beginning of block and then, after EOF, to write length there. But this requires more IO.)
Database WALs (write-ahead logs) may use such scheme. Consider that user starts transaction and makes lots of changes. Every change is appended as single record (block) to WAL. If user decides to rollback transaction, it is easy now to go backwards and then to chop off all records which were added as part of transaction user wants to rollback.
It is common for binary files to carry two blocks of metainformation: one block in the beginning (e.g. creation date, hostname) and another one in the end (e.g. statistics and checksum). When application opens existing binary file, it first wants to load these two blocks to make decisions about memory allocation and the like. This is much easier to load last block if its length is stored in the very end of file rather then scanning file from the beginning.

How many bytes does a file called 'file.txt' with the text "test" take?

I have a file on my my desktop called 'file.txt' and it contains the text "test".
When i right click the file go to properties, and view the size it says 4 bytes.
This makes sense because 4 characters = 4 bytes, but the file is called file.txt so this does take some space right?
It only says the file takes 4 bytes and nothing more.
I have tried searching on the web but i could not find a answer to this question.
So how many bytes does a file called 'file.txt' with the text "test" actually take?
The size of a file is given by its content. In your case "test" is 4 chars in length = 4 Bytes. Obviously you will end up with this size regardless of the filename.
The name of a file is stored into directory information structure, which depends entirely on the filesystem in question. For more information on this topic you can consult https://unix.stackexchange.com/questions/117325/where-are-filenames-stored-on-a-filesystem
It depends on your OS. The OS does not store file on a byte by byte basis. A file has to fit into one or more blocks. The most common block size is 4KB. thus your file will take up 1 block on the HDD, whatever that block size is on your specific system. As to where the filename is stored, that depends on your filesystem. FAT, NTFS, ext3, HFS, etc. all have lookup tables/structures to store the name and other metadata. The details are outside the scope of this answer

Write to specific part of preallocated file

I am currently trying to write to different locations of a pre-allocated file.
I first allocated my file like so:
File.open("file", "wb") { |file| file.truncate(size) }
Size being the total size of the file.
Afterwards I receive data of XX size which fits into Y location of that file. Keep in mind this portion of the process is forked. Each fork has it's own unique socket and opens it's own unique file handle, writes to the file, then closes it as so.
data = socket.read(256)
File.open("file", "wb") do |output|
output.seek(location * 256, IO::SEEK_SET)
output.write(data)
end
This should in turn allow the forked processes to open a file handle, seek to the correct location (If location is 2 and data_size is 256, then the write location is 512 -> 768) and write the chunk of data that it received.
Although what this is doing is beyond my comprehension. I monitor the files size as it is being populated and it is bouncing around from different file sizes which should not be changing.
When analyzing the file with a hex editor, where the file data header should be at the top is filled with nullbytes (like wise with 1/4 of the file). Although if I limit the forked processes to only write 1 file chunk and then exit the writes are fine and at their proper location.
I have done some other testing such as dumping that part locations, and the start locations of the data and my equation for seeking to the correct location of the file seems to be correct as well.
Is there something I am missing here or is there another way to have multiple threads/processes open a file handle to a file, seek to a specific location, and then write a chunk of data?
I have also attempted to use FLOCK on the file, and it yields the same results, likewise with using the main process instead of forking.
I have tested the same application, but rather than opening/closing the file handle each time I need to write data in rapid succession (transferring close to 70mb/s), I created one file handle per forked process and kept it open. This fixed the problem resulting in a 1:1 duplication of the file with matching checksums.
So the question is, why is opening/writing/closing file handles to a file in rapid succession causing this behavior?
It's your file mode.
File.open("file", "wb")
"wb" means "upon opening, truncate the file to zero length".
I suggest "r+b", which means "reading and writing, no truncation". Read more about available modes here: http://ruby-doc.org/core-2.2.2/IO.html#method-c-new
BTW, "b" in those modes means "binary" (as opposed to default "t" (text))

Determining end of JPEG (when merged with another file)

I'm creating a program that "hides" an encrypted file at the end of a JPEG. The problem is, when retrieving this encrypted file again, I need to be able to determine when the JPEG it was stored in ends. At first I thought this wouldn't be a problem because I can just run through the file checking for 0xFF and 0xD9, the bytes JPEG uses to end the image. However... I'm noticing that in quite a few JPEGs, this combination of bytes is not exclusive... So my program thinks the image has ended randomly half way through it.
I'm thinking there must be a set way of expressing that a JPEG has finished, otherwise me adding a load of bytes to the end of the file would obviously corrupt it... Is there a practical way to do this?
You should read the JFIF file format specifications
Well, there are always two places in the file that you can find with 100% reliability. The beginning and the end. So, when you add the hidden file, add another 4 bytes that stores the original length of the file and a special signature that's always distinct. When reading it back, first seek to the end - 8 and read that length and signature. Then just seek to that position.
You should read my answer on this question "Detect Eof for JPG images".
You're likely running into the thumbnail in the header, when moving through the file you should find that most marked segments contain a length indicator, here's a reference for which do and which don't. You can skip the bytes within those segments as the true eoi marker will not be within them.
Within the actual jpeg compressed data, any FF byte should be followed either by 00 (the zero byte is then discarded), or by FE to mark a comment (which has a length indicator, and can be skipped as described above).
Theoretically the only way you encounter a false eoi reading in the compressed data is within a comment.

Resources