copy a file omitting the first X characters

copy a file omitting the first X characters - ruby

I am manipulating proprietary files, which are very similar to wave files but with a custom header, longer than the wav header (200 bytes versus 36 bytes). The samples are similar though. These files are quite large (200Meg typically).
I am trying to batch convert the proprietary files to wav.
I wrote a short script using the wavefile gem. I just read the whole array of samples then create the wave file. It works fine with smaller examples but I have a memory allocation error for larger ones.
I noticed that using Fileutils.cp, copying the file is impressively fast. I am wondering if I could somehow copy the file while "omitting" the first 164 bytes, then just write the wave header in the first 36bytes and rename the file (.wav).
What would be the best/easiest way?

Something like this would likely work:
File.open(src, 'rb') do |r|
File.open(dst, 'wb') do |w|
w.write(new_dst_header)
r.seek(200)
until r.eof?
w.write(r.read(chunk_size))
end
end
end
The bigger chunk_size, the faster it goes, and the more memory you use.

Related

Write to specific part of preallocated file

I am currently trying to write to different locations of a pre-allocated file.
I first allocated my file like so:
File.open("file", "wb") { |file| file.truncate(size) }
Size being the total size of the file.
Afterwards I receive data of XX size which fits into Y location of that file. Keep in mind this portion of the process is forked. Each fork has it's own unique socket and opens it's own unique file handle, writes to the file, then closes it as so.
data = socket.read(256)
File.open("file", "wb") do |output|
output.seek(location * 256, IO::SEEK_SET)
output.write(data)
end
This should in turn allow the forked processes to open a file handle, seek to the correct location (If location is 2 and data_size is 256, then the write location is 512 -> 768) and write the chunk of data that it received.
Although what this is doing is beyond my comprehension. I monitor the files size as it is being populated and it is bouncing around from different file sizes which should not be changing.
When analyzing the file with a hex editor, where the file data header should be at the top is filled with nullbytes (like wise with 1/4 of the file). Although if I limit the forked processes to only write 1 file chunk and then exit the writes are fine and at their proper location.
I have done some other testing such as dumping that part locations, and the start locations of the data and my equation for seeking to the correct location of the file seems to be correct as well.
Is there something I am missing here or is there another way to have multiple threads/processes open a file handle to a file, seek to a specific location, and then write a chunk of data?
I have also attempted to use FLOCK on the file, and it yields the same results, likewise with using the main process instead of forking.
I have tested the same application, but rather than opening/closing the file handle each time I need to write data in rapid succession (transferring close to 70mb/s), I created one file handle per forked process and kept it open. This fixed the problem resulting in a 1:1 duplication of the file with matching checksums.
So the question is, why is opening/writing/closing file handles to a file in rapid succession causing this behavior?

It's your file mode.
File.open("file", "wb")
"wb" means "upon opening, truncate the file to zero length".
I suggest "r+b", which means "reading and writing, no truncation". Read more about available modes here: http://ruby-doc.org/core-2.2.2/IO.html#method-c-new
BTW, "b" in those modes means "binary" (as opposed to default "t" (text))

Testing the validity of many gzipped files on a Windows system using Perl

I have thousands (or more) of gzipped files in a directory (on a Windows system) and one of my tools consumes those gzipped files. If it encounters a corrupt gzip file, it conveniently ignores them instead of raising an alarm.
I have been trying to write a Perl program that loops through each file and makes a list of files which are corrupt.
I am using the Compress::Zlib module, and have tried reading the first 1KB of each file, but that did not work since some of the files are corrupted towards the end (verified during the manual extract, alarm raised only towards the end) and reading first 1KB doesn't show a problem. I am wondering if a CRC check of these files will be of any help.
Questions:
Will CRC validation work in this case? If yes, how does it work? Will the true CRC be part of the gzip header, and we are to compare it with the calculated CRC from the file we have? How do I accomplish this in Perl?
Are there any other simpler ways to do this?

In short, the only way to check a gzip file is to decompress it until you get an error, or get to the end successfully. You do not however need to store the result of the decompression.
The CRC stored at the end of a gzip file is the CRC of the uncompressed data, not the compressed data. To use it for verification, you have to decompress all of the data. This is what gzip -t does, decompressing the data and checking the CRC, but not storing the uncompressed data.
Often a corruption in the compressed data will be detected before getting to the end. But if not, then the CRC, as well as a check against an uncompressed length also stored at the end, will with a probability very close to one detect a corrupted file.

The Archive::Zip FAQ gives some very good guidance on this.
It looks like the best option for you is to check the CRC of each member of the archives, and a sample program that does this -- ziptest.pl -- comes with the Archive::Zip module installation.

It should be easy to test the file is not corrupt by just using "gunzip -t" command, gunzip is available for windows as well and should come with gzip package.

How to do large file integrity check

I need to do an integrity check for a single big file. I have read the SHA code for Android, but it will need one another file for the result digest. Is there another method using a single file?
I need a simple and quick method. Can I merge the two files into a single file?
The file is binary and the file name is fixed. I can get the file size using fstat. My problem is that I can only have one single file. Maybe I should use CRC, but it would be very slow because it is a large file.
My object is to ensure the file on the SD card is not corrupt. I write it on a PC and read it on an embedded platform. The file is around 200 MB.

You have to store the hash somehow, no way around it.
You can try writing it to the file itself (at the beginning or end) and skip it when performing the integrity check. This can work for things like XML files, but not for images or binaries.
You can also put the hash in the filename, or just keep a database of all your hashes.
It really all depends on what your program does and how it's set up.

Determining end of JPEG (when merged with another file)

I'm creating a program that "hides" an encrypted file at the end of a JPEG. The problem is, when retrieving this encrypted file again, I need to be able to determine when the JPEG it was stored in ends. At first I thought this wouldn't be a problem because I can just run through the file checking for 0xFF and 0xD9, the bytes JPEG uses to end the image. However... I'm noticing that in quite a few JPEGs, this combination of bytes is not exclusive... So my program thinks the image has ended randomly half way through it.
I'm thinking there must be a set way of expressing that a JPEG has finished, otherwise me adding a load of bytes to the end of the file would obviously corrupt it... Is there a practical way to do this?

You should read the JFIF file format specifications

Well, there are always two places in the file that you can find with 100% reliability. The beginning and the end. So, when you add the hidden file, add another 4 bytes that stores the original length of the file and a special signature that's always distinct. When reading it back, first seek to the end - 8 and read that length and signature. Then just seek to that position.

You should read my answer on this question "Detect Eof for JPG images".

You're likely running into the thumbnail in the header, when moving through the file you should find that most marked segments contain a length indicator, here's a reference for which do and which don't. You can skip the bytes within those segments as the true eoi marker will not be within them.
Within the actual jpeg compressed data, any FF byte should be followed either by 00 (the zero byte is then discarded), or by FE to mark a comment (which has a length indicator, and can be skipped as described above).
Theoretically the only way you encounter a false eoi reading in the compressed data is within a comment.

Modifying an IO stream in-place? Ruby

I've been writing a ruby programme that merges the content of two files.
For example if a torrent have been downloaded two times separately, it tries to merge their contents for the blocks which have been completed.
So, I've been looking for a method which modifies a stream only at the place required and saves only that block instead of saving the whole stream again.
I'm reading the file in blocks of 16 KiBs, and how do I "replace" (not append) the content of that 16 KiBs so that only those bytes are written to disk and not the whole file is re-written each time!
Kind of,
#Doesn't exist unfortunately.
#By default it appends instead of replacing, so file size grows.
IO.write(file_name, content, offset, :replace => true)
Is there exists a method which achieves kind of that functionality?

Open the file in "r+b" mode, seek to the location and just write to it:
f=File.new("some.existing.file", "r+b");
f.seek(1024);
f.write("test\n");
f.close()
This will overwrite 5 characters of the file, following offset 1024.
If the file is shorter than your seek offset, an appropriate number of null characters are inserted to the file.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio