Ruby Reads Different File Sizes for Line Reads - ruby

I need to do something where the file sizes are crucial. This is producing strange results
filename = "testThis.txt"
total_chars = 0
file = File.new(filename, "r")
file_for_writing = nil
while (line = file.gets)
total_chars += line.length
end
puts "original size #{File.size(filename)}"
puts "Totals #{total_chars}"
like this
original size 20121
Totals 20061
Why is the second one coming up short?
Edit: Answerers' hunches are right: the test file has 60 lines in it. If I change this line
total_chars += line.length + 1
it works perfectly. But on *nix this change would be wrong?
Edit: Follow up is now here. Thanks!

There are special characters stored in the file that delineate the lines:
CR LF (0x0D 0x0A) (\r\n) on Windows/DOS and
0x0A (\n) on UNIX systems.
Ruby's gets uses the UNIX method. So, if you read a Windows file you would lose 1 byte for every line you read as the \r\n bytes are converted to \n.
Also String.length is not a good measure of the size of the string (in bytes). If the String is not ASCII, one character may be represented by more than one byte (Unicode). That is, it returns the number of characters in the String, not the number of bytes.
To get the size of a file, use File.size(file_name).

My guess would be that you are on Windows, and your "testThis.txt" file has \r\n line endings. When the file is opened in text mode, each line ending will be converted to a single \n character. Therefore you'll lose 1 character per line.
Does your test file have 60 lines in it? That would be consistent with this explanation.

The line-ending issues is the most likely culprit here.
It's also worth noting that if the character encoding of the text file is something other than ASCII, you will have a discrepancy between the 2 as well. If the file is UTF-8, this will work for english and some european languages that use just standard ASCII alphabet symbols. Beyond that, the file size and character counts can vary wildly (up to 4 or even 6 times the file size compared to the character count).
Relying on '1 character = 1 byte' is just asking for trouble as it is almost certainly going to fail at some point.

Related

how to read a file if EOL char is LF

I receive a file from internet, and the lines are separated by 0x0D char
I display it using this tool
https://www.fileformat.info/tool/hexdump.htm
When I read that file into Rexx using "linein()", all the file comes into one sigle line. Obviously linein() works fine when file has 0x0D0A as End Of Line char.
How do I specify to Rexx to split lines using 0x0D char instead of 0x0D0A ?
Apart from getting the file sent to you with proper CRLF record markers for Windows instead of the LF used in Unix-like systems there are a couple of ways of splitting the data - but neither will read the file a record at a time but will extract each record from the long string read in.
1 - Use WORDPOS to find the position of the LF and SUBSTR to remove that the record
2 - Use PARSE to split the data at the LF position
One way to read one record at a time is to use CHARIN to read a byte at a time until it encounters the LF.

shorten chinesse string to fit in character array C++

I am trying to fit pinyin string in character array. for example If I have pinyin string like below.
string str = "转换汉字为拼音音"; // needs at least 25 bytes to store
char destination[22];
strncpy(destination, str.c_str(), 20);
destination[21] = '\0';
since Chinese characters takes 3 bytes i can do strncpy(destination, str.c_str(), (20/3)*3); but if str contains any character other than Chinese (that takes 2 bytes of 4 bytes in utf8 encoding) this logic will fill.
Later If i try to convert destination to print pinyin characters, only first 6 Chinese characters are printed properly and 2 bytes are printed in hexadecimal.
Is there any way, I can shorten the string before copying to destination so that when destination is printed, proper Chinese characters are printed (without any individual hex bytes)? using POCO::Textendcoing or POCO::UTF8Encoding class?
Thanks in Advance.
Nothing short of creating own way to encode text would work. But even in that case you would have to create 25 characters (don't forget zero at end!) array to store string at end to be printed properly, unless you create own printing routines.
I.e. amount of work required doesn't balance out win of extra 3 bytes.
Note, that code is practically C. In C++ you don't use that style of code.

Is there a time when 'gets' would be used without 'chomp'?

When collecting user input in Ruby, is there ever a time where using chomp on that input would not be the desired behavior? That is, when would simply using gets and not gets.chompbe appropriate.
Yes, if you specify the maximum length for input, having a "\n" included in the gets return value allows you to tell if Ruby gave you x characters because it encountered the "\n", or because x was the maximum input size:
> gets 5
abcdefghij
=> 'abcde'
vs:
> gets 5
abc\n
=> 'abc\n'
If the returned string contains no trailing newline, it means there are still characters in the buffer.
Without the limit on the input, the trailing newline, or any other delimiter, probably doesn't have much use, but it's kept for consistency.

Fastest way to check that a PDF is corrupted (Or just missing EOF) in Ruby?

I am looking for a way to check if a PDF is missing an end of file character. So far I have found I can use the pdf-reader gem and catch the MalformedPDFError exception, or of course I could simply open the whole file and check if the last character was an EOF. I need to process lots of potentially large PDF's and I want to load as little memory as possible.
Note: all the files I want to detect will be lacking the EOF marker, so I feel like this is a little more specific scenario then detecting general PDF "corruption". What is the best, fast way to do this?
TL;DR
Looking for %%EOF, with or without related structures, is relatively speedy even if you scan the entirety of a reasonably-sized PDF file. However, you can gain a speed boost if you restrict your search to the last kilobyte, or the last 6 or 7 bytes if you simply want to validate that %%EOF\n is the only thing on the last line of a PDF file.
Note that only a full parse of the PDF file can tell you if the file is corrupted, and only a full parse of the File Trailer can fully validate the trailer's conformance to standards. However, I provide two approximations below that are reasonably accurate and relatively fast in the general case.
Check Last Kilobyte for File Trailer
This option is fairly fast, since it only looks at the tail of the file, and uses a string comparison rather than a regular expression match. According to Adobe:
Acrobat viewers require only that the %%EOF marker appear somewhere within the last 1024 bytes of the file.
Therefore, the following will work by looking for the file trailer instruction within that range:
def valid_file_trailer? filename
File.open filename { |f| f.seek -1024, :END; f.read.include? '%%EOF' }
end
A Stricter Check of the File Trailer via Regex
However, the ISO standard is both more complex and a lot more strict. It says, in part:
The last line of the file shall contain only the end-of-file marker, %%EOF. The two preceding lines shall contain, one per line and in order, the keyword startxref and the byte offset in the decoded stream from the beginning of the file to the beginning of the xref keyword in the last cross-reference section. The startxref line shall be preceded by the trailer dictionary, consisting of the keyword trailer followed by a series of key-value pairs enclosed in double angle brackets (<< … >>) (using LESS-THAN SIGNs (3Ch) and GREATER-THAN SIGNs (3Eh)).
Without actually parsing the PDF, you won't be able to validate this with perfect accuracy using regular expressions, but you can get close. For example:
def valid_file_trailer? filename
pattern = /^startxref\n\d+\n%%EOF\n\z/m
File.open(filename) { |f| !!(f.read.scrub =~ pattern) }
end

convert text from utf to read-able text

I have some UTF-Text starting with "ef bb bf". How can I turn this message to human read-able text? vim, gedit, etc. interpret the file as plain text and show all the ef-text even when I force them to read the file with several utf-encodings. I tried the "recode" tool, it doesn't work. Even php's utf8_decode failed to produce the expected text output.
Please help, how can I convert this file so that I can read it?
ef bb bf is the UTF-8 BOM. Strip of the first three bytes and try to utf8_decode the remainder.
$text = "\xef\xbb\xbf....";
echo utf8_decode(substr($text, 3));
Is it UFT8, UTF16, UTF32? It matters a lot! I assume you want to convert the text into old-fashioned ASCII (all characters are 1 byte long).
UTF8 should already be (at least mostly) readable as it uses 1 byte for standard ASCII characters and only uses multiple bytes for special/multilingual characters (Character codes > 127). It sounds like your file isn't UTF8, or you'd already be able to read it! Online content is generally UTF-8.
Unicode character codes are the same as the old ASCII codes up to 127.
UTF16 and UTF32 always use 2 and 4 bytes respectively to encode every character, whether those characters can be represented in a single byte or not. That makes it unreadable if the text editor is expecting UTF8.
Gedit supports UTF16 and UTF32 but you need to 'add' those encoding explicitly in the open dialog box (and possibly select them explicitly instead of using auto-detect)

Resources