I receive a file from internet, and the lines are separated by 0x0D char
I display it using this tool
https://www.fileformat.info/tool/hexdump.htm
When I read that file into Rexx using "linein()", all the file comes into one sigle line. Obviously linein() works fine when file has 0x0D0A as End Of Line char.
How do I specify to Rexx to split lines using 0x0D char instead of 0x0D0A ?
Apart from getting the file sent to you with proper CRLF record markers for Windows instead of the LF used in Unix-like systems there are a couple of ways of splitting the data - but neither will read the file a record at a time but will extract each record from the long string read in.
1 - Use WORDPOS to find the position of the LF and SUBSTR to remove that the record
2 - Use PARSE to split the data at the LF position
One way to read one record at a time is to use CHARIN to read a byte at a time until it encounters the LF.
Related
I'm developing a software that stores its data in a binary file format. However, as a courtesy to innocent shell users that might cat to inspect the contents of such a file, I'm thinking of having an ASCII-compatible "magic string" in the start of the file that tells the name and the version of the binary format.
I'm thinking of having at least ten rows (\n) in the message so that head by default settings doesn't hit the binary part.
Now, I wonder if there is any control character or escape code that would hint to the shell that the following content isn't interpretable as printable text, and should be just ignored? I tried 0x00 (the null byte) and 0x04 (ctrl-D) but they seem to be just ignored when catting the file.
Cat regards a file as text. There is no way you can trigger an end-of-file, since EOF is not actually any character.
The other way around works of course; specifying a format that only start reading binary format from a certain character on.
I have a problem I need help solving. The business I am working for is using Informatica cloud to do alot of their ETL into AWS and Other Services.
We have been given a flat file by the business where the field delimiter is "~|" Currently to the best of my knowledge informatica only accepts a single character delimiter.
Does any one know how to overcome this?
Informatica cannot read composite delimiters.
First you could feed each line as one single long string into an
Expression transformation. In this case the delimiter character should
be set to \037 , I haven't seen this character (ASCII Unit Separator)
in use at least since 1982. Then use repetitive invocations of InStr()
within the EXP to identify the positions of those double pipe
characters and split up each line into fields using SubStr().
Second
(easier in the mapping, more work with the session) you could feed the
file into some utility which replaces those double pipe characters by
the character ASCII 31 (the Unit Separator mentioned above); the
session has to be set up such that it reads the output from this
utility (input file type = Command instead of File). Then the source
definition should contain the \037 as the field delimiter instead of
any pipe character or so.
I've read that to use fflush() function in oracle, every line in the output should end with a new line character. Will put_line() automatically introduce a new line character that needs fflush() to work ?
What is the new line character (\r\n or \n or depends on OS) that fflush() needs ? And what is the new line (\r\n or \n or depends on OS) character that put_line() introduces if at all it does ?
Yes, put_line() adds the required new line character(s). From the documentation for put_line():
This procedure writes the text string stored in the buffer parameter to the open file identified by the file handle. The file must be open for write operations. PUT_LINE terminates the line with the platform-specific line terminator character or characters.
That's really the difference between put() and put_line():
No line terminator is appended by PUT; use NEW_LINE to terminate the line or use PUT_LINE to write a complete line with a line terminator.
It's slightly confusing that the description of fflush() refers to just "a newline character" while put_line() refers to "line terminator character or characters", but they do mean the same thing - to flush the buffer must end with the operating-system line terminator character(s).
Note that it means the database server's operating system, not your client operating system, since utl_file (and all PL/SQL) is on the server and doesn't know anything about the client environment. It's generally safer to use put_line() or new_line() than to manually add \n or \r\n; even if you know the OS your database is running on now, it may move to a different OS one day.
I am using SAS's FILE statement to output a text file having fixed format (RECFM=F).
I would like each row to end in a end-of-line control character(s) such as linefeed/carriage return. I tried the FILE statement's option TERMSTR=CRLF but still I see no end-of-line control characters in the output file.
I think I could use the PUT statement to insert the desired linefeed and carriage return control characters, but would prefer a cleaner method. Is it a reasonable thing to expect of the FILE statement? (Is it a reasonable expectation for outputting fixed format data?)
(Platform: Windows v6.1.7600, SAS for Windows v9.2 TS Level 2M3 W32_VSPRO platform)
Do you really need to use RECFM=F? You can still get fixed length output with V:
data _null_;
file 'c:\temp\test.txt' lrecl=12 recfm=V;
do i=1 to 5;
x=rannor(123);
put #1 i #4 x 6.4;
end;
run;
By specifying where you want the data to go (#1 and #3) and the format (6.4) along with lrecl you will get fixed length output.
There may be a work-around, but I believe SAS won't output a line-ending with the Fixed format.
I need to do something where the file sizes are crucial. This is producing strange results
filename = "testThis.txt"
total_chars = 0
file = File.new(filename, "r")
file_for_writing = nil
while (line = file.gets)
total_chars += line.length
end
puts "original size #{File.size(filename)}"
puts "Totals #{total_chars}"
like this
original size 20121
Totals 20061
Why is the second one coming up short?
Edit: Answerers' hunches are right: the test file has 60 lines in it. If I change this line
total_chars += line.length + 1
it works perfectly. But on *nix this change would be wrong?
Edit: Follow up is now here. Thanks!
There are special characters stored in the file that delineate the lines:
CR LF (0x0D 0x0A) (\r\n) on Windows/DOS and
0x0A (\n) on UNIX systems.
Ruby's gets uses the UNIX method. So, if you read a Windows file you would lose 1 byte for every line you read as the \r\n bytes are converted to \n.
Also String.length is not a good measure of the size of the string (in bytes). If the String is not ASCII, one character may be represented by more than one byte (Unicode). That is, it returns the number of characters in the String, not the number of bytes.
To get the size of a file, use File.size(file_name).
My guess would be that you are on Windows, and your "testThis.txt" file has \r\n line endings. When the file is opened in text mode, each line ending will be converted to a single \n character. Therefore you'll lose 1 character per line.
Does your test file have 60 lines in it? That would be consistent with this explanation.
The line-ending issues is the most likely culprit here.
It's also worth noting that if the character encoding of the text file is something other than ASCII, you will have a discrepancy between the 2 as well. If the file is UTF-8, this will work for english and some european languages that use just standard ASCII alphabet symbols. Beyond that, the file size and character counts can vary wildly (up to 4 or even 6 times the file size compared to the character count).
Relying on '1 character = 1 byte' is just asking for trouble as it is almost certainly going to fail at some point.