Reading a text file in Ruby gives wrong output - ruby

I am not an experienced ruby programmer, so bear with me. I have a problem with this specific text file containing two lines ( this issue shows up only on occasions) :
trim(0, 15447)
0, 15447
I am trying to read these two lines with the following code:
File.open(trim).each do |line|
puts line
end
I normally obtain the normal output, but here, I get only one line, with some characters missing:
0, 1544715447)
If I want to check the character codes, I get this:
irb(main):120:0> File.open(trim).each do |line|
irb(main):121:1* puts '========================'
irb(main):122:1> puts line
irb(main):123:1> puts '........................'
irb(main):124:1> puts line.each_byte {|c| print c, ' ' }
irb(main):125:1> end
========================
0, 1544715447)
........................
116 114 105 109 40 48 44 32 49 53 52 52 55 41 13 48 44 32 49 53 52 52 55 trim(0,0, 15447
=> #<File:E:\Public\Public_videos\Soccer\1995_0129_odp_es\950129-ODP_&m3_trim30.txt>
I frankly don't understand what is going on, as I don't see any hidden character, and this happen randomly, but consistently with some files.
Any suggestion to help me understand or avoid this issue would be greatly appreciated.

What happened is that your file had two "lines" separated by a carraige return character, and not a linefeed.
You showed the bytes in your file as
116 114 105 109 40 48 44 32 49 53 52 52 55 41 13 48 44 32 49 53 52 52 55
That 13 is a carriage return, which is sometimes "displayed" by the writer going back to the start of the line it is writing.
So first it wrote out
trim(0, 15447)
then it went back to the start of the same line and wrote
0, 15447
overlaying the initial line! What do you end up with?
0, 1544715447)
Your "problem" is probably best fixed by reencoding that text file of yours to use a better way to separate lines. On Unix systems, including OSX these days, the line terminator is character 10 - known as LINE FEED. Windows uses the two-character combination 13 10 (CR LF). Only old Mac systems to my knowledge used the 13.
Many text editors today will allow you to select a "line ending" option, so you might be able to just open that file, then save it using a different line ending option. FWIW my guess is that you are using Windows now, which is known for rendering CRs and LFs differently than *Nix systems.

Related

Both display and save Plink output

I'm logging into to a remote ssh session using plink.exe to perform certain tasks using a batch script. Getting the output of these commands in a log file as well on the screen is very important for me.
I tried using usual batch way i.e. plink servername -m cmd.txt>logfile.log way but the problem with this is that it won't display it on the Windows terminal that the batch script is running on.
Then I found the -sshlog option of Plink. This does the work, i.e. I can get the output but on screen and in a log file, but this results in output as follows:
00000f90 56 4c 41 4e 2a 2a 0d 0a 20 65 6e 63 61 70 73 75 VLAN**.. encapsu
00000fa0 6c 61 74 69 6f 6e 20 64 6f 74 31 51 20 34 30 34 lation dot1Q 404
00000fb0 0d 0a 20 69 70 20 61 64 64 72 65 73 73 20 31 30 .. ip address 10
00000fc0 2e 37 31 2e 31 39 31 2e 31 34 35 20 32 35 35 2e .71.191.145 255.
My actual output starts at "VLAN**.. encapsu" in the text above The output has these "00000010 74 65 72 ... "bla bla characters which I do not want. Plus the main output (that would be displayed if i was using Plink interactively is "word-wrapped" and looks horrible which makes it very difficult to understand for a general user
Is there any way to prevent Plink output unwanted 'sshlog' characters in the log file? or Is there any other way to get the output on screen and log fail simultaneously in a Plink/PuTTY session inside a batch script?
I tried both -sshlog and -sshrawlog but same output. Also tried -sessionlog as per the documentation but it does not work!
I tried also > file.txt but it gave an empty file!
Expected results:
encapsulation dot1Q 404
ip address 10.71.191.145
There's no command-line switch in Plink to log only the readable output.
You would have to configure "Printable output" logging in Windows Registry prior to running Plink.
But there are other options.
See Displaying Windows command prompt output and redirecting it to a file.
So this should do what you need:
powershell "plink -batch servername -m cmd.txt | tee logfile.log"
(the tee is an alias to Tee-Object cmdlet)

invalid characters not visible in BASH

I have been working on some device that allowed login via telnet and I extracted some data from devices and made some reports, without any problems. recently, I had to switch to SSH while rest of the script is all the same, only login procedure has been changed from telnet to SSH. after switching to SSH, I am facing some problem with the data extracted that there are some invalid characters in some of the lines, below is an example: as can be seen, there is an invalid character after PON7 in the line:
OLT:LT6.PON7.ONT1,ALARM,Date time,
problem is that this invalid character is not even visible in the bash/csv file, but it was discovered when I copied the line in notepad++ or while posting it here.
now I have two problems:
1st: if someone knows what is causing these invalid characters while switching between telnet/ssh.
2nd: how to deal with this invalid character in BASH as it is not even visible in BASH, but this report is being used somewhere and these invalid characters are causing problems.
Edit:
Pasting the text into a text-to-hex converter produces this:
4f 4c 54 3a 4c 54 36 2e 50 4f 4e 37 11 2e 4f 4e 54 31 2c 41 4c 41 52 4d 2c 44 61 74 65 20 74 69 6d 65 2c
It looks like there's a DC1 character (hex 11) between the "7" and the ".".
Unfortunately, this edit also has the side effect of removing the character from the sample text.
Passing your text through a text to hexadecimal converter shows that the invisible character is an ASCII DC1 character (hex 11, octal 021). This character is also known as Ctrl-Q or XON. It's sometimes used in flow control.
In a bash script, you could filter it out using the tr program:
echo $badtext | tr -d '\021'
SSH doesn't inherently insert DC1 characters into text streams. If you're getting a DC1 character in the output from a device, presumably the device sent that character.

Logging in readable Plink output (sshlog is too clumsy) [duplicate]

I'm logging into to a remote ssh session using plink.exe to perform certain tasks using a batch script. Getting the output of these commands in a log file as well on the screen is very important for me.
I tried using usual batch way i.e. plink servername -m cmd.txt>logfile.log way but the problem with this is that it won't display it on the Windows terminal that the batch script is running on.
Then I found the -sshlog option of Plink. This does the work, i.e. I can get the output but on screen and in a log file, but this results in output as follows:
00000f90 56 4c 41 4e 2a 2a 0d 0a 20 65 6e 63 61 70 73 75 VLAN**.. encapsu
00000fa0 6c 61 74 69 6f 6e 20 64 6f 74 31 51 20 34 30 34 lation dot1Q 404
00000fb0 0d 0a 20 69 70 20 61 64 64 72 65 73 73 20 31 30 .. ip address 10
00000fc0 2e 37 31 2e 31 39 31 2e 31 34 35 20 32 35 35 2e .71.191.145 255.
My actual output starts at "VLAN**.. encapsu" in the text above The output has these "00000010 74 65 72 ... "bla bla characters which I do not want. Plus the main output (that would be displayed if i was using Plink interactively is "word-wrapped" and looks horrible which makes it very difficult to understand for a general user
Is there any way to prevent Plink output unwanted 'sshlog' characters in the log file? or Is there any other way to get the output on screen and log fail simultaneously in a Plink/PuTTY session inside a batch script?
I tried both -sshlog and -sshrawlog but same output. Also tried -sessionlog as per the documentation but it does not work!
I tried also > file.txt but it gave an empty file!
Expected results:
encapsulation dot1Q 404
ip address 10.71.191.145
There's no command-line switch in Plink to log only the readable output.
You would have to configure "Printable output" logging in Windows Registry prior to running Plink.
But there are other options.
See Displaying Windows command prompt output and redirecting it to a file.
So this should do what you need:
powershell "plink -batch servername -m cmd.txt | tee logfile.log"
(the tee is an alias to Tee-Object cmdlet)

Reorder lines near the beginning of a huge text file (>20G)

I am a vim user and can use some basic awk or bash commands. Now I have a text (vcf) file with size more than 20G. What I wanted is to move the line #69 to below line#66:
$less huge.vcf
...
66 ##contig=<ID=9,length=124595110>
67 ##contig=<ID=X,length=171031299>
68 ##contig=<ID=Y,length=91744698>
69 ##contig=<ID=MT,length=16299>
...
What I wanted is:
...
66 ##contig=<ID=9,length=124595110>
67 ##contig=<ID=MT,length=16299>
68 ##contig=<ID=X,length=171031299>
69 ##contig=<ID=Y,length=91744698>
...
I tried to open and edit it using vim (LargeFile plugin installed), but still not working very well.
The easy approach is to copy the section you want to edit out of your file, modify it in-place, then copy it back in.
# extract the first hundred lines
head -n 100 huge.txt >start.txt
# modify that extracted subset
vim start.txt
# copy that section back into the beginning of larger file
dd if=start.txt of=huge.txt conv=notrunc
Note that this only works if your edits don't change the size of the section being modified. That is to say -- make sure that start.txt has the exact same size in bytes after being modified that it had before.
Here's an awk version:
$ awk 'NR>=3 && NR<=4{b=b (b==""?"":ORS) $0;next}1;NR==5 {print b}' file
...
66 ##contig=<ID=9,length=124595110>
69 ##contig=<ID=MT,length=16299>
67 ##contig=<ID=X,length=171031299>
68 ##contig=<ID=Y,length=91744698>
...
You need to change the line numbers in the code, though. 3 -> 67, 4 -> 68 and 5 -> 69 and redirect the output to a new file. If you' like it to perform inplace, use i inplace for GNU awk.

pcl6.exe v9.15 silently converting APOSTROPHE => RIGHT SINGLE QUOTATION MARK

Good afternoon.
I am running pcl6.exe version 9.15 on Windows 8.1.
I am running into a problem where pcl6.exe in silently converting any APOSTROPHE characters into RIGHT SINGLE QUOTATION MARK characters using the 16602 typeface in a PCL5 file.
Here is the command line I am using:
pcl6.exe -dNOPAUSE -sDEVICE=txtwrite -sOutputFile=test.txt test.prn
test.prn input (hex)
1B 28 30 55 1B 28 73 31 70 31 30 76 31 36 36 30 32 54 1B 26 61 30 76 30 48 3E 27 3C
test.prn input (text ['.' is the escape character])
.(0U.(s1p10v16602T.&a0v0H>'<
test.txt output (hex)
20 20 3E E2 80 99 3C 0D 0A
test.txt output (text)
>’<..
expected test.txt output (hex)
20 20 3E 27 3C 0D 0A
expected test.txt output (text)
>'<..
Is there a flag or an option somewhere that can disable this conversion?
Thank you for your time.
txtwrite does the best it can with the input, PCL tends not to have much information in the PCL file to allow us to determine what the glyph should be (PostScript is better, and PDF often still better).
If you think there is a real problem I would suggest your best bet is to open a bug report. Apart from anything else, I would need to see the PCL file to determine what's going on. Most lkely the character code you are using corresponds to an apostrophe, which in the specific font is a right quote. There is no way for the text extraction device to know what shape the font will draw in response to a character code. At least, not in PCL
The problem was caused by the symbolset.
The sample was using the PCL ISO 6: ASCII symbolset (code 0U )
http://www.pcl.to/symbolset/pcl_0u.pdf
According to the 0U symbolset reference, APOSTROPHE (0x27) is replaced with RIGHT SINGLE QUOTATION MARK (0x2019). Pcl6.exe then converts these UTF-16 bytes into their UTF-8 equivalent: 0xE28099

Resources