Wierd output characters (Chinese characters) when using Ruby to read / write CSV - ruby

I'm trying to print the first 5 lines from a set of large (>500MB) csv files into small headers in order to inspect the content more easily.
I'm using Ruby code to do this but am getting each line padded out with extra Chinese characters, like this:
week_num type ID location total_qty A_qty B_qty count਍㌀㐀ऀ猀漀爀琀愀戀氀攀ऀ㄀㤀㜀ऀ䐀䔀开伀渀氀礀ऀ㔀㐀㜀㈀ ㌀ऀ㔀㐀㜀㈀ ㌀ऀ ऀ㤀㄀㈀㔀㌀ഀ
44 small 14 A 907859 907859 0 550360਍㐀㄀ऀ猀漀爀琀愀戀氀攀ऀ㐀㈀㄀ऀ䐀䔀开伀渀氀礀ऀ㌀ ㈀㄀㜀㐀ऀ㌀ ㈀㄀
The first few lines of input file are like so:
week_num type ID location total_qty A_qty B_qty count
34 small 197 A 547203 547203 0 91253
44 small 14 A 907859 907859 0 550360
41 small 421 A 302174 302174 0 18198
The strange characters appear to be Line 1 and Line 3 of the data.
Here's my Ruby code:
num_lines=ARGV[0]
fh = File.open(file_in,"r")
fw = File.open(file_out,"w")
until (line=fh.gets).nil? or num_lines==0
fw.puts line if outflag
num_lines = num_lines-1
end
Any idea what's going on and what I can do to simply stop at the line end character?
Looking at input/output files in hex (useful suggestion by #user1934428)
Input file - each character looks to be two bytes.
Output file - notice the NULL (00) between each single byte character...
Ruby version 1.9.1

The problem is an encoding mismatch which is happening because the encoding is not explicitly specified in the read and write parts of the code. Read the input csv as a binary file "rb" with utf-16le encoding. Write the output in the same format.
num_lines=ARGV[0]
# ****** Specifying the right encodings <<<< this is the key
fh = File.open(file_in,"rb:utf-16le")
fw = File.open(file_out,"wb:utf-16le")
until (line=fh.gets).nil? or num_lines==0
fw.puts line
num_lines = num_lines-1
end
Useful references:
Working with encodings in Ruby 1.9
CSV encodings
Determining the encoding of a CSV file

Related

How to restore PDF from ASCII?

I have a question, how to restore PDF file, if all I have is the only ASCII output?
Example:
%PDF-1.3
%���������
4 0 obj
<< /Length 5 0 R /Filter /FlateDecode >>
stream
x�ѽ
�0�ݧ8O�����[�AAqp� �jK|{S�"�f�2���[�
�(M#���#�FFIw�=*��?J4'�P�y^TP`�Q�
+�i�E�8ψ�g���º��(6�񮽗֭,���s0�T��ZL�~�e�.EA��`J�f��<��M�
[...]
0000120481 00000 n
0000122448 00000 n
trailer
<</Size 94 /Root 57 0 R /Prev 116103 /Info 1 0 R>>
startxref
122488
%%EOF
It's the beginning and end of output I have and I need to restore it back into a readable form. I tried a few things, but I was unlucky.
It is impossible, the information was lost.
You can't represent binary data as a printable text using ASCII encoding in the 'One Byte' to 'One Char' ratio.
There are many non-printable characters in the ASCII table that could be supressed when converting the pdf binary file contents, destroying the original data.
Quoted-Printable encoding and Base64 encoding are more suitable for such application.
Check this out: Binary-to-text_encoding

Matching all lines between two lines recursively in ruby

I would like to match all lines (including the first line) between two lines that start with 'SLX-', convert them to a comma separated line and then append them to a text file.
A truncated version of the original text file looks like:
SLX-9397._TC038IV_L_FLD0214.Read1.fq.gz
Sequences: 1406295
With index: 1300537
Sufficient length: 1300501
Min index: 0
Max index: 115
0 1299240
1 71
2 1
4 1
Unique: 86490
# reads processed: 86490
# reads with at least one reported alignment: 27433 (31.72%)
# reads that failed to align: 58544 (67.69%)
# reads with alignments suppressed due to -m: 513 (0.59%)
Reported 27433 alignments to 1 output stream(s)
SLX-9397._TC044II_D_FLD0197.Read1.fq.gz
Sequences: 308905
With index: 284599
Sufficient length: 284589
Min index: 0
Max index: 114
0 284290
1 16
Unique: 32715
# reads processed: 32715
# reads with at least one reported alignment: 13114 (40.09%)
# reads that failed to align: 19327 (59.08%)
# reads with alignments suppressed due to -m: 274 (0.84%)
Reported 13114 alignments to 1 output stream(s)
SLX-9397._TC047II_D_FLD0220.Read1.fq.gz
I imagine the ruby would look like
Convert all /n between two lines with SLX- to commas
Save the original text file as a new text file (or even better a CSV file.
I think I specifically have a problem with how to find and replace between two specific lines.
I guess I could do this without using ruby, but seeing as I'm trying to get into Ruby...
Assuming, that you have your string in str:
require 'csv'
CSV.open("/tmp/file.csv", "wb") do |csv|
str.scan(/^(SLX-.*?)(?=\R+SLX-)/m).map do |s| # break by SLX-
s.first.split($/).map do |el| # split by CR
"'#{el}'" # quote values
end
end.each do |line| # iterate
csv << line # fulfil csv
end
end
I don't know much about Ruby but this should work. You should read the entire file into a Sting. Use this regex - (\RSLX-) - to match all SLX- (all but the first one) and replace it with ,SLX-. For the explanation of the regex, go to https://regex101.com/r/pP3pP3/1
This question - Ruby replace string with captured regex pattern - might help you to understand how to replace in ruby

Fortran 90 - reading format

I'm trying to read that string in a formatted file: " PARAMETER (NE_M=10,NL_M=12)".
I want to replace the 12 by 11.
I tried to read the sting like this :
integer :: i
character(len=30) :: text
10 format(6x,24a,2i) text,i
read(text_data,10) text, i
write(6,100) text, 11
But it doesn't work. Any idea?
The reading and writing you have written will not do what you want. The input statement you presented for reading is 33 characters wide, and your formatting only accounts for 32 of those characters and your write will not contain the closing ).
Consider the following code, if you do not need to capture the 12 in the input.
program test
character(len=30) :: text
101 format(a30, i2, ')')
open(unit=10, file='testinput.f')
read(10,101) text
write(*,101) text, 11
end program
and the input (with 6 leading spaces) in file testinput.f:
PARAMETER (NE_M=10,NL_M=12)
when run, produces the output:
% ./test
PARAMETER (NE_M=10,NL_M=11)
This code was compiled and tested with GNU gfortran 4.8.2.
assuming test_data is a unit number of an open file and 100 is a
format statement number.
integer :: i
character(len=30) :: text
10 format(6x,a24,i2)
read(text_data,10) text, i
write(6,100) text(:24), i
fixing those other issues:
integer :: i
character(len=30) :: text
open(unit=20,file='filename')
10 format(6x,a24,i2)
read(20,10) text, i
write(6,10) text(:24), i

Replace the n-th byte in a file with another byte

In Ruby, how do I replace, say, the 7th byte of a file with another byte?
Use binwrite method from IO class
IO.binwrite("testfile", [0x0D].pack("C"), 7) # => 1
# File could contain: "This is0two\nThis is line three\nAnd so on...\n"
0x0D is 13
Also you may need to know about pack method

MATLAB: how to display UTF-8-encoded text read from file?

The gist of my question is this:
How can I display Unicode characters in Matlab's GUI (OS X) so that they are properly rendered?
Details:
I have a table of strings stored in a file, and some of these strings contain UTF-8-encoded Unicode characters. I have tried many different ways (too many to list here) to display the contents of this file in the MATLAB GUI, without success. For example:
>> fid = fopen('/Users/kj/mytable.txt', 'r', 'n', 'UTF-8');
>> [x, x, x, enc] = fopen(fid); enc
enc =
UTF-8
>> tbl = textscan(fid, '%s', 35, 'delimiter', ',');
>> tbl{1}{1}
ans =
ÎÎÎÎÎΠΣΦΩαβγδεζηθικλμνξÏÏÏÏÏÏÏÏÏÏ
>>
As it happens, if I paste the string directly into the MATLAB GUI, the pasted string is displayed properly, which shows that the GUI is not fundamentally incapable of displaying these characters, but once MATLAB reads it in, it longer displays it correctly. For example:
>> pasted = 'ΓΔΘΛΞΠΣΦΩαβγδεζηθικλμνξπρςστυφχψω'
pasted =
>>
Thanks!
I present below my findings after doing some digging... Consider these test files:
a.txt
ΓΔΘΛΞΠΣΦΩαβγδεζηθικλμνξπρςστυφχψω
b.txt
தமிழ்
First, we read files:
%# open file in binary mode, and read a list of bytes
fid = fopen('a.txt', 'rb');
b = fread(fid, '*uint8')'; %'# read bytes
fclose(fid);
%# decode as unicode string
str = native2unicode(b,'UTF-8');
If you try to print the string, you get a bunch of nonsense:
>> str
str =
Nonetheless, str does hold the correct string. We can check the Unicode code of each character, which are as you can see outside the ASCII range (last two are the non-printable CR-LF line endings):
>> double(str)
ans =
Columns 1 through 13
915 916 920 923 926 928 931 934 937 945 946 947 948
Columns 14 through 26
949 950 951 952 953 954 955 956 957 958 960 961 962
Columns 27 through 35
963 964 965 966 967 968 969 13 10
Unfortunately, MATLAB seems unable to display this Unicode string in a GUI on its own. For example, all these fail:
figure
text(0.1, 0.5, str, 'FontName','Arial Unicode MS')
title(str)
xlabel(str)
One trick I found is to use the embedded Java capability:
%# Java Swing
label = javax.swing.JLabel();
label.setFont( java.awt.Font('Arial Unicode MS',java.awt.Font.PLAIN, 30) );
label.setText(str);
f = javax.swing.JFrame('frame');
f.getContentPane().add(label);
f.pack();
f.setVisible(true);
As I was preparing to write the above, I found an alternative solution. We can use the DefaultCharacterSet undocumented feature and set the charset to UTF-8 (on my machine, it is ISO-8859-1 by default):
feature('DefaultCharacterSet','UTF-8');
Now with a proper font (you can change the font used in the Command Window from Preferences > Font), we can print the string in the prompt (note that DISP is still incapable of printing Unicode):
>> str
str =
ΓΔΘΛΞΠΣΦΩαβγδεζηθικλμνξπρςστυφχψω
>> disp(str)
ΓΔΘΛΞΠΣΦΩαβγδεζηθικλμνξπÏςστυφχψω
And to display it in a GUI, UICONTROL should work (under the hood, I think it is really a Java Swing component):
uicontrol('Style','text', 'String',str, ...
'Units','normalized', 'Position',[0 0 1 1], ...
'FontName','Arial Unicode MS', 'FontSize',30)
Unfortunately, TEXT, TITLE, XLABEL, etc.. are still showing garbage:
As a side note: It is difficult to work with m-file sources containing Unicode characters in the MATLAB editor. I was using Notepad++, with files encoded as UTF-8 without BOM.

Resources