csv file encoding issue in ruby

csv file encoding issue in ruby - ruby

I parsing a csv file using ruby and getting an error
invalid byte sequence in utf-8 csv
I tried with encoding option
CSV.foreach(path, {headers: true, encoding: 'windows-1251:utf-8'}) do |row|
new_row = {}
headers = []
row.each do |k,v|
headers << k
v = v.force_encoding('UTF-8') || ''
v.gsub! "\xE2\x80\x96", "-"
v.gsub! "\xE2\x80\x93", "-"
v.gsub! "\xE2\x80\x94", "-"
v.gsub! "\xE2\x80\x95", "-"
v.gsub! "\xE2\x80\x98", "'"
v.gsub! "\xE2\x80\x99", "'"
v.gsub! "\xE2\x80\x9C", "\""
v.gsub! "\xE2\x80\x9D", "\""
v.gsub! "\xE2\x80\xA6", "..."
v.gsub! "\x0D\x0A", "\n"
v.gsub! "\xC2\xA0", " "
v.gsub! "\xC2\xB0", " "
new_row[k] = v
end
output_csv.puts headers if output_csv.header_row?
output_csv.puts new_row
end
now i'm ended up with
incompatible encoding regexp match (ASCII-8BIT regexp with UTF-8 string)
The string which is raising this issue in CSV file is "G�ran"
Below is the sample input row
David Evans & Assocs www.deainc.com 13858534 jpv#deainc.com G�ran Volk 5034990383
Can anyone suggest me how to solve this issue.

That issue most likely induced by saving the file in wrong encoding. Say, you have unicode symbol “★” in your file. Saving it as ASCII or Latin1 or other 1-byte-per-symbol encoding, you loose some data.
The symbol “�” is known as replacement character. It’s used to indicate “here was unicode that was apparently lost during encoding convertion.”

Related

can't write IP to text file without formatting issues

I'm having trouble reading an IP from a text file and properly writing it to another text file. It shows the written IP in the file as: "ÿþ1 9 2 . 1 6 8 . 1 1 0 . 4"
#Read the first line for the IP
def get_server_ip
File.open("d:\\ip_addr.txt") do |line|
a = line.readline()
b = a.to_s
end
end
#append the ip to file2
def append_ip
FileUtils.cp('file1.txt', 'file2.txt')
file_names = ['file2.txt']
file_names.each do |file_name|
text = File.read(file_name)
b = get_server_ip
new_contents = text.gsub('ip_here', b)
File.open(file_name, "w") {|file| file.puts new_contents }
end
end
I've tried .strip and .delete(' ') with no luck. Can anyone see the issue?
Thank you

The file was generated with Notepad on Windows. It is encoded as UTF-16LE.
The first two bytes in the file have the codes 0xFF and 0xFE; this is the Bytes Order Mark of UTF-16LE.
Each character is encoded on 2 bytes (16 bits), the least significant byte first (Less Endian order).
The spaces between the printable characters in the output are, in fact NUL characters (characters with code 0).
What you can do (apart from converting the file to a more decent format like UTF-8 or even ISO-8859-1) is to pass 'rb:BOM|UTF-16LE' as the second argument of File#open.
r tells File#open to open the file in read-only mode (which is also does by default);
b means "binary mode"; it is required by BOM|UTF-16;
:BOM|UTF-16LE tells Ruby to read and ignore the BOM if it is present in the file and to expect the rest of the file being encoded as UTF16-LE.
If you can, I recommend you to convert the file encoding using a decent editor (even Notepad can be used) to UTF-8 or ISO-8859-1 and all these problems vanish.

Windows cmd: piping python 3.5 py file results works but pyinstaller exe's leads to UnicodeEncodeError

I am somewhat out of options here...
# -*- coding: utf-8 -*-
print(chr(246) + " " + chr(9786) + " " + chr(9787))
print("End.")
When I run the code mentioned above in my Win7 cmd window, I get the results depending on the way I invoke it:
python.exe utf8.py
-> ö ☺ ☻
python.exe utf8.py >test.txt
-> ö ☺ ☻ (in file)
utf8.exe
-> ö ☺ ☻
utf8.exe >test.txt
RuntimeWarning: sys.stdin.encoding == 'utf-8', whereas sys.stdout.encoding == 'cp1252', readline hook consumer may assume they are the same
Traceback (most recent call last):
File "Development\utf8.py", line 15, in <module>
print(chr(246) + " " + chr(9786) + " " + chr(9787))
File "C:\python35\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u263a' in position
Messing around with win_unicode_console doesn't help either. In the end, I get the same results.
PYTHONIOENCODING=utf-8
is set. But it seems, that when using PyInstaller, the parameter is ignored for stdout.encoding:
print(sys.stdout.encoding)
print(sys.stdout.isatty())
print(locale.getpreferredencoding())
print(sys.getfilesystemencoding())
print(os.environ["PYTHONIOENCODING"])
Output:
python.exe utf8.py > test.txt
utf-8
False
cp1252
mbcs
utf-8
utf8.exe >test.txt
cp1252
False
cp1252
mbcs
utf-8
The questions are: How does that happen? And: How can I fix that?
codecs.getwriter([something])(sys.stdout)
seems to be discouraged because it may lead to modules with broken output. Or is it possible to force that to utf-8 in case we did a check for a tty? Better: How to fix that in PyInstaller?
Thanks in advance...

Thanks to eryksun, the following workaround is working:
STDOUT_ENCODING = str(sys.stdout.encoding)
try:
PYTHONIOENCODING = str(os.environ["PYTHONIOENCODING"])
except:
PYTHONIOENCODING = False
# Remark: In case the stdout gets modified, it will only append all information
# that has been written into the pipe until that very moment.
if sys.stdout.isatty() is False:
print("Program is running in piping mode. (sys.stdout.isatty() is " + str(sys.stdout.isatty()) + ".)")
if PYTHONIOENCODING is not False:
print("PYTHONIOENCODING is set to a value. ('" + str(PYTHONIOENCODING) + "')")
if str(sys.stdout.encoding) != str(PYTHONIOENCODING):
print("PYTHONIOENCODING is differing from stdout encoding. ('" + str(PYTHONIOENCODING) + "' != '" + STDOUT_ENCODING + "'). This should normally not happen unless the PyInstaller setup is still broken. Setting hard utf-8 workaround.")
sys.stdout = open(sys.stdout.fileno(), 'w', encoding='utf-8', closefd=False)
print("PYTHONIOENCODING was differing from stdout encoding. ('" + str(PYTHONIOENCODING) + "' != '" + STDOUT_ENCODING + "'). This should normally not happen unless PyInstaller is still broken. Setting hard utf-8 workaround. New encoding: '" + str(PYTHONIOENCODING) + "'.", "D")
else:
print("PYTHONIOENCODING is equal to stdout encoding. ('" + str(PYTHONIOENCODING) + "' == '" + str(sys.stdout.encoding) + "'). - All good.")
else:
print("PYTHONIOENCODING is set False. ('" + str(PYTHONIOENCODING) + "'). - Nothing to do.")
else:
print("Program is running in terminal mode. (sys.stdout.isatty() is " + str(sys.stdout.isatty()) + ".) - All good.")
Trying to set up a new PyInstaller-Environment to see if that fixes it from the start next.

Issue running Classic ASP written in VBScript on Windows 2012 with IIS 8

This is a very peculiar issue. I have a 2012 Server running IIS 8 with support for classic ASP installed. I am building a comma separated string from a form. I then am retrieving this string from a table and want to split on the commas.
First, when I build the string and submit it to the DB (SQL Express 2014), something is adding a space after each comma even though there is no space in the code.
Second, when I return the string and attempt to split on the comma, it doesn't do anything; the ubound method returns -1... For testing purposes, I hand built an array and this has the same behavior.
Code that builds the csv string:
If fieldName = "txt_EnvironmentType" then
strTempEnvCSV = strTempEnvCSV & fieldValue & ","
End If
Test code for split:
txtEnvironmentType = "This,Is,A,Test,String"
If txtEnvironmentType <> "" then
response.write(txtEnvironmentType)
array = split(txtEnvironmentType,",")
l = ubound(array)
response.write("<br>array is " & l & " long")
For i = LBound(array) to UBound(array)
response.write("<br>" & array(i))
Next
End If
The Above test code returns the following to the browser:
This,Is,A,Test,String
array is -1 long
Am I missing something?
Thanks!

For the mysterious spaces, add a TRIM() to make sure you aren't starting with spaces:
If fieldName = "txt_EnvironmentType" then
strTempEnvCSV = strTempEnvCSV & TRIM(fieldValue) & ","
End If
This ran (for your second issue) for me - the only change I made was to dim the array variable and name it something other than "array"
<%
dim arrMine
txtEnvironmentType = "This,Is,A,Test,String"
If txtEnvironmentType <> "" then
response.write(txtEnvironmentType)
arrMine = split(txtEnvironmentType,",")
l = ubound(arrMine)
response.write("<br>arrMine is " & l & " long")
For i = LBound(arrMine) to UBound(arrMine)
response.write("<br>" & arrMine(i))
Next
End If
%>

QTP regular expression

I have problem with regular expression in QTP, can't understand why this pattern doesn't work:
Dim objRegExp
Set objRegExp = New RegExp
objRegExp.Pattern = Replace(Replace(Replace("Millennium [AUT]", "\", "\\"), "(", "\("), ")", "\)")
objRegExp.IgnoreCase = True
If objRegExp.Execute("Millennium [AUT]").Count < 1 Then
Set objRegExp = Nothing
End If
Method Count return 0 value, could someone help, pls.

Your .Replace chain does not change the pattern "Millennium [AUT]" which searchs for "Millennium" follwed by " ", followd by one letter out of "A", "U", or "T". Your input "Millennium [AUT]" has a "[" where the pattern expects "A", "U", or "T".
So please follow the general rule when asking for solutuions of regexp problems: Give at least one sample input and its expected outcome.
Perhaps you meant:
>> set r = New RegExp
>> r.Pattern = "Millennium \[AUT\]"
>> set mts = r.Execute("Millennium [AUT]")
>> WScript.Echo mts.Count
>>
1

I use this site to verify my REGEX:
http://regexpal.com/
Good Luck!

Is there a SnakeYaml DumperOptions setting to avoid double-spacing output?

I seem to see double-spaced output when parsing/dumping a simple YAML file with a pipe-text field.
The test is:
public void yamlTest()
{
DumperOptions printOptions = new DumperOptions();
printOptions.setLineBreak(DumperOptions.LineBreak.UNIX);
Yaml y = new Yaml(printOptions);
String input = "foo: |\n" +
" line 1\n" +
" line 2\n";
Object parsedObject = y.load(new StringReader(input));
String output = y.dump(parsedObject);
System.out.println(output);
}
and the output is:
{foo: 'line 1
line 2
'}
Note the extra space between line 1 and line 2, and after line 2 before the end of the string.
This test was run on Mac OS X 10.6, java version "1.6.0_29".
Thanks!
Mark

In the original string you use literal style - it is indicating by the '|' character. When you dump your text, you use single-quoted style which ignores the '\n' characters at the end. That is why they are repeated with the empty lines.
Try to set different styles in DumperOptions:
// and others - FOLDED, DOUBLE_QUOTED
DumperOptions.setDefaultScalarStyle(ScalarStyle.LITERAL)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

csv file encoding issue in ruby - ruby

Related

can't write IP to text file without formatting issues

Windows cmd: piping python 3.5 py file results works but pyinstaller exe's leads to UnicodeEncodeError

Issue running Classic ASP written in VBScript on Windows 2012 with IIS 8

QTP regular expression

Is there a SnakeYaml DumperOptions setting to avoid double-spacing output?

Categories

Resources