I was wondering how to print unicode characters, such as Japanese or fun characters like 📦.
I can print hearts with:
hearts = "\u2665"
puts hearts.encode('utf-8')
How can I print more unicode charaters with Ruby in Command Prompt?
My method works with some characters but not all.
Code examples would be greatly appreciated.
You need to enclose the unicode character in { and } if the number of hex digits isn't 4 (credit : /u/Stefan) e.g.:
heart = "\u2665"
package = "\u{1F4E6}"
fire_and_one_hundred = "\u{1F525 1F4AF}"
puts heart
puts package
puts fire_and_one_hundred
Alternatively you could also just put the unicode character directly in your source, which is quite easy at least on macOS with the Emoji & Symbols menu accessed by Ctrl + Command + Space by default (a similar menu can be accessed on Windows 10 by Win + ; ) in most applications including your text editor/Ruby IDE most likely:
heart = "♥"
package = "📦"
fire_and_one_hundred = "🔥💯"
puts heart
puts package
puts fire_and_one_hundred
Output:
♥
📦
🔥💯
How it looks in the macOS terminal:
I am writing a simple program for windows using Python 2.7. It opens an email, take some words from it and puts them in a form on web. Problem starts when the email contains polish letters like Ó, Ź, Ł etc. Whenever I try to print it I get something like: ['\xc4\x84', '\xc5\xbb', '\xc3\x93', '\xc4\x86', '\xc5\xb9'].
I already know it is because of encoding and that Python 3 has no such problem. Here is what I tried already:
mail = " Ą Ż Ó Ć Ź"
mail = mail.split()
mail = mail.decode("UTF-8")
print mail
or
mail = " Ą Ż Ó Ć Ź"
mail = mail.split()
[x.encode('UTF8') for x in mail]
print mail
Can anyone please show me how to make the list print properly ?
Python 2.x uses ASCII as a default encoding. To force it to use Unicode, add this line to the top of your program.
# -*- coding: utf-8 -*-
Also you should prefix any string literals with 'u'. e.g.
polishLetters = u'Ą Ż Ó Ć Ź'
I'm new to python and this site so thank-you in advance for your... understanding. This is my first attempt at a python script.
I'm having what I think is a performance issue trying to solve this problem which is causing me to not get any data back.
This code works on a small text file of a couple pages but when I try to use it on my 35MB real data text file it just hits the CPU and hasn't returned any data (>24 hours now).
Here's a snippet of the real data from the 35MB text file:
D)dddld
d00d90d
dd
ddd
vsddfgsdfgsf
dfsdfdsf
aAAAAAa
221546
29806916295
Meowing
fs:/mod/umbapp/umb/sentbox/221546.pdu
2013:10:4:22:11:31:4
sadfsdfsdf
sdfff
ff
f
29806916295
What's your cat doing?
fs:/mod/umbapp/umb/sentbox/10955.pdu
2013:10:4:22:10:15:4
aaa
aaa
aaaaa
What I'm trying to copy into a new file:
29806916295
Meowing
fs:/mod/umbapp/umb/sentbox/221546.pdu
2013:10:4:22:11:31:4
29806916295
What's your cat doing?
fs:/mod/umbapp/umb/sentbox/10955.pdu
2013:10:4:22:10:15:4
My Python code is:
import re
with open('testdata.txt') as myfile:
content = myfile.read()
text = re.search(r'\d{11}.*\n.*\n.*(\d{4})\D+(\d{2})\D+(\d{1})\D+(\d{2})\D+(\d{2})\D+\d{2}\D+\d{1}', content, re.DOTALL).group()
with open("result.txt", "w") as myfile2:
myfile2.write(text)
Regex isn't the fastest way to search a string. You also compounded the problem by having a very big string (35MB). Reading an entire file into memory is generally not recommended because you may run into memory issues.
Judging from your regex pattern, it seems like you want to capture 4-line groups that start with an 11-digit string and end with some time-line string. Try this code:
import re
start_pattern = re.compile(r'^\d{11}$')
end_pattern = re.compile(r'^\d{4}\D+\d{2}\D+\d{1}\D+\d{2}\D+\d{2}\D+\d{2}\D+\d{1}$')
capturing = 0
capture = ''
with open('output.txt', 'w') as output_file:
with open('input.txt', 'r') as input_file:
for line in input_file:
if capturing > 0 and capturing <= 4:
capturing += 1
capture += line
elif start_pattern.match(line):
capturing = 1
capture = line
if capturing == 4:
if end_pattern.match(line):
output_file.write(capture + '\n')
else:
capturing = 0
It iterates over the input file, line by line. If it finds a line matching the start_pattern, it will read in 3 more. If the 4th line matches the end_pattern, it will write the whole group to the output file.
I seem to see double-spaced output when parsing/dumping a simple YAML file with a pipe-text field.
The test is:
public void yamlTest()
{
DumperOptions printOptions = new DumperOptions();
printOptions.setLineBreak(DumperOptions.LineBreak.UNIX);
Yaml y = new Yaml(printOptions);
String input = "foo: |\n" +
" line 1\n" +
" line 2\n";
Object parsedObject = y.load(new StringReader(input));
String output = y.dump(parsedObject);
System.out.println(output);
}
and the output is:
{foo: 'line 1
line 2
'}
Note the extra space between line 1 and line 2, and after line 2 before the end of the string.
This test was run on Mac OS X 10.6, java version "1.6.0_29".
Thanks!
Mark
In the original string you use literal style - it is indicating by the '|' character. When you dump your text, you use single-quoted style which ignores the '\n' characters at the end. That is why they are repeated with the empty lines.
Try to set different styles in DumperOptions:
// and others - FOLDED, DOUBLE_QUOTED
DumperOptions.setDefaultScalarStyle(ScalarStyle.LITERAL)
When I invoke a Python 3 script from a Windows batch.cmd,
a UTF-8 arg is not passed as "UTF-8", but as a series of bytes,
each of which are interpreted by Python as individual UTF-8 chars.
How can I convert the Python 3 arg string to its intended UTF-8 state?
The calling .cmd and the called .py are shown below.
PS. As I mention in a comment below, calling u00FF.py "ÿ" directly from the Windows console commandline works fine. It is only a problem when I invoke u00FF.cmd via the .cmd, and I am looking for a Python 3 way to convert the double-encoded UTF-8 arg back to a "normally" encoded UTF-8 form.
I've now include here, the full (and latest) test code.. Its a bit long, but I hope it explains the issue clearly enough.
Update: I've seen why the file read of "ÿ" was "double-encoding"... I was reading the UTF-8 file in binary/byte mode... I should have used codecs.open('u00FF.arg', 'r', 'utf-8') instead of just plain open('u00FF.arg','r')... I've updated the offending code, and the output. The codepage issues seems to be the only problem now...
Because the Python issue has been largely resolved, and the codepage issue is quite independent of Python, I have posted another codepage specific question at
Codepage 850 works, 65001 fails! There is NO response to “call foo.cmd”. internal commands work fine.
::::::::::::::::::: BEGIN .cmd BATCH FILE ::::::::::::::::::::
:: Windows Batch file (UTF-8 encoded, no BOM): "u00FF.cmd"
#echo ÿ>u00FF.arg
#u00FF.py "ÿ"
#goto :eof
::::::::::::::::::: END OF .cmd BATCH FILE ::::::::::::::::::::
################### BEGIN .py SCRIPT #####################################
# -*- coding: utf-8 -*-
import sys
print ("""
Unicode
=======
CodePoint U+00FF
Character ÿ __Unicode Character 'LATIN SMALL LETTER Y WITH DIAERESIS'
UTF-8 bytes
===========
Hex: \\xC3 \\xBF
Dec: 195 191
Char: Ã ¿ __Unicode Character 'INVERTED QUESTION MARK'
\_______Unicode Character 'LATIN CAPITAL LETTER A WITH TILDE'
""")
print("## ====================================================")
print("## ÿ via hard-coding in this .py script itself ========")
print("##")
hard1s = "ÿ"
hard1b = hard1s.encode('utf_8')
print("hard1s: len", len(hard1s), " '" + hard1s + "'")
print("hard1b: len", len(hard1b), hard1b)
for i in range(0,len(hard1s)):
print("CodePoint[", i, "]", hard1s[i], "U+"+"{0:x}".upper().format(ord(hard1s[i])).zfill(4) )
print(''' This is a single CodePoint for "ÿ" (as expected).''')
print()
print("## ====================================================")
print("## ÿ read into this .py script from a UTF-8 file ======")
print("##")
import codecs
file1 = codecs.open( 'u00FF.arg', 'r', 'utf-8' )
file1s = file1.readline()
file1s = file1s[:1] # remove \r
file1b = file1s.encode('utf_8')
print("file1s: len", len(file1s), " '" + file1s + "'")
print("file1b: len", len(file1b), file1b)
for i in range(0,len(file1s)):
print("CodePoint[", i, "]", file1s[i], "U+"+"{0:x}".upper().format(ord(file1s[i])).zfill(4) )
print(''' This is a single CodePoint for "ÿ" (as expected).''')
print()
print("## ====================================================")
print("## ÿ via sys.argv from a call to .py from a .cmd) ===")
print("##")
argv1s = sys.argv[1]
argv1b = argv1s.encode('utf_8')
print("argv1s: len", len(argv1s), " '" + argv1s + "'")
print("argv1b: len", len(argv1b), argv1b)
for i in range(0,len(argv1s)):
print("CodePoint[", i, "]", argv1s[i], "U+"+"{0:x}".upper().format(ord(argv1s[i])).zfill(4) )
print(''' These 2 CodePoints are way off-beam,
even allowing for the "double-encoding" seen above.
The CodePoints are from an entirely different Unicode-Block.
This must be a Codepage issue.''')
print()
################### END OF .py SCRIPT #####################################
Here is the output from the above code.
========================== BEGIN OUTPUT ================================
C:\>u00FF.cmd
Unicode
=======
CodePoint U+00FF
Character ÿ __Unicode Character 'LATIN SMALL LETTER Y WITH DIAERESIS'
UTF-8 bytes
===========
Hex: \xC3 \xBF
Dec: 195 191
Char: Ã ¿ __Unicode Character 'INVERTED QUESTION MARK'
\_______Unicode Character 'LATIN CAPITAL LETTER A WITH TILDE'
## ====================================================
## ÿ via hard-coding in this .py script itself ========
##
hard1s: len 1 'ÿ'
hard1b: len 2 b'\xc3\xbf'
CodePoint[ 0 ] ÿ U+00FF
This is a single CodePoint for "ÿ" (as expected).
## ====================================================
## ÿ read into this .py script from a UTF-8 file ======
##
file1s: len 1 'ÿ'
file1b: len 2 b'\xc3\xbf'
CodePoint[ 0 ] ÿ U+00FF
This is a single CodePoint for "ÿ" (as expected
## ====================================================
## ÿ via sys.argv from a call to .py from a .cmd) ===
##
argv1s: len 2 '├┐'
argv1b: len 6 b'\xe2\x94\x9c\xe2\x94\x90'
CodePoint[ 0 ] ├ U+251C
CodePoint[ 1 ] ┐ U+2510
These 2 CodePoints are way off-beam,
even allowing for the "double-encoding" seen above.
The CodePoints are from an entirely different Unicode-Block.
This must be a Codepage issue.
========================== END OF OUTPUT ================================
Batch files and encodings are a finicky issue. First of all: Batch files have no direct way of specifying the encoding they're in and cmd does not really support Unicode batch files. You can easily see that if you save a batch file with a Unicode BOM or as UTF-16 – they will throw an error.
What you see when you put the ÿ directly into the command line is that when running a command Windows will initially use the command line as Unicode (it may have been converted from some legacy encoding beforehand, but in the end what Windows uses is Unicode). So Python will (hopefully) always grab the Unicode content of the arguments.
However, since cmd has its own opinions about the codepage (and you never told it to use UTF-8) the UTF-8 string you put in the batch file won't be interpreted as UTF-8 but instead in the default cmd codepage (850 or 437, in your case).
You can force UTF-8 with chcp:
chcp 65001 > nul
You can save the following file as UTF-8 and try it out:
#echo off
chcp 850 >nul
echo ÿ
chcp 65001 >nul
echo ÿ
Keep in mind, though, that the chcp setting will persist in the shell if you run the batch from there which may make things weird.
Windows shell uses a specific code page (see CHCP command output). You need to convert from Windows code page to utf-8. See iconv module or decode() / encode()