Similar looking UTF8 characters for ASCII - utf-8

I'm looking for a table which contains ASCII characters and same looking UTF8 characters. I know it also depends on the font is they look the same, but something generic to start with is enough.
>>> # PY3 code:
>>> a='H' # ascii
>>> b='Н' # utf8
>>> a==b
False
>>> ' '.join(format(ord(x), 'b') for x in a)
'1001000'
>>> ' '.join(format(ord(x), 'b') for x in b)
'10000011101'
>>> a='P' # ascii
>>> b='Ρ' # utf8
>>> a==b
False
>>> ' '.join(format(ord(x), 'b') for x in a)
'1010000'
>>> ' '.join(format(ord(x), 'b') for x in b)
'1110100001'

This is very useful tool as it will show you all characters which look similar and you can choose if this is REALLY similar enough for you :)
https://unicode.org/cldr/utility/confusables.jsp?a=test&r=None
Some other resources:
This is called Visual Spoofing
Python Package to detect confusables

Related

How to retrieve and format wifi MAC address in MicroPython on ESP32?

I have the following MicroPython code running on an ESP32:
import network
wlan_sta = network.WLAN(network.STA_IF)
wlan_sta.active(True)
wlan_mac = wlan_sta.config('mac')
print("MAC Address:", wlan_mac) # Show MAC for peering
The output looks like this:
MAC Address: b'0\xae\xa4z\xa7$'
I would like to display it in the more familiar format of six pairs of hex digits, like this:
MAC Address: AABBCC112233
After searching for a solution on the internet, I've tried:
print("MAC Address:", str(wlan_mac)) but it displays the same as when not using str()
print("MAC Address:", hex(wlan_mac)) but it results in TypeError: can't convert bytes to int
print("MAC Address:", wlan_mac.hex()) but it says AttributeError: 'bytes' object has no attribute 'hex'
I am also a little suspicious of the bytes retrieved from wlan_sta.config('mac'). I would have expected something that looked more like b'\xaa\xbb\xcc\x11\x22\x33' instead of b'0\xae\xa4z\xa7$'. The z and the $ seem very out of place for something that should be hexadecimal and it seems too short for what should be six pairs of digits.
So my question is two-fold:
Am I using the correct method to get the MAC address?
If it is correct, how can I format it as six pairs of hex digits?
I am also a little suspicious of the bytes retrieved from wlan_sta.config('mac'). I would have expected something that looked more like b'\xaa\xbb\xcc\x11\x22\x33' instead of b'0\xae\xa4z\xa7$'. The z and the $ seem very out of place for something that should be hexadecimal and it seems too short for what should be six pairs of digits.
You're not getting back a hexadecimal string, you're getting a byte string. So if the MAC address contains the value 7A, then the byte string will contain z (which has ASCII value 122 (hex 7A)).
Am I using the correct method to get the MAC address?
You are!
If it is correct, how can I format it as six pairs of hex digits?
If you want to print the MAC address as a hex string, you can use the
ubinascii.hexlify method:
>>> import ubinascii
>>> import network
>>> wlan_sta = network.WLAN(network.STA_IF)
>>> wlan_sta.active(True)
>>> wlan_mac = wlan_sta.config('mac')
>>> print(ubinascii.hexlify(wlan_mac).decode())
30aea47aa724
Or maybe:
>>> print(ubinascii.hexlify(wlan_mac).decode().upper())
30AEA47AA724
You can use:
def wifi_connect(ssid, pwd):
sta_if = None
import network
sta_if = network.WLAN(network.STA_IF)
if not sta_if.isconnected():
print("connecting to network...")
sta_if.active(True)
sta_if.connect(ssid, pwd)
while not sta_if.isconnected():
pass
print("----------------------------------------")
print("network config:", sta_if.ifconfig())
print("----------------------------------------")
get_my_mac_addr(sta_if)
Then:
def get_my_mac_addr(sta_if):
import ubinascii
import network
wlan_mac = sta_if.config('mac')
my_mac_addr = ubinascii.hexlify(wlan_mac).decode()
my_mac_addr = format_mac_addr(my_mac_addr)
Then:
def format_mac_addr(addr):
mac_addr = addr
mac_addr = mac_addr.upper()
new_mac = ""
for i in range(0, len(mac_addr),2):
#print(mac_addr[i] + mac_addr[i+1])
if (i == len(mac_addr) - 2):
new_mac = new_mac + mac_addr[i] + mac_addr[i+1]
else:
new_mac = new_mac + mac_addr[i] + mac_addr[i+1] + ":"
print("----------------------------------------")
print("My MAC Address:" + new_mac)
print("----------------------------------------")
return new_mac
Return:
----------------------------------------
My MAC Address:xx:xx:xx:xx:xx:xx
----------------------------------------

How to decoding IFC using Ruby

In Ruby, I'm reading an .ifc file to get some information, but I can't decode it. For example, the file content:
"'S\X2\00E9\X0\jour/Cuisine'"
should be:
"'Séjour/Cuisine'"
I'm trying to encode it with:
puts ifcFileLine.encode("Windows-1252")
puts ifcFileLine.encode("ISO-8859-1")
puts ifcFileLine.encode("ISO-8859-5")
puts ifcFileLine.encode("iso-8859-1").force_encoding("utf-8")'
But nothing gives me what I need.
I don't know anything about IFC, but based solely on the page Denis linked to and your example input, this works:
ESCAPE_SEQUENCE_EXPR = /\\X2\\(.*?)\\X0\\/
def decode_ifc(str)
str.gsub(ESCAPE_SEQUENCE_EXPR) do
$1.gsub(/..../) { $&.to_i(16).chr(Encoding::UTF_8) }
end
end
str = 'S\X2\00E9\X0\jour/Cuisine'
puts "Input:", str
puts "Output:", decode_ifc(str)
All this code does is replace every sequence of four characters (/..../) between the delimiters, which will each be a Unicode code point in hexadecimal, with the corresponding Unicode character.
Note that this code handles only this specific encoding. A quick glance at the implementation guide shows other encodings, including an \X4 directive for Unicode characters outside the Basic Multilingual Plane. This ought to get you started, though.
See it on eval.in: https://eval.in/776980
If someone is interested, I wrote here a Python Code that decode 3 of the IFC encodings : \X, \X2\ and \S\
import re
def decodeIfc(txt):
# In regex "\" is hard to manage in Python... I use this workaround
txt = txt.replace('\\', 'µµµ')
txt = re.sub('µµµX2µµµ([0-9A-F]{4,})+µµµX0µµµ', decodeIfcX2, txt)
txt = re.sub('µµµSµµµ(.)', decodeIfcS, txt)
txt = re.sub('µµµXµµµ([0-9A-F]{2})', decodeIfcX, txt)
txt = txt.replace('µµµ','\\')
return txt
def decodeIfcX2(match):
# X2 encodes characters with multiple of 4 hexadecimal numbers.
return ''.join(list(map(lambda x : chr(int(x,16)), re.findall('([0-9A-F]{4})',match.group(1)))))
def decodeIfcS(match):
return chr(ord(match.group(1))+128)
def decodeIfcX(match):
# Sometimes, IFC files were made with old Mac... wich use MacRoman encoding.
num = int(match.group(1), 16)
if (num <= 127) | (num >= 160):
return chr(num)
else:
return bytes.fromhex(match.group(1)).decode("macroman")

Lua string.format using UTF8 characters

How can I get the 'right' formatting using string.format with strings containing UTF-8 characters?
Example:
local str = "\xE2\x88\x9E"
print(utf8.len(str), string.len(str))
print(str)
print(string.format("###%-5s###", str))
print(string.format("###%-5s###", 'x'))
Output:
1 3
∞
###∞ ###
###x ###
It looks like the string.format uses the byte length of the infinity sign instead of the "character length".
Is there an UTF-8 string.format equivalent?
function utf8.format(fmt, ...)
local args, strings, pos = {...}, {}, 0
for spec in fmt:gmatch'%%.-([%a%%])' do
pos = pos + 1
local s = args[pos]
if spec == 's' and type(s) == 'string' and s ~= '' then
table.insert(strings, s)
args[pos] = '\1'..('\2'):rep(utf8.len(s)-1)
end
end
return (
fmt:format(table.unpack(args))
:gsub('\1\2*', function() return table.remove(strings, 1) end)
)
end
local str = "\xE2\x88\x9E"
print(string.format("###%-5s###", str)) --> ###∞ ###
print(string.format("###%-5s###", 'x')) --> ###x ###
print(utf8.format ("###%-5s###", str)) --> ###∞ ###
print(utf8.format ("###%-5s###", 'x')) --> ###x ###
Lua added the UTF-8 library with version 5.3 with just small functionality for minimal needs. It's "fresh" and not really in focus for this language. Your issue is how the characters are interpreted & rendered but graphics isn't a point for the standard library or usual use of Lua.
For now, you should just fix your pattern for the input.

Capybara, rspec- How to find text anywhere on page

There are multiple ways to find it but I want to do this in a specific manner. Here it is-
To get an element with some text in it, my framework creates an xpath in this manner-
#xpath = "//h1[contains(text(), '[the-text-i-am-searching-for]')]"
Then it executes-
find(:xpath, #xpath).visible?
Now in similar format I want to create an xpath which just looks for a text anywhere in the page and then can be used in find(:xpath,#xpath).visible? to return a true or false.
To give a little more context:
My HTML paragraph looks something like this-
<blink><p>some text here <b><u>some bold and underlined text here</u></b> again some text Learn more [the-text-i-am-searching-for]</p></blink>
but if I try to find it using find(:xpath, #xpath) where my xpath is
#xpath = "//p[contains(text(), '[the-text-i-am-searching-for]')]"
it fails.
Try replacing "//p[contains(text(), '[the-text-i-am-searching-for]')]" with "//p[contains(., '[the-text-i-am-searching-for]')]"
I don't know your environment but in Python with lxml it works:
>>> import lxml.etree
>>> doc = lxml.etree.HTML("""<blink><p>some text here <b><u>some bold and underlined text here</u></b> again some text Learn more [the-text-i-am-searching-for]</p></blink>""")
>>> doc.xpath('//p[contains(text(), "[the-text-i-am-searching-for]")]')
[]
>>> doc.xpath('//p[contains(., "[the-text-i-am-searching-for]")]')
[<Element p at 0x1c1b9b0>]
>>>
The context node . will be converted to a string to match the signature boolean contains(string, string) (http://www.w3.org/TR/xpath/#section-String-Functions)
>>> doc.xpath('string(//p)')
'some text here some bold and underlined text here again some text Learn more [the-text-i-am-searching-for]'
>>>
Consider these variations
>>> doc.xpath('//p')
[<Element p at 0x1c1b9b0>]
>>> doc.xpath('//p/*')
[<Element b at 0x1e34b90>, <Element a at 0x1e34af0>]
>>> doc.xpath('string(//p)')
'some text here some bold and underlined text here again some text Learn more [the-text-i-am-searching-for]'
>>> doc.xpath('//p/text()')
['some text here ', ' again some text ', ' [the-text-i-am-searching-for]']
>>> doc.xpath('string(//p/text())')
'some text here '
>>> doc.xpath('//p/text()[3]')
[' [the-text-i-am-searching-for]']
>>> doc.xpath('//p/text()[contains(., "[the-text-i-am-searching-for]")]')
[' [the-text-i-am-searching-for]']
>>> doc.xpath('//p[contains(text(), "[the-text-i-am-searching-for]")]')
[]

Encoding issue with Sqlite3 in Ruby

I have a list of sql queries beautifully encoded in utf-8. I read them from files, perform the inserts and than do a select.
# encoding: utf-8
def exec_sql_lines(file_name)
puts "----> #{file_name} <----"
File.open(file_name, 'r') do |f|
# sometimes a query doesn't fit one line
previous_line=""
i = 0
while line = f.gets do
puts i+=1
if(line[-2] != ')')
previous_line += line[0..-2]
next
end
puts (previous_line + line) # <---- (1)
$db.execute((previous_line + line))
previous_line =""
end
a = $db.execute("select * from Table where _id=6")
puts a <---- (2)
end
end
$db=SQLite3::Database.new($DBNAME)
exec_sql_lines("creates.txt")
exec_sql_lines("inserts.txt")
$db.close
The text in (1) is different than the one in (2). Polish letters get broken. If I use IRB and call $db.open ; $db.encoding is says UTF-8.
Why do Polish letters come out broken? How to fix it?
I need this database properly encoded in UTF-8 for my Android app, so I'm not interested in manipulating with database output. I need to fix it's content.
EDIT
Significant lines from the output:
6
INSERT INTO 'Leki' VALUES (NULL, '6', 'Acenocoumarolum', 'Acenocumarol WZF', 'tabl. ', '4 mg', '60 tabl.', '5909990055715', '2012-01-01', '2 lata', '21.0, Leki przeciwzakrzepowe z grupy antagonistów witaminy K', '8.32', '12.07', '12.07', 'We wszystkich zarejestrowanych wskazaniach na dzień wydania decyzji', '', 'ryczałt', '5.12')
out:
6
6
Acenocoumarolum
Acenocumarol WZF
tabl.
4 mg
60 tabl.
5909990055715
2012-01-01
2 lata
21.0, Leki przeciwzakrzepowe z grupy antagonistĂł[<--HERE]w witaminy K
8.32
12.07
12.07
We wszystkich zarejestrowanych wskazaniach na dzieĹ[<--HERE] wydania decyzji
ryczaĹ[<--HERE]t
5.12
There are three default encoding.
In you code you set the source encoding.
Perhaps there is a problem with External and Internal Encoding?
A quick test in windows:
#encoding: utf-8
File.open(__FILE__,'r'){|f|
p f.external_encoding
p f.internal_encoding
p f.read.encoding
}
Result:
#<Encoding:CP850>
nil
#<Encoding:CP850>
Even if UTF-8 is used, the data are read as cp850.
In your case:
Does File.open(filename,'r:utf-8') help?

Resources