Lua string.format using UTF8 characters - utf-8

How can I get the 'right' formatting using string.format with strings containing UTF-8 characters?
Example:
local str = "\xE2\x88\x9E"
print(utf8.len(str), string.len(str))
print(str)
print(string.format("###%-5s###", str))
print(string.format("###%-5s###", 'x'))
Output:
1 3
∞
###∞ ###
###x ###
It looks like the string.format uses the byte length of the infinity sign instead of the "character length".
Is there an UTF-8 string.format equivalent?

function utf8.format(fmt, ...)
local args, strings, pos = {...}, {}, 0
for spec in fmt:gmatch'%%.-([%a%%])' do
pos = pos + 1
local s = args[pos]
if spec == 's' and type(s) == 'string' and s ~= '' then
table.insert(strings, s)
args[pos] = '\1'..('\2'):rep(utf8.len(s)-1)
end
end
return (
fmt:format(table.unpack(args))
:gsub('\1\2*', function() return table.remove(strings, 1) end)
)
end
local str = "\xE2\x88\x9E"
print(string.format("###%-5s###", str)) --> ###∞ ###
print(string.format("###%-5s###", 'x')) --> ###x ###
print(utf8.format ("###%-5s###", str)) --> ###∞ ###
print(utf8.format ("###%-5s###", 'x')) --> ###x ###

Lua added the UTF-8 library with version 5.3 with just small functionality for minimal needs. It's "fresh" and not really in focus for this language. Your issue is how the characters are interpreted & rendered but graphics isn't a point for the standard library or usual use of Lua.
For now, you should just fix your pattern for the input.

Related

How to decoding IFC using Ruby

In Ruby, I'm reading an .ifc file to get some information, but I can't decode it. For example, the file content:
"'S\X2\00E9\X0\jour/Cuisine'"
should be:
"'Séjour/Cuisine'"
I'm trying to encode it with:
puts ifcFileLine.encode("Windows-1252")
puts ifcFileLine.encode("ISO-8859-1")
puts ifcFileLine.encode("ISO-8859-5")
puts ifcFileLine.encode("iso-8859-1").force_encoding("utf-8")'
But nothing gives me what I need.
I don't know anything about IFC, but based solely on the page Denis linked to and your example input, this works:
ESCAPE_SEQUENCE_EXPR = /\\X2\\(.*?)\\X0\\/
def decode_ifc(str)
str.gsub(ESCAPE_SEQUENCE_EXPR) do
$1.gsub(/..../) { $&.to_i(16).chr(Encoding::UTF_8) }
end
end
str = 'S\X2\00E9\X0\jour/Cuisine'
puts "Input:", str
puts "Output:", decode_ifc(str)
All this code does is replace every sequence of four characters (/..../) between the delimiters, which will each be a Unicode code point in hexadecimal, with the corresponding Unicode character.
Note that this code handles only this specific encoding. A quick glance at the implementation guide shows other encodings, including an \X4 directive for Unicode characters outside the Basic Multilingual Plane. This ought to get you started, though.
See it on eval.in: https://eval.in/776980
If someone is interested, I wrote here a Python Code that decode 3 of the IFC encodings : \X, \X2\ and \S\
import re
def decodeIfc(txt):
# In regex "\" is hard to manage in Python... I use this workaround
txt = txt.replace('\\', 'µµµ')
txt = re.sub('µµµX2µµµ([0-9A-F]{4,})+µµµX0µµµ', decodeIfcX2, txt)
txt = re.sub('µµµSµµµ(.)', decodeIfcS, txt)
txt = re.sub('µµµXµµµ([0-9A-F]{2})', decodeIfcX, txt)
txt = txt.replace('µµµ','\\')
return txt
def decodeIfcX2(match):
# X2 encodes characters with multiple of 4 hexadecimal numbers.
return ''.join(list(map(lambda x : chr(int(x,16)), re.findall('([0-9A-F]{4})',match.group(1)))))
def decodeIfcS(match):
return chr(ord(match.group(1))+128)
def decodeIfcX(match):
# Sometimes, IFC files were made with old Mac... wich use MacRoman encoding.
num = int(match.group(1), 16)
if (num <= 127) | (num >= 160):
return chr(num)
else:
return bytes.fromhex(match.group(1)).decode("macroman")

How to convert UTF8 byte arrays to string in lua

I have a table like this
table = {57,55,0,15,-25,139,130,-23,173,148,-24,136,158}
it is utf8 encoded byte array by php unpack function
unpack('C*',$str);
how can I convert it to utf-8 string I can read in lua?
Lua doesn't provide a direct function for turning a table of utf-8 bytes in numeric form into a utf-8 string literal. But it's easy enough to write something for this with the help of string.char:
function utf8_from(t)
local bytearr = {}
for _, v in ipairs(t) do
local utf8byte = v < 0 and (0xff + v + 1) or v
table.insert(bytearr, string.char(utf8byte))
end
return table.concat(bytearr)
end
Note that none of lua's standard functions or provided string facilities are utf-8 aware. If you try to print utf-8 encoded string returned from the above function you'll just see some funky symbols. If you need more extensive utf-8 support you'll want to check out some of the libraries mention from the lua wiki.
Here's a comprehensive solution that works for the UTF-8 character set restricted by RFC 3629:
do
local bytemarkers = { {0x7FF,192}, {0xFFFF,224}, {0x1FFFFF,240} }
function utf8(decimal)
if decimal<128 then return string.char(decimal) end
local charbytes = {}
for bytes,vals in ipairs(bytemarkers) do
if decimal<=vals[1] then
for b=bytes+1,2,-1 do
local mod = decimal%64
decimal = (decimal-mod)/64
charbytes[b] = string.char(128+mod)
end
charbytes[1] = string.char(vals[2]+decimal)
break
end
end
return table.concat(charbytes)
end
end
function utf8frompoints(...)
local chars,arg={},{...}
for i,n in ipairs(arg) do chars[i]=utf8(arg[i]) end
return table.concat(chars)
end
print(utf8frompoints(72, 233, 108, 108, 246, 32, 8364, 8212))
--> Héllö €—

Join array of strings into 1 or more strings each within a certain char limit (+ prepend and append texts)

Let's say I have an array of Twitter account names:
string = %w[example1 example2 example3 example4 example5 example6 example7 example8 example9 example10 example11 example12 example13 example14 example15 example16 example17 example18 example19 example20]
And a prepend and append variable:
prepend = 'Check out these cool people: '
append = ' #FollowFriday'
How can I turn this into an array of as few strings as possible each with a maximum length of 140 characters, starting with the prepend text, ending with the append text, and in between the Twitter account names all starting with an #-sign and separated with a space. Like this:
tweets = ['Check out these cool people: #example1 #example2 #example3 #example4 #example5 #example6 #example7 #example8 #example9 #FollowFriday', 'Check out these cool people: #example10 #example11 #example12 #example13 #example14 #example15 #example16 #example17 #FollowFriday', 'Check out these cool people: #example18 #example19 #example20 #FollowFriday']
(The order of the accounts isn't important so theoretically you could try and find the best order to make the most use of the available space, but that's not required.)
Any suggestions? I'm thinking I should use the scan method, but haven't figured out the right way yet.
It's pretty easy using a bunch of loops, but I'm guessing that won't be necessary when using the right Ruby methods. Here's what I came up with so far:
# Create one long string of #usernames separated by a space
tmp = twitter_accounts.map!{|a| a.insert(0, '#')}.join(' ')
# alternative: tmp = '#' + twitter_accounts.join(' #')
# Number of characters left for mentioning the Twitter accounts
length = 140 - (prepend + append).length
# This method would split a string into multiple strings
# each with a maximum length of 'length' and it will only split on empty spaces (' ')
# ideally strip that space as well (although .map(&:strip) could be use too)
tweets = tmp.some_method(' ', length)
# Prepend and append
tweets.map!{|t| prepend + t + append}
P.S.
If anyone has a suggestion for a better title let me know. I had a difficult time summarizing my question.
The String rindex method has an optional parameter where you can specify where to start searching backwards in a string:
arr = %w[example1 example2 example3 example4 example5 example6 example7 example8 example9 example10 example11 example12 example13 example14 example15 example16 example17 example18 example19 example20]
str = arr.map{|name|"##{name}"}.join(' ')
prepend = 'Check out these cool people: '
append = ' #FollowFriday'
max_chars = 140 - prepend.size - append.size
until str.size <= max_chars do
p str.slice!(0, str.rindex(" ", max_chars))
str.lstrip! #get rid of the leading space
end
p str unless str.empty?
I'd make use of reduce for this:
string = %w[example1 example2 example3 example4 example5 example6 example7 example8 example9 example10 example11 example12 example13 example14 example15 example16 example17 example18 example19 example20]
prepend = 'Check out these cool people:'
append = '#FollowFriday'
# Extra -1 is for the space before `append`
max_content_length = 140 - prepend.length - append.length - 1
content_strings = string.reduce([""]) { |result, target|
result.push("") if result[-1].length + target.length + 2 > max_content_length
result[-1] += " ##{target}"
result
}
tweets = content_strings.map { |s| "#{prepend}#{s} #{append}" }
Which would yield:
"Check out these cool people: #example1 #example2 #example3 #example4 #example5 #example6 #example7 #example8 #example9 #FollowFriday"
"Check out these cool people: #example10 #example11 #example12 #example13 #example14 #example15 #example16 #example17 #FollowFriday"
"Check out these cool people: #example18 #example19 #example20 #FollowFriday"

How to advance past a deflate byte sequence contained in a byte stream?

I have a byte stream that is a concatenation of sections, where each section is composed of a header plus a deflated byte stream.
I need to split this byte stream sections but the header only contains information about the data in uncompressed form, no hint about the compressed data length so I can advance properly in the stream and parse the next section.
So far the only way I found to advance past the deflated byte sequece is to parse it according to the this specification. From what I understood by reading the specification, a deflate stream is composed of blocks, which can be compressed blocks or literal blocks.
Literal blocks contain a size header which can be used to easily advance past it.
Compressed blocks are composed with 'prefix codes', which are bit sequences of variable length that have special meanings to the deflate algorithm. Since I'm only interested in finding out the deflated stream length, I guess the only code I need to look for is '0000000' which according to the specification signals the end of block.
So I came up with this coffeescript function to parse the deflate stream(I'm working on node.js)
# The job of this function is to return the position
# after the deflate stream contained in 'buffer'. The
# deflated stream begins at 'pos'.
advanceDeflateStream = (buffer, pos) ->
byteOffset = 0
finalBlock = false
while 1
if byteOffset == 6
firstTypeBit = 0b00000001 & buffer[pos]
pos++
secondTypeBit = 0b10000000 & buffer[pos]
type = firstTypeBit | (secondTypeBit << 1)
else
if byteOffset == 7
pos++
type = buffer[pos] & (0b01100000 >>> byteOffset)
if type == 0
# Literal block
# ignore the remaining bits and advance position
byteOffset = 0
pos++
len = buffer.readUInt16LE(pos)
pos += 2
lenComplement = buffer.readUInt16LE(pos)
if (len ^ ~lenComplement)
throw new Error('Literal block lengh check fail')
pos += (2 + len) # Advance past literal block
else if type in [1, 2]
# huffman block
# we are only interested in finding the 'block end' marker
# which is signaled by the bit string 0000000 (256)
eob = false
matchedZeros = 0
while !eob
byte = buffer[pos]
for i in [byteOffset..7]
# loop the remaining bits looking for 7 consecutive zeros
if (byte ^ (0b10000000 >>> byteOffset)) >>> (7 - byteOffset)
matchedZeros++
else
# reset counter
matchedZeros = 0
if matchedZeros == 7
eob = true
break
byteOffset++
if !eob
byteOffset = 0
pos++
else
throw new Error('Invalid deflate block')
finalBlock = buffer[pos] & (0b10000000 >>> byteOffset)
if finalBlock
break
return pos
To check if this works, I wrote a simple mocha test case:
zlib = require 'zlib'
test 'sample deflate stream', (done) ->
data = 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' # length 30
zlib.deflate data, (err, deflated) ->
# deflated.length == 11
advanceDeflateStream(deflated, 0).shoudl.eql(11)
done()
The problem is that this test fails and I do not know how to debug it. I accept any answer that points what I missed in the parsing algorithm or contains a correct version of the above function in any language.
The only way to find the end of a deflate stream or even a deflate block is to decode all of the Huffman codes contained within. There is no bit pattern that you can search for that can not appear earlier in the stream.

Encoding issue with Sqlite3 in Ruby

I have a list of sql queries beautifully encoded in utf-8. I read them from files, perform the inserts and than do a select.
# encoding: utf-8
def exec_sql_lines(file_name)
puts "----> #{file_name} <----"
File.open(file_name, 'r') do |f|
# sometimes a query doesn't fit one line
previous_line=""
i = 0
while line = f.gets do
puts i+=1
if(line[-2] != ')')
previous_line += line[0..-2]
next
end
puts (previous_line + line) # <---- (1)
$db.execute((previous_line + line))
previous_line =""
end
a = $db.execute("select * from Table where _id=6")
puts a <---- (2)
end
end
$db=SQLite3::Database.new($DBNAME)
exec_sql_lines("creates.txt")
exec_sql_lines("inserts.txt")
$db.close
The text in (1) is different than the one in (2). Polish letters get broken. If I use IRB and call $db.open ; $db.encoding is says UTF-8.
Why do Polish letters come out broken? How to fix it?
I need this database properly encoded in UTF-8 for my Android app, so I'm not interested in manipulating with database output. I need to fix it's content.
EDIT
Significant lines from the output:
6
INSERT INTO 'Leki' VALUES (NULL, '6', 'Acenocoumarolum', 'Acenocumarol WZF', 'tabl. ', '4 mg', '60 tabl.', '5909990055715', '2012-01-01', '2 lata', '21.0, Leki przeciwzakrzepowe z grupy antagonistów witaminy K', '8.32', '12.07', '12.07', 'We wszystkich zarejestrowanych wskazaniach na dzień wydania decyzji', '', 'ryczałt', '5.12')
out:
6
6
Acenocoumarolum
Acenocumarol WZF
tabl.
4 mg
60 tabl.
5909990055715
2012-01-01
2 lata
21.0, Leki przeciwzakrzepowe z grupy antagonistĂł[<--HERE]w witaminy K
8.32
12.07
12.07
We wszystkich zarejestrowanych wskazaniach na dzieĹ[<--HERE] wydania decyzji
ryczaĹ[<--HERE]t
5.12
There are three default encoding.
In you code you set the source encoding.
Perhaps there is a problem with External and Internal Encoding?
A quick test in windows:
#encoding: utf-8
File.open(__FILE__,'r'){|f|
p f.external_encoding
p f.internal_encoding
p f.read.encoding
}
Result:
#<Encoding:CP850>
nil
#<Encoding:CP850>
Even if UTF-8 is used, the data are read as cp850.
In your case:
Does File.open(filename,'r:utf-8') help?

Resources