How to decoding IFC using Ruby - ruby

In Ruby, I'm reading an .ifc file to get some information, but I can't decode it. For example, the file content:
"'S\X2\00E9\X0\jour/Cuisine'"
should be:
"'Séjour/Cuisine'"
I'm trying to encode it with:
puts ifcFileLine.encode("Windows-1252")
puts ifcFileLine.encode("ISO-8859-1")
puts ifcFileLine.encode("ISO-8859-5")
puts ifcFileLine.encode("iso-8859-1").force_encoding("utf-8")'
But nothing gives me what I need.

I don't know anything about IFC, but based solely on the page Denis linked to and your example input, this works:
ESCAPE_SEQUENCE_EXPR = /\\X2\\(.*?)\\X0\\/
def decode_ifc(str)
str.gsub(ESCAPE_SEQUENCE_EXPR) do
$1.gsub(/..../) { $&.to_i(16).chr(Encoding::UTF_8) }
end
end
str = 'S\X2\00E9\X0\jour/Cuisine'
puts "Input:", str
puts "Output:", decode_ifc(str)
All this code does is replace every sequence of four characters (/..../) between the delimiters, which will each be a Unicode code point in hexadecimal, with the corresponding Unicode character.
Note that this code handles only this specific encoding. A quick glance at the implementation guide shows other encodings, including an \X4 directive for Unicode characters outside the Basic Multilingual Plane. This ought to get you started, though.
See it on eval.in: https://eval.in/776980

If someone is interested, I wrote here a Python Code that decode 3 of the IFC encodings : \X, \X2\ and \S\
import re
def decodeIfc(txt):
# In regex "\" is hard to manage in Python... I use this workaround
txt = txt.replace('\\', 'µµµ')
txt = re.sub('µµµX2µµµ([0-9A-F]{4,})+µµµX0µµµ', decodeIfcX2, txt)
txt = re.sub('µµµSµµµ(.)', decodeIfcS, txt)
txt = re.sub('µµµXµµµ([0-9A-F]{2})', decodeIfcX, txt)
txt = txt.replace('µµµ','\\')
return txt
def decodeIfcX2(match):
# X2 encodes characters with multiple of 4 hexadecimal numbers.
return ''.join(list(map(lambda x : chr(int(x,16)), re.findall('([0-9A-F]{4})',match.group(1)))))
def decodeIfcS(match):
return chr(ord(match.group(1))+128)
def decodeIfcX(match):
# Sometimes, IFC files were made with old Mac... wich use MacRoman encoding.
num = int(match.group(1), 16)
if (num <= 127) | (num >= 160):
return chr(num)
else:
return bytes.fromhex(match.group(1)).decode("macroman")

Related

How to print unicode charaters in Command Prompt with Ruby

I was wondering how to print unicode characters, such as Japanese or fun characters like 📦.
I can print hearts with:
hearts = "\u2665"
puts hearts.encode('utf-8')
How can I print more unicode charaters with Ruby in Command Prompt?
My method works with some characters but not all.
Code examples would be greatly appreciated.
You need to enclose the unicode character in { and } if the number of hex digits isn't 4 (credit : /u/Stefan) e.g.:
heart = "\u2665"
package = "\u{1F4E6}"
fire_and_one_hundred = "\u{1F525 1F4AF}"
puts heart
puts package
puts fire_and_one_hundred
Alternatively you could also just put the unicode character directly in your source, which is quite easy at least on macOS with the Emoji & Symbols menu accessed by Ctrl + Command + Space by default (a similar menu can be accessed on Windows 10 by Win + ; ) in most applications including your text editor/Ruby IDE most likely:
heart = "♥"
package = "📦"
fire_and_one_hundred = "🔥💯"
puts heart
puts package
puts fire_and_one_hundred
Output:
♥
📦
🔥💯
How it looks in the macOS terminal:

can't write IP to text file without formatting issues

I'm having trouble reading an IP from a text file and properly writing it to another text file. It shows the written IP in the file as: "ÿþ1 9 2 . 1 6 8 . 1 1 0 . 4"
#Read the first line for the IP
def get_server_ip
File.open("d:\\ip_addr.txt") do |line|
a = line.readline()
b = a.to_s
end
end
#append the ip to file2
def append_ip
FileUtils.cp('file1.txt', 'file2.txt')
file_names = ['file2.txt']
file_names.each do |file_name|
text = File.read(file_name)
b = get_server_ip
new_contents = text.gsub('ip_here', b)
File.open(file_name, "w") {|file| file.puts new_contents }
end
end
I've tried .strip and .delete(' ') with no luck. Can anyone see the issue?
Thank you
The file was generated with Notepad on Windows. It is encoded as UTF-16LE.
The first two bytes in the file have the codes 0xFF and 0xFE; this is the Bytes Order Mark of UTF-16LE.
Each character is encoded on 2 bytes (16 bits), the least significant byte first (Less Endian order).
The spaces between the printable characters in the output are, in fact NUL characters (characters with code 0).
What you can do (apart from converting the file to a more decent format like UTF-8 or even ISO-8859-1) is to pass 'rb:BOM|UTF-16LE' as the second argument of File#open.
r tells File#open to open the file in read-only mode (which is also does by default);
b means "binary mode"; it is required by BOM|UTF-16;
:BOM|UTF-16LE tells Ruby to read and ignore the BOM if it is present in the file and to expect the rest of the file being encoded as UTF16-LE.
If you can, I recommend you to convert the file encoding using a decent editor (even Notepad can be used) to UTF-8 or ISO-8859-1 and all these problems vanish.

libsvm: read vectors from word2vec

Is there an easy way to use w2v's output vectors in libsvm?
There are two output formats for w2v: binary and text. In the text format each line begins with a word followed by a space-separated vector. e.g.:
something -0.197045 -0.292196 -0.107292 -0.168469 0.114897 -0.006383 -0.000056 0.068514 -0.079548 0.251488 0.185607 0.248675 -0.058647 0.062771 0.129014 -0.024715 -0.168974 -0.035367 -0.009597 0.090379 0.030133 0.017338 0.062264 -0.219165 -0.214198 0.226869 -0.058710 0.034563 -0.046304 0.2
Found a way with ruby:
First require the libsvm wrapper:
require 'libsvm'
read the vectors file (assuming textual form):
lines = File.readlines('vectors.txt')
insert to a hash
words = {}
lines[1..-1].each{ |l| sp = l.strip.split; words[sp[0]] = sp[1..-1].map(&:to_f) }
and finally use libsvm:
examples = words.values.map { |ary| Libsvm::Node.features(ary) }

How to convert UTF8 byte arrays to string in lua

I have a table like this
table = {57,55,0,15,-25,139,130,-23,173,148,-24,136,158}
it is utf8 encoded byte array by php unpack function
unpack('C*',$str);
how can I convert it to utf-8 string I can read in lua?
Lua doesn't provide a direct function for turning a table of utf-8 bytes in numeric form into a utf-8 string literal. But it's easy enough to write something for this with the help of string.char:
function utf8_from(t)
local bytearr = {}
for _, v in ipairs(t) do
local utf8byte = v < 0 and (0xff + v + 1) or v
table.insert(bytearr, string.char(utf8byte))
end
return table.concat(bytearr)
end
Note that none of lua's standard functions or provided string facilities are utf-8 aware. If you try to print utf-8 encoded string returned from the above function you'll just see some funky symbols. If you need more extensive utf-8 support you'll want to check out some of the libraries mention from the lua wiki.
Here's a comprehensive solution that works for the UTF-8 character set restricted by RFC 3629:
do
local bytemarkers = { {0x7FF,192}, {0xFFFF,224}, {0x1FFFFF,240} }
function utf8(decimal)
if decimal<128 then return string.char(decimal) end
local charbytes = {}
for bytes,vals in ipairs(bytemarkers) do
if decimal<=vals[1] then
for b=bytes+1,2,-1 do
local mod = decimal%64
decimal = (decimal-mod)/64
charbytes[b] = string.char(128+mod)
end
charbytes[1] = string.char(vals[2]+decimal)
break
end
end
return table.concat(charbytes)
end
end
function utf8frompoints(...)
local chars,arg={},{...}
for i,n in ipairs(arg) do chars[i]=utf8(arg[i]) end
return table.concat(chars)
end
print(utf8frompoints(72, 233, 108, 108, 246, 32, 8364, 8212))
--> Héllö €—

Recovering hex data from a large log-file using Ruby and RegEx

I'm trying to filter/append lines of hex data from a large log-file, using Ruby and RegEx.
The lines of the log-file that I need look like this:
Data: 10 55 61 (+ lots more hex data)
I want to add all of the hex data, for further processing later. The regex /^\sData:(.+)/ should do the trick.
My Ruby-program looks like this:
puts "Start"
fileIn = File.read("inputfile.txt")
fileOut = File.new("outputfile.txt", "w+")
fileOut.puts "Start of regex data\n"
fileIn.each_line do
dataLine = fileIn.match(/^\sData:(.+)/).captures
fileOut.write dataLine
end
fileOut.puts "\nEOF"
fileOut.close
puts "End"
It works - sort of - but the lines in the output file are all the same, just repeating the result of the first regex match.
What am I doing wrong?
You are iterating over the same entire file. You need to iterate over the line.
fileIn.each_line do |line|
dataLine = line.match(/^\sData:(.+)/).captures
fileOut.write dataLine
end

Resources