Sort a table with UTF-8 encoded values alphabetically - sorting

I am storing dictionary entries in a Lua table, using it as an array. I want to sort the entries from Lua, so that I can add new ones without having to move to the correct position myself (which gets quite tedious soon). However, I am facing several problems:
Many words contain non-ASCII characters, which makes the built-in comparison operator for strings unsuitable for the task (for instance, it makes amputar come before ámbito).
There are words from various languages (all Western, though), namely Spanish, German and English. The problem here is that, probably, different languages have different notions of the alphabetical order. Since the main language is Spanish, I would like to use its rules, although I'm unsure as to whether that will work with characters not contained in the Spanish alphabet.
Some words contain capital letters, or, even worse, start with them. For example, all German nouns start with upper-case letters. By the built-in comparison operator, the capital letters come before their lower-case siblings, which is not my desired behaviour; I would like upper-case letters to be treated exactly as their lower-case counterparts.
Take, for example, the following table:
local entries =
{
'amputar',
'Volksgeist',
'ámbito'
}
Those entries should be ordered like this:
ámbito
amputar
Volksgeist
However, with my current code, the output is wrong:
local function compare_utf8_strings( o1 , o2 )
-- Using the built-in non-UTF-8-aware non-locale-aware string comparison operator
return o1 < o2
end
table.sort( entries , function ( a , b ) return compare_utf8_strings( a , b ) end )
for i, entry in ipairs(entries) do
print( entry )
end
That outputs:
Volksgeist
amputar
ámbito
Could you please take the following code, and hack it to fulfill my requirements?
local entries =
{
'amputar',
'Volksgeist',
'ámbito'
}
local function compare_utf8_strings( o1 , o2 )
-- Hack here, please, accomplishing my requirements
end
table.sort( entries , function ( a , b ) return compare_utf8_strings( a , b ) end )
for i, entry in ipairs(entries) do
print( entry )
end
It should output this:
ámbito
amputar
Volksgeist
As an additional requirement, this Lua code is all inside LuaTeX, which currently supports 5.2 version of the language. As for external libraries, I guess it's possible to use them.
I am a novice in the Lua camp, so, please, forgive any error I have made, and feel free to notify it, so I fix it.

After some time searching to no avail, I found this article by Joseph Wright. Although it touched my issue, it didn't provide a clear solution to follow. I asked him, and it turned out that there's currently no direct way to do what I want. He pointed out, however, that slnunicode comes built-in with LuaTeX (albeit it will be replaced in the future).
I developed a 'crude' solution using the facilities provided in the LuaTeX environment. It isn't elegant, but it works, and it doesn't pull any external dependencies. About its efficiency, I have not perceived any difference in the document build time.
-- Make the facilities available
unicode = require( 'unicode' )
utf8 = unicode.utf8
--[[
Each character's position in this array-like table determines its 'priority'.
Several characters in the same slot have the same 'priority'.
]]
local alphabet =
{
-- The space is here because of other requirements of my project
{ ' ' },
{ 'a', 'á', 'à', 'ä' },
{ 'b' },
{ 'c' },
{ 'd' },
{ 'e', 'é', 'è', 'ë' },
{ 'f' },
{ 'g' },
{ 'h' },
{ 'i', 'í', 'ì', 'ï' },
{ 'j' },
{ 'k' },
{ 'l' },
{ 'm' },
{ 'n' },
{ 'ñ' },
{ 'o', 'ó', 'ò', 'ö' },
{ 'p' },
{ 'q' },
{ 'r' },
{ 's' },
{ 't' },
{ 'u', 'ú', 'ù', 'ü' },
{ 'v' },
{ 'w' },
{ 'x' },
{ 'y' },
{ 'z' }
}
-- Looks up the character `character´ in the alphabet and returns its 'priority'
local function get_pos_in_alphabet( character )
for i, alphabet_entry in ipairs(alphabet) do
for _, alphabet_char in ipairs(alphabet_entry) do
if character == alphabet_char then
return i
end
end
end
--[[
If it isn't in the alphabet, abort: it's better than silently outputting some
random garbage, and, thanks to the message, allows to add the character to
the table.
]]
assert( false , "'" .. character .. "' was not in alphabet" )
end
-- Returns the characters in the UTF-8-encoded string `s´ in an array-like table
local function get_utf8_string_characters( s )
--[[
I saw this variable being used in several code snippets around the Web, but
it isn't provided in my LuaTeX environment; I use this form of initialization
to be safe if it's defined in the future.
]]
utf8.charpattern = utf8.charpattern or "([%z\1-\127\194-\244][\128-\191]*)"
local characters = {}
for character in s:gmatch(utf8.charpattern) do
table.insert( characters , character )
end
return characters
end
local function compare_utf8_strings( _o1 , _o2 )
--[[
`o1_chars´ and `o2_chars´ are array-like tables containing all of the
characters of each string, which are all made lower-case using the
slnunicode facilities that come built-in with LuaTeX.
]]
local o1_chars = get_utf8_string_characters( utf8.lower(_o1) )
local o2_chars = get_utf8_string_characters( utf8.lower(_o2) )
local o1_len = utf8.len(o1)
local o2_len = utf8.len(o2)
for i = 1, math.min( o1_len , o2_len ) do
o1_pos = get_pos_in_alphabet( o1_chars[i] )
o2_pos = get_pos_in_alphabet( o2_chars[i] )
if o1_pos > o2_pos then
return false
elseif o1_pos < o2_pos then
return true
end
end
return o1_len < o2_len
end
I cannot integrate this solution in the question's framework because my test environment, the ZeroBrane Studio Lua IDE, doesn't come with slnunicode and I don't know how to add it.
That was it. If anyone has any doubt or would like further explanations, please, use the comments. I hope it's useful to someone else.

Related

Insert multiple characters in string at once

Where as str[] will replace a character, str.insert will insert a character at a position. But it requires two lines of code:
str = "COSO17123456"
str.insert 4, "-"
str.insert 7, "-"
=> "COSO-17-123456"
I was thinking how to do this in one line of code. I came up with the following solution:
str = "COSO17123456"
str.each_char.with_index.reduce("") { |acc,(c,i)| acc += c + ( (i == 3 || i == 5) ? "-" : "" ) }
=> "COSO-17-123456
Is there a built-in Ruby helper for this task? If not, should I stick with the insert option rather than combining several iterators?
Use each to iterate over an array of indices:
str = "COSO17123456"
[4, 7].each { |i| str.insert i, '-' }
str #=> "COSO-17-123456"
You can uses slices and .join:
> [str[0..3], str[4..5],str[6..-1]].join("-")
=> "COSO-17-123456"
Note that the index after the first one (between 3 and 4) will be different since you are not inserting earlier insertion first. ie, more natural (to me anyway...)
You will insert at the absolute index of the original string -- not the moving relative index as insertions are made.
If you want to insert at specific absolute index values, you can also use ..each_with_index and control the behavior character by character:
str2 = ""
tgts=[3,5]
str.split("").each_with_index { |c,idx| str2+=c; str2+='-' if tgts.include? idx }
Both of the above create a new string.
String#insert returns the string itself.
This means you can chain the method calls, which can be a prettier and more efficient if you only have to do it a couple of times like in your example:
str = "COSO17123456".insert(4, "-").insert(7, "-")
puts str
COSO-17-123456
Your reduce version can be therefore more concisely written as:
[4,7].reduce(str) { |str, idx| str.insert(idx, '-') }
I'll bring one more variation to the table, String#unpack:
new_str = str.unpack("A4A2A*").join('-')
# or with String#%
new_str = "%s-%s-%s" % str.unpack("A4A2A*")

Formatting large numbers using numeral.js

In my app, I want to format various numbers using a library, and I have a couple of related questions (which I don't submit separately because I think they might represent a very common set of problems)
Format a number using a format string constant to achieve compressed literals such as 1.2k or 1.23M
Format a number using a format string constant to have a thousand delimiter applied, ragardless of client's locale settings.
I tried to achieve a formatting result, where the language thousand delimiter is actually taken into consideration
http://jsfiddle.net/erbronni/19mLmekt/
// load a language
numeral.language('fr', {
delimiters: {
thousands: ' ',
decimal: ','
},
abbreviations: {
thousand: 'k',
million: 'M',
billion: '',
trillion: 't'
},
ordinal : function (number) {
return number === 1 ? 'er' : 'ème';
},
currency: {
symbol: '€'
}
});
numeral.language('fr');
document.getElementById('f1').innerHTML = numeral(12345678).format('0 000') // intended output: '12 345 678' -- does not seem to work
Numeral.js has this built in. It can be easily achieved using a such as .format('0.00a').
Some full examples:
numeral(1000000).format('0a') will return 1m
numeral(250500).format('0.0a') will return 250.5k
numeral(10500).format('0.00a') will return 10.50k

Grouping regex based on the previous grouping result

I have some parameters that I have to sort into different lists. The prefix determines which list should it belong to.
I use prefixes like: c, a, n, o and an additional hyphen (-) to determine whether to put it in include l it or exclude list.
I use the regex grouped as:
/^(-?)([o|a|c|n])(\w+)/
But here the third group (\w+) is not generic, and it should actually be dependent on the second group's result. I.e, if the prefix is:
'c' or 'a' -> /\w{3}/
'o' -> /\w{2}/
else -> /\w+/
Can I do this with a single regex? Currently I am using an if condition to do so.
Example input:
Valid:
"-cABS", "-aXYZ", "-oWE", "-oqr", "-ncanbeanyting", "nstillanything", "a123", "-conT" (will go to c_exclude_list)
Invalid:
"cmorethan3chars", "c1", "-a1234", "prefizisnotvalid", "somethingelse", "oABC"
Output: for each arg push to the correct list, ignore the invalid.
c_include_list, c_exclude_list, a_include_list, a_exclude_list etc.
You can use this pattern:
/(-?)\b([aocn])((?:(?<=[ac])\w{3}|(?<=o)\w{2}|(?<=n)\w+))\b/
The idea consists to use lookbehinds to check the previous character without including it in the capture group.
Since version 2.0, Ruby has switched from Oniguruma to Onigmo (a fork of Oniguruma), which adds support for conditional regex, among other features.
So you can use the following regex to customize the pattern based on the prefix:
^-(?:([ca])|(o)|(n))?(?(1)\w{3}|(?(2)\w{2}|(?(3)\w+)))$
Demo at rubular
Is a single, mind-bending regex the best way to deal with this problem?
Here's a simpler approach that does not employ a regex at all. I suspect that it would be at least as efficient as a single regex, considering that with the latter you must still assign matching strings to their respective arrays. I think it also reads better and would be easier to maintain. The code below should be easy to modify if I have misunderstood some fine points of the question.
Code
def devide_em_up(str)
h = { a_exclude: [], a_include: [], c_exclude: [], c_include: [],
o_exclude: [], o_include: [], other_exclude: [], other_include: [] }
str.split.each do |s|
exclude = (s[0] == ?-)
s = s[1..-1] if exclude
first = s[0]
s = s[1..-1] if 'cao'.include?(first)
len = s.size
case first
when 'a'
(exclude ? h[:a_exclude] : h[:a_include]) << s if len == 3
when 'c'
(exclude ? h[:c_exclude] : h[:c_include]) << s if len == 3
when 'o'
(exclude ? h[:o_exclude] : h[:o_include]) << s if len == 2
else
(exclude ? h[:other_exclude] : h[:other_include]) << s
end
end
h
end
Example
Let's try it:
str = "-cABS cABT -cDEF -aXYZ -oWE -oQR oQT -ncanbeany nstillany a123 " +
"-conT cmorethan3chars c1 -a1234 prefizisnotvalid somethingelse oABC"
devide_em_up(str)
#=> {:a_exclude=>["XYZ"], :a_include=>["123"],
# :c_exclude=>["ABS", "DEF"], :c_include=>["ABT"],
# :o_exclude=>["WE", "QR"], :o_include=>["QT"],
# :other_exclude=>["ncanbeany"], :other_include=>["nstillany"]}

ruby extract string between two string

I am having a string as below:
str1='"{\"#Network\":{\"command\":\"Connect\",\"data\":
{\"Id\":\"xx:xx:xx:xx:xx:xx\",\"Name\":\"somename\",\"Pwd\":\"123456789\"}}}\0"'
I wanted to extract the somename string from the above string. Values of xx:xx:xx:xx:xx:xx, somename and 123456789 can change but the syntax will remain same as above.
I saw similar posts on this site but don't know how to use regex in the above case.
Any ideas how to extract the above string.
Parse the string to JSON and get the values that way.
require 'json'
str = "{\"#Network\":{\"command\":\"Connect\",\"data\":{\"Id\":\"xx:xx:xx:xx:xx:xx\",\"Name\":\"somename\",\"Pwd\":\"123456789\"}}}\0"
json = JSON.parse(str.strip)
name = json["#Network"]["data"]["Name"]
pwd = json["#Network"]["data"]["Pwd"]
Since you don't know regex, let's leave them out for now and try manual parsing which is a bit easier to understand.
Your original input, without the outer apostrophes and name of variable is:
"{\"#Network\":{\"command\":\"Connect\",\"data\":{\"Id\":\"xx:xx:xx:xx:xx:xx\",\"Name\":\"somename\",\"Pwd\":\"123456789\"}}}\0"
You say that you need to get the 'somename' value and that the 'grammar will not change'. Cool!.
First, look at what delimits that value: it has quotes, then there's a colon to the left and comma to the right. However, looking at other parts, such layout is also used near the command and near the pwd. So, colon-quote-data-quote-comma is not enough. Looking further to the sides, there's a \"Name\". It never occurs anywhere in the input data except this place. This is just great! That means, that we can quickly find the whereabouts of the data just by searching for the \"Name\" text:
inputdata = .....
estposition = inputdata.index('\"Name\"')
raise "well-known marker wa not found in the input" unless estposition
now, we know:
where the part starts
and that after the "Name" text there's always a colon, a quote, and then the-interesting-data
and that there's always a quote after the interesting-data
let's find all of them:
colonquote = inputdata.index(':\"', estposition)
datastart = colonquote+3
lastquote = inputdata.index('\"', datastart)
dataend = lastquote-1
The index returns the start position of the match, so it would return the position of : and position of \. Since we want to get the text between them, we must add/subtract a few positions to move past the :\" at begining or move back from \" at end.
Then, fetch the data from between them:
value = inputdata[datastart..dataend]
And that's it.
Now, step back and look at the input data once again. You say that grammar is always the same. The various bits are obviously separated by colons and commas. Let's try using it directly:
parts = inputdata.split(/[:,]/)
=> ["\"{\\\"#Network\\\"",
"{\\\"command\\\"",
"\\\"Connect\\\"",
"\\\"data\\\"",
"\n{\\\"Id\\\"",
"\\\"xx",
"xx",
"xx",
"xx",
"xx",
"xx\\\"",
"\\\"Name\\\"",
"\\\"somename\\\"",
"\\\"Pwd\\\"",
"\\\"123456789\\\"}}}\\0\""]
Please ignore the regex for now. Just assume it says a colon or comma. Now, in parts you will get all the, well, parts, that were detected by cutting the inputdata to pieces at every colon or comma.
If the layout never changes and is always the same, then your interesting-data will be always at place 13th:
almostvalue = parts[12]
=> "\\\"somename\\\""
Now, just strip the spurious characters. Since the grammar is constant, there's 2 chars to be cut from both sides:
value = almostvalue[2..-3]
Ok, another way. Since regex already showed up, let's try with them. We know:
data is prefixed with \"Name\" then colon and slash-quote
data consists of some text without quotes inside (well, at least I guess so)
data ends with a slash-quote
the parts in regex syntax would be, respectively:
\"Name\":\"
[^\"]*
\"
together:
inputdata =~ /\\"Name\\":\\"([^\"]*)\\"/
value = $1
Note that I surrounded the interesting part with (), hence after sucessful match that part is available in the $1 special variable.
Yet another way:
If you look at the grammar carefully, it really resembles a set of embedded hashes:
\"
{ \"#Network\" :
{ \"command\" : \"Connect\",
\"data\" :
{ \"Id\" : \"xx:xx:xx:xx:xx:xx\",
\"Name\" : \"somename\",
\"Pwd\" : \"123456789\"
}
}
}
\0\"
If we'd write something similar as Ruby hashes:
{ "#Network" =>
{ "command" => "Connect",
"data" =>
{ "Id" => "xx:xx:xx:xx:xx:xx",
"Name" => "somename",
"Pwd" => "123456789"
}
}
}
What's the difference? the colon was replaced with =>, and the slashes-before-quotes are gone. Oh, and also opening/closing \" is gone and that \0 at the end is gone too. Let's play:
tmp = inputdata[2..-4] # remove opening \" and closing \0\"
tmp.gsub!('\"', '"') # replace every \" with just "
Now, what about colons.. We cannot just replace : with =>, because it would damage the internal colons of the xx:xx:xx:xx:xx:xx part.. But, look: all the other colons have always a quote before them!
tmp.gsub!('":', '"=>') # replace every quote-colon with quote-arrow
Now our tmp is:
{"#Network"=>{"command"=>"Connect","data"=>{"Id"=>"xx:xx:xx:xx:xx:xx","Name"=>"somename","Pwd"=>"123456789"}}}
formatted a little:
{ "#Network"=>
{ "command"=>"Connect",
"data"=>
{ "Id"=>"xx:xx:xx:xx:xx:xx","Name"=>"somename","Pwd"=>"123456789" }
}
}
So, it looks just like a Ruby hash. Let's try 'destringizing' it:
packeddata = eval(tmp)
value = packeddata['#Network']['data']['Name']
Done.
Well, this has grown a bit and Jonas was obviously faster, so I'll leave the JSON part to him since he wrote it already ;) The data was so similar to Ruby hash because it was obviously formatted as JSON which is a hash-like structure too. Using the proper format-reading tools is usually the best idea, but mind that the JSON library when asked to read the data - will read all of the data and then you can ask them "what was inside at the key xx/yy/zz", just like I showed you with the read-it-as-a-Hash attempt. Sometimes when your program is very short on the deadline, you cannot afford to read-it-all. Then, scanning with regex or scanning manually for "known markers" may (not must) be much faster and thus prefereable. But, still, much less convenient. Have fun.

ruby find and replace portion of string

I have a large file in a ruby variable, it follows a common pattern like so:
// ...
// comment
$myuser['bla'] = 'bla';
// comment
$myuser['bla2'] = 'bla2';
// ...
I am trying to given a 'key' replace the 'value'
This replaces the entire string how do I fix it? Another method I thought is to do it in two steps, step one would be to find the value within the quotes then to perform a string replace, what's best?
def keyvalr(content, key, value)
return content.gsub(/\$bla\[\'#{key}\'\]\s+\=\s+\'(.*)\'/) {|m| value }
end
The .* is greedy and consumes as much as possible (everything until the very last '). Make that . a [^'] then it is impossible for it to go past the first closing '.
/(\$bla\[\'#{key}\'\]\s+\=\s+\')[^']*(\')/
I also added parentheses to capture everything except for the value, which is to be replaced. The first set of parens will correspond to \1 and the second to \2. So that you replace the match of this with:
"\1yournewvaluehere\2"
I'd use something like:
text = %q{
// ...
// comment
$myuser['bla'] = 'bla';
// comment
$myuser['bla2'] = 'bla2';
// ...
}
from_to = {
'bla' => 'foo',
'bla2' => 'bar'
}
puts text.gsub(/\['([^']+)'\] = '([^']+)'/) { |t|
key, val = t.scan(/'([^']+)'/).flatten
"['%s'] = '%s'" % [ key, from_to[key] ]
}
Which outputs:
// ...
// comment
$myuser['bla'] = 'foo';
// comment
$myuser['bla2'] = 'bar';
// ...
This is how it works:
If I do:
puts text.gsub(/\['([^']+)'\] = '([^']+)'/) { |t|
puts t
}
I see:
['bla'] = 'bla'
['bla2'] = 'bla2'
Then I tried:
"['bla'] = 'bla'".scan(/'([^']+)'/).flatten
=> ["bla", "bla"]
That gave me a key, "value" pair, so I could use a hash to look-up the replacement value.
Sticking it inside a gsub block meant whatever matched got replaced by my return value for the block, so I created a string to replace the "hit" and let gsub do its "thang".
I'm not a big believer in using long regex. I've had to maintain too much code that tried to use complex patterns, and got something wrong, and failed to accomplish what was intended 100% of the time. They're very powerful, but maintenance of code is a lot harder/worse than developing it, so I try to keep patterns I write in spoon-size pieces, having mercy on those who follow me in maintaining the code.

Resources