Detect Character Set/Script of Arbitrary String - utf-8

I'm working on cleaning up a database of "profiles" of entities (people, organizations, etc), and one such part of the profile is the name of the individual in their native script (e.g. Thai), encoded in UTF-8. In the previous data structure we didn't capture the character set of the name, so now we have more records with invalid values than possible to manually review.
What I need to do at this point is, via script, determine what language/script any given name is in. With a sample data set of:
Name: "แผ่นดินต้น"
Script: NULL
Name: "አብርሃም"
Script: NULL
I need to end up with
Name: "แผ่นดินต้น"
Script: Thai
Name: "አብርሃም"
Script: Amharic
I do not need to translate the names, just determine what script they're in. Is there an established technique for figuring this sort of thing out?

You can use charnames in Perl to figure out the name of a given character.
use strict;
use warnings;
use charnames '';
use feature 'say';
use utf8;
say charnames::viacode(ord 'Բ');
__END__
ARMENIAN CAPITAL LETTER BEN
With that, you can break apart all you strings into characters, and then build a counting hash for each type of character group. Figuring out groups from this is a bit tricky but it's a start. Once you're done with a string, the group with the highest count should win. That way, you'll not have punctuation or numbers get in the way.
Probably it's smarter to find something that already has the names of ranges in unicode and makes it easy to look up. I know there is at least one module on CPAN that does that, but I cannot find it right now. Something like that can be abused to make the lookup easier.

Using the unicodedata2 Python module described here and here, you can examine the Unicode script for each character, like so:
#!/usr/bin/env python2
#coding: utf-8
import unicodedata2
import collections
def scripts(name):
scripts = [unicodedata2.script(char) for char in name]
scripts = collections.Counter(scripts)
scripts = scripts.most_common()
scripts = ', '.join(script for script,_ in scripts)
return scripts
assert scripts(u'Rob') == 'Latin'
assert scripts(u'Robᵩ') == 'Latin, Greek'
assert scripts(u'Aarón') == 'Latin'
assert scripts(u'แผ่นดินต้น') == 'Thai'
assert scripts(u'አብርሃም') == 'Ethiopic'

Related

Ruby regexp /(\/.*?(?=\/|$)){2}/

My regexp behaves just like I want it to on http://regexr.com, but not like I want it in irb.
I'm trying to make a regular expression that will match the following:
A forward slash,
then 2 * any number of random characters (i.e. `.*`),
up to but not including another /
OR the end of the string (whichever comes first)
I'm sorry as that was probably unclear, but it's my best attempt at an English translation.
Here's my current attempt and hopefully that will give you a better idea of what I'm trying to do:
/(\/.*?(?=\/|$)){2}/
The usage scenario is I want to be able to take a path like /foo/bar/baz/bin/bash and shorten it to the level I'm at in the filesystem, in this case the second level (/foo/bar). I'm trying to do this using the command path.scan(-regex-).shift.
The usage scenario is I want to be able to take a path like /foo/bar/baz/bin/bash and shorten it to the level I'm at in the filesystem, in this case the second level (/foo/bar)
Ruby already has a class for handling paths, Pathname. You can use Pathname#relative_path_from to do what you want.
require 'pathname'
path = Pathname.new("/foo/bar/baz/bin/bash")
# Normally you'd use Pathname.getwd
cwd = Pathname.new("/foo/bar")
# baz/bin/bash
puts path.relative_path_from(cwd)
Regexes just invite problems, like assuming the path separator is /, not honoring escapes, and not dealing with extra /. For example, "//foo/bar//b\\/az/bin/bash". // is particularly common in code which joins together directories using paths.join("/") or "#{dir}/#{file}.
For completeness, the general way you match a single piece of a path is this.
%r{^(/[^/]+)}
That's the beginning of the string, a /, then 1 or more characters which are not /. Using [^/]+ means you don't have to try and match an optional / or end of string, a very useful technique. Using %r{} means less leaning toothpicks.
But this is only applicable to a canonicalized path. It will fail on //foo//b\\/ar/. You can try to fix up the regex to deal with that, or do your own canonicalization, but just use Pathname.

What is the difference between "hello".length and "hello" .length?

I am surprised when I run the following examples in ruby console. They both produce the same output.
"hello".length
and
"hello" .length
How does the ruby console remove the space and provide the right output?
You can put spaces wherever you want, the interpreter looks for the end of the line. For example:
Valid
"hello".
length
Invalid
"hello"
.length
The interpreter sees the dot at the end of the line and knows something has to follow it up. While in the second case it thinks the line is finished. The same goes for the amount of spaces in one line. Does it matter how the interpreter removes the spaces? What matters is that you know the behavior.
If you want you can even
"hello" . length
and it will still work.
I know this is not an answer to you question, but does the "how" matter?
EDIT: I was corrected in the comments below. The examples with multiple lines given above are both valid when run in a script instead of IRB. I was mixed them up with the operators. Where the following also applies when running a script:
Valid
result = true || false
Valid
result = true ||
false
Invalid
result = true
|| false
This doesn't have as much to do with the console as it has to do with how the language itself is parsed by the compiler.
Most languages are parsed in such a way that items to be parsed are first grouped into TOKENS. Then the compiler is defined to expect a certain SEQUENCE of tokens in order to interpret each programming statement.
Because the compiler is only looking for a TOKEN SEQUENCE, it doesn't matter if there is space in between or not.
In this case the compiler is looking for:
STRING DOT METHOD_NAME
So it won't matter if you write "hello".length, or even "hello" . length. The same sequence of tokens are present in both, and that is all that matters to the compiler.
If you are curious how these token sequences are defined in the Ruby source code, you can look at parse.y starting around line 1042:
https://github.com/ruby/ruby/blob/trunk/parse.y#L1042
This is a file that is written using the YACC language, which is a language used to define parsers with.
Even without knowing anything about YACC, you should already be able to get some clues on how it works by just looking around the file a bit.

How do I filter file names out of a SQLite dump?

I'm trying to filter out all file names from an SQLite text dump using Ruby. I'm not very handy/familiar with regex and need a way to read, and write to a file, another dump of image files that are within the SQLite dump. I can filter out everything except stuff like this:
VALUES(3,5,1,43,'/images/e/e5/Folder%2FOrders%2FFinding_Orders%2FView_orders3.JPG','1415',NULL);
and this:
src="/images/9/94/folder%2FGraph.JPG"
I can't figure out the easiest way to filter through this. I've tried using split and other functions, but instead of splitting the string into an array by the character specified, it just removed the character.
You should be able to use .gsub('%2', ' ') the %2 with a space, while quoted, it should be fine.
Split does remove the character that is being split, though. So you may not want to do that, or if you do, you may want to use the Array#join method with the argument of the character you split with to put it back in.
I want to 'extract' the file name from the statements above. Say I have src="/images/9/94/folder%2FGraph.JPG", I want folder%2FGraph.JPG to be extracted out.
If you want to extract what is inside the src parameter:
foo = 'src="/images/9/94/folder%2FGraph.JPG"'
foo[/^src="(.+)"/, 1]
=> "/images/9/94/folder%2FGraph.JPG"
That returns a string without the surrounding parenthesis.
Here's how to do the first one:
bar = "VALUES(3,5,1,43,'/images/e/e5/Folder%2FOrders%2FFinding_Orders%2FView_orders3.JPG','1415',NULL);"
bar.split(',')[4][1..-2]
=> "/images/e/e5/Folder%2FOrders%2FFinding_Orders%2FView_orders3.JPG"
Not everything in programming is a regex problem. Somethings, actually, in my opinion, most things, are not candidates for a pattern. For instance, the first example could be written:
foo.split('=')[1][1..-2]
and the second:
bar[/'(.+?)'/, 1]
The idea is to use whichever is most clean and clear and understandable.
If all you want is the filename, then use a method designed to return only the filename.
Use one of the above and pass its output to File.basename. Filename.basename returns only the filename and extension.

How do you check for a changing value within a string

I am doing some localization testing and I have to test for strings in both English and Japaneses. The English string might be 'Waiting time is {0} minutes.' while the Japanese string might be '待ち時間は{0}分です。' where {0} is a number that can change over the course of a test. Both of these strings are coming from there respective property files. How would I be able to check for the presence of the string as well as the number that can change depending on the test that's running.
I should have added the fact that I'm checking these strings on a web page which will display in the relevant language depending on the location of where they are been viewed. And I'm using watir to verify the text.
You can read elsewhere about various theories of the best way to do testing for proper language conversion.
One typical approach is to replace all hard-coded text matches in your code with constants, and then have a file that sets the constants which can be updated based on the language in use. (I've seen that done by wrapping the require of that file in a case statement based on the language being tested. Another approach is an array or hash for each value, enumerated by a variable with a name like 'language', which lets the tests change the language on the fly. So validations would look something like this
b.div(:id => "wait-time-message).text.should == WAIT_TIME_MESSAGE[language]
To match text where part is expected to change but fall within a predictable pattern, use a regular expression. I'd recommend a little reading about regular expressions in ruby, especially using unicode regular expressions in ruby, as well as some experimenting with a tool like Rubular to test regexes
In the case above a regex such as:
/Waiting time is \d+ minutes./ or /待ち時間は\d+分です。/
would match the messages above and expect one or more digits in the middle (note that it would fail if no digits appear, if you want zero or more digits, then you would need a * in place of the +
Don't check for the literal string. Check for some kind of intermediate form that can be used to render the final string.
Sometimes this is done by specifying a message and any placeholder data, like:
[ :waiting_time_in_minutes, 10 ]
Where that would render out as the appropriate localized text.
An alternative is to treat one of the languages as a template, something that's more limited in flexibility but works most of the time. In that case you could use the English version as the string that's returned and use a helper to render it to the final page.

What's the most reliable way to parse a piece of text out into paragraphs in RealBasic that will work on Windows, Mac, and Linux?

I'm writing a piece of software using RealBASIC 2011r3 and need a reliable, cross-platform way to break a string out into paragraphs. I've been using the following but it only seems to work on Linux:
dim pTemp() as string
pTemp = Split(txtOriginalArticle.Text, EndOfLine + EndOfLine)
When I try this on my Mac it returns it all as a single paragraph. What's the best way to make this work reliably on all three build targets that RB supports?
EndofLine changes depending upon platform and depending upon the platform that created the string. You'll need to check for the type of EndOfLine in the string. I believe it's sMyString.EndOfLineType. Once you know what it is you can then split on it.
There are further properties for the EndOfLine. It can be EndOfLine.Macintosh/Windows/Unix.
EndOfLine docs: http://docs.realsoftware.com/index.php/EndOfLine
I almost always search for and replace the combinations of line break characters before continuing. I'll usually do a few lines of:
yourString = replaceAll(yourString,chr(10)+chr(13),"<someLineBreakHolderString>")
yourString = replaceAll(yourString,chr(13)+chr(10),"<someLineBreakHolderString>")
yourString = replaceAll(yourString,chr(10),"<someLineBreakHolderString>")
yourString = replaceAll(yourString,chr(13),"<someLineBreakHolderString>")
The order here matters (do 10+13 before an individual 10) because you don't want to end up replacing a line break that contains a 10 and a 13 with two of your line break holders.
It's a bit cumbersome and I wouldn't recommend using it to actually modify the original string, but it definitely helps to convert all of the line breaks to the same item before attempting to further parse the string.

Resources