Print number of characters in UTF-8 string

Print number of characters in UTF-8 string - utf-8

For example:
local a = "Lua"
local u = "Луа"
print(a:len(), u:len())
output:
3 6
How can I output number of characters in utf-8 string?

If you need to use Unicode/UTF-8 in Lua, you need to use external libraries, because Lua only works with 8-bit strings. One such library is slnunicode. Example code how to calculate the length of your string:
local unicode = require "unicode"
local utf8 = unicode.utf8
local a = "Lua"
local u = "Луа"
print(utf8.len(a), utf8.len(u)) --> 3 3

In Lua 5.3, you can use utf8.len to get the length of a UTF-8 string:
local a = "Lua"
local u = "Луа"
print(utf8.len(a), utf8.len(u))
Output: 3 3

You don't.
Lua is not Unicode aware. All it sees is a string of bytes. When you ask for the length, it gives you the length of that byte string. If you want to use Lua to interact in some way with Unicode strings, you have to either write a Lua module that implements those interactions or download such a module.

Another alternative is to wrap the native os UTF-8 string functions and use the os functions to do the heavy lifting. This depends on which OS you use - I've done this on OSX and it works a treat. Windows would be similar. Of course it opens another can of worms if you just want to run a script from the command line - depends on your app.

Related

Need a lua script to convert text into the utf8 encoding string

I want to create a lua script which will convert text to utf8 encoded string.
The problem is I am using lua version 5.2 which does not support LUAJit which are having libraries to do so.
So, I need a function which will do this task for me.
For example I will pass "hey this is sam this side" it should give me the utf8 encoded string like "\x68\x65\x79\x20\x74\x68\x69\x73\x20\x69\x73\x20\x73\x61\x6d\x20\x74\x68\x69\x73\x20\x73\x69\x64\x65"
The requirement is like that need to use lua only.

You can do it like this:
local str = "hey this is sam this side"
local answer = string.gsub(str,"(.)",function (x) return string.format("\\x%02X",string.byte(x)) end)
print(answer)
The answer is:
"\x68\x65\x79\x20\x74\x68\x69\x73\x20\x69\x73\x20\x73\x61\x6D\x20\x74\x68\x69\x73\x20\x73\x69\x64\x65"

UTF8-aware printf?

AFAIK, strings in OCaml are just plain sequences of bytes. They have no notion of encoding.
This is fine for most purposes. However, some pieces of standard library make assumptions about the string being encoded in a single-byte charset, for example the aligning features of printf:
# printf "[%4s]\n[%4s]\n" "a" "à";;
[ a]
[ à]
- : unit = ()
Is there an upgraded printf somewhere that deals with this correctly, for example by looking at LANG and LC_* to guess the encoding being used on the terminal? (I'm using Core)

If you need to print UTF-8 data you can use Uuseg's pretty printers.

gcc ncurses printing extend characters (glyphs) such as char 223

I am writing a terminal program for the Raspberry Pi using ncurses. I want to add a shadow around a box. I want to use mvaddch() to print extended characters such as char 233 (upper half box character). What would be the syntax for the mvaddch() command? Or is there another way to accomplish this?

You're probably referring to something like code page 866. ncurses will assume your terminal shows characters consistent with the locale encoding, which probably is UTF-8. So (unless you want to convert the characters in your program) the way to go is using Unicode values.
The Unicode organization has tables which you can use to lookup a particular code, e.g., ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP866.TXT. For your example, the relevant row is
0xdf 0x2580 #UPPER HALF BLOCK
(because 0xdf is 223). You would use the Unicode 0x2580 in a call to the function mvaddwstr, e.g.
wchar_t mydata[] = { 0x2580, 0 };
mvaddwstr(0,0, mydata);
(the similarly-named wadd_wch uses a data structure which is more complicated).
You would have to link with the ncursesw library, and of course initialize your program's locale using setlocale as mentioned in the ncurses manual page.

Detect Character Set/Script of Arbitrary String

I'm working on cleaning up a database of "profiles" of entities (people, organizations, etc), and one such part of the profile is the name of the individual in their native script (e.g. Thai), encoded in UTF-8. In the previous data structure we didn't capture the character set of the name, so now we have more records with invalid values than possible to manually review.
What I need to do at this point is, via script, determine what language/script any given name is in. With a sample data set of:
Name: "แผ่นดินต้น"
Script: NULL
Name: "አብርሃም"
Script: NULL
I need to end up with
Name: "แผ่นดินต้น"
Script: Thai
Name: "አብርሃም"
Script: Amharic
I do not need to translate the names, just determine what script they're in. Is there an established technique for figuring this sort of thing out?

You can use charnames in Perl to figure out the name of a given character.
use strict;
use warnings;
use charnames '';
use feature 'say';
use utf8;
say charnames::viacode(ord 'Բ');
__END__
ARMENIAN CAPITAL LETTER BEN
With that, you can break apart all you strings into characters, and then build a counting hash for each type of character group. Figuring out groups from this is a bit tricky but it's a start. Once you're done with a string, the group with the highest count should win. That way, you'll not have punctuation or numbers get in the way.
Probably it's smarter to find something that already has the names of ranges in unicode and makes it easy to look up. I know there is at least one module on CPAN that does that, but I cannot find it right now. Something like that can be abused to make the lookup easier.

Using the unicodedata2 Python module described here and here, you can examine the Unicode script for each character, like so:
#!/usr/bin/env python2
#coding: utf-8
import unicodedata2
import collections
def scripts(name):
scripts = [unicodedata2.script(char) for char in name]
scripts = collections.Counter(scripts)
scripts = scripts.most_common()
scripts = ', '.join(script for script,_ in scripts)
return scripts
assert scripts(u'Rob') == 'Latin'
assert scripts(u'Robᵩ') == 'Latin, Greek'
assert scripts(u'Aarón') == 'Latin'
assert scripts(u'แผ่นดินต้น') == 'Thai'
assert scripts(u'አብርሃም') == 'Ethiopic'

What's the most reliable way to parse a piece of text out into paragraphs in RealBasic that will work on Windows, Mac, and Linux?

I'm writing a piece of software using RealBASIC 2011r3 and need a reliable, cross-platform way to break a string out into paragraphs. I've been using the following but it only seems to work on Linux:
dim pTemp() as string
pTemp = Split(txtOriginalArticle.Text, EndOfLine + EndOfLine)
When I try this on my Mac it returns it all as a single paragraph. What's the best way to make this work reliably on all three build targets that RB supports?

EndofLine changes depending upon platform and depending upon the platform that created the string. You'll need to check for the type of EndOfLine in the string. I believe it's sMyString.EndOfLineType. Once you know what it is you can then split on it.
There are further properties for the EndOfLine. It can be EndOfLine.Macintosh/Windows/Unix.
EndOfLine docs: http://docs.realsoftware.com/index.php/EndOfLine

I almost always search for and replace the combinations of line break characters before continuing. I'll usually do a few lines of:
yourString = replaceAll(yourString,chr(10)+chr(13),"<someLineBreakHolderString>")
yourString = replaceAll(yourString,chr(13)+chr(10),"<someLineBreakHolderString>")
yourString = replaceAll(yourString,chr(10),"<someLineBreakHolderString>")
yourString = replaceAll(yourString,chr(13),"<someLineBreakHolderString>")
The order here matters (do 10+13 before an individual 10) because you don't want to end up replacing a line break that contains a 10 and a 13 with two of your line break holders.
It's a bit cumbersome and I wouldn't recommend using it to actually modify the original string, but it definitely helps to convert all of the line breaks to the same item before attempting to further parse the string.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Print number of characters in UTF-8 string - utf-8

For example: local a = "Lua" local u = "Луа" print(a:len(), u:len()) output: 3 6 How can I output number of characters in utf-8 string?

In Lua 5.3, you can use utf8.len to get the length of a UTF-8 string: local a = "Lua" local u = "Луа" print(utf8.len(a), utf8.len(u)) Output: 3 3

Related

Need a lua script to convert text into the utf8 encoding string

UTF8-aware printf?

gcc ncurses printing extend characters (glyphs) such as char 223

Detect Character Set/Script of Arbitrary String

What's the most reliable way to parse a piece of text out into paragraphs in RealBasic that will work on Windows, Mac, and Linux?

Categories

Resources