Swift 2.0 String behavior - swift2

Strings in 2.0 no longer conform to CollectionType. Each character in the String is now an Extended Graphene Cluster.
Without digging too deep about the Cluster stuff, I tried a few things with Swift Strings:
String now has a characters property that contains what we humans recognize as characters. Each distinct character in the string is considered a character, and the count property gives us the number of distinct characters.
What I don't quite understand is, even though the characters count shows 10, why does the index show emojis occupying 2 indexes?

The index of a String is no more related to the number of characters (count) in Swift 2.0. It is an “opaque” struct (defined as CharacterView.Index) used only to iterate through the characters of a string. So even if it is printed as an integer, it should not be considered or used as an integer, to which, for instance, you can sum 2 to get the second character from the current one. What you can do is only to apply the two methods predecessor and successor to get the previous or successive index in the String. So, for instance, to get the second character from that with index idx in mixedString you can do:
mixedString[idx.successor().successor()]
Of course you can use more confortable ways of reading the characters of string, like for instance, the for statement or the global function indices(_:).
Consider that the main benefit of this approach is not to the threat multi-bytes characters in Unicode strings, as emoticons, but rather to treat in a uniform way identical (for us humans!) strings that can have multiple representations in Unicode, as different set of “scalars”, or characters. An example is café, that can be represented either with four Unicode “scalars” (unicode characters), or with five Unicode scalars. And note that this is a completely different thing from Unicode representations like UTF-8, UTF-16, etc., that are ways of mapping Unicode scalars into memory bytes.

An Extended Graphene Cluster can still occupy multiple bytes, however, the correct way to determine the index position of a character would be:
let mixed = ("MADE IN THE USA 🇺🇸");
var index = mixed.rangeOfString("🇺🇸")
var intIndex: Int = distance(mixed.startIndex, index!.startIndex)
Result:
16
The way you are trying to get the index would normally be meant for an array, and I think Swift cannot properly work that out with your mixedString.

Related

Does it exist some kind of sorting convention?

Does it exist some established convention of sorting lines (characters)? Some convention which should play the similar role as PCRE for regular expressions.
For example, if you try to sort 0A1b-a2_B (each character on its own line) with Sublime Text (Ctrl-F9) and Vim (:%sort), the result will be the same (see below). However, I'm not sure it will be the same with another editors and IDEs.
-
0
1
2
A
B
_
a
b
Generally, characters are sorted based on their numeric value. While this used to only be applied to ASCII characters, this has also been adopted by unicode encodings as well. http://www.asciitable.com/
If no preference is given to the contrary, this is the de facto standard for sorting characters. Save for the actual alphabetical characters, the ordering is somewhat arbitrary.
There are two main ways of sorting character strings:
Lexicographic: numeric value of either the codepoint values or the code unit values or the serialized code unit values (bytes). For some character encodings, they would all be the same. The algorithm is very simple but this method is not human-friendly.
Culture/Locale-specific: an ordinal database for each supported culture is used. For the Unicode character set, it's called the CLDR. Also, in applying sorting for Unicode, sorting can respect grapheme clusters. A grapheme cluster is a base codepoint followed by a sequence of zero or more non-spacing (applied as extensions of the previous glyph) marks.
For some older character sets with one encoding, designed for only one or two scripts, the two methods might amount to the same thing.
Sometimes, people read a format into strings, such as a sequence of letters followed by a sequence of digits, or one of several date formats. These are very specialized sorts that need to be applied where users expect. Note: The ISO 8601 date format for the Julian calendar sorts correctly regardless of method (for all? character encodings).

Convert a long character field to numeric, NOT scientific notation (SAS)

I need to join two tables - one table has householdid which is CHAR30, which appears to have center alignment and the other householdid as numeric 20. I need to convert to the numeric 20 but when I do that it appears truncated, perhaps because of the strange alignment (not all of the 30 positions are actually needed).
When I try to keep the full 30 positions as a numeric I instead get a conversion to scientific notation so of course this will not work as a key id for later operations.
As long as the number is converted properly, it doesn't matter what format it has. A format just tells SAS how to show you the number. Behind the scenes, it is just a DOUBLE.
1.0 = 1 = 1e0
Now if you have converted to a number and cannot get a join, then look at the informat you used to read it in.
try
num_id = input(strip(char_id),best32.);
Strip removes leading and trailing blanks. The BEST32. INFORMAT tries its "best" to read the number up to 32 characters in length.
You cannot store a 20 digit number as a numeric in SAS. SAS stores all numbers as 8 byte floating point and so does not have enough bits to represent that many digits uniquely. You can ask SAS what is the largest integer it can represent exactly by using the CONSTANT() function.
1 data _null_;
2 x=constant('EXACTINT',8);
3 put x = comma32. ;
4 run;
x=9,007,199,254,740,992
Read and store your 20 and 30 digit strings as character variables.
Use the bestd32. format. Tends to work out pretty well for long key variables. Depending on the length of the variable, you can change 32 to whichever length you need.
Based on the comments under the original question, the only thing you can do is convert all ID fields to strings, and use the strings to do the joins. #Reeza suggested this in one of the comments but it should have been posted as an answer.
I assume you are pulling this information out of another database/system that allows for greater numeric precision then SAS does. If you don't convert the values to strings when they are read into SAS, then you run the risk of losing precision.
If you lose precision, the ID in SAS is likely to become very slightly different to the ID in the original system, which can cause problems when searching the original system for an ID obtained from SAS.
Be sure you don't read the numbers into SAS as numeric, then convert to string. If you do it this way you are still losing precision as soon as the numbers are stored in SAS as numeric variables.

Glyph to unicode string translation

Given a glyph index for a specific font, I need to get the unicode translation of the glyph. in order to build a glyph-to-unicode translation I'm using GetGlyphIndices for the whole unicode range and from the result I build the reverse translation (glyph to unicode character map). However, this gives me a translation between a single glyph to a single unicode character, and I can see that in Hindi for example, two unicode characters can be represented by one glyph.
For example, in the word namaste (नमस्ते) there are 6 unicode characters which are represented by 5 glyphs (the middle two unicode characters are represented by one glyph). I can see this by attaching to notepad.exe, inserting a breakpoint in ExtTextOut and printing this word from notepad.
Is there any way I can translate a glyph to a unicode string (in case the glyph represents more than one unicode character)?
1) For all but very simple cases, you should use Uniscribe functions (not GetGlyphIndices) for converting a string (sequence of Unicodes) into glyphs. This is noted in the documentation for GetGlyphIndices: http://msdn.microsoft.com/en-us/library/windows/desktop/dd144890(v=vs.85).aspx
2) There is no way to reliably do what you want to do for all cases. Even for most cases. This is the result of something known as complex script shaping, which translates a sequence of input Unicodes into a sequence of output glyphs. This is done using a number of tables in the font data. The two of most interest are the cmap and the GSUB.
The cmap maps Unicode values to font-specific glyphs. The cmap may specify multiple Unicodes mapping to a single glyph (multi-mapping). This is a commonly-used scheme in many fonts. Also, many glyphs in the font may not even be mapped in the cmap. Thus with this alone, you cannot reliably reverse-map a glyph to a single Unicode.
But it gets even more difficult: the GSUB may specify numerous rules and may convert one input glyph to many output glyphs, or a series of input glyphs into one output glyph. It can even specify contexts under which the conversion will occur (for example, it could say something like "convert 'A' to 'B' but only when the 'A' is preceded by a 'C'", so CA -> CB but DA -> DA). In some cases, specifically with Hindi and other Indic languages, the output glyph sequence may even be in a different order than the logical Unicode input sequence. The net result is that the output sequence of glyphs may map back to a single Unicode, or multiple Unicodes, or none at all. It may be possible to decode the rules of the GSUB + the logic of the script-shaping engine to narrow things down a bit (an adventure not suitable for the weak of spirit!), but the problem is still that multiple input Unicodes could end up resolving to the same output glyph.
Bottom line: it's best to view the process of converting a string -> font-specific glyphs as a one-way trip.
For a better understanding of these concepts, I strongly recommend that you read up on complex script shaping as implemented in Windows: http://www.microsoft.com/typography/otspec/TTOCHAP1.htm . As for coding in an application, the Uniscribe reference is also very informative: http://msdn.microsoft.com/en-us/library/windows/desktop/dd374091(v=vs.85).aspx

Counting words from a mixed-language document

Given a set of lines containing Chinese characters, Latin-alphabet-based words or a mixture of both, I wanted to obtain the word count.
To wit:
this is just an example
这只是个例子
should give 10 words ideally; but of course, without access to a dictionary, 例子 would best be treated as two separate characters. Therefore, a count of 11 words/characters would also be an acceptable result here.
Obviously, wc -w is not going to work. It considers the 6 Chinese characters / 5 words as 1 "word", and returns a total of 6.
How do I proceed? I am open to trying different languages, though bash and python will be the quickest for me right now.
You should split the text on Unicode word boundaries, then count the elements which contain letters or ideographs. If you're working with Python, you could use the uniseg or nltk packages, for example. Another approach is to simply use Unicode-aware regexes but these will only break on simple word boundaries. Also see the question Split unicode string on word boundaries.
Note that you'll need a more complex dictionary-based solution for some languages. UAX #29 states:
For Thai, Lao, Khmer, Myanmar, and other scripts that do not typically use spaces between words, a good implementation should not depend on the default word boundary specification. It should use a more sophisticated mechanism, as is also required for line breaking. Ideographic scripts such as Japanese and Chinese are even more complex. Where Hangul text is written without spaces, the same applies. However, in the absence of a more sophisticated mechanism, the rules specified in this annex supply a well-defined default.
I thought about a quick hack since Chinese characters are 3 bytes long in UTF8:
(pseudocode)
for each character:
if character (byte) begins with 1:
add 1 to total chinese chars
if it is a space:
add 1 to total "normal" words
if it is a newline:
break
Then take total chinese chars / 3 + total words to get the sum for each line. This will give an erroneous count for the case of mixed languages, but should be a good start.
这是test
However, the above sentence will give a total of 2 (1 for each of the Chinese characters.) A space between the two languages would be needed to give the correct count.

The difference and use of strings and string arrays?

Okay, so for all i know a string is basically an array of characters. So why would there be string arrays in VB? And what differences are between them?
Just the basics, the way they operate that's what i'm interested in.
At times it is very useful to think of a String as an array of characters. It can also be useful to think of it as an array of bytes at times too - and this is of course not the same thing at all.
See The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) for better understanding of the differences between bytes and the characters held by Strings (UTF-16LE) as well as other character encodings commonly used.
But all of that aside, a String is really a higher level abstraction that you should not think of as an array of any kind.
After all, by that sort of logic an Integer or Long is an array as well.
So considering that a String is meant to be viewed as a primitive scalar value type the purpose of String arrays should be pretty clear. Arrays of Strings have pretty much the same sorts of uses as arrays of any other data type.
The fact that you have operations you can perform on Strings that root around inside them (substring operations) isn't much different conceptually than the operations that operate on the data inside any other simple type.
Say you need to store a list of names, it might be 100 names, or 200 names.. it depends from case to case.. what will u do?
String array can solve such case
Try this:
Dim Names() As String
ReDim Names(3) As String
Names(0) = "First"
Names(1) = "Second"
Names(2) = "Third"
Names(3) = "Fourth"
Dim l As Long
For l = LBound(Names) To UBound(Names)
MsgBox Names(l)
Next

Resources