How to create a cfcollection / verity collection with a UTF-8 character in the name? - utf-8

I'd just like to be able to use a UTF-8 character in the name of the collection. We base our code logic on the names of the collections which are related to a given company. This new company has an abbreviation of XØZ3, and both the CFAdminstrator and cfcollection seem to have issues with using the ø in the collection name.
The errors presented are:
Unable to create collection peoplexscvdocsXØZ3.
Unable to create collection peoplexscvdocsxøz3.
An error occurred while creating the collection: com.verity.api.administration.ConfigurationException: Fail to create the index. (-6220)

If verity doesn't accept UTF-8 and there isn't a work around, I guess you'll have to
have 2 fields, one with ascii based version of the character, one with the html/xml version of the character
pass through the ascii version of the characters when searching the collection to match
so you'd have:
plaintext: XOZ3
XMLText: X&#216Z3;
And a function that takes Ø and changes it to O when searching verity on the plaintext field and return the matching XMLText field


Firestore will not save words with accents?

I'm trying to move data to Firestore from a MySQL table encoded as utf-8 (specifically, utf8mb4_unicode_520_ci). I'm using Golang's Firestore libraries along with sqlx. Most or every word that has accent characters fails, e.g., müller, évident, etc. The error returned is as follows:
rpc error: code = Internal desc = grpc: error while marshaling: proto:
field "google.firestore.v1.Value.ValueType" contains invalid UTF-8
I can enter the accent characters into Firestore manually using the browser-based interface, so I'm guessing the issue lies with the Golang library. Is there any workaround that would preserve the accent characters?
The solution to my issue was unrelated to Firestore and libraries I was using, but instead was a problem in a word-tokenization function I had written. The tokenization was mangling accented characters into bad UTF-8, so converting them to runes before tokenization solved the issue.

How to Select Only Alphanumeric characters from a string in Datastage?

I am facing a problem with my data, in my data other than alphanumeric characters are there in a column field, where for EX in Name column: Ravicᅩhandr¬an (¬ᅩ○`) like these many characters are there. I need a result like Ravichandran. How can I achieve this? Is there any way to remove in transformer stage.
I tried Convert function in Transformer stage, but problem in using Convert, I am not sure about these unknown characters, I have shown above is just example.
My Requirement is, other than alphanumeric must be removed. And the Balance string should be the same.
How can I get this done?
The following Convert function can be used in Transformer stage to remove any kind of unknown/special characters from the column.
**Convert(Convert('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 ','', Column_Name1),'',Column_Name1)
Ex : Convert(Convert('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 ','', to_txm.SourceCode),'',to_txm.SourceCode)**

What constitutes a valid URI query parameter key?

I'm looking over Section 3.4 of RFC 3986 trying to understand what constitutes a valid URI query parameter key, but I'm not seeing a clear answer.
The reason I'm asking is because I'm writing a Ruby class that composes a URI with query parameters. When a new parameter is added I want to validate the key. Based on experience, it seems like the key will be invalid if it requires any escaping.
I should also say that I plan to validate the key. I'm not sure how to go about validating this data either, but I do know that in all cases I should escape this value.
Advice is appreciated. Advice in the context of how validation might already be possible through say a Ruby Gem would also be a plus.
I could well be wrong, but that spec seems to say that anything following '?' or '#' is valid as long. I wonder if you should be looking more at the spec for 'application/x-www-form-urlencoded' (ie. the key/value pairs we're all used to)?
This is the default content type. Forms submitted with this content
type must be encoded as follows:
Control names and values are escaped. Space characters are replaced by +', and then reserved characters are escaped as described in [RFC1738], section 2.2: Non-alphanumeric characters are replaced by %HH', a percent sign and two hexadecimal digits representing the ASCII code of the character. Line breaks are represented as "CR LF" pairs (i.e., `%0D%0A').
The control names/values are listed in the order they appear in the document. The name is separated from the value by =' and name/value pairs are separated from each other by &'.
I don't believe key=value is part of the RFC, it's a convention that has emerged. Wikipedia suggests this is an 'W3C recommendation'.
Seems like some good stuff to be found searching on the application/x-www-form-urlencoded content type.

Julia: Strange characters in my string

I scraped some text from the internet, which I put in an UTF8String. I can use this string normally, but when I select some specific characters (strange character with accents, like in my case ú), which are not part of the UTF8 standard, I get an error, saying that I used invalid indexes. This only happens when the string contains strange characters; my code works with normal string that do not contain strange characters.
Any way to solve this?
I have a variable word of type SubString{UTF8String}
When I use do method(word), no problems occur. When I do method(word[2:end]) (assuming length of at least 2), I get an error in case the second character is strange (not in UTF8).
Julia does indexing on byte positions instead of character position. It is way more efficient for a variable length encoding like UTF-8, but it makes some operations use some more boilerplate.
The problem is that some codepoints is encoded as multiple bytes and when you slice the string from 2:end you would have got half of the first character (witch is invalid and you get an error).
The solution is to get the second valid index instead of 2 in the slice. I think that is something like str[nextind(str, 1):end]
PS. Sorry for a less than clear answer on my phone.
I tried this, and it seems like SubString{UTF8String} and UTF8String has different behaviour on slicing. I've reported it as bug #7811 on GitHub.

Parsing free format text in Cocoa

My Cocoa app needs to parse free format text entered via NSTextView. The result of the process should be a collection of keyword strings which can then be displayed for review to the user and optionally persisted using Core Data.
I looked at NSScanner but from the samples in Apple's documentation it looks like it's not capable of presenting a list of keyword strings from a given string. Its focus seems to be more on finding a particular occurrence of a given string within another string.
Are there alternatives?
EDIT: To make this clearer: all words in the entered text are potential keywords, so basically all words delimited by spaces should be considered. Lets assume that the user can specify a minimum required length for a string to be considered a keyword to eliminate irrelevant words like "to", "of", "in" etc. Once the parsing is done, a list of parsed keywords should be presented (possibly using a table view). The user can then select or reject each keyword. Rejected keywords will be stored so the parsing can be made smarter as more texts are scanned.
You can absolutely use NSScanner to do this. All NSScanner does is go through a string character by character. It is up to you to decide what the keyword boundaries are and to interpret them using the scanner.
I suggest reading more about NSScanner in Apple's String Programming Guide.
