Microsoft LUIS mistakes different letters for the same one - azure-language-understanding

I'm using several large List Entities in Microsoft LUIS. To import these lists in different LUIS apps, and also for a more convenient workflow, I want to import these lists as JSON files. In general, this works but for some lists I get the following error message:
Unfortunately, this error message doesn't provide any clue to problematic entries. So, I tried to make smaller lists and eventually found the problem. I had these two items in the list:
"Athena" and "Aþena". According to wikipedia, þ is an Icelandic letter which is pronouced similarly to th, and was therefore often replaced in the past. However, those two words lead to a 409 error if entered manually.
Of course, I can just remove one of the entries, and also handle the þ letter in a script, but I would like to know if there is a list of letters that are treated in the same way, which may lead to the same issue. I didn't find anything in the documentation. If you deal with many large entity lists that may grow in the future, it is not very practical to track down those cases individually.
Also I found that the letter æ is interpreted as ae.


What category / filter structure is this using? Nested Interval?

You can see on their category links that it's quite obvious that the only portion of their URL that matters is the small hash near the end of the URL itself.
For instance, Water Heaters category found under Heating/Cooling is:|1
and Water Heaters category found under Plumbing is:|1
That being said, obviously their structure could be a number of different things...
But the only thing I can think is it's a hex string that gets decoded into a number and denominator but I can't figure it out...
apparently it's important to them to obfuscate this for some reason?
Any ideas?
At first I was thinking it was some sort of base16 / hex conversion of a standard number / denom or something? or the ID of a node and it's adjacency?
Does anyone have enough experience with this to assist?
They are building on top of IBM WebSphere Commerce. Nothing fancy going on here, though. The alpha-numeric identifiers N-xxxxxxx are simple node identifiers that do not capture hierarchical structure in themselves; the structure (parent nodes and direct child nodes) is coded inside the node data itself, and there are tell-tale signs to that effect (see below.) They have no need for nested intervals (sets), their user interface does not expose more than one level at a time during normal navigation.
Take Lowe's.
If you look inside the cookies (WC_xxx) as well as see where they serve some of their contents from (.../wcsstore/B2BDirectStorefrontAssetStore/...) you know they're running on WebSphere Commerce. On their listing pages, everything leading up to /_/ is there for SEO purposes. The alpha-numeric identifier is fixed-length, base-36 (although as filters are applied additional Zxxxx groups are tacked on -- but everything that follows a capital Z simply records the filtering state.)
Let's say you then wrote a little script to inventory all 3600-or-so categories Lowe's currently has on their site. You'd get something like this:
N-1z0y28t /Closet-Organization/Wood-Closet-Systems/Wood-Closet-Kits
N-1z0y28u /Closet-Organization/Wood-Closet-Systems/Wood-Closet-Towers
N-1z0y28v /Closet-Organization/Wood-Closet-Systems/Wood-Closet-Shelves
N-1z0y28w /Closet-Organization/Wood-Closet-Systems/Wood-Closet-Hardware
N-1z0y28x /Closet-Organization/Wood-Closet-Systems/Wood-Closet-Accessories
N-1z0y28y /Closet-Organization/Wood-Closet-Systems/Wood-Closet-Pedestal-Bases
N-1z0y28z /Cleaning-Organization/Closet-Organization/Wood-Closet-Systems
N-1z0y294 /Lighting-Ceiling-Fans/Chandeliers-Pendant-Lighting/Mix-Match-Mini-Pendant-Shades
N-1z0y295 /Lighting-Ceiling-Fans/Chandeliers-Pendant-Lighting/Mix-Match-Mini-Pendant-Light-Fixtures
N-1z0y296 /Lighting-Ceiling-Fans/Chandeliers-Pendant-Lighting/Chandeliers
N-1z13dp5 /Plumbing/Plumbing-Supply-Repair
N-1z13dr7 /Plumbing
N-1z13dsg /Lawn-Care-Landscaping/Drainage
N-1z13dw5 /Lawn-Care-Landscaping
N-1z13e72 /Tools
N-1z13e9g /Cleaning-Organization/Hooks-Racks
N-1z13eab /Cleaning-Organization/Shelves-Shelving/Laminate-Closet-Shelves-Organizers
N-1z13eag /Cleaning-Organization/Shelves-Shelving/Shelves
N-1z13eak /Cleaning-Organization/Shelves-Shelving/Shelving-Hardware
N-1z13eam /Cleaning-Organization/Shelves-Shelving/Wall-Mounted-Shelving
N-1z13eao /Cleaning-Organization/Shelves-Shelving
N-1z13eb3 /Cleaning-Organization/Baskets-Storage-Containers
N-1z13eb4 /Cleaning-Organization
N-1z13eb9 /Outdoor-Living-Recreation/Bird-Care
N-1z13ehd /Outdoor-Living
N-1z13ehn /Appliances/Air-Purifiers-Accessories/Air-Purifiers
N-1z13eho /Appliances/Air-Purifiers-Accessories/Air-Purifier-Filters
N-1z13ehp /Appliances/Air-Purifiers-Accessories
N-1z13ejb /Appliances/Humidifiers-Dehumidifiers/Humidifier-Filters
N-1z13ejc /Appliances/Humidifiers-Dehumidifiers/Dehumidifiers
N-1z13ejd /Appliances/Humidifiers-Dehumidifiers/Humidifiers
N-1z13eje /Appliances/Humidifiers-Dehumidifiers
N-1z13elr /Appliances
N-1z13eny /Windows-Doors
Notice how entries are for the most part sequential (it's a sequential identifier, not a hash), mostly though not always grouped together (the identifier reflects chronology not structure, it captures insertion sequence, which happened in single or multiple batches, sometimes years and thousands of identifiers apart, at the other end of the database), and notice how "parent" nodes always come after their children, sometimes after holes. These are all tell-tale signs that, as categories are added and/or removed, new versions of their corresponding parent nodes are rewritten and the old, superseded or removed versions are ultimately deleted.
If you think there's more you need to know you may want to further inquire with WebSphere Commerce experts as to what exactly Lowe's might be using specifically for its N-xxxxxxx catalogue (though I suspect that whatever it is is 90%+ custom.) FWIW I believe Home Depot (who also appear to be using WebSphere) upgraded to version 7 earlier this year.
UPDATE Joshua mentioned Endeca, and it is indeed Endeca (those N-xxxxxxx identifiers) that is being used behind Websphere in this case (though I believe since the acquisition of Endeca Oracle is pushing SUN^H^H^Htheir own Java EE "Endeca Server" platform.) So not really a 90% custom job despite the appearances (the presentation and their javascripts are heavily customized, but that's the tip of the iceberg.) You should be able to use Solr as a substitute.

Validating FirstName in a web application

I do not want to be too strict as there may be thousands of possible characters in a possible first name
Normal english alphabets, accented letters, non english letters, numbers(??), common punctuation synbols
M.D. Shah (dots and space)
Jatin "Tom" Shah
However, I do not want to except HTML tags, semicolons etc
Is there a list of such characters which is absolutely bad from a web application perspective
I can then use RegEx to blacklist these characters
Background on my application
It is a Java Servlet-JSP based web app.
Tomcat on Linux with MySQL (and sometimes MongoDB) as a backend
What I have tried so far
String regex = "[^<>~##$%;]*";
throw new InputValidationException("Invalid FirstName")
My question is more on the design than coding ... I am looking for a exhaustive (well to a good degree of exhaustiveness) list of characters that I should blacklist
A better approach is to accept anything anyone wants to enter and then escape any problematic characters in the context where they might cause a problem.
For instance, there's no reason to prohibit people from using <i> in their names (although it might be highly unlikely that it's a legit name), and it only poses a potential problem (XSS) when you are generating HTML for your users. Similarly, disallowing quotes, semi-colons, etc. only make sense in other scenarios (SQL queries, etc.). If the rules are different in different places and you want to sanitize input, then you need all the rules in the same place (what about whitespace? Are you gong to create filenames including the user's first name? If so, maybe you'll have to add that to the blacklist).
Assume that you are going to get it wrong in at least one case: maybe there is something you haven't considered for your first implementation, so you go back and add the new item(s) to your blacklist. You still have users who have already registered with tainted data. So, you can either run through your entire database sanitizing the data (which could take a very very long time), or you can just do what you really have to do anyway: sanitize data as it is being presented for the current medium. That way, you only have to manage the sanitization at the relevant points (no need to protect HTML output from SQL injection attacks) and it will work for all your data, not just data you collect after you implement your blacklist.

Intern Problem Statement for a bank

I saw an intern opportunity in a bank in dubai. They have a defined problem statement to be solved in 2 months. They told us just 2 lines -
"Basically the problem is about name matching logic.
There are two fields (variables) – both are employer names, and it’s a free text field. So we need to write a program to match these two variables."
Can anyone help me in understanding it? Is it just a simple pattern matching stuff?
Any help/comments would be appreciated.
I think this is what they are asking for:
They have two sources of related data, for example, one from an internal database, and the other from name card input.
Because the two fields are free text fields, there will be inconsistency. For example, Nitin Garg, or Garg, Nitin, or Mr. Nitin Garg, etc. Here is an extreme case of Gadaffi.
What you are supposed to do is to find a way to match all the names for a specific person together.
In short, match two pieces of data together by employer names, taking possible inconsistency into account.
Once upon a time there was a nice simple answer to the problem of matching up names despite mis-spellings and different transliterations - Soundex. But people have put a lot of work into this problem, so now you should probably use the results of that work, which is built into databases and add-ons - some free. See Fuzzy matching using T-SQL and and

Can sorting Japanese kanji words be done programmatically?

I've recently discovered, to my astonishment (having never really thought about it before), machine-sorting Japanese proper nouns is apparently not possible.
I work on an application that must allow the user to select a hospital from a 3-menu interface. The first menu is Prefecture, the second is City Name, and the third is Hospital. Each menu should be sorted, as you might expect, so the user can find what they want in the menu.
Let me outline what I have found, as preamble to my question:
The expected sort order for Japanese words is based on their pronunciation. Kanji do not have an inherent order (there are tens of thousands of Kanji in use), but the Japanese phonetic syllabaries do have an order: あ、い、う、え、お、か、き、く、け、こ... and on for the fifty traditional distinct sounds (a few of which are obsolete in modern Japanese). This sort order is called 五十音順 (gojuu on jun , or '50-sound order').
Therefore, Kanji words should be sorted in the same order as they would be if they were written in hiragana. (You can represent any kanji word in phonetic hiragana in Japanese.)
The kicker: there is no canonical way to determine the pronunciation of a given word written in kanji. You never know. Some kanji have ten or more different pronunciations, depending on the word. Many common words are in the dictionary, and I could probably hack together a way to look them up from one of the free dictionary databases, but proper nouns (e.g. hospital names) are not in the dictionary.
So, in my application, I have a list of every prefecture, city, and hospital in Japan. In order to sort these lists, which is a requirement, I need a matching list of each of these names in phonetic form (kana).
I can't come up with anything other than paying somebody fluent in Japanese (I'm only so-so) to manually transcribe them. Before I do so though:
Is it possible that I am totally high on fire, and there actually is some way to do this sorting without creating my own mappings of kanji words to phonetic readings, that I have somehow overlooked?
Is there a publicly available mapping of prefecture/city names, from the government or something? That would reduce the manual mapping I'd need to do to only hospital names.
Does anybody have any other advice on how to approach this problem? Any programming language is fine--I'm working with Ruby on Rails but I would be delighted if I could just write a program that would take the kanji input (say 40,000 proper nouns) and then output the phonetic representations as data that I could import into my Rails app.
For Data, dig Google's Japanese IME (Mozc) data files here.
There is lots of interesting data there, including IPA dictionaries.
And you may also try Mecab, it can use IPA dictionary and can convert kanjis to katakana for most of the words
and there is ruby bindings for that too.
and here is somebody tested, ruby with mecab with tagger -Oyomi
just a quick followup to explain the eventual actual solution we used. Thanks to all who recommended mecab--this appears to have done the trick.
We have a mostly-Rails backend, but in our circumstance we didn't need to solve this problem on the backend. For user-entered data, e.g. creating new entities with Japanese names, we modified the UI to require the user to enter the phonetic yomigana in addition to the kanji name. Users seem accustomed to this. The problem was the large corpus of data that is built into the app--hospital, company, and place names, mainly.
So, what we did is:
We converted all the source data (a list of 4000 hospitals with name, address, etc) into .csv format (encoded as UTF-8, of course).
Then, for developer use, we wrote a ruby script that:
Uses mecab to translate the contents of that file into Japanese phonetic readings
(the precise command used was mecab -Oyomi -o seed_hospitals.converted.csv seed_hospitals.csv, which outputs a new file with the kanji replaced by the phonetic equivalent, expressed in full-width katakana).
Standardizes all yomikata into hiragana (because users tend to enter hiragana when manually entering yomikata, and hiragana and katakana sort differently). Ruby makes this easy once you find it: NKF.nkf("-h1 -w", katakana_str) # -h1 means to hiragana, -w means output utf8
Using the awesomely conveninent new Ruby 1.9.2 version of CSV, combine the input file with the mecab-translated file, so that the resulting file now has extra columns inserted, a la NAME, NAME_YOMIGANA, ADDRESS, ADDRESS_YOMIGANA, and so on.
Use the data from the resulting .csv file to seed our rails app with its built-in values.
From time to time the client updates the source data, so we will need to do this whenever that happens.
As far as I can tell, this output is good. My Japanese isn't good enough to be 100% sure, but a few of my Japanese coworkers skimmed it and said it looks all right. I put a slightly obfuscated sample of the converted addresses in this gist so that anybody who cared to read this far can see for themselves.
UPDATE: The results are in... it's pretty good, but not perfect. Still, it looks like it correctly phoneticized 95%+ of the quasi-random addresses in my list.
Many thanks to all who helped me!
Nice to hear people are working with Japanese.
I think you're spot on with your assessment of the problem difficulty. I just asked one of the Japanese guys in my lab, and the way to do it seems to be as you describe:
Take a list of Kanji
Infer (guess) the yomigana
Sort yomigana by gojuon.
The hard part is obviously step two. I have two guys in my lab: 高橋 and 高谷. Naturally, when sorting reports etc. by name they appear nowhere near each other.
If you're fluent in Japanese, have a look here:
It's a pretty popular tool, so you should be able to find English documentation too (the man page for mecab has English info).
I'm not familiar with MeCab, but I think using MeCab is good idea.
Then, I'll introduce another method.
If your app is written in Microsoft VBA, you can call "GetPhonetic" function. It's easy to use.
see :
Sorting prefectures by its pronunciation is not common. Most Japanese are used to prefectures sorted by 「都道府県コード」.
e.g. 01:北海道, 02:青森県, …, 13:東京都, …, 27:大阪府, …, 47:沖縄県
These codes are defined in "JIS X 0401" or "ISO-3166-2 JP".
see (Wikipedia Japanese) :

How to detect vulnerable/personal information in CVs programmatically (by means of syntax analysis/parsing etc...)

To make matter more specific:
How to detect people names (seems like simple case of named entity extraction?)
How to detect addresses: my best guess - find postcode (regexes); country and town names and take some text around them.
As for phones, emails - they could be probably caught by various regexes + preprocessing
Don't care about education/working experience at this point
In order to build a fulltext index on resumes all vulnerable information should be stripped out from them.
P.S. any 3rd party APIs/services won't do as a solution.
The problem you're interested in is information extraction from semi structured sources.
I think you should download a couple of research papers in this area to get a sense of what can be done and what can't.
I feel it can't be done by a machine.
Every other resume will have a different format and layout.
The best you can do is to design an internal format and manually copy every resume content in there. Or ask candidates to fill out your form (not many will bother).
I think that the problem should be broken up into two search domains:
Finding information relating to proper names
Finding information that is formulaic
Firstly the information relating to proper names could probably be best found by searching for items that are either grammatically important or significant. I.e. English capitalizes only the first word of the sentence and proper nouns. For the gramatical rules you could look for all of the words that have the first letter of the word capitalized and check it against a database that contains the word and the type [i.e. Bob - Name, Elon - Place, England - Place].
Secondly: Information that is formulaic. This is more about the email addresses, phone numbers, and physical addresses. All of these have a specific formats that don't change. Use a regex and use an algorithm to detect the quality of the matches.
Watch out:
The grammatical rules change based on language. German capitalizes EVERY noun. It might be best to detect the language of the document prior to applying your rules. Also, another issue with this [and my resume sometimes] is how it is designed. If the resume was designed with something other than a text editor [designer tools] the text may not line up, or be in a bitmap format.
TL;DR Version: NLP techniques can help you a lot.
