How to split/extract specific entities automatically to MS Luis? - azure-language-understanding

I am currently working with MS LUIS.ai.
My string/utterance contains both English and Chinese.
Here is the problem:
While if sentence is ALL in English, it works fine in LUIS. The reason is probably because a sentence is composed of different words, which are split by a "space".
However, in Chinese (Both Traditional and Simplified), a sentence is composed of words that are concanated/joined together and difficult to be split.
For example, in English I can write:
I love you so much: There are 5 words here. In LUIS I can select I love you and turn it into an entity. And later on, when more words like I love you goes in LUIS, it can identify the related intent easily.
However, in Chinese if I write:
我很喜歡你: which has the same meaning as in English above. Under LUIS it will be counted as 1 word. If I want to extract the word 喜歡 (which means "Love/Like"), I cannot do this in LUIS.
Only if I put space around 喜歡 like this: 我很 喜歡 你 will I be able to select 喜歡 as a particular entity.
My Question:
Are there any ways/methods/tricks that I can use so that, when someone enters joined-string, like what you see in the Chinese version, to LUIS, LUIS will be able to identify specific words as entity automatically, without any manual change?
Thank you very much in advance for all your help.

To perform machine learning, LUIS breaks an utterance into tokens based on culture. We cannot suppress tokenization. LUIS tokenizes Chinese at character level and returns tokenized entity whereas for English it tokenizes for every space or special character. In the zh-cn culture, LUIS expects the simplified Chinese character set instead of the traditional character set.
Hope this helps!!

Related

LUIS.AI patterns regarding optional punctuation for words vs entity lists work differently from each other than expected

Here is a pattern i've created
({SeminarsList}|seminar)[s]((list|lists)|(info|information))[(?|.|!)]
I would expect the optional [s] to work for both entity list and the word i.e. ending with s on seminars such as seminar vs seminars. However, only the entity list works as expected. the s for seminars is ignore and the pattern isn't recognized for seminars info
Is this a bug or something expected. I would rather it work like the entity list as that makes perfect sense and is the same way reflected in the documentation?
update
Also, the word on it's own without being a group works as expected.
so for example this works
where[(are|is)][the](SeminarsList|seminar)[[']s][seminar][[']s] [(location|locate|located)]
i.e. the send seminar with optional punctuation works as expected just not in a grouping
Update**
Here is an example from the documentation
Select the OrgChart-Manager intent, then enter the following template utterances:
Template utterances
Who is {Employee} the subordinate of[?]
Who does {Employee} report to[?]
Who is {Employee}['s] manager[?]
Who does {Employee} directly report to[?]
Who is {Employee}['s] supervisor[?]
Who is the boss of {Employee}[?]
In the above example this is the documentation of how this works. Including adding "punctuation" to the end of the sentence in an optional format. If one would expect this to work I would also expect the other methodology of working too.
Per the docs (emphasis mine):
Pattern syntax is a template for an utterance. The template should contain words and entities you want to match as well as words and punctuation you want to ignore. It is not a regular expression.
So, the pattern syntax is not meant to be used for single letters, but for full words. It has to due with the tokenization of utterances.
If you'd like this feature added, I'd recommend upvoting this LUIS UserVoice ticket.

LUIS builtin number entity for German culture completely ignores decimal separator -> BUG?

The English number format 1,000,000.90 is 1.000.000,90 in German.
In English culture LUIS interpretes the number as 1000000.9
But in German culture LUIS interpretes the number as 10000009
I.e. it completely ignores the decimal separator!
This is a bug, right?
LUIS predefined entities/intents are useful, but may not fulfil all purposes. You can try doing the matching yourself with a regex entity or to have a preprocessing step in your pipeline, where you check for numbers and format them correctly so LUIS will understand (using regex, plain code, NLP library...)

How to implement fuzzy search for Chinese pinyin and Japanese romaji?

I have some data in Chinese and Japanese, and I want it possible to search by their romanizations (Pinyin for Chinese, Romaji for Japanese). Assume that the romanizations are already provided, separated by syllables.
eg. the text "示例文本", which romanizes to ["shi", "li", "wen", "ben"].
Users should be able to match this by typing
whole syllables, with or without space, eg. shi li wen ben or shiliwenben
initials or first few letters of syllables, eg. shlwb or slwb
they might also type only part of the string, eg. wenben or wb (these examples correspond to the last two syllables of the text above).
Is there an elegant way of implementing this?
(note: I did not specify any programming language in this question, because I want to implement this in different languages. If your response is language-specific or requires specific libraries, please make it clear. Thank you!)

LUIS inserts whitespace in utterances when punctuation present causing entity getting incorrectly parsed

I am playing around with the Luis stock ticker example here, GitHub MicrosoftBotBuilder Example, it works well and the entity in the utterances is identified but there are stock tickers in the world that have periods in them such as bt.a
Luis by default pre-processes utterances where word breaks are inserted around punctuation characters and therefore an utterance of "what is price of bt.a" becomes "what is price of bt. a" and therefore Luis thinks the entity is "bt" instead of "bt.a"
Does anyone know how to get around this? Thx
This is how LUIS tokenizes utterances and I don't think it'll change int he near future.
I think you can investigate one of the 2 solutions:
Preprocess the utterance and normalize entities with punctuation (maybe save them in a map), and reverse the process when LUIS is called and the entities have been extracted.
Use phrase list features and add the entities that LUIS misses in their Tokenized form, label the entity tokens in the utterance, and retrain the model (I suggest you try that in a clone of your app, so you don't lose any current progress)
I need to process sentences with website addresses in them so I had to deal with a few different symbols. I found a technique that works for me, but it is not very elegant.
I am assuming here that you have an entity setup to represent the "stock symbol"
Here is what this would look like in your case.
Detect the cases when LUIS gets the "stock symbol" entity wrong. In
your case this may be whenever it ends in a period.
When LUIS gets the entity wrong, tokenize the raw query using spaces
as the separator. Grab the proper token by looking for a match with
the wrong partial token.
So for your example....
"what is price of bt.a"
You would see the "stock symbol" entity of "bt." and know that it is wrong because it ends in a period. You would then tokenize the query and look for tokens that contain "bt.". This would identify "bt.a" as the requested symbol.
Its not pretty, but in the case of website addresses has been reliable.

Can sorting Japanese kanji words be done programmatically?

I've recently discovered, to my astonishment (having never really thought about it before), machine-sorting Japanese proper nouns is apparently not possible.
I work on an application that must allow the user to select a hospital from a 3-menu interface. The first menu is Prefecture, the second is City Name, and the third is Hospital. Each menu should be sorted, as you might expect, so the user can find what they want in the menu.
Let me outline what I have found, as preamble to my question:
The expected sort order for Japanese words is based on their pronunciation. Kanji do not have an inherent order (there are tens of thousands of Kanji in use), but the Japanese phonetic syllabaries do have an order: あ、い、う、え、お、か、き、く、け、こ... and on for the fifty traditional distinct sounds (a few of which are obsolete in modern Japanese). This sort order is called 五十音順 (gojuu on jun , or '50-sound order').
Therefore, Kanji words should be sorted in the same order as they would be if they were written in hiragana. (You can represent any kanji word in phonetic hiragana in Japanese.)
The kicker: there is no canonical way to determine the pronunciation of a given word written in kanji. You never know. Some kanji have ten or more different pronunciations, depending on the word. Many common words are in the dictionary, and I could probably hack together a way to look them up from one of the free dictionary databases, but proper nouns (e.g. hospital names) are not in the dictionary.
So, in my application, I have a list of every prefecture, city, and hospital in Japan. In order to sort these lists, which is a requirement, I need a matching list of each of these names in phonetic form (kana).
I can't come up with anything other than paying somebody fluent in Japanese (I'm only so-so) to manually transcribe them. Before I do so though:
Is it possible that I am totally high on fire, and there actually is some way to do this sorting without creating my own mappings of kanji words to phonetic readings, that I have somehow overlooked?
Is there a publicly available mapping of prefecture/city names, from the government or something? That would reduce the manual mapping I'd need to do to only hospital names.
Does anybody have any other advice on how to approach this problem? Any programming language is fine--I'm working with Ruby on Rails but I would be delighted if I could just write a program that would take the kanji input (say 40,000 proper nouns) and then output the phonetic representations as data that I could import into my Rails app.
宜しくお願いします。
For Data, dig Google's Japanese IME (Mozc) data files here.
https://github.com/google/mozc/tree/master/src/data
There is lots of interesting data there, including IPA dictionaries.
Edit:
And you may also try Mecab, it can use IPA dictionary and can convert kanjis to katakana for most of the words
https://taku910.github.io/mecab/
and there is ruby bindings for that too.
https://taku910.github.io/mecab/bindings.html
and here is somebody tested, ruby with mecab with tagger -Oyomi
http://hirai2.blog129.fc2.com/blog-entry-4.html
just a quick followup to explain the eventual actual solution we used. Thanks to all who recommended mecab--this appears to have done the trick.
We have a mostly-Rails backend, but in our circumstance we didn't need to solve this problem on the backend. For user-entered data, e.g. creating new entities with Japanese names, we modified the UI to require the user to enter the phonetic yomigana in addition to the kanji name. Users seem accustomed to this. The problem was the large corpus of data that is built into the app--hospital, company, and place names, mainly.
So, what we did is:
We converted all the source data (a list of 4000 hospitals with name, address, etc) into .csv format (encoded as UTF-8, of course).
Then, for developer use, we wrote a ruby script that:
Uses mecab to translate the contents of that file into Japanese phonetic readings
(the precise command used was mecab -Oyomi -o seed_hospitals.converted.csv seed_hospitals.csv, which outputs a new file with the kanji replaced by the phonetic equivalent, expressed in full-width katakana).
Standardizes all yomikata into hiragana (because users tend to enter hiragana when manually entering yomikata, and hiragana and katakana sort differently). Ruby makes this easy once you find it: NKF.nkf("-h1 -w", katakana_str) # -h1 means to hiragana, -w means output utf8
Using the awesomely conveninent new Ruby 1.9.2 version of CSV, combine the input file with the mecab-translated file, so that the resulting file now has extra columns inserted, a la NAME, NAME_YOMIGANA, ADDRESS, ADDRESS_YOMIGANA, and so on.
Use the data from the resulting .csv file to seed our rails app with its built-in values.
From time to time the client updates the source data, so we will need to do this whenever that happens.
As far as I can tell, this output is good. My Japanese isn't good enough to be 100% sure, but a few of my Japanese coworkers skimmed it and said it looks all right. I put a slightly obfuscated sample of the converted addresses in this gist so that anybody who cared to read this far can see for themselves.
UPDATE: The results are in... it's pretty good, but not perfect. Still, it looks like it correctly phoneticized 95%+ of the quasi-random addresses in my list.
Many thanks to all who helped me!
Nice to hear people are working with Japanese.
I think you're spot on with your assessment of the problem difficulty. I just asked one of the Japanese guys in my lab, and the way to do it seems to be as you describe:
Take a list of Kanji
Infer (guess) the yomigana
Sort yomigana by gojuon.
The hard part is obviously step two. I have two guys in my lab: 高橋 and 高谷. Naturally, when sorting reports etc. by name they appear nowhere near each other.
EDIT
If you're fluent in Japanese, have a look here: http://mecab.sourceforge.net/
It's a pretty popular tool, so you should be able to find English documentation too (the man page for mecab has English info).
I'm not familiar with MeCab, but I think using MeCab is good idea.
Then, I'll introduce another method.
If your app is written in Microsoft VBA, you can call "GetPhonetic" function. It's easy to use.
see : http://msdn.microsoft.com/en-us/library/aa195745(v=office.11).aspx
Sorting prefectures by its pronunciation is not common. Most Japanese are used to prefectures sorted by 「都道府県コード」.
e.g. 01:北海道, 02:青森県, …, 13:東京都, …, 27:大阪府, …, 47:沖縄県
These codes are defined in "JIS X 0401" or "ISO-3166-2 JP".
see (Wikipedia Japanese) :
http://ja.wikipedia.org/wiki/%E5%85%A8%E5%9B%BD%E5%9C%B0%E6%96%B9%E5%85%AC%E5%85%B1%E5%9B%A3%E4%BD%93%E3%82%B3%E3%83%BC%E3%83%89

Resources