has LUIS a solution for German composite words like "SALAMIPIZZA" (engl: salami pizza)?
German instance separates words without any logical rule. "Apfelsaft" (engl.: apple juice) as example is not separated, but salami pizza it's.
You can use a list entity to account for this. For example, create list entity named "Foods", then a create a canonical form called "SALAMIPIZZA", and for that canonical form's list, enter in "SALAMIPIZZA" and "SALAMI PIZZA". This would also allow you to account for other spellings, such as "SALAMI-PIZZA".
Related
when i input a parameter "marry" into function,then,return "this is name",when i input a parameter "MIT",then,return "this is institution name",just a word,not a whole sentence to reognize,what should use Algorithm about NLP? thanks.
This looks like part-of-speech tagging and/or named entity recognition BUT if you are processing English, single words without context are potentially ambiguous. Also, single words may not be informative. "new" on it's own can be an adjective (POS) but "New York" is most likely a location (NER). Check some literature on both tasks and consider processing at least sentence-level features.
Here is what I want to achieve :
My field value : "one two three"
I want to be able to match this field by typing: one or onetwo or onetwothree or onethree or twothree or two or three
For that, the tokenizer need to produce those tokens:
one
onetwo
onetwothree
onethree
two
twothree
three
Do you know how can I implement this analyzer ?
there is the same problem in German language when we connect different words into one. For this purpose Elasticsearch uses technique called "coumpound words". There is also a specific token filter called "compound word token filter". It is trying to find sub-words from given dictionary in string. You only have to define dictionary for your language. There is whole specification at link bellow.
https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-compound-word-tokenfilter.html
I have read the detailed description given here- http://nlp.stanford.edu/software/crf-faq.shtml#a on training the model based on the labelled input file according to the .prop file. But the article says-
You should make sure each line consists of solely content fields and tab characters. Spaces don't work. Extra tabs will cause problems.
My text corpus has some space separated words which are all combinedly form a token instead of single word. For instance, "Wright State University" is a single token though Wright, State and University are entities individually. I would like to generate the model with the above token as a single one. The article says that the input file to generate the model should be given as a tab separated words with first column being the token and the second column the label. How can I achieve this?
Typically NER training data is in the form of natural language sentences where each token has an NER tag. You might have 10,000 sentences or more.
For instance: "He attended Wright State University."
should be represented as:
He O
attended O
Wright SCHOOL
State SCHOOL
University SCHOOL
. O
If don't have sentences, and you simply have a list of strings that should be tagged a certain way, it makes more sense to use RegexNER.
You can find a thorough description of how to use RegexNER here:
http://nlp.stanford.edu/software/regexner.html
I'm testing Amazon Cloudsearch for my web application and i'm running into some strange issues.
I have the following domain indexes: name, email, id.
For example, I have data such as: John Doe, John#example.com, 1
When I search for jo I get nothing. If I search for joh I still get nothing, But if I search for john then I get the above document as a hit. Why is it not getting when I put partial strings? I even put suggestors on name and email with fuzzy matching enabled. Is there something else i'm missing? I read the below on this:
http://docs.aws.amazon.com/cloudsearch/latest/developerguide/searching-text.html
http://docs.aws.amazon.com/cloudsearch/latest/developerguide/searching.html
http://docs.aws.amazon.com/cloudsearch/latest/developerguide/searching-compound-queries.html
I'm doing the searches using boto as well as with the form on AWS page.
What you're trying to do -- finding "john" by searching "jo" -- is a called a prefix search.
You can accomplish this either by searching
(prefix field=name 'jo')
or
q=jo*
Note that if you use the q=jo* method of appending * to all your queries, you may want to do something like q=jo* |jo because john* will not match john.
This can seem a little confusing but imagine if google gave back results for prefix matches: if you searched for tort and got back a mess of results about tortoises and torture instead of tort (a legal term), you would be very confused (and frustrated).
A suggester is also a viable approach but that's going to give you back suggestions (like john, jordan and jostle rather than results) that you would then need to search for; it does not return matching documents to you.
See "Searching for Prefixes in Amazon CloudSearch" at http://docs.aws.amazon.com/cloudsearch/latest/developerguide/searching-text.html
Are your index field types "Text"? If they are just "Literals", they have to be an exact match.
I think you must have your name and email fields set as the literal type instead of the text type, otherwise a simple text search of 'jo' or 'Joh' should've found the example document.
While using a prefix search may have solved your problem (and that makes sense if the fields are set as the literal type), the accepted answer isn't really correct. The notion that it's "like a google search" isn't based on anything in the documentation. It actually contradicts the example they use, and in general muddies up what's possible with the service. From the docs:
When you search text and text-array fields for individual terms, Amazon CloudSearch finds all documents that contain the search terms anywhere within the specified field, in any order. For example, in the sample movie data, the title field is configured as a text field. If you search the title field for star, you will find all of the movies that contain star anywhere in the title field, such as star, star wars, and a star is born. This differs from searching literal fields, where the field value must be identical to the search string to be considered a match.
I am working on implementing an autocompletion script in javascript. However, some of the names are two word names with a space in the middle. What kind of algorithm can you use to deal with it. I am using a trie to store the names.
The only solutions I could come up with were just saying that two word names cannot be used (either run them together or put a dash in the middle). The other idea was to create a list of these kind of names and have a separate loop to check the input. The other and possibly best idea I have is to redesign it slightly and have categories for first and last names and then an extra name category. I was wondering if there was a better solution out there?
Edit: I realized I wasn't very clear on what I was asking. My problem isn't adding two word phrases to the trie, but returning them when someone is typing in a name. In the trie I split the first and last names so you can search by either. So if someone types in the first name and then a space, how would I tell if they are typing in the rest of the first name or if they are now typing in the last name.
Why not have the trie also include the names with spaces?
Once you have a list of candidates, split each of them on the space and show the first token...
Is there a reason you are rolling your own autocomplete script, instead of using a currently existing one, such as YUI autocomplete? (i.e. are you doing it just for fun?, etc.)
If you have a way to parse the two-word names, then just include spaces in your trie. But if you cannot determine what is a two-word name and what is two separate words, and your trie cannot be large enough to hold all two-word sequences, then you have a problem.
One simple way to solve this is to default to allowing two-word pairs, but if you have too much branching after the space, throw away that entire branch. This way, when the first word is predictive for the second, you'll get autocompletion, but when it could be any of a huge number of things, your trie will end at the end of a single word.
If you using multiline editor, i guess the best choice autocomplete items will be a word. So firstname, middlename and lastname must be parsed and add a lookup item.
For (one line) textbox use you can add whitespaces (and firstname + space + middlename + space + lastname pattern) in search criteria.