I am adding some keywords in phrase list, there are ~8000 words. Is there any limit in LUIS phrase list. As I am getting an error "BadArgument. Too many words in Data Dictionary".
Can anyone tell me what is the limit of phrase list words?
Also is there any other approach to incorporate these words?
You can have 10 Phrase Lists and the maximum length of each of these is 5,000 items. You can find the information here in the docs.
Regarding incorporating words, can you give an example of the words you're using?
Related
How to include fuzziness in phrase matching ? In the elasticsearch documentation it is mentioned that fuzziness is not supported with phrase matching.
I have documents containing phrases now i have a text body , now i want to find out the common phrases of text and phrases within the documents , but need to search phrases that might be spelled wrong .
There are some ways to do this:
Remove Whitespace and index the hole Phrase as one Token (I think there is a Filter for that in Elastic). In your query you would have to do the same.
There is a Tokenizer where I forgot the name (Maybe someone can help out here?) which lets you index more than one words together. If your phrases have a common maximum length like 5 words or so this could do the trick.
Beware fuzzi only works for a maximum distance of 2 so if you have a very long sentence 2 might not be enough and you have to split it.
Given a phrase I want to check if one of the entries in a predefined list of phrases is contained in the phrase ("phrase match"). The predefined list of phrases may be huge, and I'm reading it from a file.
If I was looking for an exact match, I would have read the list into a hush. But since I'm not looking for an exact match, I don't know which data structure to use.
Do I have to go over the entire list for each new phrase? Do you know of a data structure that fits a phrase match purpose?
Thanks,
Li
I have limited number of industries (around 300 industries), I would like to create an index which will give the frequency of these keywords in the indexed documents. Is there any way for doing this in sphinx?
Not really.
But the --buildstops function of indexer, will produce a list of the most common keywords in an index.
So can just look at the output of that, and compare with your industry list. In theory I would think your industries should near the top of the list, so dont have to make it too long.
There is a trick in Sphinx to get keyword statistics from the index. BuildKeywords API call ( http://sphinxsearch.com/docs/current.html#api-func-buildkeywords ) with hits flag set will return per keyword frequencies from given index.
Hope this helps
There are about 1000 records in a database table. There is a column named title which is used to store the title of articles. Before inserting a record, I need to check if there is already an article with similar title exists in that table. If so, I will skip.
What's the fastest way to perform this kind of fuzzy matching? Assuming all words in sentences can be found in a English dictionary. If 70% of words in sentence #1 can be found in sentence #2, we consider them a match. Ideally, the algorithm can pre-compute a value for each sentence so that the value can be stored in the database.
For a 1000 records, doing the dumb thing and just iterating over all the records could work (assuming that the strings aren't too long and you aren't getting hit with too many queries). Just pull all of the titles out of your database, and then sort them by their distance to your given string (for example, you could use Levenshtein distance for this metric).
A fancier way to do approximate string matching would be to precompute n-grams of all your strings and store them in your database (some systems support this feature natively). This will definitely scale better performance wise, but it could mean more work:
http://en.wikipedia.org/wiki/N-gram
You can read up on forward / reverse indexing of token - value storage for getting faster search results. I personally prefer reverse indexing which stores a hash map of token(key) to value (here title).
Whenever you write a new article, like a new stackoverflow question, the tokens in the title would be searched to map against all the titles available.
To optimize the result, i.e. get the fuzzy logic for results, you can sort the titles by the max amount of occurrences in tokens being searched for. Eg, if t1,t2 and t3 refer to the tokens 'what' 'is' 'love', and the title 'what this love is for?' would exist in all the tokens mappings, it would be placed at the topmost.
You can play around with this more. I hope this approach is more simple and appealing.
I have a database where users upload articles.
I would like to make an algorithm where my web app will suggest similar texts according to the one the user reads.
I saw some examples like Levenshtein distance. But those algorithms measures distance for strings and not for whole articles. Is there a way to extract most significant keywords from text? Surely, I understand that "most significant" is an ambiguous term.
How do other sites manage this?
thanks a lot
Is there a way to extract most significant keywords from text?
Yes. Basically, you extract all the words from the text, sort the words by frequency, eliminate the common words (a, an, the, etc.) by matching them against a common word dictionary, and save the top 20 or more words, along with their frequency, from each article.
The number of top words you save is related to both the length of the article and the subject matter of all the articles. Less words work for general interest articles, while more words are necessary for special interest articles, like answers to programming questions.
Articles that match more than half of the top words could be considered related. The degree of relatedness would depend on the number of matching top words and the frequencies of the matching words.
You could calculate a relatedness score by multiplying the frequencies of each matched word from the two articles and summing all the products. The higher the score, the more the articles are related.
You might try to correct the 'weight' of each word by the frequency it appears in all the articles. So the best indicators of similarity would be the words that appear only in the two compared ones and nowhere else. This would automatically disregard the common words (a, an, the, etc.) mentioned by #Gilbert Le Blanc.