get number of repetitions of the searched word in elastic index - sorting

How can I get number of repetition of the searched word in elastic index?
I have a Index with two types that I want to sort on one of my types, and for sort my search result I want to write a script includes sort by a algorithm..
So, for writing my algorithm I need the repetition number, and type length.. i found out that I can get my index length by calling this curl -XGET localhost:9200/my_index/_stats?pretty=true , but I couldn't find how to get the repetition number!
Can I get the repetition number? If I can, can anyone tell me how? and show me an example?
Thanks.

Related

Efficiently Querying Phrases in a Huge Document Collection

In an interview, I have received a question that asks me to design a search engine for an online library that contains thousands of documents with millions of words. Initially, the interviewer only asked me to search for single words, clarifying:
Search for exact keywords ("overflow" returns true while "overflo" does not)
Case-sensitivity can be ignored
My answer was to use a crawler algorithm that runs through each document and creates a lookup table that stores the information in which documents a given word is used, prior to any query executed. Then once a query is executed, all the algorithm have to do is to find the word in the lookup table and return the list of the documents that the word is used.
On the second step, they asked me what would I do if they wanted to search multiple words (not necessarily consecutive) and my answer was to make a new query for each word and find the intersection of the results.
Finally, the interviewer asked me what would I do if they wanted me to query consecutive words or phrases (eg. "stack overflow"). At this point, my lookup table failed since there are no connection between consecutive words and I couldn't come up with a solution in this approach. How can I handle these kind of queries? Are there any problems with my initial answers and design? I have searched on the Internet but couldn't find anything noteworthy.
For the second case, generate the map such that every key is a word and every value is a set of objects that have the following properties
{
string document name : location/document_name,
integer index : index, //location in document,
toString : hash the object
}
Then when you need to find results for "stack overflow"
For all elements in set 1, does that value exist in set 2 but modified with the item_index + 1.
Get the results for two words but just return the doc names. If there are three words, do the same process you did for two words, but for the third word, check only the matches that passed word 1 and word 2 for word 2, and check to see if those items exist in set 3 with item_index + 1.

Elastic search calculation with data from different indexes

Good day, everyone. I have a lit bit strange case of using elastic search for me.
There are two different indexes, each index contain one data type.
First type contains next important for this case data:
keyword (text,keyword),
URL (text,keyword)
position (number).
Second type contains next data fields:
keyword (text,keyword)
numberValue (number).
I need to do next things:
1.Group data from the first ind by URL
2.For each object in group calculate new metric (metric A) by next simple formula: position*numberValue*Param
3.For each groups calculate sum of elements metric A we have calculated on stage 1
4.Order by desc result groups by sums we have calculated on stage 3
5.Take some interval of result groups.
Param - param, i need to set for calculation, this is not in elastic.
That is not difficult algorithm, but data in different indices, and i don`t know how to do it fast, and i prefer to do it on elastic search level.
I don`t know how to make effective data search or pipeline of data processing which can help me to implement this case.
I use ES version 6.2.3 if it is important.
Give me some advice, please, how can i implement this algorithm.
By reading 2. you seem to assume keyword is some sort of primary key. Elasticsearch is not an RDB and can only reason over one document at a time, so unless numberValue and position are (indexed) fields of the same document you can't combine them.
The rest of the items seem to be possible to achieve with the help of Aggregation

is there a way to find out the max theoritical score from an elasticsearch query?

I have a search that's purely based on attributes rather than any text searching. I'd like to know if there's a way to interpret the scores returned from elasticsearch in such a way as to determine if a match is good or not (or how good it is on a scale of 0-100)..
The scores obviously change based on the query - if I ask for things that have 5 attributes using an OR search - those that have all 5 get a highscore, whilst those with 1 get a lower score (which is fine..) - I'd like to know if there's an easy way to ask ES: given this query, what's the max score anything could give me?
I could do things like say that this result is a 90% match to your query, this one is a 50% match. Rather than this one scored 1.746373..
I'd rather not be double checking each result against the search to work this out..

Sphinx - How to index only a limited number of words?

I have limited number of industries (around 300 industries), I would like to create an index which will give the frequency of these keywords in the indexed documents. Is there any way for doing this in sphinx?
Not really.
But the --buildstops function of indexer, will produce a list of the most common keywords in an index.
So can just look at the output of that, and compare with your industry list. In theory I would think your industries should near the top of the list, so dont have to make it too long.
There is a trick in Sphinx to get keyword statistics from the index. BuildKeywords API call ( http://sphinxsearch.com/docs/current.html#api-func-buildkeywords ) with hits flag set will return per keyword frequencies from given index.
Hope this helps

Hash Table and Substring Matching

I have hundreds of keys for example like:
redapple
maninred
foraman
blueapple
i have data related to these keys, data is a string and has related key at the end.
redapple: the-tree-has-redapple
maninred: she-saw-the-maninred
foraman: they-bought-the-present-foraman
blueapple: it-was-surprising-but-it-was-a-blueapple
i am expected to use hash table and hash function to record the data according to keys and i am expected to be able to retieve data from table.
i know to use hash function and hash table, there is no problem here.
But;
i am expected to give the program a string which takes place as a substring and retrieve the data for the matching keys.
For example:
i must give "red" and must be able to get
redapple: the-tree-has-redapple
maninred: she-saw-the-maninred
as output.
or
i must give "apple" and must be able to get
redapple: the-tree-has-redapple
blueapple: it-was-surprising-but-it-was-a-blueapple
as output.
i only can think to search all keys if they has a matching substring, is there some other solution? If i search all the key strings for every query, use of hashing is unneeded, meaningless, is it?
But, searching all keys for substring is O(N), i am expected to solve the problem with O(1).
With hashing i can hash a key e.g. "redapple" to e.g. 943, and "maninred" to e.g. 332.
And query man give the string "red" how can i found out from 943 and 332 that the keys has "red" substring? It is out of my cs thinking skills.
Thanks for any advise, idea.
Possible you should use the invert index for n-gramm, the same approach is used for spell correction. For word redapple you will have following set of 3-gramms red, eda, dap, app, ppl, ple. For each n-gramm you will have a list of string in which contains it. For example for red it will be
red -> maninred, redapple
words in this list must be ordered. When you want to find the all string that contains a a give substring, you dived the substring on n-gramm and intercept the list of words for n-gramm.
This alogriphm is not O(n), but it practice it has enough speed.
It cannot be nicely done in a hash table. Given a a substring - you cannot predict the hashed result of the entire string1
A reasonable alternative is using a suffix tree. Each terminal in the suffix tree will hold list of references of the complete strings, this suffix is related to.
Given a substring t, if it is indeed a substring of some s in your collection, then there is a suffix x of s - such that t is a prefix of x. By traversing the suffix tree while reading t, and find all the terminals reachable from the the node you reached from there. These terminals contain all the needed strings.
(1) assuming reasonable hash function, if hashCode() == 0 for each element, you can obviously predict the hash value.
I have researched this problem recently and i'm sure that this can not be done. I hope hash table will help me improve speed of searching like you but it makes me disapointed.

Resources