Get the most frequent terms of text field - elasticsearch

How do I get a list of all individual tokens of a text field along with their document frequency. I want this to build a domain specific list of frequent (and therefore useless) stop words.
This question covers all the methods I found so far but
"keyword" data type is not an option because im interested in individual terms (so tokenisation is necessary)
"significant term aggregation" is not an option because im interested in the most frequent, not the most significant terms
"termvector" is not an option because I need it for the hole index not just a particular document or a small subset.

You will have to enable field_data on your field to do this.
But be careful it can impact a lot the heap memory used.
https://www.elastic.co/guide/en/elasticsearch/reference/current/fielddata.html

Related

Optimizing Elastic Search Index for many updates on few fields

We are working on a large Elasticsearch Index (>1 bi documents) with many searchable fields evenly distributed across >100 shards on 6 nodes.
Only a few of these fields are actually changed, but they are changed very often. About 25% of total requests are changes to these fields.
We have one field that simply holds a boolean, which accounts for more than 90% of the changes done to the document.
It looks like we are taking huge performance hits re-indexing the entire documents, even though a simple boolean is changing.
Reading around I found that this might be a case where one could store the boolean value within a parent-child field, as this would effectively place it in a separate index thus not forcing a recreation of the entire document. But I also read that this comes with disadvantages, such as more heap space usage for the relation.
What would be the best way to solve this challenge?
Yes, since Elasticsearch is internally a write-only system every update effectively creates a new document and marks the old copy as stale to be garbage-collected later.
Parent-child (aka join) or nested fields can help with this but they come with a significant performance hit of their own so your search performance probably will suffer.
Another approach is to use external fast storage like Redis (as described in this article) though it would be harder to maintain and might also turn out to be inefficient for searches, depending on your use case.
General rule of thumb here is: all use cases are different and you should carefully benchmark all feasible options.

How can we design system for document search?

I was recently asked a system design question where I needed to "design system for document search" and first thing came in my mind was how elastic search works. So I came up with inverted index approach that is used to support text search. The inverted index has a record for each term. Each record has the list of documents that the term appears in. Documents are identified by an integer document ID. The list of document IDs is sorted in ascending order.
So I said something along below line but I am not sure whether this is the way it should work in distributed fashion because we may have lot of documents to be indexed so we need some load balancer and somehow we need to partition inverted index or data. Meaning a process that will upload the documents and then what process will tokenize it (will it be just one machine or bunch of machines). Basically wanted to understand what's the right way to design a system like this with proper components. How should we talk about this in system design interview? What are the things we should touch in an interview for this problem?
What's the right way we should design system for document search in distributed fashion with right components.
ok... it is a vast subject. Actually elasticsearch has been done exactly for that. But google too. from elasticSearch to google search there is a technologic gap.
If you go for your personnal implementation it is still possible... but there is a ton of work to do, to be as efficient as elasticsearch. The quick answer is : use elasticsearch.
You may be curious or for some reason you may need to write it by yourself. So how it works :
TFIDF and cosine distance
As you specified first you will tokenize
then you will represent the tokenized text as vector, then measure the angular distance between the text and the search word.
imagine have only 3 word in your language "foo, bar, bird"
so a text with "foo bar bird" can be represented by a vector3[1,1,1]
a text with
A) "foo foo foo foo bird" will be [4,0,1]
another
B) "foo bar" [1,1,0]
if you search for "bar" which is represented by [0,1,0] you will look for the text with which have the minimal angular distance and if you compute the angular distance between your search and B I think this is 90° which is lower than A.
Actually the language is more than 3 word so you will compute the distance in a vector of many more dimensions because a 1 world = 1 dimension :)
TFIDF stand for term frequency inverse document frequency.
this rates the frequency of a word in a document by the inverse of the frequencies of this word in all document. What it does is it point the important word in a document.
Let explain that :
the word "that, in a the" etc are everywhere so they are not important
in all of your texts the word "corpus" has a frequency of I don't know 0.0001%
in a particular text it is cited 5 times hand has a frequencie of 0.1%
then it is quite rare in the corpus but quite important in your text in comparison
so when you'll search for "corpus" what you want is to get first the text where it appears 4 times.
so instead of vector of number of occurence you will have a vector of relative occurence frequency for exemple [0.2,1,0.0005]
https://en.wikipedia.org/wiki/Tf%E2%80%93idf
I have got a little surprise for you at the end : this is what is behind
the scoring in elasticsearch https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html
in addition elasticsearch will provide to you replication scalability, distribution and anything you could dream about.
There is still reason to not use elasticsearch :
research purpose
you actually write another search engine like duckduck go (why not)
you are curious
the scaling and distribution part
Read about lucene before https://en.wikipedia.org/wiki/Apache_Lucene
it depend greatly on the volume of text to be indexed and word.
If it represent 1M of text you dont need to distribut it.
you will need a distributed inverted index if you index something big like wikipedia (wikipedia uses elasticsearch for the searchbox)
foo is in text A,B,C,R
so I will partition my index
I will use a distributed cache with word for the keys and a list of pointer to the vectors as values. I would store the values in memory mapped file.
A search engine is something that must be fast, so if you do the thing by yourself you will reduce the external libraries. You will use c++
At google they end in a situation where the vectors take so much space that they need to store them in multiple machine so they invented GFS this is a distributed file system.
the cosine distance computation between multidimensionnal vectors are time consuming, so I will go for a computation in GPU, because GPU are efficient for floating point operations on matrix and vectors.
Actually this is a little bit crasy to reimplement all of that expect you have a good reason to do so for exemple a very good business model :)
I will probably use kubernetes docker and mesos to virtualise all my component. I will look for something similar to GFS if high volume is needed.
https://en.wikipedia.org/wiki/Comparison_of_distributed_file_systems
You will need to get the text back, so I will use any NIO webserver that scale in any language. I will use nginx to serve the static pages and something like netty or vertx to get the search, then build the answer of link to the text (it depend on how many user you want to serve in one second)
All of that if I plan to index something bigger than wikipedia. And if I plan to invent something better than elasticsearch (hard task good luck)
for exemple wikipedia this is less than 1T of text.
finally
If you do it with elasticsearch in one week you are done may be 2 week and in production. If you do it by yourself, you will need at least 1 senior developer and a datascientist an architect, and something like one year or more depending on the volume of text you want to index. I cant help to stop asking myself "what is it for".
Actually if you read the code source of lucene you will know exactly what you need. They done it, lucene is the engine of elasticsearch.
Twitter is using Lucene for its real time search

Is having empty fields bad for lucene index?

ES doc on mappings states below
Types are not as well suited for entirely different types of data. If your two types have mutually exclusive sets of fields, that means half your index is going to contain "empty" values (the fields will be sparse), which will eventually cause performance problems. In these cases, it’s much better to utilize two independent indices.
I'm wondering how strictly should I take this.
Say I have three types of documents, with each sharing same 60-70% of fields and the rest being unique to each type.
Should I put each type in a separate index?
Or one single index would be fine as well, meaning there won't be lots of storage waste or any noticeable performance hit on search or index operations?
Basically I'm looking for any information to either confirm or disprove the quote above.
If your types overlap 60-70% then ES will be fine, that does not sound 'mutually exclusive' at all. Notice that:
Things will improve in future versions of ES
If you don't need them, you can disable norms and doc_values, as recommended here

In elasticsearch, is there some method to reduce the importance of a set of search terms?

Ideally, I would like to reduce the importance of certain words such as "store", "shop", "restaurant".
I would like "Jimmy's Steak Restaurant" to be about as important as "Ralph's Steak House" when a user searches for "Steak Restaurant". I hope to accomplish this by severely diminishing the importance of the word "restaurant" (along with 20-50 other words).
Stop words work well for some words, such as "a", "the", "of", etc, but they are all-or-nothing.
Is there a way to provide a weighting or boost value per word at the index or mapping level?
I can probably accomplish this at the query level, but that could be very bad if I have 50 words whose impact I need to reduce.
This was a generalized example. In my actual complex solution, I really do need to reduce the impact of quite a few search terms.
I don't believe it is possible to specify a term-level boost during indexing. In this thread, Shay mentions that it is possible in Lucene, but that it's a tricky feature to surface through the API.
Another relevant thread, suggesting the same thing. Shay recommends trying to sort it out using a custom_score query:
I think that you should first try and solve it on the search side. If you know the weights when you do a search, you can either construct a query that applies different boosts depending the tag, or use custom_score query.
Custom_score query is slower than other queries, but I suggest you run and check if it's ok for you (with actual data, and relevant index size). The good thing is that if its slow for you (and slow here means both latency and QPS under load), you can always add more replicas and more machines to separate the load.
Here is an example of a custom_score query that boosts on a somewhat-similar term level (except it's for a special field that only has one category term, so this may not apply). It might be easier to break the script out into a native script, instead of using mvel, since you'll have a big list of words.
As an alternative, perhaps add a synonym token filter that interchanges words like "shop", "restaurant", "store", etc?

Payload performance in Lucene

I know there are several topics on the web, as well as on SO, regarding indexing and query performance within Lucene, but I have yet to find one that discusses whether or not (and if so, how much?) creating payloads will affect query performance...
Here's the scenario ...
Let's say I want to index a collection of documents (anywhere from 100K - 10M), and each document has a subsection that I want to be able to search separately (or perhaps rank higher, depending on whether a match was found within that section).
I'm considering adding a payload (during indexing) to any term that appears within that subsection, so I can efficiently make that determination at query-time.
Does anyone know of any performance issues related to using payloads, or even better, could you point me to any online documentation about this topic?
Thanks!
EDIT: I appreciate the alternative solutions to my scenario, but in case I do need to use payloads in the future, does anyone have any comments regarding the original question about query performance?
The textbook solution to what you want to do is index each original document as two fields: one for the full document, and the other for the subsection. You can boost the subsection field separately either during indexing or during retrieval.
Having said that, you can read about Lucene payloads here: Getting Started with Payloads.
Your use case doesn't fit well with the purpose of payloads -- it looks to me that any payload information would be redundant.
Payloads are attached to individual occurrences of terms in the document, not to document/term pairs. In order to store and access payloads, you have to use the offset of the term occurrence within the document. In your case, if you know the offset, you should be able to calculate which section the term occurrence is in, without using payload data.
The broader question is the effect of payloads on performance. My experience is that when properly used, the payload implementation takes up less space and is faster than whatever workaround I was previously using. The biggest impact on disk space will be wherever you currently use Field.setOmitTermFreqAndPositions(true) to reduce index size. You will need to include positions to use payloads, which potentially makes the index much larger.

Resources