How do I dump out the term dictionary from Elasticsearch?
I just want to look for pathological cases in indexing, so I want to see what is actually getting put into the index.
These are text fields, so I can't just do a terms aggregation.
_termvectors only works for single or multiple documents. I want top terms in the index, like the terms component in Solr.
Related
I am using Elasticsearch 6.2, and I have some queries that analyze a massive amount of documents. I am sorting to one field inside the index. Elasticsearch examines 10.000 documents (default configuration value) and then returns them paginated.
I tried to read the documentation, but I cannot find any information if the database applies the sorting before or after the analysis process of the documents from the index.
In other words, the sort is applied directly during the index analysis or the documents are sorted once analyzed? If the last option is correct, which kind of sort applies Elasticsearch during the scan?
Thanks a lot.
Sorting, aggregations, and access to field values in scripts requires
a different data access pattern. Instead of looking up the term and
finding documents, we need to be able to look up the document and find
the terms that it has in a field.
This quote from the Elasticsearch reference documentation implies to me, that sorting is happening on the non-analyzed level, but I've also decided to double check and do some tests on it.
In the Elasticsearch we have capabilities to do sorting on non-analyzed fields - e.g. keyword. Those fields are using doc-values to do sorting and after the test I could say that it's using pre-analyzed values to do sorting according to the codes representing characters (numbers, uppercase letters, lowercase letters)
It's also possible to do a sorting on text fields with some caveat and tuning (e.g. need to enable fielddata, since text fields do not support doc_values)
In this case the documents are sorted according to analyzed values. Of course a lot depends on analyzing pipeline, since it could do various stuff to the text. Also, just as a reminder:
Fielddata can consume a lot of heap space, especially when loading
high cardinality text fields. Once fielddata has been loaded into the
heap, it remains there for the lifetime of the segment. Also, loading
fielddata is an expensive process which can cause users to experience
latency hits. This is why fielddata is disabled by default.
I'm wondering if ElasticSearch has an aggregation that returns whether or not a field has any terms (i.e. if the terms aggregation will return an empty list or not). Obviously I can use a terms aggregation and check if there are any results, but I was wondering if there is something cheaper.
I have a corpus of documents indexed . I also stored the term vectors when indexing. Now I want to retrieve term vectors of all documents satisfying some filtering options.
I was able to get term vector for a single document or for a set of documents by providing the document IDs. But is there a way to get term vectors for all the documents without providing document IDs?
Eventually what I want to do is to get the frequency counts of all the terms in a field, for all documents in an index (i.e., a bag of words matrix).
I am using elasticsearch-py as a client.
Appreciate any pointers. Thanks!
It seems that in elastic search you would define an index on a collection, whereas in a relational DB you would define your index on a column. If the entire collection is indexed, why does it need to be defined?
There is unfortunate usage of the word "index" which means slightly (edit: VERY) different things in ES and relational databases as they are optimized for different use cases.
An "index" in database is a secondary data structure which makes WHERE queries and JOINs fast, and they typically store values exactly as they appear in the table. You can still have columns which aren't indexed, but then WHEREs require a full table scan which is slow on large tables.
An "index" in ES is actually a schematic collection of documents, similar to a database in the relational world. You can have different "types" of documents in ES, quite similar to tables in dbs. ES gives you the flexibility of defining for each document's field whether you want to be able to retrieve it, search by it or both. Some details on these options can be found from for example here, also related to _source field (the original JSON which was submitted to ES).
ES uses an inverted index to efficiently find matching documents, but most importantly it typically "normalizes" strings into tokens so that accurate free-text search can be performed. For example sentences might be splitted into individual words, words are normalized to lower case etc. so that searching for "holland" would match the text "Vacation at Holland 2015".
If a field does not have an inverted index, you cannot perform any searching on it (unlike dbs' full table scan). Interestingly you can also define fields so that you can use them for searching but you cannot retrieve them back, it is mainly beneficial when minimizing in disk and RAM usage is important.
Elastic search is by design a search engine not likely preferred for primary storage like SQL server or Mongo DB etc.
Why entire collection is indexed?
Elastic search internally uses a structure called inverted index which stores each fields(column) value for searching.
If the field contains string it will tokenize it, and perform filtering like lower case or upper case etc.
Any way you can find only the data that are available in inverted index.
So by default elastic search perform indexing for all fields to make it available/searchable to you.
https://www.elastic.co/guide/en/elasticsearch/guide/current/inverted-index.html
This is not the like adding index for Relational DB.
In Relational DB you have all the data available then what you need is to index most used columns for quicker find.
But its vary less efficient to finding all the rows containing a part of a given word(searching a word)
I'll refer to:
"It seems that in elastic search you would define an index on a
collection"
In Elasticsearch, an index is like a database in the relational world.
The index contains multiple documents just like a relational database contain tables.
Until now, it is very clear.
In order to manage large amount of data, Elasticsearch (as a distributed database by nature) breaks each index into smaller chunks which are called shards which are being distributed across the Elasticsearch nodes.
The confusion starts with the fact the shards are data structures which are based on the Apache Lucene library.
Apache Lucene's index falls into the family of indexes known as an inverted index.
It is called "inverted index" because it list for a term, the documents that contain it:
Term Document Frequency
Brasil doc_id_1, doc_id_8 4 (2 in doc_id_1, 2 in doc_id_8)
Argentina doc_id_1, doc_id_6 3 (2 in doc_id_1, 1 in doc_id_6)
So, as you can see above, this structure stores statistics (frequencies) about terms in order to make term-based search more efficient.
(*) This is an inverse (Term -> Document) of the natural relationship, in which documents list terms (Document -> Terms).
Summary:
1 ) Elasticsearch index:
There are 2 different usages for the word "index".
One is quiet trivial - index is like a database.
The other is confusing - Shards are based on a data structure named "inverted index".
2 ) Relational Databases index:
A structure which is associated with a table or view that speeds retrieval of rows from the table or view.
I have large database of annotations of images stored in an elasticsearch database. I want to use this database for keyword extraction. Input is text (typically a newspaper article). My basic idea for an algorithm is to go through each term from the article and use elasticsearch to discover how frequent the term is in the image annotations. Then output terms from articles which are not frequent (in order to prefer names of people or places over common English words).
I don't need something very sophisticated, these keywords are used just as suggestion for user input, but I want something faster then asking N search queries (where N is number of terms in text) to elasticsearch which can be slow on large texts. Is there some robust and fast technique for keyword extraction in elasticsearch?
You can use elastic search term aggregations for this. They can return bucketed keywords with document counts which indicate their relative frequency. Here is an example query in YML.
query:
match:
annotation:
query: text of your article
aggregations:
term_frequencies:
terms:
field: annotation