How to query parent/child relation using matching _version? - elasticsearch

I'm having this join datatype
"Review_Sentence": {
"type": "join",
"relations": {
"Review": "Sentence"
}
},
If I have a review v1 like that
review_v1
sentence1_v1
sentence2_v1
sentence3_v1
and later someone updates it and remove the last sentence
review_v2
sentence1_v2
sentence2_v2
then in Elastic searh I still have sentence3_v1 refering to the same review, so the query will return something like that
review_v2
sentence1_v2
sentence2_v2
sentence3_v1
How can I make sure the child _version is the same as the parent _version. I tried to use an external _version but if elasticsearch is giving me the latest _version for _id, regardless if the parent/child _version are matching or not.
So far my workaround is to delete all children for that review and insert the new one, but this is introducing a latency that I would like to get rid of

Related

Does non-indexed field update triggers reindexing in elasticsearch8?

My index mapping is the following:
{
"mappings": {
"dynamic": False,
"properties": {
"query_str": {"type": "text", "index": False},
"search_results": {
"type": "object",
"enabled": False
},
"query_embedding": {
"type": "dense_vector",
"dims": 768,
},
}
}
Field search_result is disabled. Actual search is performed only via query_embedding, other fields are just non-searchable data.
If I will update search_result field in existing document, will it trigger reindexing?
The docs say that "The enabled setting, which can be applied only to the top-level mapping definition and to object fields, causes Elasticsearch to skip parsing of the contents of the field entirely. The JSON can still be retrieved from the _source field, but it is not searchable or stored in any other way". So, it seems logical not to re-index docs if changes took place only in non-indexed part, but I'm not sure
Elasticsearch documents (Lucene Segments) are inmutable, so every change you make in a document will delete the document and create a new one. This is a Lucene's behavior:
Lucene's index is composed of segments, each of which contains a
subset of all the documents in the index, and is a complete searchable
index in itself, over that subset. As documents are written to the
index, new segments are created and flushed to directory storage.
Segments are immutable; updates and deletions may only create new
segments and do not modify existing ones. Over time, the writer merges
groups of smaller segments into single larger ones in order to
maintain an index that is efficient to search, and to reclaim dead
space left behind by deleted (and updated) documents.
When you set enable:false you are just avoiding to have the field content in the searchable structures but the data still lives in Lucene.
You can see a similar answer here:
Partial update on field that is not indexed

How to delete mutiple documents by ID in elasticsearch?

I am trying to delete a short list of documents in one swoop on Elasticsearch 2.4, and I can't seem to give it a query that results in >0 documents getting deleted.
id_list = ["AWeKNmt5qJi-jqXwc6qO", "AWeKT7ULqJi-jqXwc6qS"] #example
# The following does not delete any document (despite these ids being valid)
delres = es.delete_by_query("my_index", doc_type="my_doctype", body={
"query": {
"terms": {
"_id": id_list
}
}
})
If I go one by one, then they get deleted just fine. Which seems to point to my query being the problem.
for the_id in id_list:
es.delete("my_index", doc_type="my_doctype", id=the_id)
I've also tried the ids query instead of terms, but that also does not delete anything.
es.delete_by_query(..., body = {"query": {"ids" { "values": id_list }}})
What am I missing?
delete_by_query was deprecated in ES 1.5.3, removed in ES 2.0, and reintroduced in ES 5.0. From https://www.elastic.co/guide/en/elasticsearch/reference/1.7/docs-delete-by-query.html:
Delete by Query will be removed in 2.0: it is problematic since it silently forces a refresh which can quickly cause OutOfMemoryError during concurrent indexing, and can also cause primary and replica to become inconsistent. Instead, use the scroll/scan API to find all matching ids and then issue a bulk request to delete them.

Group by field in found document

The best way to explain what I want to accomplish is by example.
Let us say that I have an object with fields name and color and transaction_id. I want to search for documents where name and color match the specified value and that I can accomplish easily with boolean queries.
But, I do not want only documents which were found with search query. I also want transaction to which those documents belong, and that is specified with transaction_id. For example, if a document has been found with transaction_idequal to 123, I want my query to return all documents with transaction_idequal to 123.
Of course, I can do that with two queries, first one to fetch all documents that match criteria, and the second one that will return all documents that have one of transaction_idvalues found in first query.
But is there any way to do it in a single query?
You can use parent-child relation ship between transaction and your object. Or nest the denormalize your data to include the objects in the transactions. Otherwise you'll have to do an application side join, meaning 2 queries.
Try an index mapping similar to the following, and include a parent_id in the objects.
{
"mappings": {
"transaction": {},
"object": {
"_parent": {
"type": "transaction"
}
}
}
}
Further reading:
https://www.elastic.co/guide/en/elasticsearch/guide/current/parent-child-mapping.html

ElasticSearch results aren't relevant

In ElasticSearch, I've created two documents with one field, "CategoryMajor"
In doc1, I set CategoryMajor to "Restaurants"
In doc2, I set CategoryMajor to "Restaurants Restaurants Restaurants Restaurants Restaurants"
If I perform a search for CategoryMajor:Restaurants, doc1 shows up as MORE RELEVANT than doc2. Which is not typical Lucene behavior, which gives more relevance the more times a term shows up. doc2 should be MORE RELEVANT than doc1.
How in do I fix this?
You can add &explain=true to your GET query to see that score of doc2 is lowered by "fieldNorm" factor. This is caused by default lucene similarity calculation formula, which lowers score for longer documents. Please read this document about default lucene similarity formula:
http://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/search/Similarity.html
To disable this behaviour add "omit_norms=true" for CategoryMajor field to your index mapping by sending PUT request to:
http://localhost:9200/index/type/_mapping
with request body:
{
"type": {
properties": {
"CategoryMajor": {
"type": "string",
"omit_norms": "true"
}
}
}
}
I'm not certain, but it may be necessary to delete your index, create it again, put above mapping and then reindex your documents. Reindexing after changing mapping is necessary for sure :).

How can I query/filter an elasticsearch index by an array of values?

I have an elasticsearch index with numeric category ids like this:
{
"id": "50958",
"name": "product name",
"description": "product description",
"upc": "00302590602108",
"**categories**": [
"26",
"39"
],
"price": "15.95"
}
I want to be able to pass an array of category ids (a parent id with all of it's children, for example) and return only results that match one of those categories. I have been trying to get it to work with a term query, but no luck yet.
Also, as a new user of elasticsearch, I am wondering if I should use a filter/facet for this...
ANSWERED!
I ended up using a terms query (as opposed to term). I'm still interested in knowing if there would be a benefit to using a filter or facet.
As you already discovered, a termQuery would work. I would suggest a termFilter though, since filters are faster, and cache-able.
Facets won't limit result, but they are excellent tools. They count hits within your total results of specific terms, and be used for faceted navigation.

Resources