Searching for a field in AWS ElasticSearch - elasticsearch

After indexing ddb records into ElasticSearch, when doing a simple search /_search?q=test, I see the hits shown like this
"hits": [
{
// ignore other fields ...
"_id": "z0YdS3I",
"_source": {
"M": {
"name": {
"S": "test name"
},
"age": {
"N": "18"
},
// ignore other fields ...
}
}
},
....
]
However, when I search for a specific field, e.g. /_search?q=name:test, I get zero hits. This happens with every field.
"hits": {
"total": {
"value": 0,
"relation": "eq"
},
"max_score": null,
"hits": []
}
So instead I have to search like this _search?q=M.name.S=test, which is a bit cumbersome. Just wonder if there's a cleaner way to search for a field? Maybe I'm missing some configuration during indexing step?

You could try this :
First define mappings for your index as per your requirement . like -
"name":"text",
"age":"integer"
.
.
etc
Then check if that got applied properly using /_mapping API - once you see the datatypes are applied as you desire then start indexing data.
Details of mappings => https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html

I found out I could use DynamoDB Converter provided by AWS SDK to convert back and forth between Javascript object and its equivalent DDB AttributeValue type. That way I can index a document in the write mapping and access it with the normal fields.

Related

Retrieve selected fields with Spring Data Elasticsearch by using fields option, not by source filtering

We're trying to migrate our Opensearch cluster to Elasticsearch.
We've been using Spring Data Elasticsearch with OpenSearch and using fields option to retrieve selected fields.
(BaseQueryBuilder#withFields along with BaseQueryBuilder#withSourceFilter)
Sample code:
NativeSearchQueryBuilder queryBuilder =
new NativeSearchQueryBuilder().withQuery(query)
.withFields("someId")
.withSourceFilter(new FetchSourceFilter(new String[] {}, new String[] {"*"}));
This was working with Spring Data 4.2.X and Opensearch 1.2.4.
However, with Spring Data 4.4.X and Elasticsearch 8.3, SearchHit's content field does not contain given fields.
What I want to achieve is similar to this query:
GET some_index/_search
{
"query": {
"match_all": {}
},
"fields": [
"someId"
],
"_source": false
}
Tried attempts:
1.
NativeSearchQueryBuilder queryBuilder =
new NativeSearchQueryBuilder().withQuery(query)
.withFields("someId");
No luck, it's as if this parameter is ignored, returns all the fields in documents.
2.
NativeSearchQueryBuilder queryBuilder =
new NativeSearchQueryBuilder().withQuery(query)
.withSourceFilter(new FetchSourceFilter(new String[] {"someId"}, null))
It works. However, in official ES documentation, it states that:
Using fields is typically better
These options are usually not required. Using the fields option is
typically the better choice, unless you absolutely need to force
loading a stored or docvalue_fields.
So is it worse that using source filtering instead of fields option performance wise?
Is it possible to achieve disabling source and getting selected fields by fields option with Spring Data Elasticsearch?
If it's not possible we consider to use SearchSourceBuilder instead of Spring Data's NativeSearchQueryBuilder.
I cannot reproduce that. I just ran a sample application (set up using Spring Boot 2.7.0 with Spring Data
Elasticsearch 4.4.2 and an Elasticsearch instance version 8.3.3).
My test entity class is named Foo and the property in this object I use is moreText.
The code to send the request:
var query=new NativeSearchQueryBuilder()
.withQuery(matchAllQuery())
.withFields("moreText")
.withSourceFilter(new FetchSourceFilter(new String[]{},new String[]{"*"}))
.build();
return operations.search(query, Foo.class);
I trace the calls to Elasticsearch with an intercepting proxy and see that this creates the following call:
{
"from": 0,
"size": 10,
"query": {
"match_all": {
"boost": 1.0
}
},
"version": true,
"explain": false,
"_source": {
"includes": [],
"excludes": [
"*"
]
},
"fields": [
{
"field": "more-text"
}
]
}
The returned answer is:
{
"took": 93,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
{
"_index": "foo",
"_type": "_doc",
"_id": "42",
"_version": 1,
"_score": 1.0,
"_source": {},
"fields": {
"more-text": [
"More text!"
]
}
}
]
}
}
The returned answer is exactly what is wanted. This is mapped into the Java entity, see this screenshot from the debugger:
As you can see, Spring Data Elasticsearch also manage if the name of the entitys property (moreText) is different from the field name in Elasticsearch (more-text).
So there must be something different in your setup. Could it be that in your mapping the _source is disabled (or some fields?), see https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-source-field.html#disable-source-field
Can you provide a minimal sample application on Github that reproduces this behaviour?

Count of "actual hits" (not just matching docs) for arbitrary queries in Elasticsearch

This one really frustrates me. I tried to find a solution for quite a long time, but wherever I try to find questions from people asking for the same, they either want something a little different (like here or here or here) or don't get an answer that solves the problem (like here).
What I need
I want to know how many hits my search has in total, independently from the type of query used. I am not talking about the number of hits you always get from ES, which is the number of documents found for that query, but rather the number of occurrences of document features matching my query.
For example, I could have two documents with text a text field "description", both containing the word hero, but one of them containing it twice.
Like in this minimal example here:
Index mapping:
PUT /sample
{
"settings": {
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 0
}
},
"mappings": {
"doc": {
"properties": {
"name": { "type": "keyword" },
"description": { "type": "text" }
}
}
}
}
Two sample documents:
POST /sample/doc
{
"name": "Jack Beauregard",
"description": "An aging hero"
}
POST /sample/doc
{
"name": "Master Splinter",
"description": "This rat is a hero, a real hero!"
}
...and the query:
POST /sample/_search
{
"query": {
"match": { "description": "hero" }
},
"_source": false
}
... which gives me:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.22396864,
"hits": [
{
"_index": "sample",
"_type": "doc",
"_id": "hoDsm2oB22SyyA49oDe_",
"_score": 0.22396864
},
{
"_index": "sample",
"_type": "doc",
"_id": "h4Dsm2oB22SyyA49xDf8",
"_score": 0.22227617
}
]
}
}
So there are two hits ("total": 2), which is correct, because the query matches two documents. BUT I want to know many times my query matched inside each document (or the sum of this), which would be 3 in this example, because the second document contained the search term twice.
IMPORTANT: This is just a simple example. But I want this to work for any type of query and any mapping, also nested documents with inner_hits and all.
I didn't expect this to be so difficult, because it must be an information ES comes across during search anyway, right? I mean it ranks the documents with more hits inside them higher, so why can't I get the count of these hits?
I am tempted to call them "inner hits", but that is the name of a different ES feature (see below).
What I tried / could try (but it's ugly)
I could use highlighting (which I do anyway) and try to make the highlighter generate one highlight for each "inner match" (and don't combine them), then post-process the complete set of search results and count all the highlights --> Of course, this is very ugly, because (1) I don't really want to post-process my results and (2) I'd have to get all results to do this by setting size to a high enough value, but actually i only want to get the number of results requested by the client. This would be a lot of overhead!
The feature inner_hits sounds very promising, but it just means that you can handle the hits inside nested documents independently to get a highlighting for each of them. I use this for my nested docs already, but it doesn't solve this problem because (1) it persists on inner hit level and (2) I want this to work with non-nested queries, too.
Is there a way to achieve this in a generic way for arbitrary queries? I'd be most thankful for any suggestions. I'm even down for solving it by tinkering with the ranking or using script fields, anything.
Thank's a lot in advance!
I would definitely not recommend this for any kind of practical use due to the awful performance, but this data is technically available in the term frequency calculation in the results from the explain API. See What is Relevance? for a conceptual explanation and Explain API for usage.

elasticsearch: copying meta-field _id to other field while creating document

I am using elasticsearch. I see there is meta-field _id for each document. I want to search document using this meta-field as I don't have any other field as unique field in document. But _id is a string and can have dashes which are not possible to search unless we add mapping for field as type :keyword. But it is possible as mentioned here. So now I am thinking to add another field newField in document and make it same as _id. One way to do it is: first create document and assign _id to that field and save document again. But this will have 2 connections which is not that good. So I want to find some solution to set newField while creating document itself. Is it even possible?
You can search for a document that contains dashes:
PUT my_index/tweet/testwith-
{
"fullname" : "Jane Doe",
"text" : "The twitter test!"
}
We just created a document with a dash in its id
GET my_index/tweet/_search
{
"query": {
"terms": {
"_id": [
"testwith-"
]
}
}
}
We search for the document that have the following id: testwith-
{
"took": 9,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "my_index",
"_type": "tweet",
"_id": "testwith-",
"_score": 1,
"_source": {
"fullname": "Jane Doe",
"text": "The twitter test!"
}
}
]
}
}
We found it. We can search on document that have - in it.
you could also use a set processor when using an ingest pipeline to store the id in an additional field, see https://www.elastic.co/guide/en/elasticsearch/reference/5.5/accessing-data-in-pipelines.html and https://www.elastic.co/guide/en/elasticsearch/reference/5.5/set-processor.html

Change the structure of ElasticSearch response json

In some cases, I don't need all of the fields in response json.
For example,
// request json
{
"_source": "false",
"aggs": { ... },
"query": { ... }
}
// response json
{
"took": 123,
"timed_out": false,
"_shards": { ... },
"hits": {
"total": 123,
"max_score": 123,
"hits": [
{
"_index": "foo",
"_type": "bar",
"_id": "123",
"_score": 123
}
],
...
},
"aggregations": {
"foo": {
"buckets": [
{
"key": 123,
"doc_count": 123
},
...
]
}
}
}
Actually I don't need the _index/_type every time. When I do aggregations, I don't need hits block.
"_source" : false or "_source": { "exclude": [ "foobar" ] } can help ignore/exclude the _source fields in hits block.
But can I change the structure of ES response json in a more common way? Thanks.
I recently needed to "slim down" the Elasticsearch response as it was well over 1MB in json and I started using the filter_path request variable.
This allows to include or exclude specific fields and can have different types of wildcards. Do read the docs in the link above as there is quite some info there.
eg.
_search?filter_path=aggregations.**.hits._source,aggregations.**.key,aggregations.**.doc_count
This reduced (in my case) the response size by half without significantly increasing the search duration, so well worth the effort..
In the hits section, you will always jave _index, _type and _id fields. If you want to retrieve only some specific fields in your search results, you can use fields parameter in the root object :
{
"query": { ... },
"aggs": { ... },
"fields":["fieldName1","fieldName2", etc...]
}
When doing aggregations, you can use the search_type (documentation) parameter with count value like this :
GET index/type/_search?search_type=count
It won't return any document but only the result count, and your aggregations will be computed in the exact same way.

Is it possible to eliminate "empty" facets with Elastic Search?

I've finally managed to get Elastic Search indexing to work the way I want it to work, indexing the raw values of certain fields using subfields and not_analyzed. The facets are what I expect, however, in some cases, due to the source data having null/empty values for those fields, I get results like this in the facets section:
"things": {
"_type": "terms",
"missing": 187,
"total": 12214,
"other": 10608,
"terms": [
{
"term": "foo",
"count": 912
},
{
"term": "",
"count": 532
},
{
"term": "bar",
"count": 37
}
}
}
Note the "" in the second item. I can see why ElasticSearch wouldn't automatically exclude this, as one might want to know how many documents don't have the field. But for my purposes I'd like to just not have this returned.
Is there some way that I can configure ElasticSearch to ignore these, either in the indexing or in the query?
Try putting
"exclude" : ""
in your aggregation terms

Resources