Translating ElasticSearch Facets Query into PyES

Translating ElasticSearch Facets Query into PyES - elasticsearch

I have a following query and I want to change that query into PyES:
{
"facets": {
"participating-org.name": {
"terms": {
"field": "participating-org.name"
},
"nested": "participating-org"
}
}
}
I have searched in PyES documentation about:
class pyes.facets.TermsFacetFilter(field=None, values=None, _name=None, execution=None, **kwargs)
And I don't know how to use it plus I couldn't find any examples related to it. Hoping to see PyES guys coming out with good documentation with examples in future.

I have just found out myself:
from pyes import *
from pyes.facets import *
conn = ES('localhost:9200', default_indices='org', default_types='activity')
q2 = MatchAllQuery().search()
q2.facet.add_term_facet('participating-org.role', nested="participating-org")
# Displays the ES JSON query.
print q2
resultset = conn.search(q2)
# To display the all resultsets.
for r in resultset:
print r
# To display the facet counts.
print resultset.facets
This code gives the above JSON Code and gives the exact count for me.

Related

Daterange + top_hits aggregation (as subaggregation) with Elasticsearch Java API Client 7.17.x

I've been at this for a day and I don't quite understand how I do it! This is the query I want to "recreate" with the new Java API Client (using Spring Boot)
{
"aggs": {
"range": {
"date_range": {
"field": "timestamp",
"ranges": [
{ "to": "now-2d" }
]
}
}
,
"aggs": {
"top_hits": {
"_source": {
"includes": [ "Id", "timestamp" ]
}
}
}
}
}
I tried doing it with DateRangeAggregation.of but I can't seem to get the right results or type. Here's what I have
SearchResponse<MyDto> response = client.search(b -> b
.index("test-index")
.size(0)
.aggregations("range",a->a.dateRange(DateRangeAggregation.of(d->d
.field("timestamp").ranges(r->r.to(t->t.expr("now-2d")))))),
.aggregations("hits", a -> a
.topHits(h->h.source(SourceConfig.of(c->c.filter(f->f.includes(Arrays.asList("Id", "timestamp"))))))),
MyDto.class
);
I've also tried removing the subaggregation and query for now, but I don't seem to be on the right track to even get the number of doc_count from the bucket. I kind of don't get how to work with the dateRange() here.
Edit: I played around a bit and was able to at least get the number of doc_count, I'm not very sure if this is a good way to do it though?
Aggregation agg = Aggregation.of(a -> a
.dateRange(d->d.field("timestamp").ranges(r->r.to(FieldDateMath.of(v->v.expr("now-2d"))))));
SearchResponse<MyDto> response = client.search(b -> b
.index("test-index")
.size(0)
.aggregations("range", agg),
MyDto.class
);
return response.aggregations().get("range").dateRange().buckets().array().get(0).docCount();
I also fixed the query above, it had an unnecessary extra query that broke the result.

My thought process was wrong. I wanted the documents that were aggregated within this a time but I misunderstood and thought tophits would give them to me, but that's not how it works! I made a seperate range query that actually queries the documents I needed back instead.

Find same text within time range

I'm storing articles of blogs in ElasticSearch in this format:
{
blog_id: keyword,
blog_article_id: keyword,
timestamp: date,
article_text: text
}
Suppose I want to find all blogs with articles that mention X at least twice within the last 30 days. Is there a simple query to find all blog_ids that have articles with the same word at least n times within a date range?
Is this the right way to model the problem or should I use a nested objects for an easier query?
Can this be made into a report in Kibana?

The simplest query that comes to mind is
{
"_source": "blog_id",
"query": {
"bool": {
"must": [
{
"match": {
"article_text": "xyz"
}
},
{
"range": {
"timestamp": {
"gte": "now-30d"
}
}
}
]
}
}
}
nested objects are most probably not going to simplify anything -- on the contrary.
Can it be made into a Kibana report?
Sure. Just apply the filters either in KQL (Kib. query lang) or using the dropdowns & choose a metric that you want to track (total blog_id count, timeseries frequency etc.)
EDIT re # of occurrences:
I know of 2 ways:
there's the term_vector API which gives you the word frequency information but it's a standalone API and cannot be used at query time.
Then there's the scripted approach whereby you look at the whole article text, treat is as a case-sensitive keyword, and count the # of substrings, thereby eliminating the articles with non-sufficient word frequency. Note that you don't have to use function_score as I did -- a simple script query will do. it may take a non-trivial amount of time to resolve if you have non-trivial # of docs.
In your case it could look like this:
{
"query": {
"bool": {
"must": [
{
"script": {
"script": {
"source": """
def word = 'xyz';
def docval = doc['article_text.keyword'].value;
String temp = docval.replace(word, "");
def no_of_occurences = ((docval.length() - temp.length()) / word.length());
return no_of_occurences >= 2;
"""
}
}
}
]
}
}
}

Get document by index position in Elasticsearch

I am working with Elasticsearch and I am getting a query error:
elasticsearch.exceptions.TransportError: TransportError(500, 'search_phase_execution_exception', 'script score query returned an invalid score: NaN for doc: 32894')
It seems like my metric is returning NaN for document 32894 (NaN for doc: 32894). Naturally, the next step is to look at that document to see if there is anything wrong with it.
The problem is that I upload my documents using my own ID, so "32894" is meaningless for me.
A query like
curl -X GET "localhost:9200/my_index/_doc/one_of_my_ids?pretty&pretty"
works fine, but this fails if I try with the doc number from the error message.
I expected this to be trivial, but some Google has failed to help.
How can I then find this document? Or is using my own IDs not recommended and the unfixable source of this problem?
Edit: as requested, this is the query that fails. Note that obviously fixing this is my ultimate goal, but not the specific point of this question. Help appreciated in either case.
I am using the elasticsearch library in Python.
self.es.search(index=my_index, body=query_body, size=number_results)
With
query_body = {
"query": {
"script_score": {
"query": {"match_all": {}},
"script": {
"source": "cosineSimilaritySparse(params.queryVector, doc['embedding']) + 10.0",
"params": {"queryVector": query_vector}
}
}
}
}

Searching on timestamp created on the fly in elasticsearch

I have external ES instance which I need to query for documents older than 6 months. Problem is they store timestamp like that:
"timestamp": {
"year": 2018,
"monthValue": 5,
"dayValue": 1,
}
Is it possible to create a range query combining these fields and getting documents "lt" "now-6m" or something like that?

You should be able to accomplish this using a Script Query. That would enable you to create a date object using the field values, and then compare that date with the current date.
Notional example
{
"query": {
"bool" : {
"filter" : {
"script" : {
"script" : {
"params": {
"monthRange": 6
},
"source": """
def today = new Date();
def timestamp = new Date(doc['timestamp']['year'].value, doc['timestamp']['monthValue'].value, doc['timestamp']['dayValue'].value);
/* Date comparison magic (I don't know Java, so you're on your own here) */
/* return result of comparison */
""",
"lang": "painless"
}
}
}
}
}
}
I've only used Painless once before, so I'm not familiar enough to give a perfect answer. But this may help you get started. If you get stuck, just ask another question specific to the issue you're having, and someone who's more familiar with Java/Painless can help you out.

Performing an AND query in elastic search

I have tried looking for another solution to this, but the Bool query in ES seems to not do quite what I am looking for. Or I am just not using it correctly.
In our current implementation of search we are trying to boost performance/reduce memory footprint of each query by changing our query logic. Today, if you search for "The Red Ball" you may get back 5 million documents because ES returns any document that matches "the" OR "red" OR "ball" which means we get back WAAAAAY too many irrelevant documents (mostly because of the "the" term). I would like to change our query to instead use AND so ES would return only documents that match "the" AND "red" AND "ball".
I am using the NEST Client to do this with C# so an example using the client would be best since that seems to be where I cannot figure out what to do. Thanks

You can simply use query string query with AND operator.
{
"query": {
"query_string": {
"default_field": "your_field", <--- remove this if you want to search on all fields
"query": "the red ball",
"default_operator": "AND"
}
}
}
or simply
{
"query": {
"query_string": {
"query": "the AND red AND ball"
}
}
}
I do not know C#, but this is how it might look in nest(everyone,feel free to edit)
client.Search<your_index>(q => q
.Query(qu => qu
.QueryString(qs=>qs
.OnField(x=>your_field).Query("the AND red AND ball")
)
)
);

I found the appropriate query to make using the NEST client:
SearchDescriptor<BackupEntitySearchDocument> desc = new SearchDescriptor<BackupEntitySearchDocument>();
desc.Query(qq => qq.MultiMatch(m => m.OnFields(_searchFields).Query(query).Operator(Operator.And)));
var searchResp = await _client.SearchAsync<BackupEntitySearchDocument>(desc).ConfigureAwait(false);
Where _searchFields is a List<string> containing the fields to match on and query is the term to search for.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Translating ElasticSearch Facets Query into PyES - elasticsearch

Related

Daterange + top_hits aggregation (as subaggregation) with Elasticsearch Java API Client 7.17.x

Find same text within time range

Get document by index position in Elasticsearch

Searching on timestamp created on the fly in elasticsearch

Performing an AND query in elastic search

Categories

Resources