Parent Join in elasticsearch is not searching as expected - elasticsearch

We recently migrated from the elastic search version 5.5 to 7.7
The elastic search version 7.7 has removed the concept of the multiple types, so we used the JOIN data type for mapping the relationship between users and tweets like below
https://www.elastic.co/guide/en/elasticsearch/reference/current/parent-join.html
PUT twitter
{
"mappings": {
"properties": {
"my_id": {
"type": "keyword"
},
"my_join_field": {
"type": "join",
"relations": {
"users": "tweets"
}
}
}
}
}
so where the users is the parent and tweets are the children
My twitter index has 1 million entries in the combination of users and tweets,
It works as expected when I use hasParentQuery / hasChildQuery, I get the proper result.
But when I try to query only the parent in the twitter index (i.e) In this case, I want to search only on users in the twitter index, I query like below
// to filter only users
QueryBuilder query1 = QueryBuilders.matchQuery("my_join_field", "users");
// to get all the users whose name starts with joh...
QueryBuilder query2 = QueryBuilders.wildcardQuery("username", "*joh*").boost(1.0f);
BoolQueryBuilder boolQueryBuilder = QueryBuilders.boolQuery().must(resourceQuery).must(wildcard);
SearchResponse searResp = commondao.prepareSearch("twitter").setQuery(boolQueryBuilder).setFrom(from).setSize(size).execute().actionGet();
the twitter index has 1 million records with 50K entries of users and remaining as tweets.
This query is taking the same time (with / without passing the my_join_field as users) as searching the whole index.
What I am doing wrong? Any help is appreciated!

Related

Is it possible to check that specific data matches the query without loading it to the index?

Imagine that I have a specific data string and a specific query. The simple way to check that the query matches the data is to load the data into the Elastic index and run the online query. But can I do it without putting it into the index?
Maybe there are some open-source libraries that implement the Elastic search functionality offline, so I can call something like getScore(data, query)? Or it's possible to implement by using specific API endpoints?
Thanks in advance!
What you can do is to leverage the percolator type.
What this allows you to do is to store the query instead of the document and then test whether a document would match the stored query.
For instance, you first create an index with a field of type percolator that will contain your query (you also need to add in the mapping any field used by the query so ES knows what their types are):
PUT my_index
{
"mappings": {
"properties": {
"query": {
"type": "percolator"
},
"message": {
"type": "text"
}
}
}
}
Then you can index a real query, like this:
PUT my_index/_doc/match_value
{
"query" : {
"match" : {
"message" : "bonsai tree"
}
}
}
Finally, you can check using the percolate query if the query you've just stored would match
GET /my_index/_search
{
"query" : {
"percolate" : {
"field" : "query",
"document" : {
"message" : "A new bonsai tree in the office"
}
}
}
}
So all you need to do is to only store the query (not the documents), and then you can use the percolate query to check if the documents would have been selected by the query you stored, without having to store the documents themselves.

Datatype creation in AWS DynamoDB and elastic search for List of URL's

I have enabled Aws DynamoDB streams and created a lambda function to index the data into Elasticsearch.
In my DynamoDb table there is a column named URL in this i am going to store the list of URL's for a single row.
URL is most preferably like object URL of AWS S3 objects
After streaming i am indexing the data into elastic search here my question is what is the datatype should i prefer to store multiple URL in both DynamoDB (single row) and Elasticsearch (Single document)
Could some one help me to achieve this in most efficient way? Thanks in advance
Json structure
{
"id":"234561",
"policyholdername":"xxxxxx",
"age":"24",
"claimnumber":"234561",
"policynumber":"456784",
"url":"https://dgs-dms.s3.amazonaws.com/G-3114_Textract.pdf",
"claimtype":"Accident",
"modified_date":"2020-02-05T17:36:49.053Z",
"dob":"2020-02-05T17:36:49.053Z",
"client_address":"no,7 royal avenue thirumullaivoyal chennai"
}
In future for a single claim number there should be multiple URL's
So, how to handle this?
Not sure about Dynamo DB types. But in Elasticsearch there is no dedicated type for list. To store list of strings(URLs in your case) you can use keyword field type.
For example your data can be like
{
"id":"234561",
"policyholdername":"xxxxxx",
"age":"24",
"claimnumber":"234561",
"policynumber":"456784",
"url":["https://dgs-dms.s3.amazonaws.com/G-3114_Textract.pdf","https://foo/bar/foo.pdf"]
"claimtype":"Accident",
"modified_date":"2020-02-05T17:36:49.053Z",
"dob":"2020-02-05T17:36:49.053Z",
"client_address":"no,7 royal avenue thirumullaivoyal chennai"
}
and the equivalent elasticsearch mapping could be
{
"mappings": {
"_doc": {
"properties": {
"url": {
"type": "keyword"
}
}
}
}
}
and the search query can be
POST index/_search
{
"query": {
"term": {
"url": "https://foo/bar/foo.pdf"
}
}
}

Elasticsearch From and Size on aggregation for pagination

First of all, I want to say that the requirement I want to achieve is working very well on SOLR 5.3.1 but not on ElasticSearch 6.2 as a service on AWS.
My actual query is very large and complex and it is working fine on kibana but not when I cross the from = 100 and size = 50 as it is showing error on kibana console,
What I know:
For normal search, the maximum from can be 10000 and
for aggregated search, the maximum from can be 100
If I cross that limit then I've to change the maximum limit which is not possible as I am using ES on AWS as a service OR I've use scroll API with scroll id feature to get paginated data.
The Scroll API works fine as I've used it to another part of my project but when I try the same Scroll with aggregation it is not working as expected.
Here with Scroll API, the first search gets the aggregated data but the second calling with scroll id not returns the Aggregated results only showing the Hits result
Query on Kibana
GET /properties/_search
{
"size": 10,
"query": {
"bool": {
"must": [
{
"match": {
"published": true
}
},
{
"match": {
"country": "South Africa"
}
}
]
}
},
"aggs": {
"aggs_by_feed": {
"terms": {
"field": "feed",
"order": {
"_key": "desc"
}
},
"aggs": {
"tops": {
"top_hits": {
from: 100,
size: 50,
"_source": [
"id",
"feed_provider_id"
]
}
}
}
}
},
"sort": [
{
"instant_book": {
"order": "desc"
}
}
]
}
With Search on python: The problem I'm facing with this search, first time the search gets the Aggregated data along with Hits data but for next calling with scroll id it is not returning the Aggregated data only showing the Hits data.
if index_name is not None and doc_type is not None and body is not None:
es = init_es()
page = es.search(index_name,doc_type,scroll = '30s',size = 10, body = body)
sid = page['_scroll_id']
scroll_size = page['hits']['total']
# Start scrolling
while (scroll_size > 0):
print("Scrolling...")
page = es.scroll(scroll_id=sid, scroll='30s')
# Update the scroll ID
sid = page['_scroll_id']
print("scroll id: " + sid)
# Get the number of results that we returned in the last scroll
scroll_size = len(page['hits']['hits'])
print("scroll size: " + str(scroll_size))
print("scrolled data :" )
print(page['aggregations'])
With Elasticsearch-DSL on python: With this approach I'm struggling to select the _source fields names like id and feed_provider_id on the second aggs i.g tops->top_hits
es = init_es()
s = Search(using=es, index=index_name,doc_type=doc_type)
s.aggs.bucket('aggs_by_feed', 'terms', field='feed').metric('top','top_hits',field = 'id')
response = s.execute()
print('Hit........')
for hit in response:
print(hit.meta.score, hit.feed)
print(response.aggregations.aggs_by_feed)
print('AGG........')
for tag in response.aggregations.aggs_by_feed:
print(tag)
So my question is
Is it not possible to get data using from and size field on for the aggregated query above from=100?
if it is possible then please give me a hint with normal elasticsearch way or elasticsearch-dsl python way as I am not well known with elasticsearch-dsl and elasticsearch bucket, matric etc.
Some answer on SO told to use partition but I don't know how to use it on my scenario How to control the elasticsearch aggregation results with From / Size?
Some others says that this feature is not currently supported by ES (currently on feature request). If that's not possible, what else can be done in place of grouping in Solr?

Compare documents in Elasticsearch

I am new to Elasticsearch and I am trying to get all documents which have same mobile type. I couldn't find a relevant question and am currently stuck.
curl -XPUT 'http://localhost:9200/sessions/session/1' \
-d '{"useragent": "1121212","mobile": "android", "browser": "mozilla", "device": "computer", "service-code": "1112"}'
EDIT -
I need Elasticsearch equivalent of following -
SELECT * FROM session s1, session s2
where s1.device == s2.device
What you are trying to achieve is simple grouping docs on a field via self-join.
The similar notion of grouping can be achieved by terms aggregation in elasticsearch. Although this aggregation returns only the group level metrics like count, sum etc. It does not return the individual records.
However, there is another aggregation which can be applied as a sub-aggregation to the terms aggregation, top-hits aggregations.
The top_hits aggregator can effectively be used to group result sets
by certain fields via a bucket aggregator. One or more bucket
aggregators determines by which properties a result set get sliced
into.
Options
from - The offset from the first result you want to fetch.
size - The maximum number of top matching hits to return per bucket. By default the top three matching hits are returned.
sort - How the top matching hits should be sorted. By default the hits are sorted by the score of the main query.
Here is a sample query
{
"query": {
"match_all": {}
},
"aggs": {
"top-mobiles": {
"terms": {
"field": "device"
},
"aggs": {
"top_device_hits": {
"top_hits": {}
}
}
}
}
}

ElasticSearch performance when querying by element type

Assume that we have a dataset containing a collection of domains { domain.com, domain2.com } and also a collection of users { user#domain.com, angryuser#domain2.com, elastic#domain3.com }.
Being so lets assume that both domains and users have several attributes in common, such as "domain", and when the attribute name matches, also do the mapping and possible values.
Then we load up our elasticsearch index with all collections separating them by type, domain and user.
Obviously in our system we would have many more users compared to domains so when querying for domain related data, the expectation is that it would be much faster to filter the query by the type of the attribute right?
My question is, having around 5 million users and 200k domains, why is that when my index only contains domain data, users were deleted, queries run much faster than filtering the objects based on their type? Shouldn't it be at least around similar performance ? On my current status we can match 20 domains per second if there are no users on the index, but it drops to 4 when we load up the users, even though we still filter by type.
Maybe it is something that im missing as im new to elasticsearch.
UPDATE:
This is the query basically
"query" : {
"flt_field": {
"domain_address": {
"like_text": "chroma",
"fuzziness": 0.3
}
}
}
And the mapping is something like this
"user": {
"properties": {
...,
"domain_address": {
"type": "string",
"boost": 2.4,
"similarity": "linear"
}
}
},
"domain": {
"properties": {
...,
"domain_address": {
"type": "string",
"boost": 2.4,
"similarity": "linear"
}
}
}
Other fields in .... but their mapping should not influence the outcome ???

Resources