ElasticSearch: How to query by multiple conditions in different locations? - elasticsearch

I've been trying to build this ElasticSearch Query on the Danish CVR database API so far without success. Basically I'm trying to find companies where
The company has a relationship with "deltager" (participant) with "enhedsNummer" (ID) equal NUMBER
The relationship is still active, i.e. the "end of period" field is null
How do I construct a query that has multiple conditions like this?
'query': {
'bool': {
'must': [
{
'term': {'Vrvirksomhed.deltagerRelation.deltager.enhedsNummer': NUMBER},
AND
'term': {'Vrvirksomhed.deltagerRelation.organisationer.attributter.vaerdier.periode.gyldigTil': null}
},
],
},
},
}
FYI: database mapping may be found at http://distribution.virk.dk/cvr-permanent/_mapping

You can try:
GET /cvr-permanent/_search
{
"query": {
"bool": {
"must": [
{
"term": {
"Vrvirksomhed.deltagerRelation.deltager.enhedsNummer": {
"value": "your_value_here"
}
}
}
],
"must_not": [
{
"exists": {
"field": "Vrvirksomhed.deltagerRelation.organisationer.attributter.vaerdier.periode.gyldigTil"
}
}
]
}
}
}
Trick here is to use must_not/exist for nil values.
P.S. I cannot check it because it requires authorisation.

It doesn't appear like ElasticSearch Queries are as dynamic as I had wanted (or I don't know how use them). Instead, it appears that the Python code below is the best choice for generating the desired outcome:
import requests
import pandas as pd
# creation of empty lists:
virksomhedsnavne = []
virksomhedscvr = []
relation_fra = []
relation_til = []
# Pulling data (apparently limited to 3000 elements at a time):
for i in range(20):
if i == 0:
highestcvrnummer = 0
else:
highestcvrnummer = max(virksomhedscvr)
headers = {
'Content-Type': 'application/json',
}
json_data = {
"_source": ["Vrvirksomhed.cvrNummer", "Vrvirksomhed.navne", "Vrvirksomhed.virksomhedMetadata.nyesteNavn.navn", "Vrvirksomhed.deltagerRelation"],
"sort" : [{"Vrvirksomhed.cvrNummer" : {"order":"asc"}}],
"query": {
"bool": {
"must": [
{
"term": {
"Vrvirksomhed.deltagerRelation.deltager.enhedsNummer": "some_value"
}
},
{
"range":{
"Vrvirksomhed.cvrNummer": {
"gt": highestcvrnummer
}
}
}
]
}
},
'size': 3000
}
response = requests.post('http://distribution.virk.dk/cvr-permanent/virksomhed/_search', headers=headers, json=json_data, auth=('USERNAME', 'PASSWORD'))
json_data = response.json()['hits']['hits']
# Aggregate and format data neatly
for data in json_data:
virksomhed_data = data['_source']['Vrvirksomhed']
virksomhedscvr.append(virksomhed_data['cvrNummer'])
try:
virksomhedsnavne.append(virksomhed_data['virksomhedMetadata']['nyesteNavn']['navn'])
except:
virksomhedsnavne.append(virksomhed_data['navne'][0]['navn'])
# Loop through all "deltagere" and find match with value
for relation in virksomhed_data['deltagerRelation']:
# If match found
if relation['deltager']['enhedsNummer'] == some_value:
# Make sure most recent period is chosen
antalopdateringer = len(relation['organisationer'])-1
relation_gyldig = relation['organisationer'][antalopdateringer]['medlemsData'][0]['attributter'][0]['vaerdier'][0]['periode']
relation_fra.append(relation_gyldig['gyldigFra'])
relation_til.append(relation_gyldig['gyldigTil'])
break
#export to excel
dict = {'CVR nummer':virksomhedscvr, 'navn':virksomhedsnavne, 'Relation fra':relation_fra, 'Relation til':relation_til}
df = pd.DataFrame(dict)
df.to_excel("output.xlsx")
If anyone else is working with the Danish CVR register's API, I hope this helps!
Also, if you find a better solution, please let me know :)

Related

Use query result as parameter for another query in Elasticsearch DSL

I'm using Elasticsearch DSL, I'm trying to use a query result as a parameter for another query like below:
{
"query": {
"bool": {
"must_not": {
"terms": {
"request_id": {
"query": {
"match": {
"processing.message": "OUT Followup Synthesis"
}
},
"fields": [
"request_id"
],
"_source": false
}
}
}
}
}
}
As you can see above I'm trying to search for sources that their request_id is not one of the request_idswith processing.message equals to OUT Followup Synthesis.
I'm getting an error with this query:
Error loading data [x_content_parse_exception] [1:1660] [terms_lookup] unknown field [query]
How can I achieve my goal using Elasticsearch DSL?
Original question extracted from the comments
I'm trying to fetch data with processing.message equals to 'IN Followup Sythesis' with their request_id doesn't appear in data with processing.message equals to 'OUT Followup Sythesis'. In SQL language:
SELECT d FROM data d
WHERE d.processing.message = 'IN Followup Sythesis'
AND d.request_id NOT IN (SELECT request_id FROM data WHERE processing.message = 'OUT Followup Sythesis');
Answer: generally speaking, neither application-side joins nor subqueries are supported in Elasticsearch.
So you'll have to run your first query, take the retrieved IDs and put them into a second query — ideally a terms query.
Of course, this limitation can be overcome by "hijacking" a scripted metric aggregation.
Taking these 3 documents as examples:
POST reqs/_doc
{"request_id":"abc","processing":{"message":"OUT Followup Synthesis"}}
POST reqs/_doc
{"request_id":"abc","processing":{"message":"IN Followup Sythesis"}}
POST reqs/_doc
{"request_id":"xyz","processing":{"message":"IN Followup Sythesis"}}
you could run
POST reqs/_search
{
"size": 0,
"query": {
"match": {
"processing.message": "IN Followup Sythesis"
}
},
"aggs": {
"subquery_mock": {
"scripted_metric": {
"params": {
"disallowed_msg": "OUT Followup Synthesis"
},
"init_script": "state.by_request_ids = [:]; state.disallowed_request_ids = [];",
"map_script": """
def req_id = params._source.request_id;
def msg = params._source.processing.message;
if (msg.contains(params.disallowed_msg)) {
state.disallowed_request_ids.add(req_id);
// won't need this particular doc so continue looping
return;
}
if (state.by_request_ids.containsKey(req_id)) {
// there may be multiple docs under the same ID
// so concatenate them
state.by_request_ids[req_id].add(params._source);
} else {
// initialize an appendable arraylist
state.by_request_ids[req_id] = [params._source];
}
""",
"combine_script": """
state.by_request_ids.entrySet()
.removeIf(entry -> state.disallowed_request_ids.contains(entry.getKey()));
return state.by_request_ids
""",
"reduce_script": "return states"
}
}
}
}
which'd return only the correct request:
"aggregations" : {
"subquery_mock" : {
"value" : [
{
"xyz" : [
{
"processing" : { "message" : "IN Followup Sythesis" },
"request_id" : "xyz"
}
]
}
]
}
}
⚠️ This is almost guaranteed to be slow and goes against the suggested guidance of not accessing the _source field. But it also goes to show that subqueries can be "emulated".
💡 I'd recommend to test this script on a smaller set of documents before letting it target your whole index — maybe restrict it through a date range query or similar.
FYI Elasticsearch exposes an SQL API, though it's only offered through X-Pack, a paid offering.

Elastic Search Suggestions Return Zero Results

Im trying to setup ElasticSearch using the elasticsearch_dsl python library. I have been able to setup the Index, and I am able to search using the .filter() method, but I cannot get the .suggest method to work.
I am trying to use the completion mapping type, and the suggest query method since this is going to be used for an autocomplete field (recommended on elastic's docs).
I am new to elastic, so I am guessing I am missing something.
Any guidance will be greatly appreciated!
What I have done so far
I did not find a tutorial that had exactly what I wanted, but I read through the documentation on ElasticSearch.com and elasticsearch_dsl, and looked at some examples
hereand here
PS: I am using Searchbox Elasticsearch on Heroku
Index / Mappings Setup:
# imports [...]
edge_ngram_analyzer = analyzer(
'edge_ngram_analyzer',
type='custom',
tokenizer='standard',
filter=[
'lowercase',
token_filter(
'edge_ngram_filter', type='edgeNGram',
min_gram=1, max_gram=20
)
]
)
class DocumentIndex(ElasticDocument):
title = Text()
title_suggest = Completion(
analyzer=edge_ngram_analyzer,
)
class Index:
name = 'documents-index'
# [...] Initialize index
# [...] Upload Documents (5,000 documents)
# DocumentIndex.init()
# [DocumentIndex(**doc).save() for doc in mydocs]
Mappings Output:
This is the mapping as shown in the web console:
{
"documents-index": {
"mappings": {
"doc": {
"properties": {
"title": {
"type": "text"
},
"title_suggest": {
"type": "completion",
"analyzer": "edge_ngram_analyzer",
"search_analyzer": "standard",
"preserve_separators": true,
"preserve_position_increments": true,
"max_input_length": 50
}
}
}
}
}
}
Attempting to Search
Verify Index exists:
>>> search = Search(index='documents-index')
>>> search.count() # Returns correct amount of documents
5000
>>> [doc for doc in search.scan()][:3]
>>> [<Hit(documents-index/doc/1): ...} ...
Test Search - Works:
>>> query = search.filter('match', title='class')
>>> query.execute()
>>> result.hits
<Response: [<Hit(documents-in [ ... ]
>>> len(result.hits)
10
>>> query.to_dict() # see query payload
{
"query":{
"bool":{
"filter":[
{
"fuzzy":{
"title":"class"
}
}
]
}
}
}
The part that fails
I cannot get any of the .suggest() methods to work.
Note:
* I am following the official library docs
Test Suggest:
>>> query = search.suggest(
'title-suggestions',
'class',
completion={
'field': 'title_suggest',
'fuzzy': True
})
>>> query.execute()
<Response: {}>
>>> query.to_dict() # see query payload
{
"suggest": {
"title-suggestions": {
"text": "class",
"completion": { "field": "title_suggest" }
}
}
}
I also tried the code below, and obviously many different types of queries and values, but the results were similar. (note with .filter() I always get the expected result).
>>> query = search.suggest(
'title-suggestions',
'class',
term=dict(field='title'))
>>> query.to_dict() # see query payload
{
"suggest": {
"title-suggestions": {
"text": "class",
"term": {
"field": "title"
}
}
}
}
>>> query.execute()
<Response: {}>
Update
Per Honza's suggestion, I updated the title_suggest mapping to be only Completion, with no custom analyzers. I also deleted the index and reindexed from scratch
class DocumentIndex(ElasticDocument):
title = Text()
title_suggest = Completion()
class Index:
name = 'documents-index'
Unfortunately, the problem remains. Here are some more tests:
Verify title_suggest is being indexed properly
>>> search = Search(index='documents-index)
>>> search.index('documents-index').count()
23369
>>> [d for d in search.scan()][0].title
'AnalyticalGrid Property'
>>> [d for d in search.scan()][0].title_suggest
'AnalyticalGrid Property'
Tried searching again:
>>> len(search.filter('term', title='class').execute().hits)
10
>>> search.filter('term', title_suggest='Class').execute().hits
[]
>>> search.suggest('suggestions', 'class', completion={'field':
'title_suggest'}).execute().hits
[]
Verify Mapping:
>>> pprint(index.get_mapping())
{
"documents-index": {
"mappings": {
"doc": {
"properties": {
"title": { "type": "text" },
"title_suggest": {
"analyzer": "simple",
"max_input_length": 50,
"preserve_position_increments": True,
"preserve_separators": True,
"type": "completion"
}
}
}
}
}
}
For completion fields you do not want to be using ngram analyzers. The completion field will automatically index all prefixes and optimize for prefix queries so you are doing the work twice and confusing the system. Start with empty completion field and go from there.
I wanted to formalize the solution which was provided by Honza on one of the comments for another answer.
The problem was not the mapping, but simply the fact that results from the
.suggest() method are not returned under hits.
The suggestions are now visible in the dictionary returned by:
>>> response = query.execute()
>>> print(response)
<Response: {}>
>>> response.to_dict()
# output is
# {'query': {},
# 'suggest': {'title-suggestions': {'completion': {'field': 'title_suggest'},
# [...]
I have also found additional details on this github issue:
HonzaKral commented 27 days ago
The Response object provides access to any and all fields that have
been returned by elasticsearch. For convenience there is a shortcut
that allow to iterate over the hits as that is both most common and
also easy to do. For other parts of the response, like aggregations or
suggestions, you need to access them explicitly like
response.suggest.foo.options.

Combination of and or elasticsearch

How to write query for following condition in elasticsearch
Select * from table1 where (cnd1 or cond2) and (cnd3)
My cond2 value is from nested object . My json object is below
details={ "name"="name1",
"address":"{
"city":"city1"
}"
}
I need to take city from above object
details.address.city
Is above syntax is right , if not how to get value of second object city.
{
"bool" : {
"must" : cond3,
"should" : [
cond1,
cond2
],
"minimum_should_match" : 1
}
}
go through this link for more info https://www.elastic.co/guide/en/elasticsearch/reference/2.3/query-dsl-bool-query.html
You can easily create a conditional queries with Elasticsearch. But there is some weird situation of your data section.
details={ "name"="name1",
"address":"{
"city":"city1"
}"
}
Elasticsearh save your data as a json object, but you should give your data as a json. In this section, there is an object, you try to sent. Let us examine:
There is a name attribute of detail object, it is a string. And also there is a address attribute, and it is a string too. It should be an object which has to include a city attribute if you want to reach this object via details.address.city. Now we try to fix:
{
"id":...,
...
"details": {
"name": "name1",
"address": {
"city": "city1"
}
}
}
In this case, I remove double quotation marks of details object. Now, you can reach city attribute of json as a json object. Now, we create a query to reach cities:
{
"query": {
"bool": {
"must": {
"term": {
"your-json-attribute": "???"
}
},
"should": [
{
"term": {
"your-json-attribute": "???"
}
},
{
"term": {
"your-json-attribute": "???"
}
}
]
}
}
}
I use term query but there is lots of another query types. You can check them on documentation. But for And and Or, you can use bool query. Check https://www.elastic.co/guide/en/elasticsearch/reference/2.0/query-dsl-bool-query.html

Elasticsearch Mutually Exclusive results

I have an Elasticsearch query which has a condition which checks whether issoldout = false . And based on it I have few Sum and Count Aggregation fields.
However I would like to get aggregation values if issoldout = false fetch no results, then try with issoldout = true. Is there any way that I can get this done without a second search with issoldout = true.
You could literally submit two queries using _msearch as noted, but you could also just run them in parallel within the same request:
You can do this with the filter aggregation in order to get it to dive down both with it. Similarly, you could just use a terms aggregation to do it, but you would then get it when it's false too.
{
"query": {
... normal query ...
},
"aggs": {
"group_by_soldout": {
"filter": {
"term": {
"issoldout": true
}
},
"aggs": {
"stats_for_field": {
"stats": {
"field": "your_field"
}
}
}
}
}
}

NEST elasticsearch.NET search query not returning results (part 2)

I'm using the object initializer syntax with NEST to form a search query. When I include the second pdfQuery with the logical OR operator, I get no results. If I exclude it, I get results.
QueryContainer titleQuery = new MatchQuery
{
Field = Property.Path<ElasticBook>(p => p.Title),
Query = query,
Boost = 50,
Slop = 2,
MinimumShouldMatch = "55%"
};
QueryContainer pdfQuery = new MatchQuery
{
Field = Property.Path<ElasticBook>(p => p.Pdf),
Query = query,
CutoffFrequency = 0.001
};
var result = _client.Search<ElasticBook>(new SearchRequest("bookswithstop", "en")
{
From = 0,
Size = 10,
Query = titleQuery || pdfQuery,
Timeout = "20000",
Fields = new []
{
Property.Path<ElasticBook>(p => p.Title)
}
});
If I debug and inspect the result var, I copy-value one of request properties to get:
{
"timeout": "20000",
"from": 0,
"size": 10,
"fields": [
"title"
],
"query": {
"bool": {
"should": [
{
"match": {
"title": {
"query": "Proper Guide To Excel 2010",
"slop": 2,
"boost": 50.0,
"minimum_should_match": "55%"
}
}
},
{
"match": {
"pdf": {
"query": "Proper Guide To Excel 2010",
"cutoff_frequency": 0.001
}
}
}
]
}
}
}
The problem is that if I copy that query into sense - it returns about 100 results (albeit slowly). I've checked the header info and that seems to be correct from NEST as well:
ConnectionStatus = {StatusCode: 200,
Method: POST,
Url: http://elasticsearch-blablablamrfreeman/bookswithstop/en/_search,
Request: {
"timeout": "20000",
"from": 0,
"size": 10,
"fields": [
"title"
],
"query": {
"bool": {
"shoul...
The pdf field uses the elastic search attachment plugin (located # https://github.com/elastic/elasticsearch-mapper-attachments) and I was getting Newtonsoft.JSON system.outofmemoryexceptions being thrown before (but not now for some reason).
My only suggestion therefore is that perhaps there's some serialization issue via my query and NEST? If that were the case I'm not sure why it would just execute successfully with a 200 code and give 0 documents in the Documents property
Could anyone please explain to me how I would go about troubleshooting this please? It clearly doesn't like my second search query (pdfQuery) but I'm not sure why - and the resultant JSON request syntax seems to be correct as well!
I think this part is causing problems
Fields = new []
{
Property.Path<ElasticBook>(p => p.Title)
}
When do you use Fields option, elasticsearch is not returning _source field, so you can't access results through result.Documents. Instead, you have to use result.FieldSelections, which is quite unpleasant.
If you want to return only specific fields from elasticsearch and still be able to use result.Documents you can take advantage of source includes / excludes. With NEST you can do this as follows:
var searchResponse = client.Search<Document>(s => s
.Source(source => source.Include(f => f.Number))
.Query(q => q.MatchAll()));
Hope this helps you.

Resources