Can I apply facets on the entire solr result instead of the filtered solr result? - filter

Let's say I have a collection of solr documents:
{'id': 1, 'title': 'shirt', 'brand': 'adidas'},
{'id': 2, 'title': 'shirt2', 'brand': 'adidas'},
{'id': 3, 'title': 'shirt', 'brand': 'nike'},
{'id': 4, 'title': 'shirt', 'brand': 'puma'}
Now if I try to apply faceting on this with facet.field=brand, I will get the following
{'adidas': 2, 'nike': 1, 'puma': 1}
In addition, if I apply a filter brand:adidas, I get the following for faceting:
{'adidas': 2, 'nike': 0, 'puma': 0}
Is it possible to still get the count of documents returned by solr before the filtering happened? I would like to still see {'adidas': 2, 'nike': 1, 'puma': 1} regardless of what filters I apply

You can use tagging and excluding filters as shown in the reference guide.
q=yourquery&fq={!tag=brand}brand:adidas&facet=true&facet.field={!ex=brand}brand
The {!ex} instruction will then exclude any filters matching that tag from the active filters when generating the facet results, allowing you to keep the complete set of brands matching the query available as facets.

Related

Partially updating elasticsearch list field value using python

The purpose of this question is to ask the community how to go about partially updating a field without removing any other contents of that field.
There are many examples in StackOverflow to partially update ElasticSearch _source fields using python, curl, etc. The elasticsearch python library comes equipped with a elasticsearch.helpers folder with functions - parallel_bulk, streaming_bulk, bulk, which allow developers to easily update documents.
If users have data in a pandas dataframe, one can easily iterate over the rows to create a generator to update/create documents in elasticsearch. Elasticsearch documents are immutable, thus, when an update occurs elasticsearch takes the information being passed to create a new document, incrementing the docs version, while updating what needs to be updated. If a document has a field as a list, if the update request has a single value it will replace the entire list with that new value. (Many SO QAs covering this).
I do not want to replace the value of that list with the new value, but instead to update a single value in a list to a new value.
For example, in my _source I have a field as ['101 country drive', '35 park drive', '277 thunderroad belway']. This field has three values, but let's say we realize that this document is incorrect and we need to update '101 country drive' to '1001 country drive'.
I do not want to delete the other values in the list, instead, I want to only update the index value with a new value.
Do I need to write a painless script to perform this action, or is there another method to perform this action?
Example:
Want to update the document
From ---
{'took': 176,
'timed_out': False,
'_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
'hits': {'total': {'value': 0, 'relation': 'eq'},
'max_score': None,
'hits': [{'_index': 'docobot', '_type': '_doc', '_id': '19010239',
'_source': {'name': 'josephine drwaler', 'address': ['101 country drive', '35 park drive', '277 thunderroad belway']
}}]}}
to
{'took': 176,
'timed_out': False,
'_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
'hits': {'total': {'value': 0, 'relation': 'eq'},
'max_score': None,
'hits': [{'_index': 'docobot', '_type': '_doc', '_id': '19010239',
'_source': {'name': 'josephine drwaler', 'address': ['1001 country drive', '35 park drive', '277 thunderroad belway']
}}]}}
Notice that the address is updated only for the first index, but the index number should not be a factor in updating the value of address in _source.
What is the most efficient and pythonic way to go about partially updating documents in elasticsearch while keeping the integrity of the remaining values in that field?
the _source is what is passed to Elasticsearch in the API request, it's not a "field" in the same context of what address is considered
that said, you need to replace the entire address field with what you want, not just the value you want corrected. Elasticsearch assumes that what you pass in is what the entirety of the field's value should be and will overwrite that field with what it gets
Need to create a painless script to update. When doing so need to keep in mind that you can access any field in source by:
ctx._source.address = ['1001 country drive', '35 park drive', '277 thunderroad belway']
But this doesn't solve the problem...
The field is a list, so we need to iterate through the list. Below painless script loops through each item, compares it to the search param, if it matches returns the answer.
def upd_address= [];
for (def item: ctx._source.address) ]
{
if (item == params.search_id) {
upd_address.add(params.answer)
}
else {
upd_address.add(item)
}} ctx._source.address = upd_address;
You can use the above with elasticsearch_dsl as
ubq = UpdateByQuery(using=[your es connection], doc_type='doc', index=['your index']
ubq = ubq.script(source=[above query], params={'search_id': addrss, 'answer': upd_addrss)
res = ubq.execute()
print(res, type(res))
Update query loops through each item in the list. Checks if the item is the search id, if so keep the answer else keep same id.

Rethinkdb: Chain getAll & between

I get RqlRuntimeError: Expected type TABLE_SLICE but found SELECTION: error when chaining getAll and between methods.
r.db('mydb').table('logs')
.getAll('1', {index: 'userId'})
.between(r.time(2015, 5, 1, 'Z'), r.time(2015, 5, 4, 'Z'), {index: 'createdAt'})
Is there a way to use index(es) when querying for data that belongs to userId=1 and where createdAt is between dates? As I understand filter does not use indexes.
You can't use two indexes like that, but you can create a compound index:
r.table('logs').indexCreate('compound', function(row) {
return [row('userId'), row('createdAt')];
})
r.table('logs').between([1, firstTime], [1, secondTime], {index: 'compound'})

quick check whether an elasticsearch index will return search hits

We're running commands against an elasticsearch source against a few indices like so:
curl -XGET 'http://es-server:9200/logstash-2015.01.28,logstash-2015.01.27/_search?pretty' -d #/a_query_file_in_json_format
Works great most of the time, and we can parse the results we need.
However when the indices are in a bad state-- maybe there's been a lag in indexing, or some shards are acting up-- the query above will return no results, and it's impossible to know whether it's because there's no matching records or the index is unstable in some way.
I've been looking at the elastic search indices recovery API but am a bit overwhelmed. Are there some queries I can run that will give a yes/no answer to 'can a search against these indices be relied upon at the moment?'
You have multiple ways to get this information.
1) You can use the cluster health API at the indices level like this :
GET _cluster/health/my_index?level=indices
This will output the status of the cluster, with information about status and shards of the index my_index :
{
"cluster_name": "elasticsearch_thomas",
"status": "yellow",
"timed_out": false,
"number_of_nodes": 1,
"number_of_data_nodes": 1,
"active_primary_shards": 5,
"active_shards": 5,
"relocating_shards": 0,
"initializing_shards": 0,
"unassigned_shards": 5,
"indices": {
"my_index": {
"status": "yellow",
"number_of_shards": 5,
"number_of_replicas": 1,
"active_primary_shards": 5,
"active_shards": 5,
"relocating_shards": 0,
"initializing_shards": 0,
"unassigned_shards": 5
}
}
}
2) If you want to have a less verbose answer, or to filter only on some specific information, you can rely on the _cat API, which allows you to customize the output. However, the output is no longer a JSON.
For example, if you want only the name and health status of the indices, the following request will do the trick :
GET _cat/indices/my_index?h=index,health&v
by outputting this :
index health
my_index yellow
Note that the column headers are shown only because of the verbose flag (v GET parameter in the previous request).
To have a complete list of what columns are available, use the help parameter :
GET _cat/indices?help

Paginating Distinct Values in Mongo

I would like to retrieve district values for a field from a collection in my database. The distinct command is the obvious solution. The problem is that some fields have a large number of possible values and are not simple primitive values (i.e. are complex sub-documents instead of just a string). This means the results are large causing it to overload the client I am delivering the results to.
The obvious solution is to paginate the resulting distinct values. But I cannot find a good way to optimally do this. Since distinct doesn't have pagination options (limit, skip, etc) I am turning to the aggregation framework. My basic pipeline is:
[
{$match: {... the documents I am interested in ...}},
{$group: {_id: '$myfield'},
{$sort: {_id: 1},
{$limit: 10},
]
This gives me the first 10 unique values for myfield. To get the next page a $skip operator is added to the pipeline. So:
[
{$match: {... the documents I am interested in ...}},
{$group: {_id: '$myfield'},
{$sort: {_id: 1},
{$skip: 10},
{$limit: 10},
]
But sometimes the field I am collecting unique values from is an array. This means I have to unwind it before grouping. So:
[
{$match: {... the documents I am interested in ...}},
{$unwind: '$myfield'}
{$group: {_id: '$myfield'},
{$sort: {_id: 1},
{$skip: 10},
{$limit: 10},
]
Other times the field I am getting unique values for may not be an array but it's parent node might be an array. So:
[
{$match: {... the documents I am interested in ...}},
{$unwind: '$list'}
{$group: {_id: '$list.myfield'},
{$sort: {_id: 1},
{$skip: 10},
{$limit: 10},
]
Finally sometimes I also need to filter on data inside the field I am getting distinct values. This means I sometimes need another match operator after the unwind:
[
{$match: {... the documents I am interested in ...}},
{$unwind: '$list'}
{$match: {... filter within list.myfield ...}},
{$group: {_id: '$list.myfield'},
{$sort: {_id: 1},
{$skip: 10},
{$limit: 10},
]
The above are all simplifications of my actual data. Here is a real pipeline from the application:
[
{"$match": {
"source_name":"fmls",
"data.surroundings.school.nearby.district.id": 1300120,
"compiled_at":{"$exists":true},
"data.status.current":{"$ne":"Sold"}
}},
{"$unwind":"$data.surroundings.school.nearby"},
{"$match": {
"source_name":"fmls",
"data.surroundings.school.nearby.district.id":1300120,
"compiled_at":{"$exists":true},
"data.status.current":{"$ne":"Sold"}
}},
{"$group":{"_id":"$data.surroundings.school.nearby"}},
{"$sort":{"_id":1}},
{"$skip":10},
{"$limit":10}
]
I hand the same $match document to both the initial filter and the filter after the unwind because the $match document is somewhat opaque from a 3rd party so I don't really know which parts of the query are filtering outside my unwind vs inside the data I am getting distinct values for.
Is there any obvious different way to go about this. In general, my strategy is working but there are some queries where it is taking 10-15 seconds to return the results. There are about 200,000 documents in the collection although only about 60,000 after the first $match in the pipeline is applied (which can use indexes).

CouchDB - hierarchical comments with ranking. Hacker News style

I'm trying to implement a basic way of displaying comments in the way that Hacker News provides, using CouchDB. Not only ordered hierarchically, but also, each level of the tree should be ordered by a "points" variable.
The idea is that I want a view to return it in the order I except, and not make many Ajax calls for example, to retrieve them and make them look like they're ordered correctly.
This is what I got so far:
Each document is a "comment".
Each comment has a property path which is an ordered list containing all its parents.
So for example, imagine I have 4 comments (with _id 1, 2, 3 and 4). Comment 2 is children of 1, comment 3 is children of 2, and comment 4 is also children of 1. This is what the data would look like:
{ _id: 1, path: ["1"] },
{ _id: 2, path: ["1", "2"] },
{ _id: 3, path: ["1", "2", "3"] }
{ _id: 4, path: ["1", "4"] }
This works quite well for the hierarchy. A simple view will already return things ordered the way I want it.
The issue comes when I want to order each "level" of the tree independently. So for example documents 2 and 4 belong to the same branch, but are ordered, on that level, by their ID. Instead I want them ordered based on a "points" variable that I want to add to the path - but can't seem to understand where I could be adding this variable for it to work the way I want it.
Is there a way to do this? Consider that the "points" variable will change in time.
Because each level needs to be sorted recursively by score, Couch needs to know the score of each parent to make this work the way you want it to.
Taking your example with the following scores (1: 10, 2: 10, 3: 10, 4: 20)
In this case you'd want the ordering to come out like the following:
.1
.1.4
.1.2
.1.2.3
Your document needs a scores array like this:
{ _id: 1, path: [1], scores: [10] },
{ _id: 2, path: [1, 2], scores: [10,10] },
{ _id: 3, path: [1, 2, 3], scores: [10,10,10] },
{ _id: 4, path: [1, 4], scores: [10,20] }
Then you'll use the following sort key in your view.
emit([doc.scores, doc.path], doc)
The path gets used as a tiebreaker because there will be cases where sibling comments have the exact same score. Without the tiebreaker, their descendants could lose their grouping (by chain of ancestry).
Note: This approach will return scores from low-to-high, whereas you probably want scores (high to low) and path/tiebreaker(low to high). So a workaround for this would be to populate the scores array with the inverse of each score like this:
{ _id: 1, path: [1], scores: [0.1] },
{ _id: 2, path: [1, 2], scores: [0.1,0.1] },
{ _id: 3, path: [1, 2, 3], scores: [0.1,0.1,0.1] },
{ _id: 4, path: [1, 4], scores: [0.1,0.2] }
and then use descending=true when you request the view.
Maybe anybody interestingly the thread on this question with variants of solutions:
http://mail-archives.apache.org/mod_mbox/couchdb-dev/201205.mbox/thread -> theme "Hierarchical comments Hacker News style" 16/05/2012

Resources