Solr - unexpected sorting order and sortMissingLast - sorting

I've got text type defined as below:
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
...
And a couple of fields using given type. One of these fields is a title field, which is always defined and not missing, nor empty for any of the documents. When sorting by this field, either asc or desc Solr would however not return documents in the given order, but, seemingly random. Only after adding sortMissingLast="true" to type declaration sorting was in proper order.
Can anybody explain to me why is it so? In my understanding, sortMissingLast shouldn't be in effect when using sort, as a) it's connected with insertion of documents b) all documents in my collection have this field defined.
Reading further:
If sortMissingLast="true", then a sort on this field will cause documents without the field to come after documents with the field, regardless of the requested sort order (asc or desc).
I do indeed have other fields that use the same text type, however all of them are present. They might be empty, but they're present within the document.

I tested a sample index with your fieldType, that is text. when i tried with title asc my result was below response
"responseHeader":{
"status":0,
"QTime":1,
"params":{
"sort":"title asc",
"indent":"true",
"q":"*:*",
"wt":"json"}},
"response":{"numFound":10,"start":0,"docs":[
{
"id":["123"],
"title":"awesome designs_takeaway",
"lastmodified":"f75a2e26-cb41-4028-abb2-bcd7f61e4f9e",
"_version_":1538551521252212736},
{
"id":["124"],
"title":"breathtaking_designs takeaway",
"lastmodified":"170b3857-d906-44df-950c-547c25b4e594",
"_version_":1538551543494606848},
{
"id":["125"],
"title":"curtain raiser",
"lastmodified":"ea7149d5-449f-4d69-919b-617b90420381",
"_version_":1538573292313509888},
{
"id":["126"],
"title":"defying gravity_008",
"lastmodified":"82844b75-24ba-4b2f-be20-9bb3fe83e6b1",
"_version_":1538551590630195200},
{
"id":["127"],
"title":"emancipation_of the poor",
"lastmodified":"d19482a5-1666-4d4e-a40e-eb93c00eca7e",
"_version_":1538551627310432256},
{
"id":["128"],
"title":"functioning of the-metadata",
"lastmodified":"7b07f281-1268-48cc-aee6-7a6636702ba5",
"_version_":1538551653171462144},
{
"id":["130"],
"title":"graphics enhancer 101",
"lastmodified":"67fd79d6-2ae5-4597-b2e1-128bfd815b67",
"_version_":1538551680471138304},
{
"id":["131"],
"title":"half-hearted attempt",
"lastmodified":"abb4707c-8392-4595-aaeb-fbf6d4f098b1",
"_version_":1538551699761790976},
{
"id":["132"],
"title":"INK jet corporation",
"lastmodified":"b29ba3af-f3da-49d1-bd45-f7d277c53cff",
"_version_":1538551727666495488},
{
"id":["136"],
"title":"xamarin",
"filecontent":"bolshevik",
"lastmodified":"af8a1445-e693-4bac-9ac8-84fa2c9b838d",
"_version_":1538571040880328704}]
}}`
When i tried with title desc.
My response
`{
"responseHeader":{
"status":0,
"QTime":1,
"params":{
"sort":"title desc",
"indent":"true",
"q":"*:*",
"wt":"json"}},
"response":{"numFound":10,"start":0,"docs":[
{
"id":["136"],
"title":"xamarin",
"filecontent":"bolshevik",
"lastmodified":"af8a1445-e693-4bac-9ac8-84fa2c9b838d",
"_version_":1538571040880328704},
{
"id":["132"],
"title":"INK jet corporation",
"lastmodified":"b29ba3af-f3da-49d1-bd45-f7d277c53cff",
"_version_":1538551727666495488},
{
"id":["131"],
"title":"half-hearted attempt",
"lastmodified":"abb4707c-8392-4595-aaeb-fbf6d4f098b1",
"_version_":1538551699761790976},
{
"id":["130"],
"title":"graphics enhancer 101",
"lastmodified":"67fd79d6-2ae5-4597-b2e1-128bfd815b67",
"_version_":1538551680471138304},
{
"id":["128"],
"title":"functioning of the-metadata",
"lastmodified":"7b07f281-1268-48cc-aee6-7a6636702ba5",
"_version_":1538551653171462144},
{
"id":["127"],
"title":"emancipation_of the poor",
"lastmodified":"d19482a5-1666-4d4e-a40e-eb93c00eca7e",
"_version_":1538551627310432256},
{
"id":["126"],
"title":"defying gravity_008",
"lastmodified":"82844b75-24ba-4b2f-be20-9bb3fe83e6b1",
"_version_":1538551590630195200},
{
"id":["125"],
"title":"curtain raiser",
"lastmodified":"ea7149d5-449f-4d69-919b-617b90420381",
"_version_":1538573292313509888},
{
"id":["124"],
"title":"breathtaking_designs takeaway",
"lastmodified":"170b3857-d906-44df-950c-547c25b4e594",
"_version_":1538551543494606848},
{
"id":["123"],
"title":"awesome designs_takeaway",
"lastmodified":"f75a2e26-cb41-4028-abb2-bcd7f61e4f9e",
"_version_":1538551521252212736}]
}}`
As you can see, i am getting expected results . I used Solr v 5.3.2 . Your type text also does not tokenize text into parts and therefore is a good candidate for sorting . So no use of thinking in that way to solve the problem .The sortMissingLast and sortMissingFirst parameters totally serve different purpose , though i did used them to replicate your observations, i saw only expected results . And as you say that your every document has a title field,so i also kept title field in all my documents, therefore there was of no use of sortMissingLast and sortMissingFirst parameters, as they will affect the document set having document having no title field in them, your results should not deviate from what i got . This only trickles down to the inference, that may be your solr has a bug . If you are not using the same version as mine , try your documents once on the version 5.3.2 or some version different from yours. Or can you provide a subset of titles from your side that are getting sorted as wrong just as to see how they are getting analyzed , if you don't suspect Solr is having a bug. Let me know if that helps :) .

Related

Terms query does not work on keyword field which contains an array of values

I am a beginner in Elasticsearch. I recently added a new field jc_job_meta_field which is of keyword type (see image 1 below as I output the mapping of all my fields) and my index is en-gb. I expect it to be an array to hold a bunch of values. And I now have a document with ["Virtual", "Hybrid"] in that field. I wanted to have the ability to search all entries with Virtual in the field jc_job_meta_field. But now when I do a term query search like this
{
"query": {
"terms": {
"jc_job_meta_field": ["Virtual"]
}
}
}
Nothing returned (see image 2 below). Shouldn't it at least return that exact document with [Virtual, Hybrid]? I checked a similar post here and it seems like I am doing exactly what's supposed to work. What went wrong here? Thanks in advance!
My Mapping and field values:
My query:

Type of field for prefix search in Elastic Search

I'm confused on what index type I should apply for my field for prefix search, many show search_as_you_type but I think auto complete is not what I'm going for.
I have a UUID field:
id: 34y72ca1-3739-41ff-bbec-f6d17479384c
The following terms should return the doc above:
3
34
34y72ca1
34y72ca1-3739
34y72ca1-3739-41ff-bbec-f6d17479384c
Using 3739 should not return it as it doesn't start with 3739. Initially this is what I was going for but then the wildcard field is not supported by Amazon AWS, so I compromise for prefix search instead of partial search.
I tried search_as_you_type field but it doesn't return the result when I use the whole ID. Actually, my use case is when user click enter, the results will be shown, instead of real-live when they type, so if speed is compromised its OK, just that I hope for something that will be good for many rows of data.
Thanks
If you have not explicitly defined any index mapping, then you need to use id.keyword field instead of the id field for the prefix query to show the appropriate results. This uses the keyword analyzer instead of the standard analyzer
{
"query": {
"prefix": {
"id.keyword": {
"value": "34y72ca1"
}
}
}
}
Otherwise, you can modify your index mapping, by adding multi fields for id field

Kibana scripted field which loops through an array

I am trying to use the metricbeat http module to monitor F5 pools.
I make a request to the f5 api and bring back json, which is saved to kibana. But the json contains an array of pool members and I want to count the number which are up.
The advice seems to be that this can be done with a scripted field. However, I can't get the script to retrieve the array. eg
doc['http.f5pools.items.monitor'].value.length()
returns in the preview results with the same 'Additional Field' added for comparison:
[
{
"_id": "rT7wdGsBXQSGm_pQoH6Y",
"http": {
"f5pools": {
"items": [
{
"monitor": "default"
},
{
"monitor": "default"
}
]
}
},
"pool.MemberCount": [
7
]
},
If I try
doc['http.f5pools.items']
Or similar I just get an error:
"reason": "No field found for [http.f5pools.items] in mapping with types []"
Googling suggests that the doc construct does not contain arrays?
Is it possible to make a scripted field which can access the set of values? ie is my code or the way I'm indexing the data wrong.
If not is there an alternative approach within metricbeats? I don't want to have to make a whole new api to do the calculation and add a separate field
-- update.
Weirdly it seems that the number values in the array do return the expected results. ie.
doc['http.f5pools.items.ratio']
returns
{
"_id": "BT6WdWsBXQSGm_pQBbCa",
"pool.MemberCount": [
1,
1
]
},
-- update 2
Ok, so if the strings in the field have different values then you get all the values. if they are the same you just get one. wtf?
I'm adding another answer instead of deleting my previous one which is not the actual question but still may be helpful for someone else in future.
I found a hint in the same documentation:
Doc values are a columnar field value store
Upon googling this further I found this Doc Value Intro which says that the doc values are essentially "uninverted index" useful for operations like sorting; my hypotheses is while sorting you essentially dont want same values repeated and hence the data structure they use removes those duplicates. That still did not answer as to why it works different for string than number. Numbers are preserved but strings are filters into unique.
This “uninverted” structure is often called a “column-store” in other
systems. Essentially, it stores all the values for a single field
together in a single column of data, which makes it very efficient for
operations like sorting.
In Elasticsearch, this column-store is known as doc values, and is
enabled by default. Doc values are created at index-time: when a field
is indexed, Elasticsearch adds the tokens to the inverted index for
search. But it also extracts the terms and adds them to the columnar
doc values.
Some more deep-dive into doc values revealed it a compression technique which actually de-deuplicates the values for efficient and memory-friendly operations.
Here's a NOTE given on the link above which answers the question:
You may be thinking "Well that’s great for numbers, but what about
strings?" Strings are encoded similarly, with the help of an ordinal
table. The strings are de-duplicated and sorted into a table, assigned
an ID, and then those ID’s are used as numeric doc values. Which means
strings enjoy many of the same compression benefits that numerics do.
The ordinal table itself has some compression tricks, such as using
fixed, variable or prefix-encoded strings.
Also, if you dont want this behavior then you can disable doc-values
OK, solved it.
https://discuss.elastic.co/t/problem-looping-through-array-in-each-doc-with-painless/90648
So as I discovered arrays are prefiltered to only return distinct values (except in the case of ints apparently?)
The solution is to use params._source instead of doc[]
The answer for why doc doesnt work
Quoting below:
Doc values are a columnar field value store, enabled by default on all
fields except for analyzed text fields.
Doc-values can only return "simple" field values like numbers, dates,
geo- points, terms, etc, or arrays of these values if the field is
multi-valued. It cannot return JSON objects
Also, important to add a null check as mentioned below:
Missing fields
The doc['field'] will throw an error if field is
missing from the mappings. In painless, a check can first be done with
doc.containsKey('field')* to guard accessing the doc map.
Unfortunately, there is no way to check for the existence of the field
in mappings in an expression script.
Also, here is why _source works
Quoting below:
The document _source, which is really just a special stored field, can
be accessed using the _source.field_name syntax. The _source is loaded
as a map-of-maps, so properties within object fields can be accessed
as, for example, _source.name.first.
.
Responding to your comment with an example:
The kyeword here is: It cannot return JSON objects. The field doc['http.f5pools.items'] is a JSON object
Try running below and see the mapping it creates:
PUT t5/doc/2
{
"items": [
{
"monitor": "default"
},
{
"monitor": "default"
}
]
}
GET t5/_mapping
{
"t5" : {
"mappings" : {
"doc" : {
"properties" : {
"items" : {
"properties" : {
"monitor" : { <-- monitor is a property of items property(Object)
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
}
}
}
}

How to treat certain field values as null in `Elasticsearch`

I'm parsing log files which for simplicity's sake let's say will have the following format :
{"message": "hello world", "size": 100, "forward-to": 127.0.0.1}
I'm indexing these lines into an Elasticsearch index, where I've defined a custom mapping such that message, size, and forward-to are of type text, integer, and ip respectively. However, some log lines will look like this :
{"message": "hello world", "size": "-", "forward-to": ""}
This leads to parsing errors when Elasticsearch tries to index these documents. For technical reasons, it's very much untrivial for me to pre-process these documents and change "-" and "" to null. Is there anyway to define which values my mapping should treat as null ? Is there perhaps an analyzer I can write which works on any field type whatsoever that I can add to all entries in my mapping ?
Basically I'm looking for somewhat of the opposite of the null_value option. Instead of telling Elasticsearch what to turn a null_value into, I'd like to tell it what it should turn into a null_value. Also acceptable would be a way to tell Elasticsearch to simply ignore fields that look a certain way but still parse the other fields in the document.
So this one's easy apparently. Add the following to your mapping settings :
{
"settings": {
"index": {
"mapping": {
"ignore_malformed": "true"
}
}
}
}
This will still index the field (contrary to what I've understood from the documentation...) but it will be ignored during aggregations (so if you have 3 entries in an integer field that are "1", 3, and "hello world", an averaging aggregation will yield 2).
Keep in mind that because of the way the option was implemented (and I would say this is a bug) this still fails for and object that is entered as a concrete value and vice versa. If you'd like to get around that you can set the field's enabled value to false like this :
{
"mappings": {
"my_mapping_name": {
"properties": {
"my_unpredictable_field": {
"enabled": false
}
}
}
}
}
This comes at a price though, since this means the field won't be indexed, but the values entered will be still be stored so you can still accessing them by searching for that document through another field. This usually shouldn't be an issue as you likely won't be filtering documents based on the value of such an unpredictable field, but that depends on your specific case use. See here for the official discussion of this issue.

Sort results in alphabetical order with type=text_en

I have a solr text field as follows.
<field name="news_headline_ln_en" type="text_en" indexed="true" stored="true"/>
And when querying to sort results as follows, it doesn't show results in correct alphabetical order.
http://localhost:8983/solr/news/select?fl=news_headline_ln_en&indent=on&q=*:*&rows=100&sort=news_headline_ln_en%20desc&start=0&wt=json
Result response:
{
"responseHeader":{
"status":0,
"QTime":45,
"params":{
"q":"*:*",
"indent":"on",
"fl":"news_headline_ln_en",
"start":"11610",
"sort":"news_headline_ln_en asc",
"rows":"12021",
"wt":"json",
"_":"1478085256196"}},
"response":{"numFound":12621,"start":11610,"docs":[
{
"news_headline_ln_en":"Eleven stocks up despite UAE markets decline"},
{
"news_headline_ln_en":"\nOil Prices Decline on Fed Rate Rise Jitters"},
{
"news_headline_ln_en":"Euro unemployment rate declines in February"},
{
"news_headline_ln_en":"Investors Holding’s Q4 profits decrease"},
{
"news_headline_ln_en":"DED honors ‘On Time’ in Oud Metha for excellence"},
{
"news_headline_ln_en":"\nTreasures From The Deep -- WSJ"},
{
"news_headline_ln_en":"Tunisia shares deepen early losses"},
{
"news_headline_ln_en":"EGX deepens losses in week"},
{
As you can see it is not sorted alphabetically. Anyone does know a possible reason? Appreciate any help.
You can't. text_en isn't suited for sorting, as it tokenizes the input and breaks the text up into separate tokens. These tokens are not usable for sorting.
The solution is to add a copyField instruction that copies the content from the text_en field over to a field that is suitable for sorting, such as a string field or a text field with a KeywordTokenizer (which will allow you to lowercase the string, but keep it as a single token - if you want the sort to be case insensitive). If you're using a string field, you'll have to lowercase the field before indexing it yourself if you want the sort to be case insensitive.
<copyField source="news_headline_ln_en" dest="news_headline_ln_en_sort" />
.. and then use sort=text_sort for sorting. You can use the maxChars setting if you only need to copy the beginning of the original string (for example if you're sorting by the start of an article, you probably only need the first 20-40 characters of the article for the sort to be useful).
Also see defining fields and the Schema API.

Resources