How to sort over TimeUUID field in Solr? - sorting

Accordingly to Mapping Solr types article UUIDField type can be used to represent Cassandra's TimeUUID fields. It works fine until I try to sort over that field in Solr. Seems that Solr sorts such fields as ordinary UUID. Am I right? How to sort over TimeUUID columns in Solr in that case?

Related

Elasticsearch 7 - Sort on custom field of multi-field property

I am working on upgrading a system at work from using ES1 to ES7.
Part of the ES1 implementation included a custom plugin to add an analyzer for custom sorting. The custom sorting behavior we have is similar to "natural sort", but extended to deal with legal codes. For example, it will sort 1.1.1 before 1.10.1. We've been calling this "legal sort". We used this plugin to add an extra .legalsort field to multi-field properties in our index, and then we would sort based on this field when searching.
I am currently trying to adapt the main logic for indexing and searching to ES7. I am not trying to replace the "legal sort" plugin yet. When trying to implement sorting for searches, I ran into the error Fielddata is disabled on text fields by default. The solution I've seen suggested for that is to add a .keyword field for any text properties, which will be used for sorting and aggregation. This "works", but I don't see how I can then apply our old logic of sorting based on a .legalsort field.
Is there a way to sort on a field other than .keyword, which can use a custom analyzer, like we were able to in ES1?
The important aspect is not the name of your field (like *.keyword), but the type of field. For exact match searches, sorting and aggregation the type of the field should be “keyword“.
If you only use the legalsort field for display, sorting, aggregations or exact match, simply change the type from “text” to “keyword”.
If you want to use the same information for both purposes, it’s recommended to make it a multi-field by itself. Use the “keyword”-type field for sorting, aggregations and exact match search and use the “text”-type field for full-text search.
Having 2 types available for the 2 purposes is a significant improvement over the single string type you had in ES 1.0. When you sorted in ES 1.0, the information stored in the inverted index, had to get uninverted and was kept in RAM. This datastructure was/has been called fielddata. It was unbounded and often caused out-of-memory exceptions. Newer versions of Lucene introduced an alternative data structure which resides on disk (and in the file system cache) as a “replacement” to the “fielddata” data structure. It’s named doc-values and allows to sort on huge amounts of data without consuming significant amount of heap RAM. The only drawback: docvalues are not available for analyzed text (fields of type text), hence the need for a field of type keyword.
You also could set the mapping parameter “fielddata” to true for your legalsort field, enabling fielddata for this particular field to get back the previous behaviour with all its drawbacks

Point of types in Elastic 6.x

ES newbie here, simple question: What is the point of a type in ES 6.x if each index can only have one type? I've noticed that inserting a document requires both the type and index to be specified, but this seems redundant to me.
Quoting https://www.elastic.co/guide/en/elasticsearch/reference/6.x/removal-of-types.html:
Indices created in Elasticsearch 6.0.0 or later may only contain a
single mapping type. Indices created in 5.x with multiple mapping
types will continue to function as before in Elasticsearch 6.x.
Mapping types will be completely removed in Elasticsearch 7.0.0.
and:
Why are mapping types being removed?
Initially, we spoke about an “index” being similar to a “database” in
an SQL database, and a “type” being equivalent to a “table”.
This was a bad analogy that led to incorrect assumptions. In an SQL
database, tables are independent of each other. The columns in one
table have no bearing on columns with the same name in another table.
This is not the case for fields in a mapping type.
In an Elasticsearch index, fields that have the same name in different
mapping types are backed by the same Lucene field internally. In other
words, using the example above, the user_name field in the user type
is stored in exactly the same field as the user_name field in the
tweet type, and both user_name fields must have the same mapping
(definition) in both types.
This can lead to frustration when, for example, you want deleted to be
a date field in one type and a boolean field in another type in the
same index.
On top of that, storing different entities that have few or no fields
in common in the same index leads to sparse data and interferes with
Lucene’s ability to compress documents efficiently.
For these reasons, we have decided to remove the concept of mapping
types from Elasticsearch.
More details you can find under the link.

Can I make elasticsearch index like hive's partitioned table?

Hive tables can partition Date Field data into keys within a table.
Can I also do the elasticsearch index?
I would like to be able to partition an index by date using specific field values within the index.
I would appreciate it if you have any of these techniques, even if you are not necessarily using partitioning with specific field values.
Thank you.
Sure, you can define a index with a YYYY.MM.dd format.
This is what Logstash does, by default
In Kibana, you can do wildcard searches on logstash-* or logstash-2018.*. Not sure if you can do the same with the regular search API

Can ElasticSearch be used purely for aggregations?

In my current usecase, I'm using ElasticSearch as a document store, over which I am building a faceted search feature.
The docs state the following:
Sorting, aggregations, and access to field values in scripts requires a different data access pattern.
Doc values are the on-disk data structure, built at document index time, which makes this data access pattern possible. They store the same values as the _source but in a column-oriented fashion that is way more efficient for sorting and aggregations.
Does this imply that the aggregations are not dependent on the index? If so, is it advisable to prevent the fields from being indexed altogether by setting {"index": "no"} ?
This is a small deviation, but where does the setting enabled come in? How is it different from index?
On a broader note, should I be using ElasticSearch if aggregations is all I'm looking for? Should I opt for other solutions like MongoDB? If so, what are the performance considerations?
HELP!
It is definitely possible to use Elasticsearch for the sole purpose of aggregating data. I've seen such setups a few times. For instance, in one past project, we'd index data but we'd only run aggregations in order to build financial reports, and we rarely needed to get documents/hits. 99% of the use cases were simply aggregating data.
If you have such a use case, then you can tune your mapping to
The role of enabled is to decide whether your data is indexed or not. It is true by default, but if you set it to false, your data will simply be stored (in _source) but completely ignored by analyzers, i.e. it won't be analyzed, tokenized and indexed, and thus, it won't be searchable, you'll be be able to retrieve the _source, but not search for it. If you need to use aggregations, then enabled needs to be true (the default value)
The store parameter is to decide whether you want to store the field or not. By default, the field value is indexed, but not stored as it is already stored with the _source itself and you can retrieve it using source filtering. For aggregations, this parameter doesn't play any role.
If your use case is only about aggregations, you might be tempted to set _source: false, i.e. not store the _source at all since all you'll be needed is to index the field values in order to aggregate them, but this is rarely a good idea for various reasons.
So, to answer your main question, aggregations do depend on the index, but the (doc-)values used for aggregations are written in dedicated files, whose inner structure is much more performant and optimal than accessing the data from the index in order to build aggregations.
If you're using ES 1.x, make sure to set doc_values to true for all the fields you'll want to aggregate on (except analyzed strings and boolean fields).
If you're using ES 2.x, doc_values is true by default, so you don't need to do anything special.
Update:
It is worth noting that aggregations are dependent on doc_values (i.e. Per Document Values .dvd and .dvm Lucene files), which basically contains the same info as in the inverted index, but organized in a column-oriented fashion, which makes it much more efficient for aggregations.

What are the advantages of mapping a field to a type in Elasticsearch?

I have about 10 million very flat (like an RDBMS row) documents stored in ES. There are say 10 fields to each document, and 5 of the fields are actually enumerations.
I have created a mapping that maps the Enum's ordinal to a Short, and pass the ordinal in when I index the document.
Does Elasticsearch actually store these values as a Short in its index? Or do they get .toString()'ed? What is actually happening "under the hood" when I map a field to a data type?
Since ES is built on top of Lucene, that is the place to look to see how fields are actually stored and used "under the hood".
As far as I understand, Lucene does in fact store data in more than just String format. So to answer one of your questions, I believe the answer is no - everything does not get .toString()'ed. In fact, if you look at the documentation for Lucene's document package, you'll see it has many numeric types (e.g. IntField, LongField, etc).
The Elasticsearch documentation on Core Types also alludes to this fact:
"It uses specific constructs within Lucene in order to support numeric
values. The number types have the same ranges as corresponding Java
types."
Furthermore, Lucene offers queries (which ES takes advantage of) designed specifically for searching fields with known numeric terms, such as the NumericRangeQuery which is discussed in Lucene's search package. The same numeric types in Lucene allow for efficient sorting as well.
One other benefit is data integrity. Just like any database, if you only expect a field to contain numeric data and your application attempts to insert non-numeric data, in most cases you would want that insert to fail. This is the default behavior of ES when you try to index a document whose field values do not match the type mapping. (Though, you can disable this behavior on numeric fields using ignore_malformed, if you wish)
Hope this helps...

Resources