Converting stringified float to float in Elasticsearch - elasticsearch

I have a mapping in an Elasticsearch index with a certain string field called duration. However, duration is actually a float, but it's passed in as a string from my provisioning chain, so it will always look something like this : "0.12". So now I'd like to create a new index with a new mapping, where the duration field is a float. Here's what I've done, which isn't working at all, either for old entries or for incoming new ones.
First, I create my new index with my new mapping by doing the following :
PUT new_index
{
"mappings": { "new_mapping": {"properties": {"duration": {"type": "float"}, ... }
}
I then check that the new mapping are really in place using :
GET new_index/_mapping
I then copy the contents of the old index into the new one :
POST _reindex
{
"source": {
"index": "old_index"
},
"dest": {
"index": "new_index"
}
}
However, when I look at the entries in new_index, be it the ones I've added with that last POST or the new ones that came in since through my provisioning chain, the duration entry is still a string, even when its _type is new_mapping.
What am I doing wrong here ? Or is there simply no way to convert a string to a float within Elasticsearch ?

The duration field in the new index will be indexed as float (as per your mapping), however if the duration field in the source document is still a string, it will stay as a string in the _source, but still be indexed as float.
You can do a range query "from 1.00 to 3.00" on the new index and compare with what you get in the old index. Since the old index will run a lexical range (because of the string type) you might get results with a duration of 22.3, while in the new index you'll only get durations that are really between 1.00 and 3.00.

Related

Format reading ElasticSearch dates

This is my mapping for one of the properties in my ElasticSearch model:
"timestamp":{
"type":"date",
"format":"dd-MM-yyyy||yyyy-MM-dd'T'HH:mm:ss.SSSZ||epoch_millis"
}
I'm not sure if I'm misunderstanding the documentation. It clearly says:
The first format will also act as the one that converts back from milliseconds to a string representation.
And that is exactly what I want. I would like to be able to read directly (if possible) the dates as dd-MM-yyyy.
Unfortunately, when I go to the document itself (so, accessing to the ElasticSearch's endpoint directly, not via the application layer) I still get:
"timestamp" : "2014-01-13T15:48:25.000Z",
What am I missing here?.
As #Val mentioned, you'd get the value/format as how it is being indexed.
However if you want to view the date in particular format regardless of the format it has been indexed, you can make use of Script Fields. Note that it would be applied at querying time.
Below query is what your solution would be.
POST <your_index_name>/_search
{
"query":{
"match_all":{ }
},
"script_fields":{
"timestamp":{
"script":{
"inline": "def sf = new SimpleDateFormat(\"dd-MM-yyyy\");def dt = new Date(doc['timestamp'].value);def mydate = sf.format(dt);return mydate;"
}
}
}
}
Let me know how it goes.

Partitioning aggregates with groups

I'm trying to partition an aggregate similar to the example in the ElasticSearch documentation, but am not getting the example to work.
The index is populated with event-types:
public class Event
{
public int EventId { get; set; }
public string SegmentId { get; set; }
public DateTime Timestamp { get; set; }
}
The EventId is unique, and each event belongs to a specific SegmentId. Each SegmentId can be associated with zero to many events.
The question is:
How do I get the latest EventId for each SegmentId?
I expect the number of unique segments to be in the range of 10 millions, and the number of unique events one or two magnitudes greater. That's why I don't think using top_hits by itself is appropriate, as suggested here. Hence, partitioning.
Example:
I have set up a demo-index populated with 1313 documents (unique EventId), belonging to 101 distinct SegmentId (i.e. 13 events per segment). I would expect the query below to work, but the exact same results are returned regardless of which partition number I specify.
POST /demo/_search
{
"size": 0,
"aggs": {
"segments": {
"terms": {
"field": "segmentId",
"size": 15, <-- I want 15 segments from each query
"include": {
"partition": 0, <-- Trying to retrieve the first partition
"num_partitions": 7 <-- Expecting 7 partitions (7*15 > 101 segments)
}
},
"aggs": {
"latest": {
"top_hits": {
"size": 1,
"_source": [
"timestamp",
"eventId",
"segmentId"
],
"sort": {
"timestamp": "desc"
}
}
}
}
}
}
}
If I remove the include and set size to a value greater than 101, I get the latest event for every segment. However, I doubt that is a good approach with a million buckets...
You are trying to do a Scroll of the aggregation.
Scroll API is supported only for search queries and not for aggregations. If you do not want to use the Top Hits, as you have stated, due to a huge number of documents, you can either try:
Parent/Child approach - where you create segments as a parent document and the events in the child document. And everytime you add a child, you can update the timestamp field in the parent document. By doing so, you can just query the parent documents and you will have your segment id + the last event timestamp
Another approach would be you try to get the top hits only for the last 24 hours. So you can add a query to first filter the last 24 hours and then try to get the aggs using the top_hit.
It turns out I was investigating the wrong question... My example actually works perfectly.
The problem was my local ElasticSearch node. I don't know what went wrong with it, but when repeating the example on another machine, it worked. I was, however, unable to get partitioning working on my current ES installation. I therefore uninstalled and reinstalled ElasticSearch again, and then the example worked.
To answer my original question, the example I provided is the way to go. I solved my problem by using the cardinality aggregate to get an estimate on the total number of products, from which I derived a suitable number of partitions. Then I looped the query above for each partition, and added the documents to a final list.

Elasticsearch document id type integer vs string : Is there any performace difference?

I am using elasticsearch 2.3.1. Currently all the document ids are integer. But I have a situation where the document ids can be numeric valued or sometimes alpha-numeric string. So I need to make the field type 'string'.
So, I need to know if there is any performance difference based on the type of Id. Please help....
Elasticsearch will store the id as a String even if your mapping says otherwise:
"mappings": {
"properties": {
"id": {
"type": "integer"
},
That is my mapping, but when I do a sort on _id I get documents ordered as:
10489, 10499, 105, 10514...
i.e. in String order.
Latest version of ES (7.14) mandates the document's _id to be a String. You can see it in the documentation for org.elasticsearch.action.index.IndexRequest. IndexRequest mandates the _id to be a String field alone. No other types are supported. Example usage of IndexRequest can be found here: https://www.elastic.co/guide/en/elasticsearch/client/java-rest/current/java-rest-high-document-index.html
In case, the above link stops working later, here is the snippet from the link:
IndexRequest request = new IndexRequest("posts");
request.id("1"); //This is the only method available to set the document's _id.
String jsonString = "{" +
"\"user\":\"kimchy\"," +
"\"postDate\":\"2013-01-30\"," +
"\"message\":\"trying out Elasticsearch\"" +
"}";
request.source(jsonString, XContentType.JSON);

Elasticsearch 2.x index mapping _id

I ran ElasticSearch 1.x (happily) for over a year. Now it's time for some upgrading - to 2.1.x. The nodes should be turned off and then (one-by-one) on again. Seems easy enough.
But then I ran into troubles. The major problem is the field _uid, which I created myself so that I knew the exact location of a document from a random other one (by hashing a value). This way I knew that only that the exact one will be returned. During upgrade I got
MapperParsingException[Field [_uid] is a metadata field and cannot be added inside a document. Use the index API request parameters.]
But when I try to map my former _uid to _id (which should also be good enough) I get something similar.
The reason why I used the _uid param is because the lookup time is a lot lower than a termsQuery (or the like).
How can I still use the _uid or _id field in each document for the fast (and exact) lookup of certain exact documents? Note that I have to call thousands exact ones at the time, so I need an ID like query. Also it may occur the _uid or _id of the document does not exist (in that case I want, like now, a 'false-like' result)
Note: The upgrade from 1.x to 2.x is pretty big (Filters gone, no dots in names, no default access to _xxx)
Update (no avail):
Updating the mapping of _uid or _id using:
final XContentBuilder mappingBuilder = XContentFactory.jsonBuilder().startObject().startObject(type).startObject("_id").field("enabled", "true").field("default", "xxxx").endObject()
.endObject().endObject();
CLIENT.admin().indices().prepareCreate(index).addMapping(type, mappingBuilder)
.setSettings(Settings.settingsBuilder().put("number_of_shards", nShards).put("number_of_replicas", nReplicas)).execute().actionGet();
results in:
MapperParsingException[Failed to parse mapping [XXXX]: _id is not configurable]; nested: MapperParsingException[_id is not configurable];
Update: Changed name into _id instead of _uid since the latter is build out of _type#_id. So then I'd need to be able to write to _id.
Since there appears to be no way around setting the _uid and _id I'll post my solution. I mapped all document which had a _uid to uid (for internal referencing). At some point it came to me, you can set the relevant id
To bulk insert document with id you can:
final BulkRequestBuilder builder = client.prepareBulk();
for (final Doc doc : docs) {
builder.add(client.prepareIndex(index, type, doc.getId()).setSource(doc.toJson()));
}
final BulkResponse bulkResponse = builder.execute().actionGet();
Notice the third argument, this one may be null (or be a two valued argument, then the id will be generated by ES).
To then get some documents by id you can:
final List<String> uids = getUidsFromSomeMethod(); // ids for documents to get
final MultiGetRequestBuilder builder = CLIENT.prepareMultiGet();
builder.add(index_name, type, uids);
final MultiGetResponse multiResponse = builder.execute().actionGet();
// in this case I simply want to know whether the doc exists
if (only_want_to_know_whether_it_exists){
for (final MultiGetItemResponse response : multiResponse.getResponses()) {
final boolean exists = response.getResponse().isExists();
exist.add(exists);
}
} else {
// retrieve the doc as json
final String string = builder.getSourceAsString();
// handle JSON
}
If you only want 1:
client.prepareGet().setIndex(index).setType(type).setId(id);
Doing - the single update - using curl is mapping-id-field (note: exact copy):
# Example documents
PUT my_index/my_type/1
{
"text": "Document with ID 1"
}
PUT my_index/my_type/2
{
"text": "Document with ID 2"
}
GET my_index/_search
{
"query": {
"terms": {
"_id": [ "1", "2" ]
}
},
"script_fields": {
"UID": {
"script": "doc['_id']"
}
}
}

how to sort the field in the mongo document which is inside array

I have below a structured Mongo Document:
{
"_id": value,
"imageShared": {
"imageid": value,
"commentdatadoc": [
{
"whocommented": value,
"commenttext": value,
"commenttimestamp": isodate(111)
},
{
"whocommented": value,
"commenttext": value,
"commenttimestamp": isodate(444)
},
{
"whocommented": value,
"commenttext": value,
"commenttimestamp": isodate(222)
}
]
}
};
Here I want to sort the field commenttimestamp desc. I tried the way below but it is not working...
Query getComments = new Query();
getComments.addCriteria(Criteria.where("imageShared.imageId").is(imageId)).
with(new Sort(Sort.Direction.DESC,"imageShared.commentDataDoc"));
SharedMediaCollec sharedMediaCollec = mongoTemplate.findOne(getComments, SharedMediaCollec.class);
Does anyone have an idea how to sort a document field which is inside array?
When you need to get all documents anyway, it might be far easier to do the sorting in C# after you received the data from MongoDB. An elegant way to do this automatically would be to represent the commentdatadoc array in your C# object with a SortedSet.
But when you definitely want a database-sided solution, you can do it with an aggregation pipeline consisting of a $match-step, a $unwind step and a $sort step. To perform an aggregation with the C# driver, call collection.Aggregate and then set the aggregation stages at the returned IAggregateFluent interface.

Resources