Partitioning aggregates with groups - elasticsearch

I'm trying to partition an aggregate similar to the example in the ElasticSearch documentation, but am not getting the example to work.
The index is populated with event-types:
public class Event
{
public int EventId { get; set; }
public string SegmentId { get; set; }
public DateTime Timestamp { get; set; }
}
The EventId is unique, and each event belongs to a specific SegmentId. Each SegmentId can be associated with zero to many events.
The question is:
How do I get the latest EventId for each SegmentId?
I expect the number of unique segments to be in the range of 10 millions, and the number of unique events one or two magnitudes greater. That's why I don't think using top_hits by itself is appropriate, as suggested here. Hence, partitioning.
Example:
I have set up a demo-index populated with 1313 documents (unique EventId), belonging to 101 distinct SegmentId (i.e. 13 events per segment). I would expect the query below to work, but the exact same results are returned regardless of which partition number I specify.
POST /demo/_search
{
"size": 0,
"aggs": {
"segments": {
"terms": {
"field": "segmentId",
"size": 15, <-- I want 15 segments from each query
"include": {
"partition": 0, <-- Trying to retrieve the first partition
"num_partitions": 7 <-- Expecting 7 partitions (7*15 > 101 segments)
}
},
"aggs": {
"latest": {
"top_hits": {
"size": 1,
"_source": [
"timestamp",
"eventId",
"segmentId"
],
"sort": {
"timestamp": "desc"
}
}
}
}
}
}
}
If I remove the include and set size to a value greater than 101, I get the latest event for every segment. However, I doubt that is a good approach with a million buckets...

You are trying to do a Scroll of the aggregation.
Scroll API is supported only for search queries and not for aggregations. If you do not want to use the Top Hits, as you have stated, due to a huge number of documents, you can either try:
Parent/Child approach - where you create segments as a parent document and the events in the child document. And everytime you add a child, you can update the timestamp field in the parent document. By doing so, you can just query the parent documents and you will have your segment id + the last event timestamp
Another approach would be you try to get the top hits only for the last 24 hours. So you can add a query to first filter the last 24 hours and then try to get the aggs using the top_hit.

It turns out I was investigating the wrong question... My example actually works perfectly.
The problem was my local ElasticSearch node. I don't know what went wrong with it, but when repeating the example on another machine, it worked. I was, however, unable to get partitioning working on my current ES installation. I therefore uninstalled and reinstalled ElasticSearch again, and then the example worked.
To answer my original question, the example I provided is the way to go. I solved my problem by using the cardinality aggregate to get an estimate on the total number of products, from which I derived a suitable number of partitions. Then I looped the query above for each partition, and added the documents to a final list.

Related

GitHub GraphQL Query - Count PR reviews by user for a given month

I am attempting to use GitHub's GraphQL interface to query for the number of PR's reviewed within a GitHub repo during a specific month.
Ideally, I would like to only pull comments and reviews that took place within a specific time frame. Ultimately, I would like to group my results on a user by user basis.
userA 10 reviews during month
userB 6 reviews during month
userC 4 reviews during month
Here is the query that I have created so far.
{
repository(owner: "DSpace", name: "DSpace") {
pullRequests(last: 50) {
nodes {
state
resourcePath
comments (last: 100) {
nodes {
author {
login
}
}
}
reviews(last: 100) {
nodes {
state
author {
login
}
}
}
}
}
}
}
I suspect that I will need to iterate over all pull requests and then filter reviews/comments that fall within a specific date range.
When I look at the GitHub GraphDB schema, the "reviews" and "comments" objects do not seem to be filterable by date. I see that the available filters are first, last, author, and state. Is there some way to express such a filter in the GraphDB query language?
I see that GraphQL provides a way to filter by boolean properties with #include and #skip. Is there a way to pass expressions to these constructs?
You could try using Github GraphQL's search (more of a v3 REST contruct, but still available in v4 graphQL). The query string specifies the owner, repo name, date range of the pull requests last updated (which serves as an approximation of when comments/reviews are created), and type pr.
The new query, building on top of what you have so far, with a bit of modification:
{
search(first: 100, query: "repo:DSpace/DSpace updated:2018-11-01..2018-12-01 type:pr", type: ISSUE) {
nodes {
... on PullRequest {
# use part of your original query from here onward
state
resourcePath
comments(last: 100) {
nodes {
author {
login
}
}
}
reviews(last: 100) {
nodes {
author {
login
}
}
}
}
}
}
}

Converting stringified float to float in Elasticsearch

I have a mapping in an Elasticsearch index with a certain string field called duration. However, duration is actually a float, but it's passed in as a string from my provisioning chain, so it will always look something like this : "0.12". So now I'd like to create a new index with a new mapping, where the duration field is a float. Here's what I've done, which isn't working at all, either for old entries or for incoming new ones.
First, I create my new index with my new mapping by doing the following :
PUT new_index
{
"mappings": { "new_mapping": {"properties": {"duration": {"type": "float"}, ... }
}
I then check that the new mapping are really in place using :
GET new_index/_mapping
I then copy the contents of the old index into the new one :
POST _reindex
{
"source": {
"index": "old_index"
},
"dest": {
"index": "new_index"
}
}
However, when I look at the entries in new_index, be it the ones I've added with that last POST or the new ones that came in since through my provisioning chain, the duration entry is still a string, even when its _type is new_mapping.
What am I doing wrong here ? Or is there simply no way to convert a string to a float within Elasticsearch ?
The duration field in the new index will be indexed as float (as per your mapping), however if the duration field in the source document is still a string, it will stay as a string in the _source, but still be indexed as float.
You can do a range query "from 1.00 to 3.00" on the new index and compare with what you get in the old index. Since the old index will run a lexical range (because of the string type) you might get results with a duration of 22.3, while in the new index you'll only get durations that are really between 1.00 and 3.00.

Separate indices or use type field in Elasticsearch

I'm developing an Elasticsearch service and we have multiple sources like our support ticket portal and a forum. Currently, I'm segregating each source into it's own index as each will have a child type. The ticket portal will of course search tickets (with nested replies) but also search users and such so there are multiple types under the portal index. Simple stuff so far.
However, I'm starting to think of merging the indices and prefix the type (portalTicket, portalUser, forumThread, forumUser, etc) as I'm wanting to search across both sources, but maybe there is a way to query them and bring it all back together. I'm just working with tickets and threads at the moment to start small, here are the two simples mappings I'm using thus far:
{
ticket : {
properties : {
replies : {
type : 'nested'
}
}
}
}
{
thread : {
properties : {
posts : {
type : 'nested'
}
}
}
}
Wanted to show that to show I'm using nested objects with different names. I can of course have same names but there will also be other meta data attached to the ticket and thread mappings that will be nested types also and that takes me to the issue. When I search without specifying the index, I get issues with some not having the nested type, as expected. The thread mapping doesn't have a replies property, it's posts. I can get around it using index in a filter like so:
{
filter : {
indices : {
index : 'portal',
no_match_query : 'none',
query : {
bool : {
should : [
{
match : {
title : 'help'
}
},
{
nested : {
path : 'replies',
query : {
match : {
'replies.text' : 'help'
}
}
}
}
]
}
}
}
}
}
Ok, that works for the portal index but working it to include the forum index is making me feel like I'm just fighting elasticsearch and not using it properly.
So should I keep them on separate indices and get a filter that will return both indices results or should I merge them into a single index, use a field to hold the source and likely normalize the nested properties or is there a way to work with multiple indices in a faceted way? (I know, aggregates in ES 2)
Reading these two posts (thanks to the commenters for pointing these out):
Elastic search, multiple indexes vs one index and types for different data sets?
https://www.elastic.co/blog/index-vs-type
I have decided that my data is too different and the amount of documents that I anticipate (and future additions) means that I should go with different indices.
Now to learn how to search across the different indices but this post was more about which strategy I should use so I'm going to open a new question for that.

Elasticsearch 2.x index mapping _id

I ran ElasticSearch 1.x (happily) for over a year. Now it's time for some upgrading - to 2.1.x. The nodes should be turned off and then (one-by-one) on again. Seems easy enough.
But then I ran into troubles. The major problem is the field _uid, which I created myself so that I knew the exact location of a document from a random other one (by hashing a value). This way I knew that only that the exact one will be returned. During upgrade I got
MapperParsingException[Field [_uid] is a metadata field and cannot be added inside a document. Use the index API request parameters.]
But when I try to map my former _uid to _id (which should also be good enough) I get something similar.
The reason why I used the _uid param is because the lookup time is a lot lower than a termsQuery (or the like).
How can I still use the _uid or _id field in each document for the fast (and exact) lookup of certain exact documents? Note that I have to call thousands exact ones at the time, so I need an ID like query. Also it may occur the _uid or _id of the document does not exist (in that case I want, like now, a 'false-like' result)
Note: The upgrade from 1.x to 2.x is pretty big (Filters gone, no dots in names, no default access to _xxx)
Update (no avail):
Updating the mapping of _uid or _id using:
final XContentBuilder mappingBuilder = XContentFactory.jsonBuilder().startObject().startObject(type).startObject("_id").field("enabled", "true").field("default", "xxxx").endObject()
.endObject().endObject();
CLIENT.admin().indices().prepareCreate(index).addMapping(type, mappingBuilder)
.setSettings(Settings.settingsBuilder().put("number_of_shards", nShards).put("number_of_replicas", nReplicas)).execute().actionGet();
results in:
MapperParsingException[Failed to parse mapping [XXXX]: _id is not configurable]; nested: MapperParsingException[_id is not configurable];
Update: Changed name into _id instead of _uid since the latter is build out of _type#_id. So then I'd need to be able to write to _id.
Since there appears to be no way around setting the _uid and _id I'll post my solution. I mapped all document which had a _uid to uid (for internal referencing). At some point it came to me, you can set the relevant id
To bulk insert document with id you can:
final BulkRequestBuilder builder = client.prepareBulk();
for (final Doc doc : docs) {
builder.add(client.prepareIndex(index, type, doc.getId()).setSource(doc.toJson()));
}
final BulkResponse bulkResponse = builder.execute().actionGet();
Notice the third argument, this one may be null (or be a two valued argument, then the id will be generated by ES).
To then get some documents by id you can:
final List<String> uids = getUidsFromSomeMethod(); // ids for documents to get
final MultiGetRequestBuilder builder = CLIENT.prepareMultiGet();
builder.add(index_name, type, uids);
final MultiGetResponse multiResponse = builder.execute().actionGet();
// in this case I simply want to know whether the doc exists
if (only_want_to_know_whether_it_exists){
for (final MultiGetItemResponse response : multiResponse.getResponses()) {
final boolean exists = response.getResponse().isExists();
exist.add(exists);
}
} else {
// retrieve the doc as json
final String string = builder.getSourceAsString();
// handle JSON
}
If you only want 1:
client.prepareGet().setIndex(index).setType(type).setId(id);
Doing - the single update - using curl is mapping-id-field (note: exact copy):
# Example documents
PUT my_index/my_type/1
{
"text": "Document with ID 1"
}
PUT my_index/my_type/2
{
"text": "Document with ID 2"
}
GET my_index/_search
{
"query": {
"terms": {
"_id": [ "1", "2" ]
}
},
"script_fields": {
"UID": {
"script": "doc['_id']"
}
}
}

ElasticSearch - specify range for a string field

I am trying to retrieve the mentions of years between 1933 and 1949 from a string field called text. However, I cannot seem to find the working range query for that. What I tried to so far crashes:
{"query":
{"query_string":
{
"text": [1933 TO 1949]
}
}
}
I have also tried it like this:
{"query":
{"filtered":
{"query":{"match_all":{}},
"filter":{"range":{"text":[1933 TO 1949]}
}
}
}
but it still crashes.
A sample text field looks like the one below, containing a mention of the year 1933:
"Primera División 1933 (Argentinië), seizoen in de Argentijnse voetbalcompetitie\n* Primera Divisió n 1933 (Chili), seizoen in de Chileense voetbalcompetitie\n* Primera División 1933 (Uruguay), seizoen in de Uruguayaanse voetbalcompetitie\n \n "
However, I also have documents not containing any years inside, and I would like to filter all the documents to preserve only the ones mentioning years in a given period. I read here http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html that the range query can be applied to text fields as well, and I don't want to use any intermediate solution to identify dates inside texts.
What I basically want to achieve is to be able to get the same results as when using a search URI query:
urltomyindex/_search?q=text:%7B1933%20TO%201949%7D%27
which works perfectly.
Is it still possible to achieve my goal? Any help much appreciated!
This should do it:
GET index1/type1/_search
{
"query": {
"filtered": {
"filter": {
"terms": {
"fieldNameHere": [
"1933",
"1934",
"1935",
"1936",
"1937",
"1938",
"1939",
"1940",
"1941",
"1942",
"1943",
"1944",
"1945",
"1946",
"1947",
"1948",
"1949"
]
}
}
}
}
}
If you know you're going to be needing this kind of search frequently it would be much better to create a new field "yearPublished" or something like that so you can search it as a number vs a text field.

Resources