Randomly selecting document ArangoDB often gives the same results - random

The other day I saw a method for querying for a random document from a collection using AQL on this very same website:
Randomly select a document in ArangoDB
My implementation of this at the moment is:
//brands
let b1 = (
for brand in brands
filter brand.brand == #brand1
return brand._id
)
//pick random car with brand 1
let c1 = (
for edge in edges
filter edge._from == b1[0]
for car in cars
filter car._id == edge._to
sort rand() limit 1
return car._id
)
However, when I use that method it can hardly be called 'random'. For instance, in a 3500+ document collection I manage to get the same document 5 times in a row, and over the course of 25+ attempts there're maybe 3 to 4 documents that keep being returned to me. It seems the method is geared towards particular documents being output. I was wondering if there's still some improvement to be done here or another method that wasn't mentioned in that thread. The problem is that I can't comment on the thread yet due to low reputation levels, so I can't ask the question in the same place. However I think it merits a discussion nonetheless. I hope someone can help me out in getting a better randomization.

Essentially the rand() function is being seeded the same on each query execution. Multiple calls within the same query will be different, but the next execution will start back from the same number.
I ran this query and saw the same 3 numbers each time:
return {
"1": rand(),
"2": rand(),
"3": rand()
}
Not always, but more often than not got the same numbers:
[
{
"1": 0.5635853144932401,
"2": 0.19330423902096622,
"3": 0.8087405011139256
}
]
Then, seeded with current milliseconds:
return {
"1": rand() + DATE_MILLISECOND(DATE_NOW()),
"2": rand() + DATE_MILLISECOND(DATE_NOW()),
"3": rand() + DATE_MILLISECOND(DATE_NOW())
}
Now I always get a different number.
[
{
"1": 617.8103840407173,
"2": 617.0999366056549,
"3": 617.6308832757169
}
]
You can use various techniques to produce pseudorandom numbers that won't repeat like calling rand() with the same seed.
Edit: this is actually a Windows bug. If you can use linux you should be fine.

Related

Elasticsearch - Limit of total fields [1000] in index exceeded

I saw that there are some concerns to raising the total limit on fields above 1000.
I have a situation where I am not sure how to approach it from the design point of view.
I have lots of simple key value pairs:
key1:15, key2:45, key99999:1313123.
Where key is a string and value is a integer on which I would like to sort my results upon on where as if a certain document receives a key it gets sorted by the value.
I ended up creating an object and just put the key value pairs inside so I can match it easy.
For example I have sorting: "object.key".
I was wondering if I just use a simple object with bunch of strings inside that are just there for exact matching should I worry about raising this limit to 10k, or 20k.
Because I now have an issue where there can be more then 1k of these records. I've found I could use nested sorting but it still has a default limit of 10k.
Is there a good design pattern approach for this or should I not be worried by raising the field limits?
Simplified version of the query:
GET products/_search
{
"query": {
"match_all": {}
},
"sort": [
{
"sortingObject.someSortingKey1": {
"order": "desc",
"missing": 2,
"unmapped_type":"float"
}
}
]
}
Point is that I get the sortingKey from request and I use it to sort my results. There are 100k different ways to sort the result for example
There were some recent improvements (in 7.16) that should help there, but 10K or 20K fields is still a lot of overhead.
I'm not sure what kind of queries you need to run on those keyX fields, but maybe the flattened data-type would work for you? https://www.elastic.co/guide/en/elasticsearch/reference/current/flattened.html

How can I get accurate results from ElasticSearch and Kibana?

I have a problem to solve in graphing from ElasticSearch / Kibana. For the sake of argument, I have a turnstile and I need a 100% accurate count of the number of unique people who've passed through the turnstile. If Fred and Joe go through then the count is 2 - but if Fred and Joe and Joe go through (because Joe left and came in again) then the count is still two. Rather than people, I'm dealing with files - and rather than names I'm using UUIDs but the principle is the same.
We've tried using Cardinality Aggregation (https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-cardinality-aggregation.html) but that doesn't work. Even with tuning it only approaches 100% accuracy, and the possibility of a 100% accurate result decreases as the number of data points goes up. The number of data points that I'm looking at is in the tens, and possibly hundreds, of millions.
I understand that there's a performance / accuracy tradeoff - I can live with slow, but I can't live with inaccurate.
What would be the correct function - or correct way - of getting a 100% accurate count of unique names?
There's a workaround of doing a complete terms aggregation and then running a scripted_metric on that, but this is really really expensive.
{
"byFullListScripting": {
"terms": {
"field": "groupId",
"shard_size": Integer.MAX_VALUE,
"size": Integer.MAX_VALUE
},
"aggs": {
"cntScripting": {
"scripted_metric": {
"map_script": "targetId='u'+doc['cntTargetId']; if (_agg[targetId] == null) { _agg[targetId] = 1}",
"reduce_script": "map=[:]; for (a in _aggs){ map.putAll(a) }; return map.size()"
}
}
}
}

Elastic Ingest Pipeline split field and create a nested field

Dear freindly helpers,
I have an index that is fed by a database via Kafka. Now this database holds a field that aggregates a couple of pieces of information like so key/value; key/value; (don't ask for the reason, I have no idea who designed it liked that and why ;-) )
93/4; 34/12;
it can be empty, or it can hold 1..n key/value pairs.
I want to use an ingest pipeline and ideally have a "nested" field which holds all values that are in tha field.
Probably like this:
{"categories":
{ "93": 7,
"82": 4
}
}
The use case is the following: we want to visualize the sum of a filtered number of these categories (they tell me how many minutes a specific process took longer) and relate them in ranges.
Example: I filter categories x, y ,z and then group how many documents for the day had no delay, which had a delay up to 5 minutes and which had a delay between 5 and 15 minutes.
I have tried to get the fields neatly separated with the kv processor and wanted to work from there on but it was a complete wrong approach I guess.
"kv": {
"field": "IncomingField",
"field_split": ";",
"value_split": "/",
"target_field": "delays",
"ignore_missing": true,
"trim_key": "\\s",
"trim_value": "\\s",
"ignore_failure": true
}
When I test the pipeline it seems ok
"delays": {
"62": "3",
"86": "2"
}
but there are two things that don't work.
I can't know upfront how many of these combinations I have and thus converting the values from string t int in the same pipeline is an issue.
When I want to create a kibana index pattern I end up with many fields like delay.82 and delay.82.keyword which does not make sense at all for the usecase as I can't filter (get only the sum of delays where the key is one of x,y,z) and aggregate.
I have looked into other processors (dorexpander) but can't really get my head around how to get this working.
I hope my question is clear (I lack english skills, sorry) and that someone can point me at the right direction.
Thank you very much!
You should rather structure them as an array of objects with shared accessors, for instance:
[ {key: 93, value: 7}, ...]
That way, you'll be able to aggregate on categories.key and categories.value.
So this means iterating the categories' entrySet() using a custom script processor like so:
POST _ingest/pipeline/_simulate
{
"pipeline": {
"description": "extracts k/v pairs",
"processors": [
{
"script": {
"source": """
def categories = ctx.categories;
def kv_pairs = new ArrayList();
for (def pair : categories.entrySet()) {
def k = pair.getKey();
def v = pair.getValue();
kv_pairs.add(["key": k, "value": v]);
}
ctx.categories = kv_pairs;
"""
}
}
]
},
"docs": [
{
"_source": {
"categories": {
"82": 4,
"93": 7
}
}
}
]
}
P.S.: Do make sure your categories field is mapped as nested b/c otherwise you'll lose the connections between the keys & the values (also called flattening).

Discover historical trends in Elasticsearch (not visual)

I have some experience with Elastic as logs storage, but I'm stuck on basic trends recognition (where I need to compare found documents to each other) over time periods.
Easy query would answer following question:
Find all occurrences of document rows (row is specified by growing/continues #timestamp value), where specific field (e.g. threads_count) is growing for fixed count of documents, or time period.
So if I have thread_count of some application, logged every minute over a day including timestamp. And I specify that I'm looking for growing trend in 10 minutes - result should return documents or document sets where thread_count was greater over the one from document minute before at least for 10 documents.
It is very similar task to see line graph, and identify growing parts by eye.
Maybe I just miss proper function name for search. I'm not interested in visualization, I would like to search similar situations over the API and take needed actions.
Any reference to documentation or simple example is welcome!
Well script cannot be used between documents. So you will have to use a payload.
In your query sort the result by date.
https://www.elastic.co/guide/en/elastic-stack-overview/6.3/how-watcher-works.html
A script in the payload could tell you if a field is increasing (something like that, don't have access to a es index right now)
"transform": {
"script": {
"source": "ctx.payload.transform = []; def current_score = -1;
def current = []; for (int j=0;j<ctx.payload.hits.hits;j++){
//check in the loop if current_score increasing using ctx.payload.hits.hits[j]._source.message], if not return "FALSE"
} ; return "TRUE",
"lang": "painless"
}
}
If you use logstash to index your documents, take a look to elapsed, could be nice too: https://www.elastic.co/guide/en/logstash/current/plugins-filters-elapsed.html

ElasticSearch Score Function Depending on Neighbor Documents

I have an ElasticSearch index with 2 mappings (types).
In the app I need to display a paginated feed containing items of both types.
Currently the items are sorted just by creation date, but I also want to have control on how the items alternate with each other on the page.
For example, I want to set a rule for sequence "3 items of type A, 1 item of type B, and so on".
I need it to make sure items of both types are displayed on each page and equally distributed across the pages.
But as far as I see it's not possible to access another documents in custom score function script.
Of course it's easy to implement directly in the app logic, but it's not clear how to implement pagination using this way.
Any ideas on how to achieve that?
I don't think you can do this.
One approach (that doesn't work) is to keep a global variable in a script and to increment that once every document is being returned/processed. And then to take this number, divide it by 3 and get the modulo number. Based on this number, to sort the docs. But "global" variables are not possible in sripts.
The only two approaches that I can think of is to use a script to generate a random number and based on that to sort. In this way, you get some chances to have a "mixed list of types.
Or, if you want the smallest deterministic way of sorting the docs, still in a script take the ID of the document (you said is a number) modulo 3 it and use the value to sort.
For the random approach:
"sort": [
{
"date": {
"order": "desc"
}
},
{
"_script": {
"script": "Math.random()",
"type": "number",
"order": "asc"
}
}
]

Resources