Visualization in Elasticsearch using customized query - elasticsearch

Here's the situation I have, suppose my index document looks like this :
{
"user" : 1
"started" : "2021-06-05"
"finished" : -1
"status": "ONGOING"
}
{
"user" : 2
"started" : "2021-06-05"
"finished" : "2021-06-06"
"status": "DONE"
}
Like this I have 100 docs indexed. The ongoing documents have -1 as the finished time and completed once have a valid timestamp. I want to visualize a graph that can give me the number of ongoing applications with the "started" field in the X-axis.
In the date histogram, I'm only able to get the filtered ongoing processes for that specific interval. But I want the count for the ongoing application to be counted for every interval until the document is updated with the finish time.
Is there anyway I can visualize this in Kibana? Even an elastic search query that can give me this output will do.

This is really similar to a problem I had and have now solved. I spent ages trying to create a query that does this to no avail, but luckily this can be achieved using Vega's transforms.
If you want to bin it evenly, not using start time as your x-values Here is the posted solution (look for my answer). The one thing I would add is; for the documents where you have "-1" as the finished time, if you do a formula transform you can round these to the end bin times.
However, if you still want to stick to the "started"/"finished" field being the points of summation/evaluation this is also possible. I'll give you a quick rundown on how to do this...
Method:
First thing you need to do is create two copies of your data with a common field referring to the "timestamp". The first dataset will have the "started" value assigned to the field "timestamp" (started dataset) and the second will have "finished" (finished dateset). You can achieve this using the formula transform.
You will then need to create a column in each dataset named "operation" referring to what that that data entry does - add a user or remove a user. For the finished dataset you want to assign a column of '-1's and the started dataset '1's. Again using formula transforms.
Then join these datasets back up. You now need to order by "timestamp" and cumulatively sum the "operation" column up. This can be achieved using the window transform.
This should give you the data needed to plot it. Arguably this is much more accurate than binning, but if your data set is large it can yield quite messy results - binning in this case is much cleaner.
Good luck, there is obviously a lot to fill in but a working example would of taken me quite a while to draw up - plus where is the fun in copying.

Related

elasticsearch fill gaps with previous value

I have time series data in Elasticsearch and I want to aggregate it to create histogram. What I want to achieve is to fill the null buckets with the value of the previous data point. I know that I can use min_doc_count: 0 but it will put the value as 0 and I couldn't find any out of the box way to do this via Elastic. May be there is some trick that I am not aware of?
Appreciate your feedback.
I think the Date Histogram Aggregation does not provide a native way to perform what you would like.
The closest thing I can think of is using missing value. However, this will set a static value to all the dates where no values are found, which is not exactly what you want.
I also thought of using Painless with the following logic:
Get the first value in the Histogram and store it in a variable current.
If the next value is different to 0, store this value to current.
If the value is 0, set the current value to the histogram date. Don't change current.
Repeat step 2 until you finish the Histogram.
Using painless, in my experience is really painful but you can consider it as an alternative.
Additionally, I would recommend you to limit ES to perform searches and aggregations. If you require additional logic to the output, consider performing it outside ES. You can use the Python ES Client for instance.
I can think of the following script with a similar logic as the Painless scenario:
current = 0
results = es.search(...)
for i in res["aggregations"]["my_histogram_name"]["buckets"]:
if not i["doc_count"]: #this is the same as "if i["doc_count"]==0"
i["doc_count"] = current
current = i["doc_count"] #changed or not, we always use the last value to current
After that, the histogram should look as you want and ready to be displayed.
Hope this is helpful! :)

How to sort by a derived value that includes a moving date in ElasticSearch?

I have a requirement to sort the results returned by ElasticSearch by a special value i define, let's call it 'X'.
Now - the problem is, 'X' is a value derived based on:
field A in the document (which is a 'term')
field B (which is a 'date')
the current date (UTC)
So, the problem is obviously 3. The date always changes, therefore i'm not sure how to include this in the sort, since it's not part of the document.
From my initial reading it appears i can use a 'script' here, but i'm worried about the performance, since i could be searching + sorting over 1000's of documents.
The only other idea that came to mind is to calculate the value nightly, and store that in each document. But that has a few drawbacks:
i need to have something running in the background to update this value
could be a lot of documents to update (60%+ every night).
i lose precision for the value depending on how long between script runs. (if i run nightly, value is 23 hours 'stale')
Any advice?
Thanks
This can be done by having an ES script run nightly calculating value, and store that in each document

Filter Data for Each Row in a Column

EVE Online Manufacturing Spreadsheet
In Batch!F3:G, I'm attempting to break down the data input from columns B3:C to their components (and eventually materials/minerals in I3:J) by using filter to compare results in Engine!P:R. Multiplied of course by the total number of each finished product I need.
I've been trying to figure out ways to arrayformula this together, and even tried quite a few query functions without success. The best I've been able to come up with is to string the actual formula together, appending them with {}, but this gets bloated quickly. I need this to be open ended because I have a tendency to build a lot of things at once. Any help would be appreciated, even just point me in the right direction!
Well, based on my limited knowledge about google sheet, I can only think of one way to do this automatically.
Here's a sheet I constructed based on your sheet.
https://docs.google.com/spreadsheets/d/1AfX8o05gUGPiN5S90w4o0yxuIYjsJRaXsaYUFTJuEPo/edit?usp=sharing
First, on Engine sheet, add one more column which will give you the number of materials required for that part, which is looked up in the PART LIST of BATCH sheet. For this I use VLOOKUP, as you see in D2.
Then on BATCH sheet, query the materials that VLOOKUP return positive, multiply it by the amount of item and then sum them.
This is done by the QUERY used in F3
This method only if you don't have duplicate item in your PART LIST, due to the way VLOOKUP work.
Of course if you want to break the material list further, you can do the same approach..

how to improve Neo4J performance in creating edges?

i'm building a traffic schedule application using Neo4J, NodeJS and GTFS-data; currently, i'm trying to get
things working for the traffic on a single day on the Berlin subway network. these are the grand totals
i've collected so far:
10 routes
211 stops
4096 trips
83322 stoptimes
to put it simply, GTFS (General Transit Feed Specification) has the concept of a stoptime which denotes the
event of a given train or bus stopping for passengers to board and alight. stoptimes happen on a trip,
which is a series of stoptimes, they happen on a specific date and time, and they happen on a given
stop for a given route (or 'line') in a transit network. so there's a lot of references here.
the problem i'm running into is the amount of data and the time it takes to build the database. in order
to speed up things, i've already (1) cut down the data to a single day, (2) deleted the database files
and have the server create a fresh one (very effective!), (3) searched a lot to get better queries. alas,
with the figures as given above, it still takes 30~50 minutes to get all the edges of the graph.
these are the indexes i'm building:
CREATE CONSTRAINT ON (n:trip) ASSERT n.id IS UNIQUE;
CREATE CONSTRAINT ON (n:stop) ASSERT n.id IS UNIQUE;
CREATE CONSTRAINT ON (n:route) ASSERT n.id IS UNIQUE;
CREATE CONSTRAINT ON (n:stoptime) ASSERT n.id IS UNIQUE;
CREATE INDEX ON :trip(`route-id`);
CREATE INDEX ON :stop(`name`);
CREATE INDEX ON :stoptime(`trip-id`);
CREATE INDEX ON :stoptime(`stop-id`);
CREATE INDEX ON :route(`name`);
i'd guess the unique primary keys should be most important.
and here are the queries that take up like 80% of the running time (with 10% that are unrelated to Neo4J,
and 10% needed to feed the node data using plain HTTP post requests):
MATCH (trip:`trip`), (route:`route`)
WHERE trip.`route-id` = route.id
CREATE UNIQUE (trip)-[:`trip/route` {`~label`: 'trip/route'}]-(route);
MATCH (stoptime:`stoptime`), (trip:`trip`)
WHERE stoptime.`trip-id` = trip.id
CREATE UNIQUE (trip)-[:`trip/stoptime` {`~label`: 'trip/stoptime'}]-(stoptime);
MATCH (stoptime:`stoptime`), (stop:`stop`)
WHERE stoptime.`stop-id` = stop.id
CREATE UNIQUE (stop)-[:`stop/stoptime` {`~label`: 'stop/stoptime'}]-(stoptime);
MATCH (a:stoptime), (b:stoptime)
WHERE a.`trip-id` = b.`trip-id`
AND ( a.idx + 1 = b.idx OR a.idx - 1 = b.idx )
CREATE UNIQUE (a)-[:linked]-(b);
MATCH (stop1:stop)-->(a:stoptime)-[:next]->(b:stoptime)-->(stop2:stop)
CREATE UNIQUE (stop1)-[:distance {`~label`: 'distance', value: 0}]-(stop2);
the first query is still in the range of some minutes which i find longish given that there are only
thousands (not hundreds of thousands or millions) of trips in the database. the subsequent queries that
involve stoptimes take several ten minutes each on my desktop machine.
(i've also calculated whether the schedule really contains 83322 stoptimes each day, and yes, it's plausible:
in Berlin, subway trains run on 10 lines for 20 hours a day with 6 or 12 trips per hour, and there are 173
subway stations: 10 lines x 2 directions x 17.3 stops per line x 20 hours x 9 trips per hour gives 62280,
close enough. there are some faulty? / double / extra stop nodes in the data (211
stops instead of 173), but those are few.)
frankly, if i don't find a way to speed up things at least tenfold (rather more), it'll make little sense to use Neo4J
for this project. just in order to cover the single city of Berlin many, many more stoptimes have to be added,
as the subway is just a tiny fraction of the overall public transport here (e.g. bus and tramway have like
170 routes with 7,000 stops, so expect around 7,000,000 stoptimes each day).
Update the above edge creation queries, which i perform one by one, have now been running for over an hour and not yet finished, meaning that—if things scale in a linear fashion—the time needed to feed the Berlin public transport data for a single day would consume something like a week. therefore, the code currently performs several orders of magnitude too slow to be viable.
Update #MichaelHunger's solution did work; see my response below.
I just imported 12M nodes and 12M rels into Neo4j in 10 minutes using LOAD CSV.
You should see your issues when you run profiling on your queries in the shell.
Prefix your query with profile and look a the profile output if it mentions to use the index or rather just label-scan.
Do you use parameters for your insert queries? So that Neo4j can re-use built queries?
For queries like this:
MATCH (trip:`trip`), (route:`route`)
WHERE trip.`route-id` = route.id
CREATE UNIQUE (trip)-[:`trip/route` {`~label`: 'trip/route'}]-(route);
It will very probably not use your index.
Can you perhaps point to your datasource? We can convert it into CSV if it isn't and then import even more quickly.
Perhaps we can create a graph gist for your model?
I would rather use:
MATCH (route:`route`)
MATCH (trip:`trip` {`route-id` = route.id)
CREATE (trip)-[:`trip/route` {`~label`: 'trip/route'}]-(route);
For your initial import you also don't need create unique as you match every trip only once.
And I'm not sure what your "~label" is good for?
Similar for your other queries.
As the data is public it would be cool to work together on this.
Something I'd love to hear more about is how you plan do express your query use-cases.
I had a really great discussion about timetables for public transport with training attendees last time in Leipzig. You can also email me on michael at neo4j.org
Also perhaps you want to check out these links:
Tramchester
http://www.thoughtworks.com/de/insights/blog/transforming-travel-and-transport-industry-one-graph-time
http://de.slideshare.net/neo4j/graph-connect-v5
https://www.youtube.com/watch?v=AhvECxOhEX0
London Tube Graph
http://blog.bruggen.com/2013/11/meet-this-tubular-graph.html
http://www.markhneedham.com/blog/2014/03/03/neo4j-2-1-0-m01-load-csv-with-rik-van-bruggens-tube-graph/
http://www.markhneedham.com/blog/2014/02/13/neo4j-value-in-relationships-but-value-in-nodes-too/
detailed solution
i'm happy to report that #MichaelHunger's solution works like a charm. i modified the edge-building queries
from the question with the below shapes that keep to the suggested query outline:
MATCH (route:`route`)
MATCH (trip:`trip` {`route-id`: route.id})
CREATE (trip)-[:`trip/route` {`~label`: 'trip/route'}]->(route)
MATCH (trip:`trip`)
MATCH (stoptime:`stoptime` {`trip-id`: trip.id})
CREATE (trip)-[:`trip/stoptime` {`~label`: 'trip/stoptime'}]->(stoptime)
MATCH (stop:`stop`)
MATCH (stoptime:`stoptime` {`stop-id`: stop.id})
CREATE (stop)-[:`stop/stoptime` {`~label`: 'stop/stoptime'}]->(stoptime)
MATCH (a:stoptime)
MATCH (b:stoptime {`trip-id`: a.`trip-id`, `idx`: a.idx + 1})
CREATE (a)-[:linked {`~label`: 'linked'}]->(b)
MATCH (stop1:stop)--(a:stoptime)-[:linked]-(b:stoptime)--(stop2:stop)
CREATE (stop1)-[:distance {`~label`: 'distance', value: 0}]->(stop2)
as can be seen, the trick here is to give each participating node a MATCH statement of its own and to
move the WHERE clause inside the second match condition; presumably, as mentioned above, Neo4J can only
then take advantage of its indexes.
with these queries in place, the process of reading in nodes and building edges takes roughly 13 minutes;
of these 13 minutes, fetching the data from an external source, building the node representations and issuing CREATE queries
takes about 10 minutes, and building almost a half million edges between them is done in about 3 minutes.
right now none of my queries (especially the node CREATE statements and updates for stop distances) use
parametrized queries, which is another potential source for performance gains.
as for the ~label field and also the question why i use dahes in names where underscores would be more
convenient, well, that's a long story about what i perceive good and practical naming that sometimes clashes
with the syntax of some languages (of most languages, should i say). but that's boring detail. maybe more
intersting is the question: why is there a ~label attribute that repeats what the element label says (what
you write after the colon)? well, it's an attempt to comply with Neo4J conventions (we use labels here), take
advantage of the 'identifier, colon, label' syntax of cypher queries, AND to make it so the labels do
appear in the returned values.
mind you, labels are so central to graph thinking the Neo4J way, but *in query results, labels are
conspicuously absent. when you include a relationship that is marked with nothing but a label in your result set,
then that edge will arrive as an empty
object, telling you only that there is something but not what. so i decided i to duplicate the
label on each single node and each single edge. not an optimal solution but at least now i get an informative
graph display in the Neo4J browser.
as for how to express query use-cases, that's an active field of reserach for me right now. i guess it will
all start with a 'field of interest', like 'show all Berlin subway stops', or 'all busses departing within
the next 15 minutes from a bus stop near me'. the data already allows to see which stops are directly connected
by a subway line, their geographical distance, what services are present and what routes they take. the idea
is to grab the data and present them in novel, usable and beatiful ways. 9292 is quite
close to what i imagine; what's missing are graphical representations of spatial and temporal relationships.

DateHistogram Facet return ZERO if nothing took place at certain time

I am using a date histogram to get the count per hour of some messages in elastic search.
However the date histogram facet will only show the count per hour in the hours were some activity took place. Is there a way to force it to return zero if no activity happened during that interval.
This is just how it works, we have some code in place on our end to populate missing "buckets" with zeros. I'm not sure yet if this is a good design decision or a bug - perhaps open an issue for it on http://github.com/elasticsearch/elasticsearch ?

Resources