Using elastic search to build flow/funnel results based on unique identifiers - elasticsearch

I want to be able to return a set of counts of individual documents from a single index based on a previous set of results, and am wondering if there is a way to do it without running a separate query for each.
So, given a data set like this (simplified version of my ES documents):
{
"name": "visit",
"sessionId": "session1"
},
{
"name": "visit",
"sessionId": "session2"
},
{
"name": "visit",
"sessionId": "session3"
},
{
"name": "click",
"sessionId": "session1"
},
{
"name": "click",
"sessionId": "session3"
}
What I would like to do is be able to search for name: visit and give a count of all those. That part is easy. But I would also like to be able to now count my name: click docs that have the sessionId of the name: visit result set and return a count of how many of those name: click there were as well as the name: visit.
Is there an easy way to do this? I have looked at aggregation APIs but they all seem to not quite fit my needs. There also seems to be a parent/child relationship but it doesn't apply to my situation since both documents I want to individually get counts of are of the same type.
Expected result would be something like this:
{
"count": {
// total number of visit events since this is my start point
"visit": 3,
// the amount of click results that have sessionId
// matching my previous search's sessionId values
"click": 2
}
}

At first glance, you need to do this in two queries:
the first aggregation query to retrieve the sessionIds and
a second aggregation query filtered with those sessionIds to find the count of clicks.
I don't think it's a big deal to run those two queries, but that depends on how much data you have and how many sessionIds you want to retrieve at once.

Related

How to index and query Nested documents in the Elasticsearch

I have 1 million users in a Postgres table. It has around 15 columns which are of the different datatype (like integer, array of string, string). Currently using normal SQL query to filter the data as per my requirement.
I also have an "N" number of projects (max 5 projects) under each user. I have indexed these projects in the elasticsearch and doing the fuzzy search. Currently, for each project (text file) I have a created a document in the elasticsearch.
Both the systems are working fine.
Now my need is to query the data on both the systems. Ex: I want all the records which have the keyword java (on elasticsearch) and with experience of more than 10 years (available in Postgres).
Since the user's count will be increasing drastically, I have moved all the Postgres data into the elasticsearch.
There is a chance of applying filters only on the fields related to the user (except project related fields).
Now I need to created nest projects for the corresponding users. I tried parent-child types and didn't work for me.
Could anyone help me with the following things?
What will be the correct way of indexing projects associated with the users?
Since each project document has a field called category, is it possible to get the matched category name in the response?
Are there any other better way to implement this?
By your description, we can tell that the "base document" is all based on users.
Now, regarding your questions:
Based on what I said before, you can add all the projects associated to each user as an array. Like this:
{
"user_name": "John W.",
..., #More information from this user
"projects": [
{
"project_name": "project_1",
"role": "Dev",
"category": "Business Intelligence",
},
{
"project_name": "project_3",
"role": "QA",
"category": "Machine Learning",
}
]
},
{
"user_name": "Diana K.",
..., #More information from this user
"projects": [
{
"project_name": "project_1"
"role": "Project Leader",
"category": "Business Intelligence",
},
{
"project_name": "project_4",
"role": "DataBase Manager",
"category": "Mobile Devices",
},
{
"project_name": "project_5",
"role": "Project Manager",
"category": "Web services",
}
]
}
This structure is with the goal of adding all the info of the user to each document, doesn't matter if the info is repeated. Doing this will allow you to bring back, for example, all the users that work in a specific project with queries like this:
{
"query":{
"match": {
"projects.name": "project_1"
}
}
}
Yes. Like the query above, you can match all the projects by their "category" field. However, keep in mind that since your base document is merely related to users, it will bring back the whole user's document.
For that case, you might want to use the Terms aggregation, which will bring you the unique values of certain fields. This can be "combined" with a query. Like this:
{
"query":{
"match": {
"projects.category": "Mobile Devices"
}
}
},
"size", 0 #Set this to 0 since you want to focus on the aggregation's result.
{
"aggs" : {
"unique_projects_names" : {
"terms" : { "field" : "projects.name" }
}
}
}
That last query will bring back, in the aggregation fields, all the unique projects' name with the category "Mobile Devices".
You can create a new index where you'll store all the information related to your projects. However, the relationships betwen users and projects won't be easy to keep (remember that ES is NOT intended for being an structured or ER DB, like SQL) and the queries will become very complex, even if you decide to name both of your indices (users and projects) in a way you can call them with a wildcard.
EDIT: Additional, you can consider store all the info related to your projects in Postgress and do the call separately, first get the project ID (or name) from ES and then the project's info from Postgres (since I assume is maybe the info that is more likely not to change).
Hope this is helpful! :D

ACID update of ElasticSearch Document

I'm trying to build a Tinder-like system right now. Here I need to know which cards have already been seen.
If I save the cards in ElasticSearch, and then have such a document:
{ nama: David, location: {lat, lon}, seenFromUsers: [] }
I'm just wondering if it makes sense to create a list in the object itself. Probably there are 2000 entries in it.
But if I do an update in ElasticSearch, then I always have to pass all 2000 entries. If two users do this at the same time, does one get lost? How can I simply add another ID to the array? Is that even possible?
What other solutions are there?
One other solution would be a complete different approach. Instead if creating documents like this
{
"name": "David",
"location": { "lat": ..., "lon": ...},
"seenFromUsers": ["Laura", "Simone"]
}
think in Relations like this:
{
"name": "David",
"seenBy": "Laura"
}
{
"name": "David",
"seenBy": "Simone"
}
this approach will give you simpler queries, and the ACID problem is solved. New profile views are simply new documents...
As a benefit, you´ll get rid of inner objects and it will be more easy to add additional data to this relation:
{
"name": "David",
"seenBy": "Laura",
"timestamp": ...,
"liked": true
}
{
"name": "David",
"seenBy": "Simone",
"timestamp": ...,
"liked": false
}
And now you´ll be able to do a simple query for all positive likes of a profile, or bi-directional likes/matches...

Elasticsearch: Get sort index of the record

Let me describe my scenario with the real example.
I have a page where I need to show the list of the companies sorted by a field "overallRank" and with few filters (like companyType and employeeSize).
Now, it's easy to get the results from the ES index for the filter and then sort them by overallRank. But, I also want to know the rank of the company among all the company data and not only in the filtered result.
For example. Amazon is the 3rd company in the location US and companyType=Private. But, it is the 5th company in the US if we remove the companyType filter. While showing the result with the filter companyType I want to know this overall ranking (i.e 5th). Is it possible to include this field in the result somehow?
What I am currently doing is first getting the filtered result by companyType and location US. Then getting the sorted result by only location. This second query gives the result by overall ranking in the location (where Amazon is coming at 5th place). Now I iterate the first result and see where that company is in the second result to determine it's overall ranking.
The problem with this approach is that second method to determine the overall ranking in the whole company data is very expensive because it has to retrieve around 60k result. By giving the batch size 1000 it has to do a round trip around 60 times to ES to get all the results in the memory. It's time and space consuming both.
Can somebody please suggest a better way of doing this?
I think you can solve it using filtered aggregations: with top hits aggregation
As an example you can do something like:
{
"aggs": {
"filtered_companies_by_us": {
"filter": {
"term": {
"location": "US"
}
},
"aggs": {
"top_companies": {
"top_hits": {
"sort": [
{
"overallRank": {
"order": "desc"
}
}
],
"size": 5
}
}
}
}
}
}

Solr facet equivalent of group by?

If I have some data like this:
{"field1":"x", "field2":".."}
{"field1":"x", "field2":".."}
{"field1":"y", "field2":".."}
{"field1":"y", "field2":".."}
{"field1":"y", "field2":".."}
Using a simple group=true&group.field=field1&group.limit=0 I get results like this:
{
"responseHeader":{..}
"grouped":{
"field1": {
"matches": 5,
"groups": [
{"groupValue": "x", "doclist":{"numFound": 2, ...}}
{"groupValue": "y", "doclist":{"numFound": 3, ...}}
]
}
}
}
Using this, I know the num of documents found for each groupValue (numFound). The problem is I need to sort the resulting groups in descending order, which is not possible with either sort (a simple sort=numFound would result in an exception, saying the field numFound does not exists and the group.sort would sort the documents inside each group).
Is there an equivalent of this using facets where I can sort the results by count?
You can try:
http://localhost:8983/solr/your_core/select?facet.field=field1&facet.sort=count&facet.limit=-1&facet=on&indent=on&q=*:*&rows=0&start=0&wt=json
The result will be something like:
{
"responseHeader":{
"status":0,
"QTime":17,
"params":{
"q":"*:*",
"facet.field":"field1",
"indent":"on",
"start":"0",
"rows":"0",
"facet":"on",
"wt":"json"}},
"response":{"numFound":225364,"start":0,"docs":[]
},
"facet_counts":{
"facet_queries":{},
"facet_fields":{
"field1":[
"x",113550,
"y",111814]},
"facet_ranges":{},
"facet_intervals":{},
"facet_heatmaps":{}
}
}
Just tested with Solr 6.3.0.
For more information you can check related part in the Solr documentation.
If you want to compute simultaneously the number of available facets, you can use Solr stats Component (as the field is of type numeric, string, or date).
Have in mind though, server performance and memory overhead issues might appear.
Running a query like:
http://localhost:8983/solr/your_core/select?facet.field=field1&facet.sort=count&facet.limit=10&facet=true&indent=on&q=*:*&rows=0&start=0&wt=json&stats=true&stats.field={!cardinality=true}field1
The response is something like:
{
"responseHeader":{
"status":0,
"QTime":614,
"params":{
"facet.limit":"10",
"q":"*:*",
"facet.field":"field1",
"indent":"on",
"stats":"true",
"start":"0",
"rows":"0",
"facet":"true",
"wt":"json",
"facet.sort":"count",
"stats.field":"{!cardinality=true}field1"}},
"response":{"numFound":2336315,"start":0,"docs":[]
},
"facet_counts":{
"facet_queries":{},
"facet_fields":{
"field1":[
"Value1",708116,
"Value2",607088,
"Value3",493949,
"Value4",314433,
"Value5",104478,
"Value6",41099,
"Value7",28879,
"Value8",18767,
"Value9",9308,
"Value10",4545]},
"facet_ranges":{},
"facet_intervals":{},
"facet_heatmaps":{}},
"stats":{
"stats_fields":{
"field1":{
"cardinality":27}}}}
For more information about stats you can check here.

How do I architect my Elasticsearch indexes for rapidly changing data?

I have two models in my MySQL database: Users and Posts
users have geolocation attributes (lat/long)
posts simply have a body of text.
I want to use Elasticsearch to find all posts that match a string of text plus use the user's location as a filter. The problem is -- the user's location always changes (as people walk around the city). I will be frequently updating the lat/long of each user.
This is my current solution:
Index the posts and have a geolocation attribute in each
document. When a user changes location, run an elasticsearch batch
update on all that user's posts, and modify the geolocation attribute
on those documents.
Obviously this is not a scalable solution -- what if the user has 2000 posts and walks around the city? I'd have to update 2000 documents every minute.
Is there a way to "relationally map" the posts to the user's object and use it as a filter, so that when the location changes, I only need to update that user's object instead of all his posts?
Updating 2000 posts per minute is not a big deal either with the update by query plugin or with the upcoming reindex API. However, if you have many users with many posts and you need to update them in short intervals (e.g. 1 min), it might not be that scalable, indeed. Say if it takes 500 milliseconds to update all posts from a user, you'd start to lag behind at around 120 users.
Clearly, since the users' posts need to "follow" the user and don't keep the location the user had when she posted them, I would first query the users around a given location and get their IDs, and then run a second query on posts filtered by those user IDs and the matching body text.
It is perfectly OK to keep both of your indices simple and only update the location in a single user's document every minute. Those two queries I'm suggesting should be quite fast and you should not be worried of running them. People are often worried when they need to run two or more queries in order to find their results. Sometimes, trying to tie the documents to tight together is not the solution and simply running two queries over two indices is the key and works perfectly well.
The query to retrieve users would look similar to the first one below, where you only retrieve the _id property of the user. I'm making the assumption that your user documents have the id of the user as their ES doc _id, so you do not have to retrieve the _source at all (i.e. "_source": false) which is even faster and you can simply return the _id with response filtering:
POST /users/_search?filter_path=hits.hits._id
{
"size": 1000,
"_source": false,
"query": {
"bool": {
"filter": [
{
"geo_distance": {
"distance": "100m",
"location": {
"lat": 32.5362723,
"lon": -80.3654783
}
}
}
]
}
}
}
You'll get all the _id values of the users who are currently 100 meters around the desired geographic location. Then the next query consists of filtering the posts by those ids while matching their body text.
POST /posts/_search
{
"size": 50,
"query": {
"bool": {
"must": {
"match": {
"body": "some text"
}
},
"filter": [
{
"terms": {
"user_id": [ 1, 2, 3, 4 ]
}
}
]
}
}
}

Resources