Excluding users in a search, the most optimal way - elasticsearch

I have two indexes, on for a collection of profiles, and another containing each users excludes, e.g. blocked profiles.
The per user exclude lists will be updated very often, while in comparison the profiles are seldom updated... In this situation it is recommended to separate the data in two indexes, as I understand it.
EDIT [2017-01-25]
This is the mappings for the two indexes:
PROFILES MAPPING
PUT xyz_profiles
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0
},
"mappings": {
"profile": {
"_all": {"enabled":false},
"dynamic":"strict",
"properties": {
"profile_id":{"type":"integer"},
"firstname": {"type":"text"},
"age": {"type":"integer"},
"gender":{"type":"keyword"},
"height":{"type":"integer"},
"area_id":{"type":"integer"},
"created":{
"type": "date",
"format":"date_time_no_millis"
},
"location": {"type": "geo_point"}
}
}
}
}
EXCLUDE LISTS MAPPING
PUT xyz_exclude_from_search
{
"settings": {
"auto_expand_replicas": "0-all"
},
"mappings": {
"exclude_profile": {
"_all": {"enabled":false},
"dynamic":"strict",
"properties": {
"profile_id":{"type":"integer"},
"exlude_ids":{"type":"integer"}
}
}
}
number_of_shards is 1 since this is on a single node (my test server).
auto_expand_replicas set to 0-all is to make sure that the exclude list it copied to all nodes. I am aware that this is superfluous on a single node, but I don't want to forget when this is implemented on the production cluster.
exclude_ids will be an array of integers (profile ids) to exclude from the search.
This is the part of a search where certain profiles are excluded using current users (id 3076) exclude list:
GET /xyz_profiles/profile/_search{
"query" : {
"bool": {
"must_not": {
"terms" : {
"profile_id" : {
"index" : "xyz_exclude_from_search",
"type" : "exclude_profile",
"id" : "3076",
"path" : "exclude_ids"
}
}
}
}
}
}
Being very new to Elasticsearch, I would very much like to know if this is the most optimal way of doing it. I imagine there are some very experienced people out there, who can pinpoint if my mappings or my search is missing something obvious that would improve performance.
For example, I haven't fully understood the analyze/not_analyzed part of mappings as well as using routings in the search.

This is an interesting question, I think it's a quite common pattern but at the moment there is not much information about it on Internet.
I was in a similar situation some time ago and it was solved in a similar way to the one you propose. But I did not separate it in two indexes, just added an exclude_ids field to our user index. For example, let's say that when the user with id 1 is searching, we use a Term Query to check that id 1 is not inside exclude_ids of target users, a query like:
{ "term": { "exclude_ids": 1 }
After using it with around two million documents I found out that:
Search is fast
Taking into account how the inverted index works I think this usage is correct
Search is done inside the same index (having to search in other indexes means checking more shards)
Updates are slow
Each time an id is added to exclude_ids the whole document is reindexed, since partial updates to a document are not possible. If the exclude_ids array gets very long, the updates can become specially slow.
For the same reason, indexed data that is not usually updated is reindexed, like name or age.
In your case, since you are separating the exclude list in other index; as you said, the data that is not usually updated does not have to be reindexed each time. But the problem of arrays that grow indefinitely is still there.
Plus, taking into account the way you would do the query (with a Terms Query using lookup I guess), filtering a big amount of data there is a possibility of ending with some overhead. But I'm not sure about this. This is discussed here.
It's difficult to decide which one would escalate better with a huge amount of data, doing loading tests could be a good idea.
A way to solve the expensive updates problem could be not inserting exclude_ids in Elasticsearch, but inserting only the active users exclude lists in memory (using Redis or similar), setting a TTL to them. I supposed that the original data is still being stored in MySQL, so it can be taken from there and be inserted in memory each time it is necessary (for example, when a user becomes active). But I think this is not recommended since it seems that a Terms Query with many terms degrades the performance a lot (explained in this issue).
There is already a similar question, but in my opinion there are many things that should be taken into account not spoken there. I would be happy to read more opinions about the search and update performance with big amounts of data.

Related

Elasticsearch Join

I have two indices. One indices "indications" which have some set of values.
Other is "projects". In this indices, I will add indications value like " indication = oncology".
Now I want to show all indications. Which I can do using terms aggregations. But my issue is that I also want to show count of project in which that indication is used .
So for that, I need to write join query.
Can anyone help me to resolve this issue?
Expected result example:
[{name:"onclogogy",projectCount:"12"}]
You cannot have joins in Elasticsearch. What you can do is store indication name in project index and then apply the term aggregation on project index. That basically will get you the different indications from all the project documents and count of each indication.
Something of the sort:
GET /project/_search
{
"query": {},
"aggs": {
"indcation":{
"terms": {
"field": "indication_name"
}
}
}
}
Elasticsearch does not supports joins. That's the whole point of having NoSQL that you keep the data as denormalised as you can. Make the documents more and more self sufficient.
There are some ways with which you can add some sort of relationship b/w your data. This is a nice blog on it.

ElasticSearch Index Modeling

I am new to ElasticSearch (you will figure out after reading the question!) and I need help in designing ElastiSearch index for a dataset similar to described in the example below.
I have data for companies in Russell 2000 Index. To define an index for these companies, I have the following mapping -
`
{
"mappings": {
"company": {
"_all": { "enabled": false },
"properties": {
"ticker": { "type": "text" },
"name": { "type": "text" },
"CEO": { "type": "text" },
"CEO_start_date": {"type": "date"},
"CEO_end_date": {"type": "date"}
}
}
}
`
As CEO of a company changes, I want to update end_date of the existing document and add a new document with start date.
Here,
(1) For such dataset what is an ideal id scheme? Since I want to keep multiple documents should I consider (company_id + date) combination as id
(2) Since CEO changes are infrequent should Time Based indexing considered in this case?
You're schema is a reasonable starting point, but I would make a few minor changes and comments:
Recommendation 1:
First, in your proposed schema you probably want to change ticker to be of type keyword instead of text. Keyword allows you to use terms queries to do an exact match on the field.
The text type should be used when you want to match against analyzed text. Analyzing text applies normalizations to your text data to make it easier to match something a user types into a search bar. For example common words like "the" will be dropped and word endings like "ing" will be removed. Depending on how you want to search for names in your index you may also want to switch that to keyword. Also note that you have the option of indexing a field twice using BOTH keyword and text if you need to support both search methods.
Recommendation 2:
Sid raised a good point in his comment about using this a primary store. I have used ES as a primary store in a number of use cases with a lot of success. I think the trade off you generally make by selecting ES over something more traditional like an RDBMS is you get way more powerful read operations (searching by any field, full text search, etc) but lose relational operations (joins). Also I find that loading/updating data into ES is slower than an RDBMS due to all the extra processing that has to happen. So if you are going to use the system primarily for updating and tracking state of operations, or if you rely heavily on JOIN operations you may want to look at using a RDBMS instead of ES.
As for your questions:
Question 1: ID field
You should check whether you really need to create an explicit ID field. If you do not create one, ES will create one for that is guaranteed to be unique and evenly distributed. Sometimes you will still need to put your own IDs in though. If that is the case for your use case then adding a new field where you combine the company ID and date would probably work fine.
Question 2: Time based index
Time based indices are useful when you are going to have lots of events. They make it easy to do maintenance operations like deleting all records older than X days. If you are just indexing CEO changes to 2000 companies you probably won't have very many events. I would probably skip them since it adds a little bit of complexity that doesn't buy you much in this use case.

Elasticsearch index with no fields indexed

I need to create an Elasticsearch index whose contents will be accessed only by the document-id. There will never be any queries related to the contents of documents. These documents can contain any JSON, including instances where the same field can contain different types of data, etc.
I've hunted for this info, and have found much about indexing individual fields, but nothing about treating the entire document as essentially opaque.
Any help would be much appreciated.
You could do that you want, but as for me, this is not right way.
For first you need to create a mapping for index:
PUT index_name_here
{
"mappings": {
"document_type_here": {
"properties": {
"field_name_for_json_data_here": {
"type": "object",
"enabled": false
}
}
}
}
}
After this, you could create documents with custom structure of fields. You just need to store you JSON not directly in document, but inside field of document (in my example inside field "field_name_for_json_data_here")
If it possible, tell me the reason, why you choose Elasticsearch for store this kind of data? Because if I correctly understood the question, you need simple key\value storage (you could store you json as string) and many databases are more suitable for this.

Inconsistent result set

I am trying to query in Elasticsearch, but time and again I am getting a different result set from elasticsearch for same query. My cluster is having 3 shards and 2 replicas. My first guess was this might be happening because of shards in action there fore I tried querying with dfs_query_then_fetch but still I am having the same issue, after a lot a searching through I found out that the shard to which it is searching on is changing so I used preference to query. Still I observe the same issue. I am out of options now unable to figure out what is the issue.
Pasting my query here
POST _search?search_type=dfs_query_then_fetch&preference=metiswayfinder
{
"query":{
"multi_match":{
"query":"this is a test",
"fields":[
"subject^3",
"message"
]
}
},
"sort":[
{
"_score":{
"order":"desc"
}
},
{
"incidentcount":{
"order":"desc"
}
}
]
}
Relevance score is vital for me, also I observe that the score keeps on changing everytime I do a search. Elasticsearch is giving inconsistent result set and inconsistent score. Any way by which I can neutralize these two scenarios without much change.
Adding the segment query result
Thanks
Ashit
Take a look at https://www.elastic.co/guide/en/elasticsearch/reference/master/consistent-scoring.html
According to the comments in your post, it looks like you already identified that deleting documents with replication can cause this issue - though this post helps explain why. In terms of what to do next, the article suggests the following:
The recommended way to work around this issue is to use a string that identifies the user that is logged is (a user id or session id for instance) as a preference. This ensures that all queries of a given user are always going to hit the same shards, so scores remain more consistent across queries.
I hope that helps!

many indexes for mongodb refined searches

Referring to this question here:
I am working on a similar site using mongodb as my main database. As you can imagine, each user object has a lot of fields that need to be serchable, say for example mood, city, age, sex, smoker, drinker, etc.
Now, apart from the problem that there cannot be more than 64 indexes per collection, is it wise to assign index to all of my fields?
There might be another viable way of doing it: tags (refer to this other question) If i set the index on an array of predetermined tags and then text-search over them, would it be better? as I am using only ONE index. What do you think? E.g.:
{
name: "john",
tags: ["happy", "new-york", "smoke0", "drink1"]
}
MongoDB doesn't (yet) support index intersection, so the rule is: one index per query. Some of your query parameters have extremely low selectivity, the extreme example being the boolean ones, and indexing those will usually slow things down rather than speed them up.
As a simple approximation, you could create a compound index that starts with the highest-selectivity fields, for instance {"city", "age", "mood", ... }. However, then you will always have to use a city constraint. If you query for {age, mood}, the above index wouldn't be used.
If you can narrow down your result set to a reasonable size using indexes, a scan within that set won't be a performance hog. More precisely, if you say limit(100) and MongoDB has to scan 200 items to fill up those 100, it won't be critical.
The danger lies is very narrow searches across the database - if you have to perform a scan on the entire dataset to find the only unhappy, drinking non-smoker older than 95, things get ugly.
If you want to allow very fine grained searches, a dedicated search database such as SolR might be a better option.
EDIT: The tags suggestion looks a bit like using the crowbar to me -- maybe the key/value multikey index recommended by in the MongoDB FAQ is a cleaner solution:
{ _id : ObjectId(...),
attrib : [
{ k: "mood", v: "happy" },
{ k: "city": v: "new york" },
{ k: "smoker": v: false },
{ k: "drinker": v: true }
]
}
However, YMMV and 'clean' and 'fast' often don't point in the same direction, so the tags approach might not be bad at all.

Resources