We're using Elasticsearch to index about 100,000 documents. Users can favorite items and currently we're using Django Haystack's RelatedSearchQuerySet to look through the favorites table, but this executes a lot of SQL queries to filter a subset of those documents, which makes searching through a user's favorites obscenely slow.
To speed things up, I thought about adding a multivalue field to each document (e.g. favorited_by) and storing users' primary keys in it, but since users can favorite thousands of items, popular documents will become large.
Searching through user's favorites seems like a solved problem. How is it done?
In elasticsearch I would store the favorites together with the username for that favorite. Than add a filter to the query with the user name. You even could use the username as a multivalue field, not sure if that would scale in case of thousands of users. Advantage could be one favorite per website if multiple users create the same favorite. But I think I would keep a favorite per user.
Related
Popular search engines are quite performant when it comes to full text searches and many other aspects, however, I am not sure how to map the main document storage system security policies to ES and/or SOLR?
Consider Google Drive and it's folders. Users can share any folder - then files and folders below are also shared. Content management systems use something similar.
But how to map that to the external search engines (that is, not built-in to application's content management system), especially, if there are millions of documents in many tens of thousands of folders, tens of thousands of users? Will it help if, for example, depth (nestedness) of the folders is limited to some small number?
I know ES has user roles, but I can't see it can help here, because accesses are given more or less arbitrary. Another approach is to somehow materialize user access in the documents (folders and documents) themselves, but then changes in users' roles, local to some folder, will result in changing many thousands of documents.
Also, searches can be quite arbitrary and lengthy, so it is desired to have pagination, so, for example, fetching "everything" and then sorting out user access on application side is not an option.
I believe the scenario described is quite common, but I can't find any hints how to implement it.
I had used solr as search engine and solr's Data Import Handler (DIH) feature for importing the data from database to Solr.
I would suggest you to go with the approach of indexing the acl's along with the documents.
I had done the same approach and its working fine till now.
I agree that you have re-index the data on the solr side when there is any changes on folder access or change in the access of level of documents. We do need to re-index the document if the metadata of the document is changes or the content of the document is changes. Similarly we can also update the documents on the solr side for any changes in the ACL(Access Control List).
Why to index the ACL along with Document information.
The reason is whenever user search for a document, you can pass the user acl as part of the query in the form of filter query and get the documents which are accessible to user.
I feel this removes the complexity of applying the acl logic at the back end side.
If you dont index the ACL in solr, then you have to filter out the documents after you retrieve from solr by checking the document is and whatever the acl logic applies.
Or the last option could be index the document without acls. Let the user search all the documents. When he tries to perform any action on those documents then you can check the permission and allow the user to perform the action or deny the user saying you dont have enough permission to access the document.
Action could be like View, Download, Update etc..
You need to decide whichever approach suits and works out in your case.
Database structure for my Python application is very similar to Instagram one. I have users, posts, and users can follow each other. There are public and private accounts.
I am indexing this data in ElasticSearch and searching works fine so far. However, there is a problem that search returns all posts, without filtering by criteria if user has access to it (e.g. post is created by another user who has private account, and current user isn't following that user).
My data in ElasticSearch is indexed simply across several indexes in a flat format, one index for users, one for posts.
I can post-process results that ElasticSearch returns, and remove posts that current access doesn't have access to, but this introduces additional query to the database to retrieve that user followers list, and possibly blocklist (I don't want to show posts to users that block each other too).
I can also add list of follower IDs for each user to ElasticSearch upon indexing and then match against them, but in case where user has thousands of followers, these lists will be huge, and I am not sure how convenient it will be to keep them in ElasticSearch.
How can I efficiently do this? My stack is backend Python + Flask, PostgreSQL database and ElasticSearch as search index.
Maybe you already found a solution...
Using elastic "terms lookup" can solve this problem if you have an index with the list of followers you can filter on, as you said here:
I can also add list of follower IDs for each user to ElasticSearch
upon indexing and then match against them, but in case where user has
thousands of followers, these lists will be huge, and I am not sure
how convenient it will be to keep them in ElasticSearch.
More details in the doc:
https://www.elastic.co/guide/en/elasticsearch/reference/7.5/query-dsl-terms-query.html#query-dsl-terms-lookup
Note that there's a limitation of 65 536 terms (but it can be overwritten) so if your service don't have millions of users default limit will be fine.
I'm using ElasticSearch 7.1.1 as a full-text search engine. At the beginning all the documents are accessible to every user. I want to give users the possibility to edit documents. The modified version of the document will be accessible only to the editor and everyone else will only be able to see the default document.
To do this I will add two array to every document:
An array of users excluded from seeing the doc
An array with the only user that can see the this doc
Every time someone edit a document I will:
Add to the excluded users list the user that made the edit
Create document containing the edit available only to that user.
This way in the index I'll have three types of documents:
Documents accessible to everyone
Documents accessible to everyone except some users
Documents accessible only to a specific users
I use ElasticSearch not only to fetch documents but also to calculate live aggregations (e.g. sums of some field) so query-time I will be able to fetch user specific documents.
I don't expect a lot of edits, less than 1% of the total documents.
Is there a smarter, and less query intensive, way to obtain the same results?
You could implement a document level security.
With that you can define roles that restrict the read-access to certain documents that match a query (e.g. you could use the id of the document).
So instead of updating the documents each time via your proposed array-solution, you would instead update the role respectively granting the roles to the particular users. This would of course require that every user has an elasticsearch user.
This feature is the only workaround to fulfill your requirements that Elasticsearch brings on the table "out of the box" as far as I know.
I hope I could help you.
For a property sale/rent website, a search function should be provided. At the same time, users can use the filters to get the result they want most.
Normally, there are many attributions of a property, like the price, address, the year built, area, many amenties such as balcony, washing-machine and so on. maybe it's over 100.
So how to design the database(mysql or other nosql) and artitecher to make the search performance to be the most efficient?
Sounds like your application requires a lot more search queries than update queries, and that the search queries are quite diverse.
In this case, try ElasticSearch: You choose some database where you store and modify your data. Then, you should propagate any update to an ElasticSearch index, where you upload a denormalized view of the data, which is closer to what users will expect to get when searching.
https://www.quora.com/Whats-the-best-way-to-setup-MySQL-to-Elasticsearch-replication
I'm wondering if having thousands of different indexes is a bad idea?
I'm adding a search page to my web app based on ElasticSearch. The search page lets users search for other users on the site by filtering on a number of different indexed criteria (name, location, gender etc). This is fairly straight forward and will require just one index that contains a document every user of the site.
However, I want to also create a page where users can see a list of all of the other users they follow. I want this page to have the same filtering options that are available on the search page. I'm wondering if a good way to go about this would be to create a separate index for each user containing documents for each user they follow?
While you can certainly create thousands of indices in elasticsearch, I don't really see the need for it in your use case. I think you can use one index. Simply create an additional child type followers for the main user record. Every time user A follows user B, create a child record of B with the following content: {"followed_by" : "A"}. To get the list of users that current user is following, you can simply add Has Child Filter to you query.
I would like to add to Igor's answer that creating thousand of indexes on a tiny cluster (one or two nodes) can cause some drawbacks.
Each shard of an index is a full Lucene instance. That said, you will have many opened files (probably too many opened files) if you have a single node (or a small cluster - in term of nodes).
That's one of the major reasons why I would not define too many indices...
See also File descriptors on installation guide