I'm designing a solution and want to leverage some of Elasticsearch's query capabilities (version 7.x). We are expected to have around 10M documents per index.
Documents might have different 'associations' to what we call 'users' (not necessarily same meaning as in ES) -
associated to all, queryable in any context.
associated to single user, should appear only in this user context searches.
associated to a 'groups' of users (of size of up to 1000K), should appear in queries for user's of this group.
We expect to have a lot of users, in the 100Ks or so. which also mean we might have a lot of different groups, each 2 users might form a custom group.
I've been investigating ES's capabilities and it looks like each solution I came up with have disadvantages:
RBAC - will require creating a lot of rolls (per user + per group, can ES even handle that many?)
ABAC - will require creating a lot of users (can ES even handle that many?)
Simple AND clauses on a dedicated properties (complex template of the query as explained here)
it is important to note that I have a single user that I will be using in order to query on behalf of the users I will create, in case I will choose to go down this path.
I came across this question but I figured that thing might have evolved since its been answered Document access control in ElasticSearch
Any other suggestions that I should check out? maybe even custom 3rd party solutions?
Related
currently I'm developing user-search application where users can do a full-text search. It should be extremely fast and there can be a lot of users, like 100.000. There are also like 10.000 user groups. Now I came across Solr and started to implement this, but it seems like I'm failing at the design level.
The requirements:
There is a default schema which is applied to all user groups
Each user is assigned to exactly one user group
A user group can have additional fields (besides the default schema) which should be displayed in the result set (so they can extend the data with custom data)
The search should be extremely fast
How would you realize that application that suits the requirements?
First, I thought about creating a "master core" for the default schema and create a core for each user group, so that I could join the necessary cores when a user requests the data. But it seems like that joining cores in standalone would not work because it does not support sharding. However, even if it would work, I'm concerned about performance because of joining at query time.
SolrCloud does seem to support sharding, but again, I would need to join the queries to one result set which would impact performance again. Additionally, I came across this post Query multiple collections with different fields in solr which says that I would need a merged schema (share-unification) to be able to query across collections/shards. So this would mean: whenever a user group's schema is changed, I would need to change my share-unifacation. As all user group's schemas rely on the share-unification, the search would be unavailable because I would need to re-index at least two schemas.
A simple solution would be to put everything into a single core (standalone) or collection (cloud), but this feels overwhelming.
Has someone did something similar before and can give a good advice or even a best practice?
we have started using elasticsearch in our project, we are storing user data and his friend list as nested object, and nested to nested object storing friend's friend list because we required this data when we are doing global search.
Now we are syncing this data in real time with our database, so is this good to syncing done in real time 50-100 TPS or in future it will create problem.
We need to create complex queries for updating the data because we are managing friend list in 2nd level. so how to create advance scripting in painless, I have checked this in Google but not found anything in detail.
If my approach is wrong of doing this, please let me know.
To answer your first question
so is this good to syncing done in real time 50-100 TPS or in future
it will create problem
As of version ES6.0, Multi level nesting is automatically supported, and detected, resulting in an inner nested query to automatically match the relevant nesting level (and not root) if it exists within another nested query. But there is a caveat, indexing a document with 100 nested fields actually indexes 101 documents as each nested document is indexed as a separate document. To safeguard against ill-defined mappings the number of nested fields that can be defined per index is usually limited to 50 using the index.mapping.nested_fields.limit . This setting allows you to limit the number of field mappings that can be created manually or dynamically, in order to prevent bad documents from causing a mapping explosion. So to answer your question, this is fine, but as your data grows, it becomes more complicated to manage and you risk the danger of a mapping explosion.
To answer your second question
We need to create complex queries for updating the data because we are
managing friend list in 2nd level. so how to create advance scripting
in painless, I have checked this in Google but not found anything in
detail.
You might need to present some context here to be able to understand why your approach is necessary, but basically, in a social profile context, managing a friends list as you are doing is always a bad idea, especially if you anticipate scaling in the future. It may work for smaller use-cases, but it does not work very well when you scale. This is because, the relationships become more sophisticated and you will end up having too many multi nested objects. As mentioned, all factors kept at a constant, you might want to look at a graph database for this kind of a scenario. You could, however, have other reasons to your approach which is why you might want to enumerate your context so we can better advise.
Hope this helps!!
I have a scenario where security is mandated to be handled in Active Directory.
I have thousands of users.
I have hundreds of thousands of contracts.
I have 6 different roles that any user can have on any given contract.
The original implementation (don't blame me, I just got here!) created a new security group for each contract-role. As soon as users were assigned to the appropriate groups (1800+ in some cases) login and LDAP query performance for those users became unbearable (11+ minute logins).
So I'm looking for a better option, how would you tackle this association? Other ways to structure the relationship? Custom classes/attributes? Thoughts?
Looks like you need to rethink, a quick search in the Active Directory scalability limits documentation shows this:
http://technet.microsoft.com/en-us/library/active-directory-maximum-limits-scalability(v=ws.10).aspx#BKMK_Groups
Security principals (that is, user, group, and computer accounts) can be members of a maximum of approximately 1,015 groups. This limitation is due to the size limit for the access token that is created for each security principal. The limitation is not affected by how the groups may or may not be nested.
I need to know what is the recommended solution when I want to index my solr data using multiple queries and entities.
I ask because I have to add a new fields into schema.xml configuration. And depends of entity(query) there should be different fields definition.
query_one = "select * from car"
query_two = "select * fromm user"
Tables car and user have differents fields, so I should include this little fact in my schema.xml config (when i will be preparing fields definition).
Maybe someone of you creates a new solr instance for that kind of problem ?
I found something what is call MultiCore. Is it alright solution for my problem ?
Thanks
Solr does not stop you to host multiple entities in a single collection.
You can define the fields for both the entities and have them hosted within the Collection.
You would need to have an identifier to identify the Entities, if you want to filter the results per entity.
If your collections are small or there is a relationship between the User and Car it might be helpful to host them within the same collection
For Solr Multicore Check Answer
Solr Multicore is basically a set up for allowing Solr to host multiple cores.
These Cores which would host a complete different set of unrelated entities.
You can have a separate Core for each table as well.
For e.g. If you have collections for Documents, People, Stocks which are completely unrelated entities you would want to host then in different collections
Multicore setup would allow you to
Host unrelated entities separately so that they don't impact each other
Having a different configuration for each core with different behavior
Performing activities on each core differently (Update data, Load, Reload, Replication)
keep the size of the core in check and configure caching accordingly
Its more a matter of preference and requirements.
The main question for you is whether people will search for cars and users together. If not (they are different domains), you can setup multiple collections/cores. If they are going to be used together (e.g. a search for something that shows up in both cars and people), you may want to merge them into one index.
If you do use single collection for both types, you may want to setup dedicated request handlers returning different sets of fields and possibly tuning the searches. You can see an example of doing that (and a bit more) in the multilingual example from my book.
QUESTION
Is it possible to query other couchdb documents as part of a standard couchdb validation function ?
If not, what is the standard approach for including properties of other documents as part of a validation rule inside a couchdb validation function?
RATIONALE
Consider a run-of-the-mill address book application where the validation function is intended to prevent two or more entries having the same value for the 'e-mail' in one of the address book entry fields.
Consider also an address book application where it is possible to specify validation rules in separate documents, based on whether the postal code is a US-based postal code or something else.
No, it is not possible to query other couchdb documents in a validate_doc_update function. Each runs in isolation passing references only to: the new document, the old document, and user (where applicable).
My personal experience has been there are at least three options for dealing with duplicate checking:
Use Cloudant as your CouchDB provider. They offer a free tier for now if you'd like to experiment, but they guarantee consistency across nodes for a CouchDB database. (See #2)
I've used a secondary "reserve table" for names using the type-key as the ID. Then, you need to check for conflicts if not using a system like Cloudant. Basically, there's a simple document that maintains a key to prevent duplicates. It's not fun code to write given that you need to watch for conflicts. (Even with cloudant, you need to deal with failed requests to write, but it's easier than dealing with timing issues surrounding data replication across multiple nodes).
Use a traditional DB like MySQL for example that can maintain a unique and consistent index for specific data values like you're describing. Store the documents away in CouchDB though. While slightly annoying that you need different data providers, it's reliable.
(Optional: decide that CouchDB isn't a great fit for the type of system you're building)