Spring security Access Control List Billions of rows - performance

Implementing security solution based on spring security framework particularly its acl modules. There are millions of domain objects and some hundreds of users in the application. Using Spring Security Acl module the entry in acl_sid and other related tables grows to 10's of billions which impacts the performance of the application. Would like to know the best practice for handling such scenarios. Are there any alternative security framework available which deals with similar situation in efficient way.

There are several frameworks that make access control more manageable.
First of all, ACLs are great and easy to configure but they do not scale well.
Option #1: Role-based Access Control (RBAC)
RBAC is a well-known model having been defined by NIST in 1992. Many applications and frameworks implement an RBAC model. In RBAC, you give users a set of roles and each role has a set of permissions. As a consequence, users inherit those permissions. You can for instance have a manager role with the permission to view all transactions.
Spring Security, Apache Shiro, JAAS, and many other frameworks (open-source, commercial...) implement RBAC.
Option #2: Attribute-based Access Control (ABAC)
Sometimes RBAC is not enough. In particular when you want to use context or relationships. For instance, in RBAC, it is hard to implement roles and permissions that would handle the following:
Managers can view transactions in their own department
To do that you would use ABAC. You would define a role attribute, a user department attribute, and a transaction department attribute. You would then combine the attributes together in a policy:
A user with the role==manager can do the action=='view transaction' if
user.department==transaction.department
XACML - an implementation of ABAC
XACML, the eXtensible Access Control Markup Language, is a standard defined by OASIS and increasingly used to implement complex authorization challenges. There are several implementations today:
Open source
SunXACML
WSO2
Commercial
Axiomatics.
How do RBAC and ABAC reduce the management burden?
In access control lists, you have a list per item you want to protect and you have to insert user identities in those lists. You may also want to add action data so you end up with:
Item #1 ACL
Alice, read
Alice, write
Bob, read
Carol, read
Item #2
...
If you have 1 million items and 10,000 users, you have a potential of 1 million x 10k x 3 actions (read, write, delete) = a grand total of 30 billion lines. That equates to a management nightmare but also potentially a performance issue.
Now the idea with RBAC was to streamline that a bit. Instead of assigning users to items in ACLs, we use roles and permissions as a level of indirection. So Alice would be an editor. Bob and Carol would be viewers. Your ACLs are now simpler:
Item #1
Editor, read
Editor, edit
Viewer, read
The list is growing smaller. Yet RBAC still have several issues. It still has to have an ACL per object. If you have a million objects, you will still have a few million rows (still better than 30 billion though).
With ABAC, you choose to use object attributes e.g. the department or the classification. Objects no longer have ACLs and you end up writing policies that use these attributes. This makes the number of policies smaller (in the hundreds typically).
Thanks to attributes, ABAC scales better.

Related

how to restrict access to documents in elasticsearch?

I'm designing a solution and want to leverage some of Elasticsearch's query capabilities (version 7.x). We are expected to have around 10M documents per index.
Documents might have different 'associations' to what we call 'users' (not necessarily same meaning as in ES) -
associated to all, queryable in any context.
associated to single user, should appear only in this user context searches.
associated to a 'groups' of users (of size of up to 1000K), should appear in queries for user's of this group.
We expect to have a lot of users, in the 100Ks or so. which also mean we might have a lot of different groups, each 2 users might form a custom group.
I've been investigating ES's capabilities and it looks like each solution I came up with have disadvantages:
RBAC - will require creating a lot of rolls (per user + per group, can ES even handle that many?)
ABAC - will require creating a lot of users (can ES even handle that many?)
Simple AND clauses on a dedicated properties (complex template of the query as explained here)
it is important to note that I have a single user that I will be using in order to query on behalf of the users I will create, in case I will choose to go down this path.
I came across this question but I figured that thing might have evolved since its been answered Document access control in ElasticSearch
Any other suggestions that I should check out? maybe even custom 3rd party solutions?

Solr - schema per user group

currently I'm developing user-search application where users can do a full-text search. It should be extremely fast and there can be a lot of users, like 100.000. There are also like 10.000 user groups. Now I came across Solr and started to implement this, but it seems like I'm failing at the design level.
The requirements:
There is a default schema which is applied to all user groups
Each user is assigned to exactly one user group
A user group can have additional fields (besides the default schema) which should be displayed in the result set (so they can extend the data with custom data)
The search should be extremely fast
How would you realize that application that suits the requirements?
First, I thought about creating a "master core" for the default schema and create a core for each user group, so that I could join the necessary cores when a user requests the data. But it seems like that joining cores in standalone would not work because it does not support sharding. However, even if it would work, I'm concerned about performance because of joining at query time.
SolrCloud does seem to support sharding, but again, I would need to join the queries to one result set which would impact performance again. Additionally, I came across this post Query multiple collections with different fields in solr which says that I would need a merged schema (share-unification) to be able to query across collections/shards. So this would mean: whenever a user group's schema is changed, I would need to change my share-unifacation. As all user group's schemas rely on the share-unification, the search would be unavailable because I would need to re-index at least two schemas.
A simple solution would be to put everything into a single core (standalone) or collection (cloud), but this feels overwhelming.
Has someone did something similar before and can give a good advice or even a best practice?

Recommendation on AD organization

I have a scenario where security is mandated to be handled in Active Directory.
I have thousands of users.
I have hundreds of thousands of contracts.
I have 6 different roles that any user can have on any given contract.
The original implementation (don't blame me, I just got here!) created a new security group for each contract-role. As soon as users were assigned to the appropriate groups (1800+ in some cases) login and LDAP query performance for those users became unbearable (11+ minute logins).
So I'm looking for a better option, how would you tackle this association? Other ways to structure the relationship? Custom classes/attributes? Thoughts?
Looks like you need to rethink, a quick search in the Active Directory scalability limits documentation shows this:
http://technet.microsoft.com/en-us/library/active-directory-maximum-limits-scalability(v=ws.10).aspx#BKMK_Groups
Security principals (that is, user, group, and computer accounts) can be members of a maximum of approximately 1,015 groups. This limitation is due to the size limit for the access token that is created for each security principal. The limitation is not affected by how the groups may or may not be nested.

Multi tenancy with tenant sharing data

I'm currently in the process of making a webapp that sell subscriptions as a multi tenant app. The tech i'm using is rails.
However, it will not just be isolated tenants using the current app.
Each tenant create products and publish them on their personnal instance of the app. Each tenant has it's own user base.
The problematic specification is that a tenant may share its product to others tenants, so they can resell it.
Explanation :
FruitShop sells apple oranges and tomatoes.
VegetableShop sells radish and pepper bell.
Fruitshop share tomatoes to other shops.
VegetableShop decide to get tomatoes from the available list of shared
items and add it to its inventory.
Now a customer browsing vegetableshop will see radish, pepper bell and
Tomatoes.
As you can guess, a select products where tenant_id='vegetableshop_ID' will not work.
I was thinking of doing a many to many relation with some kind of tenant_to_product table that would have tenant_id, product_id, price_id and even publish begin-end dates. And products would be a "half tenanted table" where the tenant ID is replaced by tenant_creator_id to know who is the original owner.
To me it seems cumbersome, adding it would mean complex query, even for shop selling only their own produts. Getting the sold products would be complicated :
select tenant_to_products.*
where tenant_to_products.tenant_ID='current tenant'
AND (tenant_to_products.product match publication constraints)
for each tenant_to_product do
# it will trigger a lot of DB call
Display tenant_to_product.product with tenant_to_product.price
Un-sharing a product would also mean a complex update modifying all tenant_to_products referencing the original product.
I'm not sure it would be a good idea to implement this constraint like this, what do you suggest me to do? Am I planning to do something stupid or is it a not so bad idea?
You are going to need a more complicated subscription to product mechanism, as you have already worked out. It sounds like you are on the right track.
Abstract the information as much as possible. For example, don't call the table 'tenant_to_product', instead call it 'tenant_relationships', and have the product Id as a column in this table.
Then, when the tenant wants to have services, you can simply add a column to this table 'service Id' without having to add a whole extra table.
For performance, you can have a read-only database server with tenant relationships that is updated on a slight delay. Azure or similar cloud services would make this easy to spin up. However, that probably isn't needed unless you're in the order of 1 million+ users.
I would suggest you consider:
Active/Inactive (Vegetable shop may prefer to temporarily stop selling Tomatoes, as they are quite faulty at the moment, until the grower stops including bugs with them)
Server-side services for notification, such as 'productRemoved' service. These services will batch-up changes, providing faster feedback to the user.
Don't delete information, instead set columns 'delete_date' and 'delete_user_id' or similar.
Full auditing history of changes to products, tenants, relationships, etc. This table will grow quite large, so avoid reading from it and ensure updates are asynchronous so that the caller isn't blocked waiting for the table to update. But it will probably be very useful from a business perspective.
EDIT:
This related question may be useful if you haven't already seen it: How to create a multi-tenant database with shared table structures?
Multi-tenancy does seem the obvious answer as you are providing multiple clients your system.
However as an alternative, perhaps consider a reseller 'service layer', this would enable a degree of isolation whilst still offering integration. Taking inspiration to how reseller accounts work with 3rd parties like Amazon.
I realise this is very abstract reasoning but perhaps considering the integration at a higher tier than the data layer could be of benefit.
From experience strictly enforcing multi-tenancy at a data layer level we have found that tenancy sometimes has to be manipulated at a business layer (like your reseller ideas) to such a degree that the tenancy becomes a tricky concept. So considering alternatives early on could help.

optimization of db queries when implementing bounded contexts

In our project we're trying to apply the Bounded Context ideology and we've faced kind of obvious problem of performance. E.g., we have different classes (in different contexts) for representing a user in the system: Person in our core domain's context and User in security context. So, we have two different repositories for each of the aggregate, but they are using the same table in DB and sometimes accessing the same data.
Is there common solution to minimize db roundtrips in this case? Are there ORM's which deals with it, or should we code some caching system by ourselves?
upd: the db is from legacy app, and we'll have to use it "as is"
So, we have two different repositories for each of the aggregate, but
they are using the same table in DB and sometimes accessing the same
data.
The fact that you have two aggregates stored in the same table is an indication of a problem with the design. In this case, it seems you have two bounded contexts - a BC for the core domain (Person is here) and an identity/access BC (User is here). The BCs are related and the latter can be seen as upstream from the former. A Person in the core domain has a corresponding User in the identity BC, but they are not exactly the same thing.
Beyond this relationship between the BCs there are questions regarding ownership of behavior. For example, both a Person and a User may have a name and what is to be determined is who own's the behavior of changing a name. This can be implemented in several ways. Person may have its own name and changes should be propagated to the identity BC. Similarly, User may own changes to name, in which case they must be propagated to Person via a synchronization mechanism.
Overall, your problem could be addressed in two ways. First, you can store Person and User aggregates in different tables. Any given query should only use one of these tables and they can be synchronized in an eventually consistent matter. Another approach is to decouple the behavioral domain model from a model designed for queries (read-model). This way, you can create a read-model designed to serve a specific screen(s) and have a customized query, perhaps even outside of an ORM.
If all the Users are Person too (sometimes external services are modeled as special users too), the only data that User and Person should share on the database are their identifiers.
Indeed each entity in a domain model should hold references only to the data that they need to ensure their invariants.
Moreover I guess that Users are identified by Username and Persons are identified by something else (VAT code or so..).
Thus, the simplest optimization technique is to avoid to encapsulate in an entity those informations that are not required to ensure its invariants.
Furthermore you simply need an effective context mapping technique to easily pass from User to Person when needed. I use shared identifiers for this.
As an example you can expose the Person's identifier in the User class, so that a simple query to the Person's repository can provide you the data you need.
Finally I suggest you the Vaughn Vernon series on Aggregate Root Design.

Resources