solr More than on entity in DataImportHandler - performance

I need to know what is the recommended solution when I want to index my solr data using multiple queries and entities.
I ask because I have to add a new fields into schema.xml configuration. And depends of entity(query) there should be different fields definition.
query_one = "select * from car"
query_two = "select * fromm user"
Tables car and user have differents fields, so I should include this little fact in my schema.xml config (when i will be preparing fields definition).
Maybe someone of you creates a new solr instance for that kind of problem ?
I found something what is call MultiCore. Is it alright solution for my problem ?
Thanks

Solr does not stop you to host multiple entities in a single collection.
You can define the fields for both the entities and have them hosted within the Collection.
You would need to have an identifier to identify the Entities, if you want to filter the results per entity.
If your collections are small or there is a relationship between the User and Car it might be helpful to host them within the same collection
For Solr Multicore Check Answer
Solr Multicore is basically a set up for allowing Solr to host multiple cores.
These Cores which would host a complete different set of unrelated entities.
You can have a separate Core for each table as well.
For e.g. If you have collections for Documents, People, Stocks which are completely unrelated entities you would want to host then in different collections
Multicore setup would allow you to
Host unrelated entities separately so that they don't impact each other
Having a different configuration for each core with different behavior
Performing activities on each core differently (Update data, Load, Reload, Replication)
keep the size of the core in check and configure caching accordingly
Its more a matter of preference and requirements.

The main question for you is whether people will search for cars and users together. If not (they are different domains), you can setup multiple collections/cores. If they are going to be used together (e.g. a search for something that shows up in both cars and people), you may want to merge them into one index.
If you do use single collection for both types, you may want to setup dedicated request handlers returning different sets of fields and possibly tuning the searches. You can see an example of doing that (and a bit more) in the multilingual example from my book.

Related

how to restrict access to documents in elasticsearch?

I'm designing a solution and want to leverage some of Elasticsearch's query capabilities (version 7.x). We are expected to have around 10M documents per index.
Documents might have different 'associations' to what we call 'users' (not necessarily same meaning as in ES) -
associated to all, queryable in any context.
associated to single user, should appear only in this user context searches.
associated to a 'groups' of users (of size of up to 1000K), should appear in queries for user's of this group.
We expect to have a lot of users, in the 100Ks or so. which also mean we might have a lot of different groups, each 2 users might form a custom group.
I've been investigating ES's capabilities and it looks like each solution I came up with have disadvantages:
RBAC - will require creating a lot of rolls (per user + per group, can ES even handle that many?)
ABAC - will require creating a lot of users (can ES even handle that many?)
Simple AND clauses on a dedicated properties (complex template of the query as explained here)
it is important to note that I have a single user that I will be using in order to query on behalf of the users I will create, in case I will choose to go down this path.
I came across this question but I figured that thing might have evolved since its been answered Document access control in ElasticSearch
Any other suggestions that I should check out? maybe even custom 3rd party solutions?

Solr - schema per user group

currently I'm developing user-search application where users can do a full-text search. It should be extremely fast and there can be a lot of users, like 100.000. There are also like 10.000 user groups. Now I came across Solr and started to implement this, but it seems like I'm failing at the design level.
The requirements:
There is a default schema which is applied to all user groups
Each user is assigned to exactly one user group
A user group can have additional fields (besides the default schema) which should be displayed in the result set (so they can extend the data with custom data)
The search should be extremely fast
How would you realize that application that suits the requirements?
First, I thought about creating a "master core" for the default schema and create a core for each user group, so that I could join the necessary cores when a user requests the data. But it seems like that joining cores in standalone would not work because it does not support sharding. However, even if it would work, I'm concerned about performance because of joining at query time.
SolrCloud does seem to support sharding, but again, I would need to join the queries to one result set which would impact performance again. Additionally, I came across this post Query multiple collections with different fields in solr which says that I would need a merged schema (share-unification) to be able to query across collections/shards. So this would mean: whenever a user group's schema is changed, I would need to change my share-unifacation. As all user group's schemas rely on the share-unification, the search would be unavailable because I would need to re-index at least two schemas.
A simple solution would be to put everything into a single core (standalone) or collection (cloud), but this feels overwhelming.
Has someone did something similar before and can give a good advice or even a best practice?

Elasticsearch - Modelling video catalogue information into one index vs multiple indexes

I need to model a video catalogue composed of movies, tv shows, episodes, TV channels and live programs information into elasticsearch. Some of these entities are correlated, some not.
The attributes of these entities are quite different, even if there are some common ones.
Now since I may need to do query cross-entity, imagine the scenario of a customer searching for something that could be a movie, a tv channel or a live event program, is it better to have 1 single index containing a generic entity marked with a logical type attribute, or is it better to have multiple indexes, 1 for each entity (movie, show episode, channel, program) ?
In addition, some of these entities, like movies, can have metadata attributes into multiple languages.
Coming from a relational data model DB, I would create different indexes, one for every entity and have a language variant index for every language. Any suggestion or better approach in order to have great search performance and usability?
Whether to use several indexes or not very much depends on the application, so I cannot provide a definite answer, rather a few thoughts.
From my experience, indexes are rather a means to help maintenance and operations than for data modeling. It is, for example, much easier to delete an index than delete all documents from one source from a bigger index. Or if you support totally separate search applications which do not query across each others data, different indexes are the way to go.
But when you want to query, as you do, documents across data sources, it makes sense to keep them in one index. If only to have comparable ranking across all items in your index. Make sure to re-use fields across your data that have similar meaning (title, year of production, artists, etc.) For fields unique to a source we usually use prefix-marked field names, e.g. movie_... for movie-only meta data.
As for the the language you need to use language specific fields, like title_en, title_es, title_de. Ideally, at query time, you know your user's language (from the browser, because they selected it explicitly, ...) and then search in the language specific fields where available. Be sure to use the language specific analyzers for these fields, at query and at index time.
I see a search engine a bit as the dual of a database: A database stores data but can also index it. A search engine indexes data but can also store it. A database tends to normalize the schema to remove redundancy, a search engine works best with denormalized data for query performance.

Multi-tenant database. One collection or one db per tenant?

For a multi-tenancy architecture for a web application using a document-oriented database I can see two conceivable options:
Having one database per tenant, and the collections logically separate different kinds of object.
Having one collection per tenant, and all user data is stored in one database, with some kind of flag or object type identifier on each record.
Have there been any studies or has any documentation been produced regarding these two options and the differences between them?
Is there a particular standard or good reason why someone designing a web application which allows multiple users to store vastly different kinds of data would choose one over the other?
Aside from speed/efficiency issues, are there any other things to be said about this that would influence the decision?
EDIT I'm aware some of the terminology might be database specific, so for all wondering I am specifically referring to MongoDB.
I wouldn't want tenant specific collections. In my application, I usually hard code collection names, in the same way as I'd hardcode table names if I were using SQL tables. There'd be one comments collection that stores all comments for a blog. I would not want to deal with collection names like comments_tenant_1 and comments_tenant_2, because 1) that feels error prone, and 2) would make the application code more complicated (collection names would have to be replaced with functions that computed the collection name). And 3) the number of collections in a single database could grow huge, which would make a list of all collections look daunting, and also MongoDB isn't built for having very many collections (see the link in the comment below your question, which David B posted, https://docs.mongohq.com/use-cases/multi-tenant.html).
However, database names aren't coupled to application data structures, and you can grant permissions on databases (but not on single collections). So one database per tenant could be reasonable. As could be a per document tenant_id field in a single database for all tenants (see the above-mentioned link).

couchdb validation based on content from existing documents

QUESTION
Is it possible to query other couchdb documents as part of a standard couchdb validation function ?
If not, what is the standard approach for including properties of other documents as part of a validation rule inside a couchdb validation function?
RATIONALE
Consider a run-of-the-mill address book application where the validation function is intended to prevent two or more entries having the same value for the 'e-mail' in one of the address book entry fields.
Consider also an address book application where it is possible to specify validation rules in separate documents, based on whether the postal code is a US-based postal code or something else.
No, it is not possible to query other couchdb documents in a validate_doc_update function. Each runs in isolation passing references only to: the new document, the old document, and user (where applicable).
My personal experience has been there are at least three options for dealing with duplicate checking:
Use Cloudant as your CouchDB provider. They offer a free tier for now if you'd like to experiment, but they guarantee consistency across nodes for a CouchDB database. (See #2)
I've used a secondary "reserve table" for names using the type-key as the ID. Then, you need to check for conflicts if not using a system like Cloudant. Basically, there's a simple document that maintains a key to prevent duplicates. It's not fun code to write given that you need to watch for conflicts. (Even with cloudant, you need to deal with failed requests to write, but it's easier than dealing with timing issues surrounding data replication across multiple nodes).
Use a traditional DB like MySQL for example that can maintain a unique and consistent index for specific data values like you're describing. Store the documents away in CouchDB though. While slightly annoying that you need different data providers, it's reliable.
(Optional: decide that CouchDB isn't a great fit for the type of system you're building)

Resources