Which changes does a RethinkDB proxy node receive? - rethinkdb

I have a RethinkDB cluster with thousands of changefeed subscribers and the load from the subscriptions is pretty high, so I'm looking into proxy nodes. One question I have is which changes each proxy node receives, or put differently, is there any benefit in trying to cluster similar subscriptions on specific proxy nodes?
The concrete situation is that I have one table with two relevant fields for this discussion: an account field and a topic field. The subscriptions filter by account and do a "between" two topics to get a topic prefix. The table has a compound secondary index on account and topic, so the filter really is an index range scan.
What I'm wondering is whether breaking the table up into a table by account would help with subscriptions. This would only be the case if I could direct all subscriptions for an account to one proxy and if at the RethinkDB level that proxy then would not receive changes for tables for which it has no subscription. Is that the case or does each proxy receive all changes?

Related

Elasticsearch Best Practices Flow

I am using elastic-search for product filtering for products. We have complex logic of product availability. I can see two options
Using elastic to store only product specific data and product availability logic resides in web server part. we first filter data from elastic then check the condition on those result set if it matches the logic of availability.
or We can flatten the data and store it in elastic though for that case there will be duplicates of data.
My concern is if it is good practice to call elastic endpoint from browser. As it does not have any auth system by default. and every query and response will be visible in network log. I believe call should be made from web server to elastic and front end will communicate with elastic unaware of elastic existence.
Any best practice insight will be helpful
Simply create and authenticated endpoint in your backend and send the queries to that endpoint. Do make sure there are some enforced limits such as
size -- You don't want to let anybody download your whole index and
aggregation depth -- you don't want anyone to perform summaries on your whole index/indices to get a competitive advantage.
Regarding the duplicates: I wouldn't worry too much about the storage aspect (many NoSQL approaches will probably have some duplication to facilitate fast queries) but keep in mind that aggregations might yield "wrong" counts and sums. You'd typically perform those aggregations to get, say, the totals in your product categories and you want to make sure they are representative of your warehouse state.
More cannot really be said right now based on the limited information you've provided.

Best way to set up ElasticSearch for searching in each customer's data only

We have a SAAS product where companies create accounts and populate their own private data. We are thinking about using ElasticSearch to allow the customer to search all their own data in our system.
As an example we would have a free text search where the user can type anything and the API would return multiple different types of objects. E.g. they type John and the API returns the user object for users matching a first name containing John, or an email containing John. Or it might also return a team object where the team name matches John (e.g. John's Team) etc.
So my questions are:
Is ElasticSearch a sensible choice for what we want to do from a
concept perspective?
If we did use ElasticSearch what would be the
best way to index the data so we can search all data for a
particular customer? Does each customer have its own index?
Are there any hints on how we keep ElasticSearch in sync with the data in the database (DynamoDB)? If we index the data for a customer and then update the data as it changes is it sensible to then also reindex the data on a scheduled basis too?
Thanks!
I will try to provide general answers from my own experience with splitted customer data with elastic search:
If you want to search through a lot of data really fast, ES is always a really good solution for this - it comes with the cost of an secondary data storage that you will have to keep in sync with your database.
You cant have diffrent data types in one index, so the case would be either to create one index per data type and customer (carefull, indices come with an overhead - avoid creating too much with little data in it) - or you create one index per data type and add a property to your data where you then can filter it with e.g. a customer number.
You will have to denormalize your data as much as possible to benefit from elastic search.
As mentioned in 1 you will need to keep both in sync - there are plenty ways too do that. As an example we use a an event driven approach to push critical updates into elasticsearch as soon as possible (carefull: its not SQL - so you will always have some concurrency issues when u need read and write safety). For data that is not highly critical we use jobs that update them regulary. When you index a document with the same id it will get completely updated.
Hope this helps, feel free to asy questions.

Best way to include aggregated document counts as part of percolation queries

Imagine that I have a stream of events, each of which with a particular event type and scoped to a particular user/account
Users can set up alerts of the form
Send alert when event A has occurred 3 times within the last year/month/day etc.
I'd expect to receive 100s of such events a second
I was thinking that I would have a separate index for each day
I was also thinking about whether pre-aggregating counts somehow would be necessary, as doing a separate aggregation/count query for each incoming event seems excessive and not scalable, but maybe it's not a problem?
What would be the best approach for this problem?
One approach that comes to my mind is:
Having a percolate query for each user with their settings. Allowing them to add events with the word "error" to the level error for example.
Each event is indexed in one per-client index and maybe if you have a lot of events per client, should be useful to have an per-client-level index, like events_clientId_alarm.
Then the mapping of an event should be something like:
{
"indexed_at": datetime,
"level": keyword [fatal/error/debug/...],
"log": string
}
Then you will have an stream of events coming to percolate, once event is percolated you will know where to store the event.
You can then kibana/grafana,etc .. approach to monitor your indices data and make alarms if there's like 4 event with level alarms in the last 5 minutes.
At the worst case you will have one index with more or less 8640000 * 365 documents (If you have only one user with 100/events by second), this is a huge index, but could be managed correctly by ElasticSearch (adding enough shards to make your searchs/aggregations by log-level and dates).
The most important thing here is know how your data will increase in time, due Elasticsearch don't allow you to add more shards in each index. Then you must need to wonder how each customer data will increase over time and guess how many shards you will need to have it all smoothly running.
NOTE:
Depending on your deals with your customers, if they want whole history on their events-data or something like that. You can store one index per year per client in order to allow you delete old data if required and allowed.
Hope it helps, I did a similar project and I'd done a similar approach to accomplish it.

Elasticsearch best practices

1) We are fairly new to Elasticsearch. In our spring boot application, we are using Spring's Elasticsearch that is based on in-memory node client. Insert/update/deletes happen on our primary relational database (DB2) and we use Elasticsearch solely for handling searches. We have a synchronizing mechanism to keep elastic search up to date with the latest changes
2) In production, we have 4 instances of our application running. To synchronize the in-memory elastic store on all 4 servers, we have a JMS topic in place where all the DB2 updates are posted. Application has a topic listener that will consume any DB changes posted to this JMS topic and update the in-memory elastic store.
Question:
i) Is the above an ideal way to implement Elasticsearch in your application? If not, what else would you recommend?
ii) Any Elasticsearch best practices that you can point us to?
Thanks Much!
1- In Prod, choose 3 master and 4 data nodes. Always an odd number of total servers
2- Define your mappings and index in advance, dont choose auto-create option.
Should define data types
Define amount as sclaed_float with 100 precision
All numeric fields should be defined as long so query ' between', 'sort' or aggregation.
Chose carefully between keyword and text field type. Use text where it is necessary.
3- Define external version if you update the same record, again and again, to avoid updating with stale data.

separating data access with elasticsearch

I'm just getting to know elasticsearch and I'm wondering if it suits my case at all:
Considering a system where companies (with multiple employees) can register and administer their clients, and send documents to their clients.
Now, I want to enable companies to search their documents - but ONLY theirs, not the documents of other companies. In other words: how to separate the data of those companies for searches? How can this be implemented with elasticsearch?
Is this separation to be handled by elasticsearch itself? I.e. there is some mapping between the companies in my system and a related user for elasticsearch.
Or is this to be handled by the backend of my system? I.e. the backend somehow decides (how?) to show only search results for that particular company. So there would be just one user, namely the backend of my system, that accesses and filters the results of elasticsearch. But is this sensible?
I'm sure there is a wealth of information about this out there. Please just give me a hint, because I don't know what to search for. Searches for elasticsearch authentication/authorization, for example, only yield results about who gains access to the search system in general - not about a pattern to solve this separation.
Thanks in advance!
Elasticsearch on its own does not support Authorization and Authentication, you need to add this via plugins, of which there are two that I know of. Shield is the official solution, which is part of the X-Pack and you need to pay Elastic if you want to use it. SearchGuard is an open source alternative with enterprise upgrades that you can buy.
Both of these enable you to define fine grained access rights for different users. What you'd probably want to do is give every company an index of their own for their documents and then restrict their user to only be able to read/write that index. Or if you absolutely want all documents in one index, you can add document level restrictions as well, so that everybody queries the same index but only gets results returned for their company. Depending on how many companies you expect to service this might make more sense in order to not have too many indices and shards, but I'd suspect that an index per company would be the best way to go.
Without these plugins you would need to resort to something on the http-layer, for example an nginx reverse proxy that filters requests based on the index names contained in the urls or something, but I'd severely advise against this, lots of pain lies that way!

Resources