using elasticsearch replications in django - elasticsearch

The current setup is compose by 3 elastic search servers, of witch one is the master and the other 2 are slave, at least they define themselves as that.
It might happen that the master goes down, for any kind of problem, this means that elastic search is going to find a new elegible master and switch to this new one.
Currently the problem is that all my application on the frontend servers is totally not aware of this so it will keel making queries to the same backend, of course killing all my website because it will not answer. I had a look around but I was not able to find anything related to backend switch on the fly even related to the new Haystack 2.x.
Any suggestion?
Many thanks in advance

It doesn't seem to be necessary to me to leave this to your application layer. Most probably you access ES through HTTP-REST requests, which means you can put any HTTP load balancer like Nginx in front of your ES servers which should also switch to another node if one times out.

Related

How to remotely connect to a local elasticsearch server - in a secure way ofc

I have been playing around with creating a webapp that uses elasticsearch to perform queries. Currently, everything is in production, thus on the localhost, let's say elasticsearch runs at 123.123.123.123:9200. All fun and games, but once the webapplication (react) is finished, the webapp should be able to send the queries to the current local elastic search db.
I have been reading around on how to get this done in a proper and most of all secure way. Summary of this all is currently:
"First off, exposing an Elasticsearch node directly to the internet without protections in front of it is usually bad, bad news." (see here: Accessing elasticsearch from a public domain name or IP).
Another interesting blog I found: https://code972.com/blog/2017/01/dont-be-ransacked-securing-your-elasticsearch-cluster-properly-107.
The problem with the above-mentioned sources is that they are a bit older, and thus I am not sure whether they are up to date.
Therefore the following questions:
Is nginx sufficient to act as a secure middleman, passing the queries from the end-users to elastic?
What is the difference at that point with writing a backend into the react application (e.g. using node and express)?
What is the added value taking into account the built-in security from elasticsearch (usernames, password, apikey, certificates, https,...)?
I am reading a lot about using a VPN or tunneling. I have the impression that these solutions are more geared towards a corporate-collaborative approach. Let's say I am running my front-end on a live server, I can use tunneling to show my work to colleagues, my employer. VPN would be more realistic for allowing employees -wish I had them, just a cs student here- to access e.g. the database within my private network (let's say an employee needs to access kibana to adapt something, let's say an API-key - just making something up here), he/she could use a VPN connection for that.
Thank you so much for helping me clarify the above-mentioned points!
TLS, authorisation and access control are free for the Elastic Stack, and have been for a while. I'd start by looking at the docs, as it's an easy way to natively secure your cluster
for nginx, it can be useful for rate limiting, or blocking specific queries for eg. however it's another thing to configure and maintain
from a client POV it would really only matter if you are using the official Elasticsearch clients, and you use nginx and make changes to the way the API would respond to the client (eg path rewrites, rate limiting)
it's free, it's native, it's easy to manage via Kibana
I'd follow the docs to secure Elasticsearch and then see if you need this at some point in the future. this would be handled outside Elasticsearch anyway, and you'd still want to secure Elasticsearch
The point in exposing Elasticsearch nodes directly to the internet is a higher vulnerability in principle. You should follow the rule of the least "surface" of your system on the internet.
A good practice is to hide from the internet whatever doesn't need to be there, although well protected. It takes ~20mins to get cyber attacks on any exposed service (see a showcase).
So I suggest you install a private network, such as a traditional VPN or an SDP product such as Shieldoo Mesh.

Socket.io nodejs cluster hash loadbalancing with proxy

I built an app where people connect to rooms through websocket based on the room's id in the URL, such as app.com/9l4CvjXFxn . The thing is I want to run multiple node instances in order to secure an always up instance, in case others crash, and also because I heard loadbalancing is good. I also have the static UI content to serve. I am using socketio only, no REST api.
My plan now is to use a load-balancing and serving proxy such as nginx or haproxy. I have never used any of them. I have also thought about using PM2 to easily run many node instances. The app will probably be deployed on AWS.
Websockets happen through an HTTP1 upgrade to a certain path in socketio, I have set it to root /. So in the initial upgrade request I can't change the path, but I can put things into the url query or cookie headers.
So the main requirement for the loadbalancer is to direct websocket upgrades of a certain room to one specific node instance consistently, hence I thought of hash loadbalancing, but I have no idea how to do this and if this is the correct approach at all.
Could you help?
Thanks!
Couldn't solve it and it might be that its a "wrong" problem to have. This is because I don't have persistence and I keep my objects in node. If my node instance was stateless and use a redis I could spin up many nodes without having this problem. This is the next step though because, still an early app. Checkiout though: quarantime.io :)

Elasticsearch: Several independent nodes in the same machine

Our current software solution uses a local ES installation (1 cluster and 1 node) to store documents so then later the user is able to search them. The ingest of nodes is not continuously done but let's say once a month by using bulks. The document set isn't huge and the size of documents is small. This solution has been working correctly without problems in normal laptop PCs (i5 with 8Gb RAM) since the use case does not require big performance.
Now we're facing 2 new requirements for our software solution:
Should be branded for other customers
The same final user (using the same machine) should be able to work with several instances of our solution (from different customers)
With these 2 new requirement the current solution cannot be used because all documents would be indexed in the same node using the same index. Further searches would show document from different customers.
A first approach to solve this issue was to index documents based on customer, that is, to create indices per customer and index/search documents on the corresponding index. However, we're thinking on another solution that allows us the following:
ES indexed information must be easily removed from the system (i.e. by removing the data folder)
Each customer may want to use a newer version of our solution (i.e. which uses ES 7) whereas other will remain with older versions (i.e. ES 6)
Based on this, I think that the solution would be to have several ES installations on the same PC, each one with its customer dependent configuration:
Different cluster
Different node name and port
Different ES version
My questions then would be, has anyone faced a similar use case? Would it be performance issues by installing several ES an let their services running continuously at the same time? Which possible problems could arise of having this configuration?
Any help would be appreciated.
UPDATE
Based on the answer received and for possible future answers, I would like to clarify a bit more about the architecture of our solution + ES:
Our solution is a desktop application executed on normal laptop PCs
Single user
Even if more than one customer specific solution is installed in the PC, only 1 will be active at a time
Searches will be executed sporadically when the user wants to search for a specific document (as if someone opens Wikipedia to search for an article)
So topics as ...
Infrastructure failure
Data replication
Performance at high search demand
... are not critical
You can run the multiple installations of ES in the same machine in production but it has a lot of disadvantages.
Ideally, you should have at least 1 replica of your shard and it should present in another physical machine(node) so that in case of infrastructure failure, it can recover, this is done to improve the resiliency of your system.
In production, it's common to come across a use case, where having single shard is not enough and you need to break your index into multiple primary shards to make it horizontal scalable but if you just use 1 physical server then having multiple shards will not help you.
Having multiple installations also doesn't help in the case where there is a lot of traffic in one installation and it consumes all the physical resources like RAM, CPU, disk and brings down all the installations also down in production.it also becomes difficult to isolate the root cause and quickly fix the issue as ES installation is not stateless and you can not just start the same installation on another machine, without moving all its data and configuration.
Basically, yours is a truly tenant-based SAAS application and by looking into your requirement, you should design your system considering below:
Upgrading the ES version sometimes is not very straightforward and it involves a lot of breaking changes in your application code as well, having just a cluster running with the latest version will not solve the problem. Hence your application should expose the tenant(your customer) registration API which Also takes which version of ES customer wants to use and accordingly your code handles that.
ES indexed information must be easily removed from the system :- I didn't get what the issue here, you can simply delete it using the ES API which is the recommended way of doing that, instead of doing it manually.
Hope my answer is clear to you and let me know if I missed any of your requirement and you need further clarification.
Based on the update on the question I am adding below points:
As OP mentioned its a very small desktop application and not a server-side application, then it's very important to not mix and store the content of each customer. Anybody can install the ES web admin plugin like https://github.com/lmenezes/cerebro and read the data of other customers.
The best solution in your case to have a single installation of ES based on the version specified by the customer and have just 1 index pertaining to the customer running the desktop application. And you can easily use the delete API as I mentioned earlier.
There is no need to have multiple installations at all, even though they won't be active but still, they consume the local disk space(which is even more important in case of desktop app) and can cause this and this issue and its not at all cleaner design to store the unnecessary information on desktop app and also cause a security issue which is much bigger concerns in general.

Can a person add CORS headers using the ELB Application Load Balancer (sitting in front of Solr)?

We have a number of EC2 instances running Solr in EC2, which we've used in the past through another application. We would like to move towards allowing users (via web browser) to directly access Solr.
Without something "in front" of Solr this results in a security risk, so we have opted to try to use ELB (specifically the Application Load Balancer) as a simple and maintenance free way of preventing certain requests from hitting SOLR (i.e. preventing the public from DELETING or otherwise modifying the documents in Solr).
This worked great, but we realize that we need to deal with the CORS issue. In other words, we need to add the appropriate headers to requests that come in from a browser. I have not yet seen a way of doing this with Application Load Balancer but am wondering if it is possible to do someway. If it is not possible, I would love as an additional recomendation the easier and least complicated way of adding these headers. We really really really hate to add something like nginx in front of Solr because then we've got additional redundancy to deal with, more servers, etc.
Thank you!
There is not much I can find on CORS for ALB either and I remember when I used Beanstalk with ELB I had to add CORS support in my java application directly.
Having said that, I can find a lot of articles on how to set up CORS for Solr.
Can it be an option for you?

SOLR issue - too many search queries

We have a PHP web application which is using SOLR for searching. The APP is using CURL to connect to the SOLR server and which run in a loop with thousands of predefined keywords. That will create thousands of different search quires to SOLR at a given time.
My issue is that, when a single user logged into the app everything is working as expected. When there is more than one user is trying to run the app we are getting this response from the server.
Failed to connect to xxx.xxx.xxx.xxx: Cannot assign requested
addressFailed to connect to xxx.xxx.xxx.xxx: Cannot assign requested
addressFailed
Our assumption is that, SOLR server is unable to handle this much search queries at a given time. If so what is the solution to overcome this?. Is there any settings like keep-alive in SOLR?
Any help would be highly appreciate.
Thanks,
Arun S
What about OR'ing a subset of the keywords and do just one query?
Then, if nothing was found, try with the next subset of keywords?
For this particular case I think you need to improve your application performance, not SOLR's, even if you need to do some trickery.
Is there any Maximum connection limit in SOLR?
This is heavily dependent on the hardware and operating system you are using, and probably which servlet container you use.
If you need more performance out of SOLR, you may need to tweak your schema (which you could post here for more help), sale vertically (ram directories or SSD's) or considering a master-slave setup.

Resources