SOLR issue - too many search queries - performance

We have a PHP web application which is using SOLR for searching. The APP is using CURL to connect to the SOLR server and which run in a loop with thousands of predefined keywords. That will create thousands of different search quires to SOLR at a given time.
My issue is that, when a single user logged into the app everything is working as expected. When there is more than one user is trying to run the app we are getting this response from the server.
Failed to connect to xxx.xxx.xxx.xxx: Cannot assign requested
addressFailed to connect to xxx.xxx.xxx.xxx: Cannot assign requested
addressFailed
Our assumption is that, SOLR server is unable to handle this much search queries at a given time. If so what is the solution to overcome this?. Is there any settings like keep-alive in SOLR?
Any help would be highly appreciate.
Thanks,
Arun S

What about OR'ing a subset of the keywords and do just one query?
Then, if nothing was found, try with the next subset of keywords?
For this particular case I think you need to improve your application performance, not SOLR's, even if you need to do some trickery.
Is there any Maximum connection limit in SOLR?
This is heavily dependent on the hardware and operating system you are using, and probably which servlet container you use.
If you need more performance out of SOLR, you may need to tweak your schema (which you could post here for more help), sale vertically (ram directories or SSD's) or considering a master-slave setup.

Related

How to remotely connect to a local elasticsearch server - in a secure way ofc

I have been playing around with creating a webapp that uses elasticsearch to perform queries. Currently, everything is in production, thus on the localhost, let's say elasticsearch runs at 123.123.123.123:9200. All fun and games, but once the webapplication (react) is finished, the webapp should be able to send the queries to the current local elastic search db.
I have been reading around on how to get this done in a proper and most of all secure way. Summary of this all is currently:
"First off, exposing an Elasticsearch node directly to the internet without protections in front of it is usually bad, bad news." (see here: Accessing elasticsearch from a public domain name or IP).
Another interesting blog I found: https://code972.com/blog/2017/01/dont-be-ransacked-securing-your-elasticsearch-cluster-properly-107.
The problem with the above-mentioned sources is that they are a bit older, and thus I am not sure whether they are up to date.
Therefore the following questions:
Is nginx sufficient to act as a secure middleman, passing the queries from the end-users to elastic?
What is the difference at that point with writing a backend into the react application (e.g. using node and express)?
What is the added value taking into account the built-in security from elasticsearch (usernames, password, apikey, certificates, https,...)?
I am reading a lot about using a VPN or tunneling. I have the impression that these solutions are more geared towards a corporate-collaborative approach. Let's say I am running my front-end on a live server, I can use tunneling to show my work to colleagues, my employer. VPN would be more realistic for allowing employees -wish I had them, just a cs student here- to access e.g. the database within my private network (let's say an employee needs to access kibana to adapt something, let's say an API-key - just making something up here), he/she could use a VPN connection for that.
Thank you so much for helping me clarify the above-mentioned points!
TLS, authorisation and access control are free for the Elastic Stack, and have been for a while. I'd start by looking at the docs, as it's an easy way to natively secure your cluster
for nginx, it can be useful for rate limiting, or blocking specific queries for eg. however it's another thing to configure and maintain
from a client POV it would really only matter if you are using the official Elasticsearch clients, and you use nginx and make changes to the way the API would respond to the client (eg path rewrites, rate limiting)
it's free, it's native, it's easy to manage via Kibana
I'd follow the docs to secure Elasticsearch and then see if you need this at some point in the future. this would be handled outside Elasticsearch anyway, and you'd still want to secure Elasticsearch
The point in exposing Elasticsearch nodes directly to the internet is a higher vulnerability in principle. You should follow the rule of the least "surface" of your system on the internet.
A good practice is to hide from the internet whatever doesn't need to be there, although well protected. It takes ~20mins to get cyber attacks on any exposed service (see a showcase).
So I suggest you install a private network, such as a traditional VPN or an SDP product such as Shieldoo Mesh.

Is using Elastic Search as authoritative datastore for applications advisable?

I'm new to using elastic search, and I'm trying to find a datastore for our application where we can also add a front end for analytics, in this case Kibana. I'm planning to use them as a datastore for dr/cr transactions on our billing system.
Most use case I read is towards data analytics and searching related. I don't see a use case wherein it is used as a regular datastore for an application. So I'm worried I might use it on a wrong use case.
I was hoping if anyone can add their insights on this. Like why or why not use Elastic Search as authoritative/primary datastore for applications.
You should read a official blog of elasticsearch, where they clearly mentioned that databases must be robust and should not stop working unless you tell to do it.
From the robustness section of same blog
A database should be robust, especially if it is your authoritative
system of record. Ideally, a costly query should be possible to
cancel, and you certainly don't want the database to stop working
unless you tell it to.
Unfortunately, Elasticsearch (and the components it's made of) does
not currently handle OutOfMemory-errors very well. We cover this in
more depth in Elasticsearch in Production, OutOfMemory-Caused Crashes.
It is very important to provide Elasticsearch with enough memory and
be careful before running searches with unknown memory requirements on
a production cluster.
In short, you shouldn't use Elasticsearch as a primary data-store where you can't afford to loose the data.

Elasticsearch: Several independent nodes in the same machine

Our current software solution uses a local ES installation (1 cluster and 1 node) to store documents so then later the user is able to search them. The ingest of nodes is not continuously done but let's say once a month by using bulks. The document set isn't huge and the size of documents is small. This solution has been working correctly without problems in normal laptop PCs (i5 with 8Gb RAM) since the use case does not require big performance.
Now we're facing 2 new requirements for our software solution:
Should be branded for other customers
The same final user (using the same machine) should be able to work with several instances of our solution (from different customers)
With these 2 new requirement the current solution cannot be used because all documents would be indexed in the same node using the same index. Further searches would show document from different customers.
A first approach to solve this issue was to index documents based on customer, that is, to create indices per customer and index/search documents on the corresponding index. However, we're thinking on another solution that allows us the following:
ES indexed information must be easily removed from the system (i.e. by removing the data folder)
Each customer may want to use a newer version of our solution (i.e. which uses ES 7) whereas other will remain with older versions (i.e. ES 6)
Based on this, I think that the solution would be to have several ES installations on the same PC, each one with its customer dependent configuration:
Different cluster
Different node name and port
Different ES version
My questions then would be, has anyone faced a similar use case? Would it be performance issues by installing several ES an let their services running continuously at the same time? Which possible problems could arise of having this configuration?
Any help would be appreciated.
UPDATE
Based on the answer received and for possible future answers, I would like to clarify a bit more about the architecture of our solution + ES:
Our solution is a desktop application executed on normal laptop PCs
Single user
Even if more than one customer specific solution is installed in the PC, only 1 will be active at a time
Searches will be executed sporadically when the user wants to search for a specific document (as if someone opens Wikipedia to search for an article)
So topics as ...
Infrastructure failure
Data replication
Performance at high search demand
... are not critical
You can run the multiple installations of ES in the same machine in production but it has a lot of disadvantages.
Ideally, you should have at least 1 replica of your shard and it should present in another physical machine(node) so that in case of infrastructure failure, it can recover, this is done to improve the resiliency of your system.
In production, it's common to come across a use case, where having single shard is not enough and you need to break your index into multiple primary shards to make it horizontal scalable but if you just use 1 physical server then having multiple shards will not help you.
Having multiple installations also doesn't help in the case where there is a lot of traffic in one installation and it consumes all the physical resources like RAM, CPU, disk and brings down all the installations also down in production.it also becomes difficult to isolate the root cause and quickly fix the issue as ES installation is not stateless and you can not just start the same installation on another machine, without moving all its data and configuration.
Basically, yours is a truly tenant-based SAAS application and by looking into your requirement, you should design your system considering below:
Upgrading the ES version sometimes is not very straightforward and it involves a lot of breaking changes in your application code as well, having just a cluster running with the latest version will not solve the problem. Hence your application should expose the tenant(your customer) registration API which Also takes which version of ES customer wants to use and accordingly your code handles that.
ES indexed information must be easily removed from the system :- I didn't get what the issue here, you can simply delete it using the ES API which is the recommended way of doing that, instead of doing it manually.
Hope my answer is clear to you and let me know if I missed any of your requirement and you need further clarification.
Based on the update on the question I am adding below points:
As OP mentioned its a very small desktop application and not a server-side application, then it's very important to not mix and store the content of each customer. Anybody can install the ES web admin plugin like https://github.com/lmenezes/cerebro and read the data of other customers.
The best solution in your case to have a single installation of ES based on the version specified by the customer and have just 1 index pertaining to the customer running the desktop application. And you can easily use the delete API as I mentioned earlier.
There is no need to have multiple installations at all, even though they won't be active but still, they consume the local disk space(which is even more important in case of desktop app) and can cause this and this issue and its not at all cleaner design to store the unnecessary information on desktop app and also cause a security issue which is much bigger concerns in general.

How to perform Solr performance testing? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I got the task to test Solr performance testing. I am completely new to Solr and not having idea how to perform here testing.
Solr which we are using, it is utilizing a lot of RAM and CPU. Due to that our application is getting hang and send server error messages.
What would be the way of testing Solr, whether it is required any tool to create multiple concurrent threads?
According to the Solr Quick Start guide
Searching
Solr can be queried via REST clients, cURL, wget, Chrome POSTMAN, etc., as well as via the native clients available for many programming languages.
so you can use "usual" HTTP Request samplers to mimic multiple users concurrently using Solr.
References:
Building a Web Test Plan
Testing SOAP/REST Web Services Using JMeter
For search applications, the amount of requests by itself usually isn't as important as the query profile. There's a lot of internal caching going on, and the only useful way to be able to do decent performance testing, is to use your actual query logs to replicate the query profile that represents your users. You'll also have to use the actual data that you have in your Solr server, so you get (at least) close to the same cardinality for fields and values.
This would mean using the same filters, the same kind of queries and within the same kind of simultaneous load. Since you probably want to go above the load you see in production, using logs for several days as a single day (and be sure to get weekends vs weekdays in there, and if you have a particularly bad day, such as black fridays for ecommerce, keep those logs available so you're able to replicate that profile.
There are (many) tools to do the HTTP requests to Solr, but be sure to use a query profile and sets of queries that actually represent how you're using Solr, otherwise you're just hitting the query cache each single time, or you have data that doesn't represent the actual data in your dataset - which will give you completely irrelevant response times (i.e. random data performs a lot worse than actual english text where tokens are repeated over documents).
You can use solr meter to do the performance testing. read here solr
meter wiki
The most important thing is not how to test the queries - but putting up the scenarios that you want to test and which mimicks the real time application usage you see.
initially you need to decide what you want to find out with your test. Do you want to find bottlenecks? do you want to find out if your current setup can match business requirements? do you need to find the breaking points of your current architecture and setup?
Solr utilizing a lot of CPU is very often related to indexing and memory usage might be related to segment merging - so it sounds like you need to define your senarios.
how much content should you push to Solr for it to perform indexing?
how many queries do you need to send?
what are the features of the queries (facets, highlighting, spellcheck etc)?
You could use Jmeter to test the throughput of your search application and you could also check for IO, load, CPU usage & Ram Usage on each Solr instances.

using elasticsearch replications in django

The current setup is compose by 3 elastic search servers, of witch one is the master and the other 2 are slave, at least they define themselves as that.
It might happen that the master goes down, for any kind of problem, this means that elastic search is going to find a new elegible master and switch to this new one.
Currently the problem is that all my application on the frontend servers is totally not aware of this so it will keel making queries to the same backend, of course killing all my website because it will not answer. I had a look around but I was not able to find anything related to backend switch on the fly even related to the new Haystack 2.x.
Any suggestion?
Many thanks in advance
It doesn't seem to be necessary to me to leave this to your application layer. Most probably you access ES through HTTP-REST requests, which means you can put any HTTP load balancer like Nginx in front of your ES servers which should also switch to another node if one times out.

Resources