Why SolrCloud slows down when one instance killed? - performance

I am using SolrCloud (version 4.7.1) with 4 instances and embedded ZooKeeper (test environment).
When I simulate failure of one of the instances, the indexing speed goes from 4 seconds to 17 seconds.
It goes back to 4 seconds after the instance is brought back to life.
Search speed is not affected.
Our production environment shows similar behavior (only the configuration is more complex).
Is this normal or did I miss some configuration option?

It is due to having Zookeeper embedded in Solr cluster.
Please try with external zookeeper. This setup give the expected results.

Related

Solr 8 Performance issue on restart

I am trying to figure out why the solr core, doesn't respond upon a restart of solr daemon . I have multiple cores , and the configuration is a leader / follower approach, each core serving certain business needs.
When I restart solr on the server, the cores that have <100K documents, show up immediately when they are queried.
But there are 2 specific cores, where we have around 2 to 3M documents, that takes around 2 minutes to be available for querying.
I know about the warmup / first searcher..etc. But those queries are commented out, so it should not be running the first searcher queries.
I noticed that when I turn this to "true" ( the default value is false)
<useColdSearcher>true</useColdSearcher>
The core that has 2M plus documents show up immediately on a restart of solr.
This never happened in solr 6.6 world, Is this something new in solr 8.x ?
Can someone who experienced this throw some light on this.
In solr 6.x we had the defaults and the cores were available right away. But the same settings in solr 8.11 , doesn't make the core available after a restart.
thanks in advance
B
Since I did not get an answer ,I tried the following experiments.
Made a change to the useColdSearcher to true and restarted the core, then the core started right away and started serving the request.
I also ran a load test with the configuration "useColdSearcher=true", and I did not see that much of a difference. I tried this load test with both true and false.
The default option in the solrconfig is useColdSearcher is false , so the same index, similar configuration in solr 6 started the searcher quick, but not in solr 8, until I made the above change.
I experimented with questions on chatGPT as well. The response in bold.
The "useColdSearcher" setting in Solr can potentially slow down the process of registering a new searcher in Solr 8.x, but it shouldn't have any effect on Solr 6.x.
It's important to note that useColdSearcher is only available for SolrCloud mode and not for standalone mode.
This setting is not available in Solr 6.x, so it wouldn't have any impact on the registration of new searchers in that version.
Since my setup is a leader ->follower , I guess I should be good to set the useColdSearcher to true.
One should try the above tests before taking their course of action. But it worked for me. So wanted to post the answer.

SparkException: Master removed our application

I know there are other very similar questions on Stackoverflow but those either didn't get answered or didn't help me out. In contrast to those questions I put much more stack trace and log file information into this question. I hope that helps, although it made the question to become sorta long and ugly. I'm sorry.
Setup
I'm running a 9 node cluster on Amazon EC2 using m3.xlarge instances with DSE (DataStax Enterprise) version 4.6 installed. For each workload (Cassandra, Search and Analytics) 3 nodes are used. DSE 4.6 bundles Spark 1.1 and Cassandra 2.0.
Issue
The application (Spark/Shark-Shell) gets removed after ~3 minutes even if I do not run any query. Queries on small datasets run successful as long as they finish within ~3 minutes.
I would like to analyze much larger datasets. Therefore I need the application (shell) not to get removed after ~3 minutes.
Error description
On the Spark or Shark shell, after idling ~3 minutes or while executing (long-running) queries, Spark will eventually abort and give the following stack trace:
15/08/25 14:58:09 ERROR cluster.SparkDeploySchedulerBackend: Application has been killed. Reason: Master removed our application: FAILED
org.apache.spark.SparkException: Job aborted due to stage failure: Master removed our application: FAILED
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
FAILED: Execution Error, return code -101 from shark.execution.SparkTask
This is not very helpful (to me), that's why I'm going to show you more log file information.
Error Details / Log Files
Master
From the master.log I think the interesing parts are
INFO 2015-08-25 09:19:59 org.apache.spark.deploy.master.DseSparkMaster: akka.tcp://sparkWorker#172.31.46.48:46715 got disassociated, removing it.
INFO 2015-08-25 09:19:59 org.apache.spark.deploy.master.DseSparkMaster: akka.tcp://sparkWorker#172.31.33.35:42136 got disassociated, removing it.
and
ERROR 2015-08-25 09:21:01 org.apache.spark.deploy.master.DseSparkMaster: Application Shark::ip-172-31-46-49 with ID app-20150825091745-0007 failed 10 times, removing it
INFO 2015-08-25 09:21:01 org.apache.spark.deploy.master.DseSparkMaster: Removing app app-20150825091745-0007
Why do the worker nodes get disassociated?
In case you need to see it, I attached the master's executor (ID 1) stdout as well. The executors stderr is empty. However, I think it shows nothing useful to tackle the issue.
On the Spark Master UI I verified to see all worker nodes to be ALIVE. The second screenshot shows the application details.
There is one executor spawned on the master instance while executors on the two worker nodes get respawned until the whole application is removed. Is that okay or does it indicate some issue? I think it might be related to the "(it) failed 10 times" error message from above.
Worker logs
Furthermore I can show you logs of the two Spark worker nodes. I removed most of the class path arguments to shorten the logs. Let me know if you need to see it. As each worker node spawns multiple executors I attached links to some (not all) executor stdout and stderr dumps. Dumps of the remaining executors look basically the same.
Worker I
worker.log
Executor (ID 10) stdout
Executor (ID 10) stderr
Worker II
worker.log
Executor (ID 3) stdout
Executor (ID 3) stderr
The executor dumps seem to indicate some issue with permission and/or timeout. But from the dumps I can't figure out any details.
Attempts
As mentioned above, there are some similar questions but none of those got answered or it didn't help me to solve the issue. Anyway, things I tried and verified are:
Opened port 2552. Nothing changes.
Increased spark.akka.askTimeout which results in the Spark/Shark app to live longer but eventually it still gets removed.
Ran the Spark shell locally with spark.master=local[4]. On the one hand this allowed me to run queries longer than ~3 minutes successfully, on the other hand it obviously doesn't take advantage of the distributed environment.
Summary
To sum up, one could say that the timeouts and the fact long-running queries are successfully executed in local mode all indicate some misconfiguration. Though I cannot be sure and I don't know how to fix it.
Any help would be very much appreciated.
Edit: Two of the Analytics and two of the Solr nodes were added after the initial setup of the cluster. Just in case that matters.
Edit (2): I was able to work around the issue described above by replacing the Analytics nodes with three freshly installed Analytics nodes. I can now run queries on much larger datasets without the shell being removed. I intend not to put this as an answer to the question as it is still unclear what is wrong with the three original Analytics nodes. However, as it is a cluster for testing purposes, it was okay to simply replace the nodes (after replacing the nodes I performed a nodetool rebuild -- Cassandra on each of the new nodes to recover their data from the Cassandra datacenter).
As mentioned in the attempts, the root cause is a timeout between the master node, and one or more workers.
Another thing to try: Verify that all workers are reachable by hostname from the master, either via dns or an entry in the /etc/hosts file.
In my case, the problem was that the cluster was running in an AWS subnet without DNS. The cluster grew over time by spinning up a node, the adding the node to the cluster. When the master was built, only a subset of the addresses in the cluster was known, and only that subset was added to the /etc/hosts file.
When dse spark was run from a "new" node, then communication from the master using the worker's hostname failed and the master killed the job.

Postgresql slow - how best to begin tuning?

My postgresql server seems to be intermittently going down. I have PgBouncer pool in front of it, so the website hits are well managed, or were until recently.
When I explore what's up with top command, I see the postmaster doing some CLUSTER. There's no cluster command in any of my cronjobs though. Is this what autovacuum is called these days?
How can I start to find out what's happening. What commands are the usual tricks in a PGSQL DBA's toolbox? I'm a bit new to this database, and only looking for starting points.
Thank you!
No, autovacuum never runs CLUSTER. You have something on your system that's doing so - daemon, cron job, or whatever. Check individual user crontabs.
CLUSTER takes an exclusive lock on the table. So that's probably why you think the system is "going down" - all queries that access this table will wait for the CLUSTER to complete.
The other common cause of people reporting intermittent issues is checkpoints taking a long time on slow disks. You can enable checkpoint logging to see if that's an issue. There's lots of existing info on ealing with checkpointing performance issues, so I won't repeat it here.
The other key tools are:
The pg_stat_activity and pg_locks views
pg_stat_statements
The PostgreSQL logs, with a useful log_line_prefix, log_checkpoints enabled, a log_min_duration_statement, etc
auto_explain

H2 Database Cluster Recovery

I have got a SpringMVC application which runs on Apache Tomcat and uses H2 database.
The infrastructure contains two application servers (lets name them A & B) running their own Tomcat Servers. I also have a H2 database clustering in place.
On one system (A) I ran the following command
java org.h2.tools.Server -tcp -tcpPort 9101 -tcpAllowOthers -baseDir server1
On the other (B) I ran
java org.h2.tools.Server -tcp -tcpPort 9101 -tcpAllowOthers -baseDir server2
I started the cluster in machine A
java org.h2.tools.CreateCluster
-urlSource jdbc:h2:tcp://IpAddrOfA:9101/~/test
-urlTarget jdbc:h2:tcp://IpAddrOfB:9101/~/test
-user sa
-serverList IpAddrOfA:9101,IpAddrOfB:9101
When any one of the server is down, it has been mentioned that, one has to delete the database that failed, restart the server and rerun the CreateCluster.
I have the following questions ?
If both servers are down, how can I ascertain, which database to
delete so that I can restart that server and rerun the cluster ?
CreateCluster contains a urlSource and urlTarget. Do I need to be
specific as to give them the same value as was previously given or I
can interchange them without any side effect ?
Do I need to run the CreateCluster command from both the machines?
If so, do I need to interchange the urlSource and urlTarget ?
Is there a way to know whether both, one or none of the servers are
running ? I want that both IpAddress will be returned if both of
them are up, one IpAddress if only one is up otherwise none is all
are down.
If both servers are down, how can I ascertain, which database to delete
The idea of the cluster is that a second database adds redundancy to the system. Let's assume a server fails one every 100 days (hard disk failure, power failure or so). That is 99% availability. This might not be good enough for you, that's why you may want to use a cluster with two servers. That way, even if one of the server fails every 100 days, the chance of both failing at the same time is very very low. Ideally, the risk of failure is completely independent. That would mean the risk of both failing at the exact same time is 1 in 10000 (100 times 100), giving you 99.99% availability. To the risk that both servers are down is exactly what the cluster feature should prevent.
CreateCluster contains a urlSource and urlTarget. Do I need to be specific as to give them the same value as was previously
It depends which one you want to use as the source and which one as the target. The database from the source is copied to the target. The source is that database you want to copy to the target.
Do I need to run the CreateCluster command from both the machines?
No.
Is there a way to know whether both, one or none of the servers are running ?
You could try to open a TCP/IP connection to them, to check if the listener is running. What I usually do is running telnet <server> <port> on the command line.

liferay performance issue on linux

I have Liferay 6 with Tomcat system setup on two machines:
Machine 1:
Windows 2003 Server
2GB RAM, 2Gh CPU
Mysql Ver 14.14 Distrib 5.1.49
Liferay 6.0.6 with Tomcat 6
Machine 2:
Linux CentOS 5.5
4GB RAM, 2Gh CPU
Mysql Ver 14.14 Distrib 5.5.10
Liferay 6.0.6 with Tomcat 6
Both the liferay systems are having identical startup parameters and mysql configurations.
The liferay system contains a custom theme and a servlet filter hook checking each URL access.
We have written a Grinder script to test the load of the system starting with 50 concurrent users .
The test script does the following things:
Open home page
Login with username/password
Enter security key (custom portlet)
Move to a private community
Logout
On Windows system the response time is as expected (nearly 40 seconds mean time for each test in Grinder).
However on the Linux system the response time is too high (nearly 4mins) for the same operations.
We tried revising the mysql, tomcat, connection pool and few other parameters but all resulting the same. Also the liferay were tested using mysql of the other machine (machine 1 liferay -> machine 2 mysql)
We are facing the same issue on Linux machines in our test environment and also at our client's end.
This looks like a duplicate question. I suspect your issue is related to memory / jvm configuration and specifically garbage collection. High CPU utilization under small loads tend to point in that direction.
In your Grinder script, have you set each of the steps as a separate transaction? This will allow you to see how much time each step is taking. It might be useful to know if everything is slower across the board, or if it's just one transaction type that's slowing you down.
Also, is there anything in the Tomcat logs on the Linux box you aren't seeing on windows? Unexpected java stack traces, etc?
Finally, are the databases on each machine identical? Do they have the same amount of data? Do they have the same indexes?
edit: Is it one transaction that's taking up all the extra time, or is each transaction slower? When you run 'top' on your linux box, is it the tomcat java process that's eating all your CPU, or some other process?

Resources