timeout warn happened during alias process - elasticsearch

Our system have two indexes(ex:data1 & data2), and have a alias(ex:data).
[alias:data] is changed between [index:data1] and [index:data2] every 10m.
Then index which dont have the alias, will be recreate by another program just like delete → mapping → set data → wait for alias change.
alias change java code(data2):
IndicesAliasesRequestBuilder aliasRequest = ElasticUtils.getTransportClientInstance().admin().indices().prepareAliases()
.addAlias("data2", "data")
.removeAlias("data1", "data");
IndicesAliasesResponse aliasResponse = aliasRequest.execute().actionGet();
My question:
In normally, process time of alias is <1s, but occasionally(once a week?) timeout warn happen.ES server log:
[2016-06-13 09:12:33,333][WARN ][cluster.service ] [Node#2] cluster state update task [index-aliases] took 40.1s above the warn threshold of 30s
Does anyone know why this problem happens occasionally? Thanks.
ps:we are using elasticsearch_2.2.

Related

Spring Scheduler not working in google cloud run with cpu throttling off

Hello All I have a spring scheduler job running which has to be run on google cloud run with a scheduled time gap.
It works perfectly fine with docker-compose local deployment. It gets triggered without any issue.
Although it works fine locally in google cloud run service with CPU throttling off which keeps CPU 100% on always it is not working after the first run.
I will paste the docker file for any once reference but am pretty sure it is working fine
FROM maven:3-jdk-11-slim AS build-env
# Set the working directory to /app
WORKDIR /app
COPY pom.xml ./
COPY src ./src
COPY css-common ./css-common
RUN echo $(ls -1 css-common/src/main/resources)
# Build and create the common jar
RUN cd css-common && mvn clean install
# Build and the job
RUN mvn package -DskipTests
# It's important to use OpenJDK 8u191 or above that has container support enabled.
# https://docs.docker.com/develop/develop-images/multistage-build/#use-multi-stage-builds
FROM openjdk:11-jre-slim
# Copy the jar to the production image from the builder stage.
COPY --from=build-env /app/target/css-notification-job-*.jar /app.jar
# Run the web service on container startup
CMD ["java","-Djava.security.egd=file:/dev/./urandom","-jar","/app.jar"]
And below is the terraform script used for the deployment
resource "google_cloud_run_service" "job-staging" {
name = var.cloud_run_job_name
project = var.project
location = var.region
template {
spec {
containers {
image = "${var.docker_registry}/${var.project}/${var.cloud_run_job_name}:${var.docker_tag_notification_job}"
env {
name = "DB_HOST"
value = var.host
}
env {
name = "DB_PORT"
value = 3306
}
}
}
metadata {
annotations = {
"autoscaling.knative.dev/maxScale" = "4"
"run.googleapis.com/vpc-access-egress" = "all-traffic"
"run.googleapis.com/cpu-throttling" = false
}
}
}
timeouts {
update = "3m"
}
}
Something I noticed in the logs itself
2022-01-04T00:19:39.178057Z2022-01-04 00:19:39.177 INFO 1 --- [ionShutdownHook] j.LocalContainerEntityManagerFactoryBean : Closing JPA EntityManagerFactory for persistence unit 'default'
Standard
2022-01-04T00:19:39.182017Z2022-01-04 00:19:39.181 INFO 1 --- [ionShutdownHook] com.zaxxer.hikari.HikariDataSource : HikariPool-1 - Shutdown initiated...
Standard
2022-01-04T00:19:39.194117Z2022-01-04 00:19:39.193 INFO 1 --- [ionShutdownHook] com.zaxxer.hikari.HikariDataSource : HikariPool-1 - Shutdown completed.
It is shutting down the entity manager. I provided -Xmx1024m heap memory to make sure it has enough memory.
Although in google documentation it has mentioned it should work I am not sure for some reason the scheduler not getting triggered. Any help would be really nice.
TL;DR: Using Spring Scheduler on Cloud Run is a bad idea. Prefer Cloud Scheduler instead
In fact, you have to understand what is the lifecycle of a Cloud Run instance. First of all, CPU is allocated to the process ONLY when a request is processed.
The immediate effect of that is that background process, like a scheduler, can't work, because there isn't CPUs allocated out of request processing.
Except if you set the CPU Throttling to off. You did it? Yes great, but there are another caveats!
An instance is created when a request comes in, and live up to 15 minutes without any request processing. Then the instance is offloaded and you scale to 0.
Here again, the scheduler can't work if the instance is shut down. The solution is to set the min instance to 1 AND the CPU throttling to false to keep 1 instance 100% up and let the scheduler do its job.
Final issue with Cloud Run, is the scalability. You set 4 in your terraform, that means, you can have up to 4 instances in parallel, and therefore 4 scheduler running in parallel, one on each instance. Is it really what you want? If not, you can set the max instance to 1 to limit the number of parallel instance to 1.
At the end, you have 1 instance, full time up, and that can't scale up and down. Because it can't scale, I don't recommend you to perform processing on the current instance but to call another API which run on another Cloud Run instance and that will be able to scale up and down according to the scheduler requirement.
And so, you will have only 1 scheduler that will perform API call to another Cloud Run services to perform task. That's the purpose of Cloud Scheduler.

how to change spark.r.backendConnectionTimeout value in RStudio?

I am using RStudio to connect to my HDFS file using SparkR. When I leave Spark analyses running overnight, I get "R session aborted" error the next day. From Spark's documentation on SparkR (https://spark.apache.org/docs/latest/configuration.html), the default value of spark.r.backendConnectionTimeout is set to 6000s. I would like to change this value to something large that my connection doesn't time out after the analyses is done.
I have tried the following:
sparkR.session(master = "local[*]", sparkConfig = list(spark.r.backendConnectionTimeout = 10))
sparkR.session(master = "local[*]", spark.r.backendConnectionTimeout = 10)
I get the same output for both commands:
Spark package found in SPARK_HOME: C:\Spark\spark-2.3.2-bin-hadoop2.7
Launching java with spark-submit command C:\Spark\spark-2.3.2-bin-hadoop2.7/bin/spark-submit2.cmd sparkr-shell C:\Users\XYZ\AppData\Local\Temp\3\RtmpiEaE5q\backend_port696c18316c61
Java ref type org.apache.spark.sql.SparkSession id 1
It seems that the parameter was not passed correctly. Also, I am not sure where to pass that parameter.
Any help would be appreciated.
A similar post is around, but that involves Zeppelin (how to change spark.r.backendConnectionTimeout value?).
Thanks.
I found the solution: it is to modify the spark-defaults.conf file and add the following line:
spark.r.backendConnectionTimeout = 6000000
(or whatever time limit you want)
IMPORTANT note - restart hadoop and yarn services, and try connecting to Spark with SparkR normally:
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
sparkR.session(master = "local")
You can check if the settings took place or not at http://localhost:4040/environment/
I hope this comes useful for other people.

How to monetdb log settings (merovingian.log)

How to MonetDB log settings.
I want to change the log level of "merovingian.log".
I want to output the ERROR and WARN the merovingian.log.
But now merovingian.log is outputting only MSG log.
2016-07-22 18:12:03 MSG merovingian[7825]: proxying client x.x.x.x:51609 for database 'test' to mapi:monetdb:///var/MonetDB/dbfarm/test/.mapi.sock?database=test
2016-07-22 18:12:03 MSG merovingian[7825]: target connection is on local UNIX domain socket, passing on filedescriptor instead of proxying
OS is "CentOS 6.4",
MonetDB version is "MonetDB-11.19.7".
Any advice how to solve this problem?
No fine-tuning is provided in where log information ends up.

Informatica error 1417 :: Task not yet registered with this service process

I am getting following error while running a workflow in informatica.
Session task instance [worklet.session] : [TM_6775 The master DTM process was unable to connect to the master service process to update the session status with the following message: error message [ERROR: The session run for [Session task instance [worklet.session]] and [ folder id = 206, workflow id = 16042, workflow run id = 65095209, worklet run id = 65095337, task instance id = 13272 ] is not yet registered with this service process.] and error code [1417].]
This error comes randomly for many other sessions, when they are ran through workflow as a whole. However if I "start task" that failed task next time, it runs successfully.
Any help is much appreciated.
Just an idea to try if you use versioning. Check that everthing is checked in correctly. If the mapping, worflow or worklet is checked out then you and informatica will run different versions wich may cause the behaivour to differ when you start it manually.
Infromatica will allways use the checked in version and you will allways use the checked out version.

How to fix "Task attempt_201104251139_0295_r_000006_0 failed to report status for 600 seconds."

I wrote a mapreduce job to extract some info from a dataset. The dataset is users' rating about movies. The number of users is about 250K and the number of movies is about 300k. The output of map is <user, <movie, rating>*> and <movie,<user,rating>*>. In the reducer, I will process these pairs.
But when I run the job, the mapper completes as expected, but reducer always complain that
Task attempt_* failed to report status for 600 seconds.
I know this is due to failed to update status, so I added a call to context.progress() in my code like this:
int count = 0;
while (values.hasNext()) {
if (count++ % 100 == 0) {
context.progress();
}
/*other code here*/
}
Unfortunately, this does not help. Still many reduce tasks failed.
Here is the log:
Task attempt_201104251139_0295_r_000014_1 failed to report status for 600 seconds. Killing!
11/05/03 10:09:09 INFO mapred.JobClient: Task Id : attempt_201104251139_0295_r_000012_1, Status : FAILED
Task attempt_201104251139_0295_r_000012_1 failed to report status for 600 seconds. Killing!
11/05/03 10:09:09 INFO mapred.JobClient: Task Id : attempt_201104251139_0295_r_000006_1, Status : FAILED
Task attempt_201104251139_0295_r_000006_1 failed to report status for 600 seconds. Killing!
BTW, the error happened in reduce to copy phase, the log says:
reduce > copy (28 of 31 at 26.69 MB/s) > :Lost task tracker: tracker_hadoop-56:localhost/127.0.0.1:34385
Thanks for the help.
The easiest way will be to set this configuration parameter:
<property>
<name>mapred.task.timeout</name>
<value>1800000</value> <!-- 30 minutes -->
</property>
in mapred-site.xml
The easiest another way is to set in your Job Configuration inside the program
Configuration conf=new Configuration();
long milliSeconds = 1000*60*60; <default is 600000, likewise can give any value)
conf.setLong("mapred.task.timeout", milliSeconds);
**before setting it please check inside the Job file(job.xml) file in jobtracker GUI about the correct property name whether its mapred.task.timeout or mapreduce.task.timeout
.
.
.
while running the job check in the Job file again whether that property is changed according to the setted value.
In newer versions, the name of the parameter has been changed to mapreduce.task.timeout as described in this link (search for task.timeout). In addition, you can also disable this timeout as described in the above link:
The number of milliseconds before a task will be terminated if it
neither reads an input, writes an output, nor updates its status
string. A value of 0 disables the timeout.
Below is an example setting in the mapred-site.xml:
<property>
<name>mapreduce.task.timeout</name>
<value>0</value> <!-- A value of 0 disables the timeout -->
</property>
If you have hive query and its timing out , you can set above configurations in following way:
set mapred.tasktracker.expiry.interval=1800000;
set mapred.task.timeout= 1800000;
From https://issues.apache.org/jira/browse/HADOOP-1763
causes might be :
1. Tasktrackers run the maps successfully
2. Map outputs are served by jetty servers on the TTs.
3. All the reduce tasks connects to all the TT where maps are run.
4. since there are lots of reduces wanting to connect the map output server, the jetty servers run out of threads (default 40)
5. tasktrackers continue to make periodic heartbeats to JT, so that they are not dead, but their jetty servers are (temporarily) down.

Resources