Turn off time series data in a cockroachdb single-node-cluster - cockroachdb

Is there a way to turn off the the recording of time series data in a single node cluster via some start up flags? I'm using cockroachdb version 20.1.2.
It appears this property will turn it off
timeseries.storage.enabled=false
or you can set these to 0s
timeseries.storage.resolution_10s.ttl=0s
timeseries.storage.resolution_30m.ttl=0s
I'm running cockroach in a docker container and would like to set these properties when cockroach starts but I get errors when I try to set those properties as flags. Meaning this throws an error for invalid flag:
start-single-node --timeseries.storage.enabled=false --insecure
Is there a way to turn off timeseries data storage on startup without running queries to change cluster settings?

There is no other way.
Building and storing timeseries is a cluster-wide job and is controlled through cluster settings.
This is distinct from node-level features that can be controlled through command line flags.
You can find some details on this on the cluster settings page:
In contrast to cluster-wide settings, node-level settings apply to a
single node. They are defined by flags passed to the cockroach start
command when starting a node and cannot be changed without stopping
and restarting the node.

Related

Easy way to switch scylla cluster to large aws instances

I have a scylla cluster running on AWS i3en.xlarge instances which has 16 nodes.
Is there an easy way for me to switch the cluster to i3en.2xlarge or i3en.4xlarge other than replacing existing node one by one (e.g. add a new node and remove a node)?
If I add one i3en.2xlarge instance, will the cluster auto balances the data so that on the i3en.2xlarge it uses roughly twice the disk space as i3en.xlarge?
You can add a logical DC of the new nodes, run repair and then get rid of the original DC
Add a new DC with the desired instance type (see the procedure #TzachLivyatan posted in his comment)
Wait for streaming to the new DC to complete
Run a full cluster repair -> wait for it to complete
Decommission the "original" DC:
https://docs.scylladb.com/operating-scylla/procedures/cluster-management/decommissioning_data_center/

Setting the job priority in YARN

My cluster (HDP) is using YARN capacity scheduler.
The nameNode UI shows Version 2.7.1.2.4.3.30.
I am trying to set the job priority to HIGH in my hive script:
set mapreduce.job.priority=HIGH;
However I see not difference in the allocation of resources.
I cannot see the property yarn.scheduler.fair.preemption in the yarn-site.xml.
Moreover what is the equivalent property for Tez?
Because you make use of Hadoop 2.7.1, you have to explicitly set yarn.scheduler.fair.preemption in yarn-site.xml. Not setting this to true will result in the prioritising not working, because the default value of yarn.scheduler.fair.preemption=false.
Regarding your question about Tez, there is no equivalent. You may want to read more about how pre-emption works here!

"Too many fetch-failures" while using Hive

I'm running a hive query against a hadoop cluster of 3 nodes. And I am getting an error which says "Too many fetch failures". My hive query is:
insert overwrite table tablename1 partition(namep)
select id,name,substring(name,5,2) as namep from tablename2;
that's the query im trying to run. All i want to do is transfer data from tablename2 to tablename1. Any help is appreciated.
This can be caused by various hadoop configuration issues. Here a couple to look for in particular:
DNS issue : examine your /etc/hosts
Not enough http threads on the mapper side for the reducer
Some suggested fixes (from Cloudera troubleshooting)
set mapred.reduce.slowstart.completed.maps = 0.80
tasktracker.http.threads = 80
mapred.reduce.parallel.copies = sqrt (node count) but in any case >= 10
Here is link to troubleshooting for more details
http://www.slideshare.net/cloudera/hadoop-troubleshooting-101-kate-ting-cloudera
Update for 2020 Things have changed a lot and AWS mostly rules the roost. Here is some troubleshooting for it
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-troubleshoot-error-resource-1.html
Too many fetch-failures
PDF
Kindle
The presence of "Too many fetch-failures" or "Error reading task output" error messages in step or task attempt logs indicates the running task is dependent on the output of another task. This often occurs when a reduce task is queued to execute and requires the output of one or more map tasks and the output is not yet available.
There are several reasons the output may not be available:
The prerequisite task is still processing. This is often a map task.
The data may be unavailable due to poor network connectivity if the data is located on a different instance.
If HDFS is used to retrieve the output, there may be an issue with HDFS.
The most common cause of this error is that the previous task is still processing. This is especially likely if the errors are occurring when the reduce tasks are first trying to run. You can check whether this is the case by reviewing the syslog log for the cluster step that is returning the error. If the syslog shows both map and reduce tasks making progress, this indicates that the reduce phase has started while there are map tasks that have not yet completed.
One thing to look for in the logs is a map progress percentage that goes to 100% and then drops back to a lower value. When the map percentage is at 100%, this does not mean that all map tasks are completed. It simply means that Hadoop is executing all the map tasks. If this value drops back below 100%, it means that a map task has failed and, depending on the configuration, Hadoop may try to reschedule the task. If the map percentage stays at 100% in the logs, look at the CloudWatch metrics, specifically RunningMapTasks, to check whether the map task is still processing. You can also find this information using the Hadoop web interface on the master node.
If you are seeing this issue, there are several things you can try:
Instruct the reduce phase to wait longer before starting. You can do this by altering the Hadoop configuration setting mapred.reduce.slowstart.completed.maps to a longer time. For more information, see Create Bootstrap Actions to Install Additional Software.
Match the reducer count to the total reducer capability of the cluster. You do this by adjusting the Hadoop configuration setting mapred.reduce.tasks for the job.
Use a combiner class code to minimize the amount of outputs that need to be fetched.
Check that there are no issues with the Amazon EC2 service that are affecting the network performance of the cluster. You can do this using the Service Health Dashboard.
Review the CPU and memory resources of the instances in your cluster to make sure that your data processing is not overwhelming the resources of your nodes. For more information, see Configure Cluster Hardware and Networking.
Check the version of the Amazon Machine Image (AMI) used in your Amazon EMR cluster. If the version is 2.3.0 through 2.4.4 inclusive, update to a later version. AMI versions in the specified range use a version of Jetty that may fail to deliver output from the map phase. The fetch error occurs when the reducers cannot obtain output from the map phase.
Jetty is an open-source HTTP server that is used for machine to machine communications within a Hadoop cluster

Allow more than one hadoop/EMR tasks to fail before shutting down

I'm trying to use hadoop on Amazon Elastic MapReduce where I have thousands of map tasks to perform. I'm OK if a small percentage of the tasks fail, however, Amazon shuts down the job and I lose all of the results when the first mapper fails. Is there a setting I can use to increase the number of failed jobs that are allowed? Thanks.
Here's the answer for hadoop:
Is there any property to define failed mapper threshold
To use the setting described above in EMR, look at:
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-bootstrap.html#PredefinedbootstrapActions_ConfigureHadoop
Specifically, you create an xml file (config.xml in the example) with the setting that you want to change and apply bootstrap action:
./elastic-mapreduce --create \
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop \
--args "-M,s3://myawsbucket/config.xml"

How quartz detect nodes fails

My production environment running a java scheduler job using quartz 2.1.4. on weblogic cluster server with 4 machine and only one schedule job execute at one cluster node (node 1) normally for few months, but node 2 sudden find the node 1 fail at take over the executing job last night. In fact, the node 1 without error (according to the server, network, database, application log), this event caused duplicate message created due to 2 process concurrent execute.
What is the mechanism of quartz to detect node fails? By ping scan, or heart beat ping via UCP broadcast, or database respond time other? Any configuration on it?
I have read the quartz configuration guide
http://quartz-scheduler.org/documentation/quartz-2.1.x/configuration/ConfigJDBCJobStoreClustering
, but there is no answer.
I am using JDBCJobstore. After details checking, we found that there is a database (Oracle) statement executing abnormal long (from 5 sec to 30 sec). The incident happened on this period of time. Do you think it related?
my configuration is
`
org.quartz.threadPool.threadCount=10
org.quartz.threadPool.threadPriority=5
org.quartz.jobStore.misfireThreshold = 10000
org.quartz.jobStore.class=org.quartz.impl.jdbcjobstore.JobStoreTX
`
Anyone have this information? Thanks.
I know the answer is very late, but maybe somebody like both of us will still need it.
Short version: it is all handled by DB. Important property would be org.quartz.jobStore.clusterCheckinInterval.
Long version (all credits go to http://flylib.com/books/en/2.65.1.91/1/ ) :
Detecting Failed Scheduler Nodes
When a Scheduler instance performs the check-in routine, it looks to
see if there are other Scheduler instances that didn't check in when
they were supposed to. It does this by inspecting the SCHEDULER_STATE
table and looking for schedulers that have a value in the
LAST_CHECK_TIME column that is older than the property
org.quartz.jobStore.clusterCheckinInterval (discussed in the next
section). If one or more nodes haven't checked in, the running
Scheduler assumes that the other instance(s) have failed.
Additionally the next paragraph might also be important:
Running Nodes on Separate Machines with Unsynchronized Clocks
As you can ascertain by now, if you run nodes on different machines and the
clocks are not synchronized, you can get unexpected results. This is
because a timestamp is being used to inform other instances of the
last time one node checked in. If that node's clock was set for the
future, a running Scheduler might never realize that a node has gone
down. On the other hand, if a clock on one node is set in the past, a
node might assume that the node has gone down and attempt to take over
and rerun its jobs. In either case, it's not the behavior that you
want. When you're using different machines in a cluster (which is the
normal case), be sure to synchronize the clocks. See the section
"Quartz Clustering Cookbook," later in this chapter for details on how
to do this.

Resources