tarantool long WAL write - tarantool

Use tarantool, why i take in log this strange messages:
2016-03-24 16:19:58.987 [5803] main/493623/http/XXX.XXX.XXX.XXX:57295 txn.cc:214 W> too long WAL write: 0.527 sec
2016-03-24 16:20:09.841 [5803] main/493714/http/XXX.XXX.XXX.XXX:57346 txn.cc:214 W> too long WAL write: 0.605 sec
2016-03-24 16:20:12.988 [5803] main/493716/http/XXX.XXX.XXX.XXX:57347 txn.cc:214 W> too long WAL write: 1.682 sec
2016-03-24 16:20:15.023 [5803] main/493717/http/XXX.XXX.XXX.XXX:37825 txn.cc:214 W> too long WAL write: 3.373 sec
2016-03-24 16:20:35.145 [5803] main/494145/http/

The message "too long wal write" means that too much time has elapsed between writing updates to the .xlog file ("too much" here meaning "more than specified in Tarantool's configuration parameter too_long_threshold").
There are two common reasons: 1) slow disk 2) problems on the application's side.
To figure out the reason nature, launch atop with a 1s interval and check out what happened during the "too long" events: disk util means disk issues; cpu util means application issues.
The recommended solution for slow disk issues is to write changes to the write ahead log in batches, where every batch is wrapped in a single transaction. This will give you just one disk write per transaction. You'll need no yields in this case (see notes about fiber.yield further on).
Typical application issues are as follows:
you launched too many fibers (so, due to successive fiber switch, too
much time may elapse before the next WAL write);
you make no yields within time-consuming operations (like making full
scan search, deleting a huge number of records, etc).
Notes on yields:
You need to make explicit yields using fiber.yield().
You don't need to move time-consuming operations to a dedicated
fiber; you can as well launch them within the main loop, say
require('fiber') and occasionally yield control within your program
cycle (not too often though, several times per the interval specified
in too_long_threshold is quite enough).
As you optimize your application code, remember that one Tarantool instance can utilize only one CPU core, so increasing the number of CPU cores is useless — the only solution is to ensure proper control yields among the fibers.

After direct on-site help and debugging with agent-0007, we have found several issues.
Most of them been related to slow virtual environment (openvz been used), which shows inadequate io timings.
This problem is also related to Tarantool sphia make slow selects?
Additionally there are recommendations regarding slow disks:
If it is possible, try to place WAL and Tarantool Snapshots or Sophia storage on separate disks.
snap_dir, wal_dir and sophia_dir options:
http://tarantool.org/doc/book/configuration/index.html#basic-parameters
Thanks.

Related

Why real time is much higher than "user" and "system" CPU TIME combined?

We have a batch process that executes every day. This week, a job that usually does not past 18 minutes of execution time (real time, as you can see), now is taking more than 45 minutes to finish.
Fullstimmer option is already active, but we don't know why only the real time was increased.
In old documentation there are Fullstimmer stats that could help identify the problem but they do not appear in batch log. (The stats are those down below: Page Faults, Context Switches, Block Operation and so on, as you can see)
It might be an I/O issue. Does anyone know how we can identify if it is really an I/O problem or if it could be some other issue (network, for example)?
To be more specific, this is one of the queries that have increased in time dramatically. As you can see, it is reading from a data base (SQL Server, VAULT schema) and work and writing in work directory.
Number of observations its almost the same:
We asked customer about any change in network traffic, and they said still the same.
Thanks in advance.
For a process to complete, much more needs to be done than the actual calculations on the CPU.
Your data has te be read and your results have to be written.
You might have to wait for other processes to finish first, and if your process includes multiple steps, writing to and reading from disk each time, you will have to wait for the CPU each time too.
In our situation, if real time is much larger than cpu time, we usually see much trafic to our Network File System (nfs).
As a programmer, you might notice that storing intermediate results in WORK is more efficient then on remote libraries.
You might safe much time by creating intermediate results as views instead of tables, IF you only use them once. That is not only possible in SQL, but also in data steps like this
data MY_RESULT / view=MY_RESULT;
set MY_DATA;
where transaction_date between '1jan2022'd and 30jun2022'd;
run;

What's a sensible basic OLTP configuration for Postgres?

We're just starting to investigate using Postgres as the backend for our system which will be used with an OLTP-type workload: > 95% (possibly >99%) of the transactions will be inserting 1 row into 4 separate tables, or updating 1 row. Our test machine is running 9.5.6 (using out-of-the-box config options) on a modest cloud-hosted Windows VM with a 4-core i7 processor, with a conventional 7200 RPM disk. This is much, much slower than our targeted production hardware, but useful right now for finding bottlenecks in our basic design.
Our initial tests have been pretty discouraging. Although the insert statements themselves run fairly quickly (combined execution time is around 2ms), the overall transaction time is around 40ms, due to the commit statement taking 38 ms. Furthermore, during a simple 3-minute load test (5000 transactions), we're only seeing about 30 transactions per second, with pgbadger reporting 3 minutes spent in "commit" (38 ms avg.), and the next highest statements being the inserts at 10 (2ms) and 3 (0.6 ms) respectively. During this test, the cpu on the postgres instance is pegged at 100%
The fact that the time spent in commit is equal to the elapsed time of the test tells me the that not only is commit serialized (unsurprising, given the relatively slow disk on this system), but that it is consuming a cpu during that duration, which surprises me. I would have assumed before the fact that if we were i/o bound, we would be seeing very low cpu usage, not high usage.
In doing a bit of reading, it would appear that using Asynchronous Commits would solve a lot of these issues, but with the caveat of data loss on crashes/immediate shutdown. Similarly, grouping transactions together into a single begin/commit block, or using multi-row insert syntax improves throughput as well.
All of these options are possible for us to employ, but in a traditional OLTP application, none of them would be (you need to have fast, atomic, synchronous transactions). 35 transactions per second on a 4-core box would have unacceptable 20 years ago on other RDBMs running on much slower hardware than this test machine, which makes me think that we're doing this wrong, as I'm sure Postgres is capable of handling much higher workloads.
I've looked around but can't find some common-sense config options that would serve as starting points for tuning a Postgres instance. Any suggestions?
If COMMIT is your time hog, that probably means:
Your system honors the FlushFileBuffers system call, which is as it should be.
Your I/O is miserably slow.
You can test this by setting fsync = off in postgresql.conf – but don't ever do this on a production system. If that improves performance a lot, you know that your I/O system is very slow when it actually has to write data to disk.
There is nothing that PostgreSQL (or any other reliable database) can improve here without sacrificing data durability.
Although it would be interesting to see some good starting configs for OLTP workloads, we've solved our mystery of the unreasonably high CPU during the commits. Turns out it wasn't Postgres at all, it was Windows Defender constantly scanning the Postgres data files. The team that set up our VM that was hosting the test server didn't understand that we needed a backend configuration as opposed to a user configuration.

Physical memory usage keeps increasing for Spark application on YARN

I am running a Spark application in YARN-client mode with six executors (each four cores and executor memory = 6 GB and Overhead = 4 GB, Spark version: 1.6.3 / 2.1.0).
I find that my executor memory keeps increasing until getting killed by the node manager; and it gives out the information that tells me to boost spark.yarn.excutor.memoryOverhead.
I know that this parameter mainly control the size of memory allocated off-heap. But I don’t know when and how the Spark engine will use this part of memory. Also increasing that part of memory does not always solve my problem. Sometimes it works and sometimes not. It trends to be useless when the input data is large.
FYI, my application’s logic is quite simple. It means to combine the small files generated in one single day (one directory one day) into a single one and write back to HDFS. Here is the core code:
val df = spark.read.parquet(originpath)
.filter(s"m = ${ts.month} AND d = ${ts.day}")
.coalesce(400)
val dropDF = df.drop("hh").drop("mm").drop("mode").drop("y").drop("m").drop("d")
dropDF.repartition(1).write
.mode(SaveMode.ErrorIfExists)
.parquet(targetpath)
The source file may have hundreds to thousands level’s partition. And the total parquet file is around 1 to 5 GB.
Also I find that in the step that shuffle reading data from different machines, the size of shuffle read is about four times larger than the input size, Which is wired or some principle I don’t know.
Anyway, I have done some search myself for this problem. Some article said that it’s on the direct buffer memory (I don’t set myself).
Some article said that people solve it with more frequent full GC.
Also, I find one people on Stack Overflow with a very similar situation: Ever increasing physical memory for a Spark application in YARN
This guy claimed that it’s a bug with parquet, but a comment questioned him. People in this mail list may also receive an email hours ago from blondowski who described this problem while writing JSON: Executors - running out of memory
So it looks like to be common question for different output format.
I hope someone with experience about this problem could make an explanation about this issue. Why does this happen and what is a reliable way to solve this problem?
I just do some investigation in these days with my colleague. Here is my thought: from spark 1.2, we use Netty with off-heap memory to reduce GC during shuffle and cache block transfer. In my case, if I try to increase the memory overhead big enough. I will get the Max direct buffer exception. When Netty do block transferring, there will be five threads by default to grab the data chunk to target executor. In my situation, one single chunk is too big to fit into the buffer. So gc won’t help here. My final solution is to do another repartition before the repartition(1). Just to make 10x times more partitions than original’s. In this way, I can reduce the size of each chunk Netty transfer. In this way I finally make it.
Also I want to say that it’s not a good choice to repartition a big dataset into single file. This extremely unbalanced scenario is kind of waste your compute resources.
Welcome to any comment, I still don't understand this part well.

Recovery techniques for Spark Streaming scheduling delay

We have a Spark Streaming application that has basically zero scheduling delay for hours, but then suddenly it jumps up to multiple minutes and spirals out of control: This is happens after a while even if we double the batch interval.
We are not sure what causes the delay to happen (theories include garbage collection). The cluster has generally low CPU utilization regardless of whether we use 3, 5 or 10 slaves.
We are really reluctant to further increase the batch interval, since the delay is zero for such long periods. Are there any techniques to improve recovery time from a sudden spike in scheduling delay? We've tried seeing if it will recover on its own, but it takes hours if it even recovers at all.
Open the batch links, and identified which stages are in delay. Are there any external access to other DBs/application which are impacting this delay?
enter image description here
Go in each job, and see the data/records processed by each executor. you can find problems here.
enter image description here
There may be skewness in data partitions as well. If the application is reading data from kafka and processing it, then there can be skewness in data across cores if the partitioning is not well defined. Tune the parameters: # of kafka partitions, # of RDD partitions, # of executors, # of executor cores.

Is there a way to configure timeout for speculative execution in Hadoop?

I have hadoop job with tasks that are expected to run for significant length of fime (few minues). However hadoop starts speculative execution too soon. I do not want to turn speculative execution completely off but I want to increase duration of time hadoop waits before considering job for speculative execution. Is there a config option to control this timeout?
Thanks
I don't believe the speculative execution time is currently configurable. On the other hand, there's probably no need to adjust it. Speculative execution is meant to bail you out of slow running tasks (usually due to degraded hardware performance). If you have available cluster resources such that spec exec is kicking in, what's the harm in letting it do so? Note that minutes is not considered "significant" and is more than normal for medium or larger size jobs.
It's also worth noting that while mapper spec exec is almost always fine and low overhead to the system, reducer spec exec can hurt and probably should be disabled. The rationale is that if a mapper is progressing slowly and there are available resources where the data is local (normal), there's no shared overhead. If a reducer is performing slowly, starting another attempt of the same task will simply double the network load - normally the most painful part of reducer execution. If the network is what is causing the reducer to be "slow," starting a second attempt only hurts both attempts.
If you truly have a use case for adjusting the spec exec time, it might be worth filing a jira at http://issues.apache.org.
Hope this helps.

Resources