We are experimenting a bit with Cassandra lately (version 1.0.7) and we seem to have some problems with memory. We use EC2 as our test environment and we have three nodes with 3.7G of memory and 1 core # 2.4G, all running Ubuntu server 11.10.
The problem is that the node we hit from our thrift interface dies regularly (approximately after we store 2-2.5G of data). Error message: OutOfMemoryError: Java Heap Space and according to the log it in fact used all of the allocated memory.
The nodes are under relatively constant load and store about 2000-4000 row keys a minute, which are batched through the Trift interface in 10-30 row keys at once (with about 50 columns each). The number of reads is very low with around 1000-2000 a day and only requesting the data of a single row key. The is currently only one used column family.
The initial thought was that something was wrong in the cassandra-env.sh file. So, we specified the variables 'system_memory_in_mb' (3760) and the 'system_cpu_cores' (1) according to our nodes' specification. We also changed the 'MAX_HEAP_SIZE' to 2G and the 'HEAP_NEWSIZE' to 200M (we think the second is related to the Garbage Collection). Unfortunately, that did not solve the issue and the node we hit via thrift keeps on dying regularly.
In case you find this useful, swap is off and unevictable memory seems to be very high on all 3 servers (2.3GB, we usually observe the amount of unevictable memory on other Linux servers of around 0-16KB) (We are not quite sure how the unevictable memory ties into Cassandra, its just something we observed while looking into the problem). The CPU is pretty much idle the entire time. The heap memory is clearly being reduced once in a while according to nodetool, but obviously grows over the limit as time goes by.
Any ideas? Thanks in advance.
cassandra-env.sh defaults are perfect for almost all workloads, so until you know why this is happening best to put them back to their defaults or you may be making things worse without realizing.
I see concurrent reads and writes of 2k/sec/node on our cluster, so 2k-4k writes per minute is very little, although the fact that it's only the node accepting your connections that is dying is a little strange.
If you connect your app to the thrift endpoint on one of the other nodes is it then that one that dies?
Client connections use memory so might be worth double checking you're not connecting too many at a time. "netstat -A inet | grep 9160" on the dying cassandra node should tell you how many client connections you have. Depending heavily on your application you'd expect 10s or 100s rather than 1000s.
What do the writes look like?
Are you writing the same row keys repeatedly and if so are you appending new column names or overwriting the same ones?
How big is each write? Anything else you can tell me?
If you're overwriting the same column names in the same row keys constantly compaction may be struggling.
If you're appending new column names to the same row keys constantly you might be growing your rows too large to fit into memory.
the output of "nodetool -h localhost tpstats" on the dying node might also give some clues as to where you're falling down. Anything constantly pending is probably bad news, especially at such a low write rate.
If you're going to use cassandra in production you should get graphing of the internals to better understand what's going on. jmxtrans and graphite should be your new best friends.
There are some things you can try tweaking. First make sure you dont have row caching on your column family. Also worth while checking the log for errors and tpstats incase something died due to an error and something is getting backed up in a queue. The stack trace of the exception could be meaningful too since there are actually different types of OOMs that could just mean kernel tweaks.
If your just using too much memory per node then you want for the size of your data set try checking the cfstats, you can identify roughly how much space is spent on bloom filters. As you have more rows in a CF this can get linearly larger and is part of the base minimum memory your nodes are going to require.
nodetool cfstats | grep Bloom.*Used | awk '{ SUM += $5} END { print SUM " bytes" }'
Since you dont read very often you can probably increase the false positive rate on them. Each SSTable has a bloom filter it uses to check if a row exists in it or not. You can change with cqlsh
ALTER TABLE MyColumnFamily WITH bloom_filter_fp_chance = 0.1;
After that call an upgrade on that CF (this will be slow) per node
nodetool upgradesstables MyKeyspace MyColumnFamily
There are consequences to this where reads may take longer since there is a 10%-ish (the .1) chance it will check SSTables for rows that dont exist in it, resulting in extra disk seeks.
Another major memory sink if you have column families with large amount of rows is the sampling rate of the index. This can be modified per node level in the cassandra.yaml
http://www.datastax.com/docs/1.1/configuration/node_configuration#index-interval
If you have it set up to take heap dumps on OOM (-XX:+HeapDumpOnOutOfMemoryError on by default I believe) there should be some heap dumps available in the /var/lib/cassandra/data directory. You can open these up in visualvm or whatever tool you like to identify what part of the heaps is where.
Related
I'm testing nifi SplitRecord with a small file of only 11 records
However, SplitRecord hangs for a long time. I don't get a clue what it is doing.
Processor Hung
SPlitRecord Properties:
more properties
Is Records Per Split controlling
the maximum, or the minimum, or exact number of records per split?
if the total number of records is less than records per split, what's the behavior of SplitRecords? does it wait until a time-out and then put all on-hold records in to a single split?
After about 10 minutes or random number of start/stop/terminate/restart
it may trigger the processor to split the data sooner.
Records Per Split controls the maximum, see "SplitRecord.java" for the code. If there are fewer records than the RECORDS_PER_SPLIT value, it will immediately push them all out.
However, it does look like it is creating a new FlowFile, even if the total record count is less than the RECORDS_PER_SPLIT value, meaning it's doing disk writing regardless of whether a split really occured.
I would probably investigate two things:
Host memory - how much memory does the host have? How much is configured as NiFi max heap? How much total system memory is in use/free? NiFi performs best when plenty of system memory is left for file cache.
Host's disks, specifically the disk that has the Content Repository on it. Capacity? IO? Is it shared with other services? FlowFile content is written to the Content Repository, if the disk is shared with the OS, or other busy services (or other NiFi repos) it can really slow content modification down.
Note: your NiFi version over 3 years old, please consider upgrading.
I am running a Spark application in YARN-client mode with six executors (each four cores and executor memory = 6 GB and Overhead = 4 GB, Spark version: 1.6.3 / 2.1.0).
I find that my executor memory keeps increasing until getting killed by the node manager; and it gives out the information that tells me to boost spark.yarn.excutor.memoryOverhead.
I know that this parameter mainly control the size of memory allocated off-heap. But I don’t know when and how the Spark engine will use this part of memory. Also increasing that part of memory does not always solve my problem. Sometimes it works and sometimes not. It trends to be useless when the input data is large.
FYI, my application’s logic is quite simple. It means to combine the small files generated in one single day (one directory one day) into a single one and write back to HDFS. Here is the core code:
val df = spark.read.parquet(originpath)
.filter(s"m = ${ts.month} AND d = ${ts.day}")
.coalesce(400)
val dropDF = df.drop("hh").drop("mm").drop("mode").drop("y").drop("m").drop("d")
dropDF.repartition(1).write
.mode(SaveMode.ErrorIfExists)
.parquet(targetpath)
The source file may have hundreds to thousands level’s partition. And the total parquet file is around 1 to 5 GB.
Also I find that in the step that shuffle reading data from different machines, the size of shuffle read is about four times larger than the input size, Which is wired or some principle I don’t know.
Anyway, I have done some search myself for this problem. Some article said that it’s on the direct buffer memory (I don’t set myself).
Some article said that people solve it with more frequent full GC.
Also, I find one people on Stack Overflow with a very similar situation: Ever increasing physical memory for a Spark application in YARN
This guy claimed that it’s a bug with parquet, but a comment questioned him. People in this mail list may also receive an email hours ago from blondowski who described this problem while writing JSON: Executors - running out of memory
So it looks like to be common question for different output format.
I hope someone with experience about this problem could make an explanation about this issue. Why does this happen and what is a reliable way to solve this problem?
I just do some investigation in these days with my colleague. Here is my thought: from spark 1.2, we use Netty with off-heap memory to reduce GC during shuffle and cache block transfer. In my case, if I try to increase the memory overhead big enough. I will get the Max direct buffer exception. When Netty do block transferring, there will be five threads by default to grab the data chunk to target executor. In my situation, one single chunk is too big to fit into the buffer. So gc won’t help here. My final solution is to do another repartition before the repartition(1). Just to make 10x times more partitions than original’s. In this way, I can reduce the size of each chunk Netty transfer. In this way I finally make it.
Also I want to say that it’s not a good choice to repartition a big dataset into single file. This extremely unbalanced scenario is kind of waste your compute resources.
Welcome to any comment, I still don't understand this part well.
Can someone point me to cassandra client code that can achieve a read throughput of at least hundreds of thousands of reads/s if I keep reading the same record (or even a small number of records) over and over? I believe row_cache_size_in_mb is supposed to cache frequently used records in memory, but setting it to say 10MB seems to make no difference.
I tried cassandra-stress of course, but the highest read throughput it achieves with 1KB records (-col size=UNIFORM\(1000..1000\)) is ~15K/s.
With low numbers like above, I can easily write an in-memory hashmap based cache that will give me at least a million reads per second for a small working set size. How do I make cassandra do this automatically for me? Or is it not supposed to achieve performance close to an in-memory map even for a tiny working set size?
Can someone point me to cassandra client code that can achieve a read throughput of at least hundreds of thousands of reads/s if I keep reading the same record (or even a small number of records) over and over?
There are some solution for this scenario
One idea is to use row cache but be careful, any update/delete to a single column will invalidate the whole partition from the cache so you loose all the benefit. Row cache best usage is for small dataset and are frequently read but almost never modified.
Are you sure that your cassandra-stress scenario never update or write to the same partition over and over again ?
Here are my findings: when I enable row_cache, counter_cache, and key_cache all to sizable values, I am able to verify using "top" that cassandra does no disk I/O at all; all three seem necessary to ensure no disk activity. Yet, despite zero disk I/O, the throughput is <20K/s even for reading a single record over and over. This likely confirms (as also alluded to in my comment) that cassandra incurs the cost of serialization and deserialization even if its operations are completely in-memory, i.e., it is not designed to compete with native hashmap performance. So, if you want get native hashmap speeds for a small-working-set workload but expand to disk if the map grows big, you would need to write your own cache on top of cassandra (or any of the other key-value stores like mongo, redis, etc. for that matter).
For those interested, I also verified that redis is the fastest among cassandra, mongo, and redis for a simple get/put small-working-set workload, but even redis gets at best ~35K/s read throughput (largely independent, by design, of the request size), which hardly comes anywhere close to native hashmap performance that simply returns pointers and can do so comfortably at over 2 million/s.
I was playing around with cassandra-stress tool on my own laptop (8 cores, 16GB) with Cassandra 2.2.3 installed out of the box with having its stock configuration. I was doing exactly what was described here:
http://www.datastax.com/dev/blog/improved-cassandra-2-1-stress-tool-benchmark-any-schema
And measuring its insert performance.
My observations were:
using the code from https://gist.github.com/tjake/fb166a659e8fe4c8d4a3 without any modifications I had ~7000 inserts/sec.
when modifying line 35 in the code above (cluster: fixed(1000)) to "cluster: fixed(100)", i. e. configuring my test data distribution to have 100 clustering keys instead of 1000, the performance was jumping up to ~11000 inserts/sec
when configuring it to have 5000 clustering keys per partition, the performance was reducing to just 700 inserts/sec
The documentation says however Cassandra can support up to 2 billion rows per partition. I don't need that much still I don't get how just 5000 records per partition can slow the writes 10 times down or am I missing something?
Supporting is a little different from "best performaning". You can have very wide partitions, but the rule-of-thumb is to try to keep them under 100mb for misc performance reasons. Some operations can be performed more efficiently when the entirety of the partition can be stored in memory.
As an example (this is old example, this is a complete non issue post 2.0 where everything is single pass) but in some versions when the size is >64mb compaction has a two pass process, that halves compaction throughput. It still worked with huge partitions. I've seen many multi gb ones that worked just fine. but the systems with huge partitions were difficult to work with operationally (managing compactions/repairs/gcs).
I would say target the rule of thumb initially of 100mb and test from there to find own optimal. Things will always behave differently based on use case, to get the most out of a node the best you can do is some benchmarks closest to what your gonna do (true of all systems). This seems like something your already doing so your definitely on the right path.
I was trying to put some heavy load on my Redis for testing purposes and find out any upper limits. First I loaded it with 50,000 and 100,000 keys of size 32characters with values around 32 characters. It took no more than 8-15 seconds in both key sizes. Now I try to put 4kb of data as value for each key. First 10000 keys take 800 milli seconds to set. But from that point it slows down gradually and to set whole 50,000 keys it takes aroudn 40 minutes. I am loading the database using NodeJs with node_redis (Mranney) . Is there any mistake I am doing or is Redis just that slow with big values of size 4 KB?
One more thing I found now is when I run another client parallel to the current one and update keys this 2nd client finishes up loading the 50000 keys with 4kb values within 8 seconds while the first client still does its thing forever. Is it a bug in node or the redis library? This is alarming and not acceptable for production.
You'll need to get some kind of back pressure for doing bulk writes from node into Redis. By default, node will queue all writes and does not enforce an upper bound on the outgoing queue size.
node_redis has a "drain" event that you can listen for to implement some rudimentary back pressure.
The default redis configuration is not optimized for that sort of usage. I suspect you have it swapping to disk with a page size of 32 bytes, which means that each key added has to find 128 contiguous free pages and may end up using system VM or needing to expand the swap file a lot.
When you update a key, the space is already allocated so you don't see any performance issues.
Since I was doing lot of set (Key value) in NodeJs which is done asynchronously, lot of socket connections are concurrently open. The NodeJs socket write buffer might be overloaded and GC might come and fiddle with the node process.
PS: I changed redis memory configurations as Tom suggested but it was still performing the same.