Does Cassandra uses Heap memory to store blooms filter ,and how much space does it consumes for 100GB of data? - memory-management

I come to know that cassandra uses blooms filter for performance ,and it stores these filter data into physical-memory.
1)Where does cassandra stores this filters?(in heap memory ?)
2)How much memory do these filters consumes?

When running, the Bloom filters must be held in memory, since their whole purpose is to avoid disk IO.
However, each filter is saved to disk with the other files that make up each SSTable - see http://wiki.apache.org/cassandra/ArchitectureSSTable
The filters are typically a very small fraction of the data size, though the actual ratio seems to vary quite a bit. On the test node I have handy here, the biggest filter I can find is 3.3MB, which is for 1GB of data. For another 1.3GB data file, however, the filter is just 93KB...
If you are running Cassandra, you can check the size of your filters yourself by looking in the data directory for files named *-Filter.db

Related

why does elasticsearch designed to be near real time?

As we know a new document written into elasticsearch will usually been visible after 1 seconds. Because elasticsearch make data visible only after it's been moved from cache into Segments.
However, since those newly added data is already stored in memory cache, why not just directly take those data into consideration, without waiting it to be moved to segments?
The official documentation explains it pretty well why it's designed the way it is: https://www.elastic.co/guide/en/elasticsearch/reference/8.6/near-real-time.html
Sitting between Elasticsearch and the disk is the filesystem cache. Documents in the in-memory indexing buffer (Figure 1) are written to a new segment (Figure 2). The new segment is written to the filesystem cache first (which is cheap) and only later is it flushed to disk (which is expensive). However, after a file is in the cache, it can be opened and read just like any other file.
So, between the moment a document to be indexed arrives into the Elasticsearch in-memory indexing buffer (i.e. inside the JVM heap) and the moment it is written into a segment to the physical disk, it will transit through the filesystem cache (i.e. the remaining 50% of the physical RAM) where it is already searchable.
The transition from the indexing buffer to the filesystem cache is carried out by the refresh operation which happens in general every second, hence why "near real time". Then transiting the data from the filesystem cache to the disk requires a Lucene commit operation, which is a much more expensive operation and is performed less frequently.

Surrogate Key Mapping for large (50 Million) keysets in Apache Flink

I have a use case where the apache flink process must integrate near real-time data streams (events) from multiple sources but due to lack of uniform keys in the different systems I need to use a surrogate key (SK) lookup from an existing data base. The SK data set is very large (50 Million+ keys). Is it possible/advisable to cache such a data set for in-stream transformation (mapping) without a DB lookup? If yes, What are caching limitations? If not, what alternatives are possible with Flink?
There are a few options
Local map
If the surrogate key is never changing, you could just load it in RichMapFunction#open and perform the lookup. That of course means that you will have to adjust the memory settings such that Flink doesn't try to take all memory for its own operations.
Some quick math: assume both keys are strings of length 10. They will each need 40 bytes of chars in memory. With some object overhead, we are getting to ~50 bytes per entry. With 50M entries, we are needing 2.5 GB RAM to store that. Because the hash map will have some overhead, I'd plan with 3 GB RAM.
So if you task manager has 8GB, I'd set taskmanager.memory.size to 4 GB.
Ofc, you need to ensure that different tasks of the same task manager are not loading the same map twice. Also I'd choose a format that is suited to load the data as quickly as possible (e.g., Avro) because a slow parsing will greatly reduce startup and recovery time.
State-based
If memory is an issue or data is changing, you can also model the lookup data as a map-state. I'd add a second input for that lookup data and use a KeyedCoProcessFunction. The feed whatever comes from the second input into the map-state. The state should use a rocks-db backend, such that the data effectively resides on disk.
Joining data
A lookup can also be modeled as a join. If you are already using Table API, have a look at Join with Temporal Table. This will internally use the state-based approach but is much more concise. You can also mix DataStream with Tables.

What are the drawbacks of using a Lucene directory as a primary file store?

I want to use a Lucene MMapDirectory as a primary file store. Each file would be stored in a separate document as a byte array in a StoredField. All file properties that should be searchable, like file name, size etc., would be stored in indexable fields in the same document.
My questions would be:
What are the drawbacks of using Lucene directories for storing files, especially with regards to indexing and search performance and memory (RAM) consumption?
If this is not a "no-go", is there a better/faster way of storing files in the directory than as a byte array?
Short Answer
I really love Luсene and consider it to be the best opensource library, but I'm afraid that it's not a good decision to use it as a primary file source due to:
high CPU/memory overhead
slow index/query performance
high HDD utilization and doubled index size
weak capabilities to recovery
Long Answer
Under the hood lucene uses the following files to keep all stored fields in one segment:
the fields index file (.fdx),
the fields data file (.fdt).
You can read more about how it works in Lucene50StoredFieldsFormat’s docs.
This means in case of any I/O issue it is almost impossible to restore any file.
In order to return one file - lucene have to read and decompress binary data from the disc in block-by-block manner. This means high CPU overhead to decompress and high memory footprint to keep the whole file in java heap space. No streaming is also avaialbe - compared to file and network storages.
Maximum document size is limited by codec implementation - 2 GB per document
Lucene has a unique write-once segmented architecture: recently indexed documents are written to a new self-contained segment, in append-only, write-once fashion: once written, those segment files will never again change. This happens either when too much RAM is being used to hold recently indexed documents, or when you ask Lucene to refresh your searcher so you can search all recently indexed documents. Over time, smaller segments are merged away into bigger segments, and the index has a logarithmic "staircase" structure of active segment files at any time. This architecture becomes a big problem for file storage:
you can not delete file - only mark as unavailable
merge operation requires 2x disc space and consumes a lot of resources and disc throughput - it creates new .fdt file and copies content of other .fdt files thru java code and java heap memory
So you won't be using MMapDirectory but an actual lucene index.
I have made good experiences with using lucene as the primary data-store for some projects.
Just be sure to also include a generated/natural unique ID, because the document IDs are not constant or reliable.
Also make sure you use a Directory implementation fitting to your use-case. I have switched to the normal RandomAccess implementation in the low-load case, since it uses less memory and is almost as fast.

Elasticsearch: What if size of index is larger than available RAM?

Assuming a single machine system with an in-memory indexing schema.
I am not able to find this info in ES docs. Does ES start swapping out the overflowing data, loads it when needed and continue working or it gives an error?
In-memory indices provide better performance at the cost of limiting the index size to the amount of available physical memory.
Via the 1.7 documentation. Memory stores are no longer available in 2.0+.
Under the hood it uses the Lucene RAMDirectory, which will just consume RAM (and eventually swap) until either you hit Java heap limits and ES crashes with out-of-memory errors, or the system gives up and oomkills the Elasticsearch process. Don't use in-memory indexes for large indexes, or for any situation where persistence is important.

Mongodb - make inmemory or use cache

I will be creating a 5 node mongodb cluster. It will be more read heavy than write and had a question which design would bring better performance. These nodes will be dedicated to only mongodb. For the sake of an example, say each node will have 64GB of ram.
From the mongodb docs it states:
MongoDB automatically uses all free memory on the machine as its cache
Does this mean as long as my data is smaller than the available ram it will be like having an in-memory database?
I also read that it is possible to implement mongodb purely in memory
http://edgystuff.tumblr.com/post/49304254688/how-to-use-mongodb-as-a-pure-in-memory-db-redis
If my data was quite dynamic (can range from 50gb to 75gb every few hours), would it be theoretically be better performing to design mongodb in a way which allows mongodb to manage itself with its cache (default setup of mongo), or to put the mongodb into memory initially and if the data grows over the size of ram use swap space (SSD)?
MongoDB default storage engine maps the files in memory. It provides an efficient way to access the data, while avoiding double caching (i.e. MongoDB cache is actually the page cache of the OS).
Does this mean as long as my data is smaller than the available ram it will be like having an in-memory database?
For read traffic, yes. For write traffic, it is different, since MongoDB may have to journalize the write operation (depending on the configuration), and maintain the oplog.
Is it better to run MongoDB from memory only (leveraging tmpfs)?
For read traffic, it should not be better. Putting the files on tmpfs will also avoid double caching (which is good), but the data can still be paged out. Using a regular filesystem instead will be as fast once the data have been paged in.
For write traffic, it is faster, provided the journal and oplog are also put on tmpfs. Note that in that case, a system crash will result in a total data loss. Usually, the performance gain does not worth the risk.

Resources