How to limit write ahead log file (WAL) used in OrientDB - caching

I use OrientDB version 2.1.3 in embedded mode. Everything is more than fine (performance are very good compared to legacy H2 storage) but the storage space. I have very little information to store in the database and so I don't want the HDD to be wasted by temporary files.
In the database directory, I see the .wal file growing and growing (very fast). So I made some research on internet and end up with :
OGlobalConfiguration.DISK_CACHE_SIZE.setValue(16);
OGlobalConfiguration.WAL_CACHE_SIZE.setValue(16);
But this does nothing. The .wal file is keep growing and even when I delete it, it keeps growing more than 16 MB.
What can cause this file growing even with the conf set up ?
Is there a way to keep cache files under a known limit ?

There are no cache files in the database. Data files are cached to speed up system performance. As more RAM is allocated for disk cache, the faster your system will be. The amount of RAM allocated for disk cache does not affect WAL size.
The properties you have set are not related to the WAL size. Instead, you should set OGlobalConfiguration#WAL_MAX_SIZE property.
Also, the single WAL segment size is (OGlobalConfiguration#WAL_MAX_SEGMENT_SIZE) 128 megabytes so the size of the WAL can not be less than 128 megabytes, or more precisely, the value of that setting.
So, to wrap up, properties (OGlobalConfiguration#WAL_MAX_SEGMENT_SIZE, OGlobalConfiguration#WAL_CACHE_SIZE) should be set before any call to the OrientDB classes. Ideally, they should be set through system properties (storage.wal.maxSegmentSize and storage.wal.maxSize).
Please be aware that usage of such small values means that the disk cache will have to be forcefully flushed after very few operations to make possible to truncate database journal (WAL) and keep it in very small size.

Related

why does elasticsearch designed to be near real time?

As we know a new document written into elasticsearch will usually been visible after 1 seconds. Because elasticsearch make data visible only after it's been moved from cache into Segments.
However, since those newly added data is already stored in memory cache, why not just directly take those data into consideration, without waiting it to be moved to segments?
The official documentation explains it pretty well why it's designed the way it is: https://www.elastic.co/guide/en/elasticsearch/reference/8.6/near-real-time.html
Sitting between Elasticsearch and the disk is the filesystem cache. Documents in the in-memory indexing buffer (Figure 1) are written to a new segment (Figure 2). The new segment is written to the filesystem cache first (which is cheap) and only later is it flushed to disk (which is expensive). However, after a file is in the cache, it can be opened and read just like any other file.
So, between the moment a document to be indexed arrives into the Elasticsearch in-memory indexing buffer (i.e. inside the JVM heap) and the moment it is written into a segment to the physical disk, it will transit through the filesystem cache (i.e. the remaining 50% of the physical RAM) where it is already searchable.
The transition from the indexing buffer to the filesystem cache is carried out by the refresh operation which happens in general every second, hence why "near real time". Then transiting the data from the filesystem cache to the disk requires a Lucene commit operation, which is a much more expensive operation and is performed less frequently.

The approache for precessing a large file on spark

When i process a large file on spark cluster the out of memory is occurred. I know i can extend the size of heap. But in more general case, it is not good method i think. I am curious splitting the large file into small files in batch is good choice. So we can process small files in batch instead of a large file.
I have encountered the OOM problem either.As spark uses the memory to compute,the data,the intermediate file and so on all stored in the memory.I think cache or persist will be helpful.You can set the storage level as MEMORY_AND_DISK_SER.

Mongodb - make inmemory or use cache

I will be creating a 5 node mongodb cluster. It will be more read heavy than write and had a question which design would bring better performance. These nodes will be dedicated to only mongodb. For the sake of an example, say each node will have 64GB of ram.
From the mongodb docs it states:
MongoDB automatically uses all free memory on the machine as its cache
Does this mean as long as my data is smaller than the available ram it will be like having an in-memory database?
I also read that it is possible to implement mongodb purely in memory
http://edgystuff.tumblr.com/post/49304254688/how-to-use-mongodb-as-a-pure-in-memory-db-redis
If my data was quite dynamic (can range from 50gb to 75gb every few hours), would it be theoretically be better performing to design mongodb in a way which allows mongodb to manage itself with its cache (default setup of mongo), or to put the mongodb into memory initially and if the data grows over the size of ram use swap space (SSD)?
MongoDB default storage engine maps the files in memory. It provides an efficient way to access the data, while avoiding double caching (i.e. MongoDB cache is actually the page cache of the OS).
Does this mean as long as my data is smaller than the available ram it will be like having an in-memory database?
For read traffic, yes. For write traffic, it is different, since MongoDB may have to journalize the write operation (depending on the configuration), and maintain the oplog.
Is it better to run MongoDB from memory only (leveraging tmpfs)?
For read traffic, it should not be better. Putting the files on tmpfs will also avoid double caching (which is good), but the data can still be paged out. Using a regular filesystem instead will be as fast once the data have been paged in.
For write traffic, it is faster, provided the journal and oplog are also put on tmpfs. Note that in that case, a system crash will result in a total data loss. Usually, the performance gain does not worth the risk.

what does " local caching of data" mean in the context of this article?

From the following paragraphs of Text——
(http://developer.yahoo.com/hadoop/tutorial/module2.html),It mentions that sequential readable large files are not suitable for local caching. but I don't understand what does local here mean...
There are two assumptions in my opinion: one is Client caches data from HDFS and the other is datanode caches hdfs data in its local filesystem or Memory for Clients to access quickly. is there anyone who can explain more? Thanks a lot.
But while HDFS is very scalable, its high performance design also restricts it to a
particular class of applications; it is not as general-purpose as NFS. There are a large
number of additional decisions and trade-offs that were made with HDFS. In particular:
Applications that use HDFS are assumed to perform long sequential streaming reads from
files. HDFS is optimized to provide streaming read performance; this comes at the expense of
random seek times to arbitrary positions in files.
Data will be written to the HDFS once and then read several times; updates to files
after they have already been closed are not supported. (An extension to Hadoop will provide
support for appending new data to the ends of files; it is scheduled to be included in
Hadoop 0.19 but is not available yet.)
Due to the large size of files, and the sequential nature of reads, the system does
not provide a mechanism for local caching of data. The overhead of caching is great enough
that data should simply be re-read from HDFS source.
Individual machines are assumed to fail on a frequent basis, both permanently and
intermittently. The cluster must be able to withstand the complete failure of several
machines, possibly many happening at the same time (e.g., if a rack fails all together).
While performance may degrade proportional to the number of machines lost, the system as a
whole should not become overly slow, nor should information be lost. Data replication
strategies combat this problem.
Any real Mapreduce job is probably going to process GB's (10/100/1000s) of data from HDFS.
Therefore any one mapper instance is most probably going to be processing a fair amount of data (typical block size is 64/128/256 MB depending on your configuration) in a sequential nature (it will read the file / block in its entirety from start to end.
It is also unlikely that another mapper instance running on the same machine will want to process that data block again any time in the immediate future, more so that multiple mapper instances will also be processing data alongside this mapper in any one TaskTracker (hopefully with a fair few being 'local' to actually physical location of the data, i.e. a replica of the data block also exists on the same machine the mapper instance is running).
With all this in mind, caching the data read from HDFS is probably not going to gain you much - you'll most probably not get a cache hit on that data before another block is queried and will ultimately replace it in the cache.

Does Cassandra uses Heap memory to store blooms filter ,and how much space does it consumes for 100GB of data?

I come to know that cassandra uses blooms filter for performance ,and it stores these filter data into physical-memory.
1)Where does cassandra stores this filters?(in heap memory ?)
2)How much memory do these filters consumes?
When running, the Bloom filters must be held in memory, since their whole purpose is to avoid disk IO.
However, each filter is saved to disk with the other files that make up each SSTable - see http://wiki.apache.org/cassandra/ArchitectureSSTable
The filters are typically a very small fraction of the data size, though the actual ratio seems to vary quite a bit. On the test node I have handy here, the biggest filter I can find is 3.3MB, which is for 1GB of data. For another 1.3GB data file, however, the filter is just 93KB...
If you are running Cassandra, you can check the size of your filters yourself by looking in the data directory for files named *-Filter.db

Resources