Chronicle Queue has the notion of RollCycles which determine when an appender will switch to a new file. Would it be possible to have RollCycles based on other criteria, for instance of current file size? It would switch to a new file if the current one will exceed a certain file size or number of records.
There isn't a means of doing this as currently, the internal representation is time-based.
Instead, I suggest rolling often enough that the files are not too large. You can use roll cycles of 1, 5, 10, 15, 20, 30, 60, 120, 240, and 360 minutes, instead of daily.
Related
I'm trying to forecast (1 hour ahead) steam consumption using 11 continuous variables. But my predictions seems to be offset from observed values.
All data cleaning process was done and during feature engineering process I generated 140 variables including:
Window features using mean, median, standard deviation, minimum and maximum (with windows size 30, 60 and 90 for each variable)
Lags of 30, 60 and 90 for each variable
Extracted day, hour and minute from datetime index
I'm training a lgbm regressor to do the forecast task.
The model MAPE on unseen dataset was 1.8%. But my problem is an offset between predicted and observed values. I would like to know if there is any strategy I can use to remove or mitigate this offset.
This picture shows the offset of predictions in unseen data.
I am monitoring disk space usage using metricbeat.
Now I want alerts via Elastalert depending on the disk size.
Alert when disk space used crossed 50%
Alert when disk space used crossed 70%
Alert when disk space used crossed 80%
Alert when disk space used crossed 95%
Alert when disk space used crossed 100%
Now the catch here is that alerts should be raised only once when it crosses certain thresholds (50, 70, 80, 95, 100)
So, if alert is already sent for crossing 50% mark it should not send alert for 50.1% / 50.2% / ... / 69.9%
The next alert should only be raised when it crosses 70%.
Initial Approach:
If (dir size==50 || dir size==70 || dir size ==80 || dir size==95 || dir size ==100)
alert
I planned to use "any rule" to match disk space field to different values and alert. But this may generate false alerts too, reason being if the storage is saturated at 50.0% (consider no new data written to DB) for the last 1 hour and if we evaluate rules every 10 mins, it will raise alert 6 times in that hour. Also I don't want to use realert as I don't know for how long to wait.
Approach v1:
Make n number of rule configs where n is the number of different conditions
use a realert setting that is so long it's effectively "never"
realert:
weeeks: 9999
This approach is not ideal as we need repeated alerts.
Example - When usage drops below 50% and then crosses 50% again, Alert is required.
Approach v2:
Combination of two rule can be used. (consider for 50% only)
Rule 1: check disk space >= 50, send mail, enable rule 2 and disable itself using command
Rule 2: check disk space <50, enable Rule 1, disable itself using command.
Any better approach?
Created a custom rule. For more details check this post:
Using Elastalert to monitor disk growth
I am developing key value database similar to Apache Cassandra and LevelDB.
It utilize LSM trees (https://en.wikipedia.org/wiki/Log-structured_merge-tree)
Keys are strings and I am using C++.
Currently data is stored on disk in several IMMUTABLE "sstables", each with two files.
data file - "flat" file - record after record, sorted by key, without any disk block alignment.
index file - it contains number of records and array of 8 byte numbers (uint64_t) offcets to each record. You can position (seek) and find offset to any record from data file.
I have mmap-ed these two files in memory.
If I program need to locate a record it do binary search. However this means the program do lots of disk seeks. I did some optimizations, for example i actually do jump search and sequentially read last 40-50 records. But on file with 1 billion keys, it still do 20-25 seeks (instead of 30).
This all works pretty fast - 4-5 seconds for 1 billion keys without virtual memory cache (e.g. first ever request) and way under 1 second with virtual memory cache.
However I am thinking of building some additional data structure on the disk that can speed up the lookup more. I thought to use "level ordered array", e.g. instead of:
1, 2, 3, 4, 5, 6, 7
to be
4, 2, 6, 1, 3, 5, 7
In this case, most used keys will be at the beginning of the file, but I am not 100% sure it will help that much.
second though is to do something like B Tree or B+Tree, but creation seems very complex with lot of disk syncs - or at least this is how I see it.
Apache Cassandra is using key samples - it have every 1024th key in memory - I do not want to go this way, because of memory consumption. However, if I have them on disk in "flat" file, it still needs lots of seeks to find the "sample" key.
Is there any alternative I am missing?
I have n files, 50 <= n <= 100 that contain sorted integers, all of them the same size, 250MB or 500MB.
e.g
1st file: 3, 67, 123, 134, 200, ...
2nd file: 1, 12, 33, 37, 94, ...
3rd file: 11, 18, 21, 22, 1000, ...
I am running this on a 4-core machine and the goal is to merge the files as soon as possible.
Since the total size can reach 50GB I can't read them into RAM.
So far I tried to do the following:
1) Read a number from every file, and store them in an array.
2) Find the lowest number.
3) Write that number to the output.
4) Read one number from the file you found the lowest before (if file not empty)
Repeat steps 2-4 till we have no numbers left.
Reading and writing is done using buffers of 4MB.
My algorithm above works correctly but it's not perfomning as fast as I want it. The biggest issue is that it perfoms much worst if I have 100 files x 250MB compared to having 50 files x 500MB.
What is the most efficient merge algorithm in my case?
Well, you can first significantly improve efficiency by improving step (2) in your algorithm to be done smartly. Instead to do a linear search on all the numbers, use a min-heap, any insertion and deletion of the minimal value from the heap is done in logarithmic time, so it will improve the speed for large number of files. This changes time complexity to O(nlogk), over the naive O(n*k) (where n is total number of elements and k is number of files)
In addition, you need to minimize number of "random" reads from files, because few sequential big reads are much faster than many small random reads. You can do that by increasing the buffer size, for example (same goes for writing)
(java) Use GZipInputStream and GZipOutputStream for the .gz compression. Maybe that will allow memory usage to some extent. Using fast instead of high compression.
Then movement on disk for several files should be reduced, say more merging files by 2 files, both larger sequences.
For repetitions maybe use "run-length-encoding" - instead of repeating, add a repetition count: 11 12 13#7 15
An effective way to utilize the multiple cores might be to perform input and output in distinct threads from the main comparison thread, in such a way that all the cores are kept busy and the main thread never unnecessarily blocks on input or output. One thread performing the core comparison, one writing the output, and NumCores-2 processing input (each from a subset of the input files) to keep the main thread fed.
The input and output threads could also perform stream-specific pre- and post-processing - for example, depending on the distribution of the input data a run length encoding scheme of the type alluded to by #Joop might provide significant speedup of the main thread by allowing it to efficiently order entire ranges of input.
Naturally all of this increases complexity and the possibility of error.
I have a program that uses multiple threads and a queue. The threads are decreasing the queue (in case the data that was acted on was successful processed) but there are certain items (which I don't know up-front) which simply cannot be decreased.
So say I have a queue of size 700, it goes down to 670, 340, 20...and then it gets stuck there.
Is there some way to tell Ruby to do something in case the queue size stays the same for more than X seconds?