I have a case where I have collected a lot of SNMP data and stored it via rrdtool. (Using OpenNMS)
My goal is to determine, among hundreds of servers, which ones haven't had memory usage exceed a certain amount within the past six months. (say, 64 gigs)
My plan was to write a bash script to extract and process the data from rrdtool, but unsure how to start. This seems like a common enough task that I felt I should ask on here if anyone has any ideas.
Thanks!
RRDTool terminology:
RRD : The RRDTool database file
DS : Data Source. One of the variables being measured
RRA : Roundrobin Archive. Consolodation archive definedin the RRD file
CF : Consolodation Factor. MAX, MIN, AVERAGE, LAST. How the RRA consolodates the
data
DP : Data point. A sample of data as stored into the RRD before consolodation
CDP : Consolodated data point. A data point in an RRA, which corresponds to one or more DP merged using the CF of that RRA.
I would suggest doing this in two parts.
Firstly, extract the Maximum of the values over the time period for each DS. This step is simplified considerably if you create an RRA that has a MAXIMUM CF and an appropriate granularity, such as 1 day. How you do the extract will depend on if you have a single RRD with many DS, or many RRDs with one DS in each; however you will need to use the rrdtool xport and NOT rrdtool fetch to retrieve the data so that you get a single data value for each DS. The xport function of rrdtool will allow you to further consolodate your 1CDP==1day RRA to get a single CDP; do this by setting 'step' to be 6 months, and force your DEF to use a MAX CF. The reason we use a 1day RRA rather than a 6month one is so that we can run the calculation on any date, not just once every 6 months.
Assuming your file is data1.rrd containing a single DS dsname for host host1 :
rrdtool xport --end now --start "end - 6 months" --step 15552000 --maxrows 1
DEF:x=data1.rrd:dsname:MAX
XPORT:x:host1
Next, you will need to threshold and filter these to get the list of DS that have a MAX value below your threshold. This would be a simple process in bash that shouldn't tax you!
Related
I am quite new to « Big Data » technologies, especially Cassandra, so I need your advices for the task I have to do. I have been looking to Datastax examples about handling timeseries, and different discussion here about this topic, but if you think I might have missed something, feel free to tell me.
Here it my problem.
I need to store and analyze data coming from about 100 sensor stations that we are testing. In each sensor station, we have several thousand sensors. So for each station, we run several tests (about 10, each one lasting about 2h30), during which the sensors are recording information every millisecond (can be boolean, integer or float). The records of each test are kept on the station during the test, then they are sent to me once the test is completed. It means about 10 GB for each test (each parameter is about 1 MB of information).
Here is a schema to illustrate the hierarchy:
Hierarchy description
Right now, I have access to a small Hadoop Cluster with Spark and Cassandra for testing. I may be able to install other tools, but I would really appreciate to keep working with Spark/Cassandra.
My question is: what could be the best data model for storing then analyzing the information coming from these sensors?
By “analyzing”, I mean:
find min, max, average value on a specific parameter recorded by a specific sensor on a specific station; or find those values for a specific parameter but for all the station; or find those value for a specific parameter but when other parameters (one or two) of the same station are upper than a limit
plot the evolution of one or more parameters to compare them visually (the same parameter on different stations, or different parameters on the same station)
do some correlation analysis between parameters or stations (eg. to find if a sensor is not working).
I was thinking of putting all the information in a Cassandra Table with the following data model:
CREATE TABLE data_stations (
station text, // station ID
test int, // test ID
parameter text, // name of recorded parameter/sensor
tps timestamp, // timestamp
val float, // measured value
PRIMARY KEY ((station, test, parameter), tps)
);
However, I don’t know if one table would be able to handle all the data : a quick calculation give 10^14 different rows according to the precedent data model (100 stations x 10 test x 10 000 parameters x 9,000,000ms (2h30 in milliseconds) ~= 10^14), even if each partition is “only” 9,000,000 rows.
Other ideas were to split the data in different table (eg. One table per station, or one table per test per station, etc.). I don’t know what and how to choose, so any advice is welcome!
Thank you very much for your time and help, if you need more information or details I would be glad to tell you more.
Piar
You are on the right track, Cassandra can handle such data. You may store all the data you want it column families and use Apache Spark over Cassandra to do the required aggregations.
I feel Apache Spark is good for your use case as it could be used for aggregations and calculating correlations.
You may also check out Apache Hive as it can work/query over data in HDFS directly(through external tables).
Check these :
Cassandra - Max. size of wide rows?
Limitations of Cassandra
BACKGROUND
I have a binary classification task where the data is highly imbalanced. Specifically, there are
way more data with label 0 than that with label 1. In order to solve this problem, I plan to subsampling
data with label 0 to roughly match the size of data with label 1. I did this in a pig script. Instead of
only sampling one chunk of training data, I did this 10 times to generate 10 data chunks to train 10 classifiers
similar to bagging to reduce variance.
SAMPLE PIG SCRIPT
---------------------------------
-- generate training chunk i
---------------------------------
-- subsampling data with label 0
labelZeroTrainingDataChunki = SAMPLE labelZeroTrainingData '$RATIO';
-- combine data with label 0 and label 1
trainingChunkiRaw = UNION labelZeroTrainingDataChunk1,labelOneTrainingData;
-- join two tables to get all the features back from table 'dataFeatures'
trainingChunkiFeatures = JOIN trainingChunkiRaw BY id, dataFeatures BY id;
-- in order to shuffle data, I give a random number to each data
trainingChunki = FOREACH trainingChunkiFeatures GENERATE
trainingChunkiRaw::id AS id,
trainingChunkiRaw::label AS label,
dataFeatures::features AS features,
RANDOM() AS r;
-- shuffle the data
trainingChunkiShuffledRandom = ORDER trainingChunki BY r;
-- store this chunk of data into s3
trainingChunkiToStore = FOREACH trainingChunkiShuffledRandom GENERATE
id AS id,
label AS label,
features AS features;
STORE trainingChunkiToStore INTO '$training_data_i_s3_path' USING PigStorage(',');
In my real pig script, I do this 10 times to generate 10 data chunks.
PROBLEM
The problem I have is that if I choose to generate 10 chunks of data, there are so many mapper/reducer tasks, more than 10K. The majority of
mappers do very little things (runs less 1 min). And at some point, the whole pig script is jammed. Only one mapper/reducer task could run and all other mapper/reducer tasks are blocked.
WHAT I'VE TRIED
In order to figure out what happens, I first reduced the number of chunks to generate to 3. The situation was less severe.
There were roughly 7 or 8 mappers running at the same time. Again these mappers did very little things (runs about
1 min).
Then, I increased the number of chunks to 5, at this point, I observed the the same problem I have when I set the number of chunks
to be 10. At some point, there was only one mapper or reducer running and all other mappers and reducers were blocked.
I removed some part of script to only store id, label without features
--------------------------------------------------------------------------
-- generate training chunk i
--------------------------------------------------------------------------
-- subsampling data with label 0
labelZeroTrainingDataChunki = SAMPLE labelZeroTrainingData $RATIO;
-- combine data with label 0 and label 1
trainingChunkiRaw = UNION labelZeroTrainingDataChunki, labelOneTrainingData;
STORE trainingChunkiRaw INTO '$training_data_i_s3_path' USING PigStorage(',');
This worked without any problem.
Then I added the shuffling back
--------------------------------------------------------------------------
-- generate training chunk i
--------------------------------------------------------------------------
-- subsampling data with label 0
labelZeroTrainingDataChunki = SAMPLE labelZeroTrainingData $RATIO;
-- combine data with label 0 and label 1
trainingChunkiRaw = UNION labelZeroTrainingDataChunki, labelOneTrainingData;
trainingChunki = FOREACH trainingChunkiRaw GENERATE
id,
label,
features,
RANDOM() AS r;
-- shuffle data
trainingChunkiShuffledRandom = ORDER trainingChunki BY r;
trainingChunkiToStore = FOREACH trainingChunkiShuffledRandom GENERATE
id AS id,
label AS label,
features AS features;
STORE trainingChunkiToStore INTO '$training_data_i_s3_path' USING PigStorage(',');
The same problem reappears. Even worse, at some point, there was no mapper/reducer running. The whole program hanged without making any progress. I added another machine and the program ran for a few minutes before it jammed again. Looks like there are some dependency issues here.
WHAT'S THE PROBLEM
I suspect there are some dependency which leads to deadlock. The confusing thing is that before shuffling, I already
generate the data chunks. I was expecting the shuffling could be executed in parallel since these data chunks are independent
with each other.
Also I noticed there are many mappers/reducers do very little thing (exists less than 1 min). In such case, I would
imagine the overhead to launch mappers/reducers would be high, is there any way to control this?
What's the problem, any suggestions?
Is there standard way to do this sampling. I would imagine there are many cases where we need to do these subsampling like bootstrapping or bagging. So, there might be some standard way to do this in pig. I couldn't find anything useful online.
Thanks a lot
ADDITIONAL INFO
The size of table 'labelZeroTrainingData' is really small, around 16MB gziped.
table 'labelZeroTrainingData' is also generated in the same pig script by filtering.
I ran the pig script on 3 aws c3.2xlarge machines.
table 'dataFeatures' could be large, around 15GB gziped.
I didn't modify any default configuration of hadoop.
I checked the disk space and memory usage. Disk space usage is around 40%. Memory usage is around 90%. I'm not sure memory is the problem. Since
I was told if the memory is the issue, the whole task should fail.
After a while, I think I figure out something. The problem is likely to be the multiple STORE statements there. Looks like pig script will be running in batch by default. So, for each chunk of the data, there is a job running which leads to lack of resource, e.g. slots for mapper and reducer. None of the job could finish because each needs more mapper/reducer slots.
SOLUTION
use piggybank. There is a storage function called MultiStorage which might be useful in this case. I had some version incompatible issue between piggybank and hadoop. But it might work.
Disable pig executing operations in batch. Pig tries to optimize the execution. I simply disable this multiquery feature by adding -M. So, when you run pig script, it looks like something pig -M -f pig_script.pg which executes one statement at a time without any optimization. This might not be ideal because no optimization is done. For me, it's acceptable.
Use EXEC in pig to enforce certain execution order which is helpful in this case.
I've got a class in parse with 1-4k records per user. This needs to be replaced from time to time (actually these are records representing multiple timetables).
The problem I'm facing that deleting and inserting these records is a ton of requests. Is there maybe a method to delete and insert a bunch of records, that counts as one request? Maybe it's possible from Cloud Code?
I tried compacting all this data in one record, but then I faced the size limit for records (128 KB). Using any sub format(like a db or file onside a record) would be really tedious, cause the app is targeting nearly all platforms supported by Parse.
EDIT
For clarification, the problem isn't the limit on saveAll/destroyAll. My problem is facing the req/s limit (or rather, as docs state req/min).
Also, I just checked that requests from Cloud Code also seem to count towards that limit.
Well, a possible solution would be also to redesing my datasets and use Array columns or something, but I'd rather avoid it if possible.
I think you could try Parse.Object.saveAll which batch processes the save() function.
Docs: https://www.parse.com/docs/js/api/symbols/Parse.Object.html#.saveAll
Guide: https://parse.com/questions/parseobjectsaveall-performances
I would use a saveAll/DestroyAll (or DeleteAll?) and anything -All that parse provides in its SDK.
You'd still reach a 1000 objects limit, but to counter that you can loop using the .skip property of a request.
Set a limit of 1000 and skip of 0, do the query, then increase the skip value by the previous limit, and so on. And you'd have 2 or 3 requests of a size of 1000 each time. You stop the loop when your results count is smaller than your limit. If it's not, then you query again and set the skip to the limit x loopcount.
Now you say you're facing size issues, maybe you can reduce that query limit to, say, 400, and your loop would just run for longer until your number of results is smaller than your limit (and then you can stop querying/limiting/skipping/looping or anything in -ing).
Okay, so this isn't an answer to my question, but it's a solution to my problem, so I'm posting it.
My problem was storing and then replacing a large amount of small records which add up to significant size (up to 500KB JSON [~1.5MB XML] in my current plans).
So I've chosen a middle path - I implemented sort of vertical partitions.
What I have is a master User record which holds array of pointers to other class (called Entries). Entries have only 2 fields - ID of school record and data which is type Array.
I decided to split "partitions" every 1000 records, which is about ~60-70KB per record, but in my calculations may go up to ~100KB.
I also made field names in json 1 letter, cause every letter in 1000 records is like 1 or 2 KB, depending on encoding.
Actually that approach made PHP code like twice as fast and there is a lot less usage on network and remote database (1000 times less inserts/destroys basically).
So, that is my solution, if anybody has any other ideas, please post it as answer here, cause probably I'm not the only one with such problem and that certainly isn't the only solution.
We developed a spring batch application in which we have two flows. 1. Forward 2. Backward. We are only using file read/write no DB involved.
Forward Scenario : The input file will have records with 22 fields. The 22 fields to be converted into 32 fields by doing some operations like sequence number generation and adding few fillers fields. Based on country codes the output will be split into max 3. each chunk will have 250K records. (If records are in million the multiple files will be generated for same country).
8 Million records its taking 36 minutes.
8 Million records will be in single file.
We are using spring batch thread 1000 threads we are using.
Backward Flow : The input file will have 82 fields for each record. These 82 fields to be converted into 86 records. The two fields will be added in between which is taken from the Forward flow input file. The other fields are simply copied and pasted. The error records also to be written into error file. The error records is nothing but actual input records which came for Forward flow. To track we are persisting the sequence number & actual records in a file this is done in forward flow itself. We are taking the persistent file in backward flow and comparing the sequence number if anything is missing then we are writing into error records through key,value pair. This process is done after completion of backward flow.
Maximum size of input file is 250K.
8 Million records its taking 1 hour 8 minutes which is too bad.
32 Files (each 250K) will be there for input in this flow.
There is no thread used in backward. I don't know how thread usage will be. I tried but the process got hung.
Server Configurations:
12 CPU & 64 GB Linux Server.
Can you guys help in this regard to get improved the performance since we have 12 CPU/64GB RAM.
You are already using 1000 threads and that is a very high number. I have fine tuned spring batch jobs and this is what I have done
1. Reduce network traffic- Try to reduce number of calls to data base or file system in each process. Can you get all info possible in one shot and save it in memory for the life of thread ? I have uses org.apache.commons.collections.map.MultiKeyMap for storage and retrieval of data.
For eg in your case you need sequence number comparison . So get all sequence numbers into one map before you start the process. You can store the ids (if not too many) into step execution context.
Write less frequently - Keep storing all the info you need to write for some time and then write them at the end.
Set unused objects at end of process to null to expedite GC
Check your GC frequency through VisualVm or Jconsole . You should see frequent GC happening when your process is running which means objects are being created and garbage collected. If your memory graph keeps on increasing, something is wrong.
I am researching Hadoop to see which of its products suits our need for quick queries against large data sets (billions of records per set)
The queries will be performed against chip sequencing data. Each record is one line in a file. To be clear below shows a sample record in the data set.
one line (record) looks like:
1-1-174-418 TGTGTCCCTTTGTAATGAATCACTATC U2 0 0 1 4 ***103570835*** F .. 23G 24C
The highlighted field is called "position of match" and the query we are interested in is the # of sequences in a certain range of this "position of match". For instance the range can be "position of match" > 200 and "position of match" + 36 < 200,000.
Any suggestions on the Hadoop product I should start with to accomplish the task? HBase,Pig,Hive, or ...?
Rough guideline: If you need lots of queries that return fast and do not need to aggregate data, you want to use HBase. If you are looking at tasks that are more analysis and aggregation-focused, you want Pig or Hive.
HBase allows you to specify start and end rows for scans, meaning it should be satisfy the query example you provide, and seems most appropriate for your use case.
For posterity, here's the answer Xueling received on the Hadoop mailing list:
First, further detail from Xueling:
The datasets wont be updated often.
But the query against a data set is
frequent. The quicker the query, the
better. For example we have done
testing on a Mysql database (5 billion
records randomly scattered into 24
tables) and the slowest query against
the biggest table (400,000,000
records) is around 12 mins. So if
using any Hadoop product can speed up
the search then the product is what we
are looking for.
The response, from Cloudera's Todd Lipcon:
In that case, I would recommend the
following:
Put all of your data on HDFS
Write a MapReduce job that sorts the data by position of match
As a second output of this job, you can write a "sparse index" -
basically a set of entries like this:
where you're basically giving offsets
into every 10K records or so. If you
index every 10K records, then 5
billion total will mean 100,000 index
entries. Each index entry shouldn't be
more than 20 bytes, so 100,000 entries
will be 2MB. This is super easy to fit
into memory. (you could probably index
every 100th record instead and end up
with 200MB, still easy to fit in
memory)
Then to satisfy your count-range
query, you can simply scan your
in-memory sparse index. Some of the
indexed blocks will be completely
included in the range, in which case
you just add up the "number of entries
following" column. The start and
finish block will be partially
covered, so you can use the file
offset info to load that file off
HDFS, start reading at that offset,
and finish the count.
Total time per query should be <100ms
no problem.
A few subsequent replies suggested HBase.
You could also take a short look at JAQL (http://code.google.com/p/jaql/), but unfortunately it's for querying JSON data. But maybe this helps anyway.
You may need to look at No-SQL Database approaches like HBase or Cassandra. I would prefer HBase, as it has a growing community.