Filtering time based data in Pig - hadoop

I'm using Pig 0.11.1 in local mode for now, loading data from a CSV.
So far, I've been able to load our data set and perform the required calculations on it. The next step is to take some samples from the data and perform the same calculations. To replicate existing processes, we want to grab a data point every fifteen minutes.
This is where the trouble comes in. I can write a filter in Pig that will match if a data point is exactly on a fifteen minute interval, but how would I grab data points that are near the fifteen minute boundary?
I need to look at the fifteen minute mark and grab the record that's there. If there is no record right on that mark (most likely), then I need to grab the next record after the mark.
I think I'll need to write my own Filter UDF, but it seems like the UDF would need to be stateful so that it knows when it's found the first match after the time interval. I haven't been able to find any examples of stateful UDFs, and from what I can tell it's probably a bad idea given that we won't know how data is mapped/reduced when eventually run against Hadoop.
I could do this in a couple of steps, by storing key/timestamp values and writing a Python script that would parse those. I'd really like to keep as much of this process in Pig as possible, though.
Edit: The data at its most basic is like this: {id:long, timestamp:long}. The timestamp is in milliseconds. Each set of data is sorted on timestamp. If record X falls exactly on a 15-minute boundary after the minimum timestamp (start time), grab it. Otherwise, grab the very next record after that 15 minute boundary, whenever that might be. I don't have a good example of what the expected results are because I haven't had time to sort through the data by hand.

It might be tricky in MapReduce to satisfy the condition "Otherwise, grab the very next record after that 15 minute boundary, whenever that might be", But if you change it slightly to the "grab the previous record before that 15 minute boundary" than it could be quite easy. The idea is that 15 minutes is 900000 milliseconds, so that we can group the records into the groups which cover 900000 milliseconds, sort them and take the top one. Here is an example of the script from the top of my head:
inpt = LOAD '....' AS (id:long, timestamp:long);
intervals = FOREACH inpt GENERATE id, timestamp, timestamp / 900000 as interval;
grp = GROUP intervals BY interval;
result = FOREACH grp {
sort = ORDER intervals BY timestamp DESC;
top = LIMIT ord 1;
GENERATE FLATTEN(top);
};

Related

How to exclude lowest value from average calculation in Kibana

I usually do it in Excel but it is not easy for me to do it in KIBANA as well
I have this table in Excel and every hour I want to average for all instancs in the fiels "detail" but excluding the lowest three values (nine details each hour, the average should be only for the the six highest of them). In Excel I use the LARGE function.
https://docs.google.com/spreadsheets/d/1LcKO8TGl49dz6usWNwxRx0oVgQb9s_h1/edit?usp=sharing&ouid=114168049607741321864&rtpof=true&sd=true
In your opinion is there any chance to do it directly in KIBANA?
No idea how to proceed
You can use lens table visualization and set the number of rows to 6 and order rows by descending order of your CPU load. Look at the sample data table here
The average here is calculated for the top 6 values of bytes only.
Here are the settings:
You can try replacing the clientIP here by details and bytes by CPU load
No, it is not possible to automatically remove the last N results from the equation in Kibana. You should be manually filtering out from the list in the visualization every time.
The only alternative I see is to add an extra step that deletes or flags the 3 results per hour you want to exclude, and then in Kibana you just add a regular filter.
The easiest way I can think of is creating a watcher that groups the results by hour, sort by CPU, and then ingest the first 6 results in a different index you can query using Kibana.
Docs: https://www.elastic.co/guide/en/elasticsearch/reference/current/xpack-alerting.html
If this acceptable for you I can edit this answer with more details about the Watcher I would create.

Data Structure for time scheduling?

I am in need of a data structure that can properly model blocks of time, like appointments. For example, each appointment has a time it starts on, and a time it ends on. I need to have extremely fast access to things like:
Does a specified start time and end time conflict with an existing event?
What events exist from a specified start time and end time?
Ideally the data structure could model something like the image below.
I thought of using a binary search tree (ex. Java's TreeMap) but I can't think of what key or value I would use. Is there a single data structure or combination of data structures that is strong at modeling this?
A Guava Table would probably work for your use case, depending on what it is you want to actually index on.
A naive approach would be to index by name, then time of day, and then have a value whether or not this particular block is occupied by that particular person.
This would make the instantiation of the object become...
Table<String, LocalDateTime, Boolean> calendar = TreeBasedTable.create();
You would populate each individual's allocation at a given interval. You get to set what that interval is - if it's broken into 15, 30 or 1 hour periods (as defined by the table).
To find if a time is occupied, you look for the closest interval to the time you want to schedule a person. You'd use the column() method for this to see if there's any availability, or you could get specific and get a row for the individual. This means you'd have to pull two values; the start time you want, and however many minutes your interval is out. That part I'll have to leave as an exercise for the reader.

how to get N last records from rethinkdb fast?

Lets say you want to get 3 first records from rethinkdb. It is easy to do with:
objects = r.db("db").getAll(val, {index:"index"}).limit(N)
But in order to get last N records you have to get in count, count of objects and then do slice like this:
count = r.db("db").getAll(val, {index:"index"}).count(N)
objects = r.db("db").getAll(val, {index:"index"}).slice(count - N, count)
There is a huge difference in time:
First one with .Limit in golang takes: 63.28276ms
Second one with .Slice in golang takes: 1.028439202s
Doing orderBy some timestamp makes all thing even slower.
So as you can see it is just crazy from speed perspective. This query is executed on 26 000 documents in database.
I need some idea on how to solve this.
So, I have tried many things and getting last N records will be slow so I approach problem different way. I am did some script that is putting last N records every 20seconds in file/redis and then I read that records from Go web app fast.

How to store time series data in a list (or any other data structure) to get reasonable trends over a variety of horizons?

Say I want to store a forex rate trend in which, I receive two updates every second on average. But I don't want to store all updates against the timestamp over a day as the data would be huge. But I want to show every update in the last two minutes, every second update in the last 1 hour and so on with reducing frequencies over a day. Which algorithm/data structure is best for this?
You could use a circular buffer. But generally StackOverflow is not for questions like that.

How should I select top 10% of the table?

I need the select the top x% rows of a table in Pig. Could someone tell me how to do it without writing a UDF?
Thanks!
As mentioned before, first you need to count the number of rows in your table and then obviously you can do:
A = load 'X' as (row);
B = group A all;
C = foreach B generate COUNT(A) as count;
D = LIMIT A C.count/10; --you might need a cast to integer here
The catch is that, dynamic argument support for LIMIT function was introduced in Pig 0.10. If you're working with a previous version, then a suggestion is offered here using the TOP function.
Not sure how you would go about pulling a percentage, but if you know your table size is 100 rows, you can use the LIMIT command to get the top 10% for example:
A = load 'myfile' as (t, u, v);
B = order A by t;
C = limit B 10;
(Above example adapted from http://pig.apache.org/docs/r0.7.0/cookbook.html#Use+the+LIMIT+Operator)
As for dynamically limiting to 10%, not sure you can do this without knowing how 'big' the table is, and i'm pretty sure you couldn't do this in a UDF, you'd need to run a job to count the number of rows, then another job to do the LIMIT query.
I won't write the pig code as it will take a while to write and test, but I would do it like this (if you need the exact solution, if not, there are simpler methods):
Get a sample from your input. Say a few thousand data points or so.
Sort this and find the n quantiles, where n should be somewhere in the order of the number of reducers you have or somewhat larger.
Count the data points for each quantile.
At this point the min point of the top 10% will fall into one of these intervals. Find this interval (this is easy as the counts will tell you exactly where it is), and using the sum of the counts of the larger quantiles together with the relevant quantile find the 10% point in this interval.
Go over your data again and filter out everything but the points larger than the one you just found.
Portions of this might require UDFs.

Resources