extracting either two or one intervals in a tier - praat

I'm new to praat scripting so bear with me: I have a for loop set up and I want to extract data from three tiers. My first two tiers work beautifully but I'm having trouble with the third tier.
So in the third tier, at a given point in the loop, there could either be 1 or 2 elements, (My linguistics researcher is having me write this; I don't have a full understanding of what exactly I'm extracting) and I don't know how to check how many elements there are. Is there a function I can use that allows me to get the number of elements at a given interval? My line of thought at the moment is get the number of elements in the third tier at that point in the loop. If there is only one, get that one, assign it to the correct variable name, and move on. If there are two, grab both.

I can think of two ways to do this, "manually" and by extracting parts of the TextGrid.
Let's imagine (for clarity) that you want to count the number of points that fall within a given interval. There are some differences between this and counting intervals that fall within intervals, but baby steps.
Manually
What I mean by manually is that you can get the index of the "first" point within your interval (the first point after the beginning of the interval), and the index of the "last" point, and then just subtracting (beware of fencepost errors!). If the first is 3 and the last is 8, you know there are 6 points in your interval.
Let's assume we have this:
textgrid = selected("TextGrid")
main_tier = 1 ; The tier with the main interval
sub_tier = 2 ; The tier with the elements you want to count
interval = 3 ; The interval in the main tier
start = Get start point: main_tier, interval
end = Get end point: main_tier, interval
Then we can do this:
first = Get high index from time: sub_tier, start
last = Get low index from time: sub_tier, end
total = last - first + 1
appendInfoLine: "There are ", total, " points within interval ", interval
(Or you could use the "Count points in range..." command in the tgutils CPrAN plugin).
If you were counting intervals, you'd have to change that slightly:
first = Get high interval at time: sub_tier, start
last = Get low interval at time: sub_tier, end
Or, if you wanted to count only those intervals that fall entirely within your main interval
first = Get high interval at time: sub_tier, start
last_edge = Get interval edge from time: sub_tier, end
last = last_edge + 1
Extracting parts
An entirely different approach would be to use the "Extract part..."
command for TextGrids. You can extract the part of the TextGrid that
falls within your time window, and then work with that part only.
Counting the number of intervals in that part would then simply be a
matter of counting the total number of intervals in that new TextGrid.
Of course, this does not check whether your the intervals that are considered to be within are entirely within.
A simple example:
Extract part: start, end, "yes"
# And then you just count the intervals
intervals = Get number of intervals: sub_tier
# or points
points = Get number of points: sub_tier
If you want to do this repeatedly (eg. for each of the intervals in your main tier), the tgutils plugin mentioned above has a script to
"explode" TextGrids. Although the name might be a bit unnerving,
this just separates a TextGrid into interval-sized chunks using the
intervals in a given tier (by calling the same command mentioned above). As an example, if you "explode" a TextGrid
using an interval tier with 5 intervals, you'd get as a result 5
smaller TextGrids, corresponding to each of the original intervals.
The script can preserve the time stamps of the resulting TextGrids, to
make it easier to refer back to the original. And if run with a
TextGrid and a Sound selected, it will "explode" the Sound as well, so
you can work on the combination of both objects as well.
(Full disclosure: I wrote that plugin).

Related

time series data change point detection

I am trying to segment a time series data into different zones.
In each time period, the pressure is running under an allowed max stress level (was not told before hand). Please see the pictures below.
edit each time period is more than a week.
How to detect the start / end of different time period? Would anyone point me some direction?
Once the time different time zones are divided, I guess I could average several max readings in each zone to have the max allowed stress.
I would take let's say enough values for 1h. Then you calculate the average value.
After that, you set the the average value in relation with the one before.
Some Pseudocode, to make it visual.
class Chunk:
private double[] values;//For one hour, for example.
double average();
enum Relation:
FALLING,RISING,EQUAL
func algorithm(Chunk[] chunks){
double averages=new double[chunks.length];
for(int i=0;i<chunks.length;i++)
averages[i]=chunks[i].average();
//Got averages, now make it rising or falling or stay same.
Relation[] relations=new Relation[chunks.length];
for(int i=1;i<chunks.length;i++){
double diff=averages[i]-averages[i-1];
if(diff==0) //TODO, a bit of difference is allowed (Like deviations of +-3)
relations[i]=EQUALS;
else
relations[i]=diff>0?RISING:FALLING;
}
// After that, you have to find sequences of many FALLING or RISING, followed by many EQUALS
}
To proceed with this array of Relations, you could divide it into smaller arrays, calculate the average (Like FALLING=0,RISING=1,EQUAL=2). After that you simply "merge" them like this:
F=FALLING
R=RISING
E=EQUALS
//Before merging
[RREEEEFFEEEEERRREEEE]
//After merging
[REFERE]
And there you can see the mountains and valleys.
Now, to get the exact values, when a mountain or valley starts, you have to extend Chunk a bit.
class Chunk:
//The value on x-Axis + the value of y-Axis
private Tuple<Time,Double>[] values;
//Tuple of Range, this chunk uses and the average value of this range
Tuple<Tuple<Time,Time>,double> average();
Furthermore, you can't use raw Relation anymore, you have to wrap it with the Range, from where it starts to the end.

Operations on Two Streams of Data - Design Algorithm

I have seen this algorithm question or variants of it several times but have not found or been able to determine an optimal solution to the problem. The question is:
You are given two queues where each queue contains {timestamp, price}
pair. You have to print "price 1, price 2" pair for all those
timestamps where abs(ts1-ts2) <= 1 second where ts1 and price1 are
from the first queue and ts2 and price2 are from the second queue.
How would you design a system to handle these requirements?
Then a followup on this questions: what if one of the queues is slower than the other (data is delayed). How would you handle this?
You can do this in a similar fashion to the merging algorithm from merge sort, only doubled.
I'm going to describe an algorithm in which I choose queue #1 to be my "main queue." This will only provide a partial solution; I'll explain how to complete it afterwards.
At any time you keep one entry from each queue in memory. Whenever the two entries you have uphold your condition of being less than one second apart, print out their prices. Whether or not you did, you discard the one with the lower time stamp and get the next one. If at any point the time stamp from queue #1 is lower than that from queue #2, discard entries from queue #1 until that is no longer the case. If they both have the same time stamp, print it out and advance the one from queue #1. Repeat until done.
This will print out all the pairs of "price1, price2" whose corresponding ts1 and ts2 uphold that 0 <= ts1 - ts2 <= 1.
Now, for the other half, do the same only this time choose queue #2 as your "main queue" (i.e. do everything I just said with the numbers 1 and 2 reversed) - except don't print out pairs with equal time stamps, since you've already printed those in the first part.
This will print out all the pairs of "price1, price2" whose corresponding ts1 and ts2 uphold that 0 < ts2 - ts1 <= 1, which is like saying 0 > ts1 - ts2 >= -1.
Together you get the printout for all the cases in which -1 <= ts1 - ts2 <= 1, i.e. in which abs(ts1 - ts2) <= 1.
Additionally to the queues use two hashmaps (each exclusive for one queue)
As soon as a new item arrives strip the seconds out and use this as the key of the corresponding hashmap.
Using the very same key, retrieve all the items in the other hashmap.
One by one compare if the actual retrieved items are 1 second away of the item in bullet 2.
Note that this will fail to detect items with a difference in minutes: 10:00:59 and 10:01:00 will not be detected.
To solve this:
for items like XX:XX:59 you will need to hit the hashmap twice using keys XX:XX and XX:XX+1.
for items like XX:XX:00 you will need to hit the hashmap twice using keys XX:XX and XX:XX-1.
Note: do a date addition (not a mathematical one) since it will automatically deal with things like 01:59:59 + 1 = 02:00:00 or Monday 1 23:59:59 becoming Tuesday 2 00:00:00.
BTW, this algorithm also deals with the delay issue.
The speed of the queues does not matter at all if the algorithm is based on the comparison of timestamps alone. If one queue is empty and you cannot proceed just check periodically until you can continue.
You can solve this by managing a list for one of the queues. In the algorithm below the first was chosen, therefore it is called l1. It works like a sliding window.
Dequeue the 2nd queue: d2.
While the timestamp of the head of l1 is smaller than the one of d2 and the difference is greater than 1: remove the head from l1.
Go through the list and print all the pairs l1[i].price, d2.price as long as the difference of the timestamps is smalller than 1. If you don't reach the end of the list, continue with step 1.
Get the next element from the first queue and add it to the list. If the difference between the timestamps is smaller than 1 print the prices and repeat, if not continue with step 1.
here is my solution, you need following services.
Design a service to read the message from Queue1 and push the data to DB
Design another service to read the message from Queue2 and push the data to same DB.
Design another service to read the data from DB and print the result as per the frequency of result needed.
Edit
above specified system is designed ,keepin below point in mind
Scalablity - if load for system increases number of services can be scale up
Slowness as already mention one queu is slow then other , chances are ,first queue recieving more message then second ,hence not able to produce desired out put.
Otput Frequencey in future if requirement changes and instead of 1 sec difference we want to show 1 hour diffference ,then it is also very much possible.
Get the first element from both queues.
Compare the timestamps. If within one second, output the pair.
From the queue that gave the earlier timestamp, get the next element.
Repeat.
EDIT:
After #maraca's comment, I had to rethink my algorithm. And yes, if on both queues there are multiple events within a second, it will not produce all combinations.

What data structure can we use to efficiently check for resource availability?

This question is asked on behalf of reddit user /u/Dasharg95.
I want to build a hotel room reservation system where each hotel room can be booked for an arbitrary set of time frames. A common query against the reservation data set is trying to figure out what rooms are available for a given time frame. Is there a data structure for the reservation data set that allows this kind of query to be performed efficiently?
For example, say, we have five rooms with the following occupation times:
room 1: 9:00 -- 12:00, 15:00 -- 18:00, 19:30 -- 20:00
room 2: 8:00 -- 9:30, 15:30 -- 17:30, 18:00 -- 20:00
room 3: 6:30 -- 7:00, 7:30 -- 8:15
room 4: 12:00 -- 20:00,
room 5: 7:00 -- 14:15, 18:00 -- 21:55
I want a data structure for the occupation times that is reasonably space efficient and allows for the following queries to be performed with reasonable performance:
what times a given room is occupied for
what rooms are free for the entirety of a given time frame
The 2D array system can still be useful without a heavy resource usage. The room number could be equivalent to the index- for example, index i is the room number:
String [] = {"taken","not","taken","not","taken"}
An index is the position of an element
The second element, "not", is the index of 1, as the first element (item) is the index of zero. To get the room number, add the index with 1, as if a hotel had just one room, it would be "Room 1" not "Room 0". So the index + 1 holds the number.
If you assign the times with equal size (xxxx.yyyy, with xxxx being the open time and yyyy being the close), then you can cut half of the element by using a substring to get the first four / last four characters for the time, printing it out by putting a colon in the middle of the xxxx like xx:xx.
It could be stored in a simple 1D array, like so:
String [] rooms = {"0900.1200", "1500.1800", "1930.2000")
...... edit, just realised that those times would be for one room x( ...
So, to assign multiple times for one room you might want to use a formatting system - like:
// * = the next four digits are the opening time
// - = the next four digits are the closing time
So you could hold multiple times in one element, like: `{"*0800-0930*1530-1730*1800*2000", ....}
Its extremely complicated, but this only uses one array, and the computer could use a while loop to check if there are more times after the closing time -> if there are none, move to the next element / set of times, and room number / index.
Once you cycle through all elements, the room check is finished.
Just imagine if you like to have it in 15min intervall then u would have 24×4 = 92 different intervalls with the first from 0:00 to 0:15. Put this in binary with some added information to check if you selected the right room u could use 100 bits. Now you create functions to create the bitstring and to decrypt the string an store the strings in an array. Done.

Need help aligning time series data in MATLAB

So my friend and I ran an experiment 2 weeks ago and we've encountered something a bit weird. I should preface this by saying that I dont really program much so sorry if this is a dumb question and seems like a waste of time.
Let's say we have data-set A and data-set B (the experiment itself doesnt matter). All of the times are given in fractional day. The format of the data should all match, but the time the data points were recorded for each set aren't necessarily aligned (they each have their own time vectors). For example, the measurements for data-set A are recorded every 100 ms. However, the instrument for data-set B is averaging the data and only records a point once every minute or so. My problem here is aligning the time for the different types of data collected. For data-set A, the data and the time vectors have a length of 25042 (25042x1 double). Data-set B and its time vector have a length of 828 (828x1 double).
It comes down to the fact that I need to look at data set B, and find the times that correspond to peaks in the data. Those times are the only times of interest to me in data set A. This is why I need a way of aligning the time vectors/series and thus the data. If an exact solution isn't possible, even an approximation would be great help. Does anybody have any ideas?
So you have two time vectors: tA and tB, and a vector of time indices bIndices that contains the known peak(s). This corresponds to time(s) tB(bIndices(:)). You need to loop through the entire vector bIndices searching again through the entire vector tA(:) FULLY each time until the time is greater-than-or-equal-to tB(b)
bIndices = [101, 403,...]; %Vector containing the indices of the peaks in 'tB'
aIndices = []; %Allocate an empty vector
A = []; %Allocate an empty vector
B = []; %Allocate an empty vector
for b = bIndices %This will cycle through all peak indices one at a time setting 'b' to the current single index
B = [B tB(b)]; %Retrieve the actual time using the index, concatenate it
for a = 1:length(tA) %Loop through the entire time vector tA
if (tA(a) >= tB(b)) %Time is greater than or equal
%Concatenate the newly found index 'a' from tA to the vector aIndex:
aIndices = [aIndices a];
%Concatenate the newly found time 'tA(a)' to the time vector A:
A = [A tA(a)]; %Or if you want the actual time
break; %Exit the inner loop, search for the next index `b`
end
end
end
At the end, A stores an array of peak times matching all the times in B (approximately, probably a little later). A-B is the discrepancy between the two times (both vectors should be the same length), but it should be pretty small, any zeros would mean the 2 aligned perfectly at those instances. aIndices is the corresponding indices of tA at the desired time(s). I didn't actually test this code, hopefully my logic is sound.

Simple Popularity Algorithm

Summary
As Ted Jaspers wisely pointed out, the methodology I described in the original proposal back in 2012 is actually a special case of an exponential moving average. The beauty of this approach is that it can be calculated recursively, meaning you only need to store a single popularity value with each object and then you can recursively adjust this value when an event occurs. There's no need to record every event.
This single popularity value represents all past events (within the limits of the data type being used), but older events begin to matter exponentially less as new events are factored in. This algorithm will adapt to different time scales and will respond to varying traffic volumes. Each time an event occurs, the new popularity value can be calculated using the following formula:
(a * t) + ((1 - a) * p)
a — coefficient between 0 and 1 (higher values discount older events faster)
t — current timestamp
p — current popularity value (e.g. stored in a database)
Reasonable values for a will depend on your application. A good starting place is a=2/(N+1), where N is the number of events that should significantly affect the outcome. For example, on a low-traffic website where the event is a page view, you might expect hundreds of page views over a period of a few days. Choosing N=100 (a≈0.02) would be a reasonable choice. For a high-traffic website, you might expect millions of page views over a period of a few days, in which case N=1000000 (a≈0.000002) would be more reasonable. The value for a will likely need to be gradually adjusted over time.
To illustrate how simple this popularity algorithm is, here's an example of how it can be implemented in Craft CMS in 2 lines of Twig markup:
{% set popularity = (0.02 * date().timestamp) + (0.98 * entry.popularity) %}
{% do entry.setFieldValue("popularity", popularity) %}
Notice that there's no need to create new database tables or store endless event records in order to calculate popularity.
One caveat to keep in mind is that exponential moving averages have a spin-up interval, so it takes a few recursions before the value can be considered accurate. This means the initial condition is important. For example, if the popularity of a new item is initialized using the current timestamp, the item immediately becomes the most popular item in the entire set before eventually settling down into a more accurate position. This might be desirable if you want to promote new content. Alternatively, you may want content to work its way up from the bottom, in which case you could initialize it with the timestamp of when the application was first launched. You could also find a happy medium by initializing the value with an average of all popularity values in the database, so it starts out right in the middle.
Original Proposal
There are plenty of suggested algorithms for calculating popularity based on an item's age and the number of votes, clicks, or purchases an item receives. However, the more robust methods I've seen often require overly complex calculations and multiple stored values which clutter the database. I've been contemplating an extremely simple algorithm that doesn't require storing any variables (other than the popularity value itself) and requires only one simple calculation. It's ridiculously simple:
p = (p + t) / 2
Here, p is the popularity value stored in the database and t is the current timestamp. When an item is first created, p must be initialized. There are two possible initialization methods:
Initialize p with the current timestamp t
Initialize p with the average of all p values in the database
Note that initialization method (1) gives recently added items a clear advantage over historical items, thus adding an element of relevance. On the other hand, initialization method (2) treats new items as equals when compared to historical items.
Let's say you use initialization method (1) and initialize p with the current timestamp. When the item receives its first vote, p becomes the average of the creation time and the vote time. Thus, the popularity value p still represents a valid timestamp (assuming you round to the nearest integer), but the actual time it represents is abstracted.
With this method, only one simple calculation is required and only one value needs to be stored in the database (p). This method also prevents runaway values, since a given item's popularity can never exceed the current time.
An example of the algorithm at work over a period of 1 day: http://jsfiddle.net/q2UCn/
An example of the algorithm at work over a period of 1 year: http://jsfiddle.net/tWU9y/
If you expect votes to steadily stream in at sub-second intervals, then you will need to use a microsecond timestamp, such as the PHP microtime() function. Otherwise, a standard UNIX timestamp will work, such as the PHP time() function.
Now for my question: do you see any major flaws with this approach?
I think this is a very good approach, given its simplicity. A very interesting result.
I made a quick set of calculations and found that this algorithm does seem to understand what "popularity" means. Its problem is that it has a clear tendency to favor recent votes like this:
Imagine we take the time and break it into discrete timestamp values ranging from 100 to 1000. Assume that at t=100 both A and B (two items) have the same P = 100.
A gets voted 7 times on 200, 300, 400, 500, 600, 700 and 800
resulting on a final Pa(800) = 700 (aprox).
B gets voted 4 times on 300, 500, 700 and 900
resulting on a final Pb(900) = 712 (aprox).
When t=1000 comes, both A and B receive votes, so:
Pa(1000) = 850 with 8 votes
Pb(1000) = 856 with 5 votes
Why? because the algorithm allows an item to quickly beat historical leaders if it receives more recent votes (even if the item has fewer votes in total).
EDIT INCLUDING SIMULATION
The OP created a nice fiddle that I changed to get the following results:
http://jsfiddle.net/wBV2c/6/
Item A receives one vote each day from 1970 till 2012 (15339 votes)
Item B receives one vote each month from Jan to Jul 2012 (7 votes)
The result: B is more popular than A.
The proposed algorithm is a good approach, and is a special case of an Exponential Moving Average where alpha=0.5:
p = alpha*p + (1-alpha)*t = 0.5*p + 0.5*t = (p+t)/2 //(for alpha = 0.5)
A way to tweak the fact that the proposed solution for alpha=0.5 tends to favor recent votes (as noted by daniloquio) is to choose higher values for alpha (e.g. 0.9 or 0.99). Note that applying this to the testcase proposed by daniloquio is not working however, because when alpha increases the algorithm needs more 'time' to settle (so the arrays should be longer, which is often true in real applications).
Thus:
for alpha=0.9 the algorithm averages approximately the last 10 values
for alpha=0.99 the algorithm averages approximately the last 100 values
for alpha=0.999 the algorithm averages approximately the last 1000 values
etc.
I see one problem, only the last ~24 votes count.
p_i+1 = (p + t) / 2
For two votes we have
p2 = (p1 + t2) / 2 = ((p0 + t1) /2 + t2 ) / 2 = p0/4 + t1/4 + t2/2
Expanding that for 32 votes gives:
p32 = t*2^-32 + t0*2^-32 + t1*2^-31 + t2*2^-30 + ... + t31*2^-1
So for signed 32 bit values, t0 has no effect on the result. Because t0 gets divided by 2^32, it will contribute nothing to p32.
If we have two items A and B (no matter how big the differences are) if they both get the same 32 votes, they will have the same popularity. So you're history goes back for only 32 votes. There is no difference in 2032 and 32 votes, if the last 32 votes are the same.
If the difference is less than a day, they will be equal after 17 votes.
The flaw is that something with 100 votes is usually more meaningful than something with only one recent vote. However it isn't hard to come up with variants of your scheme that work reasonably well.
I don't think that the above-discussed logic is going to work.
p_i+1= (p_i + t) /2
Article A gets viewed on timestamps: 70, 80, 90 popularity(Article A): 82.5
Article B gets viewed on timestamps: 50, 60, 70, 80, 90 popularity(Article B): 80.625
In this case, the popularity of Article B should have been more. Firstly Article B was viewed as recently as Article A and secondly, it was also viewed more times than Article A.

Resources