Count number of observations by group matlab - performance

I have a matlab dataset that looks like this:
year value
1995 90000
1995 53000
1995 80000
1995 60000
1995 37000
1995 42000
1995 13102
1996 35000
1996 50000
1996 32000
1996 47000
1997 36000
1997 90000
1997 NaN
1997 90000
1997 51500
1997 81000
1998 71000
(...)
2020 68000
These are two separate columns of data.
Now I want to count the number of non-NaN observations in column value between 2010 and 2020 per year i.e. the output should look like:
year count
2010 20
2011 31
(...)
2020 9
If any count is zero, it should show up as zero.
I know I can do it with a very simple loop (example below). But this is very inefficient for a large dataset. I was looking into accumarray, but could not figure out how to do it.
N = 300;
%Generate years vector
years = round(1996 + (2020-1996) .* (rand(N,1)));
years = sort(years);
% Generate values vector
values = rand(N,1);
NaN_position = rand(N,1)>.9; %Now put some random NaNs
values(NaN_position) = NaN;
count = 1;
for y=min(years):max(years)
indicator = years == y;
count_vals(count,1) = sum(not(isnan(values(indicator))));
count = count + 1;
end

Let the data be defined as:
years = [1995 1995 1995 1995 1995 1995 1995 1996 1996 1996 1996 1997 1997 1997 1997 1997 1997 1998 2020].';
values = [90000 53000 80000 60000 37000 42000 13102 35000 50000 32000 47000 36000 90000 NaN 90000 51500 81000 71000 68000].';
year_min = 1996;
year_max = 1998;
Then:
result_year = year_min:year_max;
result_count = histcounts(years(~isnan(values)), [result_year year_max+.5]);
The term year_max+.5 is needed in the second input of histcounts because, as per the documentation, the last bin includes the right edge.

Related

Finding cummulative sum of MAX values

I need to calculate the cumulative sum of Max value per period (or per category). See the embedded image.
So, first, I need to find max value for each category/month per year. Then I want to calculate the cumulative SUM of these max values. I tried it by setting up max measure (which works fine for the first step - finding max per category/month for a given year) but then I fail at finding a solution to finding cumulative SUM (finding the cumulative Max is easy, but it is not what I'm looking for).
Table1
Year Month MonthlyValue MaxPerYear
2016 Jan 10 15
2016 Feb 15 15
2016 Mar 12 15
2017 Jan 22 22
2017 Feb 19 22
2017 Mar 12 22
2018 Jan 5 17
2018 Feb 16 17
2018 Mar 17 17
Desired Output
Year CumSum
2016 15
2017 37
2018 54
This is a bit similar to this question and this question and this question as far as subtotaling, but also includes a cumulative component as well.
You can do this in two steps. First, calculate a table that gives the max for each year and then use a cumulative total pattern.
CumSum =
VAR Summary =
SUMMARIZE(
ALLSELECTED(Table1),
Table1[Year],
"Max",
MAX(Table1[MonthlyValue])
)
RETURN
SUMX(
FILTER(
Summary,
Table1[Year] <= MAX(Table1[Year])
),
[Max]
)
Here's the output:
If you expand to the month level, then it looks like this:
Note that if you only need the subtotal to work leaving each row as a max (15, 22, 17, 54) rather than as a cumulative sum of maxes (15, 37, 54, 54), then you can use a simpler approach:
MaxSum =
SUMX(
VALUES( Table1[Year] ),
CALCULATE( MAX( Table1[MonthlyValue] ) )
)
This calculates the max for each year separately and then adds them together.
External References:
Subtotals and Grand Totals That Add Up “Correctly”
Cumulative Total - DAX Patterns

Calculating which day of the week a date falls on using Gauss's algorithm, ordinal date and modulo arithmetic

After calculating which day of the week the 1st of January falls on using Gauss's algorithm, as well as calculating the ordinal date for a given calendar date, how can the day of the week of the latter date be calculated?
For example, Gauss's algorithm can tell us that, this year, the 1st of January fell on a Sunday, the 7th day of the week. Today is the 22nd of October, with an ordinal day of 295. How can this information be used to calculate that today is a Sunday?
For common years (= non-leap years), 1st of January and 1st of October are on the same day of the week:
Jan 31
Feb 28
Mar 31
Apr 30
May 31
Jun 30
Jul 31
Aug 30
Sep 31
Sum 273 = 39 x 7
See Wikipedia
22nd October is exactly three weeks later than 1st of October.
An approach I've found, which I haven't tested extensively, but seems to work with the dates I've thrown at it, is...
(ordinal day + day of 1st of January - 1) % 7
Where Mon = 1, Tue = 2,..., Sat = 6, Sun = 0.
In the example mentioned in the question:
(295 + 0 - 1) % 7 = 0 (Sunday)

gnuplot time axis from two different columns

I'm trying to plot some data from a four columns file. The first one is the numbre of data the second one is the year the third one are months and the final one are values of temperature. The thing is that I woul like that my x axis takes a date from the second and the third columns.
The text file look like this:
1 1990 2 265.78945923
2 1990 3 260.53842163
3 1990 4 265.00366211
4 1990 5 277.61206055
5 1990 6 284.72595215
6 1990 7 291.54879761
7 1990 8 293.61392212
8 1990 9 288.47149658
9 1990 10 284.55172729
12 1991 1 285.98762388
13 1991 2 283.47484293
I'm using a code like this:
set xdata time
set timefmt '%Y %m'
plot 'datafile' u 2:4
But it doesn't work. I woul like to have on my x axis the year and the months.
All help appreciated! Thanks

Summing by Column

Suppose we have the following columns:
X Y Z
Category Date Amount
A January 10
A February 20
A March 30
B January 34
B February 45
B March 65
C January 87
C February 98
C March 100
D January 80
D February 90
I want to sum the Amount column by Category and Date . So for Category A, we would have the sum of the amount be 10+20+30 = 60 for the dates between January and March. In Oracle BI, how would we do this? Note that Some categories might have missing dates. So I want to sum the Amounts for the only the the available dates between January and March. Category D, for example, has March missing. So the total amount would be 80+90 = 170.
When I do the following, I just get the sum of all the amounts:
sum("Z"."Amount")
If the required result has to be achieved through OBIEE Answer, then it can be done in following way.
Create a table with columns - Category, Date, Amount.
Go to Results tab. Edit view of the table.
Click on Total By icon above Category column. Both After and Report-Based Total (when applicable) should be ticked.
The result will be coming as shown.
Category Date Amount
A January 10
February 20
March 30
A Total 60
B January 34
February 45
March 65
B Total 144
C January 87
February 98
March 100
C Total 285
D January 80
February 90
D Total 170
You can do this quite simply by editing the column formula from within the Criteria. When you look at it to begin, your Amount column formula probably looks something like "Z"."Amount". You can edit this slightly to change the aggregation level:
sum("Z"."Amount" by "X"."Category")
That should give you something like:
Category Date Amount
A Jan 60
A Feb 60
A Mar 60
B Jan 144
B Feb 144
B Mar 144

Find the time period with the maximum number of overlapping intervals

There is one very famous problem. I am asking the same here.
There is number of elephants time span given, here time span means, year of birth to year of death.
You have to calculate the period where maximum number of elephants are alive.
Example:
1990 - 2013
1995 - 2000
2010 - 2020
1992 - 1999
Answer is 1995 - 1999
I tried hard to solve this, but I am unable to do so.
How can I solve this problem?
I got approach for when a user asks to find the number of elephants in any year. I solved that by using segment tree, whenever any elephants time span given, increase every year of that time span by 1. We can solve that in this way. Can this be used to solve the above problem?
For above question, I only need the high-level approach, I will code it myself.
Split each date range into start date and end date.
Sort the dates. If a start date and an end date are the same, put the end date first (otherwise you could get an empty date range as the best).
Start with a count of 0.
Iterate through the dates using a sweep-line algorithm:
If you get a start date:
Increment the count.
If the current count is higher than the last best count, set the count, store this start date and set a flag.
If you get an end date:
If the flag is set, store the stored start date and this end date with the count as the best interval so far.
Reset the flag.
Decrement the count.
Example:
For input:
1990 - 2013
1995 - 2000
2010 - 2020
1992 - 1999
Split and sorted: (S = start, E = end)
1990 S, 1992 S, 1995 S, 1999 E, 2000 E, 2010 S, 2013 E, 2020 E
Iterating through them:
count = 0
lastStart = N/A
1990: count = 1
count = 1 > 0, so set flag
and lastStart = 1990
1992: count = 2
count = 2 > 0, so set flag
and lastStart = 1992
1995: count = 3
count = 3 > 0, so set flag
and lastStart = 1995
1999: flag is set, so
record [lastStart (= 1995), 1999] with a count of 3
reset flag
count = 2
2000: flag is not set
reset flag
count = 1
2010: count = 2
since count = 2 < 3, don't set flag
2013: flag is not set
reset flag
count = 1
2020: flag is not set
reset flag
count = 0
How about this?
Say I have all the above data stored in a file. Read it into two arrays separated by the " - ".
Hence, now I have birthYear[] which contains all the birth years and deathYear[] containing all the death years.
so birthYear[] = [1990, 1995, 2010, 1992]
deathYear[] = [2013, 2000, 2020, 1999]
Get the min birth year and the max death year. Create a Hashtable with the Key as a year, and the Value as the count.
Hence,
HashTable<String, Integer> numOfElephantsAlive = new HashTable<String, Integer>();
Now, from the min(BirthYear) to the max(BirthYear), do the following :
Iterate through the Birth Year Array and do an add to the HashTable all the years in between the BirthYear and Corresponding DeathYear with the count being 1. If the key already exists, add 1 to it. Hence, for the last case :
1992 - 1999
HashTable.put(1992, 1)
HashTable.put(1993, 1)
and so on for every year.
Say, for example, you have a Hashtable that looks like this at the end of it:
Key Value
1995 3
1996 3
1992 2
1993 1
1994 3
1998 1
1997 2
1999 2
Now, you need the range of the Years when the number of elephants were maximum. Hence, let's iterate and find the year with the max value. This is pretty easy. Iterate over the keySet() and get the year.
Now, you need a contiguous range of years. You can either do this in two ways:
Do Collections.sort() over the keySet() and when you hit the max value, save all contiguous locations.
Hence, on hitting 3 for our example at 1994, we would check for all the following years with a 3. This will return you your range which is the min-year, max-year combo.
One approach maybe:
Iterate through the periods. Keep track of a list of periods up to now. Note: At each step, the number of periods increases by 2 (or 1 if there is no overlap with the existing list of periods).
For example
1990 - 2013
Period List contains 1 period { (1990,2013) }
Count List contains 1 entry { 1 }
1995 - 2000
Period List contains 3 periods { (1990,1995), (1995,2000), (2000,2013) }
Count List contains 3 entries { 1, 2, 1 }
2010 - 2020
Period List contains 5 periods { (1990,1995), (1995,2000), (2000,2010), (2010, 2013), (2013, 2020) }
Count List contains 5 entries { 1, 2, 1, 2, 1 }
1992 - 1999
Period List contains 7 periods { (1990,1992), (1992,1995), (1995,1999), (1999,2000), (2000,2010), (2010, 2013), (2013, 2020) }
Count List contains 7 entries { 1, 2, 3, 2, 1, 2, 1 }
1) arrange in assending order year wise starting from the largest series.
2) count the years for largest series for whole data set
3) then identify the largest count.
4) the largest count is your answer for years... this can be done in Algo.

Resources