How to make the following "apply" function in R more efficient? - performance

I am trying to bucket certain features into groups. The data.frame below (grouped) is my "key" (think Excel vlookup):
Original Grouped
1 Features Constant
2 PhoneService Constant
3 PhoneServices Constant
4 Surcharges Constant
5 CallingPlans Constant
6 Taxes Constant
7 LDUsage Noise
8 RegionalUsage Noise
9 LocalUsage Noise
10 Late fees Noise
11 SpecialServices Noise
12 TFUsage Noise
13 VoipUsage Noise
14 CCUsage Noise
15 Credits Credits
16 OneTime OneTime
I then reference my database which has a column (BillSection) that takes on a specific value from grouped$Original, and I want to group it according to grouped$Grouped. I am using the sapply function to perform this operation. Then I cbind the resulting output to my original data.frame.
grouper<-as.character(sapply(as.character(bill.data$BillSection[1:100]), # for the first 100 records of the data.frame bill.data
function(x)grouped[grouped$Original==x,2])) # take the second column, i.e. Grouped, for the corresponding "TRUE" value in Original
cbind(bill.data[1:100,],as.data.frame(grouper))
The above code works as expected, but it's slow when I apply it to my whole database, which exceeds 10,000,000 unique records. Is there an alternative to this method? I know I can use plyr, but it's even slower (I think) than sapply. I was trying to figure it out with data.table but no luck. Any suggestions would be helpful. I am open to coding this in Python, which I am new to, but heard is much faster than R, since I am dealing with large datasets very often. I wanted to know if R can do this fast enough to be useful.
Thanks!

I'm not sure I understand your question, but can you use merge()? i.e. something like...
merge(big.df, group.names.df, by.x='orginal.column.in.big.df',
by.y='original', all.x=T)
NB. Plyr has a parallel option...

Related

How to get powerset from a set with 3600 elements using as little memory as possible

I have been looking for a language and code to help me calculate all possible subsets of a set of 3600 elements. At first my search started with python, then I went through JavaScript and then came to Perl. I know using Perl to calculate all subsets as shown in https://rosettacode.org/wiki/Power_set having 16GB of ram there is a significant memory consumption, but I'm not sure if anything better than perl or this script bellow:
MY MWE:
use ntheory "forcomb";
my #S = qw/1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30/;
forcomb { print "[#S[#_]] " } scalar(#S);
print "\n";
There is no calculator that can handle so much elements in memory.
The number of possible subset starting from a set of 3600 elements is 2^3600.
This number is very big. Consider that
2^10 is close to 1.000
2^20 is close to 1.000.000
2^30 is close to 1.000.000.000
Basically every 10 you add three zeros, so with 2^3600 you have a number with 1200 zeros of different combinations, which is an unimaginable big number.
You can't solve this problem also saving the data to disk and using all the existing computers on the earth.
With all the computers existing on the earth (a number close to 2.000.000.000, so 2^31 computers) and imagine a disk space of a terabyte for each of them (2^40 bytes) you can imagine storing information for a set of 71 elements (71 not 3600) using a single byte to store each number and without considering the extra space to store the set information... take your consideration based on that.
You can eventually imagine giving a sort order to all the possible subsets and coding an algorithm that gives you the nth subset based on that sort. This can be done because you don't need to calculate and store all possible subsets, but calculate just one using some rule. If you are interested we can try to evaluate such solution
For a set s (with size |s|), the size of its power set P(s) is |P(s)| = 2^|s|.
Never-mind the memory. You'd need 2^3600 iterations to calculate each value.
This is totally computationally intractable in this universe.
Take Java (or another compiled language like Pascal with some bit support).
It has BitSet, so 3600 elements are represented with approximately 3600/8 = 450 bytes. All possibilities would be 23600: to much to iterate. One could iterate with a BigInteger, every ith bit representing an element.
Simple iterating with a BigInteger upto 23600 - 1 should make it your (descendants') life's work. Be aware that this kind of problem is something for quantum computing.
But I assume you have a very smart algorithm pruning most possibilities.
It would be nice to have dependencies like in sudoku. Then maybe a logic language or some rule engine might do.
Should 3600 be the seconds in an hour which you have to combine, please consider spending that hour otherwise. 😉

Rapid change detection algorithm

I'm logging temperature values in a room, saving them to the database. I'd like to be alerted when temperature rises suddenly. I can't set fixed values, because 18°C is acceptable in winter and 25°C is acceptable in summer. But if it jumps from 20°C to 25°C during, let's say, 30 minutes and stays like this for 5 minutes (to eliminate false readouts), I'd like to be informed.
My current idea is to take readouts from last 30 minutes (A) and readouts from last 5 minutes (B), calculate median of A and B and check if difference between them is less then my desired threshold.
Is this correct way to solve this or is there a better algorithm? I searched for a specific one but most of them seem overcomplicated.
Thanks!
Detecting changes in a time-series is a well-researched subject, and hundreds if not thousands of papers have been written on this subject. As you've seen many methods are quite advanced, but proved to be quite useful for many use cases. Whatever method you choose, you should evaluate it against real of simulated data, and optimize its parameters for your use case.
As you require, let me suggest a very simple method that in many cases prove to be good enough, and is quite similar to that you considered.
Basically, you have two concerns:
Detecting a monotonous change in a sampled noisy signal
Ignoring false readouts
First, note that medians are not commonly used for detecting trends. For the series (1,2,3,30,35,3,2,1) the medians of 5 consecutive terms is be (3, 3, 3, 3). It is much more common to use averages.
One common trick is to throw the extreme values before averaging (e.g. for each 7 values average only the middle 5). If many false readouts are expected - try to take measurements at a faster rate, and throw more extreme values (e.g. for each 13 values average the middle 9).
Also, you should throw away unfeasible values and replace them with the last measured value (unfeasible means out of range, or non-physical change rate).
Your idea of comparing a short-period measure with a long-period measure is a good idea, and indeed it is commonly used (e.g. in econometrics).
Quoting from "Financial Econometric Models - Some Contributions to the Field [Nicolau, 2007]:
Buy and sell signals are generated by two moving averages of the price
level: a long-period average and a short-period average. A typical
moving average trading rule prescribes a buy (sell) when the
short-period moving average crosses the long-period moving average
from below (above) (i.e. when the original time series is rising
(falling) relatively fast).
When you say "rises suddenly," mathematically you are talking about the magnitude of the derivative of the temperature signal.
There is a nice algorithm to simultaneously smooth a signal and calculate its derivative called the Savitzky–Golay filter. It's explained with examples on Wikipedia, or you can use Matlab to help you generate the convolution coefficients required. Once you have the coefficients the calculation is very simple.

suitable formula/algorithm for detecting temperature fluctuations

I'm creating an app to monitor water quality. The temperature data is updated every 2 min to firebase real-time database. App has two requirements
1) It should alert the user when temperature exceed 33 degree or drop below 23 degree - This part is done
2) It should alert user when it has big temperature fluctuation after analysing data every 30min - This part i'm confused.
I don't know what algorithm to use to detect big temperature fluctuation over a period of time and alert the user. Can someone help me on this?
For a period of 30 minutes, your app would give you 15 values.
If you want to figure out a big change in this data, then there is one way to do so.
You can use implement the following method:
Calculate the mean and the standard deviation of the values.
Subtract the data you have from the mean and then take the absolute value of the result.
Compare if the absolute value is greater than one standard deviation, if it is greater then you have a big data.
See this example for better understanding:
Lets suppose you have these values for 10 minutes:
25,27,24,35,28
First Step:
Mean = 27 (apprx)
One standard deviation = 3.8
Second Step: Absolute(Data - Mean)
abs(25-27) = 2
abs(27-27) = 0
abs(24-27) = 3
abs(35-27) = 8
abs(28-27) = 1
Third Step
Check if any of the subtraction is greater than standard deviation
abs(35-27) gives 8 which is greater than 3.8
So, there is a big fluctuation. If all the subtracted results are less than standard deviation, then there is no fluctuation.
You can still improvise the result by selecting two or three standard deviation instead of one standard deviation.
Start by defining what you mean by fluctuation.
You don't say what temperature scale you're using. Fahrenheit, Celsius, Rankine, or Kelvin?
Your sampling rate is a new data value every two minutes. Do you define fluctuation as the absolute value of the difference between the last point and current value? That's defensible.
If the max allowable absolute value is some multiple of your 33-23 = 10 degrees you're in business.

How to classify a set of samples via a continuous feature?

For example I got below table which is simply a coarse distribution for 20 persons over their age
age count of person
2 1
5 5
8 2
10 3
15 1
16 2
17 1
20 4
21 1
Then by using the same dataset, I could build another 'better' table .
age count of person
10- 8
10s 7
20+ 5
In fact , I could make more tables which contains different age range combination by using the same dataset.
Now I wonder how could I find the best combinations. The possible "goodness functions" we could use to measure if the combination is good or not might come by following three principles:
There should not be too many or too little classes
Ranges of classes should not vary too much.
Distribution should be smooth enough, that is ,number of items covered by each class should not vary too much.
Since this question represents a situation which is just general enough to describe a kind of specific problems , some sophisticated solutions to it should have already been there . But I failed to find them. Anyone could give some suggestions please?
I have go through some classification algorithm like PCA, k-mean or "max entropy based algorithm" but seems they are just too general to cover this specific problem by following all of the above three principles.
I would do the following:
Construct an evaluation function:
double goodness(double firstThreshold, double bucketWidth, int numBuckets)
which returns a goodness score based on your principles. I would then brute force a number of combinations of parameters and pick the combination with the best goodness score. If we try 4-10 values for each parameter then brute force will work, and probably give you nice round numbers for the cutoffs. If you want to get more sophisticated or have it run faster then you can try other search methods like hill-climbing, beam search or simulated annealing but I think that might be overkill for your situation.

slot machine payout calculation

There's this question but it has nothing close to help me out here.
Tried to find information about it on the internet yet this subject is so swarmed with articles on "how to win" or other non-related stuff that I could barely find anything. None worth posting here.
My question is how would I assure a payout of 95% over a year?
Theoretically, of course.
So far I can think of three obvious variables to consider within the calculation: Machine payout term (year in my case), total paid and total received in that term.
Now I could simply shoot a random number between the paid/received gap and fix slots results to be shown to the player but I'm not sure this is how it's done.
This method however sounds reasonable, although it involves building the slots results backwards..
I could also make a huge list of all possibilities, save them in a database randomized by order and simply poll one of them each time.
This got many flaws - the biggest one is the huge list I'm going to get (millions/billions/etc' records).
I certainly hope this question will be marked with an "Answer" (:
You have to make reel strips instead of huge database. Here is brief example for very basic 3-reel game containing 3 symbols:
Paytable:
3xA = 5
3xB = 10
3xC = 20
Reels-strip is a sequence of symbols on each reel. For the calculations you only need the quantity of each symbol per each reel:
A = 3, 1, 1 (3 symbols on 1st reel, 1 symbol on 2nd, 1 symbol on 3rd reel)
B = 1, 1, 2
C = 1, 1, 1
Full cycle (total number of all possible combinations) is 5 * 3 * 4 = 60
Now you can calculate probability of each combination:
3xA = 3 * 1 * 1 / full cycle = 0.05
3xB = 1 * 1 * 2 / full cycle = 0.0333
3xC = 1 * 1 * 1 / full cycle = 0.0166
Then you can calculate the return for each combination:
3xA = 5 * 0.05 = 0.25 (25% from AAA)
3xB = 10 * 0.0333 = 0.333 (33.3% from BBB)
3xC = 20 * 0.0166 = 0.333 (33.3% from CCC)
Total return = 91.66%
Finally, you can shuffle the symbols on each reel to get the reels-strips, e.g. "ABACA" for the 1st reel. Then pick a random number between 1 and the length of the strip, e.g. 1 to 5 for the 1st reel. This number is the middle symbol. The upper and lower ones are from the strip. If you picked from the edge of the strip, use the first or last one to loop the strip (it's a virtual reel). Then score the result.
In real life you might want to have Wild-symbols, free spins and bonuses. They all are pretty complicated to describe in this answer.
In this sample the Hit Frequency is 10% (total combinations = 60 and prize combinations = 6). Most of people use excel to calculate this stuff, however, you may find some good tools for making slot math.
Proper keywords for Google: PAR-sheet, "slot math can be fun" book.
For sweepstakes or Class-2 machines you can't use this stuff. You have to display a combination by the given prize instead. This is a pretty different task, so you may try to prepare a database storing the combinations sorted by the prize amount.
Well, the first problem is with the keyword assure, if you are dealing with random, you cannot assure, unless you change the logic of the slot machine.
Consider the following algorithm though. I think this style of thinking is more reliable then plotting graphs of averages to achive 95%;
if( customer_able_to_win() )
{
calculate_how_to_win();
}
else
no_win();
customer_able_to_win() is your data log that says how much intake you have gotten vs how much you have paid out, if you are under 95%, payout, then customer_able_to_win() returns true; in that case, calculate_how_to_win() calculates how much the customer would be able to win based on your %, so, lets choose a sampling period of 24 hours. If over the last 24 hours i've paid out 90% of the money I've taken in, then I can pay out up to 5%.... lets give that 5% a number such as 100$. So calculate_how_to_win says I can pay out up to 100$, so I would find a set of reels that would pay out 100$ or less, and that user could win. You could add a little random to it, but to ensure your 95% you'll have to have some other rules such as a forced max payout if you get below say 80%, and so on.
If you change the algorithm a little by adding random to the mix you will have to have more of these caveats..... So to make it APPEAR random to the user, you could do...
if( customer_able_to_win() && payout_percent() < 90% )
{
calculate_how_to_win(); // up to 5% payout
}
else
no_win();
With something like that, it will go on a losing streak after you hit 95% until you reach 90%, then it will go on a winning streak of random increments until you reach 95%.
This isn't a full algorithm answer, but more of a direction on how to think about how the slot machine works.
I've always envisioned this is the way slot machines work especially with video poker. Because the no_win() function would calculate how to lose, but make it appear to be 1 card off to tease you to think you were going to win, instead of dealing with a 'fair' game and the random just happens to be like that....
Think of the entire process of.... first thinking if you are going to win, how are you going to win, if you're not going to win, how are you going to lose, instead of random number generators determining if you will win or not.
I worked many years ago for an internet casino in Australia, this one being the only one in the world that was regulated completely by a government body. The algorithms you speak of that produce "structured randomness" are obviously extremely complex especially when you are talking multiple lines in all directions, double up, pick the suit, multiple progressive jackpots and the like.
Our poker machine laws for our state demand a payout of 97% of what goes in. For rudely to be satisfied that our machine did this, they made us run 10 million mock turns of the machine and then wanted to see that our game paid off at what the law states with the tiniest range of error (we had many many machines running a script to auto playing using a script to simulate the click for about a week before we hit the 10 mil).
Anyhow the algorithms you speak of are EXPENSIVE! They range from maybe $500k to several million per machine so as you can understand, no one is going to hand them over for free, that's for sure. If you wanted a single line machine it would be easy enough to do. Just work out you symbols/cards and what pay structure you want for each. Then you could just distribute those payouts amongst non-payouts till you got you respective figure. Obviously the more options there are means the longer it will take to pay out at that respective rate, it may even payout more early in the piece. Hit frequency and prize size are also factors you may want to consider
A simple way to do it, if you assume that people win a constant number of times a time period:
Create a collection of all possible tumbler combinations with how much each one pays out.
The first time someone plays, in that time period, you can offer all combinations at equal probability.
If they win, take that amount off the total left for the time period, and remove from the available options any combination that would payout more than you have left.
Repeat with the reduced combinations until all the money is gone for that time period.
Reset and start again for the next time period.

Resources