I have two vectors, and I would like to use a statistic test to know if their median are equal, but I don't know how to do that with RStudio.
Is there someone who could help me ?
Thank you very much !
You should use boxplot.
Read the description of it. You can make notches=T, to include more stats on the graph.
Also, be sure to assign boxplot to a name to gather the stats from it
info<-boxplot(your.vectors)
Related
I realize this is an unspecific question (because I don't know a lot about the topic, please help me in this regard), that said here's the task I'd like to achieve:
Find a statistically sound algorithm to determine an optimal cut-off value to binarize a vector to filter out minimal values (i.e. get rid of). Here's code in matlab to visualize this problem:
randomdata=rand(1,100,1);
figure;plot(randomdata); %plot random data between 0 and 1
cutoff=0.5; %plot cut-off value
line(get(gca,'xlim'),[cutoff cutoff],'Color','red');
Thanks
You could try using Matlab's percentile function:
cutoff = prctile(randomdata,10);
I tried with John Fox's "polycor" package, but it does not show the ICs. I have also tried with the "psych" package but nothing. The only thing I got are the standard errors and the thresholds. Any help will be greatly appreciated.
Try the cor.ci function in psych. This will find the confidence intervals by bootstrapping.
In addition, cor.plot.upperLowerCi will then display a correlation plot showing the upper and lower confidence values.
p.c <-cor.ci(bfi[1:200,1:5], poly = TRUE)
cor.plot.upperLowerCi(p.c, numbers=TRUE)
I am trying to run a decision tree on SystemML standalone version on Windows (https://github.com/apache/incubator-systemml/blob/master/scripts/algorithms/decision-tree.dml) but I keep receiving the error "NUMBER OF SAMPLES AT NODE 1.0 CANNOT BE REDUCED TO MATCH 10. THIS NODE IS DECLAR ED AS LEAF!". It seems like the code is not computing any split, although I am able to perform tree via R. Has anyone used this algorithm before and has some tips on how to solve the error?
Thank you
This message generally indicates that a split on the best categorical or scale features would not give any additional gain.
I would recommend to
Investigate the computed gain (best_cat_gain, best_scale_gain)
Double check that the meta data (num_cat_features,
num_scale_features) is correctly recognised.
You could simply put additional print statements into the script to do that. In case the meta data is invalid, you might want to check that the optional input R has the right layout as described in the header of the script.
If this does not help, please share the input arguments, format of input data, etc and we'll have a closer look.
I have an interesting problem I'm working on right now and wonder if anyone has had success in implementing high performance solutions to it.
I have a set of "intervals" meaning an array of arrays each of the form
Intervals = [
[min_val_1, max_val_1],
[min_val_2, max_val_2],
...
[min_val_n, max_val_n]
]
Where all these values are real valued. Now I have a number and I want to ask, which intervals contain this numbers? And I need to be able to answer this very quickly. I can preprocess as much as needed and space is less of a consideration than time. What approach would you recommend? Thanks in advance!
I recommend using an interval tree
I am trying to find an equation which calculates the "importance" of a twitter user according to #following #followers
Things I want to consider:
1. The more #followers / #following is bigger, the more important he his.
2. differ between 20/20 and 10k/10k (10k is more important although the ratio is the same).
Considering these two, I expect to get a similar output importance value to these two inputs:
#followers=1000 #following=100
#followers=30k #following=30k
I'm having problems inserting the second point into consideration. I believe it needs to be quite simple. Help?
Thanks
one possibility is (#followers/#following)*[log(#followers) - CONST] where CONST is some predefined value, tested as appropriate. this will ensure the ratio has its appropriate importance, but also the scale matters.
for your last example, you will need to set CONST~=9.4 to achieve similar results.
There are too many answers to this question, you need to weight how important is the number of followers compared to the ratio so you get a common number to relationate this two. For example the first idea that come to my mind is to multiply the ratio by the log of the #Followers. Something like this.
Importance = (#Followers / #Following)*Log(#Followers)
Based on what you said there, you could do 3*followers^2/following.
But you've described a system where users can increase their importance by following fewer other users. Doesn't seem too awesome.
You could normalize it by the total number of users.
I'd suggest using logarithms on all the values to get a less dramatic increase or change in higher values.
(log(#followers)/log(#TotalNumberOfPeopleInTwitter))*(log(#followers)/log(#following))