how to sort a list of doubles according to their names in R - sorting

I am trying to write a function to calculate R1 lexical richness measure. The formula is as follows:
R1 = 1 - ( F(h) - h*h/2N) )
where N is the number of tokens, h is the Hirsch point, and F(h) is the cumulative relative frequencies up to that point. my actual data is in the same format as the data below:
txt <- list(
a = c("The truck driver whose runaway vehicle rolled into the path of an express train and caused one of Taiwan’s worst ever rail disasters has made a tearful public apology.", "The United States is committed to advancing prosperity, security, and freedom for both Israelis and Palestinians in tangible ways in the immediate term, which is important in its own right, but also as a means to advance towards a negotiated two-state solution.","The 49-year-old is part of a team who inspects the east coast rail line for landslides and other risks.", "We believe that this UN agency for so-called refugees should not exist in its current format.","His statement comes amid an ongoing investigation into the crash, with authorities saying the train driver likely had as little as 10 seconds to react to the obstruction.", " The US president accused Palestinians of lacking “appreciation or respect.", "To create my data I had to chunk each text in an increasing manner.", "Therefore, the input is a list of chunked texts within another list.","We plan to restart US economic, development, and humanitarian assistance for the Palestinian people,” the secretary of state, Antony Blinken, said in a statement.", "The cuts were decried as catastrophic for Palestinians’ ability to provide basic healthcare, schooling, and sanitation, including by prominent Israeli establishment figures.","After Donald Trump’s row with the Palestinian leadership, President Joe Biden has sought to restart Washington’s flailing efforts to push for a two-state resolution for the Israel-Palestinian crisis, and restoring the aid is part of that.")
)
library(quanteda)
DFMs <- lapply(txt, dfm)
txt_freq <- function(x) textstat_frequency(x, groups = docnames(x), ties_method = "first")
Fs <- lapply(DFMs, txt_freq)
get_h_point <- function(DATA) {
fn_interp <- approxfun(DATA$rank, DATA$frequency)
fn_root <- function(x) fn_interp(x) - x
uniroot(fn_root, range(DATA$rank))$root
}
s_p <- function(x){split(x,x$group)}
tstat_by <- lapply(Fs, s_p)
h_values <-lapply(tstat_by, vapply, get_h_point, double(1))
str(tstat_by)
str(h_values)
F <- list()
R <- list()
temp <- list()
for( Ls in names(tstat_by) ){
for (item in names(h_values[[Ls]]) ){
temp[[Ls]][[item]] <- subset(tstat_by[[Ls]][[item]], rank <= h_values[[Ls]][[item]])
F[[Ls]][[item]] <- sum(temp[[Ls]][[item]]$frequency) / sum(tstat_by[[Ls]][[item]]$frequency)
R[[Ls]][[item]] <- 1 - ( F[[Ls]][[item]] -
h_values[[Ls]][[item]] ^ 2 /
2 * sum(tstat_by[Ls][[item]]$frequency) )
}}
I have the value I need stored in a list but in the wrong order. here is what the for loop produces:
names(R[["a"]])
[1] "text1" "text10" "text11" "text2" "text3" "text4" "text5" "text6" "text7"
[10] "text8" "text9"
but I need it to be in this natural order:
names(R[["a"]])
[1] "text1" "text2" "text3" "text4" "text5" "text6" "text7" "text8" "text9"
[10] "text10" "text11"
so the question is how do I get the values sorted based on the names they have—the numeric parts of the names need to be in order.

Order them by the integer values in the element names, after stripping the "text" part.
> R$a <- R$a[order(as.integer(gsub("text", "", names(R$a))))]
> R$a
$text1
[1] 0.8666667
$text2
[1] 0.8510638
$text3
[1] 0.9
$text4
[1] 0.9411765
$text5
[1] 0.8333333
$text6
[1] 0.9166667
$text7
[1] 0.8666667
$text8
[1] 0.8571429
$text9
[1] 0.7741935
$text10
[1] 0.8888889
$text11
[1] 0.8717949

Related

Find mutual element in different facts in swi-prolog

i am trying to have an input of a list of movies and find the actors who play at the same movies.
Question : Given a list of movies, display their link by a set of stars using recursion.
these are an example of facts :
fact(movie,actor).
starsin(a,bob).
starsin(b,bob).
starsin(c,bob).
starsin(a,maria).
starsin(b,maria).
starsin(c,maria).
starsin(a,george).
starsin(b,george).
starsin(c,george).
Example of input and out put :
?- sameActors([a,b,c],Y).
Y = bob,maria,george.
Rule written so far :
sameActors(Actors,Movies) :-
findall(Stars,(member(movie,Actors),starsin(movie,Stars)),Name),
sum_list(Name,Movies).
I am quite new and can't find any similar solution online for my problem, i don't understand what i am doing wrong , or what i need to add.
Here is another one (I finally found a way)
No recursion, just a relative of findall, setof/3:
Given a database of "actors starring in movies":
starsin(a,bob).
starsin(c,bob).
starsin(a,maria).
starsin(b,maria).
starsin(c,maria).
starsin(a,george).
starsin(b,george).
starsin(c,george).
starsin(d,george).
We do some reflection (described in setof/3 inside setof/3 not working, but why?), and then:
subselect(Ax,MovIn) :-
setof(Mx,starsin(Mx,Ax),MovAx), subset(MovIn, MovAx).
actors_appearing_in_movies(MovIn,ActOut) :-
setof(Ax, subselect(Ax,MovIn) , ActOut).
This has the right feel of being an RDBMS operation!
Testing!
Note that for the empty set of movies, we get all the actors. This is marginally right: all the actors
star in all the movies of the empty set.
Testing consists in running these goals and observing that they succeed:
actors_appearing_in_movies([],ActOut),
permutation([bob, george, maria],ActOut),!.
actors_appearing_in_movies([a],ActOut),
permutation([bob, george, maria],ActOut),!.
actors_appearing_in_movies([a,b],ActOut),
permutation([george, maria],ActOut),!.
actors_appearing_in_movies([a,b,c],ActOut),
permutation([george, maria],ActOut),!.
actors_appearing_in_movies([a,b,c,d],ActOut),
permutation([george],ActOut),!.
Bonus Round: In R
Completely unrelated, but I thought about how to do that in R.
After some fumbling:
# Load tidyverse dplyr
library(dplyr)
# Create a data frame ("tibble") with our raw data using `tribble`
t <- tribble(
~movie, ~actor
,"a" , "bob"
,"c" , "bob"
,"a" , "maria"
,"b" , "maria"
,"c" , "maria"
,"a" , "george"
,"b" , "george"
,"c" , "george"
,"d" , "george")
# The function!
actors_appearing_in_movies <- function(data, movies_must) {
# (movie,actor) pairs of actors active in movies we are interested in
t1 <- data %>% filter(is.element(movie, movies_must))
# (actor, (movies)) pairs of actors and the movies they appear in
# for movies we are interested in
t2 <- t1 %>% group_by(actor) %>% summarize(movies = list(unique(movie)))
# Retain only those which appear in all movies
t3 <- t2 %>% rowwise() %>% filter(setequal(movies_must,movies))
# Select only the actor column
# ("Select" works columnwise, not rowwise as in SQL)
t4 <- t3 %>% select(actor)
return(t4)
}
Results?
The above approach has a different opinion on who is in the empty movie set:
> actors_appearing_in_movies(t, c())
# A tibble: 0 x 1
# … with 1 variable: actor <chr>
But:
> actors_appearing_in_movies(t, c("a"))
# A tibble: 3 x 1
actor
<chr>
1 bob
2 george
3 maria
> actors_appearing_in_movies(t, c("a","b"))
# A tibble: 2 x 1
actor
<chr>
1 george
2 maria
> actors_appearing_in_movies(t, c("a","b","c"))
# A tibble: 2 x 1
actor
<chr>
1 george
2 maria
> actors_appearing_in_movies(t, c("a","b","c","d"))
# A tibble: 1 x 1
actor
<chr>
1 george

Robust Standard Errors in lm() using stargazer()

I have read a lot about the pain of replicate the easy robust option from STATA to R to use robust standard errors. I replicated following approaches: StackExchange and Economic Theory Blog. They work but the problem I face is, if I want to print my results using the stargazer function (this prints the .tex code for Latex files).
Here is the illustration to my problem:
reg1 <-lm(rev~id + source + listed + country , data=data2_rev)
stargazer(reg1)
This prints the R output as .tex code (non-robust SE) If i want to use robust SE, i can do it with the sandwich package as follow:
vcov <- vcovHC(reg1, "HC1")
if I now use stargazer(vcov) only the output of the vcovHC function is printed and not the regression output itself.
With the package lmtest() it is possible to print at least the estimator, but not the observations, R2, adj. R2, Residual, Residual St.Error and the F-Statistics.
lmtest::coeftest(reg1, vcov. = sandwich::vcovHC(reg1, type = 'HC1'))
This gives the following output:
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.54923 6.85521 -0.3719 0.710611
id 0.39634 0.12376 3.2026 0.001722 **
source 1.48164 4.20183 0.3526 0.724960
country -4.00398 4.00256 -1.0004 0.319041
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
How can I add or get an output with the following parameters as well?
Residual standard error: 17.43 on 127 degrees of freedom
Multiple R-squared: 0.09676, Adjusted R-squared: 0.07543
F-statistic: 4.535 on 3 and 127 DF, p-value: 0.00469
Did anybody face the same problem and can help me out?
How can I use robust standard errors in the lm function and apply the stargazer function?
You already calculated robust standard errors, and there's an easy way to include it in the stargazeroutput:
library("sandwich")
library("plm")
library("stargazer")
data("Produc", package = "plm")
# Regression
model <- plm(log(gsp) ~ log(pcap) + log(pc) + log(emp) + unemp,
data = Produc,
index = c("state","year"),
method="pooling")
# Adjust standard errors
cov1 <- vcovHC(model, type = "HC1")
robust_se <- sqrt(diag(cov1))
# Stargazer output (with and without RSE)
stargazer(model, model, type = "text",
se = list(NULL, robust_se))
Solution found here: https://www.jakeruss.com/cheatsheets/stargazer/#robust-standard-errors-replicating-statas-robust-option
Update I'm not so much into F-Tests. People are discussing those issues, e.g. https://stats.stackexchange.com/questions/93787/f-test-formula-under-robust-standard-error
When you follow http://www3.grips.ac.jp/~yamanota/Lecture_Note_9_Heteroskedasticity
"A heteroskedasticity-robust t statistic can be obtained by dividing an OSL estimator by its robust standard error (for zero null hypotheses). The usual F-statistic, however, is invalid. Instead, we need to use the heteroskedasticity-robust Wald statistic."
and use a Wald statistic here?
This is a fairly simple solution using coeftest:
reg1 <-lm(rev~id + source + listed + country , data=data2_rev)
cl_robust <- coeftest(reg1, vcov = vcovCL, type = "HC1", cluster = ~
country)
se_robust <- cl_robust[, 2]
stargazer(reg1, reg1, cl_robust, se = list(NULL, se_robust, NULL))
Note that I only included cl_robust in the output as a verification that the results are identical.

Calculate time remaining with different length of variables

I will have to admit the title of this question sucks... I couldn't get the best description out. Let me see if I can give an example.
I have about 2700 customers with my software at one time was installed on their server. 1500 or so still do. Basically what I have going on is an Auto Diagnostics to help weed out people who have uninstalled or who have problems with the software for us to assist with. Currently we have a cURL fetching their website for our software and looking for a header return.
We have 8 different statuses that are returned
GREEN - Everything works (usually pretty quick 0.5 - 2 seconds)
RED - Software not found (usually the longest from 5 - 15 seconds)
BLUE - Software found but not activated (usually from 3 - 9 seconds)
YELLOW - Server IP mismatch (usually from 1 - 3 seconds)
ORANGE - Server IP mismatch and wrong software type (usually 5 - 10 seconds)
PURPLE - Activation key incorrect (usually within 2 seconds)
BLACK - Domain returns 404 - No longer exists (usually within a second)
UNK - Connection failed (usually due to our load balancer -- VERY rare) (never countered this yet)
Now basically what happens is a cronJob will start the process by pulling the domain and product type. It will then cURL the domain and start cycling through the status colors above.
While this is happening we have an ajax page that is returning the results so we can keep an eye on the status. The major problem is the Time Remaining is so volatile that it does not do a good estimate. Here is the current math:
# Number of accounts between NOW and when started
$completedAccounts = floor($parseData[2]*($parseData[1]/100));
# Number of seconds between NOW and when started
$completedTime = strtotime("now") - strtotime("$hour:$minute:$second");
# Avg number of seconds per account
$avgPerCompleted = $completedTime / $completedAccounts;
# Total number of remaining accounts to be scanned
$remainingAccounts = $parseData[2] - $completedAccounts;
# The total of seconds remaining for all of the remaining accounts
$remainingSeconds = $remainingAccounts * $avgPerCompleted;
$remainingTime = format_time($remainingSeconds, ":");
I could create a count on all of the green, red, blue, etc... and do an average of how long each color does, then use that for the average time, although I don't believe that would give much better results.
With the difference in times that are so varied, any suggestions would be grateful?
Thanks,
Jeff
OK, I believe I have figured it out. I had to create a class so I could calculate a single regression over a period of time.
function calc() {
$n = count($this->mDatas);
$vSumXX = $vSumXY = $vSumX = $vSumY = 0;
//var_dump($this->mDatas);
$vCnt = 0; // for time-series, start at t=0<br />
foreach ($this->mDatas AS $vOne) {
if (is_array($vOne)) { // x,y pair<br />
list($x,$y) = $vOne;
} else { // time-series<br />
$x = $vCnt; $y = $vOne;
} // fi</p>
$vSumXY += $x*$y;
$vSumXX += $x*$x;
$vSumX += $x;
$vSumY += $y;
$vCnt++;
} // rof
$vTop = ($n*$vSumXY – $vSumX*$vSumY);
$vBottom = ($n*$vSumXX – $vSumX*$vSumX);
$a = $vBottom!=0?$vTop/$vBottom:0;
$b = ($vSumY – $a*$vSumX)/$n;
//var_dump($a,$b);
return array($a,$b);
}
I take each account and start building an array, for the amount of time it takes for each one. The array then runs through this calculation so it will build a x and y time sets. Finally I then run the array through the predict function.
/** given x, return the prediction y */
function calcpredict($x) {
list($a,$b) = $this->calc();
$y = $a*$x+$b;
return $y;
}
I put static values in so you could see the results:
$eachTime = array(7,1,.5,12,11,6,3,.24,.12,.28,2,1,14,8,4,1,.15,1,12,3,8,4,5,8,.3,.2,.4,.6,4,5);
$forecastProcess = new Linear($eachTime);
$forecastTime = $forecastProcess->calcpredict(5);
This overall system gives me about a .003 difference in 10 accounts and about 2.6 difference in 2700 accounts. Next will be to calculate the Accuracy.
Thanks for trying guys and gals

how to calculate correlation with a sliding window?

I have a zoo object called aux with yearly data from 1961 to 2009:
x$nao x[, 2]
1961 -0.03 63.3
1962 0.20 155.9
1963 -2.98 211.0
I want to calculate the correlation between the two columns using a 20 years sliding window. I am trying to use rollapply, but I don't seem to be able to make it work. I tried several different ways of doing it but always without success...
> rollapply(aux,20, cor(aux[,1],aux[,2],method="pearson"))
Error in match.fun(FUN) : 'cor(aux[, 1], aux[, 2], method = "pearson")' is not a function, character or symbol
> rollapply(aux,20, cor,method="pearson")
Error in FUN(coredata(data)[posns], ...) : supply both 'x' and 'y' or a matrix-like 'x'
> rollapply(aux,20, cor)
Error in FUN(coredata(data)[posns], ...) : supply both 'x' and 'y' or a matrix-like 'x'
Can anybody tell me how to make rollapply work?
Thanks for helping!
Try this.
library(quantmod)
library(TTR)
#Set the seed so results can be duplicated
set.seed(123)
#Build a zoo object with typical price data
var1 <- zoo(cumprod(1+rnorm(50, 0.01, 0.05)), seq(1961, 2001, 1))
var2 <- zoo(cumprod(1+rnorm(50, 0.015, 0.1)), seq(1961, 2001, 1))
dat <- merge(var1=var1, var2=var2)
plot(dat)
grid()
#Calculate the percent returns for the two prices
del1 <- Delt(dat$var1)
del2 <- Delt(dat$var2)
dat <- merge(dat, del1=del1, del2=del2)
dimnames(dat)[[2]][3] <- "del1"
dimnames(dat)[[2]][4] <- "del2"
head(dat)
plot(dat)
#Calculate the correlation between the two returns using a 5 year sliding window
delcor <- runCor(dat$del1, dat$del2, n=5, use="all.obs", sample=TRUE, cumulative=FALSE)
dat <- merge(dat, delcor)
plot(dat$delcor, type="l", main="Sliding Window Correlation of Two Return Series", xlab="", col="red")
grid()

Speed up the analysis

I have 2 dataframes in R for example df and dfrefseq.
df<-data.frame( chr = c("chr1","chr1","chr1","chr4")
, start = c(843294,4329248,4329423,4932234)
, stop = c(845294,4329248,4529423,4935234)
, genenames= c("HTA","OdX","FEA","MGA")
)
dfrefseq<-data.frame( chr = c("chr1","chr1","chr1","chr2")
, start = c(843294,4329248,4329423,4932234)
, stop = c(845294,4329248,4529423,4935234)
, genenames= c("tra","FGE","FFs","FAA")
)
I want to check for each gene in df witch gene in dfrefseq lies closest to the selected df gene.
I first selected "chr1" in both dataframes.
Then I calculated for the first gene in readschr1 the distance between start-start start-stop stop-start and stop-stop sites.
The sum of this calculations say everything about the distance. My question here is, How can I speed up this analyse? Because now I tested only 1 gene against a dataframe, but I need to test 2000 genes.
readschr1 <- subset(df,df[,1]=="chr1")
refseqchr1 <- subset(dfrefseq,dfrefseq[,1]=="chr1")
names<-list()
read_start_start<-list()
read_start_stop<-list()
read_stop_start<-list()
read_stop_stop<-list()
for (i in 1:nrow(refseqchr1)) {
startstart<-abs(readschr1[1,2] - refseqchr1[i,2])
startstop<-abs(readschr1[1,2] - refseqchr1[i,3])
stopstart<-abs(readschr1[1,3] - refseqchr1[i,2])
stopstop<-abs(readschr1[1,3] - refseqchr1[i,3])
read_start_start[[i]]<- matrix(startstart)
read_start_stop[[i]]<- matrix(startstop)
read_stop_start[[i]]<- matrix(stopstart)
read_stop_stop[[i]]<- matrix(stopstop)
names[[i]]<-matrix(refseqchr1[i,4])
}
table<-cbind(names, read_start_start, read_start_stop, read_stop_start, read_stop_stop)
sumtotalcolumns<-as.numeric(table[,2]) + as.numeric(table[,3])+ as.numeric(table[,4]) + as.numeric(table[,5])
test<-cbind(table, sumtotalcolumns)
test1<-test[order(as.vector(test$sumtotalcolumns)), ]
Thank you!
The Bioconductor package GenomicRanges is designed to work with this type of data
source('http://bioconductor.org/biocLite.R')
biocLite('GenomicRanges') # one-time installation
then
library(GenomicRanges)
gr <- with(df,
GRanges(factor(chr, levels=paste("chr", 1:4, sep="")),
IRanges(start, stop), genenames=genenames))
grrefseq <- with(dfrefseq,
GRanges(factor(chr, levels=paste("chr", 1:4, sep="")),
IRanges(start, stop), genenames=genenames))
and
> nearest(gr, grrefseq)
[1] 1 2 3 NA
You can merge the two separate data.frames together to form one table and then use vectorized operations. The key to merge is to specify the common column(s) between the data.frames and to tell it what to do when there are cases that do not match. Specifying all = TRUE will return all rows and fill in NAs if there is no match in the other data.frame, i.e. ch2 and ch4 in this case. Once the data.frames have been merged, then it's a simple exercise in subtracting the different columns from one another and then summing the four columns of interest. I use transform to cut down on the typing needed to do the subtraction.
zz <- merge(df, dfrefseq, by = "chr", all = TRUE)
zz <- transform(zz,
read_start_start = abs(start.x - start.y)
, read_start_stop = abs(start.x - stop.y)
, read_stop_start = abs(stop.x - start.y)
, read_stop_stop = abs(stop.x - stop.y)
)
zz <- transform(zz,
sum_total_columns = read_start_start + read_start_stop + read_stop_start + read_stop_stop
)
Here's one approach get the row with the minimum distance. I'm assuming you want to do this by chr and genenames. I use the plyr package, but I'm sure there are base solutions if you'd prefer one of those. Maybe someone else will chime in with a base solution.
require(plyr)
ddply(zz, c("chr", "genenames.x"), function(x) x[which.min(x$sum_total_columns) ,])

Resources