how to write this in a for loop in R? - for-loop

I have wrote this
pp1<-table(sip$newSS4_1 [sip$newSS4_1==1], sip$HS1C1 [sip$newSS4_1==1])
pp1=round(prop.table(pp1,1), digits=3)
pp1
i have 30 variables to do. for example:
pp2<-table(sip$newSS4_2 [sip$newSS4_2==1], sip$HS1C1 [sip$newSS4_2==1])
pp2=round(prop.table(pp2,1), digits=3)
pp2
and pp3...pp30 so on. I havenewSS4_1...newSS4_30in dataframe already.
How to write this in a loop?
Thanks.

It seems like you are making things hard for yourself.
Instead of creating 30 variables with different names, why not use a single list?
Instead of using column names, why not use column numbers?
pp <- list()
colnums <- grep(names(sip), "newSS4_") # assuming they are in order
for (i in 1:30) {
cn <- colnums[i]
pp[[i]] <- table(sip[,cn ] [sip[,cn]==i], sip$HS1C1 [ sip[,cn]==i ])
}
If you really want to have different variables, you can use
assign(paste0("pp", i), value)
to assign to e.g. pp1, pp2, etc.

Related

Stack multiple columns into one

I want to do a simple task but somehow I'm unable to do it. Assume that I have one column like:
a
z
e
r
t
How can I create a new column with the same value twice with the following result:
a
a
z
z
e
e
r
r
t
t
I've already tried to double my column and do something like :
=TRANSPOSE(SPLIT(JOIN(";",A:A,B:B),";"))
but it creates:
a
z
e
r
t
a
z
e
r
t
I get inspired by this answer so far.
Try this:
=SORT({A1:A5;A1:A5})
Here we use:
sort
{} to combine data
Accounting your comment, then you may use this formula:
=QUERY(SORT(ArrayFormula({row(A1:A5),A1:A5;row(A1:A5),A1:A5})),"select Col2")
The idea is to use additional column of data with number of row, then sort by row, then query to get only values.
And join→split method will do the same:
=TRANSPOSE(SPLIT(JOIN(",",ARRAYFORMULA(CONCAT(A1:A5&",",A1:A5))),","))
Here we use range only two times, so this is easier to use. Also see Concat + ArrayFormula sample.
Few hundreds rows is nothing :)
I created index from 1 to n, then pasted it twice and sorted by index. But it's obviously fancier to do it with a formula :)
Assuming Your list is in column A and (for now) the times of repeat are in C1 (can be changed to a number in the formula), then something simple like this will do (starting in B1):
=INDEX(A:A,(INT(ROW()-1)/$C$1)+1)
Simply copy down as you need it (will give just 0 after the last item). No sorting. No array. No sheets/excel problems. No heavy calculations.

Compare 43 variables in all possible ways

I am trying to figure out which method is the best way to cross compare 43 variables (data sets, data)
I need to compare variable 1 with variable 2,3,4,5,6,7....43 and then compare variable 2 with variable 1,3,4,5,6,7....43 and so on, to variable no. 43.
I think i should use some kind of a loop, but i am clueless how to perform this operation efficient.
I think I just need some kind of pseudo code. Either way I want to do this in a do-file in Stata.
Assuming e.g. variables var1-var43 and that the "comparison" between the first and the second differs from that between the second and the first, which is what your question implies, then
forval i = 1/43 {
forval j = 1/43 {
if `i' != `j' {
<code for comparison between var`i' and var`j'>
}
}
}
With other variable names, foreach might be better.
As #NickCox suggested, you could use a O(NxN) nested loop. If that takes too long, which it could if your "43" is actually 1000, then there's a better way. Sort each list (indirectly), which is O(N logN), and run a merge-order loop, which is O(N), so altogether it is O(N logN).

Difficulty Accessing Members of Tuple in Apache Pig

I have a variable titled F.
Describe F returns:
F: {group: bytearray,indexkey: {(indexkey: chararray)}}
Dump F returns:
(321,{(CHOW),(DREW)})
(5011,{(CHOW),(DREW)})
(5825,{(TANNER),(SPITZENBERGER)})
(16631,{(CHOW),(DREW)})
(34299,{(CHOW),(DREW)})
(35044,{(TANNER),(SPITZENBERGER)})
(65623,{(CHOW),(DREW)})
(74597,{(SPITZENBERGER),(TANNER)})
(83499,{(SPITZENBERGER),(TANNER)})
(90257,{(SPITZENBERGER),(TANNER)})
What I need is to produce an output that looks like this (only 1st row as an example):
(321,DREW,{(CHOW)})
I've tried using deference to pull out the first element by using this:
G = FOREACH F generate indexkey.$0;
But, this still returns the whole tuple.
Can anyone suggest a method for doing this? I was under the impression that the deference operator should allow me to do this.
Thanks in advance!
Daniel
You can't index into bags like that. The reason for that is bags don't have any notion of ordering. Selecting the first item in a bag should be treated as picking a random one.
Either way, if you want only one item instead of all of them you can used a nested FOREACH to pull a LIMIT of 1:
first = FOREACH F {
lim = LIMIT indexkey 1;
GENERATE group, lim;
}
(disclaimer: I can't test this code right now, if it doesn't work let me know. Hopefully you can get the gist)
You can take this a bit further and FLATTEN it to remove the bag of one item entirely, but be careful in that if the bag is empty i think you throw away the entire record in this case.
first = FOREACH F {
lim = LIMIT indexkey 1;
GENERATE group, FLATTEN(lim);
}

Is there a more readable way to write for k, v in pairs(my_table) do ... end in lua if I never use k?

Is there a more readable way in lua to write:
for k, v in pairs(my_table) do
myfunction( v )
end
I'm never using k, so I'd like to take it out of the loop control, so it's clear I'm just iterating over the values. Is there a function like pairs() that only gives me a list of the values?
There is no standard function that only iterates values, but you can write it yourself if you wish. Here is such an iterator :
function values(t)
local k, v
return function()
k, v = next(t, k)
return v
end
end
But normally people just use pairs and discard the first variable. It is customary in this case to name the unused variable _ (an underscore) to clearly indicate the intent.
I've seen people use the _ variable instead of k or i.
why would you use the pairs() function if you don't want the key/value pairs of the table you're enumerating then?
for example, this is even shorter to type:
local t = {"asdf", "sdfg", "dfgh"}
for i=1, #t do
print(t[i])
end
otherwise, i always just did this:
local t = {"asdf", "sdfg", "dfgh"}
for _,v in pairs(t) do
print(v)
end
edit: for your scenario, where you want to enumerate only values in a table with non-numeric keys, probably the clearest thing you could do would be to write your own table iterator function like this:
local t = {["asdf"] = 1, ["sdfg"] = 2, ["dfgh"] = 3}
function values(tbl)
local key = nil
return function()
key = next(tbl, key)
return tbl[key]
end
end
for value in values(t) do
print(value)
end
then, it is very explicit that you're only traversing the values of the table t. like pairs(), this is not guaranteed to traverse in order since it uses next().
It's your coding style, really. If you can read it and you're consistent with it then it shouldn't matter.
However, I tend to use:
for i,c in
"i" standing for "index" and "c" standing for "child", but "v" for "value" works as well. And even if you're not using the index variable, it's still good practice.
Another thing you might do is:
for n = 1, 10
when dealing with numbers. But once again, it's you're coding style and as long as it's consistent you should be good.
From Lua Style Guide:
The variable consisting of only an underscore "_" is commonly used as a placeholder when you want to ignore the variable:
for _,v in ipairs(t) do print(v) end
Note: This resembles the use of "_" in Haskell, Erlang, Ocaml, and Prolog languages, where "_" takes the special meaning of anonymous (ignored) variables in pattern matches. In Lua, "_" is only a convention with no inherent special meaning though. Semantic editors that normally flag unused variables may avoid doing so for variables named "_" (e.g. LuaInspect is such a case).
So I would expect underscore (_) name is more readable for unused variables.

add columns to data frame using foreach and %dopar%

In Revolution R 2.12.2 on Windows 7 and Ubuntu 64-bit 11.04 I have a data frame with over 100K rows and over 100 columns, and I derive ~5 columns (sqrt, log, log10, etc) for each of the original columns and add them to the same data frame. Without parallelism using foreach and %do%, this works fine, but it's slow. When I try to parallelize it with foreach and %dopar%, it will not access the global environment (to prevent race conditions or something like that), so I cannot modify the data frame because the data frame object is 'not found.'
My question is how can I make this faster? In other words, how to parallelize either the columns or the transformations?
Simplified example:
require(foreach)
require(doSMP)
w <- startWorkers()
registerDoSMP(w)
transform_features <- function()
{
cols<-c(1,2,3,4) # in my real code I select certain columns (not all)
foreach(thiscol=cols, mydata) %dopar% {
name <- names(mydata)[thiscol]
print(paste('transforming variable ', name))
mydata[,paste(name, 'sqrt', sep='_')] <<- sqrt(mydata[,thiscol])
mydata[,paste(name, 'log', sep='_')] <<- log(mydata[,thiscol])
}
}
n<-10 # I often have 100K-1M rows
mydata <- data.frame(
a=runif(n,1,100),
b=runif(n,1,100),
c=runif(n,1,100),
d=runif(n,1,100)
)
ncol(mydata) # 4 columns
transform_features()
ncol(mydata) # if it works, there should be 8
Notice if you change %dopar% to %do% it works fine
Try the := operator in data.table to add the columns by reference. You'll need with=FALSE so you can put the call to paste on the LHS of :=.
See When should I use the := operator in data.table?
Might it be easier if you did something like
n<-10
mydata <- data.frame(
a=runif(n,1,100),
b=runif(n,1,100),
c=runif(n,1,100),
d=runif(n,1,100)
)
mydata_sqrt <- sqrt(mydata)
colnames(mydata_sqrt) <- paste(colnames(mydata), 'sqrt', sep='_')
mydata <- cbind(mydata, mydata_sqrt)
producing something like
> mydata
a b c d a_sqrt b_sqrt c_sqrt d_sqrt
1 29.344088 47.232144 57.218271 58.11698 5.417018 6.872565 7.564276 7.623449
2 5.037735 12.282458 3.767464 40.50163 2.244490 3.504634 1.940996 6.364089
3 80.452595 76.756839 62.128892 43.84214 8.969537 8.761098 7.882188 6.621340
4 39.250277 11.488680 38.625132 23.52483 6.265004 3.389496 6.214912 4.850240
5 11.459075 8.126104 29.048527 76.17067 3.385126 2.850632 5.389669 8.727581
6 26.729365 50.140679 49.705432 57.69455 5.170045 7.081008 7.050208 7.595693
7 42.533937 7.481240 59.977556 11.80717 6.521805 2.735186 7.744518 3.436157
8 41.673752 89.043099 68.839051 96.15577 6.455521 9.436265 8.296930 9.805905
9 59.122106 74.308573 69.883037 61.85404 7.689090 8.620242 8.359607 7.864734
10 24.191878 94.059012 46.804937 89.07993 4.918524 9.698403 6.841413 9.438217
There are two ways you can handle this:
Loop over each column (or, better yet, a subset of the columns) and apply the transformations to create a temporary data frame, return that, and then do cbind of the list of data frames, as #Henry suggested.
Loop over the transformations, apply each to the data frame, and then return the transformation data frames, cbind, and proceed.
Personally, the way I tend to do things like this is create a bigmatrix object (either in memory or on disk, using the bigmemory package), and you can access all of the columns in shared memory. Just pre-allocate the columns you will fill in, and you won't need to do a post hoc cbind. I tend to do it on disk. Just be sure to run flush(), to make sure everything is written to disk.

Resources