R: Sort multiple columns by another data.frame? - sorting

I'm trying to make sense of how to sort one data.frame based on multiple columns in another. This question does this with vectors. Can someone suggest a way to do the equivalent with data.frames?
Here's some sample data.
x1 <- data.frame(a=1:5, b=letters[1:5], c=rnorm(5))
x2 <- data.frame(a=c(4,4,2), b=c("d", "d", "b"), d=rnorm(3))
So I want to sort x2 by the first two columns of x1. My actual data is much more complicated, but this replicates the idea...

It really depends on what your data really looks like. As it looks right now, you only need one column to sort, and that is easily done by:
x2[order(match(x2[,1],x1[,1])),]
If you need more than one column, this becomes a bit trickier. You will have to specify which one you want to sort first on, and which one second, eg :
x1 <- data.frame(a=rep(1:3,2), b=rep(letters[2:4],each=2), c=rnorm(6))
x2 <- data.frame(a=c(3,3,2), b=c("c", "d", "b"), d=rnorm(3))
x2[order(match(
paste(x2[,1],x2[,2]),
paste(x1[,1],x1[,2]))
),]
This sorts on the first column first, and then on the second. You have to keep in mind that you need all combinations in x2 also in x1. T

Attach a rank column to the relevant columns of x1:
len <- dim(x1)[1]
x1. <- cbind(x1[,1:2], rank=1:len)
Merge into x2 (this is like a SQL join; see the merge documentation for how to specify what happens if there are ambiguities such as multiple matches or no matches):
x2. <- merge(x2, x1.)
Sort:
x2.[order(x2.[,'rank']),]

This can be done precisely using plyr. Joris' answer would work fairly well, but potentially could missort when combining strings:
> paste ("A A","B")
[1] "A A B"
> paste ("A","A B")
[1] "A A B"
You can get an exact answer using join.keys and match:
x1 <- data.frame(a=rep(1:3,2), b=rep(letters[2:4],each=2), c=rnorm(6))
x2 <- data.frame(a=c(3,3,2), b=c("c", "d", "b"), d=rnorm(3))
library(plyr)
keys<-join.keys(x1,x2,c("a","b"))
matches<-match(keys$y,keys$x,nomatch=(keys$n+1))
x2[order(matches),]
This should handle most edge cases, mismatched list sizes, etc. Items without a match in both the index columns are put at the end of the list.

Related

Multiple components in an array slice - equivalent to perl5: #a[0..1,3]

Very basic question but I can't seem to find anything about multiple ranges in the docs.
How do you select multiple ranges from a perl6 array?
my #a = "a","b","c","d";
#a[0..1,3] # expecting array with a, b & d as p5
This seems to return a sort of nested list, what's the Perl 6 syntax to acheive the result this would yeild in Perl 5 (i.e an array with a, b & d)?
Your question is a little confusing, but assuming you've just got typos or whatever, I'll try to guess what you are asking.
This makes a simple array:
> my #a = "a", "b', "c", "d";
[a b c d]
This makes an anonymous array of a Range from 0..1 and a 3:
> #[0..1,3];
[0..1 3]
If you want it to pull values out of the #a array, you have to refer to it:
> #a[0..1,3];
((a b) d)
pulls the bits you asked for from #a -- The first element is the 0..1 parts of #a, (a,b) -- (Not sure why you want to see c in here..)
That's the nested list -- the two bits you asked for include the list in the first field, and the value d you asked for in the second field.
If you want it flattened instead of nested, you can use .flat:
> #a[0..1,3].flat;
(a b d)
In Raku (formerly known as Perl 6), 0..1 results in a single item, which is a range. In Perl 5, 0..1 immediately expands into two numbers.
One of my most common mistakes in Raku is forgetting to flatten things. The upside is that in Raku, we basically get the equivalent of Perl 5 references for free, which eliminates a lot of messy referencing and dereferencing.

Operate on the y's of a {{x1,y1},{x2,y2},...{xn,yn}} list

I've been scratching my head and I can't figure out a way to conveniently apply an operation to the y values of a list of the form{{x1,y1},{x2,y2},...{xn,yn}}. The list is in this form for plotting with ListPlot[] mostly.
The type of operations I'd like to apply would include:
Mathematica Operations. Ex.: LowpassFilter[y's] (not point-by-point, I know)
Generic mathematic point-by-point operations. Ex: y's*10 + 2
I know I can transpose and then filpity-flop turn the list arround and then target each element, and then transpose back and flopity-flip and overwrite the original list. This becomes tiresome after dealing with each case. I bet there is a cleaver way to do this. Or what would be the best way to hold values in a list that can easily be plotted and manipulated?
Thanks
Map[{#[[1]],2+10 #[[2]]}&,{{x1,y1},{x2,y2},...{xn,yn}}]
MapAt[2+10#&,{{x1,y1},{x2,y2},...{xn,yn}},{All,2}]
if you need to operate on the 'y' list as a list, do like this:
Transpose#MapAt[LowpassFilter[#,1]&,
Transpose#{{x1,y1},{x2,y2},...{xn,yn}},2]
Suppose you named your list as l, i.e.
l={{x1,y1},{x2,y2},...{xn,yn}}
You can get all ys by:
ylist=l[[All,2]]
{#, 10 # + 2} & ### lst
{{x1, 2 + 10 x1}, {x2, 2 + 10 x2}, {xn, 2 + 10 xn}}

Algorithm to find "ordered combinations"

I need an algorithm to find, what I call, "ordered combinations" (Maybe someone knows the real name for this if there is one).
Of course I already tried to come up with an algorithm on my own but I'm really stuck.
How it should work:
Given 2 lists (not sets, order is important here!) of elements that are guaranteed to contain the same elements, all ordered combinations.
An ordered combination is a 2-tuple, 3-tuple, ... n-tuple (no limit on N) of elements that appear in the same order in both lists.
Its entirely possible that an element occurs more than once in a list.
But every element from one list is guaranteed to appear at least once in the other list.
It does not matter if the output contains a combination more than once.
I'm not really sure if that makes it clear so here are multiple examples:
(List1, List2, Expected Result, Annotation)
ASDF
ADSF
Result: AS, AD, AF, SF, DF, ASF, ADF
Note: ASD is not a valid result because there is no way to have ascending indices in the second list for this combination
ADSD
ASDD
Result: AD, AS, AD, DD, SD, ASD, ADD
Note: AD appears twice because it can be created from indices 1,2 and 1,4 and in the second list 1,3 and 1,4. But it would also be correct if it only appears once. Also D appears twice in both lists in an order, so this allows ADD as a valid combination too.
SDFG
SDFG
Result: SD, SF, SG, DF, DG, FG, SDF, SFG, SDG, DFG, SDFG,
Note: Same input; all combinations are possible
ABCDEFG
GFEDCBA
Result: <empty>
Note: There are no combinations that appear in the same order in both lists
QWRRRRRRR
WRQ
Result: WR
Note: The only combination that appears in the same order in both sets is WR
Notes:
While it's a language agnostic algorithm I'd prefer answers that contain either C# or pseudo-code so I can understand them.
I realized that longer combinations are always made up from shorter combinations. Example: SDF can only be a valid result if SD and DF are possible too. Maybe this helps to make the algorithm more performant by building the longer combinations from the shorter ones.
Speed is of great importance here. This is algorithm will be used in realtime!
If it's not clear how the algorithm works, drop a comment. I'll add an example to clarify it.
Maybe this problem is already known and solved, but I don't know the proper name for it.
I would describe this problem as enumerating common subsequences of two strings. As a first cut, make a method like this, which chooses the first letter nondeterministically and recurses (Python, sorry).
def commonsubseqs(word1, word2, prefix=''):
if len(prefix) >= 2:
print(prefix)
for letter in set(word1) & set(word2): # set intersection
# figure out what's left after consuming the first instance of letter
remainder1 = word1[word1.index(letter) + 1:]
remainder2 = word2[word2.index(letter) + 1:]
# take letter and recurse
commonsubseqs(remainder1, remainder2, prefix + letter)
If this simple solution is not fast enough for you, then it can be improved as follows. For each pair of suffixes of the two words, we precompute the list of recursive calls. In Python again:
def commonsubseqshelper(table, prefix, i, j):
if len(prefix) >= 2:
print(''.join(prefix))
for (letter, i1, j1) in table[i][j]:
prefix.append(letter)
commonsubseqshelper(table, prefix, i1, j1)
del prefix[-1] # delete the last item
def commonsubseqs(word1, word2):
table = [[[(letter, word1.index(letter, i) + 1, word2.index(letter, j) + 1)
for letter in set(word1[i:]) & set(word2[j:])]
for j in range(len(word2) + 1)] # 0..len(word2)
for i in range(len(word1) + 1)] # 0..len(word1)
commonsubseqshelper(table, [], 0, 0)
This polynomial-time preprocessing step improves the speed of enumeration to its asymptotic optimum.

Efficiently sample a data frame avoiding loops

I have a data frame which consists of a first column (experiment.id) and the rest of the columns are values associated with this experiment id. Each row is a unique experiment id. My data frame has columns in the order of 10⁴ - 10⁵.
data.frame(experiment.id=1:100, v1=rnorm(100,1,2),v2=rnorm(100,-1,2) )
This data frame is the source of my sample space. What i would like to do, is for each unique experiment.id (row) randomly sample (with replacement) one of the values v1, v2, ....,v10000 associated with this id and construct a sample s1. In each sample s1 all experiment ids are represented.
Eventually i want to perform 10⁴ samples, s1, s2, ....,s 10⁴ and calculate some statistic.
What would be the most efficient way (computationally) to perform this sampling process. I would like to avoid for loops as much as possible.
Update:
My questions in not all about sampling but also storing the samples. I guess my real question is if there is a quicker way to perform the above other than
d<-data.frame(experiment.id=1:1000, replicate (10000,rnorm(1000,100,2)) )
results<-data.frame(d$experiment.id,replicate(n=10000,apply(d[,2:10001],1,function(x){sample(x,size=1,replace=T)})))
Here is an expression that chooses one of the columns (excluding the first). It does not copy the first column, you will need to supply that as a separate step.
For a data frame d:
d[matrix(c(seq(nrow(d)), sample(ncol(d)-1, nrow(d), replace=TRUE)+1), ncol=2)]
That's one sample. To get N samples, just multiply the selection (as in John's answer):
mm <- matrix(c(rep(seq(nrow(d)), N), sample(ncol(d)-1, nrow(d)*N, replace=TRUE)+1), ncol=2)
result <- matrix(d[mm], ncol=N)
But you're going to have memory issues.
The shortest and most readable IMHO is still to use apply, but making good use of the fact that sample is vectorized:
results <- data.frame(experiment.id = d$experiment.id,
t(apply(d[, -1], 1, sample, 10000, replace = TRUE)))
If the 3 seconds it takes are too slow for your needs then I would recommend you use matrix indexing.
It's possible to do without any looping whatsoever. If you convert your columns after the first one to a matrix this gets easy because a matrix can be addressed either as [row, column] or sequentially as it's underlying vector.
mat <- as.matrix(datf[,-1])
nr <- nrow(mat); nc <- ncol(mat)
sel <- sample( 1:nc, nr, replace = TRUE )
sel <- sel + ((1:nr)-1) * nc
x <- t(mat)[sel]
seldatf <- data.frame( datf[,1], x = x )
Now, to get lots of the samples it pretty easy just multiplying the same logic.
ns <- 10 # number of samples / row
sel <- sample(1:nc, nr * ns, replace = TRUE )
sel <- sel + rep(((1:nr)-1) * nc, each = ns)
x <- t(mat)[sel]
seldatf <- cbind( datf[,1], data.frame(matrix(x, ncol = ns, byrow = TRUE)) )
It's possible that it's going to be a really big data frame if you're going to set ns <- 1e5 and you have lots of rows. You may have to watch running out of memory. I do a bit of unnecessary copying for readability reasons. You can eliminate that for memory, and speed because once you are using large amounts of memory you'll be swapping out other programs that are running. That is slow. You don't have to assign and save x, mat, or even sel. The result of not doing that would provide you about the fastest answer possible.

What is the optimal way to match list entries after rounding in Mathematica?

I have two lists in Mathematica:
list1 = {{a1, b1, c1}, ... , {an, bn, cn}}
and
list2 = {{d1, e1, f1}, ... , {dn, en, fn}}
the lists contain numerical results and are roughly consisting of 50000 triplets each. Each triplet represents two coordinates and a numerical value of some property at these coordinates. Each list has different length and the coordinates are not quite the same range. My intention is to correlate the numerical values of the third property from each list so I need to scan through the lists and identify the properties whose coordinates are matching. My output will be something like
list3 = {{ci, fj}, ... , {cl, fm}}
where
{ai, bi}, ..., {al, bl}
will be (roughly) equal to, respectively
{dj, ej}, ..., {dm, em}
By "roughly" I mean the coordinates will match once rounded to some desired accuracy:
list1(2) = Round[{#[[1]], #[[2]], #[[3]]}, {1000, 500, 0.1}] & /# list1(2)
so after this process I's have two lists that contain some matching coordinates amongst them. My question is how to perform the operation of identifying them and picking out the pairs of properties in the optimal way?
An example of a 6 element list would be
list1 = {{-1.16371*10^6, 548315., 14903.}, {-1.16371*10^6, 548322., 14903.9},
{-1.16371*10^6, 548330., 14904.2}, {-1.16371*10^6, 548337., 14904.8},
{-1.16371*10^6, 548345., 14905.5}, {-1.16371*10^6, 548352., 14911.5}}
You may want to use something like this:
{Round[{#, #2}], #3} & ### Join[list1, list2];
% ~GatherBy~ First ~Select~ (Length## > 1 &)
This will group all data points that having matching coordinates after rounding. You can use a second argument to Round to specify the fraction to round by.
This assumes that there are not duplicated points within a single list. If there are, you will need to remove those to get useful pairs. Tell me if this is the case and I will update my answer.
Here is another method using Sow and Reap. The same caveats apply. Both of these examples are simply guidelines for how you may implement your functionality.
Reap[
Sow[#3, {Round[{#, #2}]}] & ### Join[list1, list2],
_,
List
][[2]] ~Cases~ {_, {_, __}}
To deal with duplicate-after-round elements within each list, you may use Round and GatherBy on each list as follows.
newList1 = GatherBy[{Round[{#, #2}], #3} & ### list1, First][[All, 1]];
newList2 = GatherBy[{Round[{#, #2}], #3} & ### list2, First][[All, 1]];
and then proceed with:
newList1 ~Join~ newList2 ~GatherBy~ First ~Select~ (Length## > 1 &)
Here's my approach, relying on Nearest to match the points.
Let's assume that list1 doesn't have fewer elements than list2. (Otherwise you can swap them using {list1, list2} = {list2, list1})
(* extract points *)
points1=list1[[All,{1,2}]];
points2=list2[[All,{1,2}]];
(* build a "nearest-function" for matching them *)
nf=Nearest[points1]
(* two points match only if they're closer than threshold *)
threshold=100;
(* This function will find the match of a point from points2 in points1.
If there's no match, the point is discarded using Sequence[]. *)
match[point_]:=
With[{m=First#nf[point]},
If[Norm[m-point]<threshold, {m,point}, Unevaluated#Sequence[]]
]
(* find matching point-pairs *)
matches=match/#points1;
(* build hash tables to retrieve the properties associated with points quickly *)
Clear[values1,values2]
Set[values1[{#1,#2}],#3]&###list1;
Set[values2[{#1,#2}],#3]&###list2;
(* get the property-pairs *)
{values1[#1],values2[#2]}&###matches
An altrenative is to use a custom DistanceFunction in nearest to avoid the use of values1 & values2, and have a shorter program. This may be slower or faster, I didn't test this with large data at all.
Note: How complicated the implementation needs to be really depends on your particular dataset. Does each point from the first set have a match in the second one? Are there any duplicates? How close can points from the same dataset be? Etc. I tried to provide something which can be tweaked to be relatively robust, at the cost of having longer code.

Resources