I have an unsorted set of n-dimensional vectors and, for each of the n dimensions in turn, I am looking for the subsets of vectors that differ in only this dimension's component. How can I do this efficiently?
Example:
[ (1,2,3), (1,3,3), (2,3,3), (1,2,5), (2,2,5), (2,3,4) ]
dim 3 variable: [ (1,2,3), (1,2,5) ] & [ (2,3,3), (2,3,4) ]
dim 2 variable: [ (1,2,3), (1,3,3) ]
dim 1 variable: [ (1,3,3), (2,3,3) ] & [ (1,2,5), (2,2,5) ]
Thanks very much for your help!
EDIT
As requested in a comment I am now posting my buggy code:
recursive subroutine get_peaks_on_same_axis(indices, result, current_dim, look_at, last_dim, mode, upper, &
num_groups, num_dim)
! Group the indices that denote the location of peaks within PEAK_INDICES which have n-1 dimensions in common.
! Eventually, RESULT will hold the groups of these peaks.
! e.g.: result(1,:) == (3,7,9) <= peak_indices(3), peak_indices(7), and peak_indices(9) belong together
integer, intent(in) :: indices(:), current_dim, look_at, last_dim, mode, num_dim
integer, intent(inout) :: upper(:), num_groups, result(:,:) ! in RESULT: each line holds a group of peaks
integer :: i, pos_on_axis, next_dim, aux(0:num_dim-1), stat
integer, allocatable :: num_peaks(:), groups(:,:)
integer, save :: slot
if (mode.eq.0) slot = 1
! we're only writing to RESULT once group determination has been completed
if (current_dim.eq.last_dim) then
! saving each column of 'groups' of the instance of the subroutine called one level further up
! = > those are the peaks which have n-1 dimensions in common
upper(slot) = ubound(indices,1)
result(slot,1:upper(slot)) = indices
num_groups = slot ! after the final call it will contain the actual number of peak groups
slot = slot + 1
return
end if
aux(0:num_dim-2) = (/ (i,i = 2,num_dim) /)
aux(num_dim-1) = 1
associate(peak_indices => public_spectra%intensity(look_at)%peak_indices, &
ndp => public_spectra%axes(look_at)%ax_set(current_dim)%num_data_points)
! potentially as many peaks as there are points in this dimension
allocate(num_peaks(ndp), groups(ndp,ubound(indices,1)), stat=stat)
if (stat.ne.0) call aloerr('spectrum_paraphernalia.f90',763)
num_peaks(:) = 0
! POS_ON_AXIS: ppm value of the peak in dimension DIM, converted to an index on the axis
! GROUPS: peaks that have the same axis index in dimension DIM; line: index on axis;
do i=1,ubound(indices,1)
pos_on_axis = peak_indices(current_dim,indices(i))
num_peaks(pos_on_axis) = num_peaks(pos_on_axis) + 1 ! num. of peaks that have this coordinate
groups(pos_on_axis,num_peaks(pos_on_axis)) = indices(i)
end do
next_dim = aux(mod(current_dim+(num_dim-1),num_dim))
do pos_on_axis=1,ubound(num_peaks,1)
if (num_peaks(pos_on_axis).gt.0) then
call get_peaks_on_same_axis(groups(pos_on_axis,1:num_peaks(pos_on_axis)), result, next_dim, look_at, last_dim, &
1, upper, num_groups, num_dim)
end if
end do
end associate
end subroutine
What about the naive way?
Let's assume, you have m vectors with length n.
Then you have to compare all vectors with each other which results in 1/2*(m^2+m-) = O(m^2) comparisons.
In each comparison you check your vectors element wise. If you find one difference you have to make sure, that there is no other difference. In best case, all vectors differ in the first 2 elements which is then 2 comparisons. The worst case is one or no difference which leads to n comparisons for the appropriate vectors.
If there is only one difference you can store its dimension, otherwise store a value like 0 or -1.
Related
Given a finite random sequence of bits, how can the minimum number of bit toggles necessary to result in a sorted sequence (i.e. any and all 0's are before any and all 1's) be determined?
Note, homogeneous sequences (e.g. 0000000 and 1111111) are considered sorted by this definition.
Also, this is not technically "sorting" the sequence because elements are toggled in-place, not restricted to swapping with other elements, is there better word to describe this activity than "sorting"?
Let Z(n) be the cost of setting the first n bits all 0.
Let X(n) be the cost of minimum cost of "sorting" the first n bits.
We have:
Z(0) = 0, X(0) = 0
if the ith bit is 0: Z(i) = Z(i-1), X(i) = min( Z(i-1), X(i-1)+1 )
if the ith bit is 1: Z(i) = Z(i-1)+1, X(i) = X(i-1)
The answer is X(n).
It's even easier in code:
z=0
x=0
for bit in sequence:
if bit == 0:
x = min(z,x+1)
else:
z = z+1
return x
One canonical dynamic program would be to evaluate in O(1) time and space two states for each bit: the cost of keeping the bit the same or toggling it, while assigning it as rightmost of the relevant section (also incurring the relevant cost for the implied toggles due to the assignment).
Sorry, the above applies to the general problem - where "sorted" could be in either direction. (To choose a direction, pick the relevant one from the best variable assignments below.)
Python code:
def f(bits):
set_bits = sum(bits)
left_ones = 0
best = len(bits)
for i, bit in enumerate(bits):
right_ones = set_bits - left_ones - bit
left_zeros = i - left_ones
right_zeros = len(bits) - i - 1 - right_ones
# As rightmost 0
best = min(best, bit + left_ones + right_zeros)
# As rightmost 1
best = min(best, (bit ^ 1) + left_zeros + right_ones)
left_ones += bit
return best
Output:
bit_sets = [
[1,0,0,1,1], # 1
[1,0,0,1,0,1], # 2
[0,1,1,0,1,0], # 2
[1,1,1,1], # 0
[0,0,1,1], # 0
[0,1,0,0,0], # 1
[1,0,1,1,1] # 1
]
for bits in bit_sets:
print(f(bits), bits)
I am working on implementing periodic boundary conditions in a finite element problem and I want to pair the nodes on boundary A with nodes on boundary B given a vector trans that lines boundary A up with boundary B. The nodes in boundary A are given in a list g1; in B they are g2. The node coordinates are looked up in mesh%nodes(:,nodenum).
I have decided to do this by creating a distance matrix for each node, which I realise is not the most efficient way, and to be honest I don't expect to save significant time by optimising this algorithm. The question is more academic.
I know that Fortran stores in column-major order, on the other hand, the array will be symmetric and when the array is completed I want to take column slices of it to find the nearest node. So the question is how should one populate this?
Here is my naive attempt.
subroutine autopair_nodes_in_groups(mesh, g1, g2, pairs, trans)
type(meshdata) :: mesh
integer(kind=sp) :: i,j
integer(kind=sp),dimension(:) :: g1,g2
integer(kind=sp),dimension(:,:) :: pairs
real(kind=dp) :: trans(3) !xyz translate
real(kind=dp) :: dist_mat(size(g1),size(g2))
real(kind=dp) :: p1(3), p2(3)
dist_mat = -1.0_wp
! make a distance matrix
do j=1,size(g2)
p2 = mesh%nodes(1:3,g2(j))-trans
do i=1,j
p1 = mesh%nodes(1:3,g1(i))
dist_mat(i,j) = norm2(p1-p2) !equivalent to norm2(n1pos+trans-n2pos)
if (i.ne.j) dist_mat(j,i) = dist_mat(i,j) !fill symmetry
end do
end do
! Remainder of routine to find nearest nodes
end subroutine autopair_nodes_in_groups
The problem as far as I can tell is that this is efficient in terms of memory access until one symmetrises the array.
To do a fast nearest-neighbor search, you should implement a tree structure that has search complexity O(log(N)) instead of looking at all point-to-point distances which are O(N^2).
Anyways, regarding symmetric matrix handling, you'll have:
! Storage size of a symmetric matrix
elemental integer function sym_size(N)
integer, intent(in) :: N
sym_size = (N*(N+1))/2
end function sym_size
! Compute pointer to an element in the symmetric matrix array
elemental integer function sym_ptr(N,i,j)
integer, intent(in) :: N,i,j
integer :: irow,jcol
! Column-major storage
irow = merge(i,j,i>=j)
jcol = merge(j,i,i>=j)
! Skip previous columns
sym_ptr = N*(jcol-1)-((jcol-1)*(jcol-2))/2
! Locate in current column
sym_ptr = sym_ptr + (irow-jcol+1)
end function sym_ptr
then do your job:
N = size(g2)
allocate(sym_dist_mat(sym_size(N)))
do j=1,size(g2)
p2 = mesh%nodes(1:3,g2(j))-trans
do i=j,size(g2)
p1 = mesh%nodes(1:3,g1(i))
sym_dist_mat(sym_ptr(N,i,j)) = norm2(p1-p2)
end do
end do
The minloc function should then look something like this (untested):
! minloc-style
pure function symmetric_minloc(N,matrix,dim) result(loc_min)
integer, intent(in) :: N
real(8), intent(in) :: matrix(N*(N+1)/2)
integer, intent(in) :: dim
real(8) :: dim_min(N),min_column
integer :: loc_min(N)
select case (dim)
! Row/column does not matter: it's symmetric!
case (1,2)
dim_min = huge(dim_row)
loc_min = -1
ptr = 0
do j=1,N
! Diagonal element m(j,j)
ptr=ptr+1
min_column = matrix(ptr)
if (min_column<=dim_min(j)) then
loc_min(j) = j
dim_min(j) = min_column
end if
! Lower-diagonal part at column j
do i=j+1,N
ptr=ptr+1
! Applies to both this column,
if (matrix(ptr)<=dim_min(j)) then
loc_min(j) = i
dim_min(j) = matrix(ptr)
end if
! And the i-th column
if (matrix(ptr)<=dim_min(i)) then
loc_min(i) = j
dim_min(i) = matrix(ptr)
end if
end do
end do
case default
! Invalid dimension
loc_min = -1
end select
end function symmetric_minloc
I have this structure
x = [8349310431, 8349314513]
y = [667984788, 667987788]
z = [148507632380, 153294624079]
map = Hash[x.zip([y, z].transpose).sort]
#=> {
# 8349310431=>[667984788, 148507632380],
# 8349314533=>[667987788, 153294624079]
# }
and I need to compare, the keys with the rest of the keys, but if the subtraction of the keys is less than 100, you have to compare the first elements to which the key points and if this subtraction of elements is less than 100 the procedure is repeated with the second element that the key points to
example
key[0] - key[1] = 8349310431−8349314533 = 4102 (with value absolute)
so now we subtract the first elements that the key points to, because it is greater than 100 the subtraction
element1Key1 - element1Key2 = 667984788 - 667987788 = 3000 (with value absolute)
as the subtraction is greater than 100 we repeat this with the second elements
element2Key1 - element2Key2 = 15329460 - 15329462 = 2 (with value absolute)
as it is less than 100 we stop here and keep this in a counter can be
if the subtraction is less than 100 since the operation with the keys, it can not be stopped there, it is necessary to do it until the second element to which the key points.
but how do I do it
Sorry for my English, but I don't speak it, I hope you understand, and thanks
Does this make sense?
x = [8349310431, 8349314513]
y = [667984788, 667987788]
z = [15329460, 15329462]
[x, y, z].detect { |a, b| (a-b).abs < 100 } # => [15329460, 15329462]
Just in case, why build a hash?
I have a rather unorthodox homework assignment where I am to write a simple function where a double value is rounded to an integer with using only a while loop.
The main goal is to write something similar to the round function.
I made some progress where I should add or subtract a very small double value and I would eventually hit a number that will become an integer:
while(~isinteger(inumberup))
inumberup=inumberup+realmin('double');
end
However, this results in a never-ending loop. Is there a way to accomplish this task?
I'm not allowed to use round, ceil, floor, for, rem or mod for this question.
Assumption: if statements and the abs function are allowed as the list of forbidden functions does not include this.
Here's one solution. What you can do is keep subtracting the input value by 1 until you get to a point where it becomes less than 1. The number produced after this point is the fractional component of the number (i.e. if our number was 3.4, the fractional component is 0.4). You would then check to see if the fractional component, which we will call f, is less than 0.5. If it is, that means you need to round down and so you would subtract the input number with f. If the number is larger than 0.5 or equal to 0.5, you would add the input number by (1 - f) in order to go up to the next highest number. However, this only handles the case for positive values. For negative values, round in MATLAB rounds towards negative infinity, so what we ought to do is take the absolute value of the input number and do this subtraction to find the fractional part.
Once we do this, we then check to see what the fractional part is equal to, and then depending on the sign of the number, we either add or subtract accordingly. If the fractional part is less than 0.5 and if the number is positive, we need to subtract by f else we need to add by f. If the fractional part is greater than or equal to 0.5, if the number is positive we need to add by (1 - f), else we subtract by (1 - f)
Therefore, assuming that num is the input number of interest, you would do:
function out = round_hack(num)
%// Repeatedly subtract until we get a value that less than 1
%// i.e. the fractional part
%// Also make sure to take the absolute value
f = abs(num);
while f > 1
f = f - 1;
end
%// Case where we need to round down
if f < 0.5
if num > 0
out = num - f;
else
out = num + f;
end
%// Case where we need to round up
else
if num > 0
out = num + (1 - f);
else
out = num - (1 - f);
end
end
Be advised that this will be slow for larger values of num. I've also wrapped this into a function for ease of debugging. Here are a few example runs:
>> round_hack(29.1)
ans =
29
>> round_hack(29.6)
ans =
30
>> round_hack(3.4)
ans =
3
>> round_hack(3.5)
ans =
4
>> round_hack(-0.4)
ans =
0
>> round_hack(-0.6)
ans =
-1
>> round_hack(-29.7)
ans =
-30
You can check that this agrees with MATLAB's round function for the above test cases.
You can do it without loop: you can use num2str to convert the number into a string, then find the position of the . in the string and extract the string fron its beginning up to the position of the .; then you convert it back to a numebr with str2num
To round it you have to check the value of the first char (converted into a number) after the ..
r=rand*100
s=num2str(r)
idx=strfind(num2str(r),'.')
v=str2num(s(idx+1))
if(v <= 5)
rounded_val=str2num(s(1:idx-1))
else
rounded_val=str2num(s(1:idx-1))+1
end
Hope this helps.
Qapla'
I have a matrix, matrix_logical(50000,100000), that is a sparse logical matrix (a lot of falses, some true). I have to produce a matrix, intersect(50000,50000), that, for each pair, i,j, of rows of matrix_logical(50000,100000), stores the number of columns for which rows i and j have both "true" as the value.
Here is the code I wrote:
% store in advance the nonzeros cols
for i=1:50000
nonzeros{i} = num2cell(find(matrix_logical(i,:)));
end
intersect = zeros(50000,50000);
for i=1:49999
a = cell2mat(nonzeros{i});
for j=(i+1):50000
b = cell2mat(nonzeros{j});
intersect(i,j) = numel(intersect(a,b));
end
end
Is it possible to further increase the performance? It takes too long to compute the matrix. I would like to avoid the double loop in the second part of the code.
matrix_logical is sparse, but it is not saved as sparse in MATLAB because otherwise the performance become the worst possible.
Since the [i,j] entry counts the number of non zero elements in the element-wise multiplication of rows i and j, you can do it by multiplying matrix_logical with its transpose (you should convert to numeric data type first, e.g matrix_logical = single(matrix_logical)):
inter = matrix_logical * matrix_logical';
And it works both for sparse or full representation.
EDIT
In order to calculate numel(intersect(a,b))/numel(union(a,b)); (as asked in your comment), you can use the fact that for two sets a and b, you have
length(union(a,b)) = length(a) + length(b) - length(intersect(a,b))
so, you can do the following:
unLen = sum(matrix_logical,2);
tmp = repmat(unLen, 1, length(unLen)) + repmat(unLen', length(unLen), 1);
inter = matrix_logical * matrix_logical';
inter = inter ./ (tmp-inter);
If I understood you correctly, you want a logical AND of the rows:
intersct = zeros(50000, 50000)
for ii = 1:49999
for jj = ii:50000
intersct(ii, jj) = sum(matrix_logical(ii, :) & matrix_logical(jj, :));
intersct(jj, ii) = intersct(ii, jj);
end
end
Doesn't avoid the double loop, but at least works without the first loop and the slow find command.
Elaborating on my comment, here is a distance function suitable for pdist()
function out = distfun(xi,xj)
out = zeros(size(xj,1),1);
for i=1:size(xj,1)
out(i) = sum(sum( xi & xj(i,:) )) / sum(sum( xi | xj(i,:) ));
end
In my experience, sum(sum()) is faster for logicals than nnz(), thus its appearance above.
You would also need to use squareform() to reshape the output of pdist() appropriately:
squareform(pdist(martrix_logical,#distfun));
Note that pdist() includes a 'jaccard' distance measure, but it is actually the Jaccard distance and not the Jaccard index or coefficient, which is the value you are apparently after.