This is a part of the Profiler report, and shows how these lines are eating up the time. Can it be improved upon?
434 %clean up empty cells in subPoly
228 435 if ~isempty(subPoly)
169 436 subPoly(cellfun(#isempty,subPoly)) = [];
437
438 %remove determined subpoly points from the hull polygon
169 439 removeIndex = zeros(size(extendedPoly,1),1);
169 440 for i=1:length(subPoly)
376 441 for j=1:size(subPoly{i}(:,1))
20515 442 for k=1:size(extendedPoly,1)
6.12 5644644 443 if extendedPoly(k,:)==subPoly{i}(j,:)
30647 444 removeIndex(k,1)=1;
30647 445 end
1.08 5644644 446 end
0.02 20515 447 end
376 448 end
169 449 extendedPoly = extendedPoly(~removeIndex(:,1),:);
169 450 end
Since Matlab has a tendency to think that everything on the command line is an array of doubles, and assuming that the contents of your arrays and cell arrays are numbers, you can replace
if extendedPoly(k,:)==subPoly{i}(j,:)
removeIndex(k,1)=1;
end
with the equivalent
removeIndex(k,1) = extendedPoly(k,:)==subPoly{i}(j,:)
which might save a few nano-seconds, though I'll be a bit surprised if it saves much more.
I suspect that if I was a little smarter or more diligent I could probably replace your entire loop nest with a single assignment along the lines of
removeIndex = extendedPoly==subPoly
The trick here is to ensure that all the arrays in the expression have the same dimensions.
You're probably approaching a performance limit within the current nesting strategy. The "slow" line only requires 1 usec per execution.
Generally in a set-matching case like this, you are better off sorting both sets, and then performing a single loop through both sets concurrently. (Google "insertion sort" for more on this, also see this related question/answer Optimization of timestamp filter in MATLAB - Working with very large datasets)
It's not immediately obvious how to best apply this to your circumstance. If you post an executable example we could look into this more closely.
Without looking at executable code, it may make sense to expand your subPoly cell of vectors into a single, sorted numeric array (called something like sortedElementsToremove). Then get the sort order from extendedPoly like this: [~, ixsSortExpended] = sort(extendedPoly);.
Now you can use a single loop with two indexes to perform the masking. Something like this (code not tested):
ixExtended = 1; %Index though sort order
for ixSub = 1:length(sortedElementsToremove);
%Use while to update second index
while ...
(extendedPoly(ixsSortExpended(ixExtended)) < sortedElementsToremove(ixSub) ) && ...
ixExtended < length(ixsSortExpended)
ixExtended = ixExtended + 1;
end
if (sortedElementsToremove(ixSub) == extendedPoly(ixsSortExpended(ixExtended)))
removeIndex(ixsSortExpended(ixExtended)) = true;
end
end
Related
I'm trying to split a dataset into train and test groups in Python using a method similar to what I'm used to in R (I realize there are other options). So I'm defining an array of row numbers that will make up my train set. I then want to grab the remaining row numbers for my test set using np.delete. Since there are 170 rows total and 136 go to the train set, the test set should have 34 rows. But it's got 80 -- the actual number varies when I change my random seed ... What have I got wrong here?
np.random.seed(222)
marriage = np.random.rand(170,55)
rows,cols = marriage.shape
sample = np.random.randint(0,rows-1,(round(.8*rows)))
train = marriage[sample,:]
test = np.delete(marriage, sample, axis=0)
print(marriage.shape)
print(len(sample))
print(train.shape)
print(test.shape)
I currently have a very large array of permutations, which is currently using a significant amount of RAM. This is the current code I have which SHOULD:
Count all but the occurrences where more than one '1' exists or three '2's exist in a row.
arr = [*1..3].repeated_permutation(30).to_a;
count = 0
arr.each do |x|
if not x.join('').include? '222' and x.count(1) < 2
count += 1
end
end
print count
So basically this results in a 24,360 element array, each of which have 30 elements.
I've tried to run it through Terminal but it literally ate through 14GB of RAM, and didn't move for 15 minutes, so I'm not sure whether the process froze while attempting to access more RAM or if it was still computing.
My question being: is there a faster way of doing this?
Thanks!
I am not sure what problem you try to solve. If your code is just an example for a more complex problem and you really need to check programatically every single permumation, then you might want to experiment with lazy:
[*1..3].repeated_permutation(30).lazy.each do ||
# your condition
end
Or you might want to make the nested iteratior very explicit:
[1,2,3].each do |x1|
[1,2,3].each do |x2|
[1,2,3].each do |x3|
# ...
[1,2,3].each do |x30|
permutation = [x1,x2,x3, ... , x30]
# your condition
end
end
end
end
end
But it feels wrong to me to solve this kind of problem with Ruby enumerables at all. Let's have a look at your strings:
111111111111111111111111111111
111111111111111111111111111112
111111111111111111111111111113
111111111111111111111111111121
111111111111111111111111111122
111111111111111111111111111123
111111111111111111111111111131
...
333333333333333333333333333323
333333333333333333333333333331
333333333333333333333333333332
333333333333333333333333333333
I suggest to just use enumerative combinatorics. Just look at the patterns and analyse (or count) how often your condition can be true. For example there are 28 indexes in your string at which a 222 substring could be place, only 27 for the 2222 substring... If you place a substring how likely is it that there is no 1 in the other parts of the string?
I think your problem is a mathematics problem, not a programming problem.
NB This is an incomplete answer, but I think the idea might give a push to the proper solution.
I can think of a following approach: let’s represent each permutation as a value in ternary number base, padded by zeroes:
1 = 000..00001
2 = 000..00002
3 = 000..00010
4 = 000..00011
5 = 000..00012
...
Now consider we restated the original task, treating zeroes as ones, ones as twos and twos as threes. So far so good.
The whole list of permutations would be represented by:
(1..3**30-1).map { |e| x = e.to_s(3).rjust(30, '0') }
Now we are to apply your conditions:
def do_calc permutation_count
(1..3**permutation_count-1).inject do |memo, e|
x = e.to_s(3).rjust(permutation_count, '0')
!x.include?('111') && x.count('0') < 2 ? memo + 1 : memo
end
Unfortunately, even for permutation_count == 20 it takes more than 5 minutes to calculate, so probably some additional steps are required. I will be thinking of further optimization. Currently I hope this will give you a hint to find the good approach yourself.
I am trying to use ran1 from Numerical Recipes in my FORTRAN 90 code. I think a common way is to compile the old subroutine separately, then use the object file. But here I want to know what change is necessary to use it directly in my code.
FUNCTION ran1(idum)
INTEGER idum,IA,IM,IQ,IR,NTAB,NDIV
REAL ran1,AM,EPS,RNMX
PARAMETER (IA=16807,IM=2147483647,AM=1./IM,IQ=127773,IR=2836,
! NTAB=32,NDIV=1+(IM-1)/NTAB,EPS=1.2e-7,RNMX=1.-EPS)
! “Minimal” random number generator of Park and Miller with Bays-Durham shuffle and
! added safeguards. Returns a uniform random deviate between 0.0 and 1.0 (exclusive of
! the endpoint values). Call with idum a negative integer to initialize; thereafter, do not
! alter idum between successive deviates in a sequence. RNMX should approximate the largest
! floating value that is less than 1.
INTEGER j,k,iv(NTAB),iy
SAVE iv,iy
DATA iv /NTAB*0/, iy /0/
iy = 0
if (idum.le.0.or.iy.eq.0) then !Initialize.
idum=max(-idum,1)
! Be sure to prevent idum = 0.
do 11 j=NTAB+8,1,-1
! Load the shuffle table (after 8 warm-ups).
k=idum/IQ
idum=IA*(idum-k*IQ)-IR*k
if (idum.lt.0) idum=idum+IM
if (j.le.NTAB) iv(j)=idum! Compute idum=mod(IA*idum,IM) without overflows by
enddo 11
iy=iv(1)
endif
k=idum/IQ
idum=IA*(idum-k*IQ)-IR*k
! Compute idum=mod(IA*idum,IM) without overflows by
if (idum.lt.0) idum=idum+IM ! Schrage’s method.
j=1+iy/NDIV
iy=iv(j) ! Output previously stored value and refill the shuffle table.
iv(j)=idum
ran1=min(AM*iy,RNMX) ! Because users don’t expect endpoint values.
return
END
Your code is malformed. It looks like you copied it manually from the book, but not exactly. The second problem is actually present even in the book.
Firstly, there should be a line continuation and not a comment in the parameter statement on the second and third line
PARAMETER (IA=16807,IM=2147483647,AM=1./IM,IQ=127773,IR=2836, &
NTAB=32,NDIV=1+(IM-1)/NTAB,EPS=1.2e-7,RNMX=1.-EPS)
(converted to free form, see the book for the original)
Secondly, the loop is a strange combination of a do loop with numeric label and a do loop with end do:
do 11 j=NTAB+8,1,-1
...
enddo 11
should be
do j=NTAB+8,1,-1
...
enddo
or
do 11 j=NTAB+8,1,-1
...
11 continue
There may be more problems present.
I have files that can be 19GB or greater, they will be huge but sorted. Can I use the fact that they are sorted to my advantage when searching to see if a certain string exists?
I looked at something called sgrep but not sure if its what I'm looking for. An example is I will have a 19GB text file with millions of rows of
ABCDEFG,1234,Jan 21,stackoverflow
and I want to search just the first column of these millions of row to see if ABCDEFG exists in this huge text file.
Is there a more efficient way then just greping this file for the string and seeing if a result comes. I don't even need the line, I just need almost a boolean, true/false if it is inside this file
Actually sgrep is what I was looking for. The reason I got confused was because structured grep has the same name as sorted grep and I was installing the wrong package. sgrep is amazing
I don't know if there are any utilities that would help you out if the box, but it would be pretty straight forward to write an application specific to your problem. A binary search would work well, and should yield your result within 20-30 queries against the file.
Let's say your lines are never more than 100 characters, and the file is B bytes long.
Do something like this in your favorite language:
sub file_has_line(file, target) {
a = 0
z = file.length
while (a < z) {
m = (a+z)/2
chunk = file.read(m, 200)
// That is, read 200 bytes, starting at m.
line = chunk.split(/\n/)[2]
// split the line on newlines, and keep only the second line.
if line < target
z = m - 1
else
a = m + 1
}
return (line == target)
}
If you're only doing a single lookup, this will dramatically speed up your program. Instead of reading ~20 GB, you'll be reading ~20 KB of data.
You could try to optimize this a bit by extrapolating that "Xerox" is going to be at 98% of the file and starting the midpoint there...but unless your need for optimization is quite extreme, you really won't see much difference. The binary search will get you that close within 4 or 5 passes, anyway.
If you're doing lots of lookups (I just saw your comment that you will be), I would look to pump all that data into a database where you can query at will.
So if you're doing 100,000 lookups, but this is a one-and-done process where having it in a database has no ongoing value, you could take another approach...
Sort your list of targets, to match the sort order of the log file. Then walk through each in parallel. You'll still end up reading the entire 20 GB file, but you'll only have to do it once and then you'll have all your answers. Something like this:
sub file_has_lines(file, target_array) {
target_array = target_array.sort
target = ''
hits = []
do {
if line < target
line = file.readln()
elsif line > target
target = target_array.pop()
elseif line == target
hits.push(line)
line = file.readln()
} while not file.eof()
return hits
}
I have a parser written in Perl which parses file of fixed length records. Part of the record consists of several strings (also fixed length), made of numbers only. Each character in the string is encoded as number, not as ASCII char. I.e., if I have string 12345, it's encoded as 01 02 03 04 05 (instead of 31 32 33 34 35).
I parse the record with unpack, and this particular part is unpacked as #array = unpack "C44", $s. Then I recover needed string with simple join, like $m = join("", #array).
I was wondering if thats an optimal way to decode. Files are quite big, millions of records, and obviously I tried to look if it's possible to optimize. Profiler shows that most of the time is spent in parsing the records (i.e., reading, writing, and other stuff is not a problem), and in parsing most of the time has been taken by these joins. I remember from other sources that join is quite efficient operation. Any ideas if it's possible to speed code more or is it optimal already? Perhaps it would be possible to avoid this intermediate array in some clever way, e.g., use pack/unpack combination instead?
Edited: code example
The code which I try to optimise looks like this:
while (read(READ, $buf, $rec_l) == $rec_l) {
my #s = unpack "A24 C44 H8", $buf;
my $msisdn = substr $s[0], 0, 11;
my $address = join("", #s[4..14]);
my $imsi = join("", #s[25..39]);
my $ts = localtime(hex($s[45]));
}
Untested (I'll come back and edit when I'm less busy) but this should work if I've done all of the math correctly, and be faster:
my ($msisdn, $address, $imsi, $ts) =
unpack "A11 x13 x3 a10 x10 a15 x5 N", $buf;
$address |= "0" x 10;
$imsi |= "0" x 15
$ts = localtime($ts);
As always in Perl, faster is less readable :-)
join("", unpack("C44", $s))
I don't believe this change would speed up your code. Everything depends on how often you call the join function to read one whole file. If you're working in chunks, try to increase the size of them. If you're doing some operation between unpack and join to this array, try to line them up with a map operation. If you post your source code it would be easier to identify the bottleneck.
I'm a pack/unpack noob, but how about skipping the join by altering your sample code like so:
my $m = unpack "H*", $s ;
quick test:
#!/usr/bin/perl
use strict ;
use Test::More tests => 1 ;
is( unpack("H*", "\x12\x34\x56"),"123456");