I tried to write a perl version of the following algorithm
Here is the code that I have:
#!/usr/bin/perl
use warnings;
use strict;
use diagnostics;
my $s1 = 'GATTACCA';
my $s2 = 'AGTGGGCGGGGAGAGAGAGAGAGG';
my $dist = levdist($s1, $s2);
sub levdist
{
my ( $seq1, $seq2 ) = (#_)[0,1];
my $l1 = length($s1);
my $l2 = length($s2);
my #s1 = split '', $seq1;
my #s2 = split '', $seq2;
for (my $i = 0; $i <= $l1; $i++) {
my $distances->[$i]->[0] = $i;
}
for (my $j = 0; $j <= $l2; $j++) {
my $distances->[0]->[$j] = $j;
}
for (my $i = 1; $i <= $l1; $i++) {
for (my $j = 1; $j <= $l2; $j++) {
if ( $s1[$i-1] eq $s2[$j-1] ) {
my $cost = 0;
} else {
my $cost = 1;
}
my $distances->[$i]->[$j] = minimum($distances->[$i-1]->[$j-1] + my $cost,
$distances->[$i]->[$j-1]+1,
$distances->[$i-1]->[$j]+ 1 )
}
}
my $min_distance = my $distances->[$l1]->[$l2];
for (my $i = 0; $i <= $l1; $i++) { my $min_distance = minimum($min_distance, my $distances->[$i]->[$l2]);
}
for (my $j = 0; $j <= $l2; $j++ ) {
my $min_distance = minimum($min_distance, my $distances->[$l1]->[$j]);
}
return $min_distance;
}
sub minimum
{
my $min = shift #_;
foreach ( #_ ) {
if ( $_ < $min ) {
$min = $_;
}
}
return $min;
}
This throws the following error:
Global symbol "$distances" requires explicit package name at ./levenshtein.pl line 33.
Global symbol "$distances" requires explicit package name at ./levenshtein.pl line 34.
Global symbol "$distances" requires explicit package name at ./levenshtein.pl line 35.
When I alter the code to look like:
my $distances->[$i]->[$j] = minimum(my $distances->[$i-1]->[$j-1] + my $cost,
my $distances->[$i]->[$j-1]+1,
my $distances->[$i-1]->[$j]+ 1
I get the following set of errors:
"my" variable $distances masks earlier declaration in same statement at
./levenshtein.pl line 33 (#1)
(W misc) A "my", "our" or "state" variable has been redeclared in the
current scope or statement, effectively eliminating all access to the
previous instance. This is almost always a typographical error. Note
that the earlier variable will still exist until the end of the scope
or until all closure references to it are destroyed.
"my" variable $distances masks earlier declaration in same statement at
./levenshtein.pl line 34 (#1)
"my" variable $distances masks earlier declaration in same statement at
./levenshtein.pl line 35 (#1)
I feel like I am in a catch-22. I get an error if I declare the variable or not. Any insights would be appreciated.
Thanks,
Use my to declare a variable once, in its proper scope. It will get cleaned up when it falls out of scope.
Use library functions so as not to reinvent the wheel. List::More
You should also use better variable names. $i,$l1,$i1 they are hard to read, but easy to introduce bugs.
It would be more perlish use the range operator, so instead of writing
for (my $i = 0; $i <= $l1; $i++) { you could use for my $i ( 0 .. $l1 ) {
But you probably need (0 .. $li-1) instead of (0 .. $li) since perl arrays are 0-based by default, so you may be introducing a bug.
I would recommend you write a test file to validate the results of your algorithm.
Here is a working (compiling/running, not necessarily correct) version:
#!/usr/bin/perl
use warnings;
use strict;
use diagnostics;
use List::Util qw( min max );
my $s1 = 'GATTACCA';
my $s2 = 'AGTGGGCGGGGAGAGAGAGAGAGG';
my $dist = levdist($s1, $s2);
print "Distance between '$s1' and '$s2' is $dist\n";
sub levdist {
my ( $seq1, $seq2 ) = (#_);
my $l1 = length($s1);
my $l2 = length($s2);
my #s1 = split '', $seq1;
my #s2 = split '', $seq2;
my $distances;
for (my $i = 0; $i <= $l1; $i++) {
$distances->[$i]->[0] = $i;
}
for (my $j = 0; $j <= $l2; $j++) {
$distances->[0]->[$j] = $j;
}
for (my $i = 1; $i <= $l1; $i++) {
for (my $j = 1; $j <= $l2; $j++) {
my $cost;
if ( $s1[$i-1] eq $s2[$j-1] ) {
$cost = 0;
} else {
$cost = 1;
}
$distances->[$i]->[$j] = min($distances->[$i-1]->[$j-1] + $cost,
$distances->[$i]->[$j-1]+1,
$distances->[$i-1]->[$j]+ 1 )
}
}
my $min_distance = $distances->[$l1]->[$l2];
for (my $i = 0; $i <= $l1; $i++) {
$min_distance = min($min_distance, $distances->[$i]->[$l2]);
}
for (my $j = 0; $j <= $l2; $j++ ) {
$min_distance = min($min_distance, $distances->[$l1]->[$j]);
}
return $min_distance;
}
Output
Distance between 'GATTACCA' and 'AGTGGGCGGGGAGAGAGAGAGAGG' is 6
Have a read:
perldoc -f my
perldoc List::More
perldoc Test::More
Related
I have a collection of arrays that include numbers from 1 to 10. The size of each array is 5. For example
[1,2,3,4,5]
[3,4,9,8,2]
[1,5,7,9,2]
[1,2,5,9,7]
...........
[3,8,1,6,9]
What algorithm should I use to find repeated triads in these arrays?
For example one of the results should be 1,2,5 since this triad is included in some of the arrays. I don't mind how many times some triad is repeated. I am looking the n most often (could be 3 or 4 or something else).
[1,2,3] is the same with [3,1,2] and each number is allowed only once. [3,3,4] is not valid.
This problem gets harder if we assume arrays of 10 or more numbers, so that each array could have a combination of triads. Just food for thought
[1,3,5,19,45,11,12,13,9,31]
[1,3,5,32,46,15,12,18,29,37]
result : (1,3,5) (1,3,12) (3,5,12) etc
I have completely reviewed my response :
**Bugs fixed** in : `array function computeRepettition(array $a);`
**Avoid** increment of repetition if triad was already found in pass-1
**Return** an array of arrays, and the number of repetition of each triad is set in '**numberOfRepetition**', the triad in self is the key of the array
**Support** number composed of 2 digits or more
**New** `array function iCombination(array $a);` reduce the number of probability for finding triad, since the order it is not important, and repetition of number is not allowed
**Update** of `array function detectRepetition(array $a);` detects all triad that can be found
<?php
define ("MIN_LENGTH_VECTOR" , 3 );
define ("KEY_SEPERATOR" , '-');
$src2D = array (
array(1,3,5,19,45,11,12,13,9, 100,31),
array(1,3,5,32,46,15,100, 12,18,29,37),
array(1222,32222,5,3222222,4622222,1522222,100, 12,182222,292222,372222));
var_dump (computeRepetition ($src2D));
function computeRepetition ($src2D) {
$repetition = array ();
for ($i=0 ; $i<count($src2D)-1 ; $i++) {
foreach ($repetition as &$rep) {
$rep['escape'] = TRUE;
}
for ($j=$i+1 ; $j<count($src2D) ; $j++) {
$t = buildTruth ($src2D[$i], $src2D[$j]);
$r = detectRepetition ($t);
if (is_null ($r)) continue;
$comb = iCombination ($r);
foreach ($comb as $cb) {
if (isset ($repetition[$cb]['escape']) && $repetition[$cb]['escape'] === TRUE) continue;
if (array_key_exists ($cb, $repetition)) {
$repetition[$cb]['numberOfRepetition']++;
} else {
$repetition[$cb]['numberOfRepetition'] = 2;
$repetition[$cb]['escape'] = FALSE;
}
}
}
}
return $repetition;
}
function detectRepetition ($t) {
$a = array ();
foreach ($t as $key => $value) {
if ($value === TRUE) {
$a[] = $key;
}
}
if (count($a) < MIN_LENGTH_VECTOR) return NULL;
return $a;
}
function iCombination ($array) {
$res = array ();
sort ($array, SORT_NUMERIC);
for ($i = 0 ; $i < count ($array) - 2 ; $i++) {
for ($k = $i + 1 ; $k < count ($array) - 1 ; $k++) {
for ($l = $k + 1 ; $l < count ($array) ; $l++) {
$res[] = $array[$i] . KEY_SEPERATOR . $array[$k] . KEY_SEPERATOR . $array[$l];
}
}
}
return $res;
}
function buildTruth ($vec1, $vec2) {
foreach ($vec1 as $v) {
$t[$v] = FALSE;
}
foreach ($vec2 as $v) {
if (isset ($t[$v]) && $t[$v] === FALSE ) $t[$v] = TRUE ;
}
return $t;
}
I'd go with brute force. Depending on how you define 'triad' (is [1,2,3] different than [3,2,1]? Is [3,3,3] admissible?) There are from C(10,3)=120 to 1000 possible triads, and each array generates C(5,3)=10 triads.
Prepare a (hash) table of counters for triads, iterate over the arrays updating the counters as you go, select triads with largest counts.
Are there any ways to make this program faster? I am thinking about some faster tools for user input etc.
Here is my code:
sub partia {
my ( $u, $v ) = #_;
if ( $u == $v ) { return $u }
if ( $u == 0 ) { return $v }
if ( $v == 0 ) { return $u }
if ( ~$u & 1 ) {
if ( $v & 1 ) {
return partia( $u >> 1, $v );
}
else {
return partia( $u >> 1, $v >> 1 ) << 1;
}
}
if ( ~$v & 1 ) {
return partia( $u, $v >> 1 );
}
if ( $u > $v ) {
return partia( ( $u - $v ) >> 1, $v );
}
return partia( ( $v - $u ) >> 1, $u );
}
sub calosc {
$t = <>;
while ($t) {
#tab = split( /\s+/, <> );
print( partia( $tab[0], $tab[1] ), "\n" );
$t--;
}
}
calosc();
How does program works :
Generally it returns greatest common divisor for 2 numbers inputed by user. It's mostly Stein's algorithm.
INPUT :
First line:
How many pairs user wants to check.[enter]
Second line:
first number [space] second number[enter]
OUTPUT:
GCD[enter]
In Python I would use things like :
from sys import stdin
t=int(stdin.readline())
instead of
t=input()
Is there any way to do it?
Your solution — Recursive Stein's Algorithm
It appears that you're simply trying to get the GCD of two numbers, and wanting to do so quickly.
You're apparently using the recursive version of the Binary GCD Algorithm. Typically speaking, it is much better to use an iterative algorithm for both speed and scalability. However, I would assert that it is almost certainly worth it to try the much simpler Euclidean algorithm first.
Alternatives — Iterative Stein's Algorithm and Basic Euclidean Algorithm
I've adapted your script to take 3 number pairs from the __DATA__ block as input. The first pair are just two small numbers, then I have two numbers from the Fibonacci Sequence, and finally two larger numbers including some shared powers of two.
I then coded two new subroutines. One of them uses the Iterative Stein's Algorithm (the method your using), and the other is just a simple Euclidean Algorithm. Benchmarking your partia subroutine versus my two subroutine for 1 million iterations report that the iterative is 50% faster, and that Euclid is 3 times faster.
use strict;
use warnings;
use Benchmark;
#use Math::Prime::Util::GMP qw(gcd);
# Original solution
# - Stein's Algorithm (recursive)
sub partia {
my ( $u, $v ) = #_;
if ( $u == $v ) { return $u }
if ( $u == 0 ) { return $v }
if ( $v == 0 ) { return $u }
if ( ~$u & 1 ) {
if ( $v & 1 ) {
return partia( $u >> 1, $v );
}
else {
return partia( $u >> 1, $v >> 1 ) << 1;
}
}
if ( ~$v & 1 ) {
return partia( $u, $v >> 1 );
}
if ( $u > $v ) {
return partia( ( $u - $v ) >> 1, $v );
}
return partia( ( $v - $u ) >> 1, $u );
}
# Using Euclidian Algorithm
sub euclid {
my ( $quotient, $divisor ) = #_;
return $divisor if $quotient == 0;
return $quotient if $divisor == 0;
while () {
my $remainder = $quotient % $divisor;
return $divisor if $remainder == 0;
$quotient = $divisor;
$divisor = $remainder;
}
}
# Stein's Algorithm (Iterative)
sub stein {
my ($u, $v) = #_;
# GCD(0,v) == v; GCD(u,0) == u, GCD(0,0) == 0
return $v if $u == 0;
return $u if $v == 0;
# Remove all powers of 2 shared by U and V
my $twos = 0;
while ((($u | $v) & 1) == 0) {
$u >>= 1;
$v >>= 1;
++$twos;
}
# Remove Extra powers of 2 from U. From here on, U is always odd.
$u >>= 1 while ($u & 1) == 0;
do {
# Remove all factors of 2 in V -- they are not common
# Note: V is not zero, so while will terminate
$v >>= 1 while ($v & 1) == 0;
# Now U and V are both odd. Swap if necessary so U <= V,
# then set V = V - U (which is even). For bignums, the
# swapping is just pointer movement, and the subtraction
# can be done in-place.
($u, $v) = ($v, $u) if $u > $v;
$v -= $u;
} while ($v != 0);
return $u << $twos;
}
# Process 3 pairs of numbers
my #nums;
while (<DATA>) {
my ($num1, $num2) = split;
# print "Numbers = $num1, $num2\n";
# print ' partia = ', partia($num1, $num2), "\n";
# print ' euclid = ', euclid($num1, $num2), "\n";
# print ' stein = ', stein($num1, $num2), "\n";
# print ' gcd = ', gcd($num1, $num2), "\n\n";
push #nums, [$num1, $num2];
}
# Benchmark!
timethese(1_000_000, {
'Partia' => sub { partia(#$_) for #nums },
'Euclid' => sub { euclid(#$_) for #nums },
'Stein' => sub { stein(#$_) for #nums },
# 'GCD' => sub { gcd(#$_) for #nums },
});
__DATA__
20 25 # GCD of 5
89 144 # GCD of Fibonacci numbers = 1
4789084 957196 # GCD of 388 = 97 * 2 * 2
Outputs:
Benchmark: timing 1000000 iterations of Euclid, Partia, Stein...
Euclid: 9 wallclock secs ( 8.31 usr + 0.00 sys = 8.31 CPU) # 120279.05/s (n=1000000)
Partia: 26 wallclock secs (26.00 usr + 0.00 sys = 26.00 CPU) # 38454.14/s (n=1000000)
Stein: 18 wallclock secs (17.36 usr + 0.01 sys = 17.38 CPU) # 57544.02/s (n=1000000)
Module Solution — Math::Prime::Util::GMP qw(gcd)
The fastest solutions are likely to be C implementations of these algorithms though. I therefore recommend finding already coded versions like that provided by Math::Prime::Util::GMP.
Running benchmarks including this new function shows that it is twice again as fast as the basic Euclidean algorithm that I programmed:
Benchmark: timing 1000000 iterations of Euclid, GCD, Partia, Stein...
Euclid: 8 wallclock secs ( 8.32 usr + 0.00 sys = 8.32 CPU) # 120264.58/s (n=1000000)
GCD: 3 wallclock secs ( 3.93 usr + 0.00 sys = 3.93 CPU) # 254388.20/s (n=1000000)
Partia: 26 wallclock secs (25.94 usr + 0.00 sys = 25.94 CPU) # 38546.04/s (n=1000000)
Stein: 18 wallclock secs (17.55 usr + 0.00 sys = 17.55 CPU) # 56976.81/s (n=1000000)
Unless I've completely forgotten what I'm doing (no promises) - this algorithm looks like it keeps dividing it's terms by 2 in each recurse, which means your algorithm is O(log-base2-N). Unless you can find a constant-time algorithm, you've probably got the best one at the moment.
Now #ikegami has mentioned micro-optimizations...if you want to make those, I suggest that you check out Devel::NYTProf for an awesome Perl profiler that should be able to tell you where you're spending time in your algorithm, so you can target your microoptimisations.
I would like to print a database-table to STDOUT. If the table-width is greater than the screen-width I would like to cut the column each with the same percentage (unless a table-width has reached min_width) until the table fits in the screen. I've tried to solve this with the posted subroutine. Does somebody know a shorter and more elegant algorithm to solve this problem?
sub cal_size {
my ( $maxcols, $ref ) = #_;
# $maxcols => screen width
# $ref => ref to an AoA; holds the table
my ( $max ) = cal_tab( $ref );
# $max => ref to an array; holds the length of the longest string of each column
# $tab = 2;
if ( $max and #$max ) {
my $sum = sum( #$max ) + $tab * #$max;
$sum -= $tab;
my #max_tmp = #$max;
my $percent = 0;
while ( $sum > $maxcols ) {
$percent += 0.5;
if ( $percent > 99 ) {
return;
}
my $count = 0;
for my $i ( 0 .. $#max_tmp ) {
# $min_width => columns width should not be less than $min_width if possible
next if $min_width >= $max_tmp[$i];
# no need to cut if the column width id less than min_width
next if $min_width >= minus_x_percent( $max_tmp[$i], $percent );
# don't cut if column width become less than min_width
$max_tmp[$i] = minus_x_percent( $max_tmp[$i], $percent );
$count++;
last if $sum <= $maxcols;
}
$min_width-- if $count == 0 and $min_width > 1;
# if no cuts but $sum > $maxcols reduce $min_width
$sum = sum( #max_tmp ) + $tab * #max_tmp;
$sum -= $tab;
}
my $rest = $maxcols - $sum;
while ( $rest > 0 ) { # distribute the rest
my $count = 0;
for my $i ( 0 .. $#max_tmp ) {
if ( $max_tmp[$i] < $max->[$i] ) {
$max_tmp[$i]++;
$rest--;
$count++;
last if $rest < 1;
}
}
last if $count == 0;
last if $rest < 1;
}
$max = [ #max_tmp ] if #max_tmp;
}
return $max;
}
sub minus_x_percent {
my ( $value, $percent ) = #_;
return int $value - ( $value * 1/100 * $percent );
}
This problem would be simple if it wasn't for the lower limit for the field widths. Once a field cannot get any smaller only the larger ones are eligible for scaling, so the calculation varies depending on whether all, none, or some of the fields have been scaled down to their limit.
The scaling has several bands, one per unique field width. As the fields are scaled down in equal proportion the smallest of them will be the first to hit the minimum field size limit. After that only the columns bigger than the smallest size can be reduced any further until the second smallest also reaches the limit.
This continues until all columns have reached their minimum size, after which the available space is just divided equally between the columns.
This program implements the calculations for that algorithm, and I think does what you want.
Note that the returned field widths are floating-point values and you must round them as you see fit.
use strict;
use warnings;
use List::Util 'max';
my $min_col_width = 10;
my $tab = 2;
my $widths = recalc_widths(80, [ 10, 15, 20, 25, 30 ]);
print join ' ', map sprintf('%.3f', $_), #$widths;
print "\n";
sub recalc_widths {
my ($target, $widths) = #_;
$target -= (#$widths - 1) * $tab;
my #sorted_widths = sort { $a <=> $b } #$widths;
my $num_limited = 0;
my $adjustable_total_width = 0;
$adjustable_total_width += $_ for #sorted_widths;
while (#sorted_widths) {
my $boundary = $sorted_widths[0];
my $scale = ($target - $num_limited * $min_col_width) / $adjustable_total_width;
if ($boundary * $scale >= $min_col_width) {
return [ map max($_ * $scale, $min_col_width), #$widths ];
}
while (#sorted_widths and $sorted_widths[0] == $boundary) {
shift #sorted_widths;
$adjustable_total_width -= $boundary;
$num_limited++;
}
}
return [ ($target / $num_limited) x $num_limited ];
}
output
10.000 10.333 13.778 17.222 20.667
I think it's better to use a proper module from CPAN. If you can calc the width of each column, then configure each row print with Text::Format
5 monkey share n peaches, they cannot distribute equally. So the first monkey dump 1 peach, and total number of peaches can be divided by 5, and the first monkey took his part.
Then is the second monkey, -1 peach, can be divided by 5 and took his part.
Until the fifth monkey finished all the steps. There may be some peaches still left.
Give the minimum number of peaches that satisfy this condition.
perl code 1:
#!/usr/bin/perl -w
for $n (0..10000){ #this is basic idea but code is too messy !
if( ($n-1) % 5 == 0 ){
$remain = 4/5 * ($n -1 );
if( ($remain - 1) % 5 == 0){
$remain = 4/5 * ($remain -1 );
if( ($remain - 1) % 5 == 0){
$remain = 4/5 * ($remain -1 );
if( ($remain - 1) % 5 == 0){
$remain = 4/5 * ($remain -1 );
if( ($remain - 1) % 5 == 0){
$remain = 4/5 * ($remain -1 );
print "remain: $remain original: $n\n";
}
}
}
}
}
}
perl code 2:
sub doit($){
($n) = #_;
if( ($n - 1) % 5 ==0 ){ #if can be distributed by 5 monkey
$n = ($n - 1) * 4/5; #commit distribute
return $n;
}else{
return -1; #fail
}
}
for $n (0..10000){ #restriction
$r = $n; #"recursively" find solution
$o = $n; #backup n
$count = 0;
for ($i = 0; $i < 5; $i++){ #assume there is 5 monkey, it can be changed
$r = doit($r);
if($r == -1){ #skip once fail
last;
}
$count++;
}
if($count == 5){ # if pass 5 test, then you found the number !
print "now ".$r."\n";
print "origin ".$o."\n";
}
}
I am thinking to cut some code. But felt hard. Can anyone help ?
First of all, you really should use strict and warnings pragmas at the top of your scripts. Your $n usage is especially worrisome. In the future, if you declare variables with my but use the same name, you convey the fact that they will represent the same quantity, without the fear that they might collide.
Anyway here is a slightly polished, and more importantly strict and warnings safe version:
#!/usr/bin/env perl
use strict;
use warnings;
sub doit {
my ($n) = #_;
if( ($n - 1) % 5 ==0 ){ #if can be distributed by 5 monkey
$n = ($n - 1) * 4/5; #commit distribute
return $n;
} else {
return undef; #fail
}
}
OUTER: for my $n (0..10000){ #restriction
my $r = $n; #"recursively" find solution
for (1..5){ #assume there is 5 monkey, it can be changed
$r = doit($r);
next OUTER unless defined $r;
}
# if code gets here, then it passed 5 test, then you found the number !
print "now: $r\torigin: $n\n";
}
And now, if you really want to be fun with it (don't use this in production, readability first! ):
#!/usr/bin/env perl
use strict;
use warnings;
OUTER: for my $n (0..10000){
my $r = $n;
$r = ($r - 1) % 5 ? next OUTER : 4/5 * ($r - 1) for (1..5);
print "now: $r\torigin: $n\n";
}
or even golfed:
for(0..10000){$r=$n=$_;map$r*=--$r%5?next:4/5,1..5;print"now: $r\torigin: $n\n"}
Consider this solution:
sub share {
($_[0] - 1) % 5 == 0 ? ($_[0]-1)/5*4 : die "unable to share";
}
for my $i (1..10000) {
eval {
share(share(share(share(share($i)))));
};
unless ($#) {
print "solution: $i\n";
last;
}
}
I'm sure there is a monad lurking within.
I'm not 100% sure I understand your question, but instead of searching for the answer, start with the last monkey. The minimum peaches he could take is 1, and even though there could be some left, to get the minimum, assume there are 0 left. Now, calculate how many peaches the second to last monkey saw, and so on.
There is no need to loop, if you start from the last monkey
# set numPeaches to what the last monkey had
$numPeaches = 1;
# then, figure out how many the second to last monkey had, and add to numPeaches
# and, so on ...
# until you get to the first monkey
I have two sets of integers A and B (size of A less than or equal to B), and I want to answer the question, "How close is A to B?". The way I want to answer this question is by producing a measure of how far you have to go from a given a in A to find a b in B.
The specific measure I want to produce does the following: for each a, find the closest b, the only catch being that once I match a b with an a, I can no longer use that b to match any other a's. (EDIT: the algorithm I'm trying to implement will always prefer a shorter match. So if b is the nearest neighbor to more than one a, pick the a nearest to b. I'm not sure what to do if more than one a has the same distance to b, right now I'm picking the a that precedes b, but that's quite arbitrary and not necessarily optimal.) The measure that I'll for make these sets, the final product, is a histogram showing the number of pairs in the vertical axis and the distance of the pairs in the x-axis.
So if A = {1, 3, 4} and B = {1, 5, 6, 7}, I will get the following a,b pairs: 1,1, 4,5, 3,6. For these data, the histogram should show one pair with distance zero, one pair with distance 1, and one pair with distance 3.
(The actual size of these sets has an upper bound around 100,000 elements, and I read them in from disk already sorted low to high. The integers range from 1 to ~20,000,000. EDIT: also, the elements of A and B are unique, i.e. no repeated elements.)
The solution I've come up with feels a bit clunky. I'm using Perl, but the problem is more or less language agnostic.
First I make a hash, with one key for each number that appears in the union of A and B and values indicating whether each number appears in A, B, or both, e.g. $hash{5} = {a=>1, b=>1} if the number 5 appears in both data-sets. (If it only appeared in A, you'd have $hash{5} = {a=>1}.)
Next, I iterate over A to find all the hash elements that appear in A and B, mark them in the measure, and remove them from the hash.
Then, I sort all the hash keys and make each element of the hash point to its nearest neighbors, like a linked list, where a given hash element now looks like $hash{6} = {b=>1, previous=>4, next=>8}. The linked list doesn't know whether the next and previous elements are in A or B.
Then I loop over pair distances starting at d=1, and find all pairs with distance d, mark them, remove them from the hash, until there are no more elements of A to match.
The loop looks like this:
for ($d=1; #a > 0; $d++) {
#left = ();
foreach $a in #a {
$next = $a;
# find closest b ahead of $a, stop searching if you pass $d
while (exists $hash{$next}{next} && $next - $a < $d) {
$next = $hash{$next}{next};
}
if ($next is in B && $next - $a == $d) {
# found a pair at distance $d
mark_in_measure($a, $next);
remove_from_linked_list($next);
remove_from_linked_list($a);
next;
}
# do same thing looking behind $a
$prev = $a;
...
# you didn't find a match for $a
push #left, $a;
}
#a = #left;
}
This loop obviously prefers pairs that match b's that appear later than a's; I can't tell whether there's a sensible way to decide whether later is better than prior (better in terms of getting closer pairs). The main optimization I'm interested in is processing time.
Sounds like you have a particular case of the Assignment Problem (finding a minimum matching in a weighted bipartite graph).
The algorithm to solve the assignment problem is too slow for you at O(N^3) but I'm pretty sure there you can shave some of this complexity off by exploiting your particular weight function or how you only want a histogram instead of the exact matching.
#!/usr/bin/perl
use strict;
use warnings FATAL => 'all';
use diagnostics;
# http://www.hungarianalgorithm.com/solve.php?c=3-2-6-22--7-2-2-18--13-8-4-12--23-18-14-2&random=1
# https://www.topcoder.com/community/data-science/data-science-tutorials/assignment-problem-and-hungarian-algorithm/
# http://www.cse.ust.hk/~golin/COMP572/Notes/Matching.pdf
my #mat;
my #out_mat;
my $spaces = 6;
my $precision = 0;
my $N = 10;
my $M = 12;
my $r = 100;
my #array1; my #array2;
for my $i (1..$N) {
push #array1, sprintf( "%.${precision}f", rand($r) );
}
for my $i (1..$M) {
push #array2, sprintf( "%.${precision}f", rand($r) );
}
##array1 = ( 1, 3, 4); # $mat[i]->[j] = abs( array1[i] - array2[j] )
##array2 = ( 1, 5, 6, 7);
# 1 5 6 7
# 1 [ 0* 4 5 6 ]
# 3 [ 2 2* 3 4 ]
# 4 [ 3 1 2* 3 ]
my $min_size = $#array1 < $#array2 ? $#array1 : $#array2;
my $max_size = $#array1 > $#array2 ? $#array1 : $#array2;
for (my $i = 0; $i < #array1; $i++){
my #weight_function;
for (my $j = 0; $j < #array2; $j++){
my $dif = sprintf( "%.${precision}f", abs ($array1[$i] - $array2[$j]) );
#my $dif = sprintf( "%.${precision}f", ($array1[$i] - $array2[$j])**2 );
push #weight_function, $dif;
}
push #mat, \#weight_function;
}
# http://cpansearch.perl.org/src/TPEDERSE/Algorithm-Munkres-0.08/lib/Algorithm/Munkres.pm
Algorithm::Munkres::assign(\#mat,\#out_mat);
print "\n\#out_mat index = [";
for my $index (#out_mat) {
printf("%${spaces}d", $index);
}
print " ]\n";
print "\#out_mat values = [";
my %hash;
for my $i (0 .. $max_size){
my $j = $out_mat[$i];
last if ( $i > $min_size and $#array1 < $#array2 );
next if ( $j > $min_size and $#array1 > $#array2 );
my $dif = $mat[$i]->[$j];
printf( "%${spaces}.${precision}f", $dif );
$hash{ $dif } { $i } { 'index_array1' } = $i;
$hash{ $dif } { $i } { 'index_array2' } = $j;
$hash{ $dif } { $i } { 'value_array1' } = $array1[$i];
$hash{ $dif } { $i } { 'value_array2' } = $array2[$j];
}
print " ]\n\n";
my $soma_da_dif = 0;
foreach my $min_diferenca ( sort { $a <=> $b } keys %hash ){
foreach my $k ( sort { $a <=> $b } keys %{$hash{$min_diferenca}} ){
$soma_da_dif += $min_diferenca;
my $index_array1 = $hash{ $min_diferenca } { $k } { 'index_array1' };
my $index_array2 = $hash{ $min_diferenca } { $k } { 'index_array2' };
my $value_array1 = $hash{ $min_diferenca } { $k } { 'value_array1' };
my $value_array2 = $hash{ $min_diferenca } { $k } { 'value_array2' };
printf( " index (%${spaces}.0f,%${spaces}.0f), values (%${spaces}.${precision}f,%${spaces}.${precision}f), dif = %${spaces}.${precision}f\n",
$index_array1, $index_array2, $value_array1, $value_array2, $min_diferenca );
}
}
print "\n\nSum = $soma_da_dif\n";
#-------------------------------------------------#
#------------------ New-Package ------------------#
{ # start scope block
package Algorithm::Munkres;
use 5.006;
use strict;
use warnings;
require Exporter;
our #ISA = qw(Exporter);
our #EXPORT = qw( assign );
our $VERSION = '0.08';
...
... <---- copy all the 'package Algorithm::Munkres' here
...
return $minval;
}
1; # don't forget to return a true value from the file
} # end scope block