transitive reduction algorithm: pseudocode? - algorithm

I have been looking for an algorithm to perform a transitive reduction on a graph, but without success. There's nothing in my algorithms bible (Introduction To Algorithms by Cormen et al) and whilst I've seen plenty of transitive closure pseudocode, I haven't been able to track down anything for a reduction. The closest I've got is that there is one in "Algorithmische Graphentheorie" by Volker Turau (ISBN:978-3-486-59057-9), but unfortunately I don't have access to this book! Wikipedia is unhelpful and Google is yet to turn up anything. :^(
Does anyone know of an algorithm for performing a transitive reduction?

See Harry Hsu. "An algorithm for finding a minimal equivalent graph of a digraph.", Journal of the ACM, 22(1):11-16, January 1975. The simple cubic algorithm below (using an N x N path matrix) suffices for DAGs, but Hsu generalizes it to cyclic graphs.
// reflexive reduction
for (int i = 0; i < N; ++i)
m[i][i] = false;
// transitive reduction
for (int j = 0; j < N; ++j)
for (int i = 0; i < N; ++i)
if (m[i][j])
for (int k = 0; k < N; ++k)
if (m[j][k])
m[i][k] = false;

The basic gist of the transitive reduction algorithm I used is
foreach x in graph.vertices
foreach y in graph.vertices
foreach z in graph.vertices
delete edge xz if edges xy and yz exist
The transitive closure algorithm I used in the same script is very similar but the last line is
add edge xz if edges xy and yz OR edge xz exist

Based on the reference provided by Alan Donovan, which says you should use the path matrix (which has a 1 if there is a path from node i to node j) instead of the adjacency matrix (which has a 1 only if there is an edge from node i to node j).
Some sample python code follows below to show the differences between the solutions
def prima(m, title=None):
""" Prints a matrix to the terminal """
if title:
print title
for row in m:
print ', '.join([str(x) for x in row])
print ''
def path(m):
""" Returns a path matrix """
p = [list(row) for row in m]
n = len(p)
for i in xrange(0, n):
for j in xrange(0, n):
if i == j:
continue
if p[j][i]:
for k in xrange(0, n):
if p[j][k] == 0:
p[j][k] = p[i][k]
return p
def hsu(m):
""" Transforms a given directed acyclic graph into its minimal equivalent """
n = len(m)
for j in xrange(n):
for i in xrange(n):
if m[i][j]:
for k in xrange(n):
if m[j][k]:
m[i][k] = 0
m = [ [0, 1, 1, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 1, 1],
[0, 0, 0, 0, 1],
[0, 1, 0, 0, 0]]
prima(m, 'Original matrix')
hsu(m)
prima(m, 'After Hsu')
p = path(m)
prima(p, 'Path matrix')
hsu(p)
prima(p, 'After Hsu')
Output:
Adjacency matrix
0, 1, 1, 0, 0
0, 0, 0, 0, 0
0, 0, 0, 1, 1
0, 0, 0, 0, 1
0, 1, 0, 0, 0
After Hsu
0, 1, 1, 0, 0
0, 0, 0, 0, 0
0, 0, 0, 1, 0
0, 0, 0, 0, 1
0, 1, 0, 0, 0
Path matrix
0, 1, 1, 1, 1
0, 0, 0, 0, 0
0, 1, 0, 1, 1
0, 1, 0, 0, 1
0, 1, 0, 0, 0
After Hsu
0, 0, 1, 0, 0
0, 0, 0, 0, 0
0, 0, 0, 1, 0
0, 0, 0, 0, 1
0, 1, 0, 0, 0

The Wikipedia article on transitive reduction points to an implementation within GraphViz (which is open source). Not exactly pseudocode, but maybe someplace to start?
LEDA includes a transitive reduction algorithm. I don't have a copy of the LEDA book anymore, and this function might have been added after the book was published. But if it's in there, then there will be a good description of the algorithm.
Google points to an algorithm that somebody suggested for inclusion in Boost. I didn't try to read it, so maybe not correct?
Also, this might be worth a look.

The algorithm of "girlwithglasses" forgets that a redundant edge could span a chain of three edges. To correct, compute Q = R x R+ where R+ is the transitive closure and then delete all edges from R that show up in Q. See also the Wikipedia article.

Depth-first algorithm in pseudo-python:
for vertex0 in vertices:
done = set()
for child in vertex0.children:
df(edges, vertex0, child, done)
df = function(edges, vertex0, child0, done)
if child0 in done:
return
for child in child0.children:
edge.discard((vertex0, child))
df(edges, vertex0, child, done)
done.add(child0)
The algorithm is sub-optimal, but deals with the multi-edge-span problem of the previous solutions. The results are very similar to what tred from graphviz produces.

ported to java / jgrapht, the python sample on this page from #Michael Clerx:
import java.util.ArrayList;
import java.util.List;
import java.util.Set;
import org.jgrapht.DirectedGraph;
public class TransitiveReduction<V, E> {
final private List<V> vertices;
final private int [][] pathMatrix;
private final DirectedGraph<V, E> graph;
public TransitiveReduction(DirectedGraph<V, E> graph) {
super();
this.graph = graph;
this.vertices = new ArrayList<V>(graph.vertexSet());
int n = vertices.size();
int[][] original = new int[n][n];
// initialize matrix with zeros
// --> 0 is the default value for int arrays
// initialize matrix with edges
Set<E> edges = graph.edgeSet();
for (E edge : edges) {
V v1 = graph.getEdgeSource(edge);
V v2 = graph.getEdgeTarget(edge);
int v_1 = vertices.indexOf(v1);
int v_2 = vertices.indexOf(v2);
original[v_1][v_2] = 1;
}
this.pathMatrix = original;
transformToPathMatrix(this.pathMatrix);
}
// (package visible for unit testing)
static void transformToPathMatrix(int[][] matrix) {
// compute path matrix
for (int i = 0; i < matrix.length; i++) {
for (int j = 0; j < matrix.length; j++) {
if (i == j) {
continue;
}
if (matrix[j][i] > 0 ){
for (int k = 0; k < matrix.length; k++) {
if (matrix[j][k] == 0) {
matrix[j][k] = matrix[i][k];
}
}
}
}
}
}
// (package visible for unit testing)
static void transitiveReduction(int[][] pathMatrix) {
// transitively reduce
for (int j = 0; j < pathMatrix.length; j++) {
for (int i = 0; i < pathMatrix.length; i++) {
if (pathMatrix[i][j] > 0){
for (int k = 0; k < pathMatrix.length; k++) {
if (pathMatrix[j][k] > 0) {
pathMatrix[i][k] = 0;
}
}
}
}
}
}
public void reduce() {
int n = pathMatrix.length;
int[][] transitivelyReducedMatrix = new int[n][n];
System.arraycopy(pathMatrix, 0, transitivelyReducedMatrix, 0, pathMatrix.length);
transitiveReduction(transitivelyReducedMatrix);
for (int i = 0; i <n; i++) {
for (int j = 0; j < n; j++) {
if (transitivelyReducedMatrix[i][j] == 0) {
// System.out.println("removing "+vertices.get(i)+" -> "+vertices.get(j));
graph.removeEdge(graph.getEdge(vertices.get(i), vertices.get(j)));
}
}
}
}
}
unit test :
import java.util.Arrays;
import org.junit.Assert;
import org.junit.Test;
public class TransitiveReductionTest {
#Test
public void test() {
int[][] matrix = new int[][] {
{0, 1, 1, 0, 0},
{0, 0, 0, 0, 0},
{0, 0, 0, 1, 1},
{0, 0, 0, 0, 1},
{0, 1, 0, 0, 0}
};
int[][] expected_path_matrix = new int[][] {
{0, 1, 1, 1, 1},
{0, 0, 0, 0, 0},
{0, 1, 0, 1, 1},
{0, 1, 0, 0, 1},
{0, 1, 0, 0, 0}
};
int[][] expected_transitively_reduced_matrix = new int[][] {
{0, 0, 1, 0, 0},
{0, 0, 0, 0, 0},
{0, 0, 0, 1, 0},
{0, 0, 0, 0, 1},
{0, 1, 0, 0, 0}
};
System.out.println(Arrays.deepToString(matrix) + " original matrix");
int n = matrix.length;
// calc path matrix
int[][] path_matrix = new int[n][n];
{
System.arraycopy(matrix, 0, path_matrix, 0, matrix.length);
TransitiveReduction.transformToPathMatrix(path_matrix);
System.out.println(Arrays.deepToString(path_matrix) + " path matrix");
Assert.assertArrayEquals(expected_path_matrix, path_matrix);
}
// calc transitive reduction
{
int[][] transitively_reduced_matrix = new int[n][n];
System.arraycopy(path_matrix, 0, transitively_reduced_matrix, 0, matrix.length);
TransitiveReduction.transitiveReduction(transitively_reduced_matrix);
System.out.println(Arrays.deepToString(transitively_reduced_matrix) + " transitive reduction");
Assert.assertArrayEquals(expected_transitively_reduced_matrix, transitively_reduced_matrix);
}
}
}
test ouput
[[0, 1, 1, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 1, 1], [0, 0, 0, 0, 1], [0, 1, 0, 0, 0]] original matrix
[[0, 1, 1, 1, 1], [0, 0, 0, 0, 0], [0, 1, 0, 1, 1], [0, 1, 0, 0, 1], [0, 1, 0, 0, 0]] path matrix
[[0, 0, 1, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 1, 0], [0, 0, 0, 0, 1], [0, 1, 0, 0, 0]] transitive reduction

Related

Golang sudoku algorithm not working

I'm very new to Golang, I'm trying to do a sudoku with backtracking algorithm.
But when I run my program, there are no errors but it only displays the grid not complete, with empty cases here is my code :
package main
import "fmt"
var sudoku = [9][9]int{
{9, 0, 0, 1, 0, 0, 0, 0, 5},
{0, 0, 5, 0, 9, 0, 2, 0, 1},
{8, 0, 0, 0, 4, 0, 0, 0, 0},
{0, 0, 0, 0, 8, 0, 0, 0, 0},
{0, 0, 0, 7, 0, 0, 0, 0, 0},
{0, 0, 0, 0, 2, 6, 0, 0, 9},
{2, 0, 0, 3, 0, 0, 0, 0, 6},
{0, 0, 0, 2, 0, 0, 9, 0, 0},
{0, 0, 1, 9, 0, 4, 5, 7, 0},
}
func main(){
IsValid(sudoku, 0)
Display(sudoku)
}
func Display(sudoku[9][9] int){
var x, y int
for x = 0; x < 9; x++ {
fmt.Println("")
if(x == 3 || x == 6){
fmt.Println(" ")
}
for y = 0; y < 9; y++ {
if(y == 3 || y == 6){
fmt.Print("|")
}
fmt.Print(sudoku[x][y])
}
}
}
func AbsentOnLine(k int, sudoku [9][9]int, x int) bool {
var y int
for y=0; y < 9; y++ {
if (sudoku[x][y] == k){
return false
}
}
return true
}
func AbsentOnRow(k int, sudoku [9][9]int, y int) bool {
var x int
for x=0; x < 9; x++{
if (sudoku[x][y] == k){
return false;
}
}
return true;
}
func AbsentOnBloc(k int, sudoku [9][9]int, x int, y int) bool {
var firstX, firstY int;
firstX = x-(x%3)
firstY = y-(y%3)
for x = firstX; x < firstX+3; x++ {
for y = firstY; y < firstY+3; y++ {
if (sudoku[x][y] == k){
return false;
}
}
}
return true;
}
func IsValid(sudoku [9][9]int, position int) bool {
if (position == 9*9){
return true;
}
var x, y, k int
x = position/9
y = position%9
if (sudoku[x][y] != 0){
return IsValid(sudoku, position+1);
}
for k=1; k <= 9; k++ {
if (AbsentOnLine(k,sudoku,x) && AbsentOnRow(k,sudoku,y) && AbsentOnBloc(k,sudoku,x,y)){
sudoku[x][y] = k;
if (IsValid(sudoku, position+1)){
return true;
}
}
}
sudoku[x][y] = 0;
return false;
}
I'm getting this in the console :
900|100|005
005|090|201
800|040|000
000|080|000
000|700|000
000|026|009
200|300|006
000|200|900
001|904|570
I don't understand why it's not completing the grid, has anyone any ideas ?
I don't know Golang, but I have written a sudoku solving algorithm using backtracking.
Your code only iterates the board once. You start with position=0, your code than iterates over the board ,if a position has the value zero you try values 1-9 and if that doesn't work you go to the next position. When position=81, your code stops.
You added new values to the board with your Isvalid function, but you are not iterating over the new board again to see if those new values help your AbsentOn... function to return a new value that is different than the previous iteration. You have to iterate your board again and again until you are sure that there are no 0 valued cells.
That is the reason you have to many 0 on the board at the end of your program. Your program iterated only once, at It is can not solve your example sudoku on it's first try. It has to add new values to the board and make the sudoku board easier with every iteration.
Another problem is your code does not give a feedback. For example it gives 1 to an empty cell. That seems okey at first, but It doesn't mean that the final value of that cell has to be 1. It may change because in your next iterations you realize that there is another cell that can only take the value 1, so now you have to change back to your initial cell and find a new value other than 1. Your code also fails to do that. That's why humans put some possible values near a cell when they are not sure.
It looks like your problem is with the algorithm. You have to understand the backtracking algorithm. You can try it in another language that you know well first and then migrate it to golang(I wrote mine in C++). Other than that your golang code is easy to read and I don't see any golang related problems.
Your IsValid function changes the contents of the sudoku. The problem is, it actually, in your code as is, changes only a copy of the sudoku. You need to pass it as a pointer if it should change the actual variable.
Here are the changes that you need in your code, it is only five characters:
func main() {
IsValid(&sudoku, 0)
Display(sudoku)
}
// ...
func IsValid(sudoku *[9][9]int, position int) bool {
// ...
if AbsentOnLine(k, *sudoku, x) && AbsentOnRow(k, *sudoku, y) && AbsentOnBloc(k, *sudoku, x, y) {

Looking for non-recursive algorithm for visiting all k-combinations of a multiset in lexicographic order

More specifically, I'm looking for an algorithm A that takes as its inputs
a sorted multiset M = {a1, a2, …, an } of non-negative integers;
an integer 0 &leq; k &leq; n = |M |;
a "visitor" callback V (taking a k-combination of M as input);
(optional) a sorted k-combination K of M (DEFAULT: the k-combination {a1, a2, …, ak }).
The algorithm will then visit, in lexicographic order, all the k-combinations of M, starting with K, and apply the callback V to each.
For example, if M = {0, 0, 1, 2}, k = 2, and K = {0, 1}, then executing A(M, k, V, K ) will result in the application of the visitor callback V to each of the k-combinations {0, 1}, {0, 2}, {1, 2}, in this order.
A critical requirement is that the algorithm be non-recursive.
Less critical is the precise ordering in which the k-combinations are visited, so long as the ordering is consistent. For example, colexicographic order would be fine as well. The reason for this requirement is to be able to visit all k-combinations by running the algorithm in batches.
In case there are any ambiguities in my terminology, in the remainder of this post I give some definitions that I hope will clarify matters.
A multiset is like a set, except that repetitions are allowed. For example, M = {0, 0, 1, 2} is a multiset of size 4. For this question I'm interested only in finite multisets. Also, for this question I assume that the elements of the multiset are all non-negative integers.
Define a k-combination of a multiset M as any sub-multiset of M of size k. E.g. the 2-combinations of M = {0, 0, 1, 2} are {0, 0}, {0, 1}, {0, 2}, and {1, 2}.
As with sets, the ordering of a multiset's elements does not matter. (e.g. M can also be represented as {2, 0, 1, 0}, or {1, 2, 0, 0}, etc.) but we can define a canonical representation of the multiset as the one in which the elements (here assumed to be non-negative integers) are in ascending order. In this case, any collection of k-combinations of a multiset can itself be ordered lexicographically by the canonical representations of its members. (The sequence of all 2-combinations of M given earlier exhibits such an ordering.)
UPDATE: below I've translated rici's elegant algorithm from C++ to JavaScript as faithfully as I could, and put a simple wrapper around it to conform to the question's specs and notation.
function A(M, k, V, K) {
if (K === undefined) K = M.slice(0, k);
var less_than = function (a, b) { return a < b; };
function next_comb(first, last,
/* first_value */ _, last_value,
comp) {
if (comp === undefined) comp = less_than;
// 1. Find the rightmost value which could be advanced, if any
var p = last;
while (p != first && ! comp(K[p - 1], M[--last_value])) --p;
if (p == first) return false;
// 2. Find the smallest value which is greater than the selected value
for (--p; comp(K[p], M[last_value - 1]); --last_value) ;
// 3. Overwrite the suffix of the subset with the lexicographically
// smallest sequence starting with the new value
while (p !== last) K[p++] = M[last_value++];
return true;
}
while (true) {
V(K);
if (!next_comb(0, k, 0, M.length)) break;
}
}
Demo:
function print_it (K) { console.log(K); }
A([0, 0, 0, 0, 1, 1, 1, 2, 2, 3], 8, print_it);
// [0, 0, 0, 0, 1, 1, 1, 2]
// [0, 0, 0, 0, 1, 1, 1, 3]
// [0, 0, 0, 0, 1, 1, 2, 2]
// [0, 0, 0, 0, 1, 1, 2, 3]
// [0, 0, 0, 0, 1, 2, 2, 3]
// [0, 0, 0, 1, 1, 1, 2, 2]
// [0, 0, 0, 1, 1, 1, 2, 3]
// [0, 0, 0, 1, 1, 2, 2, 3]
// [0, 0, 1, 1, 1, 2, 2, 3]
A([0, 0, 0, 0, 1, 1, 1, 2, 2, 3], 8, print_it, [0, 0, 0, 0, 1, 2, 2, 3]);
// [0, 0, 0, 0, 1, 2, 2, 3]
// [0, 0, 0, 1, 1, 1, 2, 2]
// [0, 0, 0, 1, 1, 1, 2, 3]
// [0, 0, 0, 1, 1, 2, 2, 3]
// [0, 0, 1, 1, 1, 2, 2, 3]
This, of course, is not production-ready code. In particular, I've omitted all error-checking for the sake of readability. Furthermore, an implementation for production will probably structure things differently. (E.g. the option to specify the comparator used by next_combination's becomes superfluous here.) My main aim was to keep the ideas behind the original algorithm as clear as possible in a piece of functioning code.
I checked the relevant sections of TAoCP, but this problem is at most an exercise there. The basic idea is the same as Algorithm L: try to "increment" the least significant positions first, filling the positions after the successful increment to have their least allowed values.
Here's some Python that might work but is crying out for better data structures.
def increment(M, K):
M = list(M) # copy them
K = list(K)
for x in K: # compute the difference
M.remove(x)
for i in range(len(K) - 1, -1, -1):
candidates = [x for x in M if x > K[i]]
if len(candidates) < len(K) - i:
M.append(K[i])
continue
candidates.sort()
K[i:] = candidates[:len(K) - i]
return K
return None
def demo():
M = [0, 0, 1, 1, 2, 2, 3, 3]
K = [0, 0, 1]
while K is not None:
print(K)
K = increment(M, K)
In iterative programming, to make combinations of K size you would need K for loops. First we remove the repetitions from the sorted input, then we create an array that represents the for..loop indices. While the indices array doesn't overflow we keep generating combinations.
The adder function simulates the pregression of counters in a stacked for loop. There is a little bit of room for improvement in the below implementation.
N = size of the distinct input
K = pick size
i = 0 To K - 1
for(var v_{i0} = i_{0}; v_{i} < N - (K - (i + 1)); v_{i}++) {
...
for(var v_{iK-1} = i_{K-1}; v_{iK-1} < N - (K - (i + 1)); v_{iK-1}++) {
combo = [ array[v_{i0}] ... array[v_{iK-1}] ];
}
...
}
Here's the working source code in JavaScript
function adder(arr, max) {
var k = arr.length;
var n = max;
var carry = false;
var i;
do {
for(i = k - 1; i >= 0; i--) {
arr[i]++;
if(arr[i] < n - (k - (i + 1))) {
break;
}
carry = true;
}
if(carry === true && i < 0) {
return false; // overflow;
}
if(carry === false) {
return true;
}
carry = false;
for(i = i + 1; i < k; i++) {
arr[i] = arr[i - 1] + 1;
if(arr[i] >= n - (k - (i + 1))) {
carry = true;
}
}
} while(carry === true);
return true;
}
function nchoosekUniq(arr, k, cb) {
// make the array a distinct set
var set = new Set();
for(var i=0; i < arr.length; i++) { set.add(arr[i]); }
arr = [];
set.forEach(function(v) { arr.push(v); });
//
var n = arr.length;
// create index array
var iArr = Array(k);
for(var i=0; i < k; i++) { iArr[i] = i; }
// find unique combinations;
do {
var combo = [];
for(var i=0; i < iArr.length; i++) {
combo.push(arr[iArr[i]]);
}
cb(combo);
} while(adder(iArr, n) === true);
}
var arr = [0, 0, 1, 2];
var k = 2;
nchoosekUniq(arr, k, function(set) {
var s="";
set.forEach(function(v) { s+=v; });
console.log(s);
}); // 01, 02, 12

Sort list based on pair-wise affiliation of its elements

Given a list of elements, say [1,2,3,4], and their pair-wise affiliation, say
[[0, 0.5, 1, 0.1]
[0.5, 0, 1, 0.9]
[ 1, 1, 0, 0.2]
[0.1, 0.9, 0.2, 0]]
For those familiar with graph-theory, this is basically an adjacency matrix.
What is the fastest way to sort the list such that the distance in the list best correlates with the pair-wise affiliation, i.e. pairs of nodes with high affiliation should be close to each other.
Is there a way to do this (even a greedy algorithm would be fine) without going too much into MDS and ordination theory?
As a bonus question:
Note that some pair-wise affiliations can be represented perfectly, like for the list [1,2,3] and a pair-wise affiliation:
[[0, 0, 1]
[0, 0, 1]
[1, 1, 0]]
the perfect order would be [1,3,2]. But some affiliations can't, like this one:
[[0, 1, 1]
[1, 0, 1]
[1, 1, 0]]
where any order is equally good/bad.
Is there a way to tell the quality of an ordering? In the sense of how well it represents the pair-wise affiliations?
Here's a lightly tested algorithm that takes the adjacency matrix, sets up the elements/nodes in order of appearance, then tries to find an equilibrium. Since it's 1d I just picked a really simple attractive-force formula. Maybe adding repulsive force would improve it.
/*
* Sort the nodes of an adjacency matrix
* #return {Array<number>} sorted list of node indices
*/
function sort1d(mat) {
var n = mat.length;
// equilibrium total force threshold
var threshold = 1 / (n * n);
var map = new Map(); // <index, position>
// initial positions
for(var i = 0; i < n; i++) {
map.set(i, i);
}
// find an equilibrium (local minima)
var prevTotalForce;
var totalForce = n * n;
do {
prevTotalForce = totalForce;
totalForce = 0;
for(var i = 0; i < n; i++) {
var posi = map.get(i);
var force = 0;
for(var j = i + 1; j < n; j++) {
var posj = map.get(j);
var weight = mat[i][j];
var delta = posj - posi;
force += weight * (delta / n);
}
// force = Sum[i, j=i+1..n]( W_ij * ( D_ij / n )
map.set(i, posi + force);
totalForce += force;
}
console.log(totalForce, prevTotalForce);
} while(totalForce < prevTotalForce && totalForce >= threshold);
var list = [];
// Map to List<[position, index]>
map.forEach(function(v, k) { list.push([v, k]); });
// sort list by position
list.sort(function(a, b) { return a[0] - b[0]; });
// return sorted indices
return list.map(function(vk) { return vk[1]; });
}
var mat = [
[0, 0.5, 1, 0.1],
[0.5, 0, 1, 0.9],
[1, 1, 0, 0.2],
[0.1, 0.9, 0.2, 0]
];
var mat2 = [
[0, 1, 1],
[1, 0, 1],
[1, 1, 0]
];
console.log(sort1d(mat)); // [2, 0, 1, 3]
console.log(sort1d(mat2)); // [0, 1, 2]

How to separate Ruby inlineC code to multiple files?

I would like to move the MATCH macro and bitmap to a separate file since I use these many places, and I would like to avoid repeating code. How may that be done?
require 'inline'
# Class to calculate the Levenshtein distance between two
# given strings.
# http://en.wikipedia.org/wiki/Levenshtein_distance
class Levenshtein
BYTES_IN_INT = 4
def self.distance(s, t)
return 0 if s == t;
return t.length if s.length == 0;
return s.length if t.length == 0;
v0 = "\0" * (t.length + 1) * BYTES_IN_INT
v1 = "\0" * (t.length + 1) * BYTES_IN_INT
l = self.new
l.distance_C(s, t, s.length, t.length, v0, v1)
end
# >>>>>>>>>>>>>>> RubyInline C code <<<<<<<<<<<<<<<
inline do |builder|
# Macro for matching nucleotides including ambiguity codes.
builder.prefix %{
#define MATCH(A,B) ((bitmap[A] & bitmap[B]) != 0)
}
# Bitmap for matching nucleotides including ambiguity codes.
# For each value bits are set from the left: bit pos 1 for A,
# bit pos 2 for T, bit pos 3 for C, and bit pos 4 for G.
builder.prefix %{
char bitmap[256] = {
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 1,14, 4,11, 0, 0, 8, 7, 0, 0,10, 0, 5,15, 0,
0, 0, 9,12, 2, 2,13, 3, 0, 6, 0, 0, 0, 0, 0, 0,
0, 1,14, 4,11, 0, 0, 8, 7, 0, 0,10, 0, 5,15, 0,
0, 0, 9,12, 2, 2,13, 3, 0, 6, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
};
}
builder.prefix %{
unsigned int min(unsigned int a, unsigned int b, unsigned int c)
{
unsigned int m = a;
if (m > b) m = b;
if (m > c) m = c;
return m;
}
}
builder.c %{
VALUE distance_C(
VALUE _s, // string
VALUE _t, // string
VALUE _s_len, // string length
VALUE _t_len, // string length
VALUE _v0, // score vector
VALUE _v1 // score vector
)
{
char *s = (char *) StringValuePtr(_s);
char *t = (char *) StringValuePtr(_t);
unsigned int s_len = FIX2UINT(_s_len);
unsigned int t_len = FIX2UINT(_t_len);
unsigned int *v0 = (unsigned int *) StringValuePtr(_v0);
unsigned int *v1 = (unsigned int *) StringValuePtr(_v1);
unsigned int i = 0;
unsigned int j = 0;
unsigned int cost = 0;
for (i = 0; i < t_len + 1; i++)
v0[i] = i;
for (i = 0; i < s_len; i++)
{
v1[0] = i + 1;
for (j = 0; j < t_len; j++)
{
cost = (MATCH(s[i], t[j])) ? 0 : 1;
v1[j + 1] = min(v1[j] + 1, v0[j + 1] + 1, v0[j] + cost);
}
for (j = 0; j < t_len + 1; j++)
v0[j] = v1[j];
}
return UINT2NUM(v1[t_len]);
}
}
end
end
builder.prefix is just a method call, so you could create a Ruby method which called it with your macro and character array, and then add it to any class which wanted to use those C snippets inline. e.g.
module MixinCommonC
def add_match_macro inline_builder
inline_builder.prefix %{
#define MATCH(A,B) ((bitmap[A] & bitmap[B]) != 0)
}
end
end
In your example code you make the following changes to use it:
At the start of the class
class Levenshtein
extend MixinCommonC
(It's extend and not include, because the inline method is being called against the class, so can only access class methods inside the block)
Where you currently have the call to builder.prefix:
add_match_macro( builder )

SIMD optimization puzzle

I Want to optimize the following function using SIMD (SSE2 & such):
int64_t fun(int64_t N, int size, int* p)
{
int64_t sum = 0;
for(int i=1; i<size; i++)
sum += (N/i)*p[i];
return sum;
}
This seems like an eminently vectorizable task, except that the needed instructions just aren't there ...
We can assume that N is very large (10^12 to 10^18) and size~sqrt(N). We can also assume that p can only take values of -1, 0, and 1; so we don't need a real multiplication, the (N/i)*p[i] can be done with four instructions (pcmpgt, pxor, psub, pand), if we could just somehow compute N/i.
This is as close as I could get to vectorizing that code. I don't really expect it to be faster. I was just trying my hand at writting SIMD code.
#include <stdint.h>
int64_t fun(int64_t N, int size, const int* p)
{
int64_t sum = 0;
int i;
for(i=1; i<size; i++) {
sum += (N/i)*p[i];
}
return sum;
}
typedef int64_t v2sl __attribute__ ((vector_size (2*sizeof(int64_t))));
int64_t fun_simd(int64_t N, int size, const int* p)
{
int64_t sum = 0;
int i;
v2sl v_2 = { 2, 2 };
v2sl v_N = { N, N };
v2sl v_i = { 1, 2 };
union { v2sl v; int64_t a[2]; } v_sum;
v_sum.a[0] = 0;
v_sum.a[1] = 0;
for(i=1; i<size-1; i+=2) {
v2sl v_p = { p[i], p[i+1] };
v_sum.v += (v_N / v_i) * v_p;
v_i += v_2;
}
sum = v_sum.a[0] + v_sum.a[1];
for(; i<size; i++) {
sum += (N/i)*p[i];
}
return sum;
}
typedef double v2df __attribute__ ((vector_size (2*sizeof(double))));
int64_t fun_simd_double(int64_t N, int size, const int* p)
{
int64_t sum = 0;
int i;
v2df v_2 = { 2, 2 };
v2df v_N = { N, N };
v2df v_i = { 1, 2 };
union { v2df v; double a[2]; } v_sum;
v_sum.a[0] = 0;
v_sum.a[1] = 0;
for(i=1; i<size-1; i+=2) {
v2df v_p = { p[i], p[i+1] };
v_sum.v += (v_N / v_i) * v_p;
v_i += v_2;
}
sum = v_sum.a[0] + v_sum.a[1];
for(; i<size; i++) {
sum += (N/i)*p[i];
}
return sum;
}
#include <stdio.h>
static const int test_array[] = {
1, 0, -1, 0, 1, 0, -1, 0, 1, 0, -1, 0, 1, 0, -1, 0,
1, 0, -1, 0, 1, 0, -1, 0, 1, 0, -1, 0, 1, 0, -1, 0,
1, 0, -1, 0, 1, 0, -1, 0, 1, 0, -1, 0, 1, 0, -1, 0,
1, 0, -1, 0, 1, 0, -1, 0, 1, 0, -1, 0, 1, 0, -1, 0,
1, 0, -1, 0, 1, 0, -1, 0, 1, 0, -1, 0, 1, 0, -1, 0,
1, 0, -1, 0, 1, 0, -1, 0, 1, 0, -1, 0, 1, 0, -1, 0,
1, 0, -1, 0, 1, 0, -1, 0, 1, 0, -1, 0, 1, 0, -1, 0,
1, 0, -1, 0, 1, 0, -1, 0, 1, 0, -1, 0, 1, 0, -1, 0,
1, 0, -1, 0, 1, 0, -1, 0, 1, 0, -1, 0, 1, 0, -1, 0,
1, 0, -1, 0, 1, 0, -1, 0, 1, 0, -1, 0, 1, 0, -1, 0,
1, 0, -1, 0, 1, 0, -1, 0, 1, 0, -1, 0, 1, 0, -1, 0,
1, 0, -1, 0, 1, 0, -1, 0, 1, 0, -1, 0, 1, 0, -1, 0,
1, 0, -1, 0, 1, 0, -1, 0, 1, 0, -1, 0, 1, 0, -1, 0,
1, 0, -1, 0, 1, 0, -1, 0, 1, 0, -1, 0, 1, 0, -1, 0,
1, 0, -1, 0, 1, 0, -1, 0, 1, 0, -1, 0, 1, 0, -1, 0,
1, 0, -1, 0, 1, 0, -1, 0, 1, 0, -1, 0, 1, 0, -1, 0,
1, 0, -1, 0, 1, 0, -1, 0, 1, 0, -1, 0, 1, 0, -1, 0,
1, 0, -1, 0, 1, 0, -1, 0, 1, 0, -1, 0, 1, 0, -1, 0,
1, 0, -1, 0, 1, 0, -1, 0, 1, 0, -1, 0, 1, 0, -1, 0,
1, 0, -1, 0, 1, 0, -1, 0, 1, 0, -1, 0, 1, 0, -1, 0,
1, 0, -1, 0, 1, 0, -1, 0, 1, 0, -1, 0, 1, 0, -1, 0
};
#define test_array_len (sizeof(test_array)/sizeof(int))
#define big_N (1024 * 1024 * 1024)
int main(int argc, char *argv[]) {
int64_t res1;
int64_t res2;
int64_t res3;
v2sl a = { 123, 456 };
v2sl b = { 100, 200 };
union { v2sl v; int64_t a[2]; } tmp;
a = a + b;
tmp.v = a;
printf("a = { %ld, %ld }\n", tmp.a[0], tmp.a[1]);
printf("test_array size = %zd\n", test_array_len);
res1 = fun(big_N, test_array_len, test_array);
printf("fun() = %ld\n", res1);
res2 = fun_simd(big_N, test_array_len, test_array);
printf("fun_simd() = %ld\n", res2);
res3 = fun_simd_double(big_N, test_array_len, test_array);
printf("fun_simd_double() = %ld\n", res3);
return 0;
}
The derivative of 1/x is -1/x^2, which means as x gets bigger, N/x==N/(x + 1).
For a known value of N/x (let's call that value r), we can determine the next value of x (let's call that value x' such that N/x'<r:
x'= N/(r - 1)
And since we are dealing with integers:
x'= ceiling(N/(r - 1))
So, the loop becomes something like this:
int64_t sum = 0;
int i=1;
int r= N;
while (i<size)
{
int s= (N + r - 1 - 1)/(r - 1);
while (i<s && i<size)
{
sum += (r)*p[i];
++i;
}
r= N/s;
}
return sum;
For sufficiently large N, you will have many many runs of identical values for N/i. Granted, you will hit a divide by zero if you aren't careful.
I suggest you do this with floating point SIMD operations - either single or double precision depending on your accuracy requirements. Conversion from int to float or double is relatively fast using SSE.
The cost is concentrated in computing the divisions. There is no opcode in SSE2 for integral divisions, so you would have to implement a division algorithm yourself, bit by bit. I do not think it would be worth the effort: SSE2 allow you to perform two instances in parallel (you use 64-bit numbers, and SSE2 registers are 128-bit) but I find it likely that a handmade division algorithm would be at least twice as slow as the CPU idiv opcode.
(By the way, do you compile in 32-bit or 64-bit mode ? The latter will be more comfortable with 64-bit integers.)
Reducing the overall number of divisions looks like a more promising way. One may note that for positive integers x and y, then floor(x/(2y)) = floor(floor(x/y)/2). In C terminology, once you have computed N/i (truncated division) then you just have to shift it right by one bit to obtain N/(2*i). Used properly, this makes half of your divisions almost free (that "properly" also includes accessing the billions of p[i] values in a way which does not wreak havoc with the caches, so it does not seem very easy).

Resources