Building a binary matrix with geohashes in r - matrix

I want to build a binary matrix in R for each geohash that i have in a dataframe using alphabet letters and numbers.
Therefore, i want that each geohash is matched with 1 if the corresponding letter or number is matched with alphabet or numbers, or 0 if not, in order to build a complete binary matrix for each geohash.
The reason why I want to build these matrices is because i want to apply an encode and decode deep learning algorithm for events prediction
Thank you
alphabetandnumbers <- c('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i',
'j', 'k','l', 'm', 'n', 'o', 'p', 'q', 'r',
's',
't', 'u', 'v', 'w','x', 'y', 'z',
0,1,2,3,4,5,6,7,8,9)
names(df2sub)
t2 <- table(alphabetandnumbers, seq_along(df2sub$geohash))
t2[t2 > 1] <- 1
t2[1:1000]
I tried also this 'tactics' without any success
V1 <- df2sub[['geohash']]
V2 <- array(alphabetandnumbers, dim = length(alphabetandnumbers))
m <- as.matrix(V1)
id <- cbind(rowid = as.vector(t(row(m))),
colid = as.vector(t(m)))
id <- id[complete.cases(id), ]
id
out <- matrix(0, nrow = nrow(m), ncol = max(m))
out[id] <- 1

Related

Algorithm to translate list operations to be index based

Assume you have an unsorted list of distinct items. for example:
['a', 'z', 'g', 'i', 'w', 'p', 't']
You also get a list of Insert and remove operations. Insert operations are composed of the index to insert to, and the item to insert. For example: Insert(5, 's')
Remove operations are expressed using the element to remove. For example: Remove('s')
So a list of operations may look like this:
Insert ('s', 5)
Remove ('p')
Insert ('j', 0)
Remove ('a')
I am looking for the most efficient algorithm that can translate the list of operations so that they are index based. That means that there is no need to modify the insert operations, but the remove operations should be replaced with a remove operation stating the current index of the item to be removed (not the original one).
So the output of the example should look like this:
Starting set: ['a', 'z', 'g', 'i', 'w', 'p', 't']
Insert('s', 5) ( list is now: ['a', 'z', 'g', 'i', 'w', 's', 'p', 't']
Remove (6) (list is now: ['a', 'z', 'g', 'i', 'w', 's', 't']
Insert('j', 0) (list is now: ['j', 'a', 'z', 'g', 'i', 'w', 's', 't']
Remove(1) (list is now: ['j', 'z', 'g', 'i', 'w', 's', 't']
Obviously, we can scan for the next item to remove in the set after each operation, and that would mean the entire algorithm would take O(n*m) where n is the size of the list, and m is the number of operations.
The question is - is there a more efficient algorithm?
You can make this more efficient if you have access to all of the remove operations ahead of time, and they are significantly (context-defined) shorter than the object list.
You can maintain a list of items of interest: those to be removed. Look up their initial positions -- either in the original list, or upon insertion. Whenever an insertion is made at position n, each element of this list past that position gets its index increased by one; for each such deletion, decrease by one.
This is little different from methods already obvious; it's merely quantitatively faster, a potentially smaller m on the O(n*m) complexity.

Use dynamic programming to merge two arrays such that the number of repetitions of the same element is minimised

Let's say we have two arrays m and n containing the characters from the set a, b, c , d, e. Assume each character in the set has a cost associated with it, consider the costs to be a=1, b=3, c=4, d=5, e=7.
for example
m = ['a', 'b', 'c', 'd', 'd', 'e', 'a']
n = ['b', 'b', 'b', 'a', 'c', 'e', 'd']
Suppose we would like to merge m and n to form a larger array s.
An example of s array could be
s = ['a', 'b', 'c', 'd', 'd', 'e', 'a', 'b', 'b', 'b', 'a', 'c', 'e', 'd']
or
s = ['b', 'a', 'd', 'd', 'd', 'b', 'e', 'c', 'b', 'a', 'b', 'a', 'c', 'e']
If there are two or more identical characters adjacent to eachother a penalty is applied which is equal to: number of adjacent characters of the same type * the cost for that character. Consider the second example for s above which contains a sub-array ['d', 'd', 'd']. In this case a penalty of 3*5 will be applied because the cost associated with d is 5 and the number of repetitions of d is 3.
Design a dynamic programming algorithm which minimises the cost associated with s.
Does anyone have any resources, papers, or algorithms they could share to help point me in the right direction?

If I use a simple substitution algorithm on a unique string, will the output be always unique?

This is sort of like hashing, but simpler. Something like:
function getUniqueId(input) {
var map = {
A: 'Q', B: 'W', C: 'E',
D: 'R', E: 'T', F: 'Y',
G: 'U', H: 'I', I: 'O',
J: 'P', K: 'A', L: 'S',
M: 'D', N: 'F', O: 'G',
P: 'H', Q: 'J', R: 'K',
S: 'L', T: 'Z', U: 'X',
V: 'C', W: 'V', X: 'B',
Y: 'N', Z: 'M',
a: 'q', b: 'w', c: 'e',
d: 'r', e: 't', f: 'y',
g: 'u', h: 'i', i: 'o',
j: 'p', k: 'a', l: 's',
m: 'd', n: 'f', o: 'g',
p: 'h', q: 'j', r: 'k',
s: 'l', t: 'z', u: 'x',
v: 'c', w: 'v', x: 'b',
y: 'n', z: 'm',
0: '3', 1: '4', 2: '5',
3: '6', 4: '7', 5: '8',
6: '9', 7: '0', 8: '1',
9: '2',
},
output = "";
for (var i = 0; i < input.length; i++) {
if (map[input[i]] !== undefined){
output += map[input[i]];
}
}
return output;
}
I only encode/hash/substitute A-Z, a-z, and 0-9.
Provided that the input string is always unique, will the output string be always unique as well?
If the hash gets created by applying the same substitution function on every character of a string which is unique, you can be sure that the output string will always be unique only if the substitution function is injective.
This property preservers distinctness, which is what is needed to prevent uniqueness.
To check this property, this needs to be proved: a ≠ b, then f(a) ≠ f(b). For your map - for numbers its obvious, for letters it might be confusing, but can be checked with a simple bash script, which collects all target values to see if none is present twice:
echo "A: Q, B: W, C: E, D: R, E: T, F: Y, G: U, H: I, I: O, J: P, G, P: H, Q: J, R: K, S: L, T: Z, U: X, V: C, W: V, X: B, Y: N, Z: M" | tr ',' '\n' | cut -d':' -f2 | sort | tr -d ' \n'
Which indeed outputs
ABCDEFGHIJKLMNOPQRSTUVWXYZ
Because your function doesn't change positions of characters and every character is subjected to this one-to-one mapping, unique input will always produce a unique output.
This is of course only true if you are certain that the input string is in the domain A-Z, a-z, 0-9, otherwise it won't work ("#A"=>"Q","A"=>"Q").
I guess it depends on how you define "simple substitution".
If you strictly apply "substitute a with emen" (and nothing else), the result of applying this substitution (which I would call "simple", but not "reversible") to the strings "cat" and "cement" both end up with the reult "cement".
EDIT: You only have single-letter substitutions, so "all" you need to ensure is that there's no "destination character" that is mapped to by two (or more) "source characters".
The easiest way of doing this is probably to generate a string that contains all the possible mappable characters (so, the alphabet in uppercase and lowercase, and all the numbers) in whatever sorting order you prefer (sorting by ASCII value is probably easiest). Then you generate the encoded version, sort that and compare them.
Since you're mapping all the source characters, and you have a string composed only of the destination characters, if their sorted versions are identical, you do not have any duplication and thus you can guarantee that two distinct strings also have distinct encoded forms.

How to calculate total weight of paths of directed weighted graph in DFS in one iteration?

G = (V,E) - a directed weighted graph.
D -> G (w:4)
D -> C (w:2)
D -> E (w:2)
C -> F (w:5)
C -> A (w:4)
B -> D (w:3)
B -> E (w:10)
G -> F (w:1)
E -> G (w:6)
A -> D (w:1)
A -> B (w:2)
picture
I use DFS to find all simple path between START=A node to END=F node:
def find_all_paths(self, start, end, path=[]):
path = path + [start]
if start == end:
return [path]
if start not in self.edges:
return []
paths = []
for node in self.edges[start]:
if node not in path:
paths.extend(self.find_all_paths(node, end, path))
return paths
Result:
['A', 'D', 'G', 'F']
['A', 'D', 'C', 'F']
['A', 'D', 'E', 'G', 'F']
['A', 'B', 'D', 'G', 'F']
['A', 'B', 'D', 'C', 'F']
['A', 'B', 'D', 'E', 'G', 'F']
['A', 'B', 'E', 'G', 'F']
I need to get result like this:
['A', 'D', 'G', 'F'], TOTAL_WEIGHT_OF_PATH = 6
['A', 'D', 'C', 'F'], TOTAL_WEIGHT_OF_PATH = 8
['A', 'D', 'E', 'G', 'F'], TOTAL_WEIGHT_OF_PATH = 10
....
....
Where TOTAL_WEIGHT_OF_PATH is sum of weights for each edge in path.
Of course I could just count the TOTAL_WEIGHT_OF_PATH value after getting result of DFS, but I need to calculate it into DFS steps for cutoff searching in condition based on TOTAL_WEIGHT_OF_PATH (e.g. TOTAL_WEIGHT_OF_PATH should be < MAX_WEIGHT_OF_PATH)
Well, notice that the TOTAL_WEIGT_OF_PATH (TWOP) to any node V (other then the root) is TWOP to the preceding node U plus the weight of the edge (U,V). TWOP to root is 0.
TWOP<sub>V</sub> = TWOP<sub>U</sub> + weight(U,V)
Any time you are expanding a new node on a path, you just need to store the TWOP to this node into it, so you don't need to calculate it every time.
Note, that if you visit a node again, using different path, you need to "calculate" a new weight.

Finding number of occurrences of char from a short List/Array in a infinitely large List/Array

I have been working on a practical situation wherein I require an algorithm, have made a generic problem out of that. Considering there are are Two Arrays :-
Source[10] = {'a', 'v', 'l', 'r', 'p', 's', 'x', 'd', 'q', 'o' , 'g', 'm'}
Target[N] = {'a', 'v', 'l', 'r', 'p', 's', 'x', 'd', 'q', 'o' , 'g', 'm',a', 'v', 'l', 'r', 'p',a', 'v', 'l', 'r', 'p',a',
'v', 'l', 'r', 'p',a', 'v', 'l', 'r', 'p',a', 'v', 'l', 'r', 'p',a', 'v', 'l', 'r', 'p',a', 'v', 'l', 'r', 'p',a', 'v',
'l', 'r', 'p',a', 'v', 'l', 'r', 'p', .... }
We need to have an efficient algorithm to find the frequency of occurrences of characters from Source in Target.
I have thought of hashing the complete Target list and then iterate through the Source and do the lookup in the hashed list. Can people comment/validate the approach.
If your character set is reasonably limited, you can use character codes as indexes into an array of counts. Let's say you have 16-bit characters. You can do this:
int[] counts = new int[65536];
foreach (char c in Target)
counts[c]++;
With the array of counts in hand, you can easily find the frequency by looking up a code from the Source in the counts array.
This solution is asymptotically as fast as it could possibly get, but it may not be the most memory-efficient one.
I don't know what a hashed list is, so I can't comment on that. For efficiency, I would suggest turning the target array into a multiset. Guava has a nice implementation of such a thing (although the Java Collections Framework does not). So does Apache Commons (where it's called a Bag). You can then simply iterate through the source and look up the frequency of each element in the multiset. As described in this thread, using a multiset is easier than using a HashMap from elements to frequencies, although it does require using a third-party library.

Resources