How to use the trained char-rnn to generate words? - char

When the char-rnn is trained, the weights of the network is fixed. If I use the same first char, how can I get the different sentence? Such as the two sentences "What is wrong?" and "What can I do for you?"
have the same first word "W". Can the char-rnn generate the two different sentences?

Yes, you can get different results from the same state by sampling. Take a look at min-char-rnn by Andrej Karpathy. The sample code is at line 63:
def sample(h, seed_ix, n):
"""
sample a sequence of integers from the model
h is memory state, seed_ix is seed letter for first time step
"""
x = np.zeros((vocab_size, 1))
x[seed_ix] = 1
ixes = []
for t in xrange(n):
h = np.tanh(np.dot(Wxh, x) + np.dot(Whh, h) + bh)
y = np.dot(Why, h) + by
p = np.exp(y) / np.sum(np.exp(y))
ix = np.random.choice(range(vocab_size), p=p.ravel())
x = np.zeros((vocab_size, 1))
x[ix] = 1
ixes.append(ix)
return ixes
Starting from the same hidden vector h and seed char seed_ix, you'll have a deterministic distribution over the next char - p. But the result is random, because the code performs np.random.choice instead of np.argmax. If the distribution is highly peaked at some char, you'll still get the same outcome most of the time, but in most cases several next chars are highly probable and they will be sampled, thus changing the whole generated sequence.
Note that this isn't the only possible sampling procedure: temperature-based sampling is more popular. You can take a look at, for instance, this post for overview.

Related

find more indipendent seed value for a 64 bit LCG (MMIX (by Knuth))

I'm using a 64 bit LCG (MMIX (by Knuth)). It generate a certain block of random numbers inside my code, which use them to perform some operations. My code works in single core and I would like to parallelize the work to reduce the execution time.
Before start thinking to more advanced methods in this sense I'd like to simply execute more identical codes in parallel (in fact the code repeats the same task over a certain numbers of indipendent simulation, so I can simply split the number of simulation between more identical codes and run them in parallel).
My only problem now is to find a seed for each code; in particular, to avoid the possibility of unwanted non trivial correlation between data generated in different codes, I have to be sure that the random number generated in the various codes don't overlap. To do so, starting from a certain seed in the first code I have to find a way to find a value (the next seed) very distant not in absolute value but in the pseudo-random sequence (so, such that, to go from the first to the second seed, I need a huge number of steps of LCG).
My first attempt was this:
starting from the LCG relation between 2 consecutive numbers generated in the sequence
So, in principle, I could calculate the above relation with, say, n = 2^40 and I_0 equal to the value of the first seed, and obtain a new seed distant 2^40 steps in the random CLG sequence from the first one.
The problem is that, doing so, I necessary go in overflow calculating a^n. In fact for MMIX (by Knuth) a~2^62 and i use unsigned long long int (64 bit). Note that the only problem here is the fraction in the above relation. If there only were sum and multiplication I could ignore the overflow problem due to the following modular properties (in fact I'm using 2^64 as c (64 bit generator)):
So, starting from a certain value (first seed), how can I find a second one distant a huge number of step in the LC pseudo-random sequence?
[EDIT]
r3mainer solution is perfectly suited for python codes. I'm trying now to implement it in c using unsigned __int128 variables. I have only one problem: in principle I should compute:
Say, for simplicity, I want to compute:
with n = 2^40 and c(a-1)~2^126. I proceed with a cycle.Starting with temp = a, in each iteration I compute temp = temp*temp, then I compute temp%c(a-1). The problem is in the second step (temp = temp*temp). temp in fact could be, in principle any number < c(a-1)~2^126. If temp is a big number, say > 2^64, I'll go in overflow, reaching 2^128 - 1, before the next module operation. So is there a way to avoid it? For now the only solution I see is to perform each multiplication with a loop over bit, as suggested here: c code: prevent overflow in modular operation with huge modules (modules near the overflow treshold)
Is there another way to perform module operation during the multiplication?
(note that being c = 2^64, with mod(c) operation I don't have the same problem because the overflow point (for ull int variables) coincides with the module)
Any LCG of the form x[n+1] = (x[n] * a + c) % m can be skipped to an arbitrary position very quickly.
Starting with a seed value of zero, the first few iterations of the LCG will give you this sequence:
x₀ = 0
x₁ = c % m
x₂ = (c(a + 1)) % m
x₃ = (c(a² + a + 1)) % m
x₄ = (c(a³ + a² + a + 1)) % m
It's pretty easy to see that each term is actually the sum of a geometric series, which can be calculated with a simple formula:
x_n = (c(a^{n-1} + a^{n-2} + ... + a + 1)) % m
= (c * (a^n - 1) / (a - 1)) % m
The (a^n - 1) term can be calculated quickly by modular exponentiation, but dividing by (a-1) is a bit tricky because (a-1) and m are both even (i.e., not coprime), so we can't calculate the modular multiplicative inverse of (a-1) mod m directly.
Instead, calculate (a^n-1) mod m*(a-1), then perform a straightforward (non-modular) division of the result by a-1. In Python, the calculation would go something like this:
def lcg_skip(m, a, c, n):
# Calculate nth term of LCG sequence with parameters m (modulus),
# a (multiplier) and c (increment), assuming an initial seed of zero
a1 = a - 1
t = pow(a, n, m * a1) - 1
t = (t * c // a1) % m
return t
def test(nsteps):
m = 2**64
a = 6364136223846793005
c = 1442695040888963407
#
print("Calculating by brute force:")
seed = 0
for i in range(nsteps):
seed = (seed * a + c) % m
print(seed)
#
print("Calculating by fast method:")
# Calculate nth term by modular exponentiation
print(lcg_skip(m, a, c, nsteps))
test(1000000)
So to create LCGs with non-overlapping output sequences, all you would need to do is use initial seed values generated by lcg_skip() with values of n that are far enough apart.
Well, for LCG it is known property to jump forward and backward in O(log2(N)) time where N is the distance between jump points, paper by F. Brown, "Random Number Generation with Arbitrary Stride," Trans. Am. Nucl. Soc. (Nov. 1994).
It means if you have LCG parameters (a, c) satisfying Hull–Dobell Theorem, then whole period would be 264 numbers before repeating themself, and say for Nt number pf threads you make jump distance of 264 / Nt, and all threads start with the same seed and just jump after initializing LCG by (264 / Nt)*threadId, and you would be completely safe from RNG correlations due to sequences overlap.
For simplest case of common 64 unsigned modulo math, as implemented in NumPy, code below should work fine
import numpy as np
class LCG(object):
UZERO: np.uint64 = np.uint64(0)
UONE : np.uint64 = np.uint64(1)
def __init__(self, seed: np.uint64, a: np.uint64, c: np.uint64) -> None:
self._seed: np.uint64 = np.uint64(seed)
self._a : np.uint64 = np.uint64(a)
self._c : np.uint64 = np.uint64(c)
def next(self) -> np.uint64:
self._seed = self._a * self._seed + self._c
return self._seed
def seed(self) -> np.uint64:
return self._seed
def set_seed(self, seed: np.uint64) -> np.uint64:
self._seed = seed
def skip(self, ns: np.int64) -> None:
"""
Signed argument - skip forward as well as backward
The algorithm here to determine the parameters used to skip ahead is
described in the paper F. Brown, "Random Number Generation with Arbitrary Stride,"
Trans. Am. Nucl. Soc. (Nov. 1994). This algorithm is able to skip ahead in
O(log2(N)) operations instead of O(N). It computes parameters
A and C which can then be used to find x_N = A*x_0 + C mod 2^M.
"""
nskip: np.uint64 = np.uint64(ns)
a: np.uint64 = self._a
c: np.uint64 = self._c
a_next: np.uint64 = LCG.UONE
c_next: np.uint64 = LCG.UZERO
while nskip > LCG.UZERO:
if (nskip & LCG.UONE) != LCG.UZERO:
a_next = a_next * a
c_next = c_next * a + c
c = (a + LCG.UONE) * c
a = a * a
nskip = nskip >> LCG.UONE
self._seed = a_next * self._seed + c_next
#%%
np.seterr(over='ignore')
seed = np.uint64(1)
rng64 = LCG(seed, np.uint64(6364136223846793005), np.uint64(1))
print(rng64.next())
print(rng64.next())
print(rng64.next())
#%%
rng64.skip(-3) # back by 3
print(rng64.next())
print(rng64.next())
print(rng64.next())
rng64.skip(-3) # back by 3
rng64.skip(2) # forward by 2
print(rng64.next())
Tested in Python 3.9.1, x64 Win 10

A faster alternative to all(a(:,i)==a,1) in MATLAB

It is a straightforward question: Is there a faster alternative to all(a(:,i)==a,1) in MATLAB?
I'm thinking of a implementation that benefits from short-circuit evaluations in the whole process. I mean, all() definitely benefits from short-circuit evaluations but a(:,i)==a doesn't.
I tried the following code,
% example for the input matrix
m = 3; % m and n aren't necessarily equal to those values.
n = 5000; % It's only possible to know in advance that 'm' << 'n'.
a = randi([0,5],m,n); % the maximum value of 'a' isn't necessarily equal to
% 5 but it's possible to state that every element in
% 'a' is a positive integer.
% all, equal solution
tic
for i = 1:n % stepping up the elapsed time in orders of magnitude
%%%%%%%%%% all and equal solution %%%%%%%%%
ax_boo = all(a(:,i)==a,1);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
end
toc
% alternative solution
tic
for i = 1:n % stepping up the elapsed time in orders of magnitude
%%%%%%%%%%% alternative solution %%%%%%%%%%%
ax_boo = a(1,i) == a(1,:);
for k = 2:m
ax_boo(ax_boo) = a(k,i) == a(k,ax_boo);
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
end
toc
but it's intuitive that any "for-loop-solution" within the MATLAB environment will be naturally slower. I'm wondering if there is a MATLAB built-in function written in a faster language.
EDIT:
After running more tests I found out that the implicit expansion does have a performance impact in evaluating a(:,i)==a. If the matrix a has more than one row, all(repmat(a(:,i),[1,n])==a,1) may be faster than all(a(:,i)==a,1) depending on the number of columns (n). For n=5000 repmat explicit expansion has proved to be faster.
But I think that a generalization of Kenneth Boyd's answer is the "ultimate solution" if all elements of a are positive integers. Instead of dealing with a (m x n matrix) in its original form, I will store and deal with adec (1 x n matrix):
exps = ((0):(m-1)).';
base = max(a,[],[1,2]) + 1;
adec = sum( a .* base.^exps , 1 );
In other words, each column will be encoded to one integer. And of course adec(i)==adec is faster than all(a(:,i)==a,1).
EDIT 2:
I forgot to mention that adec approach has a functional limitation. At best, storing adec as uint64, the following inequality must hold base^m < 2^64 + 1.
Since your goal is to count the number of columns that match, my example converts the binary encoding to integer decimals, then you just loop over the possible values (with 3 rows that are 8 possible values) and count the number of matches.
a_dec = 2.^(0:(m-1)) * a;
num_poss_values = 2 ^ m;
num_matches = zeros(num_poss_values, 1);
for i = 1:num_poss_values
num_matches(i) = sum(a_dec == (i - 1));
end
On my computer, using 2020a, Here are the execution times for your first 2 options and the code above:
Elapsed time is 0.246623 seconds.
Elapsed time is 0.553173 seconds.
Elapsed time is 0.000289 seconds.
So my code is 853 times faster!
I wrote my code so it will work with m being an arbitrary integer.
The num_matches variable contains the number of columns that add up to 0, 1, 2, ...7 when converted to a decimal.
As an alternative you can use the third output of unique:
[~, ~, iu] = unique(a.', 'rows');
for i = 1:n
ax_boo = iu(i) == iu;
end
As indicated in a comment:
ax_boo isolates the indices of the columns I have to sum in a row vector b. So, basically the next line would be something like c = sum(b(ax_boo),2);
It is a typical usage of accumarray:
[~, ~, iu] = unique(a.', 'rows');
C = accumarray(iu,b);
for i = 1:n
c = C(i);
end

Can someone explain this piece of code that recognises a digit from the Coursera Machine Learning course

This is a snippet from the predict function of exercise 4 of the Coursera machine learning course. What it does is it stores the predicted digit from a trained neural network in p. Can someone explain how it does this?
function p = predict(Theta1, Theta2, x)
p = 0;
h1 = sigmoid(double([1 x]) * Theta1');
h2 = sigmoid([1 h1] * Theta2');
[dummy, p] = max(h2, [], 2);
end
x = 1x784 matrix of pixel intensity values.
Theta1 = 100x785 matrix.
Theta2 = 10x101 matrix.
I have already trained the network and have gotten the optimum value of Theta1 and Theta2. What I want to know is how that last line of code takes the forward propagated values and stores 1/2/3/4/5/6/7/8/9/10 in p. Whichever digit is stored is the predicted digit.
Sigmoid function:
function g = sigmoid(z)
g = 1 ./ (1 + e.^-z);
end
The last line simply returns index of the neuron with the highest value, in matlab/octave
[M, I] = max(A, [], dim)
stores in I indeces of A which have highest values among dimension dim. In your case, h2 has activations of each output neuron, and from construction of your neural network - classification is simply index of the one with the highest value,
cl(x) = arg max_i f_i(x)

Homework: Implementing Karp-Rabin; For the hash values modulo q, explain why it is a bad idea to use q as a power of 2?

I have a two-fold homework problem, Implement Karp-Rabin and run it on a test file and the second part:
For the hash values modulo q, explain why it is a bad idea to use q as a power of 2. Can you construct a terrible example e.g. for q=64
and n=15?
This is my implementation of the algorithm:
def karp_rabin(text, pattern):
# setup
alphabet = 'ACGT'
d = len(alphabet)
n = len(pattern)
d_n = d**n
q = 2**32-1
m = {char:i for i,char in enumerate(alphabet)}
positions = []
def kr_hash(s):
return sum(d**(n-i-1) * m[s[i]] for i in range(n))
def update_hash():
return d*text_hash + m[text[i+n-1]] - d_n * m[text[i-1]]
pattern_hash = kr_hash(pattern)
for i in range(0, len(text) - n + 1):
text_hash = update_hash() if i else kr_hash(text[i:n])
if pattern_hash % q == text_hash % q and pattern == text[i:i+n]:
positions.append(i)
return ' '.join(map(str, positions))
...The second part of the question is referring to this part of the code/algo:
pattern_hash = kr_hash(pattern)
for i in range(0, len(text) - n + 1):
text_hash = update_hash() if i else kr_hash(text[i:n])
# the modulo q used to check if the hashes are congruent
if pattern_hash % q == text_hash % q and pattern == text[i:i+n]:
positions.append(i)
I don't understand why it would be a bad idea to use q as a power of 2. I've tried running the algorithm on the test file provided(which is the genome of ecoli) and there's no discernible difference.
I tried looking at the formula for how the hash is derived (I'm not good at math) trying to find some common factors that would be really bad for powers of two but found nothing. I feel like if q is a power of 2 it should cause a lot of clashes for the hashes so you'd need to compare strings a lot more but I didn't find anything along those lines either.
I'd really appreciate help on this since I'm stumped. If someone wants to point out what I can do better in the first part (code efficiency, readability, correctness etc.) I'd also be thrilled to hear your input on that.
There is a problem if q divides some power of d, because then only a few characters contribute to the hash. For example in your code d=4, if you take q=64 only the last three characters determine the hash (d**3 = 64).
I don't really see a problem if q is a power of 2 but gcd(d,q) = 1.
Your implementation looks a bit strange because instead of
if pattern_hash % q == text_hash % q and pattern == text[i:i+n]:
you could also use
if pattern_hash == text_hash and pattern == text[i:i+n]:
which would be better because you get fewer collisions.
The Thue–Morse sequence has among its properties that its polynomial hash quickly becomes zero when a power of 2 is the hash module, for whatever polynomial base (d). So if you will try to search a short Thue-Morse sequence in a longer one, you will have a great lot of hash collisions.
For example, your code, slightly adapted:
def karp_rabin(text, pattern):
# setup
alphabet = '01'
d = 15
n = len(pattern)
d_n = d**n
q = 32
m = {char:i for i,char in enumerate(alphabet)}
positions = []
def kr_hash(s):
return sum(d**(n-i-1) * m[s[i]] for i in range(n))
def update_hash():
return d*text_hash + m[text[i+n-1]] - d_n * m[text[i-1]]
pattern_hash = kr_hash(pattern)
for i in range(0, len(text) - n + 1):
text_hash = update_hash() if i else kr_hash(text[i:n])
if pattern_hash % q == text_hash % q : #and pattern == text[i:i+n]:
positions.append(i)
return ' '.join(map(str, positions))
print(karp_rabin('0110100110010110100101100110100110010110011010010110100110010110', '0110100110010110'))
outputs a lot of positions, although only three of then are proper matches.
Note that I have dropped the and pattern == text[i:i+n] check. Obviously if you restore it, the result will be correct, but also it is obvious that the algorithm will do much more work checking this additional condition than for other q. In fact, because there are so many collisions, the whole idea of algorithm becomes not working: you could almost as effectively wrote a simple algorithm that checks every position for a match.
Also note that your implementation is quite strange. The whole idea of polynomial hashing is to take the modulo operation each time you compute the hash. Otherwise your pattern_hash and text_hash are very big numbers. In other languages this might mean arithmetic overflow, but in Python this will invoke big integer arithmetic, which is slow and once again loses the whole idea of the algorithm.

Evolutionary Algorithm to guess a string, messed up by replication

i am working on a python script to test out genetic programming.
As an exercise i have made a simple Script that tries to guess
a string without the whole population part.
My Code is:
# acts as a gene
# it has three operations:
# Mutation : One character is changed
# Replication: a sequencepart is duplicated
# Extinction : A sequencepart is lost
# Crossover : the sequence is crossed with another Sequence
import random
class StringGene:
def __init__(self, s):
self.sequence = s
self.allowedChars = "ABCDEFGHIJKLMOPQRSTUVWXYZ/{}[]*()+-"
def __str__(self):
return self.sequence
def Mutation(self):
x = random.randint(0, len(self.sequence)-1)
r = random.randint(0, len(self.allowedChars)-1)
d = self.sequence
self.sequence = d[:x-1]+ self.allowedChars[r] + d[x:]
def Replication(self):
x1 = random.randint(0, len(self.sequence)-1)
x2 = random.randint(0, len(self.sequence)-1)
self.sequence =self.sequence[:x1]+ self.sequence[x1:x2] + self.sequence[x2:]
self.sequence = self.sequence[:32]
def Extinction(self):
x1 = random.randint(0, len(self.sequence)-1)
x2 = random.randint(0, len(self.sequence)-1)
self.sequence = self.sequence[:x1] + self.sequence[x2:]
def CrossOver(self, s):
x1 = random.randint(0, len(self.sequence)-1)
x2 = random.randint(0, len(s)-1)
self.sequence = self.sequence[:x1+1]+ s[x2:]
#x1 = random.randint(0, len(self.sequence)-1)
#self.sequence = s[:x2 ] + self.sequence[x1+1:]
if __name__== "__main__":
import itertools
def hamdist(str1, str2):
if (len(str2)>len(str1)):
str1, str2 = str2, str1
str2 = str2.ljust(len(str1))
return sum(itertools.imap(str.__ne__, str1, str2))
g = StringGene("Hi there, Hello World !")
g.Mutation()
print "gm: " + str(g)
g.Replication()
print "gr: " + str(g)
g.Extinction()
print "ge: " + str(g)
h = StringGene("Hello there, partner")
print "h: " + str(h)
g.CrossOver(str(h))
print "gc: " + str(g)
change = 0
oldres = 100
solutionstring = "Hello Daniel. Nice to meet you."
best = StringGene("")
res = 100
print solutionstring
while (res > 0):
g.Mutation()
g.Replication()
g.Extinction()
res = hamdist(str(g), solutionstring)
if res<oldres:
print "'"+ str(g) + "'"
print "'"+ str(best) + "'"
best = g
oldres = res
else :
g = best
change = change + 1
print "Solution:" + str(g)+ " " + str(hamdist(solutionstring, str(g))) + str (change)
I have a crude hamming distance as a measure how far the solution string
differs from the current one. However i want to be able to have a varying
length in the guessing, so i introduced replication and deletion of parts
of the string.
Now, however the string grows infinitely and the Solution String is never
found. Can you point out, where i went wrong?
Can you suggest improvements?
cheers
Your StringGene objects are mutable, which means that when you do an operation like best = g, you are making both g and best reference the same object. Since after that first step you only have a single object, every mutation gets applied permanently, whether or not it's successful, and all comparisons between g and best are comparisons between the same object.
You either need to implement a copy operator, or make instances immutable, and have each mutation operator return a modified version of the 'gene'.
Also, if the first mutation fails to improve the string, you set g to best, which is an empty string, throwing away your starting string entirely.
Finally, the canonical test string is "Methinks it is like a weasel".
The simplest thing might be to limit how long the guessed string is allowed to be. Don't allow guesses above a certain length.
I had a look at your code and I'm not good enough in Python to find any bugs, but it might be that you're simply referencing or indexing the array incorrectly, resulting in always adding new characters to the guess-string, so your string is always increasing in length... I don't know if that's the bug, but things like that have happened to me before, so double-check your array indicies. ;)
I think your fitness function is too simple. I would play with using two variables, one the size distance and the other your "hamdist". The further the size difference is, the more it effects the total fitness. So add the two together with some percentage constant.
I'm also not very familiar with python, but it looks to me that this is not what you're doing.
First of all, what you are doing is a genetic algorithm, not genetic programming (which is a related, but a different concept).
I don't know Python, but it looks you have a major problem in your extinction function. As far as I can tell, if x1 > x2 it causes the string to increase in size instead of decreasing (the part between x1 and x2 is effectively doubled). What would happen in the replication function when x1 > x2, I can't tell without knowing Python.
Also keep in mind, that maintaining a population is key to effectively solving problems with genetic algorithms. Crossovers are the essential part of the algorithm, and they make little or no sense if they are not made between population members (also, the more varied the population is, the better, most of the time). The code you presented is dependant on mutations of a single specimen to achieve your expected result, and thus highly unlikely to produce anything useful faster than a simple brute force method.

Resources