corpus extraction with changing data type R - data-extraction

i have a corpus of text files, contains just text, I want to extract the ngrams from the texts and save each one with his original file name in matrixes of 3 columns..
library(tokenizer)
myTokenizer <- function(x, n, n_min) {
corp<-"this is a full text "
tok <- unlist(tokenize_ngrams(as.character(x), n = n, n_min = n_min))
M <- matrix(nrow=length(tok), ncol=3,
dimnames=list(NULL, c( "gram" , "num.words", "words")))
}
corp <- tm_map(corp,content_transformer(function (x) myTokenizer(x, n=3, n_min=1)))
writecorpus(corp)

Since I don't have your corpus I created one of my own using the crude dataset from tm. No need to use tm_map as that keeps the data in a corpus format. The tokenizer package can handle this.
What I do is store all your desired matrices in a list object via lapply and then use sapply to store the data in the crude directory as separate files.
Do realize that the matrices as specified in your function will be character matrices. This means that columns 1 and 2 will be characters, not numbers.
library(tm)
data("crude")
crude <- as.VCorpus(crude)
myTokenizer <- function(x, n, n_min) {
tok <- unlist(tokenizers::tokenize_ngrams(as.character(x), n = n, n_min = n_min))
M <- matrix(nrow=length(tok), ncol=3,
dimnames=list(NULL, c( "gram" , "num.words", "words")))
M[, 3] <- tok
M[, 2] <- lengths(strsplit(M[, 3], "\\W+")) # counts the words
M[, 1] <- 1:length(tok)
return(M)
}
my_matrices <- lapply(crude, myTokenizer, n = 3, n_min = 1)
# make sure directory crude exists as a subfolder in working directory
sapply(names(my_matrices),
function (x) write.table(my_matrices[[x]], file=paste("crude/", x, ".txt", sep=""), row.names = FALSE))
outcome of the first file:
"gram" "num.words" "words"
"1" "1" "diamond"
"2" "2" "diamond shamrock"
"3" "3" "diamond shamrock corp"
"4" "1" "shamrock"
"5" "2" "shamrock corp"
"6" "3" "shamrock corp said"

I would recommend to create a document term matrix (DTM). You will probably need this in your downstream tasks anyway. From that you could also extract the information you want, although, it is probably not reasonable to assume that a term (incl. ngrams) only has a single document where its coming from (at least this is what I understood from your question, please correct me if I am wrong). Therefore, I guess that in practice one term will have several documents associated with it - this kind of information is usually stored in a DTM.
An example with text2vec below. If you could elaborate further how you want to use your terms, etc. I could adapt the code according to your needs.
library(text2vec)
# I have set up two text do not overlap in any term just as an example
# in practice, this probably never happens
docs = c(d1 = c("here a text"), d2 = c("and another one"))
it = itoken(docs, tokenizer = word_tokenizer, progressbar = F)
v = create_vocabulary(it, ngram = c(1,3))
vectorizer = vocab_vectorizer(v)
dtm = create_dtm(it, vectorizer)
as.matrix(dtm)
# a a_text and and_another and_another_one another another_one here here_a here_a_text one text
# d1 1 1 0 0 0 0 0 1 1 1 0 1
# d2 0 0 1 1 1 1 1 0 0 0 1 0
library(stringi)
docs = c(d1 = c("here a text"), d2 = c("and another one"))
it = itoken(docs, tokenizer = word_tokenizer, progressbar = F)
v = create_vocabulary(it, ngram = c(1,3))
vectorizer = vocab_vectorizer(v)
dtm = create_dtm(it, vectorizer)
for (d in rownames(dtm)) {
v = dtm[d, ]
v = v[v!=0]
v = data.frame(number = 1:length(v)
,term = names(v))
v$n = stri_count_fixed(v$term, "_")+1
write.csv(v, file = paste0("v_", d, ".csv"), row.names = F)
}
read.csv("v_d1.csv")
# number term n
# 1 1 a 1
# 2 2 a_text 2
# 3 3 here 1
# 4 4 here_a 2
# 5 5 here_a_text 3
# 6 6 text 1
read.csv("v_d2.csv")
# number term n
# 1 1 and 1
# 2 2 and_another 2
# 3 3 and_another_one 3
# 4 4 another 1
# 5 5 another_one 2
# 6 6 one 1

Related

Duplicate Strings with Ambiguity

I have a large (5-10 million) set of strings with the restricted alphabet of nucleotide symbols (A,T,C, and G) along with a wildcard symbol N. Each string has an integer associated with it.
I want to find all the unique strings and, for each, sum their integer values. The 'representative' string for a set of equal strings should be the one with the highest integer value. For example, given:
NTG 9
NAG 6
ANG 5
TTT 2
ATG 2
I want the output to be:
NTG 14
NAG 6
ATG 2
TTT 2
With a dataset of this size pairwise comparisons are not feasible. Any ideas?
I assumed that your target output wasn't accurate. It seems more appropriate to match "ATG" to "ANG" (which I have done) instead of matching "ANG" to "NTG" (your stated goal). This solution addresses your given sample set, but may not be helpful for your desired application given the significant difference in scale.
Code:
import re
test = """
NTG 9
NAG 6
ANG 5
TTT 2
ATG 2
"""
test = [x.split(" ") for x in test.upper().split("\n") if x != ""]
#print(test)
index = 0
while index < len(test):
seq = test[index]
seq_regex = seq[0].replace("N", ".")
no_match_li = [x for x in test if len(re.findall(seq_regex, x[0])) == 0]
match_li = [int(x[1]) for x in test if len(re.findall(seq_regex, x[0])) != 0]
#print(no_match_li, match_li)
test = [[seq[0], sum(match_li)]] + no_match_li
index += 1
test = sorted(test, key=lambda x: x[1], reverse=True)
for seq in test:
print(seq[0], seq[1])
Output:
NTG 11
NAG 6
ANG 5
TTT 2

All possible N choose K WITHOUT recusion

I'm trying to create a function that is able to go through a row vector and output the possible combinations of an n choose k without recursion.
For example: 3 choose 2 on [a,b,c] outputs [a,b; a,c; b,c]
I found this: How to loop through all the combinations of e.g. 48 choose 5 which shows how to do it for a fixed n choose k and this: https://codereview.stackexchange.com/questions/7001/generating-all-combinations-of-an-array which shows how to get all possible combinations. Using the latter code, I managed to make a very simple and inefficient function in matlab which returned the result:
function [ combi ] = NCK(x,k)
%x - row vector of inputs
%k - number of elements in the combinations
combi = [];
letLen = 2^length(x);
for i = 0:letLen-1
temp=[0];
a=1;
for j=0:length(x)-1
if (bitand(i,2^j))
temp(k) = x(j+1);
a=a+1;
end
end
if (nnz(temp) == k)
combi=[combi; derp];
end
end
combi = sortrows(combi);
end
This works well for very small vectors, but I need this to be able to work with vectors of at least 50 in length. I've found many examples of how to do this recursively, but is there an efficient way to do this without recursion and still be able to do variable sized vectors and ks?
Here's a simple function that will take a permutation of k ones and n-k zeros and return the next combination of nchoosek. It's completely independent of the values of n and k, taking the values directly from the input array.
function [nextc] = nextComb(oldc)
nextc = [];
o = find(oldc, 1); %// find the first one
z = find(~oldc(o+1:end), 1) + o; %// find the first zero *after* the first one
if length(z) > 0
nextc = oldc;
nextc(1:z-1) = 0;
nextc(z) = 1; %// make the first zero a one
nextc(1:nnz(oldc(1:z-2))) = 1; %// move previous ones to the beginning
else
nextc = zeros(size(oldc));
nextc(1:nnz(oldc)) = 1; %// start over
end
end
(Note that the else clause is only necessary if you want the combinations to wrap around from the last combination to the first.)
If you call this function with, for example:
A = [1 1 1 1 1 0 1 0 0 1 1]
nextCombination = nextComb(A)
the output will be:
A =
1 1 1 1 1 0 1 0 0 1 1
nextCombination =
1 1 1 1 0 1 1 0 0 1 1
You can then use this as a mask into your alphabet (or whatever elements you want combinations of).
C = ['a' 'b' 'c' 'd' 'e' 'f' 'g' 'h' 'i' 'j' 'k']
C(find(nextCombination))
ans = abcdegjk
The first combination in this ordering is
1 1 1 1 1 1 1 1 0 0 0
and the last is
0 0 0 1 1 1 1 1 1 1 1
To generate the first combination programatically,
n = 11; k = 8;
nextCombination = zeros(1,n);
nextCombination(1:k) = 1;
Now you can iterate through the combinations (or however many you're willing to wait for):
for c = 2:nchoosek(n,k) %// start from 2; we already have 1
nextCombination = nextComb(A);
%// do something with the combination...
end
For your example above:
nextCombination = [1 1 0];
C(find(nextCombination))
for c = 2:nchoosek(3,2)
nextCombination = nextComb(nextCombination);
C(find(nextCombination))
end
ans = ab
ans = ac
ans = bc
Note: I've updated the code; I had forgotten to include the line to move all of the 1's that occur prior to the swapped digits to the beginning of the array. The current code (in addition to being corrected above) is on ideone here. Output for 4 choose 2 is:
allCombs =
1 2
1 3
2 3
1 4
2 4
3 4

MATLAB identify adjacient regions in 3D image

I have a 3D image, divided into contiguous regions where each voxel has the same value. The value assigned to this region is unique to the region and serves as a label. The example image below describes the 2D case:
1 1 1 1 2 2 2
1 1 1 2 2 2 3
Im = 1 4 1 2 2 3 3
4 4 4 4 3 3 3
4 4 4 4 3 3 3
I want to create a graph describing adjaciency between these regions. In the above case, this would be:
0 1 0 1
A = 1 0 1 1
0 1 0 1
1 1 1 0
I'm looking for a speedy solution to do this for large 3D images in MATLAB. I came up with a solution that iterates over all regions, which takes 0.05s per iteration - unfortunately, this will take over half an hour for an image with 32'000 regions. Does anybody now a more elegant way of doing this? I'm posting the current algorithm below:
labels = unique(Im); % assuming labels go continuously from 1 to N
A = zeros(labels);
for ii=labels
% border mask to find neighbourhood
dil = imdilate( Im==ii, ones(3,3,3) );
border = dil - (Im==ii);
neighLabels = unique( Im(border>0) );
A(ii,neighLabels) = 1;
end
imdilate is the bottleneck I would like to avoid.
Thank you for your help!
I came up with a solution which is a combination of Divakar's and teng's answers, as well as my own modifications and I generalised it to the 2D or 3D case.
To make it more efficient, I should probably pre-allocate the r and c, but in the meantime, this is the runtime:
For a 3D image of dimension 117x159x126 and 32000 separate regions: 0.79s
For the above 2D example: 0.004671s with this solution, 0.002136s with Divakar's solution, 0.03995s with teng's solution.
I haven't tried extending the winner (Divakar) to the 3D case, though!
noDims = length(size(Im));
validim = ones(size(Im))>0;
labels = unique(Im);
if noDims == 3
Im = padarray(Im,[1 1 1],'replicate', 'post');
shifts = {[-1 0 0] [0 -1 0] [0 0 -1]};
elseif noDims == 2
Im = padarray(Im,[1 1],'replicate', 'post');
shifts = {[-1 0] [0 -1]};
end
% get value of the neighbors for each pixel
% by shifting the image in each direction
r=[]; c=[];
for i = 1:numel(shifts)
tmp = circshift(Im,shifts{i});
r = [r ; Im(validim)];
c = [c ; tmp(validim)];
end
A = sparse(r,c,ones(size(r)), numel(labels), numel(labels) );
% make symmetric, delete diagonal
A = (A+A')>0;
A(1:size(A,1)+1:end)=0;
Thanks for the help!
Try this out -
Im = padarray(Im,[1 1],'replicate');
labels = unique(Im);
box1 = [-size(Im,1)-1 -size(Im,1) -size(Im,1)+1 -1 1 size(Im,1)-1 size(Im,1) size(Im,1)+1];
mat1 = NaN(numel(labels),numel(labels));
for k2=1:numel(labels)
a1 = find(Im==k2);
for k1=1:numel(labels)
a2 = find(Im==k1);
t1 = bsxfun(#plus,a1,box1);
t2 = bsxfun(#eq,t1,permute(a2,[3 2 1]));
mat1(k2,k1) = any(t2(:));
end
end
mat1(1:size(mat1,1)+1:end)=0;
If it works for you, share with us the runtimes as comparison? Would love to see if the coffee brews any faster than half an hour!
Below is my attempt.
Im = [1 1 1 1 2 2 2;
1 1 1 2 2 2 3;
1 4 1 2 2 3 3;
4 4 4 4 3 3 3;
4 4 4 4 3 3 3];
% mark the borders
validim = zeros(size(Im));
validim(2:end-1,2:end-1) = 1;
% get value of the 4-neighbors for each pixel
% by shifting the images 4 times in each direction
numNeighbors = 4;
adj = zeros([prod(size(Im)),numNeighbors]);
shifts = {[0 1] [0 -1] [1 0] [-1 0]};
for i = 1:numNeighbors
tmp = circshift(Im,shifts{i});
tmp(validim == 0) = nan;
adj(:,i) = tmp(:);
end
% mark neighbors where it does not eq Im
imDuplicates = repmat(Im(:),[1 numNeighbors]);
nonequals = adj ~= imDuplicates;
% neglect the border
nonequals(isnan(adj)) = 0;
% get these neighbor values and the corresponding Im value
compared = [imDuplicates(nonequals == 1) adj(nonequals == 1)];
% construct your 'A' % possibly could be more optimized here.
labels = unique(Im);
A = zeros(numel(labels));
for i = 1:size(compared,1)
A(compared(i,1),compared(i,2)) = 1;
end
#Lisa
Yours reasoning is elegant, though it obviously gives wrong answers for labels on the edges.
Try this simple label matrix:
Im =
1 2 2
3 3 3
3 4 4
The resulting adjacency matrix , according to your code is:
A =
0 1 1 0
1 0 1 1
1 1 0 1
0 1 1 0
which claims an adjacency between labels "2" and "4": obviously wrong. This happens simply because you are reading padded Im labels based on "validim" indices, which now doesn't match the new Im and goes all the way down to the lower borders.

Efficient way of finding rows in which A>B

Suppose M is a matrix where each row represents a randomized sequence of a pool of N objects, e.g.,
1 2 3 4
3 4 1 2
2 1 3 4
How can I efficiently find all the rows in which a number A comes before a number B?
e.g., A=1 and B=2; I want to retrieve the first and the second rows (in which 1 comes before 2)
There you go:
[iA jA] = find(M.'==A);
[iB jB] = find(M.'==B);
sol = find(iA<iB)
Note that this works because, according to the problem specification, every number is guaranteed to appear once in each row.
To find rows of M with a given prefix (as requested in the comments): let prefix be a vector with the sought prefix (for example, prefix = [1 2]):
find(all(bsxfun(#eq, M(:,1:numel(prefix)).', prefix(:))))
something like the following code should work. It will look to see if A comes before B in each row.
temp = [1 2 3 4;
3 4 1 2;
2 1 3 4];
A = 1;
B = 2;
orderMatch = zeros(1,size(temp,1));
for i = 1:size(temp,1)
match1= temp(i,:) == A;
match2= temp(i,:) == B;
aIndex = find(match1,1);
bIndex = find(match2,1);
if aIndex < bIndex
orderMatch(i) = 1;
end
end
solution = find(orderMatch);
This will result in [1,1,0] because the first two rows have 1 coming before 2, but the third row does not.
UPDATE
added find function on ordermatch to give row indices as suggested by Luis

Print (or output to file) table of number of steps for Euclid's algorithm

I'd like to print (or send to a file in a human-readable format like below) arbitrary size square tables where each table cell contains the number of steps required to solve Euclid's algorithm for the two integers in the row/column headings like this (table written by hand, but I think the numbers are all correct):
1 2 3 4 5 6
1 1 1 1 1 1 1
2 1 1 2 1 2 1
3 1 2 1 2 3 1
4 1 1 2 1 2 2
5 1 2 3 2 1 2
6 1 1 1 2 2 1
The script would ideally allow me to choose the start integer (1 as above or 11 as below or something else arbitrary) and end integer (6 as above or 16 as below or something else arbitrary and larger than the start integer), so that I could do this too:
11 12 13 14 15 16
11 1 2 3 4 4 3
12 2 1 2 2 2 2
13 3 2 1 2 3 3
14 4 2 2 1 2 2
15 4 2 3 2 1 2
16 3 2 3 2 2 1
I realize that the table is symmetric about the diagonal and so only half of the table contains unique information, and that the diagonal itself is always a 1-step algorithm.
See this and for a graphical representation of what I'm after, but I'd like to know the actual number of steps for any two integers which the image doesn't show me.
I have the algorithms (there's probably better implementations, but I think these work):
The step counter:
def gcd(a,b):
"""Step counter."""
if b > a:
x = a
a = b
b = x
counter = 0
while b:
c = a % b
a = b
b = c
counter += 1
return counter
The list builder:
def gcd_steps(n):
"""List builder."""
print("Table of size", n - 1, "x", n - 1)
list_of_steps = []
for i in range(1, n):
for j in range(1, n):
list_of_steps.append(gcd(i,j))
print(list_of_steps)
return list_of_steps
but I'm totally hung up on how to write the table. I thought about a double nested for loop with i and j and stuff, but I'm new to Python and haven't a clue about the best way (or any way) to go about writing the table. I don't need special formatting like something to offset the row/column heads from the table cells as I can do that by eye, but just getting everything to line up so that I can read it easily is proving too difficult for me at my current skill level, I'm afraid. I'm thinking that it probably makes sense to print/output within the two nested for loops as I'm calculating the numbers I need which is why the list builder has some print statements as well as returning the list, but I don't know how to work the print magic to do what I'm after.
Try this. The programs computes data row by row and prints each row when it's available,
in order to limit memory usage.
import sys, os
def gcd(a,b):
k = 0
if b > a:
a, b = b, a
while b > 0:
a, b = b, a%b
k += 1
return k
def printgcd(name, a, b):
f = open(name, "wt")
s = ""
for i in range(a, b + 1):
s = "{}\t{}".format(s, i)
f.write("{}\n".format(s))
for i in range(a, b + 1):
s = "{}".format(i)
for j in range (a, b + 1):
s = "{}\t{}".format(s, gcd(i, j))
f.write("{}\n".format(s))
f.close()
printgcd("gcd-1-6.txt", 1, 6)
The preceding won't return a list with all computed values, since they are destroyed on purpose. It's easy to do however. Here is a solution with a hash table
def printgcd2(name, a, b):
f = open(name, "wt")
s = ""
h = { }
for i in range(a, b + 1):
s = "{}\t{}".format(s, i)
f.write("{}\n".format(s))
for i in range(a, b + 1):
s = "{}".format(i)
for j in range (a, b + 1):
k = gcd(i, j)
s = "{}\t{}".format(s, k)
h[i, j] = k
f.write("{}\n".format(s))
f.close()
return h
And here is another with a list of lists
def printgcd3(name, a, b):
f = open(name, "wt")
s = ""
u = [ ]
for i in range(a, b + 1):
s = "{}\t{}".format(s, i)
f.write("{}\n".format(s))
for i in range(a, b + 1):
v = [ ]
s = "{}".format(i)
for j in range (a, b + 1):
k = gcd(i, j)
s = "{}\t{}".format(s, k)
v.append(k)
f.write("{}\n".format(s))
u.append(v)
f.close()
return u

Resources