I have already implemented my algorithm using cells of multiple strings on Matlab, but I can't seem to do it through reading a file.
On Matlab, I create cells of strings for each line, let's call them line.
So I get
line= 'string1' 'string2' etc
line= 'string 5' 'string7'...
line=...
and so on. I have over 100s of lines to read.
What I'm trying to do is compare the words from to the first line to itself.
Then combine the first and second line, and compare the words in the second line to the combined cell. I accumulate each cell I read and compare with the last cell read.
Here is my code on
for each line= a,b,c,d,...
for(i=1:length(a))
for(j=1:length(a))
AA=ismember(a,a)
end
combine=[a,b]
[unC,i]=unique(combine, 'first')
sorted=combine(sort(i))
for(i=1:length(sorted))
for(j=1:length(b))
AB=ismember(sorted,b)
end
end
combine1=[a,b,c]
.....
When I read my file, I create a while loop which reads the whole script until the end, so how I can I implement my algorithm if all my cells of strings have the same name?
while~feof(fid)
out=fgetl(fid)
if isempty(out)||strncmp(out, '%', 1)||~ischar(out)
continue
end
line=regexp(line, ' ', 'split')
Suppose your data file is called data.txt and its content is:
string1 string2 string3 string4
string2 string3
string4 string5 string6
A very easy way to retain only the first unique occurrence is:
% Parse everything in one go
fid = fopen('C:\Users\ok1011\Desktop\data.txt');
out = textscan(fid,'%s');
fclose(fid);
unique(out{1})
ans =
'string1'
'string2'
'string3'
'string4'
'string5'
'string6'
As already mentioned, this approach might not work if:
your data file has irregularities
you actually need the comparison indices
EDIT: solution for performance
% Parse in bulk and split (assuming you don't know maximum
%number of strings in a line, otherwise you can use textscan alone)
fid = fopen('C:\Users\ok1011\Desktop\data.txt');
out = textscan(fid,'%s','Delimiter','\n');
out = regexp(out{1},' ','split');
fclose(fid);
% Preallocate unique comb
comb = unique([out{:}]); % you might need to remove empty strings from here
% preallocate idx
m = size(out,1);
idx = false(m,size(comb,2));
% Loop for number of lines (rows)
for ii = 1:m
idx(ii,:) = ismember(comb,out{ii});
end
Note that the resulting idx is:
idx =
1 1 1 1 0 0
0 1 1 0 0 0
0 0 0 1 1 1
The advantage of keeping it in this form is that you save on space with respect to a cell array (which imposes 112 bytes of overhead per cell). You can also store it as a sparse array to potentially improve on storage costs.
Another thing to note, is that even if the logical array is longer than the e.g. double array which is indexing, as long as the exceeding elements are false you can still use it (and by construction of the above problem, idx satisfies this requirement).
An example to clarify:
A = 1:3;
A([true false true false false])
Related
I am trying to write a function that returns true or false if a given string has exactly 6 consecutive characters with the same value. If the string has more or less than 6, it will return false:
I am not allowed to use lists, sets or import any packages. I am only restricted to while loops, for loops, and utilizing basic mathematical operations
Two example runs are shown below:
Enter a string: 367777776
True
Enter a string: 3677777777776
False
Note that although I entered numbers, it is actually a string within the function argument for example: consecutive('3777776')
I tried to convert the string into an ASCII table and then try and filter out the numbers there. However, I
def consecutive(x):
storage= ' '
acc=0
count=0
for s in x:
storage+= str(ord(s)) + ' '
acc+=ord(s)
if acc == acc:
count+=1
for s in x-1:
return count
My intention is to compare the previous character's ASCII code to the current character's ASCII code in the string. If the ASCII doesnt match, I will add an accumulator for it. The accumulator will list the number of duplicates. From there, I will implement an if-else statement to see if it is greater or less than 6 However, I have a hard time translating my thoughts into python code.
Can anyone assist me?
That's a pretty good start!
A few comments:
Variables storage and acc play the same role, and are a little more complicated than they have to be. All you want to know when you arrive at character s is whether or not s is identical to the previous character. So, you only need to store the previously seen character.
Condition acc == acc is always going to be True. I think you meant acc == s?
When you encounter an identical character, you correctly increase the count with count += 1. However, when we change characters, you should reset the count.
With these comments in mind, I fixed your code, then blanked out a few parts for you to fill. I've also renamed storage and acc to previous_char which I think is more explicit.
def has_6_consecutive(x):
previous_char = None
count = 0
for s in x:
if s == previous_char:
???
elif count == 6:
???
else:
???
previous_char = ???
???
You could use recursion. Loop over all the characters and for each one check to see of the next 6 are identical. If so, return true. If you get to the end of the array (or even within 6 characters of the end), return false.
For more info on recursion, check this out: https://www.programiz.com/python-programming/recursion
would something like this be allowed?
def consecF(n):
consec = 1
prev = n[0]
for i in n:
if i==prev:
consec+=1
else:
consec=1
if consec == 6:
return True
prev = i
return False
n = "12111123333221"
print(consecF(n))
You can try a two pointer approach, where the left pointer is fixed at the first instance of some digit and the right one is shifted as long as the digit is seen.
def consecutive(x):
left = 0
while left != len(x):
right = left
while right < len(x) and x[right] == x[left]:
right += 1
length = (right - 1) - left + 1 # from left to right - 1 inclusive, x[left] repeated
if length == 6: # found desired length
return True
left = right
return False # no segment found
tests = [
'3677777777776',
'367777776'
]
for test in tests:
print(f"{test}: {consecutive(test)}")
Output
3677777777776: False
367777776: True
You should store the current sequence of repeated chars.
def consecutive(x):
sequencechar = ' '
repetitions = 0
for ch in x:
if ch != sequencechar:
if repetitions == 6:
break
sequencechar = ch
repetitions = 1
else:
repetitions += 1
return repetitions == 6
If I could, I would not have given the entire solution, but this still is a simple problem. However one has to take care of some points.
As you see the current sequence is stored, and when the sequence is ended and a new starts, on having found a correct sequence it breaks out of the for loop.
Also after the for loop ends normally, the last sequence is checked (which was not done in the loop).
Integer :: NBE,ierr,RN,i,j
Real(kind=8), allocatable :: AA1(:,:),AA2(:,:)
NBE=40
RN=3*NBE-2
Allocate(AA1(3*NBE,3*NBE),AA2(3*NBE,RN),stat=ierr)
If (ierr .ne. 0) Then
print *, 'allocate steps failed 1'
pause
End If
Do i=1,3*NBE
Do j=1,3*NBE
AA1(i,j)=1
End Do
End Do
I want to remove columns 97 and 113 from the matrix AA1 and then this matrix becomes AA2. I just want to know if any command of Fortran can realize this operation?
Alexander Vogt's answer gives the concept of using a vector subscript to select the elements of the array for inclusion. That answer constructs the vector subscript array using
[(i,i=1,96),(i=98,112),(i=114,3*NBE)]
Some may consider
AA2 = AA1(:,[(i,i=1,96),(i=98,112),(i=114,3*NBE)])
to be less than clear in reading. One could use a "temporary" index vector
integer selected_columns(RN)
selected_columns = [(i,i=1,96),(i=98,112),(i=114,3*NBE)]
AA2 = AA1(:,selected_columns)
but that doesn't address the array constructor being not nice, especially in more complicated cases. Instead, we can create a mask and use our common techniques:
logical column_wanted(3*NBE)
integer, allocatable :: selected_columns(:)
! Create a mask of whether a column is wanted
column_wanted = .TRUE.
column_wanted([97,113]) = .FALSE.
! Create a list of indexes of wanted columns
selected_columns = PACK([(i,i=1,3*NBE)],column_wanted)
AA2 = AA1(:,selected_columns)
Here's a simple one liner:
AA2 = AA1(:,[(i,i=1,96),(i=98,112),(i=114,3*NBE)])
Explanation:
(Inner part) Construct a temporary array for the indices [1,...,96,98,...,112,114,...,3*NBE]
(Outer part) Copy the matrix and only consider the columns in the index array
OK, I yield to #IanBush... Even simpler would be to do three dedicated assignments:
AA2(:,1:96) = AA1(:,1:96)
AA2(:,97:111) = AA1(:,98:112)
AA2(:,112:) = AA1(:,114:)
I don't have a fortran compiler here at home, so I cant test it. But I'd do something line this:
i = 0
DO j = 1, 3*NBE
IF (j == 97 .OR. j == 113) CYCLE
i = i + 1
AA2(:, i) = AA1(:, j)
END DO
The CYCLE command means that the rest of the loop should not be executed any more and the next iteration should start. Thereby, i will not get incremented, so when j=96, then i=96, when j=98 then i=97, and when j=114 then i=112.
A few more words: Due to Fortran's memory layout, you want to cycle over the first index the fastest, and so forth. So your code would run faster if you changed it to:
Do j=1,3*NBE ! Outer loop over second index
Do i=1,3*NBE ! Inner loop over first index
AA1(i,j)=1
End Do
End Do
(Of course, such an easy initialisation can be done even easier with just AA1(:,:) = 1 of just AA1 = 1.
I am trying to implement a bit modified version of Rabin Karp algorithm. My idea is if I get a hash value of the given pattern in terms of weight associated with each letter, then I don't have to worry about anagrams so I can just pick up a part of the string, calculate its hash value and compare with hash value of the pattern unlike traditional approach where hashvalue of both part of string and pattern is calculated and then checked whether they are actually similar or it could be an anagram. Here is my code below
string = "AABAACAADAABAABA"
pattern = "AABA"
#string = "gjdoopssdlksddsoopdfkjdfoops"
#pattern = "oops"
#get hash value of the pattern
def gethashp(pattern):
sum = 0
#I mutiply each letter of the pattern with a weight
#So for eg CAT will be C*1 + A*2 + T*3 and the resulting
#value wil be unique for the letter CAT and won't match if the
#letters are rearranged
for i in range(len(pattern)):
sum = sum + ord(pattern[i]) * (i + 1)
return sum % 101 #some prime number 101
def gethashst(string):
sum = 0
for i in range(len(string)):
sum = sum + ord(string[i]) * (i + 1)
return sum % 101
hashp = gethashp(pattern)
i = 0
def checkMatch(string,pattern,hashp):
global i
#check if we actually get first four strings(comes handy when you
#are nearing the end of the string)
if len(string[:len(pattern)]) == len(pattern):
#assign the substring to string2
string2 = string[:len(pattern)]
#get the hash value of the substring
hashst = gethashst(string2)
#if both the hashvalue matches
if hashst == hashp:
#print the index of the first character of the match
print("Pattern found at {}".format(i))
#delete the first character of the string
string = string[1:]
#increment the index
i += 1 #keep a count of the index
checkMatch(string,pattern,hashp)
else:
#if no match or end of string,return
return
checkMatch(string,pattern,hashp)
The code is working just fine. My question is this a valid way of doing it? Can there be any instance where the logic might fail? All the Rabin Karp algorithms that I have come across doesn't use this logic instead for every match, it furthers checks character by character to ensure it's not an anagram. So is it wrong if I do it this way? My opinion is with this code as soon as the hash value matches, you never have to further check both the strings character by character and you can just move on to the next.
It's not necessary that only anagrams collide with the hash value of the pattern. Any other string with same hash value could also collide. Same hash value can act as a liar, so character by character match is required.
For example in your case, you are taking mod 100. Take any distinct 101 patterns, then by the Pigeonhole principle, at least two of them would be having the same hash. If you use one of them as a pattern then the presence of other string would err your output if you avoid character match.
Moreover, even with the hash you used, two anagrams can have the same hash value which can be obtained by solving two linear equations.
For example,
DCE = 4*1 + 3*2 + 5*3 = 25
CED = 3*1 + 5*2 + 4*3 = 25
Here is my code:
#http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance
# used for fuzzy matching of two strings
# for indexing, seq2 must be the parent string
def dameraulevenshtein(seq1, seq2)
oneago = nil
min = 100000000000 #index
max = 0 #index
thisrow = (1..seq2.size).to_a + [0]
seq1.size.times do |x|
twoago, oneago, thisrow = oneago, thisrow, [0] * seq2.size + [x + 1]
seq2.size.times do |y|
delcost = oneago[y] + 1
addcost = thisrow[y - 1] + 1
subcost = oneago[y - 1] + ((seq1[x] != seq2[y]) ? 1 : 0)
thisrow[y] = [delcost, addcost, subcost].min
if (x > 0 and y > 0 and seq1[x] == seq2[y-1] and seq1[x-1] == seq2[y] and seq1[x] != seq2[y])
thisrow[y] = [thisrow[y], twoago[y-2] + 1].min
end
end
end
return thisrow[seq2.size - 1], min, max
end
there has to be someway to get the starting and ending index of substring, seq1, withing parent string, seq2, right?
I'm not entirely sure how this algorithm works, even after reading the wiki article on it. I mean, I understand the highest level explanation, as it finds the insertion, deletion, and transposition difference (the lines in the second loop).. but beyond that. I'm a bit lost.
Here is an example of something that I wan to be able to do with this (^):
substring = "hello there"
search_string = "uh,\n\thello\n\t there"
the indexes should be:
start: 5
end: 18 (last char of string)
Ideally, the search_string will never be modified. But, I guess I could take out all the white space characters (since there are only.. 3? \n \r and \t) store the indexes of each white space character, get the indexes of my substring, and then re-add in the white space characters, making sure to compensate the substring's indexes as I offset them with the white space characters that were originally in there in the first place. -- but if this could all be done in the same method, that would be amazing, as the algorithm is already O(n^2).. =(
At some point, I'd like to only allow white space characters to split up the substring (s1).. but one thing at a time
I don't think this algorithm is the right choice for what you want to do. The algorithm is simply calculating the distance between two strings in terms of the number of modifications you need to make to turn one string into another. If we rename your function to dlmatch for brevity and only return the distance, then we have:
dlmatch("hello there", "uh, \n\thello\n\t there"
=> 7
meaning that you can convert one string into the other in 7 steps (effectively by removing seven characters from the second). The problem is that 7 steps is a pretty big difference:
dlmatch("hello there", "panda here"
=> 6
This would actually imply that "hello there" and "panda here" are closer matches than the first example.
If what you are trying to do is "find a substring that mostly matches", I think you are stuck with an O(n^3) algorithm as you feed the first string to a series of substrings of the second string, and then selecting the substring that provides you the closest match.
Alternatively, you may be better off trying to do pre-processing on the search string and then doing regexp matching with the substring. For example, you could strip off all special characters and then build a regexp that looks for words in the substring that are case insensitive and can have any amount of whitespace between them.
I have about 30 500MB files, one word per line. I have a script that does this, in pseudo-bash:
for i in *; do
echo "" > everythingButI
for j in *-except-$i; do
cat $j >> everythingButI
sort everythingButI | uniq > tmp
mv tmp everythingButI
done
comm $i everythingButI -2 -3 > uniqueInI
percentUnique=$(wc -l uniqueInI) / $(wc -l $i) * 100
echo "$i is $percentUnique% Unique"
done
It computes the 'uniqueness' of each file (the files are already sorted and unique within each file).
So if I had files:
file1 file2 file3
a b 1
c c c
d e e
f g
h
file1 would be 75% unique (because 1/4 of it's lines are found in another file), file2 would be 60% unique, and file3 would be 33.33% unique. But make it 30 files at 500MB a pop, and it takes a bit to run.
I'd like to write a python script that does this much, much faster, but I'm wondering what the fastest algorithm for this would actually be. (I only have 2GB of RAM on the PC also.)
Anyone have opinions about algorithms, or know of a faster way to do this?
EDIT: Since each of the inputs are already internally sorted and deduplicated, you actually need an n-way merge for this, and the hash-building exercise in the previous version of this post is rather pointless.
The n-way merge is kind of intricate if you're not careful. Basically, it works something like this:
Read in the first line of each file, and initialize its unique lines counter and total lines counter to 0.
Do this loop body:
Find the least value among the lines read.
If that value is not the same as the one from any of the other files, increment that file's unique lines counter.
For each file, if the least value equals the last value read, read in the next line and increment that file's total lines counter. If you hit end of file, you're done with that file: remove it from further consideration.
Loop until you have no files left under consideration. At that point, you should have an accurate unique lines counter and total lines counter for each file. Percentages are then a simple matter of multiplication and division.
I've left out the use of a priority queue that's in the full form of the merge algorithm; that only becomes significant if you have a large enough number of input files.
Use a modified N/K-way sort algorithm that treats the entire set of compare files in a pass. Only counting and advancing needs to be done; the merging portion itself can be skipped.
This utilizes the fact that the input is already sorted. If they aren't already sorted, sort them and store them on disk sorted :-) Let the operating system file buffers and read-ahead be your friend.
Happy coding.
With a little bit of cleverness I believe this could also be extended to tell the difference in percent between all the files in a single pass. Just need to keep track of the "trailing" input and counters for each set of relationships (m-m vs. 1-m).
Spoiler code that seems to work for me on the data provided in the question...
Of course, I haven't tested this on really large files or, really, at all. "It ran". The definition of "unique" above was simpler than I was initially thinking about so some of the previous answer doesn't apply much. This code is far from perfect. Use at your own risk (of both computer blowing up and boredom/disgust for not cranking something out better!). Runs on Python 3.1.
import os
import itertools
# see: http://docs.python.org/dev/library/itertools.html#itertools-recipes
# modified for 3.x and eager lists
def partition(pred, iterable):
t1, t2 = itertools.tee(iterable)
return list(itertools.filterfalse(pred, t1)), list(filter(pred, t2))
# all files here
base = "C:/code/temp"
names = os.listdir(base)
for n in names:
print("analyzing {0}".format(n))
# {name => file}
# files are removed from here as they are exhausted
files = dict([n, open(os.path.join(base,n))] for n in names)
# {name => number of shared items in any other list}
shared_counts = {}
# {name => total items this list}
total_counts = {}
for n in names:
shared_counts[n] = 0
total_counts[n] = 0
# [name, currentvalue] -- remains mostly sorted and is
# always a very small n so sorting should be lickity-split
vals = []
for n, f in files.items():
# assumes no files are empty
vals.append([n, str.strip(f.readline())])
total_counts[n] += 1
while len(vals):
vals = sorted(vals, key=lambda x:x[1])
# if two low values are the same then the value is not-unique
# adjust the logic based on definition of unique, etc.
low_value = vals[0][1]
lows, highs = partition(lambda x: x[1] > low_value, vals)
if len(lows) > 1:
for lname, _ in lows:
shared_counts[lname] += 1
# all lowest items discarded and refetched
vals = highs
for name, _ in lows:
f = files[name]
val = f.readline()
if val != "":
vals.append([name, str.strip(val)])
total_counts[name] += 1
else:
# close files as we go. eventually we'll
# dry-up the 'vals' and quit this mess :p
f.close()
del files[name]
# and what we want...
for n in names:
unique = 1 - (shared_counts[n]/total_counts[n])
print("{0} is {1:.2%} unique!".format(n, unique))
Retrospectively I can already see the flaws! :-) The sorting of vals is in for a legacy reason that no longer really applies. In all practically just a min would work fine here (and likely be better for any relatively small set of files).
Here's some really ugly psuedo-code that does the n-way merge
#!/usr/bin/python
import sys, os, commands
from goto import goto, label
def findmin(linesread):
min = ""
indexes = []
for i in range(len(linesread)):
if linesread[i] != "":
min = linesread[i]
indexes.append(i)
break
for i in range(indexes[0]+1, len(linesread)):
if linesread[i] < min and linesread[i] != "":
min = linesread[i]
indexes = [i]
elif linesread[i] == min:
indexes.append(i)
return min, indexes
def genUniqueness(path):
wordlists = []
linecount = []
log = open(path + ".fastuniqueness", 'w')
for root, dirs, files in os.walk(path):
if root.find(".git") > -1 or root == ".":
continue
if root.find("onlyuppercase") > -1:
continue
for i in files:
if i.find('lvl') >= 0 or i.find('trimmed') >= 0:
wordlists.append( root + "/" + i );
linecount.append(int(commands.getoutput("cat " + root + "/" + i + " | wc -l")))
print root + "/" + i
whandles = []
linesread = []
numlines = []
uniquelines = []
for w in wordlists:
whandles.append(open(w, 'r'))
linesread.append("")
numlines.append(0)
uniquelines.append(0)
count = range(len(whandles))
for i in count:
linesread[i] = whandles[i].readline().strip()
numlines[i] += 1
while True:
(min, indexes) = findmin(linesread)
if len(indexes) == 1:
uniquelines[indexes[0]] += 1
for i in indexes:
linesread[i] = whandles[i].readline().strip()
numlines[i] += 1
if linesread[i] == "":
numlines[i] -= 1
whandles[i] = 0
print "Expiring ", wordlists[i]
if not any(linesread):
break
for i in count:
log.write(wordlists[i] + "," + str(uniquelines[i]) + "," + str(numlines[i]) + "\n")
print wordlists[i], uniquelines[i], numlines[i]