I want to know the approach to compress any string - algorithm

I want to know approach about, how can I compress the string like if I given some string ABCABCABC than as I am thinking that I can find sub-string as ABC which is frequently occurs so it will be compressed as 3ABC. Another way if string like ABCABCBC is given than here ABC is the frequently occurring sub-string, so it compressed as 2ABC1BC. As you can see I am taking only consideration of adjacent substrings.

s='FLFLAFLAFLAF'
def check(str_check,str):
try:
index=str.index(str_check)
except Exception:
index=9999
return index
def max_index(i,j,sub_str):
if(i>=0 and j<=len(count)):
try:
j=count.index(max(count[i:j]))
except Exception:
j=-1
try:
sub_str_index=int(sub_str.index(sub_list[j]))
except Exception:
sub_str_index=-1
temp=int(count[j])
if len(sub_str)==0:
return ""
elif(count[j]==1 or int(count[j])*len(sub_list[j])>len(sub_str)):
for i in range (len(sub_str)):
count[j+i]=0
return "1"+ sub_str
else:
#count[j]=0
return max_index(0,sub_str_index,sub_str[0:sub_str_index]) + str(temp) + sub_list[j] + max_index(len(sub_str[0:sub_str_index])+(int(temp)*len(sub_list[j])),len(sub_str),sub_str[len(sub_str[0:sub_str_index])+(int(temp)*len(sub_list[j])): len(sub_str)])
sub_list=sorted(set([s[i:i+j] for j in range(1,len(s)+1) for i in range(len(s)-j+1)]))
length=len(s)
length1=len(sub_list)
count=[]
for i in range (length1):
cnt=0
j=0
while(j<length):
k=check(sub_list[i],s[j:])
if k==9999:
break
if (s[j+k:j+k+len(sub_list[i])]==sub_list[i]):
if(cnt>=1 and s[j+k-len(sub_list[i]):j+k]==sub_list[i]):
j=j+k+1
cnt=cnt+1
elif (cnt==0):
j=j+k+1
cnt=cnt+1
else:
j=length
else:
j=length
count.append(cnt)
print "SUBLIST"
print sub_list
print "count"
count1=count
print count1
j=len(count)
i=0
max_ele=max(count)
while(max_ele in count):
max_in=count.index(max(count[i:j]))
print max_index(i,j,s)
count=count1
count[max_in]=0

Related

Tool/Algorithm for text comparision after every key hit

I am struggling to find a text comparison tool or algorithm that can compare an expected text against the current state of the text being typed.
I will have an experimentee typewrite a text that he has in front of his eyes. My idea is to compare the current state of the text against the expected text whenever something is typed. That way I want to find out when and what the subject does wrong (I also want to find errors that are not in the resulting text but were in the intermediate text for some time).
Can someone point me in a direction?
Update #1
I have access to the typing data in a csv format:
This is example output data of me typing "foOBar". Every line has the form (timestamp, Key, Press/Release)
17293398.576653,F,P
17293398.6885,F,R
17293399.135282,LeftShift,P
17293399.626881,LeftShift,R
17293401.313254,O,P
17293401.391732,O,R
17293401.827314,LeftShift,P
17293402.073046,O,P
17293402.184859,O,R
17293403.178612,B,P
17293403.301748,B,R
17293403.458137,LeftShift,R
17293404.966193,A,P
17293405.077869,A,R
17293405.725405,R,P
17293405.815159,R,R
In Python
Given your input csv file (I called it keyboard_records.csv)
17293398.576653,F,P
17293398.6885,F,R
17293399.135282,LeftShift,P
17293399.626881,LeftShift,R
17293401.313254,O,P
17293401.391732,O,R
17293401.827314,LeftShift,P
17293402.073046,O,P
17293402.184859,O,R
17293403.178612,B,P
17293403.301748,B,R
17293403.458137,LeftShift,R
17293404.966193,A,P
17293405.077869,A,R
17293405.725405,R,P
17293405.815159,R,R
The following code does the following:
Read its content and store it in a list named steps
For each step in steps recognizes what happened and
If it was a shift press or release sets a flag (shift_on) accordingly
If it was an arrow pressed moves the cursor (index of current where we insert characters) – if it the cursor is at the start or at the end of the string it shouldn't move, that's why those min() and max()
If it was a letter/number/symbol it adds it in curret at cursor position and increments cursor
Here you have it
import csv
steps = [] # list of all actions performed by user
expected = "Hello"
with open("keyboard.csv") as csvfile:
for row in csv.reader(csvfile, delimiter=','):
steps.append((float(row[0]), row[1], row[2]))
# Now we parse the information
current = [] # text written by the user
shift_on = False # is shift pressed
cursor = 0 # where is the cursor in the current text
for step in steps:
time, key, action = step
if key == 'LeftShift':
if action == 'P':
shift_on = True
else:
shift_on = False
continue
if key == 'LeftArrow' and action == 'P':
cursor = max(0, cursor-1)
continue
if key == 'RightArrow' and action == 'P':
cursor = min(len(current), cursor+1)
continue
if action == 'P':
if shift_on is True:
current.insert(cursor, key.upper())
else:
current.insert(cursor, key.lower())
cursor += 1
# Now you can join current into a string
# and compare current with expected
print(''.join(current)) # printing current (just to see what's happening)
else:
# What to do when a key is released?
# Depends on your needs...
continue
To compare current and expected have a look here.
Note: by playing around with the code above and a few more flags you can make it recognize also symbols. This will depend on your keyboard. In mine Shift + 6 = &, AltGr + E = € and Ctrl + Shift + AltGr + è = {. I think this is a good point to start.
Update
Comparing 2 texts isn't a difficult task and you can find tons of pages on the web about it.
Anyway I wanted to present you an object oriented approach to the problem, so I added the compare part that I previously omitted in the first solution.
This is still a rough code, without primary controls over the input. But, as you asked, this is pointing you in a direction.
class UserText:
# Initialize UserText:
# - empty text
# - cursor at beginning
# - shift off
def __init__(self, expected):
self.expected = expected
self.letters = []
self.cursor = 0
self.shift = False
# compares a and b and returns a
# list containing the indices of
# mismatches between a and b
def compare(a, b):
err = []
for i in range(min(len(a), len(b))):
if a[i] != b[i]:
err.append(i)
return err
# Parse a command given in the
# form (time, key, action)
def parse(self, command):
time, key, action = command
output = ""
if action == 'P':
if key == 'LeftShift':
self.shift = True
elif key == 'LeftArrow':
self.cursor = max(0, self.cursor - 1)
elif key == 'RightArrow':
self.cursor = min(len(self.letters), self.cursor + 1)
else:
# Else, a letter/number was pressed. Let's
# add it to self.letters in cursor position
if self.shift is True:
self.letters.insert(self.cursor, key.upper())
else:
self.letters.insert(self.cursor, key.lower())
self.cursor += 1
########## COMPARE WITH EXPECTED ##########
output += "Expected: \t" + self.expected + "\n"
output += "Current: \t" + str(self) + "\n"
errors = UserText.compare(str(self), self.expected[:len(str(self))])
output += "\t\t"
i = 0
for e in errors:
while i != e:
output += " "
i += 1
output += "^"
i += 1
output += "\n[{} errors at time {}]".format(len(errors), time)
return output
else:
if key == 'LeftShift':
self.shift = False
return output
def __str__(self):
return "".join(self.letters)
import csv
steps = [] # list of all actions performed by user
expected = "foobar"
with open("keyboard.csv") as csvfile:
for row in csv.reader(csvfile, delimiter=','):
steps.append((float(row[0]), row[1], row[2]))
# Now we parse the information
ut = UserText(expected)
for step in steps:
print(ut.parse(step))
The output for the csv file above was:
Expected: foobar
Current: f
[0 errors at time 17293398.576653]
Expected: foobar
Current: fo
[0 errors at time 17293401.313254]
Expected: foobar
Current: foO
^
[1 errors at time 17293402.073046]
Expected: foobar
Current: foOB
^^
[2 errors at time 17293403.178612]
Expected: foobar
Current: foOBa
^^
[2 errors at time 17293404.966193]
Expected: foobar
Current: foOBar
^^
[2 errors at time 17293405.725405]
I found the solution to my own question around a year ago. Now i have time to share it with you:
In their 2003 paper 'Metrics for text entry research: An evaluation of MSD and KSPC, and a new unified error metric', R. William Soukoreff and I. Scott MacKenzie propose three major new metrics: 'total error rate', 'corrected error rate' and 'not corrected error rate'. These metrics have become well established since the publication of this paper. These are exaclty the metrics i was looking for.
If you are trying to do something similiar to what i did, e.g. compare the writing performance on different input devices this is the way to go.

Improving an algorithm for substring search when reading ZIP files

So I have a ZIP reader library, and I read ZIP files by first figuring out where the EOCD record is (the standard way "from the tail"). I have to look for a pattern that is roughly this:
4byte_magic_number, fixed_n_bytes, 2_bytes_of_comment_size, comment
The bytesize of comment is provided in the 2_bytes_of_comment_size. Just scanning for the magic number is insufficient, because I eager-read a substantial portion at the tail of the file - basically the maximum size the ZIP EOCD record can be, and then look for this pattern in there.
So far, I came up with this
def locate_eocd_signature(in_str)
# We have to scan from the _very_ tail. We read the very minimum size
# the EOCD record can have (up to and including the comment size), using
# a sliding window. Once our end offset matches the comment size we found our
# EOCD marker.
eocd_signature_int = 0x06054b50
unpack_pattern = 'VvvvvVVv'
minimum_record_size = 22
end_location = minimum_record_size * -1
loop do
# If the window is nil, we have rolled off the start of the string, nothing to do here.
# We use negative values because if we used positive slice indices
# we would have to detect the rollover ourselves
break unless window = in_str[end_location, minimum_record_size]
window_location = in_str.bytesize + end_location
unpacked = window.unpack(unpack_pattern)
# If we found the signature, pick up the comment size, and check if the size of the window
# plus that comment size is where we are in the string. If we are - bingo.
if unpacked[0] == 0x06054b50 && comment_size = unpacked[-1]
assumed_eocd_location = in_str.bytesize - comment_size - minimum_record_size
# if the comment size is where we should be at - we found our EOCD
return assumed_eocd_location if assumed_eocd_location == window_location
end
end_location -= 1 # Shift the window back, by one byte, and try again.
end
end
but it just screams ugly at me. Is there a better way to do something like this? Is there a pack specifier that says "all the bytes in binary until the the end of the string" that I do not know of? Then I could tack that onto the end of the pack specifier for example... A bit at loss here.
In the end I opted for the following optimization. First, I made a method for finding all the indices of a given substring in a string - there is no stdlib builtin for this.
def all_indices_of_substr_in_str(of_substring, in_string)
last_i = 0
found_at_indices = []
while last_i = in_string.index(of_substring, last_i)
found_at_indices << last_i
last_i += of_substring.bytesize
end
found_at_indices
end
Then, we use it to "latch" onto the offsets in our buffer where our signature was found.
def locate_eocd_signature(in_str)
eocd_signature = 0x06054b50
eocd_signature_str = [eocd_signature].pack('V')
unpack_pattern = 'VvvvvVVv'
minimum_record_size = 22
str_size = in_str.bytesize
indices = all_indices_of_substr_in_str(eocd_signature_str, in_str)
indices.each do |check_at|
maybe_record = in_str[check_at..str_size]
# If the record is smaller than the minimum - we will never recover anything
break if maybe_record.bytesize < minimum_record_size
# Now we check if the record ends with the combination
# of the comment size and an arbitrary byte string of that size.
# If it does - we found our match
*_unused, comment_size = maybe_record.unpack(unpack_pattern)
if (maybe_record.bytesize - minimum_record_size) == comment_size
return check_at # Found the EOCD marker location
end
end
# If we haven't caught anything, return nil deliberately instead of returning the last statement
nil
end

Ruby - How to subtract numbers of two files and save the result in one of them on a specified position?

I have 2 txt files with different strings and numbers in them splitted with ;
Now I need to subtract the
((number on position 2 in file1) - (number on position 25 in file2)) = result
Now I want to replace the (number on position 2 in file1) with the result.
I tried my code below but it only appends the number in the end of the file and its not the result of the calculation which got appended.
def calc
f1 = File.open("./file1.txt", File::RDWR)
f2 = File.open("./file2.txt", File::RDWR)
f1.flock(File::LOCK_EX)
f2.flock(File::LOCK_EX)
f1.each.zip(f2.each).each do |line, line2|
bg = line.split(";").compact.collect(&:strip)
bd = line2.split(";").compact.collect(&:strip)
n = bd[2].to_i - bg[25].to_i
f2.print bd[2] << n
#puts "#{n}" Only for testing
end
f1.flock(File::LOCK_UN)
f2.flock(File::LOCK_UN)
f1.close && f2.close
end
Use something like this:
lines1 = File.readlines('file1.txt').map(&:to_i)
lines2 = File.readlines('file2.txt').map(&:to_i)
result = lines1.zip(lines2).map do |value1, value2| value1 - value2 }
File.write('file1.txt', result.join(?\n))
This code load all files in memory, then calculate result and write it to first file.
FYI: If you want to use your code just save result to other file (i.e. result.txt) and at the end copy it to original file.

mysterious "'str' object is not callable" python error

I am currently making my first python effort, a modification of some code written by a friend. I am using python 2.6.6. The original piece of code, which works, extracts information from a log file of data from donations made by credit card to my nonprofit. My new version, should it one day work, will perform the same task for donations that were made by paypal. The log files are similar, but have different field names and other differences.
The error messages I'm getting are:
Traceback (most recent call last):
File "../logparse-paypal-1.py", line 196, in
convert_log(sys.argv[1], sys.argv[2], access_ids)
File "../logparse-paypal-1.py", line 170, in convert_log
output = [f(record, access_ids) for f in output_fns]
TypeError: 'str' object is not callable
I've read some of the posts on this forum related to this error message, but so far I'm still at sea. I can't find any consequential differences between the portions of my code that related to the likely problem object (access_ids) and the code that I started with. All I did related to the access_ids table was to remove some lines that printed problems the script finds with the table that caused it to ignore some data. Perhaps I changed a character or something while doing that, but I've looked and so far can't find anything.
The portion of the code that is producing these error messages is the following:
# Use the output functions configured above to convert the
# transaction record into a list of outputs to be emitted to
# the CSV output file.
print "Converting %s at %s to CSV" % (record["type"], record["time"])
output = [f(record, access_ids) for f in output_fns]
j = 0
while j < len(output):
os.write(csv_fd, output[j])
if j < len(output) - 1:
os.write(csv_fd, ",")
else:
os.write(csv_fd, "\n")
j += 1
convert_count += 1
print "Converted %d approved transactions to CSV format, skipped %d non-approved transactions" % (convert_count, skip_count)
if __name__ == '__main__':
if len(sys.argv) < 3:
print "Usage: logparse.py INPUT_FILE OUTPUT_FILE [ACCESS_IDS_FILE]"
print
print " INPUT_FILE Silent post log containing transaction records (must exist)"
print " OUTPUT_FILE Filename for the CSV file to be created (must not exist, will be created)"
print " ACCESS_IDS_FILE List of Access IDs and email addresses (optional, must exist if specified)"
sys.exit(-1)
access_ids = {}
if len(sys.argv) > 3:
access_ids = load_access_ids(sys.argv[3])
convert_log(sys.argv[1], sys.argv[2], access_ids)
Line 170 is this one:
output = [f(record, access_ids) for f in output_fns]
and line 196 is this one:
convert_log(sys.argv[1], sys.argv[2], access_ids)
The access_ids definition, possibly related to the problem, is this:
def access_id_fn(record, access_ids):
if "payer_email" in record and len(record["payer_email"]) > 0:
if record["payer_email"] in access_ids:
return '"' + access_ids[record["payer_email"]] + '"'
else:
return ""
else:
return ""
AND
def load_access_ids(filename):
print "Loading Access IDs from %s..." % filename
access_ids = {}
for line in open(filename, "r"):
line = line.rstrip()
access_id, email = [s.strip() for s in line.split(None, 1)]
if not email_address.match(email):
continue
if email in access_ids:
access_ids[string.strip(email)] = string.strip(access_id)
return access_ids
Thanks in advance for any advice with this.
Dave
I'm not seeing anything right off hand, but you did mention that the log files were similar and I take that to mean that there are differences between the two.
Can you post a line from each?
I would double check the data in the log files and make sure what you think is being read in is correct. This definitely appears to me like a piece of data is being read in, but somewhere it is breaking what the code is expecting.

Prevent or delete duplicates in textscraper?

I have a code that parses through text files in a folder, and saves a predefined number of words around certain search words.
For example, it looks for words such as "date" and "year". If it finds both in the same sentence it will save the sentence twice. Furthermore, if it finds the same word used a few times in a sentence, it will also save it multiple times.
This way the scraper saves an enormous amount of unnecessary duplicate text.
I see two possible solutions:
If the next search-match is in the padding, in the group of words, of the previous one, it will not be saved.
If a group of, say, seven words of the search match is also part of the preceding group it will not be saved/deleted.
Everything I've tried has utterly failed thusfar:
#helper
def indices text, index, word
padding = 200
bottom_i = index - padding < 0 ? 0 : index - padding
top_i = index + word.length + padding > text.length ? text.length : index + word.length + padding
return bottom_i, top_i
end
#script
base_text = File.open("base.txt", 'w')
Dir::mkdir("summaries") unless File.exists?("summaries")
Dir.chdir("summaries")
Dir.glob("*.txt").each do |textfile|
whole_file = File.open(textfile, 'r').read
puts "Currently summarizing " + textfile + "..."
curr_i = 0
str = nil
whole_file.scan(Regexp.union(/firstword/, /secondword/).each do |match|
if i_match = whole_file.index(match, curr_i)
top_bottom = indices(whole_file, i_match, match)
base_text.puts(whole_file[top_bottom[0]..top_bottom[1]] + " : " + File.path(textfile))
curr_i += i_match
end
end
puts "Done summarizing " + textfile + "."
end
base_text.close
Why not do something where you keep track of what you're looking for:
search_words = %w( year date etc )
And then downcase the search string, and start an index.
def summarize(str)
search_str = str.downcase
ind = 0
Then find the smallest index offset of your search words in the search_str, and remove everything up to (ind + offset - delta), go up to (ind + delta) into matches, and continue in the while loop. Something like:
matches = []
while (offset = search_words.map{|w| search_str.index w }.min)
ind += offset
matches.push str[ind - delta, delta * 2]
search_str = search_str[offset + delta, ]
end
matches
end
Preferably something better than:
whole_file.scan(Regexp.union(/firstword/, /secondword/).each do |match|
if i_match = whole_file.index(match, curr_i)
top_bottom = indices(whole_file, i_match, match)
base_text.puts(whole_file[top_bottom[0]..top_bottom[1]] + " : " + File.path(textfile))
curr_i += i_match + 50
end
end

Resources