Prevent or delete duplicates in textscraper? - ruby

I have a code that parses through text files in a folder, and saves a predefined number of words around certain search words.
For example, it looks for words such as "date" and "year". If it finds both in the same sentence it will save the sentence twice. Furthermore, if it finds the same word used a few times in a sentence, it will also save it multiple times.
This way the scraper saves an enormous amount of unnecessary duplicate text.
I see two possible solutions:
If the next search-match is in the padding, in the group of words, of the previous one, it will not be saved.
If a group of, say, seven words of the search match is also part of the preceding group it will not be saved/deleted.
Everything I've tried has utterly failed thusfar:
#helper
def indices text, index, word
padding = 200
bottom_i = index - padding < 0 ? 0 : index - padding
top_i = index + word.length + padding > text.length ? text.length : index + word.length + padding
return bottom_i, top_i
end
#script
base_text = File.open("base.txt", 'w')
Dir::mkdir("summaries") unless File.exists?("summaries")
Dir.chdir("summaries")
Dir.glob("*.txt").each do |textfile|
whole_file = File.open(textfile, 'r').read
puts "Currently summarizing " + textfile + "..."
curr_i = 0
str = nil
whole_file.scan(Regexp.union(/firstword/, /secondword/).each do |match|
if i_match = whole_file.index(match, curr_i)
top_bottom = indices(whole_file, i_match, match)
base_text.puts(whole_file[top_bottom[0]..top_bottom[1]] + " : " + File.path(textfile))
curr_i += i_match
end
end
puts "Done summarizing " + textfile + "."
end
base_text.close

Why not do something where you keep track of what you're looking for:
search_words = %w( year date etc )
And then downcase the search string, and start an index.
def summarize(str)
search_str = str.downcase
ind = 0
Then find the smallest index offset of your search words in the search_str, and remove everything up to (ind + offset - delta), go up to (ind + delta) into matches, and continue in the while loop. Something like:
matches = []
while (offset = search_words.map{|w| search_str.index w }.min)
ind += offset
matches.push str[ind - delta, delta * 2]
search_str = search_str[offset + delta, ]
end
matches
end

Preferably something better than:
whole_file.scan(Regexp.union(/firstword/, /secondword/).each do |match|
if i_match = whole_file.index(match, curr_i)
top_bottom = indices(whole_file, i_match, match)
base_text.puts(whole_file[top_bottom[0]..top_bottom[1]] + " : " + File.path(textfile))
curr_i += i_match + 50
end
end

Related

How to read by the number at each iteration of the loop in Ruby?

How to read by the number at each iteration of the loop? Dynamic work is important, (not once to read the entire line and convert to an array), at each iteration, take one number from the file string and work with it. How to do it right?
input.txt :
5
1 7 5 2 3
Work with 2nd line of the file.
fin = File.open("input.txt", "r")
fout = File.open("output.txt", "w")
n = fin.readline.to_i
heap_min = Heap.new(:min)
heap_max = Heap.new(:max)
for i in 1..n
a = fin.read.to_i #code here <--
heap_max.push(a)
if heap_max.size > heap_min.size
tmp = heap_max.top
heap_max.pop
heap_min.push(tmp)
end
if heap_min.size > heap_max.size
tmp = heap_min.top
heap_min.pop
heap_max.push(tmp)
end
if heap_max.size == heap_min.size
heap_max.top > heap_min.top ? median = heap_min.top : median = heap_max.top
else
median = heap_max.top
end
fout.print(median, " ")
end
If you're 100% sure that your file separate numbers by space you can try this :
a = fin.gets(' ', -1).to_i
Read the 2nd line of a file:
line2 = File.readlines('input.txt')[1]
Convert it to an array of integers:
array = line2.split(' ').map(&:to_i).compact

Improving an algorithm for substring search when reading ZIP files

So I have a ZIP reader library, and I read ZIP files by first figuring out where the EOCD record is (the standard way "from the tail"). I have to look for a pattern that is roughly this:
4byte_magic_number, fixed_n_bytes, 2_bytes_of_comment_size, comment
The bytesize of comment is provided in the 2_bytes_of_comment_size. Just scanning for the magic number is insufficient, because I eager-read a substantial portion at the tail of the file - basically the maximum size the ZIP EOCD record can be, and then look for this pattern in there.
So far, I came up with this
def locate_eocd_signature(in_str)
# We have to scan from the _very_ tail. We read the very minimum size
# the EOCD record can have (up to and including the comment size), using
# a sliding window. Once our end offset matches the comment size we found our
# EOCD marker.
eocd_signature_int = 0x06054b50
unpack_pattern = 'VvvvvVVv'
minimum_record_size = 22
end_location = minimum_record_size * -1
loop do
# If the window is nil, we have rolled off the start of the string, nothing to do here.
# We use negative values because if we used positive slice indices
# we would have to detect the rollover ourselves
break unless window = in_str[end_location, minimum_record_size]
window_location = in_str.bytesize + end_location
unpacked = window.unpack(unpack_pattern)
# If we found the signature, pick up the comment size, and check if the size of the window
# plus that comment size is where we are in the string. If we are - bingo.
if unpacked[0] == 0x06054b50 && comment_size = unpacked[-1]
assumed_eocd_location = in_str.bytesize - comment_size - minimum_record_size
# if the comment size is where we should be at - we found our EOCD
return assumed_eocd_location if assumed_eocd_location == window_location
end
end_location -= 1 # Shift the window back, by one byte, and try again.
end
end
but it just screams ugly at me. Is there a better way to do something like this? Is there a pack specifier that says "all the bytes in binary until the the end of the string" that I do not know of? Then I could tack that onto the end of the pack specifier for example... A bit at loss here.
In the end I opted for the following optimization. First, I made a method for finding all the indices of a given substring in a string - there is no stdlib builtin for this.
def all_indices_of_substr_in_str(of_substring, in_string)
last_i = 0
found_at_indices = []
while last_i = in_string.index(of_substring, last_i)
found_at_indices << last_i
last_i += of_substring.bytesize
end
found_at_indices
end
Then, we use it to "latch" onto the offsets in our buffer where our signature was found.
def locate_eocd_signature(in_str)
eocd_signature = 0x06054b50
eocd_signature_str = [eocd_signature].pack('V')
unpack_pattern = 'VvvvvVVv'
minimum_record_size = 22
str_size = in_str.bytesize
indices = all_indices_of_substr_in_str(eocd_signature_str, in_str)
indices.each do |check_at|
maybe_record = in_str[check_at..str_size]
# If the record is smaller than the minimum - we will never recover anything
break if maybe_record.bytesize < minimum_record_size
# Now we check if the record ends with the combination
# of the comment size and an arbitrary byte string of that size.
# If it does - we found our match
*_unused, comment_size = maybe_record.unpack(unpack_pattern)
if (maybe_record.bytesize - minimum_record_size) == comment_size
return check_at # Found the EOCD marker location
end
end
# If we haven't caught anything, return nil deliberately instead of returning the last statement
nil
end

Ruby - How to subtract numbers of two files and save the result in one of them on a specified position?

I have 2 txt files with different strings and numbers in them splitted with ;
Now I need to subtract the
((number on position 2 in file1) - (number on position 25 in file2)) = result
Now I want to replace the (number on position 2 in file1) with the result.
I tried my code below but it only appends the number in the end of the file and its not the result of the calculation which got appended.
def calc
f1 = File.open("./file1.txt", File::RDWR)
f2 = File.open("./file2.txt", File::RDWR)
f1.flock(File::LOCK_EX)
f2.flock(File::LOCK_EX)
f1.each.zip(f2.each).each do |line, line2|
bg = line.split(";").compact.collect(&:strip)
bd = line2.split(";").compact.collect(&:strip)
n = bd[2].to_i - bg[25].to_i
f2.print bd[2] << n
#puts "#{n}" Only for testing
end
f1.flock(File::LOCK_UN)
f2.flock(File::LOCK_UN)
f1.close && f2.close
end
Use something like this:
lines1 = File.readlines('file1.txt').map(&:to_i)
lines2 = File.readlines('file2.txt').map(&:to_i)
result = lines1.zip(lines2).map do |value1, value2| value1 - value2 }
File.write('file1.txt', result.join(?\n))
This code load all files in memory, then calculate result and write it to first file.
FYI: If you want to use your code just save result to other file (i.e. result.txt) and at the end copy it to original file.

Join array of strings into 1 or more strings each within a certain char limit (+ prepend and append texts)

Let's say I have an array of Twitter account names:
string = %w[example1 example2 example3 example4 example5 example6 example7 example8 example9 example10 example11 example12 example13 example14 example15 example16 example17 example18 example19 example20]
And a prepend and append variable:
prepend = 'Check out these cool people: '
append = ' #FollowFriday'
How can I turn this into an array of as few strings as possible each with a maximum length of 140 characters, starting with the prepend text, ending with the append text, and in between the Twitter account names all starting with an #-sign and separated with a space. Like this:
tweets = ['Check out these cool people: #example1 #example2 #example3 #example4 #example5 #example6 #example7 #example8 #example9 #FollowFriday', 'Check out these cool people: #example10 #example11 #example12 #example13 #example14 #example15 #example16 #example17 #FollowFriday', 'Check out these cool people: #example18 #example19 #example20 #FollowFriday']
(The order of the accounts isn't important so theoretically you could try and find the best order to make the most use of the available space, but that's not required.)
Any suggestions? I'm thinking I should use the scan method, but haven't figured out the right way yet.
It's pretty easy using a bunch of loops, but I'm guessing that won't be necessary when using the right Ruby methods. Here's what I came up with so far:
# Create one long string of #usernames separated by a space
tmp = twitter_accounts.map!{|a| a.insert(0, '#')}.join(' ')
# alternative: tmp = '#' + twitter_accounts.join(' #')
# Number of characters left for mentioning the Twitter accounts
length = 140 - (prepend + append).length
# This method would split a string into multiple strings
# each with a maximum length of 'length' and it will only split on empty spaces (' ')
# ideally strip that space as well (although .map(&:strip) could be use too)
tweets = tmp.some_method(' ', length)
# Prepend and append
tweets.map!{|t| prepend + t + append}
P.S.
If anyone has a suggestion for a better title let me know. I had a difficult time summarizing my question.
The String rindex method has an optional parameter where you can specify where to start searching backwards in a string:
arr = %w[example1 example2 example3 example4 example5 example6 example7 example8 example9 example10 example11 example12 example13 example14 example15 example16 example17 example18 example19 example20]
str = arr.map{|name|"##{name}"}.join(' ')
prepend = 'Check out these cool people: '
append = ' #FollowFriday'
max_chars = 140 - prepend.size - append.size
until str.size <= max_chars do
p str.slice!(0, str.rindex(" ", max_chars))
str.lstrip! #get rid of the leading space
end
p str unless str.empty?
I'd make use of reduce for this:
string = %w[example1 example2 example3 example4 example5 example6 example7 example8 example9 example10 example11 example12 example13 example14 example15 example16 example17 example18 example19 example20]
prepend = 'Check out these cool people:'
append = '#FollowFriday'
# Extra -1 is for the space before `append`
max_content_length = 140 - prepend.length - append.length - 1
content_strings = string.reduce([""]) { |result, target|
result.push("") if result[-1].length + target.length + 2 > max_content_length
result[-1] += " ##{target}"
result
}
tweets = content_strings.map { |s| "#{prepend}#{s} #{append}" }
Which would yield:
"Check out these cool people: #example1 #example2 #example3 #example4 #example5 #example6 #example7 #example8 #example9 #FollowFriday"
"Check out these cool people: #example10 #example11 #example12 #example13 #example14 #example15 #example16 #example17 #FollowFriday"
"Check out these cool people: #example18 #example19 #example20 #FollowFriday"

Radix sort not working in Lua

Firstly I should mention I've not been coding very long at all, although that much is probably obvious from my code :P
I'm having two problems, firstly the sort isn't functioning correctly but does sort the numbers by their length. Any help here would be appreciated.
Secondly it's changing both the table it grabs and the table it returns (not sure why). How do I prevent it changing the table it grabs?
I'd prefer if people didn't post a fully optisimised premade code as I'm not going to learn or understand anything that way.
function radix_sort(x)
pass, bucket, maxstring = 0, x, 2
while true do
pass = pass + 1
queue = {}
for n=#bucket,1,-1 do
key_length = string.len(bucket[n])
key = bucket[n]
if pass == 1 and key_length > maxstring then
maxstring = key_length
end
if key_length == pass then
pool = string.sub(key, 1,1)
if queue[pool + 1] == nil then
queue[pool + 1] = {}
end
table.insert(queue[pool + 1], key)
table.remove(bucket, n)
end
end
for k,v in pairs(queue) do
for n=1,#v do
table.insert(bucket, v[n])
end
end
if pass == maxstring then
break
end
end
return bucket
end
There's a lot of changes I made to get this working, so hopefully you can look through and pickup on them. I tried to comment as best I could.
function radix_sort(x)
pass, maxstring = 0, 0
-- to avoid overwriting x, copy into bucket like this
-- it also gives the chance to init maxstring
bucket={}
for n=1,#x,1 do
-- since we can, convert all entries to strings for string functions below
bucket[n]=tostring(x[n])
key_length = string.len(bucket[n])
if key_length > maxstring then
maxstring = key_length
end
end
-- not a fan of "while true ... break" when we can set a condition here
while pass <= maxstring do
pass = pass + 1
-- init both queue and all queue entries so ipairs doesn't skip anything below
queue = {}
for n=1,10,1 do
queue[n] = {}
end
-- go through bucket entries in order for an LSD radix sort
for n=1,#bucket,1 do
key_length = string.len(bucket[n])
key = bucket[n]
-- for string.sub, start at end of string (LSD sort) with -pass
if key_length >= pass then
pool = tonumber(string.sub(key, pass*-1, pass*-1))
else
pool = 0
end
-- add to appropriate queue, but no need to remove from bucket, reset it below
table.insert(queue[pool + 1], key)
end
-- empty out the bucket and reset, use ipairs to call queues in order
bucket={}
for k,v in ipairs(queue) do
for n=1,#v do
table.insert(bucket, v[n])
end
end
end
return bucket
end
Here's a test run:
> input={55,2,123,1,42,9999,6,666,999,543,13}
> output=radix_sort(input)
> for k,v in pairs(output) do
> print (k , " = " , v)
> end
1 = 1
2 = 2
3 = 6
4 = 13
5 = 42
6 = 55
7 = 123
8 = 543
9 = 666
10 = 999
11 = 9999
pool = string.sub(key, 1,1)
always looks at the first character; perhaps you meant string.sub(key, pass, 1)

Resources