Join array of strings into 1 or more strings each within a certain char limit (+ prepend and append texts) - ruby

Let's say I have an array of Twitter account names:
string = %w[example1 example2 example3 example4 example5 example6 example7 example8 example9 example10 example11 example12 example13 example14 example15 example16 example17 example18 example19 example20]
And a prepend and append variable:
prepend = 'Check out these cool people: '
append = ' #FollowFriday'
How can I turn this into an array of as few strings as possible each with a maximum length of 140 characters, starting with the prepend text, ending with the append text, and in between the Twitter account names all starting with an #-sign and separated with a space. Like this:
tweets = ['Check out these cool people: #example1 #example2 #example3 #example4 #example5 #example6 #example7 #example8 #example9 #FollowFriday', 'Check out these cool people: #example10 #example11 #example12 #example13 #example14 #example15 #example16 #example17 #FollowFriday', 'Check out these cool people: #example18 #example19 #example20 #FollowFriday']
(The order of the accounts isn't important so theoretically you could try and find the best order to make the most use of the available space, but that's not required.)
Any suggestions? I'm thinking I should use the scan method, but haven't figured out the right way yet.
It's pretty easy using a bunch of loops, but I'm guessing that won't be necessary when using the right Ruby methods. Here's what I came up with so far:
# Create one long string of #usernames separated by a space
tmp = twitter_accounts.map!{|a| a.insert(0, '#')}.join(' ')
# alternative: tmp = '#' + twitter_accounts.join(' #')
# Number of characters left for mentioning the Twitter accounts
length = 140 - (prepend + append).length
# This method would split a string into multiple strings
# each with a maximum length of 'length' and it will only split on empty spaces (' ')
# ideally strip that space as well (although .map(&:strip) could be use too)
tweets = tmp.some_method(' ', length)
# Prepend and append
tweets.map!{|t| prepend + t + append}
P.S.
If anyone has a suggestion for a better title let me know. I had a difficult time summarizing my question.

The String rindex method has an optional parameter where you can specify where to start searching backwards in a string:
arr = %w[example1 example2 example3 example4 example5 example6 example7 example8 example9 example10 example11 example12 example13 example14 example15 example16 example17 example18 example19 example20]
str = arr.map{|name|"##{name}"}.join(' ')
prepend = 'Check out these cool people: '
append = ' #FollowFriday'
max_chars = 140 - prepend.size - append.size
until str.size <= max_chars do
p str.slice!(0, str.rindex(" ", max_chars))
str.lstrip! #get rid of the leading space
end
p str unless str.empty?

I'd make use of reduce for this:
string = %w[example1 example2 example3 example4 example5 example6 example7 example8 example9 example10 example11 example12 example13 example14 example15 example16 example17 example18 example19 example20]
prepend = 'Check out these cool people:'
append = '#FollowFriday'
# Extra -1 is for the space before `append`
max_content_length = 140 - prepend.length - append.length - 1
content_strings = string.reduce([""]) { |result, target|
result.push("") if result[-1].length + target.length + 2 > max_content_length
result[-1] += " ##{target}"
result
}
tweets = content_strings.map { |s| "#{prepend}#{s} #{append}" }
Which would yield:
"Check out these cool people: #example1 #example2 #example3 #example4 #example5 #example6 #example7 #example8 #example9 #FollowFriday"
"Check out these cool people: #example10 #example11 #example12 #example13 #example14 #example15 #example16 #example17 #FollowFriday"
"Check out these cool people: #example18 #example19 #example20 #FollowFriday"

Related

How to get a block at an offset in the IO.foreach loop in ruby?

I'm using the IO.foreach loop to find a string using regular expressions. I want to append the next block (next line) to the file_names list. How can I do that?
file_names = [""]
IO.foreach("a.txt") { |block|
if block =~ /^file_names*/
dir = # get the next block
file_names.append(dir)
end
}
Actually my input looks like this:
file_names[174]:
name: "vector"
dir_index: 1
mod_time: 0x00000000
length: 0x00000000
file_names[175]:
name: "stl_bvector.h"
dir_index: 2
mod_time: 0x00000000
length: 0x00000000
I have a list of file_names, and I want to capture each of the name, dir_index, mod_time and length properties and put them into the files_names array index according to the file_names index in the text.
You can use #each_cons to get the value of the next 4 rows from the text file:
files = IO.foreach("text.txt").each_cons(5).with_object([]) do |block, o|
if block[0] =~ /file_names.*/
o << block[1..4].map{|e| e.split(':')[1]}
end
end
puts files
#=> "vector"
# 1
# 0x00000000
# 0x00000000
# "stl_bvector.h"
# 2
# 0x00000000
# 0x00000000
Keep in mind that the files array contains subarrays of 4 elements. If the : symbol occurs later in the lines, you could replace the third line of my code with this:
o << block[1..4].map{ |e| e.partition(':').last.strip}
I also added #strip in case you want to remove the whitespaces around the values. With this line changed, the actual array will look something like this:
p files
#=>[["\"vector\"", "1", "0x00000000", "0x00000000"], ["\"stl_bvector.h\"", "2", "0x00000000", "0x00000000"]]
(the values don't contain the \ escape character, that's just the way #p shows it).
Another option, if you know the pattern 1 filename, 4 values will be persistent through the entire text file and the textfile always starts with a filename, you can replace #each_cons with #each_slice and remove the regex completely, this will also speed up the entire process:
IO.foreach("text.txt").each_slice(5).with_object([]) do |block, o|
o << block[1..4].map{ |e| e.partition(':').last.strip }
end
It's actually pretty easy to carve up a series of lines based on a pattern using slice_before:
File.readlines("data.txt").slice_before(/\Afile_names/)
Now you have an array of arrays that looks like:
[
[
"file_names[174]:\n",
" name: \"vector\"\n",
" dir_index: 1\n",
" mod_time: 0x00000000\n",
" length: 0x00000000\n"
],
[
"file_names[175]:\n",
" name: \"stl_bvector.h\"\n",
" dir_index: 2\n",
" mod_time: 0x00000000\n",
" length: 0x00000000"
]
]
Each of these groups could be transformed further, like for example into a Ruby Hash using those keys.

Improving an algorithm for substring search when reading ZIP files

So I have a ZIP reader library, and I read ZIP files by first figuring out where the EOCD record is (the standard way "from the tail"). I have to look for a pattern that is roughly this:
4byte_magic_number, fixed_n_bytes, 2_bytes_of_comment_size, comment
The bytesize of comment is provided in the 2_bytes_of_comment_size. Just scanning for the magic number is insufficient, because I eager-read a substantial portion at the tail of the file - basically the maximum size the ZIP EOCD record can be, and then look for this pattern in there.
So far, I came up with this
def locate_eocd_signature(in_str)
# We have to scan from the _very_ tail. We read the very minimum size
# the EOCD record can have (up to and including the comment size), using
# a sliding window. Once our end offset matches the comment size we found our
# EOCD marker.
eocd_signature_int = 0x06054b50
unpack_pattern = 'VvvvvVVv'
minimum_record_size = 22
end_location = minimum_record_size * -1
loop do
# If the window is nil, we have rolled off the start of the string, nothing to do here.
# We use negative values because if we used positive slice indices
# we would have to detect the rollover ourselves
break unless window = in_str[end_location, minimum_record_size]
window_location = in_str.bytesize + end_location
unpacked = window.unpack(unpack_pattern)
# If we found the signature, pick up the comment size, and check if the size of the window
# plus that comment size is where we are in the string. If we are - bingo.
if unpacked[0] == 0x06054b50 && comment_size = unpacked[-1]
assumed_eocd_location = in_str.bytesize - comment_size - minimum_record_size
# if the comment size is where we should be at - we found our EOCD
return assumed_eocd_location if assumed_eocd_location == window_location
end
end_location -= 1 # Shift the window back, by one byte, and try again.
end
end
but it just screams ugly at me. Is there a better way to do something like this? Is there a pack specifier that says "all the bytes in binary until the the end of the string" that I do not know of? Then I could tack that onto the end of the pack specifier for example... A bit at loss here.
In the end I opted for the following optimization. First, I made a method for finding all the indices of a given substring in a string - there is no stdlib builtin for this.
def all_indices_of_substr_in_str(of_substring, in_string)
last_i = 0
found_at_indices = []
while last_i = in_string.index(of_substring, last_i)
found_at_indices << last_i
last_i += of_substring.bytesize
end
found_at_indices
end
Then, we use it to "latch" onto the offsets in our buffer where our signature was found.
def locate_eocd_signature(in_str)
eocd_signature = 0x06054b50
eocd_signature_str = [eocd_signature].pack('V')
unpack_pattern = 'VvvvvVVv'
minimum_record_size = 22
str_size = in_str.bytesize
indices = all_indices_of_substr_in_str(eocd_signature_str, in_str)
indices.each do |check_at|
maybe_record = in_str[check_at..str_size]
# If the record is smaller than the minimum - we will never recover anything
break if maybe_record.bytesize < minimum_record_size
# Now we check if the record ends with the combination
# of the comment size and an arbitrary byte string of that size.
# If it does - we found our match
*_unused, comment_size = maybe_record.unpack(unpack_pattern)
if (maybe_record.bytesize - minimum_record_size) == comment_size
return check_at # Found the EOCD marker location
end
end
# If we haven't caught anything, return nil deliberately instead of returning the last statement
nil
end

Reversing and Splitting in Python

I have a file "names.txt". The contents are
"Smith,RobJones,MikeJane,SallyPetel,Brian"
and I want to read "names.txt" and make a new file "names2.txt" that looks like:
"Rob Smith Mike Jones Sally Jane Brian Petel"
I know I should be using #rstrip(\n) and #.split(',')
So far I have:
namesfile = input('Enter name of file: ') #open names.txt
openfile = open(namesfile, 'r')
This will do exactly that. You might be able to polish this and make it more elegant and I encourage you to do so:
import re
with open('names.txt') as f:
# Split the names
names = re.sub(r'([A-Z])(?![A-Z])',r',\1',f.read()).split(',')
# Filter empty results
names = [n for n in names if n != '']
# Swap pairs with each other
for i in range(len(names)):
if((i+1)%2 == 0):
names[i], names[i-1] = names[i-1], names[i]
print ' '.join(names)

Moving chunks of data in a file with awk

I'm moving my bookmarks from kippt.com to pinboard.in.
I exported my bookmarks from Kippt and for some reason, they were storing tags (preceded by #) and description within the same field. Pinboard keeps tags and description separated.
This is what a Kippt bookmark looks like after export:
<DT>This is a title
<DD>#tag1 #tag2 This is a description
This is what it should look like before importing into Pinboard:
<DT>This is a title
<DD>This is a description
So basically, I need to replace #tag1 #tag2 by TAGS="tag1,tag2" and move it on the first line within <A>.
I've been reading about moving chunks of data here: sed or awk to move one chunk of text betwen first pattern pair into second pair?
I haven't been to come up with a good recipe so far. Any insight?
Edit:
Here's an actual example of what the input file looks like (3 entries out of 3500):
<DT>Phabricator
<DD>#bug #tracking
<DT>The hidden commands for diagnosing and improving your Netflix streaming quality – Quartz
<DT>Icelandic Farm Holidays | Local experts in Iceland vacations
<DD>#iceland #tour #car #drive #self Self-driving tour of Iceland
This might not be the most beautiful solution, but since it seems to be a one-time-thing it should be sufficient.
import re
dt = re.compile('^<DT>')
dd = re.compile('^<DD>')
with open('bookmarks.xml', 'r') as f:
for line in f:
if re.match(dt, line):
current_dt = line.strip()
elif re.match(dd, line):
current_dd = line
tags = [w for w in line[4:].split(' ') if w.startswith('#')]
current_dt = re.sub('(<A[^>]+)>', '\\1 TAGS="' + ','.join([t[1:] for t in tags]) + '">', current_dt)
for t in tags:
current_dd = current_dd.replace(t + ' ', '')
if current_dd.strip() == '<DD>':
current_dd = ""
else:
print current_dt
print current_dd
current_dt = ""
current_dd = ""
print current_dt
print current_dd
If some parts of the code are not clear, just tell me. You can of course use python to write the lines to a file instead of printing them, or even modify the original file.
Edit: Added if-clause so that empty <DD> lines won't show up in the result.
script.awk
BEGIN{FS="#"}
/^<DT>/{
if(d==1) print "<DT>"s # for printing lines with no tags
s=substr($0,5);tags="" # Copying the line after "<DT>". You'll know why
d=1
}
/^<DD>/{
d=0
m=match(s,/>/) # Find the end of the HREF descritor first match of ">"
for(i=2;i<=NF;i++){sub(/ $/,"",$i);tags=tags","$i} # Concatenate tags
td=match(tags,/ /) # Parse for tag description (marked by a preceding space).
if(td==0){ # No description exists
tags=substr(tags,2)
tagdes=""
}
else{ # Description exists
tagdes=substr(tags,td)
tags=substr(tags,2,td-2)
}
print "<DT>" substr(s,1,m-1) ", TAGS=\"" tags "\"" substr(s,m)
print "<DD>" tagdes
}
awk -f script.awk kippt > pinboard
INPUT
<DT>Phabricator
<DD>#bug #tracking
<DT>The hidden commands for diagnosing and improving your Netflix streaming quality – Quartz
<DT>Icelandic Farm Holidays | Local experts in Iceland vacations
<DD>#iceland #tour #car #drive #self Self-driving tour of Iceland
OUTPUT:
<DT>Phabricator
<DD>
<DT>The hidden commands for diagnosing and improving your Netflix streaming quality – Quartz
<DT>Icelandic Farm Holidays | Local experts in Iceland vacations
<DD> Self-driving tour of Iceland

Prevent or delete duplicates in textscraper?

I have a code that parses through text files in a folder, and saves a predefined number of words around certain search words.
For example, it looks for words such as "date" and "year". If it finds both in the same sentence it will save the sentence twice. Furthermore, if it finds the same word used a few times in a sentence, it will also save it multiple times.
This way the scraper saves an enormous amount of unnecessary duplicate text.
I see two possible solutions:
If the next search-match is in the padding, in the group of words, of the previous one, it will not be saved.
If a group of, say, seven words of the search match is also part of the preceding group it will not be saved/deleted.
Everything I've tried has utterly failed thusfar:
#helper
def indices text, index, word
padding = 200
bottom_i = index - padding < 0 ? 0 : index - padding
top_i = index + word.length + padding > text.length ? text.length : index + word.length + padding
return bottom_i, top_i
end
#script
base_text = File.open("base.txt", 'w')
Dir::mkdir("summaries") unless File.exists?("summaries")
Dir.chdir("summaries")
Dir.glob("*.txt").each do |textfile|
whole_file = File.open(textfile, 'r').read
puts "Currently summarizing " + textfile + "..."
curr_i = 0
str = nil
whole_file.scan(Regexp.union(/firstword/, /secondword/).each do |match|
if i_match = whole_file.index(match, curr_i)
top_bottom = indices(whole_file, i_match, match)
base_text.puts(whole_file[top_bottom[0]..top_bottom[1]] + " : " + File.path(textfile))
curr_i += i_match
end
end
puts "Done summarizing " + textfile + "."
end
base_text.close
Why not do something where you keep track of what you're looking for:
search_words = %w( year date etc )
And then downcase the search string, and start an index.
def summarize(str)
search_str = str.downcase
ind = 0
Then find the smallest index offset of your search words in the search_str, and remove everything up to (ind + offset - delta), go up to (ind + delta) into matches, and continue in the while loop. Something like:
matches = []
while (offset = search_words.map{|w| search_str.index w }.min)
ind += offset
matches.push str[ind - delta, delta * 2]
search_str = search_str[offset + delta, ]
end
matches
end
Preferably something better than:
whole_file.scan(Regexp.union(/firstword/, /secondword/).each do |match|
if i_match = whole_file.index(match, curr_i)
top_bottom = indices(whole_file, i_match, match)
base_text.puts(whole_file[top_bottom[0]..top_bottom[1]] + " : " + File.path(textfile))
curr_i += i_match + 50
end
end

Resources