Ruby Truncate Words + Long Text - ruby

I have the following function which accepts text and a word count and if the number of words in the text exceeded the word-count it gets truncated with an ellipsis.
#Truncate the passed text. Used for headlines and such
def snippet(thought, wordcount)
thought.split[0..(wordcount-1)].join(" ") + (thought.split.size > wordcount ? "..." : "")
end
However what this function doesn't take into account is extremely long words, for instance...
"Helloooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo
world!"
I was wondering if there's a better way to approach what I'm trying to do so it takes both word count and text size into consideration in an efficient way.

Is this a Rails project?
Why not use the following helper:
truncate("Once upon a time in a world far far away", :length => 17)
If not, just reuse the code.

This is probably a two step process:
Truncate the string to a max length (no need for regex for this)
Using regex, find a max words quantity from the truncated string.
Edit:
Another approach is to split the string into words, loop through the array adding up
the lengths. When you find the overrun, join 0 .. index just before the overrun.

Hint: regex ^(\s*.+?\b){5} will match first 5 "words"

The logic for checking both word and char limits becomes too convoluted to clearly express as one expression. I would suggest something like this:
def snippet str, max_words, max_chars, omission='...'
max_chars = 1+omision.size if max_chars <= omission.size # need at least one char plus ellipses
words = str.split
omit = words.size > max_words || str.length > max_chars ? omission : ''
snip = words[0...max_words].join ' '
snip = snip[0...(max_chars-3)] if snip.length > max_chars
snip + omit
end
As other have pointed out Rails String#truncate offers almost the functionality you want (truncate to fit in length at a natural boundary), but it doesn't let you independently state max char length and word count.

First 20 characters:
>> "hello world this is the world".gsub(/.+/) { |m| m[0..20] + (m.size > 20 ? '...' : '') }
=> "hello world this is t..."
First 5 words:
>> "hello world this is the world".gsub(/.+/) { |m| m.split[0..5].join(' ') + (m.split.size > 5 ? '...' : '') }
=> "hello world this is the world..."

Related

How do I get Ruby to search for a pattern on the tail of a local file?

Say I have a file blah.rb which is constantly written to somehow and has patterns like :
bagtagrag" " hellobello " blah0 blah1 " trag kljesgjpgeagiafw blah2 " gneo" whatttjtjtbvnblah3
Basically, it's garbage. But I want to check for the blah that keeps on coming up and find the latest value i.e. number in front of the blah.
Hence, something like :
grep "blah"{$1} | tail var/test/log
My file is at location var/test/log and as you can see, I need to get the number in front of the blah.
def get_last_blah("filename")
// Code to get the number after the last blah in the less of the filename
end
def display_the_last_blah()
puts get_last_blah("var/test/log")
end
Now, I could just keep on reading the file and performing something akin to string pattern search on the entire file again and again. Obtaining the last value, I can then get the number. But what if I only want to look at the added text in the less and not the entire text.
Moreover, is there a quick one-liner or smart command to get this?
Use IO.open to read the file and Enumerable#grep to search the desired text using a regular expression like the following code does:
def get_last_blah(filename)
open(filename) { |f| f.grep(/.*blah(\d).*$/){$1}.last.to_i }
end
puts get_last_blah('var/test/log')
# => 3
The method return the number in from of the last "blah" word of the file. It is reading the entire file but the result is the same as if is done with tail.
If you want to use a proper tail, take a look at the File::Tail gem.
I presume you wish to avoid reading the entire file each time; rather, you want to start at the end and work backward until you find the last string of interest. Here's a way to do that.
Code
BLOCK_SIZE = 30
MAX_BLAH_NBR = 123
def doit(fname, blah_text)
#f = File.new(fname)
#blah_text = blah_text
#chars_to_read = BLOCK_SIZE + #blah_text.size + MAX_BLAH_NBR.to_s.size
ptr = #f.size
block_size = BLOCK_SIZE
loop do
return nil if ptr.zero?
ptr -= block_size
if ptr < 0
block_size += ptr
ptr = 0
end
blah_nbr = read_block(ptr)
(f.close; return blah_nbr.to_i) if blah_nbr
end
end
def read_block(ptr)
#f.seek(ptr)
#f.read(#chars_to_read)[/.*#{#blah_text}(\d+)/,1]
end
Demo
Let's first write something interesting to a file.
MY_FILE = 'my_file.txt'
text =<<_
Now is the time
for all blah2 to
come to the aid of
their blah3, blah4 enemy or
perhaps do blagh5 something
else like wash the dishes.
_
File.write(MY_FILE, text)
Now run the program:
p doit(MY_FILE, "blah") #=> 4
We expected it to return 4 and it did.
Explanation
doit first instructs read_block to read up to 37 characters, beginning BLOCK_SIZE (30) characters from the end of the file. That's at the beginning of the string
"ng\nelse like wash the dishes.\n"
which is 30 characters long. (I'll explain the "37" in a moment.) read_block finds no text matching the regex (like "blah3"), so returns nil.
As nil was returned, doit makes the same request of read_block, but this time starting BLOCK_SIZE characters closer to the beginning of the file. This time read_block reads the 37 character string:
"y or\nperhaps do blagh5 something\nelse"
but, again, does not match the regex, so returns nil to doit. Notice that it read the seven characters, "ng\nelse", that it read previously. This overlap is necessary in case one 30-character block ended, "...bla" and the next one began "h3...". Hence the need to read more characters (here 37) than the block size.
read_block next reads the string:
"aid of\ntheir blah3, blah4 enemy or\npe"
and finds that "blah4" matches the regex (not "blah3", because the regex is being "greedy" with .*), so it returns "4" to doit, which converts that to the number 4, which it returns.
doit would return nil if the regex did not match any text in the file.

How to slice a string to less then 500 character chunks but not cut within < > brackets?

I would like to split a string into less then 500 characters chunks and put it into an array. This is easily solved.
My problem is that the string contains html code and the split should happen outside of the < > brackets. Anyone knows how to do it?
That's what I currently got.
while article.length > 0 do
textarr << article[0, 499]
article[0, 499] = ""
end
Can someone tell me how to check that the split does not cut into the html code?
Thanks
textarr = article.scan(/.{1,500}(?![^<>]*>)/m)
splits your string into chunks of up to 500 characters (as many as possible), reducing the size of the chunk, if necessary, to ensure that the next angle bracket is not a closing bracket.
Assuming you have well-formed HTML (no unencoded < outside tags), you can check if you have cut a tag by finding an unmatched <. You could use a regex:
"Lorem <b>ipsum</b" =~ /<[^>]*\Z/
# => 14
"Lorem <b>ipsum</b>" =~ /<[^>]*\Z/
# => nil
In order to modify your splitting so that it doesn't cut tags you could use this regex to take variable-length chunks (noting that =~ returns the index at which the match occurs, or nil if there is no match):
def chunk_length(chunk)
chunk =~ /<[^>]*\Z/ || chunk.length
end
textarr = []
start = 0
while start < article.length
length = chunk_length(article[start, 499])
# probably should check for length == 0 here in case you get a really long tag!
textarr << article[start, length]
start += length
end
The check for length == 0 might be necessary if you have very long tags; suppose you have something pathological like
<div class="lots of classes" style="some: 'raw css';" data-attribute="more stuff" ...
that could be longer than 500 characters on its own. Then, you will get to the point where article[start, 499] begins with < but does not contain the closing >, so that =~ returns 0 (because it matches at the start of the string) and you will get trapped in an infinite loop.

How does this Ruby app know to select the middle third of sentences?

I am currently following Beginning Ruby by Peter Cooper and have put together my first app, a text analyzer. However, whilst I understand all of the concepts and the way in which they work, I can't for the life of me understand how the app knows to select the middle third of sentences sorted by length from this line:
ideal_sentances = sentences_sorted.slice(one_third, one_third + 1)
I have included the whole app for context any help is much appreciated as so far everything is making sense.
#analyzer.rb --Text Analyzer
stopwords = %w{the a by on for of are with just but and to the my I has some in do}
lines = File.readlines(ARGV[0])
line_count = lines.size
text = lines.join
#Count the characters
character_count = text.length
character_count_nospaces = text.gsub(/\s+/, '').length
#Count the words, sentances, and paragraphs
word_count = text.split.length
paragraph_count = text.split(/\n\n/).length
sentence_count = text.split(/\.|\?|!/).length
#Make a list of words in the text that aren't stop words,
#count them, and work out the percentage of non-stop words
#against all words
all_words = text.scan(/\w+/)
good_words = all_words.select {|word| !stopwords.include?(word)}
good_percentage = ((good_words.length.to_f / all_words.length.to_f)*100).to_i
#Summarize the text by cherry picking some choice sentances
sentances = text.gsub(/\s+/, ' ').strip.split(/\.|\?|!/)
sentances_sorted = sentences.sort_by { |sentence| sentance.length }
one_third = sentences_sorted.length / 3
ideal_sentances = sentences_sorted.slice(one_third, one_third + 1)
ideal_sentances = ideal_sentences.select{ |sentence| sentence =~ /is|are/ }
#Give analysis back to user
puts "#{line_count} lines"
puts "#{character_count} characters"
puts "#{character_count_nospaces} characters excluding spaces"
puts "#{word_count} words"
puts "#{paragraph_count} paragraphs"
puts "#{sentence_count} sentences"
puts "#{sentence_count / paragraph_count} sentences per paragraph (average)"
puts "#{word_count / sentence_count} words per sentence (average)"
puts "#{good_percentage}% of words are non-fluff words"
puts "Summary:\n\n" + ideal_sentences.join(". ")
puts "-- End of analysis."
Obviously I am a beginner so plain English would help enormously, cheers.
It gets a third of the length of the sentence with one_third = sentences_sorted.length / 3 then the line you posted ideal_sentances = sentences_sorted.slice(one_third, one_third + 1) says "grab a slice of all the sentences starting at the index equal to 1/3rd and continue 1/3rd of the length +1".
Make sense?
The slice method in you look it up in the ruby API say this:
If passed two Fixnum objects, returns a substring starting at the
offset given by the first, and a length given by the second.
This means that if you have a sentence broken into three pieces
ONE | TWO | THREE
slice(1/3, 1/3+1)
will return the string starting at 1/3 from the beginning
| TWO | THREE (this is what you are looking at now)
then you return the string that is 1/3+1 distance from where you are, which gives you
| TWO |
sentences is a list of all sentences. sentances_sorted is that list sorted by sentence length, so the middle third will be the sentences with the most average length. slice() grabs that middle third of the list, starting from the position represented by one_third and counting one_third + 1 from that point.
Note that the correct spelling is 'sentence' and not 'sentance'. I mention this only because you have some code errors that result from spelling it inconsistently.
I was stuck on this when I first started too. In plain English, you have to realize that the slice method can take 2 parameters here.
The first is the index. The second is how long slice goes for.
So lets say you start off with 6 sentences.
one_third = 2
slice(one_third, one_third+1)
1/3 of 6 is 2.
1) here the 1/3 means you start at element 2 which is index[1]
2) then it goes on for 2 (6/3) more + 1 length, so a total of 3 spaces
so it is affecting indexes 1 to index 3

Truncate string to the first n words

What's the best way to truncate a string to the first n words?
n = 3
str = "your long long input string or whatever"
str.split[0...n].join(' ')
=> "your long long"
str.split[0...n] # note that there are three dots, which excludes n
=> ["your", "long", "long"]
You could do it like this:
s = "what's the best way to truncate a ruby string to the first n words?"
n = 6
trunc = s[/(\S+\s+){#{n}}/].strip
if you don't mind making a copy.
You could also apply Sawa's Improvement (wish I was still a mathematician, that would be a great name for a theorem) by adjusting the whitespace detection:
trunc = s[/(\s*\S+){#{n}}/]
If you have to deal with an n that is greater than the number of words in s then you could use this variant:
s[/(\S+(\s+)?){,#{n}}/].strip
You can use str.split.first(n).join(' ')
with n being any number.
Contiguous white spaces in the original string are replaced with a single white space in the returned string.
For example, try this in irb:
>> a='apple orange pear banana pineaple grapes'
=> "apple orange pear banana pineaple grapes"
>> b=a.split.first(2).join(' ')
=> "apple orange"
This syntax is very clear (as it doesn't use regular expression, array slice by index). If you program in Ruby, you know that clarity is an important stylistic choice.
A shorthand for join is *
So this syntax str.split.first(n) * ' ' is equivalent and shorter (more idiomatic, less clear for the uninitiated).
You can also use take instead of first
so the following would do the same thing
a.split.take(2) * ' '
This could be following if it's from rails 4.2 (which has truncate_words)
string_given.squish.truncate_words(number_given, omission: "")

How can I do fuzzy substring matching in Ruby?

I found lots of links about fuzzy matching, comparing one string to another and seeing which gets the highest similarity score.
I have one very long string, which is a document, and a substring. The substring came from the original document, but has been converted several times, so weird artifacts might have been introduced, such as a space here, a dash there. The substring will match a section of the text in the original document 99% or more. I am not matching to see from which document this string is, I am trying to find the index in the document where the string starts.
If the string was identical because no random error was introduced, I would use document.index(substring), however this fails if there is even one character difference.
I thought the difference would be accounted for by removing all characters except a-z in both the string and the substring, compare, and then use the index I generated when compressing the string to translate the index in the compressed string to the index in the real document. This worked well where the difference was whitespace and punctuation, but as soon as one letter is different it failed.
The document is typically a few pages to a hundred pages, and the substring from a few sentences to a few pages.
You could try amatch. It's available as a ruby gem and, although I haven't worked with fuzzy logic for a long time, it looks to have what you need. The homepage for amatch is: https://github.com/flori/amatch.
Just bored and messing around with the idea, a completely non-optimized and untested hack of a solution follows:
include 'amatch'
module FuzzyFinder
def scanner( input )
out = [] unless block_given?
pos = 0
input.scan(/(\w+)(\W*)/) do |word, white|
startpos = pos
pos = word.length + white.length
if block_given?
yield startpos, word
else
out << [startpos, word]
end
end
end
def find( text, doc )
index = scanner(doc)
sstr = text.gsub(/\W/,'')
levenshtein = Amatch::Levensthtein.new(sstr)
minlen = sstr.length
maxndx = index.length
possibles = []
minscore = minlen*2
index.each_with_index do |x, i|
spos = x[0]
str = x[1]
si = i
while (str.length < minlen)
i += 1
break unless i < maxndx
str += index[i][1]
end
str = str.slice(0,minlen) if (str.length > minlen)
score = levenshtein.search(str)
if score < minscore
possibles = [spos]
minscore = score
elsif score == minscore
possibles << spos
end
end
[minscore, possibles]
end
end
Obviously there are numerous improvements possible and probably necessary! A few off the top:
Process the document once and store
the results, possibly in a database.
Determine a usable length of string
for an initial check, process
against that initial substring first
before trying to match the entire
fragment.
Following up on the previous,
precalculate starting fragments of
that length.
A simple one is fuzzy_match
require 'fuzzy_match'
FuzzyMatch.new(['seamus', 'andy', 'ben']).find('Shamus') #=> seamus
A more elaborated one (you wouldn't say it from this example though) is levenshein, which computes the number of differences.
require 'levenshtein'
Levenshtein.distance('test', 'test') # => 0
Levenshtein.distance('test', 'tent') # => 1
You should look at the StrikeAMatch implementation detailed here:
A better similarity ranking algorithm for variable length strings
Instead of relying on some kind of string distance (i.e. number of changes between two strings), this one looks at the character pairs patterns. The more character pairs occur in each string, the better the match. It has worked wonderfully for our application, where we search for mistyped/variable length headings in a plain text file.
There's also a gem which combines StrikeAMatch (an implementation of Dice's coefficient on character-level bigrams) and Levenshtein distance to find matches: https://github.com/seamusabshere/fuzzy_match
It depends on the artifacts that can end up in the substring. In the simpler case where they are not part of [a-z] you can use parse the substring and then use Regexp#match on the document:
document = 'Ulputat non nullandigna tortor dolessi illam sectem laor acipsus.'
substr = "tortor - dolessi _%&# +illam"
re = Regexp.new(substr.split(/[^a-z]/i).select{|e| !e.empty?}.join(".*"))
md = document.match re
puts document[md.begin(0) ... md.end(0)]
# => tortor dolessi illam
(Here, as we do not set any parenthesis in the Regexp, we use begin and end on the first (full match) element 0 of MatchData.
If you are only interested in the start position, you can use =~ operator:
start_pos = document =~ re
I have used none of them, but I found some libraries just by doing a search for 'diff' in rubygems.org. All of them can be installed by gem. You might want to try them. I myself is interested, so if you already know these or if you try them out, it would be helpful if you leave your comment.
diff
diff-lcs
differ
difflcs
pretty_diff
diffy
kronk
khtmldiff
gdiff
ruby_diff
tdiff
diffrenderer
diffplex
dbdiff
diff_dirs
rsyncdiff
wdiff
diff4all
davidtrogers-htmldiff
edouard-htmldiff
diff2xml
dirdiff
rrdiff
nokogiri-diff
pretty-diff
easy_diff
smartdiff

Resources