How do I cut phrases off a string in ruby? - ruby

I wasn't sure about my questions name. I have an HTML page I got using nokogiri. Now I want to cut some tags off that page. I tried using ruby's delete method after converting the html to a string - Though it deletes all the letters I entered. The best result i got was using .gsub('<stuff>', '') though it still leaves some space. Is it possible to actually cut stuff of a string? specific pharses?
Another question - Can I remove spaces?
what I done so far :
doc = Nokogiri::HTML(open("http://www.example.com/"))
tester = doc.css(".example").to_s.gsub('<div class="example">', '')

I'd suggest trying to do it at the xml tree level rather than string editing.
I think the nokogiri api gives you some tools for doing this.
Another approach might be selecting the data you want, with css or xpath, rather than deleting the parts you don't want?
There's also an xpath function for normalising space in strings, there's an example in this question
Some nokogiri help:
Intro article on Engineyard
Railscast/Asciicasts
Official tutorials

Check out Nokogiri's Tutorials. In particular, you want to read "Modifying an HTML / XML Document", Changing text contents.
Nokogiri's XML accessors are very friendly, because you don't need to use XPath. You can use CSS accessors also, and for people who aren't in XML all day long they can help a lot.
In that particular example, they're using the at_css method, which searches for the first occurrence of the target. You have many alternate methods, which are synonyms: at, %, at_css and at_xpath handle "find the first one" cases. search, css, xpath, / similarly handle "find all occurrences".
For instance:
require 'nokogiri'
html = '<h1>Snap, Crackle and Pop</h1>'
doc = Nokogiri::HTML(html)
h1 = doc.at('h1')
h1.content = h1.content[0, h1.content.length - 3] + '...'
puts doc.to_html
>> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
>> <html><body><h1>Snap, Crackle and ...</h1></body></html>
That creates a new HTML document in Nokogiri, searches for the first H1, and trims the trailing three characters in its contents, replacing them with an ellipsis.

Related

How do I write a CSS selector that looks for an element starting with text in a case-insensitive way?

I'm using Rails 5.0.1 with Nokogiri. How do I select a CSS element whose text starts with a certain string in a case insensitive way? Right now I can search for something in a case-sensitive way using
doc.css("#select_id option:starts-with('ABC')")
but I would like to know how to disregard case when looking for an option that starts with certain text?
Summary It's ugly. You're better off just using Ruby:
doc.css('select#select_id > option').select{ |opt| opt.text =~ /^ABC/i }
Details
Nokogiri uses libxml2, which uses XPath to search XML and HTML documents. Nokogiri transforms ~CSS expressions into XPath. For example, for your ~CSS selector, this is what Nokogiri actually searches for:
Nokogiri::CSS.xpath_for("#select_id option:starts-with('ABC')")
#=> ["//*[#id = 'select_id']//option[starts-with(., 'ABC')]"]
The expression you wrote is not actually CSS. There is no :starts-with() pseudo-class in CSS, not even proposed in Selectors 4. What there is is the starts-with() function in XPath, and Nokogiri is (somewhat surprisingly) allowing you to mix XPath functions into your CSS and carrying them over to the XPath it uses internally.
The libxml2 library is limited to XPath 1.0, and in XPath 1.0 case-insensitive searches are done by translating all characters to lowercase. The XPath expression you'd want is thus:
//select[#id='select_id']/option[starts-with(translate(.,'ABC','abc'),'abc')]
(Assuming you only care about those characters!)
I'm not sure that you CAN write CSS+XPath in a way that Nokogiri would produce that expression. You'd need to use the xpath method and feed it that query.
Finally, you can create your own custom CSS pseudo-classes and implement them in Ruby. For example:
class MySearch
def insensitive_starts_with(nodes, str)
nodes.find_all{ |n| n.text =~ /^#{Regex.escape(str)}/i }
end
end
doc.css( "select#select_id > option:insensitive_starts_with('ABC')", MySearch )
...but all this gives you is re-usability of your search code.

Ruby Regex: Return just the match

When I do
puts /<title>(.*?)<\/title>/.match(html)
I get
<h2>foobar</h2>
But I want just
foobar
What's the most elegant method for doing so?
The most elegant way would be to parse HTML with an HTML parser:
require 'nokogiri'
html = '<title><h2>Pancakes</h2></title>'
doc = Nokogiri::HTML(html)
title = doc.at('title').text
# title is now 'Pancakes'
If you try to do this with a regular expression, you will probably fail. For example, if you have an <h2> in your <title> what's to prevent you from having something like this:
<title><strong>Where</strong> is <span>pancakes</span> <em>house?</em></title>
Trying to handle something like that with a single regex is going to be ugly but doc.at('title').text handles that as easily as it handles <title>Pancakes</title> or <title><h2>Pancakes</h2></title>.
Regular expressions are great tools but they shouldn't be the only tool in your toolbox.
Something of this style will return just the contents of the match.
html[/<title>(.*?)<\/title>/,1]
Maybe you need to tell us more, like what html might contain, but right now, you are capturing the contents of the title block, irrespective of the internal tags. I think that is the way you should do it, rather than assuming that there is an internal tag you want to handle, especially because what would happen if you had two internal tags? This is why everyone is telling you to use an html parser, which you really should do.

Getting all links of a webpage using Ruby

I'm trying to retrieve every external link of a webpage using Ruby. I'm using String.scan with this regex:
/href="https?:[^"]*|href='https?:[^']*/i
Then, I can use gsub to remove the href part:
str.gsub(/href=['"]/)
This works fine, but I'm not sure if it's efficient in terms of performance. Is this OK to use or I should work with a more specific parser (nokogiri, for example)? Which way is better?
Thanks!
Using regular expressions is fine for a quick and dirty script, but Nokogiri is very simple to use:
require 'nokogiri'
require 'open-uri'
fail("Usage: extract_links URL [URL ...]") if ARGV.empty?
ARGV.each do |url|
doc = Nokogiri::HTML(open(url))
hrefs = doc.css("a").map do |link|
if (href = link.attr("href")) && !href.empty?
URI::join(url, href)
end
end.compact.uniq
STDOUT.puts(hrefs.join("\n"))
end
If you want just the method, refactor it a little bit to your needs:
def get_links(url)
Nokogiri::HTML(open(url).read).css("a").map do |link|
if (href = link.attr("href")) && href.match(/^https?:/)
href
end
end.compact
end
I'm a big fan of Nokogiri, but why reinvent the wheel?
Ruby's URI module already has the extract method to do this:
URI::extract(str[, schemes][,&blk])
From the docs:
Extracts URIs from a string. If block given, iterates through all matched URIs. Returns nil if block given or array with matches.
require "uri"
URI.extract("text here http://foo.example.org/bla and here mailto:test#example.com and here also.")
# => ["http://foo.example.com/bla", "mailto:test#example.com"]
You could use Nokogiri to walk the DOM and pull all the tags that have URLs, or have it retrieve just the text and pass it to URI.extract, or just let URI.extract do it all.
And, why use a parser, such as Nokogiri, instead of regex patterns? Because HTML, and XML, can be formatted in a lot of different ways and still render correctly on the page or effectively transfer the data. Browsers are very forgiving when it comes to accepting bad markup. Regex patterns, on the other hand, work in very limited ranges of "acceptability", where that range is defined by how well you anticipate the variations in the markup, or, conversely, how well you anticipate the ways your pattern can go wrong when presented with unexpected patterns.
A parser doesn't work like a regex. It builds an internal representation of the document and then walks through that. It doesn't care how the file/markup is laid out, it does its work on the internal representation of the DOM. Nokogiri relaxes its parsing to handle HTML, because HTML is notorious for being poorly written. That helps us because with most non-validating HTML Nokogiri can fix it up. Occasionally I'll encounter something that is SO badly written that Nokogiri can't fix it correctly, so I'll have to give it a minor nudge by tweaking the HTML before I pass it to Nokogiri; I'll still use the parser though, rather than try to use patterns.
Mechanize uses Nokogiri under the hood but has built-in niceties for parsing HTML, including links:
require 'mechanize'
agent = Mechanize.new
page = agent.get('http://example.com/')
page.links_with(:href => /^https?/).each do |link|
puts link.href
end
Using a parser is generally always better than using regular expressions for parsing HTML. This is an often-asked question here on Stack Overflow, with this being the most famous answer. Why is this the case? Because constructing a robust regular expression that can handle real-world variations of HTML, some valid some not, is very difficult and ultimately more complicated than a simple parsing solution that will work for just about all pages that will render in a browser.
why you dont use groups in your pattern?
e.g.
/http[s]?:\/\/(.+)/i
so the first group will already be the link you searched for.
Can you put groups in your regex? That would reduce your regular expressions to 1 instead of 2.

How do I count a sub string using a regex in ruby?

I have a very large xml file which I load as a string
so my XML lools like
<publication ID="7728" contentstatus="Unchanged" idID="0b000064800e9e39">
<volume contentstatus="Unchanged" idID="0b0000648151c35d">
<article ID="5756261" contentstatus="Changed" doi="10.1109/TNB.2011.2145270" idID="0b0000648151d8ca"/>
</volume>
I want to count the number of occurrences the string
article ID="5705641" contentstatus="Changed"
how can I convert the ID to a regex
Here is what I have tried doing
searchstr = 'article ID=\"/[1-9]{7}/\" contentstatus=\"Changed\"'
count = ((xml.scan(searchstr).length)).to_s
puts count
Please let me know how can I achieve this?
Thanks
I'm going to go out on a limb and guess that you're new to Ruby. First, it's not necessary to convert count into a string to puts it. Puts automatically calls to_s on anything you send to it.
Second, it's rarely a good idea to handle XML with string manipulation. I would strongly advise that you use a full fledged XML parser such as Nokogiri.
That said, you can't embed a regex in a string like that. The entire query string would need to be a regex.
Something like
/article ID="[1-9]{7}" contentstatus="Changed"/
Quotation marks aren't special characters in a regex, so you don't need to escape them.
When in doubt about regex in Ruby, I recommend checking out Rubular.com.
And once again, I can't emphasize enough that I really don't condone trying to manipulate XML via regex. Nokogiri will make dealing with XML a billion times easier and more reliable.
If XPath is an option, it is a preferred way of selecting XML elements. You can use the selector:
//article[#contentstatus="Changed"]
Or, if possible:
count(//article[#contentstatus="Changed"])
Nokogiri is my recommended Ruby XML parser. It's very robust, and is probably the standard for the language now.
I added two more "articles" to show how easily you can find and manipulate the contents, without having to rely on a regex.
require 'nokogiri'
xml =<<EOT
<publication ID="7728" contentstatus="Unchanged" idID="0b000064800e9e39">
<volume contentstatus="Unchanged" idID="0b0000648151c35d">
<article ID="5756261" contentstatus="Changed" doi="10.1109/TNB.2011.2145270" idID="0b0000648151d8ca"/>
<article ID="5756262" contentstatus="Unchanged" doi="10.1109/TNB.2011.2145270" idID="0b0000648151d8ca"/>
<article ID="5756263" contentstatus="Changed" doi="10.1109/TNB.2011.2145270" idID="0b0000648151d8ca"/>
</volume>
EOT
doc = Nokogiri::XML(xml)
puts doc.search('//article[#contentstatus="Changed"]').size.to_s + ' found'
puts doc.search('//article[#contentstatus="Changed"]').map{ |n| "#{ n['ID'] } #{ n['doi'] } #{ n['idID'] }" }
>> 2 found
>> 5756261 10.1109/TNB.2011.2145270 0b0000648151d8ca
>> 5756263 10.1109/TNB.2011.2145270 0b0000648151d8ca
The problem with using regex with HTML or XML, is they'll break really easily if the XML changes, or if your XML comes from different sources or is malformed. Regex was never designed to handle that sort of problem, but a parser was. You could have XML with line ends after every tag, or none at all, and the parser won't really care as long as the XML is well-formed. A good parser, like Nokogiri can even do fixups if the XML is broken, in order to try to make sense of it, but
Your current string looks almost perfect to me, just remove the errant / from around the numbers:
searchstr = 'article ID=\"[1-9]{7}\" contentstatus=\"Changed\"'

How do I extract links from HTML using regex?

I want to extract links from google.com; My HTML code looks like this:
<a href="http://www.test.com/" class="l"
I took me around five minutes to find a regex that works using www.rubular.com.
It is:
"(.*?)" class="l"
The code is:
require "open-uri"
url = "http://www.google.com/search?q=ruby"
source = open(url).read()
links = source.scan(/"(.*?)" class="l"/)
links.each { |link| puts #{link}
}
The problem is, is it not outputting the websites links.
Those links actually have class=l not class="l". By the way, to figure this put I added some logging to the method so that you can see the output at various stages and debug it. I searched for the string you were expecting to find and didn't find it, which is why your regex failed. So I looked for the right string you actually wanted and changed the regex accordingly. Debugging skills are handy.
require "open-uri"
url = "http://www.google.com/search?q=ruby"
source = open(url).read
puts "--- PAGE SOURCE ---"
puts source
links = source.scan(/<a.+?href="(.+?)".+?class=l/)
puts "--- FOUND THIS MANY LINKS ---"
puts links.size
puts "--- PRINTING LINKS ---"
links.each do |link|
puts "- #{link}"
end
I also improved your regex. You are looking for some text that starts with the opening of an a tag (<a), then some characters of some sort that you dont care about (.+?), an href attribute (href="), the contents of the href attribute that you want to capture ((.+?)), some spaces or other attributes (.+?), and lastly the class attrubute (class=l).
I have .+? in three places there. the . means any character, the + means there must be one or more of the things right before it, and the ? means that the .+ should try to match as short a string as possible.
To put it bluntly, the problem is that you're using regexes. The problem is that HTML is what is known as a context-free language, while regular expressions can only the class of languages that are known as regular languages.
What you should do is send the page data to a parser that can handle HTML code, such as Hpricot, and then walk the parse tree you get from the parser.
What im going wrong?
You're trying to parse HTML with regex. Don't do that. Regular expressions cannot cover the range of syntax allowed even by valid XHTML, let alone real-world tag soup. Use an HTML parser library such as Hpricot.
FWIW, when I fetch ‘http://www.google.com/search?q=ruby’ I do not receive ‘class="l"’ anywhere in the returned markup. Perhaps it depends on which local Google you are using and/or whether you are logged in or otherwise have a Google cookie. (Your script, like me, would not.)

Resources