How do I extract urls from hyperlinks using hpricot?

How do I extract urls from hyperlinks using hpricot? - ruby

I'd like to get the actual url strings from the hyperlinks. I'd like my result to be stripped of html.
So, if one of my input strings is
resource
I'd like to get:
http://target.com/resource.tar.gz
How can I do this?

In Hpricot you access attributes of an element using square brackets (like you would when accessing elements in a Hash). So, to use your example:
doc = Hpricot('resource')
puts doc.at('a')['href'] # => http://target.com/resource.tar.gz

Related

How to get the last word from a URL?

Is there a method to extract just the last word from the URL example below? I would like to be able to use this as a heading on a page, i.e the "Account" page.
I found that by using request.path it will give me the path without the root but I'm not sure how to get just the last path name.
/users/1234/account

Try:
request.path.split('/').last
If you want "Account" (instead of "account"), call the capitalize method on the result.

I am not familiar with Ruby, but you can try this approach.
Try string splitting request.path with '/' as the separator and take the last element from the resulting array
users/1234/account will be split to {'user', '1234', 'account'}
Even though this doesn't answer your question directly, I hope it gives you a start

URLs are a simple string consisting of a scheme showing how to connect to a site, the host where the resource is located, plus a path to that resource. You can use File.basename to get the last part of that path, just like we'd use on a file on our disk:
File.basename('/users/1234/account')
=> "account"

Suppose you have URL like https://www.google.com/user/lastword
If you want to store last word of the URL which is lastword in a variable then use the following and pass url as value to finalVal.
var getLastWordFromUrl = finalVal.split("/").last()

Find URLs in text and wrap in anchor tag

I'm basically writing my own Markdown parser. I want to detect a URL in a string and wrap it with an anchor tag if it's a valid URL. For example:
string = 'here is a link: http://google.com'
# if string matches regex (which it does)
# should return:
'here is a link: http://google.com'
# but this would remain unchanged:
string 'here is a link: google.com'
How can I achieve this?
Bonus points if you can point me to the code in an existing Ruby markdown parser that I can use as an example.

In general: use a regular expression to find URLs and wrap them in your HTML:
urls = %r{(?:https?|ftp|mailto)://\S+}i
html = str.gsub urls, '\0'
Note that this particular solution will turn this text:
See more at http://www.google.com.
…into…
See more at http://www.google.com.
So you may want to play with the regex a bit to figure out where the URL should really end.

You can use this jquery plugin
http://www.jquery.gr/linker/

ExpressionEngine template will not output empty JSON array

I'm creating JSON in an ExpressionEngine template and pointing the Ruby JSON library at the relevant URL. The template looks like this:
[
{exp:mylib:mytag channel="mychannel" backspace="1"}
{"entry_id":"{entry_id}","title":"{title}"},
{/exp:mylib:mytag}
]
When the tag returns data, everything is fine, my Ruby code works perfectly with the array of objects. However, when the tag returns no data (because there are no appropriate entries), Ruby complains that the json string is not the required 2 octets in length. I would expect the output to be [], i.e. an empty but valid JSON array. However, visiting the URL in Firefox/firebug and wget confirms that the response coming back from the URL is zero bytes in length, with status 200 OK.
I tested further by creating a template without tags and just a pair of empty square brackets, with the same result: zero bytes.
Is a pair of empty square brackets somehow a reserved token in the EE template language? Is there some clever optimisation going on that assumes that no-one could ever want a pair of square brackets in an html page?

Are you developing your own add-on, or using the built-in ExpressionEngine tags?
Using the native channel entries queries, you can use a if_no_results conditional tag to control what gets output when there are no matching results:
{exp:channel:entries channel="channel_name"}
{if no_results} ...{/if}
{/exp:channel:entries}
Many third-party add-ons also support the same type of {if_no_results} conditional.
You might also have a look at the third-party ExpressionEngine JSON add-on, which may be able to give you some inspiration on how to approach your situation.

Getting all links of a webpage using Ruby

I'm trying to retrieve every external link of a webpage using Ruby. I'm using String.scan with this regex:
/href="https?:[^"]*|href='https?:[^']*/i
Then, I can use gsub to remove the href part:
str.gsub(/href=['"]/)
This works fine, but I'm not sure if it's efficient in terms of performance. Is this OK to use or I should work with a more specific parser (nokogiri, for example)? Which way is better?
Thanks!

Using regular expressions is fine for a quick and dirty script, but Nokogiri is very simple to use:
require 'nokogiri'
require 'open-uri'
fail("Usage: extract_links URL [URL ...]") if ARGV.empty?
ARGV.each do |url|
doc = Nokogiri::HTML(open(url))
hrefs = doc.css("a").map do |link|
if (href = link.attr("href")) && !href.empty?
URI::join(url, href)
end
end.compact.uniq
STDOUT.puts(hrefs.join("\n"))
end
If you want just the method, refactor it a little bit to your needs:
def get_links(url)
Nokogiri::HTML(open(url).read).css("a").map do |link|
if (href = link.attr("href")) && href.match(/^https?:/)
href
end
end.compact
end

I'm a big fan of Nokogiri, but why reinvent the wheel?
Ruby's URI module already has the extract method to do this:
URI::extract(str[, schemes][,&blk])
From the docs:
Extracts URIs from a string. If block given, iterates through all matched URIs. Returns nil if block given or array with matches.
require "uri"
URI.extract("text here http://foo.example.org/bla and here mailto:test#example.com and here also.")
# => ["http://foo.example.com/bla", "mailto:test#example.com"]
You could use Nokogiri to walk the DOM and pull all the tags that have URLs, or have it retrieve just the text and pass it to URI.extract, or just let URI.extract do it all.
And, why use a parser, such as Nokogiri, instead of regex patterns? Because HTML, and XML, can be formatted in a lot of different ways and still render correctly on the page or effectively transfer the data. Browsers are very forgiving when it comes to accepting bad markup. Regex patterns, on the other hand, work in very limited ranges of "acceptability", where that range is defined by how well you anticipate the variations in the markup, or, conversely, how well you anticipate the ways your pattern can go wrong when presented with unexpected patterns.
A parser doesn't work like a regex. It builds an internal representation of the document and then walks through that. It doesn't care how the file/markup is laid out, it does its work on the internal representation of the DOM. Nokogiri relaxes its parsing to handle HTML, because HTML is notorious for being poorly written. That helps us because with most non-validating HTML Nokogiri can fix it up. Occasionally I'll encounter something that is SO badly written that Nokogiri can't fix it correctly, so I'll have to give it a minor nudge by tweaking the HTML before I pass it to Nokogiri; I'll still use the parser though, rather than try to use patterns.

Mechanize uses Nokogiri under the hood but has built-in niceties for parsing HTML, including links:
require 'mechanize'
agent = Mechanize.new
page = agent.get('http://example.com/')
page.links_with(:href => /^https?/).each do |link|
puts link.href
end
Using a parser is generally always better than using regular expressions for parsing HTML. This is an often-asked question here on Stack Overflow, with this being the most famous answer. Why is this the case? Because constructing a robust regular expression that can handle real-world variations of HTML, some valid some not, is very difficult and ultimately more complicated than a simple parsing solution that will work for just about all pages that will render in a browser.

why you dont use groups in your pattern?
e.g.
/http[s]?:\/\/(.+)/i
so the first group will already be the link you searched for.

Can you put groups in your regex? That would reduce your regular expressions to 1 instead of 2.

How do I extract links from HTML using regex?

I want to extract links from google.com; My HTML code looks like this:
<a href="http://www.test.com/" class="l"
I took me around five minutes to find a regex that works using www.rubular.com.
It is:
"(.*?)" class="l"
The code is:
require "open-uri"
url = "http://www.google.com/search?q=ruby"
source = open(url).read()
links = source.scan(/"(.*?)" class="l"/)
links.each { |link| puts #{link}
}
The problem is, is it not outputting the websites links.

Those links actually have class=l not class="l". By the way, to figure this put I added some logging to the method so that you can see the output at various stages and debug it. I searched for the string you were expecting to find and didn't find it, which is why your regex failed. So I looked for the right string you actually wanted and changed the regex accordingly. Debugging skills are handy.
require "open-uri"
url = "http://www.google.com/search?q=ruby"
source = open(url).read
puts "--- PAGE SOURCE ---"
puts source
links = source.scan(/<a.+?href="(.+?)".+?class=l/)
puts "--- FOUND THIS MANY LINKS ---"
puts links.size
puts "--- PRINTING LINKS ---"
links.each do |link|
puts "- #{link}"
end
I also improved your regex. You are looking for some text that starts with the opening of an a tag (<a), then some characters of some sort that you dont care about (.+?), an href attribute (href="), the contents of the href attribute that you want to capture ((.+?)), some spaces or other attributes (.+?), and lastly the class attrubute (class=l).
I have .+? in three places there. the . means any character, the + means there must be one or more of the things right before it, and the ? means that the .+ should try to match as short a string as possible.

To put it bluntly, the problem is that you're using regexes. The problem is that HTML is what is known as a context-free language, while regular expressions can only the class of languages that are known as regular languages.
What you should do is send the page data to a parser that can handle HTML code, such as Hpricot, and then walk the parse tree you get from the parser.

What im going wrong?
You're trying to parse HTML with regex. Don't do that. Regular expressions cannot cover the range of syntax allowed even by valid XHTML, let alone real-world tag soup. Use an HTML parser library such as Hpricot.
FWIW, when I fetch ‘http://www.google.com/search?q=ruby’ I do not receive ‘class="l"’ anywhere in the returned markup. Perhaps it depends on which local Google you are using and/or whether you are logged in or otherwise have a Google cookie. (Your script, like me, would not.)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio