Find a url in a document using regex in ruby - ruby

I have been trying to find a url in a html document and this has to be done in regex since the url is not in any html tag so I can't use nokogiri for that. To get the html i used httparty and i did it this way
require 'httparty'
doc = HTTParty.get("http://127.0.0.1:4040")
puts doc
That outputs the html code. And to get the url i used the .split() method to reach to the url. The full code is
require 'httparty'
doc = HTTParty.get('http://127.0.0.1:4040').split(".ngrok.io")[0].split('https:')[2]
puts "https:#{doc}.ngrok.io"
I wanted to do this using regex since ngrok might update their localhost html file and so this code won't work anymore. How do i do it?

If I understood correctly you want to find all hostnames matching "https://(any subdomain).ngrok.io", right ?
If then you want to use String#scan with a regexp. Here is an example:
# get your body (replace with your HTTP request)
body = "my doc contains https://subdomain.ngrok.io and https://subdomain-1.subdomain.ngrok.io"
puts body
# Use scan and you're done
urls = body.scan(%r{https://[0-9A-Za-z-\.]+\.ngrok\.io})
puts urls
It will result in an array containing ["https://subdomain.ngrok.io", "https://subdomain-1.subdomain.ngrok.io"]
Call .uniq if you want to get rid of duplicates
This doesn't handle ALL edge cases but it's probably enough for what you need

Related

Ruby - How can I follow a .php link through a request and get the redirect link?

Firstly I want to make clear that I am not familiar with Ruby, at all.
I'm building a Discord Bot in Go as an exercise, the bot fetches UrbanDictionary definitions and sends them to whoever asked in Discord.
However, UD doesn't have an official API, and so I'm using this. It's an Heroku App written in Ruby. From what I understood, it scrapes the UD page for the given search.
I want to add random to my Bot, however the API doesn't support it and I want to add it.
As I see it, it's not hard since http://www.urbandictionary.com/random.php only redirects you to a normal link of the site. This way if I can follow the link to the "normal" one, get the link and pass it on the built scraper it can return just as any other link.
I have no idea how to follow it and I was hoping I could get some pointers, samples or whatsoever.
Here's the "ruby" way using net/http and uri
require 'net/http'
require 'uri'
uri = URI('http://www.urbandictionary.com/random.php')
response = Net::HTTP.get_response(uri)
response['Location']
# => "http://www.urbandictionary.com/define.php?term=water+bong"
Urban Dictionary is using an HTTP redirect (302 status code, in this case), so the "new" URL is being passed back as an http header (Location). To get a better idea of what the above is doing, here's a way just using curl and a system call
`curl -I 'http://www.urbandictionary.com/random.php'`. # Get the headers using curl -I
split("\r\n"). # Split on line breaks
find{|header| header =~ /^Location/}. # Get the 'Location' header
split(' '). # Split on spaces
last # Get the last element in the split array

How can I get both header and web page content in a single call?

I’m using Rails 4.2.7. I know how to use openURI to get the headers from a URL …
open(url){|f| pp f.meta }
and I know how to get the contents of the URL
open(url).read
So how can I get both headers and contents in one call, preferably storing headers into one variable and contents into another?
You just have to reuse the result of the open call:
f = open(url)
pp f.meta
pp f.read

Assign a variable to xpath scrapy

Im using scrapy to crawl a webpage, the web page has 10+ links to crawl using |LinkExtractor, everything works fine but on the crawling of extracted links i need to get the page url. I have no other way to get the url but to use
response.request.url
How do i assign that value to
il.add_xpath('url', response.request.url)
If i do it like this i get error:
File "C:\Python27\lib\site-packages\scrapy\selector\unified.py", line
100, in xpath
raise ValueError(msg if six.PY3 else msg.encode("unicode_escape"))
exceptions.ValueError: Invalid XPath: http://www.someurl.com/news/45539/
title-of-the-news
And for description it is like this (just for refference):
il.add_xpath('descrip', './/div[#class="main_text"]/p/text()')
Thanks
The loader comes with two ways of adding attributes to the item, and is with add_xpath and add_value, so you should use something like:
...
il.add_value('url', response.url) # yes, response also has the url attribute

Capturing URL parameters after hashtag with Sinatra

I'm trying to capture the URL parameters from the following URL with Sinatra: http://localhost:4567/token#access_token=7nuf5lgupiya8fd6rz4yzkzvwwo2ria&scope=user_read
I'm tried using a couple code blocks to do this:
get '/token' do
puts params['access_token']
end
and
get '/:token' do |token|
puts token
end
and
get '/token#:token' do |token|
puts token
end
However none of these work. In the first block I get an empty string, in the second block I get the string "token", and in the third block I get "Sinatra doesn't know this ditty".
What would be the appopriate solution in this example?
Is that url you wrote correct? I think it needs to be
http://localhost:4567/token?access_token=7nuf5lgupiya8fd6rz4yzkzvwwo2ria&scope=user_read
With a ? instead of a # after /token. With that change, you should be able to access all the query parameters in the params hash.

Extracting the anchor from a URL in ruby

I looked through the documentation of the URI class in ruby but couldn't find a way to extract the anchor (HTML) from the instance. For example, in
http://example.com/index.php?q=something#anchor
I would like to get the anchor text. Trivial solution is to manipulate the text with regular expressions but if there is some method for it, then it's much better.
The URI module provides a fragment attribute. e.g:
>> uri = URI("http://example.com/index.php?q=something#anchor")
>> uri.fragment
=> "anchor"

Resources