Ruby - How can I follow a .php link through a request and get the redirect link? - ruby

Firstly I want to make clear that I am not familiar with Ruby, at all.
I'm building a Discord Bot in Go as an exercise, the bot fetches UrbanDictionary definitions and sends them to whoever asked in Discord.
However, UD doesn't have an official API, and so I'm using this. It's an Heroku App written in Ruby. From what I understood, it scrapes the UD page for the given search.
I want to add random to my Bot, however the API doesn't support it and I want to add it.
As I see it, it's not hard since http://www.urbandictionary.com/random.php only redirects you to a normal link of the site. This way if I can follow the link to the "normal" one, get the link and pass it on the built scraper it can return just as any other link.
I have no idea how to follow it and I was hoping I could get some pointers, samples or whatsoever.

Here's the "ruby" way using net/http and uri
require 'net/http'
require 'uri'
uri = URI('http://www.urbandictionary.com/random.php')
response = Net::HTTP.get_response(uri)
response['Location']
# => "http://www.urbandictionary.com/define.php?term=water+bong"
Urban Dictionary is using an HTTP redirect (302 status code, in this case), so the "new" URL is being passed back as an http header (Location). To get a better idea of what the above is doing, here's a way just using curl and a system call
`curl -I 'http://www.urbandictionary.com/random.php'`. # Get the headers using curl -I
split("\r\n"). # Split on line breaks
find{|header| header =~ /^Location/}. # Get the 'Location' header
split(' '). # Split on spaces
last # Get the last element in the split array

Related

Find a url in a document using regex in ruby

I have been trying to find a url in a html document and this has to be done in regex since the url is not in any html tag so I can't use nokogiri for that. To get the html i used httparty and i did it this way
require 'httparty'
doc = HTTParty.get("http://127.0.0.1:4040")
puts doc
That outputs the html code. And to get the url i used the .split() method to reach to the url. The full code is
require 'httparty'
doc = HTTParty.get('http://127.0.0.1:4040').split(".ngrok.io")[0].split('https:')[2]
puts "https:#{doc}.ngrok.io"
I wanted to do this using regex since ngrok might update their localhost html file and so this code won't work anymore. How do i do it?
If I understood correctly you want to find all hostnames matching "https://(any subdomain).ngrok.io", right ?
If then you want to use String#scan with a regexp. Here is an example:
# get your body (replace with your HTTP request)
body = "my doc contains https://subdomain.ngrok.io and https://subdomain-1.subdomain.ngrok.io"
puts body
# Use scan and you're done
urls = body.scan(%r{https://[0-9A-Za-z-\.]+\.ngrok\.io})
puts urls
It will result in an array containing ["https://subdomain.ngrok.io", "https://subdomain-1.subdomain.ngrok.io"]
Call .uniq if you want to get rid of duplicates
This doesn't handle ALL edge cases but it's probably enough for what you need

Ruby on Sinatra: Imitate a request based on a parameter

I am currently developing a Ruby API based on Sinatra. This API mostly receives GET requests from an existing social platform which supports external API integration.
The social platform fires off GET requests in the following format (only relevant parameters shown):
GET /{command}
Parameters: command and text
Where text is a string that the user has entered.
In my case, params[:text] is in fact a series of commands, delimited by a space. What I want to achieve is, for example: If params[:text]="corporate finance"
Then I want my API to interpret the request as a GET request to
/{command}/corporate/finance
instead of requesting /{command} with a string as a parameter containing the rest of the request.
Can this be achieved on my side? Nothing can be changed in terms of the initial request from the social platform.
EDIT: I think a better way of explaining what I am trying to achieve is the following:
GET /list?text=corporate finance
Should hit the same endpoint/route as
GET /list/corporate/finance
This must not affect the initial GET request from the social platform as it expects a response containing text to display to the user. Is there a neat, best practice way of doing this?
get "/" do {
text = params[:text].split.join "/"
redirect "#{params[:command]}/#{text}"
end
might do the trick. Didn't check though.
EDIT: ok, the before filter was stupid. Basically you could also route to "/" and then redirect. Or, even better:
get "/:command" do {
text = params[:text].split.join "/"
redirect "#{params[:command]}/#{text}"
}
There a many possible ways of achieving this. You should check the routes section of the sinatra docs (https://github.com/sinatra/sinatra)
The answer by three should do the trick, and to get around the fact that the filter will be invoked with every request, a conditional like this should do:
before do
if params[:text]
sub_commands = params[:text].split.join "/"
redirect "#{params[:command]}/#{sub_commands}"
end
end
I have tested it in a demo application and it seems to work fine.
The solution was to use the call! method.
I used a regular expression to intercept calls which match /something with no further parameters (i.e. /something/something else). I think this step can be done more elegantly.
From there, I split up my commands:
get %r{^\/\w+$} do
params[:text] ? sub_commands="/"+params[:text].split.join("/") : sub_commands=""
status, headers, body = call! env.merge("PATH_INFO" => "/#{params[:command]}#{sub_commands}")
[status, headers, body]
end
This achieves exactly what I needed, as it activates the correct endpoint, as if the URL was typed it the usual format i.e. /command/subcommand1/subcommand2 etc.
Sorry, I completely misunderstood your question, so I replace my answer with this:
require 'sinatra'
get '/list/?*' do
"yep"
end
like this, the following routes all lead to the same
You need to add a routine for each command or replace the command with a * and depend your output based on a case when.
The params entered by the user can be referred by the params hash.
http://localhost:4567/list
http://localhost:4567/list/corporate/finance
http://localhost:4567/list?text=corporate/finance

Difficulties using UNIX cURL to scrape Ajax Wicket Information

I am instructed to use write UNIX shell scripts that scrape certain websites. We use fiddler to trace the HTTP requests, then we write the cURLs accordingly. For the most part, scraping most websites seem to be fairly simple, however I've ran into a situation where I'm having difficulties capturing certain information.
I need to be somewhat generic in saying that I cannot provide the website address that I am actually looking at, however I can post some of the requests and responses to provide context.
Here's the situation:
The website starts with a search screen. You enter your search query and the website returns a list of results.
I need to choose the first result from the result page.
I need to capture EVERYTHING on the page from the first result.
Everything up until this point is working fine
Here's the problem:
The page returned has hyperlinks that are wickets. When these links are pressed, a window pops up within the page - it is not actually a window like a pop up created by javascript, it is more comparable to what you see when you 'compose a message' or 'poke' someone on Facebook ( am I the only one who still does that? ).
I need to capture the contents of that pop up window. There are usually multiple wicket links on a given page. Handling that should be easy enough with a loop, but I need to figure out the proper way to cURL those wickets first.
Here is the cURL i'm currently using to attempt to scrape the wickets.
(I'm explicitly defining the referrer URL, Accept, and Wicket-Ajax boolean as these were the items that were sent in the header when I traced the site). Link is the URL which looks like this:
http://www.someDomainName.com/searches/?x=as56f1sa65df1&random=0.121345151
( the random I believe is populated with some javascript, not sure if that's needed or even possible to recreate. I'm currently sending one of the randoms that I received on one particular occasion. ).
/bin/curl -v3 -b COOKIE -c COOKIE -H "Accept: text/xml" -H "Referer: $URL$x" -H "Wicket-Ajax: true" -sLf "$link"
Here is the response I get:
<ajax-response><redirect><![CDATA[home.page;jsessionid=6F45DF769D527B98DD1C7FFF3A0DF089]]></redirect>
</ajax-response>
I am expecting an XML document with actual content to be returned. Any insight into this issue would be greatly appreciated. Please let me know if you need more information.
Thanks,
Paul

Totally stuck trying to get HTTPS data using Ruby on Windows

I'm using Ruby 1.9.3 and trying to write a Google Play scraper loosely based on this one. I am having a really hard time with the HTTPS part of it.
Basically, using Nokogiri::HTML(open("https://play.google.com/store/#{type}/details?id=#{id}")) (as in the original gem) failed on Windows, for reasons explained on this thread.
So, I tried implementing the solution from that same thread, but it is really not working at all. I've even stopped trying with HTTPS for now, because there must be something basic I am missing on even just HTTP.
Here's the code I currently have:
url = URI.parse( "http://google.com/" )
http = Net::HTTP.new( url.host, url.port )
http.use_ssl = true if url.port == 443
http.verify_mode = OpenSSL::SSL::VERIFY_NONE
res, data = http.get ("http://google.com/")
puts data
In this case, I get nothing. Not even "nil", just no output at all.
However, when I just do a straight Net::HTTP.get_print URI('http://www.google.com'), I get the output, no problems.
Any help would be most appreciated. The real solution I am looking for is a simple way to scrape Google Play pages when using Windows -- this is just a step on the way there. So, if you know of a simpler way to accomplish this, I'd love to hear about it.
The reason you are getting nil is because data doesn't have anything assigned to it. This line is only assigning to res:
res, data = http.get("http://google.com/")
Also, Google must be accessed using http://www.google.com with the www otherwise all you get back is a 301 redirect message and Net::HTTPMovedPermanently object.

Grab Facebook signed_request with Sinatra

I'm trying to figure out whether or not a user likes our brand page. Based off of that, we want to show either a like button or some 'thank you' text.
I'm working with a sinatra application hosted on heroku.
I tried the code from this thread: Decoding Facebook's signed request in Ruby/Sinatra
However, it doesn't seem to grab the signed_request and I can't figure out why.
I have the following methods:
get "/tab" do
#encoded_request = params[:signed_request]
#json_request = decode_data(#encoded_request)
#signed_request = Crack::JSON.parse(#json_request)
erb :index
end
# used by Canvas apps - redirect the POST to be a regular GET
post "/tab" do
#encoded_request = params[:signed_request]
#json_request = decode_data(#encoded_request)
#signed_request = Crack::JSON.parse(#json_request)
redirect '/tab'
end
I also have the helper messages from that thread, as they seem to make sense to me:
helpers do
def base64_url_decode(payload)
encoded_str = payload.gsub('-','+').gsub('_','/')
encoded_str += '=' while !(encoded_str.size % 4).zero?
Base64.decode64(encoded_str)
end
def decode_data(signed_request)
payload = signed_request.split('.')
data = base64_url_decode(payload)
end
end
However, when I just do
#encoded_request = params[:signed_request]
and read that out in my view with:
<%= #encoded_request %>
I get nothing at all.
Shouldn't this return at least something? My app seems to be crashing because well, there's nothing to be decoded.
I can't seem to find a lot of information about this around the internet so I'd be glad if someone could help me out.
Are there better ways to know whether or not a user likes our page? Or, is this the way to go and am I just overlooking something obvious?
Thanks!
The hint should be in your app crashing because there's nothing to decode.
I suspect the parameters get lost when redirecting. Think about it at the HTTP level:
The client posts to /tab with the signed_request in the params.
The app parses the signed_request and stores the result in instance variables.
The app redirects to /tab, i.e. sends a response with code 302 (or similar) and a Location header pointing to /tab. This completes the request/response cycle and the instance variables get discarded.
The client makes a new request: a GET to /tab. Because of the way redirects work, this will no longer have the params that were sent with the original POST.
The app tries to parse the signed_request param but crashes because no such param was sent.
The simplest solution would be to just render the template in response to the POST instead of redirecting.
If you really need to redirect, you need to carefully pass along the signed_request as query parameters in the redirect path. At least that's a solution I've used in the past. There may be simpler ways to solve this, or libraries that handle some of this for you.

Resources