I am trying to crawl websites with ruby. The way I am implementing it is that I send request for a page, and get all the links(href tags) in the page, and then generate another GET request. The problem is that I want to stay logged in during the whole process. I wrote some code as followed.
def start_crawling
uri = URI(#host + "/login")
#visited.push #host + "/login"
req = Net::HTTP::Post.new(uri)
req.set_form_data({
'email' => 'test',
'password' => 'test'
})
Net::HTTP.start(uri.hostname, uri.port) do |http|
res = http.request req
puts uri
puts res.code
content = res.body
puts content
puts res.response
cookie = res.response['Set-Cookie'] # this gives nothing
puts cookie
puts res["Set-Cookie"] # prints nothing here
hrefs = get_href_tag_array_from_html(content)
send_get_requests(hrefs, cookie)
end
end
def send_get_requests(hrefs, cookie)
while not hrefs.empty?
href = hrefs.pop
href = #host + href if not href.start_with?"http"
next if #visited.include?(href)
puts "href: " + href
uri = URI(href)
Net::HTTP.start(uri.host, uri.port) do |http|
req = Net::HTTP::Get.new uri
res = http.request req
puts "------------------href: #{href}---------------------------"
puts res.code
puts res.message
puts res.class.name
puts "Cookie: "
puts res['Set-Cookie'] # this works and prints cookies
puts res.body
puts "------------------end of: #{href}---------------------------"
new_hrefs = get_href_tag_array_from_html(res.body)
hrefs += new_hrefs
end
#visited.push href
end
end
I want to start crawling from the login page. Ideally I want to stay logged in during the whole crawling procedure. I don't know much about session/cookie stuff, but I guess if I can get the cookie from the previous response and send it with the next request, I should be able to stay logged in. However, I cannot get any cookie from the login response. The response body is a 302 redirection, as I would expect. I checked it on the browser and the 302 response header does contain a cookie field, and this cookie is used for the next get request to redirect to the home page, but I cannot get the cookie field.
When I send GET request and get the response, I can get the cookie field out of it, but when I send POST request for the login page, I cannot get any cookie. Is there any fundamental difference between GET and POST requests in such cases?
Any idea how can I get this cookie field? Or do I have some basic misunderstanding in solving this crawling problem? Thanks.
Related
I use net/http ruby's library to get the html response, but i can't get the body of the page with the status code 3xx
Page Body:
<div class="flash-container">
<div class="flash flash-success">
Il tuo indirizzo email è stato modificato con successo.
×
</div>
</div>
Request:
require 'net/http'
require 'uri'
http = Net::HTTP.new(uri.host, uri.port)
http.use_ssl = true
request = Net::HTTP::Post.new(uri.request_uri)
request.set_form_data({
'email' => email,
'email-confirm' => email_confirm,
'password' => password
})
request['Cookie'] = 'ACCOUNT_SESSID=' + token
response = http.request(request)
Response:
response.code # '302'
response.body # ''
You'll likely need to follow the redirect (302 code). The Ruby docs have a great example for doing this.
I've included this below, along with a check to return the body if it exists. If you never want to follow the redirect, you could change the else condition to return response.code, and empty string, false, or whatever's appropriate. Here's the full example:
def fetch(uri_str, limit = 10)
raise ArgumentError, 'too many HTTP redirects' if limit == 0
response = Net::HTTP.get_response(URI(uri_str))
case response
when Net::HTTPSuccess then
response
when Net::HTTPRedirection then
if response.body_permitted?
response
else
location = response['location']
warn "redirected to #{location}"
fetch(location, limit - 1)
end
else
response.value
end
end
The code is pretty straight forward, calling itself recursively if the code from Net::HTTP.get_response returns a redirect, pointing to the new location.
You can follow up to ten redirects with this approach, which should be ample, though should likely adjust to suit or circumstances.
Then, when you run fetch(your_url), it should follow the redirect until it lands on a page and can return the body. I.E.
res = fetch(your_url)
res.body
Let me know how you get on with this, or if you've any questions!
I've been trying to make an API call to my server to delete a user record help on a dev database. When I use Fiddler to call the URL with the DELETE operation I am able to immediately delete the user record. When I call that same URL, again with the DELETE operation, from my script below, I get this error:
{"Message":"The requested resource does not support http method 'DELETE'."}
I have changed the url in my script below. The url I am using is definitely correct. I suspect that there is a logical error in my code that I haven't caught. My script:
require 'net/http'
require 'json'
require 'pp'
require 'uri'
def deleteUserRole
# prepare request
url= "http://my.database.5002143.access" # dev
uri = URI.parse(url)
request = Net::HTTP::Delete.new(uri.path)
http = Net::HTTP.new(uri.host, uri.port)
# send the request
response = http.request(request)
puts "response: \n"
puts response.body
puts "response code: " + response.code + "\n \n"
# parse response
buffer= response.body
result = JSON.parse(buffer)
status= result["Success"]
if status == true
then puts "passed"
else puts "failed"
end
end
deleteUserRole
It turns out that I was typing in the wrong command. I needed to change this line:
request = Net::HTTP::Delete.new(uri.path)
to this line:
request = Net::HTTP::Delete.new(uri)
By typing uri.path I was excluding part of the URL from the API call. When I was debugging, I would type puts uri and that would show me the full URL, so I was certain the URL was right. The URL was right, but I was not including the full URL in my DELETE call.
if you miss the parameters to pass while requesting delete, it won't work
you can do like this
uri = URI.parse('http://localhost/test')
http = Net::HTTP.new(uri.host, uri.port)
attribute_url = '?'
attribute_url << body.map{|k,v| "#{k}=#{v}"}.join('&')
request = Net::HTTP::Delete.new(uri.request_uri+attribute_url)
response = http.request(request)
where body is a hashmap where you can define query params as a hashmap.. while sending request it can be joined in the url by the code above.
ex:body = { :resname => 'res', :bucket_name => 'bucket', :uploaded_by => 'upload' }
I understand that you could use proxy in the ruby Net::HTTP. However, I have no idea how to do this with a bunch of proxy. I need the Net::HTTP to change to another proxy and send another post request after every post request. Also, is it possible to make the Net::HTTP to change to another proxy if the previous proxy is not working? If so, how?
Code I'm trying to implement the script in:
require 'net/http'
sleep(8)
http = Net::HTTP.new('URLHERE', 80)
http.read_timeout = 5000
http.use_ssl = false
path = 'PATHHERE'
data = '(DATAHERE)'
headers = {
'Referer' => 'REFERER HERE',
'Content-Type' => 'application/x-www-form-urlencoded; charset=UTF-8',
'User-Agent' => '(USERAGENTHERE)'}
resp, data = http.post(path, data, headers)
# Output on the screen -> we should get either a 302 redirect (after a successful login) or an error page
puts 'Code = ' + resp.code
puts 'Message = ' + resp.message
resp.each {|key, val| puts key + ' = ' + val}
puts data
end
Given an array of proxies, the following example will make a request through each proxy in the array until it receives a "302 Found" response. (This isn't actually a working example because Google doesn't accept POST requests, but it should work if you insert your own destination and working proxies.)
require 'net/http'
destination = URI.parse "http://www.google.com/search"
proxies = [
"http://proxy-example-1.net:8080",
"http://proxy-example-2.net:8080",
"http://proxy-example-3.net:8080"
]
# Create your POST request_object once
request_object = Net::HTTP::Post.new(destination.request_uri)
request_object.set_form_data({"q" => "stack overflow"})
proxies.each do |raw_proxy|
proxy = URI.parse raw_proxy
# Create a new http_object for each new proxy
http_object = Net::HTTP.new(destination.host, destination.port, proxy.host, proxy.port)
# Make the request
response = http_object.request(request_object)
# If we get a 302, report it and break
if response.code == "302"
puts "#{proxy.host}:#{proxy.port} responded with #{response.code} #{response.message}"
break
end
end
You should also probably do some error checking with begin ... rescue ... end each time you make a request. If you don't do any error checking and a proxy is down, control will never reach the line that checks for response.code == "302" -- the program will just fail with some type of connection timeout error.
See the Net::HTTPHeader docs for other methods that can be used to customize the Net::HTTP::Post object.
I'm trying to write a Ruby script to use the API on the image gallery site Piwigo, this requires you to login first with one HTTP post and upload an image with another post.
This is what I've got so far but it doesn't work, just returns a 401 error, can anyone see where I am going wrong?
require 'net/http'
require 'pp'
http = Net::HTTP.new('mydomain.com',80)
path = '/piwigo/ws.php'
data = 'method=pwg.session.login&username=admin&password=password'
resp, data = http.post(path, data, {})
if (resp.code == '200')
cookie = resp.response['set-cookie']
data = 'method=pwg.images.addSimple&image=image.jpg&category=7'
headers = { "Cookie" => cookie }
resp, data = http.post(path, data, headers)
puts resp.code
puts resp.message
end
Which gives this response when run;
$ ruby piwigo.rb
401
Unauthorized
There is a Perl example on their API page which I was trying to convert to Ruby http://piwigo.org/doc/doku.php?id=dev:webapi:pwg.images.addsimple
By using the nice_http gem: https://github.com/MarioRuiz/nice_http
NiceHttp will take care of your cookies so you don't have to do anything
require 'nice_http'
path = '/piwigo/ws.php'
data = '?method=pwg.session.login&username=admin&password=password'
http = NiceHttp.new('http://example.com')
resp = http.get(path+data)
if resp.code == 200
resp = http.post(path)
puts resp.code
puts resp.message
end
Also if you want you can add your own cookies by using http.cookies
You can use a gem called mechanize. It handles cookies transparently.
I've got a piece of Ruby code that I've written to follow a series of potential redirects until it reaches the final URL:
def self.obtain_final_url_in_chain url
logger.debug "Following '#{url}'"
uri = URI url
http = Net::HTTP.start uri.host, uri.port
response = http.request_head url
case response.code
when "301"
obtain_final_url_in_chain response['location']
when "302"
obtain_final_url_in_chain response['location']
else
url
end
end
You call obtain_final_url_in_chain with the url and it should eventually return the final url.
I'm trying it with this URL: http://feeds.5by5.tv/master
Based on http://web-sniffer.net/ this should be redirected to http://5by5.tv/rss as a result of a 301 redirect. Instead though I get a 404 for http://feeds.5by5.tv/master.
The above code is returning 200 for other URLs though (eg. http://feeds.feedburner.com/5by5video).
Does anyone know why this is happening please? It's driving me nuts!
Thanks.
According to the docs for Net::HTTP#request_head, you want to pass the path, not the full url, as the first parameter.
With that and a few other changes, here's one way to rewrite your method:
def obtain_final_url_in_chain(url)
uri = URI url
response = Net::HTTP.start(uri.host, uri.port) do |http|
http.request_head uri.path
end
case response
when Net::HTTPRedirection
obtain_final_url_in_chain response['location']
else
url
end
end