Access session cookie in scrapy spiders - session

I am trying to access the session cookie within a spider. I first login to a social network using in a spider:
def parse(self, response):
return [FormRequest.from_response(response,
formname='login_form',
formdata={'email': '...', 'pass':'...'},
callback=self.after_login)]
In after_login, I would like to access the session cookies, in order to pass them to another module (selenium here) to further process the page with an authentificated session.
I would like something like that:
def after_login(self, response):
# process response
.....
# access the cookies of that session to access another URL in the
# same domain with the autehnticated session.
# Something like:
session_cookies = XXX.get_session_cookies()
data = another_function(url,cookies)
Unfortunately, response.cookies does not return the session cookies.
How can I get the session cookies ? I was looking at the cookies middleware: scrapy.contrib.downloadermiddleware.cookies and scrapy.http.cookies but there doesn't seem to be any straightforward way to access the session cookies.
Some more details here bout my original question:
Unfortunately, I used your idea but I dind't see the cookies, although I know for sure that they exists since the scrapy.contrib.downloadermiddleware.cookies middleware does print out the cookies! These are exactly the cookies that I want to grab.
So here is what I am doing:
The after_login(self,response) method receives the response variable after proper authentication, and then I access an URL with the session data:
def after_login(self, response):
# testing to see if I can get the session cookies
cookieJar = response.meta.setdefault('cookie_jar', CookieJar())
cookieJar.extract_cookies(response, response.request)
cookies_test = cookieJar._cookies
print "cookies - test:",cookies_test
# URL access with authenticated session
url = "http://site.org/?id=XXXX"
request = Request(url=url,callback=self.get_pict)
return [request]
As the output below shows, there are indeed cookies, but I fail to capture them with cookieJar:
cookies - test: {}
2012-01-02 22:44:39-0800 [myspider] DEBUG: Sending cookies to: <GET http://www.facebook.com/profile.php?id=529907453>
Cookie: xxx=3..........; yyy=34.............; zzz=.................; uuu=44..........
So I would like to get a dictionary containing the keys xxx, yyy etc with the corresponding values.
Thanks :)

A classic example is having a login server, which provides a new session id after a successful login. This new session id should be used with another request.
Here is the code picked up from source which seems to work for me.
print 'cookie from login', response.headers.getlist('Set-Cookie')[0].split(";")[0].split("=")[1]
Code:
def check_logged(self, response):
tmpCookie = response.headers.getlist('Set-Cookie')[0].split(";")[0].split("=")[1]
print 'cookie from login', response.headers.getlist('Set-Cookie')[0].split(";")[0].split("=")[1]
cookieHolder=dict(SESSION_ID=tmpCookie)
#print response.body
if "my name" in response.body:
yield Request(url="<<new url for another server>>",
cookies=cookieHolder,
callback=self."<<another function here>>")
else:
print "login failed"
return

Maybe this is an overkill, but i don't know how are you going to use those cookies, so it might be useful (an excerpt from real code - adapt it to your case):
from scrapy.http.cookies import CookieJar
class MySpider(BaseSpider):
def parse(self, response):
cookieJar = response.meta.setdefault('cookie_jar', CookieJar())
cookieJar.extract_cookies(response, response.request)
request = Request(nextPageLink, callback = self.parse2,
meta = {'dont_merge_cookies': True, 'cookie_jar': cookieJar})
cookieJar.add_cookie_header(request) # apply Set-Cookie ourselves
CookieJar has some useful methods.
If you still don't see the cookies - maybe they are not there?
UPDATE:
Looking at CookiesMiddleware code:
class CookiesMiddleware(object):
def _debug_cookie(self, request, spider):
if self.debug:
cl = request.headers.getlist('Cookie')
if cl:
msg = "Sending cookies to: %s" % request + os.linesep
msg += os.linesep.join("Cookie: %s" % c for c in cl)
log.msg(msg, spider=spider, level=log.DEBUG)
So, try request.headers.getlist('Cookie')

This works for me
response.request.headers.get('Cookie')
It seems to return all the cookies that where introduced by the middleware in the request, session's or otherwise.

As of 2021 (Scrapy 2.5.1), this is still not particularly straightforward. But you can access downloader middlewares (like CookiesMiddleware) from within a spider via self.crawler.engine.downloader:
def after_login(self, response):
downloader_middlewares = self.crawler.engine.downloader.middleware.middlewares
cookies_mw = next(iter(mw for mw in downloader_middlewares if isinstance(mw, CookiesMiddleware)))
jar = cookies_mw.jars[response.meta.get('cookiejar')].jar
cookies_list = [vars(cookie) for domain in jar._cookies.values() for path in domain.values() for cookie in path.values()]
# or
cookies_dict = {cookie.name: cookie.value for domain in jar._cookies.values() for path in domain.values() for cookie in path.values()}
...
Both output formats above can be passed to other requests using the cookies parameter.

Related

Reading Withings API ruby

I have been trying for days to pull down activity data from the Withings API using the OAuth Ruby gem. Regardless of what method I try I consistently get back a 503 error response (not enough params) even though I copied the example URI from the documentation, having of course swapped out the userid. Has anybody had any luck with this in the past. I hope it is just something stupid I am doing.
class Withings
API_KEY = 'REMOVED'
API_SECRET = 'REMOVED'
CONFIGURATION = { site: 'https://oauth.withings.com', request_token_path: '/account/request_token',
access_token_path: '/account/access_token', authorize_path: '/account/authorize' }
before do
#consumer = OAuth::Consumer.new API_KEY, API_SECRET, CONFIGURATION
#base_url ||= "#{request.env['rack.url_scheme']}://#{request.env['HTTP_HOST']}#{request.env['SCRIPT_NAME']}"
end
get '/' do
#request_token = #consumer.get_request_token oauth_callback: "#{#base_url}/access_token"
session[:token] = #request_token.token
session[:secret] = #request_token.secret
redirect #request_token.authorize_url
end
get '/access_token' do
#request_token = OAuth::RequestToken.new #consumer, session[:token], session[:secret]
#access_token = #request_token.get_access_token oauth_verifier: params[:oauth_verifier]
session[:token] = #access_token.token
session[:secret] = #access_token.secret
session[:userid] = params[:userid]
redirect "#{#base_url}/activity"
end
get '/activity' do
#access_token = OAuth::AccessToken.new #consumer, session[:token], session[:secret]
response = #access_token.get("http://wbsapi.withings.net/v2/measure?action=getactivity&userid=#{session[:userid]}&startdateymd=2014-01-01&enddateymd=2014-05-09")
JSON.parse(response.body)
end
end
For other API endpoints I get an error response of 247 - The userid provided is absent, or incorrect. This is really frustrating. Thanks
So I figured out the answer after copious amount of Googleing and grasping a better understanding of both the Withings API and the OAuth library I was using. Basically Withings uses query strings to pass in API parameters. I though I was going about passing these parameters correctly when I was making API calls, but apparently I needed to explicitly set the OAuth library to use the query string scheme, like so
http_method: :get, scheme: :query_string
This is appended to my OAuth consumer configuration and all worked fine immediately.

How do you save a persistent session with a cookie in Ruby?

All I'm looking for is something very simple. I want to check if the client has my cookie in their browser. If so then to load the last session ID from the cookie and access the session from the server side. If not I want to create the server side session and store the session id to the cookie.
So my question is two parts. What parts of the session code/variable/ID is relevant to retrieve it in the future? And what's the simple method for achieving a continuous session via cookie/session?
I am not using rails. It's a standard CGI script.
#!/usr/bin/ruby
require 'cgi'
require 'cgi/session'
require 'cgi/session/pstore'
cgi = CGI.new("html4")
# !! TODO !! if cookie has "session_id" then open sess from "session_id"
sess = CGI.Session.new( cgi,
'database_manager' => CGI::Session::PStore, # use PStore
'session_key' => '_rb_sess_id', # custom session key
'session_expires' => Time.now + 30 * 60, # 30 minute timeout
'prefix' => 'pstore_sid_')
if sess.has_key? "friend_list"
#friend_list = sess['friend_list']
sess['access_count'] += 1
else
#friend_list = JSON.parse( IO.read( "fbget.json" ) )
sess['friend_list'] = #friend_list
sess['access_count'] = 1
end
I wanted to save the json list on the server side via a session instance. I figure putting that in a cookie would be a very bad security mistake.
Through my searching online I haven't found any non-rails info on saving session access info via the cookie. Only how to use each individually.

Selenium Webdriver getting a cookie value

I am trying to get a cookie value but keep getting an error of <Selenium::WebDriver::Driver:0x13a0e0e8 browser=:firefox>
I am calling
#browser.cookie_named("configsession").each do |cookie|
puts cookie[:name]
is there something I i'm doing wrong?
The methods for working with cookies are defined in the Selenium::WebDriver::Options - see the API docs.
To access these cookie methods, you need to call the manage method for the driver:
#browser.manage
To get a cookie based on its name, you need to do:
#browser.manage.cookie_named("configsession")
Note that cookie_named returns a single cookie that matches. The cookies values are a hash. Therefore, you can get values of the cookie by doing:
cookie = #browser.manage.cookie_named("configsession")
cookie[:name]
#=> "configsession"
If you want to get the name of all the cookies on the page, use the all_cookies method:
driver.manage.all_cookies.each do |cookie|
puts cookie[:name]
end
This worked for me:
Cookie cookie= driver.manage().getCookieNamed("sitename.session");
String cookieVal= cookie.getValue();
Set<Cookie> cook = driver.manage().getCookies();
for(Cookie cooks : cook)
{
System.out.println(cooks.getName());
}
Cookie t = driver.manage().getCookieNamed("_gid");
if(t!=null){
String s1 = t.getValue();
System.out.println("The Cookie value is : " + s1);
}

How to parse HTTP response using Ruby

I've written a short snippet which sends a GET request, performs auth and checks if there is a 200 OK response (when auth success). Now, one thing I saw with this specific GET request, is that the response is always 200 irrespective of whether auth success or not.
The diff is in the HTTP response. That is when auth fails, the first response is 200 OK, just the same as when auth success, and after this then there is a second step. The page gets redirected again to the login page.
I am just trying to make a quick script which can check my login user and pass on my web application and tell me which auth passed and which didn't.
How should I check this? The sample code is like this:
def funcA(u, p)
print_A("#{ip} - '#{u}' : '#{p}' - Pass")
end
def try_login(u, p)
path = '/index.php?uuser=#{u}&ppass=#{p}'
r = send_request_raw({
'URI' => 'path',
'method' => 'GET'
})
if (r and r.code.to_i == 200)
check = true
end
if check == true
funcA(u, p)
else
out = "#{ip} - '#{u} - Fail"
print_B(out)
end
return check, r
end
end
Update:
I also tried adding a new check for matching a 'Success/Fail' keyword coming in HTTP response. It didn't work either. But I now noticed that the response coming back seems to be in a different form. The Content-Type in response is text/html;charset=utf-8 though. And I am not doing any parsing so it is failing.
Success Response is in form of:
{"param1":1,"param2"="Auth Success","menu":0,"userdesc":"My User","user":"uuser","pass":"ppass","check":"success"}
Fail response is in form of:
{"param1":-1,"param2"="Auth Fail","check":"fail"}
So now I need some pointers on how to parse this response.
Many Thanks.
I do this with with "net/http"
require 'net/http'
uri = URI(url)
connection = Net::HTTP.start(uri.host, uri.port)
#response = Net::HTTP.get_response(URI(url))
#httpStatusCode = #response.code
connection.finish
If there's a redirect from a 200 then it must be a javascript or meta redirect. So just look for that in the response body.

How to pass cookies from one page to another using curl in Ruby?

I am doing a video crawler in ruby. In there I have to log in to a page by enabling cookies and download pages. For that I am using the CURL library in ruby. I can successfully log in, but I can't download the pages inside that with curl. How can I fix this or download the pages otherwise?
My code is
curl = Curl::Easy.new(1st url)
curl.follow_location = true
curl.enable_cookies = true
curl.cookiefile = "cookie.txt"
curl.cookiejar = "cookie.txt"
curl.http_post(1st url,field)
curl.perform
curl = Curl::Easy.perform(2nd url)
curl.follow_location = true
curl.enable_cookies = true
curl.cookiefile = "cookie.txt"
curl.cookiejar = "cookie.txt"
curl.http_get
code = curl.body_str
What I've seen in writing my own similar "post-then-get" script is that ruby/Curb (I'm using version 0.7.15 with ruby 1.8) seems to ignore the cookiejar/cookiefile fields of a Curl::Easy object. If I set either of those fields and the http_post completes successfully, no cookiejar or cookiefile file is created. Also, curl.cookies will still be nil after your curl.http_post, however, the cookies ARE set within the curl object. I promise :)
I think where you're going wrong is here:
curl = Curl::Easy.perform(2nd url)
The curb documentation states that this creates a new object. That new object doesn't have any of your existing cookies set. If you change your code to look like the following, I believe it should work. I've also removed the curl.perform for the first url since curl.http_post already implicitly does the "perform". You were basically http_post'ing twice before trying your http_get.
curl = Curl::Easy.new(1st url)
curl.follow_location = true
curl.enable_cookies = true
curl.http_post(1st url,field)
curl.url = 2nd url
curl.http_get
code = curl.body_str
If this still doesn't seem to be working for you, you can verify if the cookie is getting set by adding
curl.verbose = true
Before
curl.http_post
Your Curl::Easy object will dump all the headers that it gets in the response from the server to $stdout, and somewhere in there you should see a line stating that it added/set a cookie. I don't have any example output right now but I'll try to post a follow-up soon.
HTTPClient automatically enables cookies, as does Mechanize.
From the HTTPClient docs:
clnt = HTTPClient.new
clnt.get_content(url1) # receives Cookies.
clnt.get_content(url2) # sends Cookies if needed.
Posting a form is easy too:
body = { 'keyword' => 'ruby', 'lang' => 'en' }
res = clnt.post(uri, body)
Mechanize makes this sort of thing really simple (It will handle storing the cookies, among other things).

Resources