I'm trying to log into Google, so that I can scrape & migrate a private google group.
It doesn't seem to log in over SSL. Any ideas appreciated. I'm using Mechanize and the code is below:
group_signin_url = "https://login page to goolge, with referrer url to a private group here"
user = ENV['GOOGLE_USER']
password = ENV['GOOGLE_PASSWORD']
scraper = Mechanize.new
scraper.user_agent = Mechanize::AGENT_ALIASES["Linux Firefox"]
scraper.agent.http.verify_mode = OpenSSL::SSL::VERIFY_NONE
page = scraper.get group_signin_url
google_form = page.form
google_form.Email = user
google_form.Passwd = password
group_page = scraper.submit(google_form, google_form.buttons.first)
pp group_page
I worked with Ian (the OP) on this problem and just felt we should close this thread with some answers based on what we found when we spent some more time on the problem.
1) You can't scrape a Google Group with Mechanize. We managed to get logged in abut the content of the Google Group pages is all rendered in-browser, meaning that HTTP requests, such as issued by Mechanize, are returned with a few links and no actual content.
We found that we could get page content by the use of Selenium (we used Selenium in Firefox, using the Ruby bindings).
2) the HTML element IDs/classes in Google Groups are obfuscated but we found that these Selenium commands will pull out the bits you need (until Google change them)
message snippets (click on them to expand messages)
find_elements(:class, 'GFP-UI5CCLB')
elements with name of author
find_elements(:class, 'GFP-UI5CA1B')
elements with content of post
find_elements(:class, 'GFP-UI5CCKB')
elements containing date
find_elements(:class, 'GFP-UI5CDKB') (and then use the attribute[:title] for a full length date string)
3) I have some Ruby code here which scrapes the content programmatically and uploads it into a Discourse forum (which is what we were trying to migrate to).
It's hacky but it kind of works. I recently migrated 2 commercially important Google Groups using this script. I'm up for taking on 'We Scrape Your Google Group' type work, please PM me.
Related
Im trying to make an app which would iterate through my own posts and get a list of users who favorited a post. Afterwards I would like the application to follow each of those users if I am not already following them. I am using Ruby for this.
This is my code now:
#client = Twitter::REST::Client.new(config)
OpenSSL::SSL::VERIFY_PEER = OpenSSL::SSL::VERIFY_NONE
user = #client.user()
tweets = #client.user_timeline(user).take(20)
num_of_tweets = tweets.length
puts "tweets found: #{tweets.length}"
tweets.each do |item|
puts "#{ item}" #iterating through my posts here
end
any suggestions?
That information isn't exposed in the Twitter API, either through a timeline collection or via the endpoint representing a single tweet. This'll be why the twitter gem, which provides a useable interface around the Rest API, cannot give you what you're after.
Third party sites such as Favstar do display that information, but as far as I know their own API does not expose the relevant users in any manageable way.
Previously I used mechanize for parsing, but now I'm parsing website that uses javscript and mechanize doesn't support it, so I took selenium. I have to take information about companies from this website but I can get the information only after click on javascript link. I did it with selenium, my parser clicks on javascript, then collects information and here appear problems. As you understand I need to save collected information to the database and I can do this properly only if information will be stored in the variables (e.g. address=.., phone=.., email=.., etc). I select necessary information with SelectorGadget and selenium collects information (driver.find_element(:css, ..), but the information about all the companies is located in a single selector (.p2 div)
and I can not save the location as a single variable, the phone in the other variable, etc. So my question - is it possible to divide this text and save in the variables?
Photos that illustrate the process:
i.imgur.com/J5dcGZD.png
i.imgur.com/MaBWICZ.png
i.imgur.com/ZDNXhLt.png
Photo with part of html:
http://i.imgur.com/NUa1X97.png
Here is an example page of this site. The site is in Russian so translate it through Google translator
The parser itself (save a bunch of text from each company to the contacts variable):
require 'rubygems'
require 'selenium-webdriver'
driver = Selenium::WebDriver.for :firefox
driver.get "http://www.ypag.ru/cat/komp249/page3880.html"
loop do
driver.find_elements(:css, ".p2 div a").each {|link| link.click}
driver.find_elements(:css, ".p3 a, .firm , .p2 div").each {
|n,r,c|
name = n
region = r
contacts = c
print name.text.center(100)
puts region
puts contacts
}
link = driver.find_element(:xpath, "/html/body/table[5]/tbody/tr/td/a[2]" )[:href]
break if link == "http://www.ypag.ru/cat/komp249/page3780.html"
driver.get "#{link}"
end
Has anyone battled 500 errors with the Google spreadsheet API for google domains?
I have copied the code in this post (2-legged OAuth): http://code.google.com/p/google-gdata/source/browse/trunk/clients/cs/samples/OAuth/Program.cs, substituted in my domain;s API id and secret and my own credentials, and it works.
So it appears my domain setup is fine (at least for the contacts/calendar apis).
However swapping the code out for a new Spreadsheet service / query instead, it reverts to type: remote server returned an internal server error (500).
var ssq = new SpreadsheetQuery();
ssq.Uri = new OAuthUri("https://spreadsheets.google.com/feeds/spreadsheets/private/full", "me", "mydomain.com");
ssq.OAuthRequestorId = "me#mydomain.com"; // can do this instead of using OAuthUri for queries
var feed = ssservice.Query(ssq); //boom 500
Console.WriteLine("ss:" + feed.Entries.Count);
I are befuddled
I had to make sure to use the "correct" class:
not
//using SpreadsheetQuery = Google.GData.Spreadsheets.SpreadsheetQuery;
but
using SpreadsheetQuery = Google.GData.Documents.SpreadsheetQuery;
stinky-malinky
Seems you need the gdocs api to query for spreadsheets, but the spreadsheet api to query inside of a spreadsheet but nowhere on the internet until now will you find this undeniably important tit-bit. Google sucks hard on that one.
I am trying to use ruby and Mechanize to parse data on foursquare's website. Here is my code:
require 'rubygems'
require 'mechanize'
agent = Mechanize.new
page = agent.get('https://foursquare.com')
page = agent.click page.link_with(:text => /Log In/)
form = page.forms[1]
form.F12778070592981DXGWJ = ARGV[0]
form.F1277807059296KSFTWQ = ARGV[1]
page = form.submit form.buttons.first
puts page.body
But then, when I run this code, the following error poped up:
C:/Ruby192/lib/ruby/gems/1.9.1/gems/mechanize-2.0.1/lib/mechanize/form.rb:162:in
`method_missing': undefined method `F12778070592981DXGWJ='
for #<Mechanize::Form:0x2b31f70> (NoMethodError)
from four.rb:10:in `<main>'
I checked and found that these two variables for the form object "F12778070592981DXGWJ" and "F1277807059296KSFTWQ" are changing every time when I try to open foursquare's webpage.
Does any one have the same problem before? your variables change every time you try to open a webpage? How should I solve this problem?
Our project is about parsing the data on foursquare. So I need to be able to login first.
Mechanize is useful for sites which don't expose an API, but Foursquare has an established REST API already. I'd recommend using one of the Ruby libraries, perhaps foursquare2. These libraries abstract away things like authentication, so you just have to register your app and use the provided keys.
Instead of indexing the form fields by their name, just index them by their order. That way you don't have to worry about the name that changes on each request:
form.fields[0].value = ARGV[0]
form.fields[1].value = ARGV[1]
...
However like dwhalen said, using the REST API is probably a much better way. That's why it's there.
Let me set the stage for what I'm trying to accomplish. In a physics class I'm taking, my teacher always likes to brag about how impossible it is to cheat in her class, because all of her assignments are done through WebAssign. The way WebAssign works is this: Everyone gets the same questions, but the numbers used in the question are random variables, so each student has different numbers, thus a different answer. So I've been writing ruby scripts to solve the question's for people by just imputing your specific numbers.
I would like to automate this process using mechanize. I've used mechanize plenty of times before, but I'm having trouble logging in to the site. I'll submit the form and it returns the same page I was just on. You can take a look at the site's source code, at http://webassign.net, and I've also tried using the login at http://webassign.net/login.html with no luck either.
Let me follow all of this up with some ruby code that doesn't do what I want it to:
require 'rubygems'
require 'mechanize'
agent = Mechanize.new
page = agent.get("http://www.webassign.net/login.html")
form = page.forms.last
puts "Enter your username"
form.WebAssignUsername = gets.chomp
puts "Enter your password (Don't worry, we don't save this)"
form.WebAssignPassword = gets.chomp
form.WebAssignInstitution = "trinityvalley.tx"
form.submit #=> Returns original page
If anyone really takes an interest in getting this to work, I would be more than happy to send them a working username and password.
The site could be checking that the Login post variable is set (see the login button). Try adding form.Login = "Login".
Have you tried to use agent.submit(form, form.buttons.first) instead of form.submit?
This worked for me when I tried to submit a form. I tried using form.submit first and it kept returning the original page.
Try setting the user agent:
agent = Mechanize.new do |a|
a.user_agent_alias = 'Mac Safari'
end
Some sites seem to require that.
Your question seems a little ambiguous, saying that you're not having any luck? What is the problem exactly? Are you getting a different response entirely than when you view the page in a browser? If so, then do what #cam says and analyzer the headers, you can do it in Firefox via an extension, or you can do it in Chrome natively. Either way, try to mimic the headers that you see in whatever browser you are doing in you mechanize user agent. Here is a script that I used to mimic the iTunes request headers when I was data-mining the app-store:
def mimic_itunes( mech_agent )
mech_agent.pre_connect_hooks << lambda {|headers|
headers[:request]['X-Apple-Store-Front'] = X_APPLE_STOREFRONT;
headers[:request]['X-Apple-Tz'] = X_APPLE_TZ;
headers[:request]['X-Apple-Validation'] = X_APPLE_VALIDATION;
}
mech_agent.user_agent = 'iTunes/9.1.1 (Windows; Microsoft Windows 7 x64 Business Edition (Build 7600)) AppleWebKit/531.22.7'
mech_agent
end
Note: the constants in the example are just strings... not really that important what they are, as long as you know you can add any string there
Using this approach, you should be able to alter/add any headers that the web application might need.
If this is not the problem that you are having, then post more in-depth details of what exactly is happening.