I'd like to scrape the discussion list of a private google group. It's a multi-page list and I might have to this later again so scripting sounds like the way to go.
Since this is a private group, I need to login in my google account first.
Unfortunately I can't manage to login using wget or ruby Net::HTTP. Surprisingly google groups is not accessible with the Client Login interface, so all the code samples are useless.
My ruby script is embedded at the end of the post. The response to the authentication query is a 200-OK but no cookies in the response headers and the body contains the message "Your browser's cookie functionality is turned off. Please turn it on."
I got the same output with wget. See the bash script at the end of this message.
I don't know how to workaround this. am I missing something? Any idea?
Thanks in advance.
John
Here is the ruby script:
# a ruby script
require 'net/https'
http = Net::HTTP.new('www.google.com', 443)
http.use_ssl = true
path = '/accounts/ServiceLoginAuth'
email='john#gmail.com'
password='topsecret'
# form inputs from the login page
data = "Email=#{email}&Passwd=#{password}&dsh=7379491738180116079&GALX=irvvmW0Z-zI"
headers = { 'Content-Type' => 'application/x-www-form-urlencoded',
'user-agent' => "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.2 (KHTML, like Gecko) Chrome/6.0"}
# Post the request and print out the response to retrieve our authentication token
resp, data = http.post(path, data, headers)
puts resp
resp.each {|h, v| puts h+'='+v}
#warning: peer certificate won't be verified in this SSL session
Here is the bash script:
# A bash script for wget
CMD=""
CMD="$CMD --keep-session-cookies --save-cookies cookies.tmp"
CMD="$CMD --no-check-certificate"
CMD="$CMD --post-data='Email=john#gmail.com&Passwd=topsecret&dsh=-8408553335275857936&GALX=irvvmW0Z-zI'"
CMD="$CMD --user-agent='Mozilla'"
CMD="$CMD https://www.google.com/accounts/ServiceLoginAuth"
echo $CMD
wget $CMD
wget --load-cookies="cookies.tmp" http://groups.google.com/group/mygroup/topics?tsc=2
Have you tried with mechanize for ruby?
Mechanize library is used for automating interaction with website; you could log in to google and browse your private google group saving what you need.
Here an example where mechanize is used for gmail scraping.
I did this previously by logging in manually with Firefox and then used Chickenfoot to automate browsing and scraping.
Found this PHP Solution to scraping private Google Groups.
Related
I need a slack bot that's able to receive and save files send from slack chatrooms.
The problem is: slack doesn't send file contents, but an array of links pointing to the file. Most of them, including download link are private and cannot be accessed via bot. It does send one public link, but that link points at the file preview, which does not have the file itself (here's an example).
How can I access uploaded files via bot?
You can access private URLs from your bot by providing an access token in the HTTP header when you are doing you CURL request.
Your token needs to have the scope files.read in order to get access.
The format is:
Authorization: Bearer A_VALID_TOKEN
Replace A_VALID_TOKEN with your slack access token.
I just tested it with a simple PHP script to retrieve a file by its "url_private" and it works nicely.
Source: Slack API documententation / file object / Authentication
Example for using the Python requests library to fetch an example file:
import requests
url = 'https://slack-files.com/T0JU09BGC-F0UD6SJ21-a762ad74d3'
token = 'xoxp-8853424449-8820034832-8891394196-faf6f0'
requests.get(url, headers={'Authorization': 'Bearer %s' % token})
for those wanting to accomplish this with Bash & cURL, here's a helpful function! It will download the file to the current directory with a filename that uniquely identifies the file, even if the file has the same name as others in your file listing.
function slack_download {
URL="$1";
TOKEN="$2"
FILENAME=`echo "$URL" | sed -r 's/.*\/(T.+)\/([^\/]+)$/\1-\2/'`;
curl -o "$FILENAME" -H "Authorization: Bearer $TOKEN" "$URL";
}
# Usage:
# Downloads as ./TJOLLYDAY-FANGBEARD-NSFW_PIC.jpg
slack_download "https://files.slack.com/files-pri/TJOLLYDAY-FANGBEARD/NSFW_PIC.jpg" xoxp-12345678901-01234567890-123456789012-abcdef0123456789abcdef0123456789
Tested with Python3 - just replace SLACK_TOKEN with your token.
Downloads and creates an output file.
#!/usr/bin/env python3
# Usage: python3 download_files_from_slack.py <URL>
import sys
import re
import requests
url = " ".join(sys.argv[1:])
token = 'SLACK_TOKEN'
resp = requests.get(url, headers={'Authorization': 'Bearer %s' % token})
headers = resp.headers['content-disposition']
fname = re.findall("filename=(.*?);", headers)[0].strip("'").strip('"')
assert not os.path.exists(fname), print("File already exists. Please remove/rename and re-run")
out_file = open(fname, mode="wb+")
out_file.write(resp.content)
out_file.close()
As I can access the source code with lynx, w3m, links, etc. protected with a form.
lynx -source -auth=user:pass domain.com
lynx -source -accept_all_cookies -auth=user:pass domain.com
lynx -accept_all_cookies -auth=user:pass domain.com
all fail me.
thx.
What about:
lynx --source -accept_all_cookies -auth=user:pass "domain.com"
The -- and the semicolon play the role for me sometimes.
If there is a logon form before the page, you cannot pass it with lynx or similar applications
you should actually write some scripts. use something like Mechanize module either on Perl or Python!
somthing like this:
import mechanize
browser = mechanize.Browser()
browser.open(YOUR URL)
browser.select_form(nr = 0)
browser.form['username'] = USERNAME
browser.form['password'] = PASSWORD
browser.submit()
I am newly working for a website that needs to crawl products from several stores/sites...
I am a bit new to python and scrapy, in which the original code was written, so when testing crawlers and Xpaths, i use Scrapy and also open another console to test using nokogiri (Ruby gem)
in a particular site, i failed to extract some content using scrapy, but I've found that I can get this content, from the same url using same xpath
Here is the code snippet used in both cases:
Scrapy
yield Request(product_url,headers={'User-Agent':'curl/7.22.0 (x86_64-pc-linux-gnu) libcurl/7.22.0 OpenSSL/1.0.1 zlib/1.2.3.4 libidn/1.23 librtmp/2.3'}, callback=self.parse_item)
def parse_item(self, response):
script = response.xpath('//script[contains(text(),"var ProductViewJSON")]')
yield {
'url': response.url,
'script length': len(script),
'script': script,
}
it produces following result:
{"url": "http://www.pullandbear.com/eg/en/man/accessories/pack-of-3-assorted-bracelets-c29537p100036212.html", "script length": 0, "script": []},
Nokogiri
require 'nokogiri'
require 'open-uri'
html_data = open('http://www.pullandbear.com/eg/en/man/accessories/pack-of-3-assorted-bracelets-c29537p100036212.html', 'User-Agent' => 'curl/7.22.0 (x86_64-pc-linux-gnu) libcurl/7.22.0 OpenSSL/1.0.1 zlib/1.2.3.4 libidn/1.23 librtmp/2.3').read
nokogiri_object = Nokogiri::HTML(html_data)
script = nokogiri_object.xpath('//script[contains(text(),"var ProductViewJSON")]')
script.length # produces 1
Can anybody help me to explain that, please note that this scrapy code was running, I've just been reported that it has stopped, and the main problem was the need to add the headers
I hope I was clear enough, thanks for your interest :)
Edit
I've tried to parse the url from scrapy shell, using the same User Agent as the spider's request and nokogiri's one, it worked for me, it found the element matching the xpath, but still not running with in the spider...
The cause for this is the User-Agent you use.
I tried the site with a simple scrapy shell (with the default User-Agent) and I get the following response:
>>> response.body
'<HTML><HEAD>\n<TITLE>Access Denied</TITLE>\n</HEAD><BODY>\n<H1>Access Denied</H1>\n \nYou don\'t have permission to access "http://www.pullandbear.com/eg/en/man/accessories/pack-of-3-assorted-bracelets-c29537p100036212.html" on this server.<P>\nReference #18.3f496768.1453197808.1ef09a53\n</BODY>\n</HTML>\n'
So change your User-Agent in your Request (or set it through the settings of scrapy once) and you should be ready to gather your information.
As you can see the server returns an access denied site for User-Agents which are not a browser -- just as your cURL agent.
If I start the shell with the following User-Agent:
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111 Safari/537.36'
and execute your XPath I get following results:
>>> response.xpath('//script[contains(text(),"var ProductViewJSON")]')
[<Selector xpath='//script[contains(text(),"var ProductViewJSON")]' data=u'<script type="text/javascript">\r\n\tvar Pr'>]
I have a ruby script, that posts data to a URL:
require 'httparty'
data = {}
data[:client_id] = '123123'
data[:key] = '123321'
url = "http://someserver.com/endpoint/"
response = HTTParty.post(url, :body => data)
Now i am using Charles for sniffing the HTTP traffic. This works great from the browser, but not from the terminal, where I run my script:
$ ruby MyScript.rb
How can I tell ruby or my Terminal.app to use the Charles proxy at http://localhost:88888
Update Another solution would be to see the request before it is being sent. So that I would not necessarily need the proxy.
Setting the proxy as timmah suggested should work.
Anyway 88888 is not a valid port! I think you want to use 8888 (Charles proxy default port).
So the right commands would be:
export http_proxy=localhost:8888
ruby MyScript.rb
If your script was to use https:// you would also/instead need to specify a HTTPS proxy like so:
export https_proxy=localhost:8888
I'm trying to write a JIRA-ruby script (only be used from command-line) to mark some JIRA issue closed automatically.
I borrow an example from here because I'm using 'jira-ruby' gem.
This works however it will pop-up a browser asking you to click "Allow" to get the access_token. I would like to do this programmatically, but I don't think the API was built for this purpose. As access_token changes every time, and this script will run periodically in a cronjob, so we need to have a way to do this. Any idea what other ways we can do this?
require 'jira'
#jira = JIRA::Client.new({:site => 'http://localhost:2990', :context_path => '/jira', :consumer_key => 'test-jira', :private_key_file => "rsakey.pem"})
if ARGV.length == 0
# If not passed any command line arguments, open a browser and prompt the
# user for the OAuth verifier.
request_token = #jira.request_token
puts "Opening #{request_token.authorize_url}"
system "open #{request_token.authorize_url}"
puts "Enter the oauth_verifier: "
oauth_verifier = gets.strip
access_token = #jira.init_access_token(:oauth_verifier => oauth_verifier)
puts "Access token: #{access_token.token} secret: #{access_token.secret}"
elsif ARGV.length == 2
# Otherwise assume the arguments are a previous access token and secret.
access_token = #jira.set_access_token(ARGV[0], ARGV[1])
else
# Script must be passed 0 or 2 arguments
raise "Usage: #{$0} [ token secret ]"
end
# Show all projects
projects = #jira.Project.all
projects.each do |project|
puts "Project -> key: #{project.key}, name: #{project.name}"
end
issue = #jira.Issue.find('DEMO-1')
puts issue
I know there's a way to use long-life access tokens, but not really use if Jira supports it.
I was using the jira-ruby gem at first but I found the performance terrible. I ended up just going with curl instead as I only needed to require the JSON gem which is less bloated. Have your Jira administrators create a user that will never have the password change with admin access and then do the following to find "DEMO-1"
require 'json'
username = "admin"
password = "abc123"
issue = JSON.parse(%x[curl -u #{username}:#{password} \"http://jira/rest/api/latest/issue/DEMO-1\"])
Here is a link to the Jira REST API documentation, just choose the same version of Jira you are using. This will bypass any issues with oauth and the pop-up.