Wget does not fetch google search results - bash

I noticed when running wget https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=foo and similar queries, I don't get the search results, but the google homepage.
There seems to be some redirect within the google page. Does anyone know a fix to wget so it would work?

You can use this curl commands to pull Google query results:
curl -sA "Chrome" -L 'http://www.google.com/search?hl=en&q=time' -o search.html
For using https URL:
curl -k -sA "Chrome" -L 'https://www.google.com/search?hl=en&q=time' -o ssearch.html
-A option sets a custom user-agent Chrome in request to Google.

#q=foo is your hint, as that's a fragment ID, which never gets sent to the server. I'm guessing you just took this URL from your browser URL-bar when using the live-search function. Since it is implemented with a lot of client-side magic, you cannot rely on it to work; try using Google with live search disabled instead. A URL pattern that seems to work looks like this: http://www.google.com/search?hl=en&q=foo.
However, I do notice that Google returns 403 Forbidden when called naïvely with wget, indicating that they don't want that. You can easily get past it by setting some other user-agent string, but do consider all the implications before doing so on a regular basis.

Related

curl 1020 error when trying to scrape page using bash script

I'm trying to write a bash script to access a journal overview page on SSRN.
I'm trying to use curl for this, which works for me on other webpages, but it returns error code: 1020 for me if I try to run the following codes:
curl https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1925128
I thought it might have to do with the question mark in the URL, but I got it to work with other pages that contained question marks.
It probably has something to do with what the page's allows to do. However, I can also access the page using R's rvest package, so I think it should work in general also using bash.
Looks like the site has blocked access via curl. Change the user agent and it should work fine i.e.
curl --user-agent 'Chrome/79' "https://papers.ssrn.com/sol3/papersstract_id=1925128"

Get request with curl works but Net::HTTP and OpenURI fail

My Rails application regularly polls partners' ICS files and sometimes it fails for no reason whatsoever. When I do:
curl https://www.airbnb.es/calendar/ical/234892374.ics?s=23412342323
(params #'s faked here)
I get output matching the content of the ICS file. Just opening it in the browser works fine as well.
When I use:
Net::HTTP.get(URI(a.ics_link))
I get a "503 Service Temporarily Unavailable" response. I also tried the same with OpenURI with similar results.
Why is it that the server is treating requests from curl or a browser differently?
Is there some way to get Ruby to get around this?
It's an https issue... not sure why, but switch your url in Ruby to https and it should work.

ajax request and robots.txt

A website has a URL http://example.com/wp-admin/admin-ajax.php?action=FUNCTIOn_NAME. When I click the URL, it executes the ajax function.
When I put the URL in the address bar, it gives a redirect error because the URL doesn't actually take you anywhere, but it definitely still executes the ajax function.
When I use the command line bash call: firefox -new-window http://example.com/wp-admin/admin-ajax.php?action=FUNCTIOn_NAME, it opens a empty page except for the line "Bad user...". After some digging I found that the robots.txt file has "Disalow: /wp-admin/". I am assuming this is why it isn't working in the command line. I have used wget -e robots=off URL before, but there isn't anything to download so it doesn't apply here.
What type of URL is this? (I believe it's dynamic or formula, but not sure)
I want to get the same results with the command line as when I plug the URL into the address bar. Ideas?
It's nothing special it just display a that html no matter what. HTTP servers don't have use files. It could be written in c++, java, python or nodejs(probably not).

how curl retrieves a url with # and ! symbols in it?

I was considering using curl to retrieve a page from a url(http://bbs.byr.cn/#!board/JobInfo?p=3) but ended up getting a notice from bash like
$ curl bbs.byr.cn/#!article/JobInfo/102321
bash: !article/JobInfo/102321: event not found
this url is accessible in my browser window, how can I write a curl command line that works on this url?
In general this is not possible that stuff after the hashtag (#) is just handled by JavaScript on the client side. Curl cannot execute JavaScript. You can put that URL in quotes to get the static part of the page, but this is however surly not that what you want.
If you observe the traffic of that page in Firebug you will see that the url http://bbs.byr.cn/board/JobInfo?p=3 will be downloaded. This file you can download to get your results.

google image search shell api

I'm looking for something like API for google-image search using in bash shell.
I want to gel list of links and resolution-info for some query string.
The ideal will be curling or wgeting any page and than parsing results.
But I cannot find any parseble page variant.
I'm trying $> curl "http://images.google.com/images?q=apple" and get nothing.
Any ideas?
There are APIs for Google's searches; http://code.google.com/apis/imagesearch although I don't know how you would meed the referrer/branding licensing requirements.
It seems that Google Images does not like curl (403 error code).
To avoid the 403 error, you need to fake the user agent, like this:
wget -qO- "http://images.google.com/images?q=apple" -U "Firefox on Ubuntu Gutsy: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.14) Gecko/20080418 Ubuntu/7.10 (gutsy) Firefox/2.0.0.14"
Still, I guess this is not enough since you get a load of javascript code, that needs to be executed somehow.
My 2 cents.

Resources