google image search shell api - bash

I'm looking for something like API for google-image search using in bash shell.
I want to gel list of links and resolution-info for some query string.
The ideal will be curling or wgeting any page and than parsing results.
But I cannot find any parseble page variant.
I'm trying $> curl "http://images.google.com/images?q=apple" and get nothing.
Any ideas?

There are APIs for Google's searches; http://code.google.com/apis/imagesearch although I don't know how you would meed the referrer/branding licensing requirements.

It seems that Google Images does not like curl (403 error code).
To avoid the 403 error, you need to fake the user agent, like this:
wget -qO- "http://images.google.com/images?q=apple" -U "Firefox on Ubuntu Gutsy: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.14) Gecko/20080418 Ubuntu/7.10 (gutsy) Firefox/2.0.0.14"
Still, I guess this is not enough since you get a load of javascript code, that needs to be executed somehow.
My 2 cents.

Related

curl 1020 error when trying to scrape page using bash script

I'm trying to write a bash script to access a journal overview page on SSRN.
I'm trying to use curl for this, which works for me on other webpages, but it returns error code: 1020 for me if I try to run the following codes:
curl https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1925128
I thought it might have to do with the question mark in the URL, but I got it to work with other pages that contained question marks.
It probably has something to do with what the page's allows to do. However, I can also access the page using R's rvest package, so I think it should work in general also using bash.
Looks like the site has blocked access via curl. Change the user agent and it should work fine i.e.
curl --user-agent 'Chrome/79' "https://papers.ssrn.com/sol3/papersstract_id=1925128"

Wget does not fetch google search results

I noticed when running wget https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=foo and similar queries, I don't get the search results, but the google homepage.
There seems to be some redirect within the google page. Does anyone know a fix to wget so it would work?
You can use this curl commands to pull Google query results:
curl -sA "Chrome" -L 'http://www.google.com/search?hl=en&q=time' -o search.html
For using https URL:
curl -k -sA "Chrome" -L 'https://www.google.com/search?hl=en&q=time' -o ssearch.html
-A option sets a custom user-agent Chrome in request to Google.
#q=foo is your hint, as that's a fragment ID, which never gets sent to the server. I'm guessing you just took this URL from your browser URL-bar when using the live-search function. Since it is implemented with a lot of client-side magic, you cannot rely on it to work; try using Google with live search disabled instead. A URL pattern that seems to work looks like this: http://www.google.com/search?hl=en&q=foo.
However, I do notice that Google returns 403 Forbidden when called naïvely with wget, indicating that they don't want that. You can easily get past it by setting some other user-agent string, but do consider all the implications before doing so on a regular basis.

Using Libcurl to authenticate ntlm proxy without pass

i'm testing some network simple process to understand better and know how to work with NTLM.
Following this (ntlm-proxy-without-password) Q&A i found hot to uthenticate my transaction via ntml using the log information of the current user.
The command is this: curl.exe -U : --proxy-ntlm --proxy myproxy.com:8080 http://www.google.com
Know i have to do the same thing using libcurl since i need to achieve that result into the application i'm developing. There is a way to do this?
Following this Q&A i found hot to
This solved like a charm
curl_easy_setopt(ctx, CURLOPT_PROXYUSERPWD, ":");

bug in google drive SDK JS api (TypeError: Cannot read property 'sl' of undefined)

A few weeks ago we started noticing strange errors from the google client API or google drive JS api (not sure which, the URL reference is below), they have increased in frequency over the last few days
TypeError: Cannot read property 'sl' of undefined
This seems to be affecting windows Chrome mostly - a typical example of the user agent from our error logs is
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.43 Safari/537.31)
from what I could see, the only line with .sl is this:
if(!this.b.headers.Authorization){var f=(0,_.Hx)(_.p,_.p);f&&f[_.Ak.pl.sl]&&(c=f[_.Ak.pl.sl].split(/\w+/))}
this comes from
https://apis.google.com/_/scs/apps-static/_/js/k=oz.gapi.en.uSTvEdNXb7o.O/m=client/rt=j/sv=1/d=1/ed=1/am=UQ/rs=AItRSTOm1KS5pZVEepZkn9qQJeuQZC_Qjw/cb=gapi.loaded_0
I know this is intentionally cryptic, so it's beyond me to suggest how to fix it, but I would appreciate if someone looks into this as the frequency seems to be increasing. Perhaps a guard around _Ak.pl to check if it's not null before executing .sl ?
I managed to resolve the problem which was reported. The issue is due to the authorize settings. Some settings seems to be not working for the app. Now the app works with following settings:
gapi.auth.authorize({client_id: clientId, scope: scopes, immediate: false}, handleAuthResult);
Previously the app was configured to run offline.
Note: In the code, clientId and scopes are variables, handleAuthResult is an associated function.

Non User Agent Denied

I'm reeealy stuck and way over my head.
I've got a e-commerce site set up with woocommerce and realex (the merchant company).
I'm having problems with the response URL, it is being denied when programmatic (non user-agent) requests are being made.
try:
$ wget curl http://fifty2printsolutions.com/?wc-api=WC_Gateway_Realex_Redirect
But when you set a valid user agent:
try:
$ curl -A "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/534.30 (KHTML, like Gecko) Chrome/12.0.742.112 Safari/534.30" http://fifty2printsolutions.com/?wc-api=WC_Gateway_Realex_Redirect
it works.
This points to a server configuration issue, it seems like programmatic (non user-agent) requests are being denied by his server with a 406 response.
At least this is what the extension author has told me
How can I got about fixing this? I'm with host papa, the support is a bit lack luster.
Any help would be greatly appreciated.
I think you might be dealing with an Apache mod_security issue here. From my experimentation, as long as I have something in the User-Agent, it will process ok. However, if I include the words libwww or perl in the User-Agent, or leave it blank, it doesn't work.
Realex Epage Redirect uses "epage.cgi / libwww-perl" (or similar, I don't have it in front of me) as the User-Agent. This hits both of those rules, so it is getting blocked. We've come across this before and the only solution is to ask the administrators of the site to modify the mod_security rules to allow this. We once changed the User-Agent string to something else, but incredibly, several shopping carts suddenly stopped working and we had to roll the change out.
Hope this helps
Owen

Resources