Using Scrapy to download images from a google search - image

I am trying to download google images for a particular search.
Currently, if i have the url, my code will download the first 10 images.
However, my question is: How would i get the url for a particular search on google?
When i look at the url for any search on google, it looks very complicated and it seems hard to understand how the url was created

http://www.google.com/m/search?q=hello&site=images
This URL pulls up the mobile website, which is static and is easier to harvest images off of. All parts of the query are self-explanatory

The &q= part of the url is the actual search string. Note that some characters are converted such as space becoming plus etc.
Easy enough to fake by doing https://www.google.com/search?q=a+search
For image search https://www.google.com/search?q=a+search&tbm=isch

Related

Octoparse and relative Xpath iframe extraction issues

I am trying to use Octoparse to extract the podcast details from Marie Brown's "Beyond the kitchen table" website. https://beyondthekitchentable.co.uk/podcast/
I'm using Octoparse's free version which allows for scraping locally. The problem is that while Octoparse will automatically auto-detect the Title, Title_URL, and Content webpage data and correctly set up the Pagination, Scroll Page, and Loop item workflow to extract (Title, Title_URL, and Content fields), it does not auto-detect the 'Date' and 'Podcast time duration' fields of each individual podcast as these pieces appear to be getting embedded from an iframe. However, while I am able to custom add Date and Podcast time duration using an Absolute Xpath i.e. //div[#class="cfm-episodes-list"]/div[1]/div[2]/div[1]/iframe[1]. This results in the same value copied for each record. So when I attempt to fix this by using the Relative XPath setting in Octoparse to loop each item //span[#class="cp-episode-date"] in order to gather all individually unique, it does not get any values even though this relative Xpath //span[#class="cp-episode-date"] is finding all items when I use WebDevTools to search and find all occurrences seen within Chrome. I saw what might be another helpful post on Stackexchange about this but I was not able to make sense of it.
This portion //span[#class="cp-episode-date"] is relative Xpath as it finds multiple Date items in Chrome WebDevTools but it is not complete and I am not sure how to implement the unique Iframe traversal for the Date and Podcast time duration custom added fields I added that Octoparse's Relative XPath settings are looking for. I even tried to install the SelectorsHub Chrome browser extension but it didn't pull up the nested SelectorHub to query the Xpath the way the SelectorHub Youtube video demonstrates - it only showed me the relative Xpath I already am showing below.
Please have a look at this site using Octoparse and see if it is possible. If so, how can I do it?
When Absolute Path is used - //div[#class="cfm-episodes-list"]/div[1]/div[2]/div[1]/iframe[1]
vs.
When Relative Path is used - //span[#class="cp-episode-date"]
There are plenty of iframes inside the webpage. I don't know if Octoparse could handle this. Choose another starting point.
For example, use Apple Podcast :
https://podcasts.apple.com/gb/podcast/the-website-coach/id1587503231
Dates could be recovered with the following XPath :
//div[#class="l-row"]//time[#class]/#aria-label
Other possibility, scrape the following page :
https://feeds.captivate.fm/the-website-coach/
Dates could be recovered with the following XPath :
//h4/text()
Even easier, get directly the data from this URL (.json file) :
https://itunes.apple.com/lookup?id=1587503231&media=podcast&entity=podcastEpisode&limit=100

Crawl depth for URLs added through metadata-and-url feed

We have a need to add specific URLs through metadata-and-url feed and prevent GSA to follow links found on these pages. URLs found on this pages must be ignored even if they specified in Follow Patterns rules.
Is it possible to specify crawl depth for URLs added through metadata-and-url feed or maybe there are some other ways to prevent GSA follow URLs found on specific pages?
You can't solve this problem with just a metadata-and-URL feed. The GSA is going to crawl the links that it finds, unless you can specify patterns to block them.
There are a couple possible solutions I can think of.
You could replace the metadata-and-URL feed with a content feed. You'd then have to fetch whatever you want to index and include that in the feed. Your fetch program could remove all of the links, or it could "break" relative links by specifying an incorrect URL for each of the documents. You'd then have to rewrite the incorrect URLs back to the correct URLs in your search result display page. I've done the second approach before, and that's pretty easy to do.
You could use a crawl proxy to block access to any of the links you don't want the GSA to follow.
The easiest method to prevent this is to add the following to the "HEAD" section of your HTML.
This will prevent the GSA (and any other search engine) from following any links on the page.
Since you say that you can't add the relevant nofollow meta tags to your content then you can handle this using your follow and crawl patterns.
From the official documentation:
Google recommends crawling to the maximum depth, allowing the Google algorithm to present the user with the best search results. You can use URL patterns to control how many levels of subdirectories are included in the index.
For example, the following URL patterns cause the search appliance to crawl the top three subdirectories on the site www.mysite.com:
regexp:www\\.mysite\\.com/[^/]*$
regexp:www\\.mysite\\.com/[^/]*/[^/]*$
regexp:www\\.mysite\\.com/[^/]*/[^/]*/[^/]*$

Google Sitelinks don't appear with the domain in some search cases

I'm facing a weird problem with Google search , when I search for my website using these keywords "dardasha newspaper" ... I got the expected correct result. my site comes first with site-links included.
https://www.google.com/search?q=dardasha+newspaper&ie=utf-8&oe=utf-8
But when I search for my website using these keywords "جريدة دردشة", I got the correct result but with no site-links
https://www.google.com/search?q=dardasha+newspaper&ie=utf-8&oe=utf-8#q=%D8%AC%D8%B1%D9%8A%D8%AF%D8%A9+%D8%AF%D8%B1%D8%AF%D8%B4%D8%A9
Even my website's language is "Arabic" - the second one used for the search. ... Why are the search results different based on used keywords?
The results are expanded to site-links in Google results when you search by website domain or very close.
Your website is www.dardashanewspaper.com and you searched by dardasha newspaper which is the domain name.
Another problem is that Google thinks that dardasha in Arabic is : درداشا not دردشة.

How does Market Samurai and Long Tail Pro handle retrieving the top 10 Google search results for a keyword?

I'm curious to know how Market Samurai, Long Tail Pro and other software handle retrieving the top 10 Google search results and not running into limits. It appears that these software packages use the users own Google account. Google Custom Search limits users to 100 queries per day (the free limit) but people tend to do keyword research on hundreds or even thousands of keywords per day and don't pay any additional amounts to Google.
Are they paying extra for this service, are they using a different API (perhaps the Adwords API?) or are they scraping the Google search results page (violation of TOS)? Really would like to know! Thanks.
i have done this in one of my project (in java).
this is very simple, in java there is one library call JSoup by using this library you can send get request to google, for example:
https://www.google.co.in/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=<your url encoded search term>
this will return you an HTML code of google search result with your own term.
using Jsoup u can find specific HTML tag with specific class or id. this concept helps you to extract url link, title and description from google search result.
for working example check here, in that example you can extract google serach result links with custom search term.
i hope this will help you.

Can rapidminer extract xpaths from a list of URLS, instead of first saving the HTML pages?

I've recently discovered RapidMiner, and I'm very excited about it's capabilities. However I'm still unsure if the program can help me with my specific needs. I want the program to scrape xpath matches from an URL list I've generated with another program. (it has more options then the 'crawl web' operator in RapidMiner)
I've seen the following tutorials from Neil Mcguigan: http://vancouverdata.blogspot.com/2011/04/web-scraping-rapidminer-xpath-web.html. But the websites I try to scrape have thousands of pages, and I don't want to store them all on my pc. And the web crawler simply lacks critical features so I'm unable to use it for my purposes. Is there a way I can just make it read the URLS, and scrape the xpath's from each of those URLS?
I've also looked at other tools for extracting html from pages, but I've been unable to figure out how they work (or even install) since I'm not a programmer. Rapidminer on the other hand is easy to install, the operator descriptions make sense but I've been unable to connect them in the right order.
I need to have some input to keep the motivation going. I would like to know what operator I could use instead of 'process documents from files.' I've looked at 'process documents from web' but it doesn't have an input, and it still needs to crawl. Any help is much appreciated.
Looking forward to your replies.
Web scraping without saving the html pages internally using RapidMiner is a two step process:
Step 1 Follow the video at http://vancouverdata.blogspot.com/2011/04/rapidminer-web-crawling-rapid-miner-web.html by Neil McGuigan with the following difference:
instead of Crawl Web operator use the Process Documents from Web
operator. There will not be an option to specify the output
directory, because the results will be loaded into the ExampleSet.
ExampleSet will contain links matching the crawling rules.
Step 2 Follow the video at http://vancouverdata.blogspot.com/2011/04/web-scraping-rapidminer-xpath-web.html but only from 7:40 with the following difference:
put the Extract Information subprocess inside the Process Documents from Web which has been created previously.
ExampleSet will contain the links and the attributes matching the XPath queries.
I have quite the same problem than you and maybe these posts from RapidMiner's forum will help you a little :
http://rapid-i.com/rapidforum/index.php/topic,2753.0.html
and
http://rapid-i.com/rapidforum/index.php?topic=3851.0.html
See ya ;)

Resources