Crawl depth for URLs added through metadata-and-url feed - google-search-appliance

We have a need to add specific URLs through metadata-and-url feed and prevent GSA to follow links found on these pages. URLs found on this pages must be ignored even if they specified in Follow Patterns rules.
Is it possible to specify crawl depth for URLs added through metadata-and-url feed or maybe there are some other ways to prevent GSA follow URLs found on specific pages?

You can't solve this problem with just a metadata-and-URL feed. The GSA is going to crawl the links that it finds, unless you can specify patterns to block them.
There are a couple possible solutions I can think of.
You could replace the metadata-and-URL feed with a content feed. You'd then have to fetch whatever you want to index and include that in the feed. Your fetch program could remove all of the links, or it could "break" relative links by specifying an incorrect URL for each of the documents. You'd then have to rewrite the incorrect URLs back to the correct URLs in your search result display page. I've done the second approach before, and that's pretty easy to do.
You could use a crawl proxy to block access to any of the links you don't want the GSA to follow.

The easiest method to prevent this is to add the following to the "HEAD" section of your HTML.
This will prevent the GSA (and any other search engine) from following any links on the page.

Since you say that you can't add the relevant nofollow meta tags to your content then you can handle this using your follow and crawl patterns.
From the official documentation:
Google recommends crawling to the maximum depth, allowing the Google algorithm to present the user with the best search results. You can use URL patterns to control how many levels of subdirectories are included in the index.
For example, the following URL patterns cause the search appliance to crawl the top three subdirectories on the site www.mysite.com:
regexp:www\\.mysite\\.com/[^/]*$
regexp:www\\.mysite\\.com/[^/]*/[^/]*$
regexp:www\\.mysite\\.com/[^/]*/[^/]*/[^/]*$

Related

Google Custom Search API (CSE) - Retrieve only discussions

I would like to use the Google custom search API for searching only in discussions like using the query string &tbm=dsc.
Unfortunately there is no tbm parameter given in the API documentation.
Is it not possible to limit the search results to discussions only?
No, there is currently not a way to do the discussion search with CSE/GSS. The only special search is image which is documented in the API reference. You could use Labels and Refinements to limit your search to specific sites and/or patterns.
Limiting search results for Google Custom Search to only discussion websites is not possible. Just in case, remember that Google Custom Search is for searching over one website or a collection of websites. If your collection is all discussion sites, well, that doesn't seem to be the purpose of Google Custom Search. However, there may be some useful workarounds/solutions.
Workaround 0
Find or generate a collection of discussion sites you're interest in and create a custom search based on that. This would accomplish (almost) the same results you are after.
Workaround 1
You might be able to perform a redirection with refinement labels. This example redirects to a Google Scholar search. You might be able to accomplish the same result using &tbm=dsc.
<CustomSearchEngine>
<Title>Universities</Title>
<Context>
<Facet>
<FacetItem title="Papers">
<Label name="papers" mode="FILTER"/>
<Redirect url="http://scholar.google.com/scholar?q=$q"/>
</FacetItem>
</Facet>
</Context>
</CustomSearchEngine>

Using Scrapy to download images from a google search

I am trying to download google images for a particular search.
Currently, if i have the url, my code will download the first 10 images.
However, my question is: How would i get the url for a particular search on google?
When i look at the url for any search on google, it looks very complicated and it seems hard to understand how the url was created
http://www.google.com/m/search?q=hello&site=images
This URL pulls up the mobile website, which is static and is easier to harvest images off of. All parts of the query are self-explanatory
The &q= part of the url is the actual search string. Note that some characters are converted such as space becoming plus etc.
Easy enough to fake by doing https://www.google.com/search?q=a+search
For image search https://www.google.com/search?q=a+search&tbm=isch

Can rapidminer extract xpaths from a list of URLS, instead of first saving the HTML pages?

I've recently discovered RapidMiner, and I'm very excited about it's capabilities. However I'm still unsure if the program can help me with my specific needs. I want the program to scrape xpath matches from an URL list I've generated with another program. (it has more options then the 'crawl web' operator in RapidMiner)
I've seen the following tutorials from Neil Mcguigan: http://vancouverdata.blogspot.com/2011/04/web-scraping-rapidminer-xpath-web.html. But the websites I try to scrape have thousands of pages, and I don't want to store them all on my pc. And the web crawler simply lacks critical features so I'm unable to use it for my purposes. Is there a way I can just make it read the URLS, and scrape the xpath's from each of those URLS?
I've also looked at other tools for extracting html from pages, but I've been unable to figure out how they work (or even install) since I'm not a programmer. Rapidminer on the other hand is easy to install, the operator descriptions make sense but I've been unable to connect them in the right order.
I need to have some input to keep the motivation going. I would like to know what operator I could use instead of 'process documents from files.' I've looked at 'process documents from web' but it doesn't have an input, and it still needs to crawl. Any help is much appreciated.
Looking forward to your replies.
Web scraping without saving the html pages internally using RapidMiner is a two step process:
Step 1 Follow the video at http://vancouverdata.blogspot.com/2011/04/rapidminer-web-crawling-rapid-miner-web.html by Neil McGuigan with the following difference:
instead of Crawl Web operator use the Process Documents from Web
operator. There will not be an option to specify the output
directory, because the results will be loaded into the ExampleSet.
ExampleSet will contain links matching the crawling rules.
Step 2 Follow the video at http://vancouverdata.blogspot.com/2011/04/web-scraping-rapidminer-xpath-web.html but only from 7:40 with the following difference:
put the Extract Information subprocess inside the Process Documents from Web which has been created previously.
ExampleSet will contain the links and the attributes matching the XPath queries.
I have quite the same problem than you and maybe these posts from RapidMiner's forum will help you a little :
http://rapid-i.com/rapidforum/index.php/topic,2753.0.html
and
http://rapid-i.com/rapidforum/index.php?topic=3851.0.html
See ya ;)

Pretty URLs Vs. Duplicate Content

I'm trying to clear up a grey area about this much talked about topic...
Like most devs, I've made some pretty URLs with mod_rewrite. My sites internal links point to the pretty URLs and things are working nicely.
But, I can still access the old URL if I point to it directly.
Now, this is most certainly going to cause duplicate content issues so after doing some research it seems that 301 redirects are the way to go.
But.... and here's the grey bit...
If you are working on a site with thousands of URLs, what's best practice to achieve this? I don't wantto list 1k+ lines in .htaccess I thought of a regexp in my rewrite rule, but my pretty URLs have names from the database in them... and I can't access that from .htaccess :)
Have I hit a dead end? Is there a way around this? Would Google's canonical tag be a possibility??
Well, I don't know if this is the "definitive" answer, but I have a bunch of "functional" URLS like:
http://www.flipscript.com/product.aspx?cid=7&pid=42&ds=asdjlf8i7sdfkhsjfd978
but I remap the URLs, link to them and list them in my site map as:
http://www.flipscript.com/ambigram-ring.aspx
I haven't seen ANY evidence that identical URLS pointing to the same content within the same domain has any negative impact on SEO.
In fact, over the past year, I have climbed to the #1 position on Google with this in place for my primary keyword.
My theory about why this should be so is that Google applies the duplicate content penalty for entire "clone sites", not for just linking with different URLs to the same content within a single site.
A quick dirty way would be to re-route everything on the site via a PHP file that checks to see if the path is still valid, querying the database if necessary. Use a 301 redirect if the path has permanently moved. Soon enough these "grey urls" should hardly ever come across, and indexes should be updated across search engines. At which point you can remove the router.
If you could specify what your "grey url" looks like I may be able to suggest a better alternative.
"Would Google's canonical tag be a possibility??" -- Why not?
--> It automatically transfers page rank
--> Google recommends canonical tag even if the content differs slightly but is more or less similar.
--> Too many 301 redirects to pages within site are bad for SEO (my personal experience with Bing).
--> Too may 301 redirects increase the effective load time of content for your users (especially bad if the ping times from their location to your server is high).

mod_rewrite and redundant / old urls, some SEO best practices needed

Having a look at how google perceives our site at the moment and coming up short...
Basically, we use a bog-standard structure of URL rewriting to make them look SEO friendly.
for instance, a product URL takes shape of any string_([0-9]).html and so forth. of course, this allows us to link to whatever we want before the product id... which we have done. In the past, a product page was Product_Name_79.html and then became Brand_Name_Product_Name_79.html. apache does not really care and id 79 gets passed on in either case. However, google now has 2 versions of this product cached under different URLs - and that's not a good thing as it continues to arrive to the first URL and spider it.
same thing applies to our rewrite rules for brands and categories, some of which had been dropped and some of which have been modified.
there are over 11k urls in site:domain whereas our sitemap gets some 5.8k only. how would you prevent spiders from fetching older versions of urls that you no-longer link to (considering it's not a manual process and often such urls can be very dynamic).
eg, Mens_Merrell_Trail_Running_Shoes__50-100__10____024/ is a dynamic url for the merrell brand, narrowed down by items in trail running shoes that cost between 50 and 100 and size 10 with gender set to men's.
if we decide to nofollow any size and money filter urls, that leaves google still being able to access them through its old cache...
what is the best practice for disallowing a particular type of urls? as the combinations above are nearly infinite, i cannot produce a list and it certainly cannot be backdated against what brands and categories google may hold for us historically.
shall we add noindex when such filters are applied? shall we export them to robots.txt? do nothing in the hope that google stops returning?
to put it into perspective, we have 2600 product page urls that are now redundant / disabled, what would you do with them? redirect to homepage, brand page, 404, do nothing?
thanks for any advice
i think you're looking for rel="canonical", google should start ignoring you're links if they're really not linked to. You can check any incoming links with a tool like this: http://www.seomoz.org/linkscape.
Also if you're old urls match (or don't match) a consisent pattern you could set up a 301 redirect in apache either for pages matching the old pattern or not matching the new pattern...
hope this helps!
Just be sure to set up redirects for any URL you change. Also, I don't recommend using rel=nofollow since it indicates to Google that your site is not trustworthy.

Resources