How to re-crawl documents that have an error status - google-search-appliance

We had an issue yesterday that prevented gsa crawler from loging in to our website to crawl. Because of this many of the URLs are indexed as the login page. I see a lot of results on the search page titled "Please log in" (title of the login page). Also when I check Index Diagnostics the crawl status for these URLs are "Retrying URL: Connection reset by peer during fetch.".
Now the login problem is resolved and once a page is re-crawled the crawl status goes to successful and it is picking up the page content and the search results show up with the proper title.. But since I cannot control what is being crawled there are pages that still haven't been re-crawled and still have the problem.
There is not a uniform URL that I can force a re-crawl. Hence my question:
Is there a way to force a re-crawl based on the crawl status ("Retrying URL: Connection reset by peer during fetch.")? If that is to specific how about a re-crawl based on crawl status type (Errors/Successful/Excluded)?

Export all the error url as csv file using "Index> Diagnostics >
Index Diagnostics"
Open CSV and apply filter on crawl status colum and get urls having
the error you are looking for.
Copy those urls and goto "Content Sources > Web Crawl > Freshness
Tuning>Recrawl these URL Patterns" and paste and click on Recrawl
That's it. You are done!
PS: If error urls are more (>10000,If I am not wrong), you may not be able to get all of them in a single csv file. In that case you can do in batches.
Regards,
Mohan

You can use this to submit a batch of URLs for recrawling:
https://github.com/google/gsa-admin-toolkit/blob/master/interactive-feed-client.html
I have tested in batches of 80K at once.

Related

Wrong Pages In sitemap.xml of magento

How to solve error of Wrong Pages found error in sitemap.xml of Magento? After i set "No" value in "Use Categories Path for Product URLs"...
It seems from your question that earlier you were having some urls for which path is now changed, thus you want earlier URL to be removed. If this is the case then you there are several option:-
Firstly google itself remove url not found after certain search.
Secondly you can use Google remove url tool(https://www.google.com/webmasters/tools/url-removal), only if you have access of webmaster tool.
Thirdly add the url in robot.txt so that they are no more indexed by Google.
Hope this answers your question

CKAN errors in links and updates to resources after a change in domain name

A policy change forced me to have to change our domain name for ckan, someone had a bright idea to use the domain name for a different landing site and then redirect off to the ckan sever from there.
I have updated data records in the database to reflect the amended url in the resource tables. (resource, resource_revision , resource_view)
it was https://www.blah.....
the new url is http://hub.blah......
Some resources re now downloadable at the new domain, the pages show the correct domain name and links and buttons work correctly.
Some resources on the page show the old url and link to the old url where there is no document accessible. The document is present at the new url. I can see the records for the resource in the database reflect the new url in the database.
The previewers of the dataset are all showing the data as I would expect on those that have the correct url but not working on the pages where the url is incorrect on the screen.
New uploads are showing correctly on pages and correctly in the database.
I have restarted solr, nginx, apache and a reboot of the web server and the database server.
It looks like the pages that are out of date are cached but I cant find a way to refresh them and force the several hundred pages to re-query the database to get the correct information.
I see the same issue when the page is accessed externally and internally to my corporate network.
Anyone any more ideas?
My best guess would be that the URLs are wrong in the solr index, try to rebuild the whole solr index with this paster command (run in virtualenv):
paster --plugin=ckan search-index rebuild -c production.ini
Adapt to your config file accordingly.
after a bit of a headache and a good nights sleep I added
the following to production.ini
ckan.datastore.sqlalchemy.pool_size=10
ckan.datastore.sqlalchemy.max_overflow=20
this worked for me

Google Webmaster error - blocked urls by robots.txt but no such file

my hosting company mixed up and trying to limit the search agents they had blocked all including google with robots.txt. Afer I discover it I changed the robots.txt content to Allow: / and waited for a week time Google to see changes but nothing.. Than I completely removed robots.txt file and still I can see this error:
The result from this is that my site did 1000-1200 visits per day - now is dropped to 200.. Please help to solve this.. How to wake google that nothing stops him to browse the site? Is it possible all those 5000 url that are now blocked to have been removed from google index?
what you need to do is create robots.txt allowing your whole site to be indexed, upload it to root then go to Webmaster Tools -> Crawl -> Fetch as Google and click on red button saying "FETCH"
wait few seconds or just refresh the page, then click on "Submit to index" and choose "URL and all linked pages"
let me know if that helps, i'm pretty sure it will help

Magento Backend Catalog page keeps refreshing

I have a magento install on a staging environment, everything is working except the catalog backend page and the frontend layered navigation. Look at the image below. There is a backend within the backend. When visiting the page it refreshes to infinity. See this identical problem
My first guess is there is a bug in the template file but an identical template file on my local machine does not cause any issues. Additionally the database between my local and staging site are identical minus the core_config_url. The only difference is local I am running Apache and on staging I am running nginx.
The second issue which I imagine is related is that the filters on the frontend catalog page dont work. They are visible but clicking on them reloads the page without changing the products.
Any help would be appreciated. Thanks
UPDATE: After switching from NGINX to Apache the issue disappeared. I still would like to figure out what is causing the problem
I take for granted that you had already tried different browsers/clearing your browser cache and Magento cache (empty the cache directory), etc.
Are you sure you put the correct value in the cookie_domain setting? Many users that had your same problem seem to have setted up an incorrect value in that setting.
Take a look here and let me know.
I have found that usually this is caused by a server-side error on the ajax request. If there is any kind of error response returned, it will just continue to spin. Either check your Chrome console for a 500 response, or look in your server's error logs.

Magento Homepage Keeps on redirecting to a 404 page

I have been on this issue for almost a week now and have been research all over the net for answers but i could not find one.
Problem:
each time i access the homepage of my site, it results to a 404 error.
ex.
http://www.domain.com ---> redirects to 404
http://www.domain.com/home ---> enters to the cms page that i set as my homepage
I have already run the magento cleanup scripts but it was not able to solve the problem.
ERROR MESSAGE:
Whoops, our bad...
The page you requested was not found, and we have a fine guess why.
If you typed the URL directly, please make sure the spelling is correct.
If you clicked on a link to get here, the link is outdated.
What can you do?
Have no fear, help is near! There are many ways you can get back on track with Magento Store.
Go back to the previous page.
Use the search bar at the top of the page to search for your products.
Follow these links to get you back on track!
Store Home | My Account
I've got a similar error and would like to post my solution here. The case was exactly the same for me: all the pages, categories, etc. worked perfectly, but the home page showed a 404 error.
I looked into the core_url_rewrite table and discover that there was an entry with the «request_path» field empty. So this entry was rewriting my base url and that was the reason for the 404 in my case. I just deleted it.
Hope this helps to other people.
If you are using the Enterprise version you should check the table enterprise_url_rewrite. The query below should help you:
select * from enterprise_url_rewrite where request_path="";
delete from enterprise_url_rewrite where request_path="";
Check the folowing:
1- System->Configuration->general->Web->secure
Base URL: http://www.your-site.com/
(be sure you are in the right shop site from the top left of the system)
2- System->Configuration->general->Web->Default Pages
Default Web URL: cms (yes just cms)
CMS Home Page: select your cms page
Default No-route URL: cms/index/noRoute
3- check your server has rewrites otherwise
System->Configuration->general->Web->Search Engines Optimization
Use Web Server Rewrites: No
4- something wrong with your .htaccess replace the file with the one in your original installer.
Because StackOveflow has this stupid rule that I must have a certain number of "points" before I can chime in on discussions I need to make this as a "new answer":
open-ecommerce.org's #2 also solved the problem for me
2- System->Configuration->general->Web->Default Pages Default Web URL:
cms (yes just cms) CMS Home Page: select your cms page Default
No-route URL: cms/index/noRoute
for me this was set to "index". No clue how it worked before, or why it was set that way, but after updating it broke and this was the fix for me.
Note that I did also truncate the 'core_url_rewrite' table as well in troubleshooting. If you don't change urls at all, or often, then that's no big deal. But if you are chaning URLs often, then you'll loose your 301 redirect history (all old urls will 404 going forward).
Open-ecommerce.org solution # 2 , worked for me.
I initially set my "Defaul Web URL" as the URL of the homepage. Similar to what I did with the baseURL but I was wrong.
I changed it to "CMS"
From there, it all works.
Thanks!

Resources