GSA crawl vs content feed which way is better approach - google-search-appliance

I have been ruuning GSA with content crawling for some good time and have always seen issues with search results, the expected results are never there or found of wrong places, this could be due to wrong config or something else. However it has been working.
Due to last update of website, the sorting of results is now in a mess and I am unable to find a way out of it. The pattern of last modified date (meta tag) is not differnt from new pages, I guess due to this there is great inconsistency of content, the search always start from old content no matter i sort of date or relevency.
I am thinking to switch to content feed and feed all content from database to GSA using content feed. But want to know the opinion is this is better approach or Crawl is still a better option...

You have to tell GSA which date to use for sorting the results.
By default,GSA inspects "Last-Modified" response header(While crawling the web contents) to update sort date i.e <FS name="date" value="YYYY-MM-DD">. If your application is not sending the "Last-Modified" response header, then you have to configure "Document Dates" on GSA admin console. It will help GSA to extract the date from your metadata and update FS date accordingly.
you can read about document dates configuration here
Regarding your question on which is better web crawl or content feed,
Feeds are meant for crawling documents which needs special handling.
Read this to understand when/why to use feeds.If your GSA can crawl content thru web, you should choose to go with web crawl.
Regards,
Mohan.

Related

Google Custom Search API returning HTML documents instead of images

I started using the Google Custom Search API for a project, the idea is to search for images, and I wanted to use the Custom Search because the Google Images API is deprecated.
I already enabled image search on the CSE console
My query is like this:
https://www.googleapis.com/customsearch/v1?key=APIKEY&cx=CSECX&q=flower&alt=json&searchType=image&num=1&start=NUMBER
Where NUMBER is a random value between 1 and 20
Sometimes, it returns results like this:
{u'kind': u'customsearch#result', u'title': u'Flower Wallpaper Tumblr #6790199', u'displayLink': u'7-themes.com', u'htmlTitle': u'<b>Flower</b> Wallpaper Tumblr #6790199', u'snippet': u'Flower Wallpaper Tumblr', u'htmlSnippet': u'<b>Flower</b> Wallpaper Tumblr', u'link': u'http://7-themes.com/data_images/out/7/6790199-flower-wallpaper-tumblr.jpg', u'mime': u'image/jpeg', u'image': {u'thumbnailWidth': 150, u'byteSize': 808360, u'height': 1200, u'width': 1920, u'contextLink': u'http://7-themes.com/6790199-flower-wallpaper-tumblr.html', u'thumbnailLink': u'https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcSad0z_Wla0nRHAcQrjO5jLQkFjcoqnNHhejjuGmdA1AW2BqIVEpLARAk0s', u'thumbnailHeight': 94}}
Highlighting the interesting part:
u'link': u'http://7-themes.com/data_images/out/7/6790199-flower-wallpaper-tumblr.jpg', u'mime': u'image/jpeg'
So it seems that the URL is http://7-themes.com/data_images/out/7/6790199-flower-wallpaper-tumblr.jpg and mimetype is image/jpeg, but if you go to the URL, you'll see it's not an image, but an HTML document
Of course, I could capture this as an exception, but I don't want to waste daily API requests (out of a 100 limit per day) because the API didn't give me an image when I explicitly said so.
So, the question is: Is this normal behaviour, or misconfiguration/misuse on my part? If so, how could I fix it?
Thanks for your attention
After a little bit of reading, my best guess is that some servers are doing a resource redirect to prevent external sources from hotlinking directly to a resource. The file in question is advertised as an image, but accessing it from an external server will provide an HTML document instead. This is not a URL redirect, so it isn't detected by clients (including the Google crawler) until the resource is downloaded.
This sort of resource redirect is done on Apache servers using the .htaccess file and the RewriteEngine, with a technique similar to the one described here, although that particular technique can't be used to bait-and-switch images for HTML documents.
In short, if a server is lying about what type of file it's hosting, Google can't do anything about that. You can confirm that this is not an issue with the custom search API by performing the same query on the normal web search interface -- notice that clicking the image loads an HTML document rather than the image itself.

Google is indexing future dates on my codeigniter datepicker

If I do a google search for "harbour holidays 2 strand" then it returns my clients site www.padstow-self-catering.co.uk
The problem is that google has decided to index and future date which can be seen from the datapicker on the right sidebar. Nearly all searches for specific holiday properties has this issue and the future date is different for each.
I have no idea why this is happening?
The "2016" comes from the URL that's been indexed:
http://www.padstow-self-catering.co.uk/properties/map/46/2016/12
Presumably somewhere there's a page with that URL on, and Google indexed that.
Personally I'd probably make the date picker parameters URL parameters instead:
http://www.padstow-self-catering.co.uk/properties/map/46?year=2016&month=12
... or remove it entirely from your links. Either way, it's not that Google has "decided to index" pages for a future date - they're just pages.
(Note: I work for Google, but have nothing to do with web search. This answer should not be seen as being an "insider" post in any way, nor as representing the view of Google.)

Scraping pages with asynchronous responses with Hpricot

I'm trying to scrape a page but the initial response has nothing in the body as the content is pumped in asynchronously, e.g. the results from a search on the apple website: http://www.apple.com/uk/search/?q=searching+for+something&sec=global
Any ideas on how I can successfully grab the results from the search with hpricot?
Thanks.
When the search page you refer to is loaded, it makes a request via javascript/ajax to some other location, then populates the search results. This is what you're seeing in the page. Hpricot itself can't help you here because it has no way to interpret the javascript that comes with the page in order to fetch the actual search results list.
Now, if what you're interested in are the search results, you'd need to analyze a bit what happens when you enter that page and type a search query. Some javascript in the page takes your query, and calls (via XMLHttpRequest or similar, AJAX techniques) some other script in Apple's server. This is the one that actually does the search in a database and returns the result.
I suggest you install Firefox with the Firebug plugin, or some other way of seeing the actual requests a page and its javascript components send and / or receive. You'll see that, for the search page you referred, it fetches two parts: First, the "featured" results that come from this URL:
http://www.apple.com/global/scripts/search_featured.php?q=mac+mini&section=global&geo=uk
Notice the search string is in the "q" parameter.
Second, a long results list comes from here:
http://www.apple.com/search/service/nph-search10?site=uk_www&filter=1&snum=50&q=mac+mini
These both are XML documents; you might have better luck parsing these URLs with Hpricot.

How to write a Google Analytics filter to record site searches

I what to record all my website searches with google analytics but the problem is my search links look like this
**www.mywebsite.com/search/category/your+query+here**
From what i found out i must give GA the query parameter (mywebsite.com/search.php?q=your+query+here) but i have none (and don't want any).
Is there a way to rewrite the URL with a google analytics filter? If yes how.
Yes, you can create a custom filter that rewrites URL /search/<category>/<query> to ?q=<query>&c=<category>.
Go to Analytics Settings › Filter Manager, and click Add Filter. Choose Custom Filter in the Filter Type drop-down list, select Search and Replace radio button, and then set two Request URI fields with the corresponding values. For further details, see ’How do I create a filter?’ page in Google Analytics Help Center.
Keep in mind! Since past visitor data cannot be reprocessed, always keep a ’raw’ profile that you do not apply filters against. For further details, see chapter ’Best Practices for Filters & Profiles’ in presentation ’Filters in Google Analytics’.
Site Search is processed BEFORE Filters are applied.
I went through a week of testing to realize this. Yes, the Filter logic is correct, but as of Nov 1, 2009 this will not work with Site Search.
We accomplished this by appeding the ?search= parameter to the page URI in the GA script. Then we strip search params in the Profile Settings and we get the pure URI's in the content section as well as the Searches tracked in Site Search.
I know this is old, but to expand on the prev accepted answer, use a 'virtual url' in the _trackPageview call, so for www.mywebsite.com/search/category/your+query+here have
gat._trackPageview( "/search/content/your+query+here?query=your+query+here&cat=category" )
This means that URLs won't be changed, so everything else works (as noted in the previous answer) - if you really want to you could remove the search params, but unless you're running into a URL limit I'd probably prefer to keep them present so they can be seen in the content reports.

Web Programming with AJAX, Problem with caching (I think)

Web programmer here - using AJAX (HTML, CSS, JavaScript, AJAX, PHP, MySQL), but for some reason Internet Explorer is acting up (surprise surprise).
AJAX is updating query results on the HTML page, via a PHP script that queries a MySQL Database.
Everything is working fine, except when I use Internet Explorer 8.0 .
There are several php scripts, which allow for the data to be ordered according to certain criteria, and for testing purposes I have attached the mktime field (current time, in the format HH:MM:SS) to the beginning of the results for each query.
When I use IE, these times appear to remain constant, whereas with ALL other browsers these times are correct and display the current time.
I think the issue has something to do with caching or something along those lines anyway.
Any thoughts or suggestions welcome...
Here is an article on the caching issue.
If your request is a GET change it to a POST, this will prevent the results being cached.
GET requests are cached in IE; switch it to a POST request and it won't be cached anymore.
Instead of switching to POST, which can be ugly if you're not really using it to update or create content, you should append a random number to the query string, as in http://domain.com/ajax/some-request?r=123456. If this number is unique for every request you won't have caching problems.
What I have done is, I have kept the "GET" and added new dummy query parameter to the querystring as follows,
./BaseServlet?sname=3d_motor&calcdir=20110514&dummyParam=datetime
I set dummyParam a value of date object in the javascript so that every time the url is generated browser will treat it as a new url and fetch new (fresh) results.
var d = new Date();
url = url + '&dummyParam='+d.valueOf();
So instead of generating some random numbers this is easy way!

Resources