Google Custom Search API returning HTML documents instead of images - google-api

I started using the Google Custom Search API for a project, the idea is to search for images, and I wanted to use the Custom Search because the Google Images API is deprecated.
I already enabled image search on the CSE console
My query is like this:
https://www.googleapis.com/customsearch/v1?key=APIKEY&cx=CSECX&q=flower&alt=json&searchType=image&num=1&start=NUMBER
Where NUMBER is a random value between 1 and 20
Sometimes, it returns results like this:
{u'kind': u'customsearch#result', u'title': u'Flower Wallpaper Tumblr #6790199', u'displayLink': u'7-themes.com', u'htmlTitle': u'<b>Flower</b> Wallpaper Tumblr #6790199', u'snippet': u'Flower Wallpaper Tumblr', u'htmlSnippet': u'<b>Flower</b> Wallpaper Tumblr', u'link': u'http://7-themes.com/data_images/out/7/6790199-flower-wallpaper-tumblr.jpg', u'mime': u'image/jpeg', u'image': {u'thumbnailWidth': 150, u'byteSize': 808360, u'height': 1200, u'width': 1920, u'contextLink': u'http://7-themes.com/6790199-flower-wallpaper-tumblr.html', u'thumbnailLink': u'https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcSad0z_Wla0nRHAcQrjO5jLQkFjcoqnNHhejjuGmdA1AW2BqIVEpLARAk0s', u'thumbnailHeight': 94}}
Highlighting the interesting part:
u'link': u'http://7-themes.com/data_images/out/7/6790199-flower-wallpaper-tumblr.jpg', u'mime': u'image/jpeg'
So it seems that the URL is http://7-themes.com/data_images/out/7/6790199-flower-wallpaper-tumblr.jpg and mimetype is image/jpeg, but if you go to the URL, you'll see it's not an image, but an HTML document
Of course, I could capture this as an exception, but I don't want to waste daily API requests (out of a 100 limit per day) because the API didn't give me an image when I explicitly said so.
So, the question is: Is this normal behaviour, or misconfiguration/misuse on my part? If so, how could I fix it?
Thanks for your attention

After a little bit of reading, my best guess is that some servers are doing a resource redirect to prevent external sources from hotlinking directly to a resource. The file in question is advertised as an image, but accessing it from an external server will provide an HTML document instead. This is not a URL redirect, so it isn't detected by clients (including the Google crawler) until the resource is downloaded.
This sort of resource redirect is done on Apache servers using the .htaccess file and the RewriteEngine, with a technique similar to the one described here, although that particular technique can't be used to bait-and-switch images for HTML documents.
In short, if a server is lying about what type of file it's hosting, Google can't do anything about that. You can confirm that this is not an issue with the custom search API by performing the same query on the normal web search interface -- notice that clicking the image loads an HTML document rather than the image itself.

Related

Retrieve images from Instagram that contains more tag

I'm retrieving images from Instagram but I can't figure out why with this call:
https://api.instagram.com/v1/tags/trytag/media/recent?access_token=ACCESS_TOKEN
I can only retrieve images with one single trytag.
So if I post an image with more tags the link above can't retrieve it.
The current Instagram v1.0 API doesn't currently allow for multiple tag search. You could query for each tag separately and combine your results, or an alternative is to use a third party search like Mixagram, which is most likely doing the same thing but doesn't require an enormous amount of storage space upfront.

How to crawl/index the links on a single page: Google Search Appliance

Am new to the GSA and also don't have full admin access to the system so have to forward requests through to ICT Services to have changes made to our crawls and collections.
I hope someone can help with this question:
I have a single web page which has a list of links to about 180 documents (most of which are stored in the same subdirectory /docs/ which contains some 2400 documents). The rest are scattered across the site in a number of other subdirectories ie /finance/, /hr/ etc
At the moment all that happens is that I either get the single webpage indexed and none of the 180 links. Or I get the 1 page plus ALL of the 2400 documents in the /docs/ subdirectory.
I want to be able to just crawl/index this page and the 180 links and create a separate collection
Is there a simple way to do this?
Regards
Henry
Another possible solution is to use a robots.txt file to disallow crawling of the other pages you don't want. This would be a lot of work if you have to enumerate all of them though.
Your best bet is to see if there is some common URL pattern you can use to specify only the 180 pages you do want. For example, are the pages you do want all PDFs, and the other files you do not want are all some other type? If you can find something that is common for all the pages you want that isn't true for the other pages, you can use that to formulate a pattern (maybe using regex) to do what you want.
Instead of configuring the URL pattern under start urls and follow pattern,
configure the complete url. Get the 180 urls + 1 single web page url and put all 181 urls under start urls and follow pattern.By configuring complete urls, we could avoid GSA being crawling the other urls in the application as we are not keeping any common url pattern under follow urls.
Create a new collection and place all 180 doc urls + single web page
url (or generic pattern matching 181 urls) in that collection under "Include Content Matching the Following Patterns".
I assume that you do not want to index other 2400 documents on GSA.
Hope it helps.
Regards,
Mohan.
You would be better off using a meta and url feed for this.
It will allow you to control whether the GSA follows links in your 180 pages if you fed them in or whether you index your list page if you just feed that. You do this by specifying noindex or nofollow.
You'll still need to have your follow and crawl patterns and collections set up correctly but it's the easiest way to control what gets indexed.
You don't necessarily need to write code for this either, you can use curl and hand craft the xml.
The documentation is pretty good and easy to follow. Feeds Protocol Developers Guide

GSA crawl vs content feed which way is better approach

I have been ruuning GSA with content crawling for some good time and have always seen issues with search results, the expected results are never there or found of wrong places, this could be due to wrong config or something else. However it has been working.
Due to last update of website, the sorting of results is now in a mess and I am unable to find a way out of it. The pattern of last modified date (meta tag) is not differnt from new pages, I guess due to this there is great inconsistency of content, the search always start from old content no matter i sort of date or relevency.
I am thinking to switch to content feed and feed all content from database to GSA using content feed. But want to know the opinion is this is better approach or Crawl is still a better option...
You have to tell GSA which date to use for sorting the results.
By default,GSA inspects "Last-Modified" response header(While crawling the web contents) to update sort date i.e <FS name="date" value="YYYY-MM-DD">. If your application is not sending the "Last-Modified" response header, then you have to configure "Document Dates" on GSA admin console. It will help GSA to extract the date from your metadata and update FS date accordingly.
you can read about document dates configuration here
Regarding your question on which is better web crawl or content feed,
Feeds are meant for crawling documents which needs special handling.
Read this to understand when/why to use feeds.If your GSA can crawl content thru web, you should choose to go with web crawl.
Regards,
Mohan.

filtering kml in a static map

i'm developing a desktop application, not web.
The software environment is Windows and VB10.
In my user interface I have a browser where I want to show a map, issuing an address like http://maps.google.com/maps?q= and then I indicate a URL where I have put a KML file with my data.
The problem is: is it possible to filter the data in the KML file in order to show only a subset of them ?
Basically you have two options:
Pass parameters to a service which generates your filtered KML on the fly.
Do it in JavaScript in your browser interface.
Based on your question, I am going to assume option one is out. For option two there are tons of examples on the web, but basically you need to parse the KML yourself and write JavaScript code to handle it however it needs to be done to achieve your filtering, you cannot pass the KML URL to google maps directly and achieve any of this behaviour.
Possibly useful example: http://www.gpsvisualizer.com/examples/google_folders.html
UPDATE
Based on conversation in the comments:
The only other thing I can think of is to create your own map page with the JavaScript to do what you want on it (like http://gpsvisualizer.com/examples/google_folders.html linked above) and then embedding it in your app instead of the google map. Essentially encapsulating the features you want. So instead of maps.google.com/maps?q= in your app you have myMapURL.com/MyMap?querystring which is your google maps wrapper with the desired filtering. Otherwise I think you are out of luck based on your current setup.

How to write a Google Analytics filter to record site searches

I what to record all my website searches with google analytics but the problem is my search links look like this
**www.mywebsite.com/search/category/your+query+here**
From what i found out i must give GA the query parameter (mywebsite.com/search.php?q=your+query+here) but i have none (and don't want any).
Is there a way to rewrite the URL with a google analytics filter? If yes how.
Yes, you can create a custom filter that rewrites URL /search/<category>/<query> to ?q=<query>&c=<category>.
Go to Analytics Settings › Filter Manager, and click Add Filter. Choose Custom Filter in the Filter Type drop-down list, select Search and Replace radio button, and then set two Request URI fields with the corresponding values. For further details, see ’How do I create a filter?’ page in Google Analytics Help Center.
Keep in mind! Since past visitor data cannot be reprocessed, always keep a ’raw’ profile that you do not apply filters against. For further details, see chapter ’Best Practices for Filters & Profiles’ in presentation ’Filters in Google Analytics’.
Site Search is processed BEFORE Filters are applied.
I went through a week of testing to realize this. Yes, the Filter logic is correct, but as of Nov 1, 2009 this will not work with Site Search.
We accomplished this by appeding the ?search= parameter to the page URI in the GA script. Then we strip search params in the Profile Settings and we get the pure URI's in the content section as well as the Searches tracked in Site Search.
I know this is old, but to expand on the prev accepted answer, use a 'virtual url' in the _trackPageview call, so for www.mywebsite.com/search/category/your+query+here have
gat._trackPageview( "/search/content/your+query+here?query=your+query+here&cat=category" )
This means that URLs won't be changed, so everything else works (as noted in the previous answer) - if you really want to you could remove the search params, but unless you're running into a URL limit I'd probably prefer to keep them present so they can be seen in the content reports.

Resources