Enriching Google Search Appliance's indexing with additional URLs - google-search-appliance

Is there a way to customize how the Google Search Appliance indexes HTML documents? Basically, assuming I have a mapping of keywords to URLs, I'd like the indexer to treat occurrences of the keywords that it finds within HTML documents as if they were links to their respective URLs.
For example if keyword/URL mapping was
ABC -> http://alpha.intra.net/beta/charlie
FOOBAR -> http://barbar.intra.net/foo
XYZ -> http://xxx.intra.net/yotta/zuul
And the document were
<html><body>
Toby was talking about partnering with the folks over in ABC
on the tango project.
But I think the people over in FOOBAR would be a better fit.
</body></html>
The indexer would pull out:
http://alpha.intra.net/beta/charlie
http://proj.intra.net/tango
http://barbar.intra.net/foo
Alternately, is there a stage before indexing where I could preprocess the HTML to insert such links?

What you are asking for is not possible. You can't tell the GSA, "If keyword X, index URL that corresponds to to X-->URL".
However, nothing prevents you from building a proxy sitting between the GSA and the website you index, in order for you to do this transformation in the HTML document that is pushed to the GSA. All you would have to do then would be to configure the GSA to use a proxy server when crawling this URL pattern.

Related

How to index files while uploading on google drive and then search using indexing?

Is it possible to index file content during upload to drive and later search against it using Google API?
In this way it should be faster. Also, I am thinking to extract contend (using vision) and use it to search later via search APIs.
The search function for file.list response is very limited. You cant search on all the fields so unless you want to add some special tag to the name of the file i dont think there is anyway your going to be able to index them for faster searching.
Yes you can.
You need to duplicate the file content and set it as contentHints.indexableText within your POST body.
From https://developers.google.com/drive/api/v3/reference/files/create
contentHints.indexableText string Text to be indexed for the file to improve fullText queries. This is limited to 128KB in length and may contain HTML elements.

FileNet Social Collaboration - search by comments

We have social collaboration enabled on our FileNet system. I can add comment, tag, like and track how many times a document has been downloaded. These features are nice. When I tag a document, I can search documents by the tag text.
Ex: If I tag a document as say "test". I can user a search template to search for a document by its tag value i.e. test.
When I comment, I can't search document based on Comment Text.
Say I added a comment as "good doc". I can't search it by the text. Rather I need to provide an integer value like 1 search. Then search happens like "get all documents which has number of comments =1". I don't want this behavior instead I should be able to search on the comment text.
Can anybody help on this?
One way to achieve this would be to use CBR on the property. See how to enable CBR on a property
The property will then be full-text searchable using the CONTAINS statement, see doc.
Optionally (but i'm not sure as i've never personally used it) - the satisfies operator might exactly what you're looking for according to the documentation.

Crawl depth for URLs added through metadata-and-url feed

We have a need to add specific URLs through metadata-and-url feed and prevent GSA to follow links found on these pages. URLs found on this pages must be ignored even if they specified in Follow Patterns rules.
Is it possible to specify crawl depth for URLs added through metadata-and-url feed or maybe there are some other ways to prevent GSA follow URLs found on specific pages?
You can't solve this problem with just a metadata-and-URL feed. The GSA is going to crawl the links that it finds, unless you can specify patterns to block them.
There are a couple possible solutions I can think of.
You could replace the metadata-and-URL feed with a content feed. You'd then have to fetch whatever you want to index and include that in the feed. Your fetch program could remove all of the links, or it could "break" relative links by specifying an incorrect URL for each of the documents. You'd then have to rewrite the incorrect URLs back to the correct URLs in your search result display page. I've done the second approach before, and that's pretty easy to do.
You could use a crawl proxy to block access to any of the links you don't want the GSA to follow.
The easiest method to prevent this is to add the following to the "HEAD" section of your HTML.
This will prevent the GSA (and any other search engine) from following any links on the page.
Since you say that you can't add the relevant nofollow meta tags to your content then you can handle this using your follow and crawl patterns.
From the official documentation:
Google recommends crawling to the maximum depth, allowing the Google algorithm to present the user with the best search results. You can use URL patterns to control how many levels of subdirectories are included in the index.
For example, the following URL patterns cause the search appliance to crawl the top three subdirectories on the site www.mysite.com:
regexp:www\\.mysite\\.com/[^/]*$
regexp:www\\.mysite\\.com/[^/]*/[^/]*$
regexp:www\\.mysite\\.com/[^/]*/[^/]*/[^/]*$

Google Sitelinks don't appear with the domain in some search cases

I'm facing a weird problem with Google search , when I search for my website using these keywords "dardasha newspaper" ... I got the expected correct result. my site comes first with site-links included.
https://www.google.com/search?q=dardasha+newspaper&ie=utf-8&oe=utf-8
But when I search for my website using these keywords "جريدة دردشة", I got the correct result but with no site-links
https://www.google.com/search?q=dardasha+newspaper&ie=utf-8&oe=utf-8#q=%D8%AC%D8%B1%D9%8A%D8%AF%D8%A9+%D8%AF%D8%B1%D8%AF%D8%B4%D8%A9
Even my website's language is "Arabic" - the second one used for the search. ... Why are the search results different based on used keywords?
The results are expanded to site-links in Google results when you search by website domain or very close.
Your website is www.dardashanewspaper.com and you searched by dardasha newspaper which is the domain name.
Another problem is that Google thinks that dardasha in Arabic is : درداشا not دردشة.

Skip common/duplicate parts while indexing web pages with ElasticSearch

I don't have any experience with ElasticSearch yet, but from what I read I think it suits most my needs. I have a web scraper which scrapes pages of certain domains.
I want to feed these pages into SE and offer a front end interface to search the scraped content. I'm building some sort of vertical search engine.
But as we all know, web pages of one host often only contain a little bit of unique content, a great part of the pages are common. Footer, header, menu etc. are the same on every page.
Does ElasticSearch have some build in intelligence that can filter out the common parts and only search the real content??
It's not terribly difficult to pump web content into Elastic, so I'll assume you have that down. =)
I think this article is fantastic for understanding how to index/search web pages:
http://blog.urx.com/urx-blog/2014/9/4/the-science-of-crawl-part-1-deduplication-of-web-content
It's a complex problem and they have some great detail. There is nothing I know of natively in Elastic that has intelligence to help you eliminate duplicates etc.
The strategy you need to adopt here would be to create a unique key per document. Taking checksum using sha1 or similar algorithm will do the job for getting the unique key. Make this the document ID so that only one page occurs at all point of time. Again use _create API to index if you dont want new duplicates to be indexed ( More efficient ) , and in case you want the new ones to be the document use normal indexing.
In case you need to modify the orginal document in case of disocvery of duplicate document , use upser.
I have explained a great deal of this in this blog.

Resources