Google Search Appliance isn't displaying document's title - google-search-appliance

For some reason, our Google Search Appliance isn't displaying the title of some of our larger files (even though they have a title property). Instead, it's showing the filepath. For example, it does this for 3 word documents that are about 4mb, but it doesn't do it for a powerpoint file that is around 5mb. Any idea what causes this and if there is a workaround to get the title to display?

The GSA will fetch a title based on meta, or the title it can find in the document. If it cannot find a suitable title tag it will use the filepath. Suitability could be length, format, character encoding, position, etc.
This used to be well documented, but I am struggling to find it now apart from vague mentions here https://support.google.com/gsa/answer/4411411?hl=en
Also, as a less important check make sure file sizes are not being exceeded as per the configuration. https://support.google.com/gsa/answer/4411411?hl=en

You can read about how GSA determines the title of documents here.
I don't think file size matters here unless you have specified very less value for Crawl and Index > Index Settings.

You might be missing the "&getfields=title" query string in your search call. To just get all fields in tags, you can just set the query string to "&getfields=*".

Related

Several occurrences of same Anchor Tag string

I have several occurrences of same Anchor Tag string in one document - /sn1/. But when signing I can see only one generated field - near the first match. Documentation says that a sign tab is created in every place a match is found in a document. What am I doing wrong or should strings be unique?
Update: I have the document in Hebrew(RTL) language, probably it's somehow connected with the problem, as I tested another document, but this time in English, and had multiple fields at the place of anchor string instances with no problem.
Well, it was incorrect convertation via libreconv gem that prevented additional sign tabs from appearing. The problem was in initial convertation from .docx to .pdf using the above mentioned gem. I do not recommend it at least for RTL documents.

How to index files while uploading on google drive and then search using indexing?

Is it possible to index file content during upload to drive and later search against it using Google API?
In this way it should be faster. Also, I am thinking to extract contend (using vision) and use it to search later via search APIs.
The search function for file.list response is very limited. You cant search on all the fields so unless you want to add some special tag to the name of the file i dont think there is anyway your going to be able to index them for faster searching.
Yes you can.
You need to duplicate the file content and set it as contentHints.indexableText within your POST body.
From https://developers.google.com/drive/api/v3/reference/files/create
contentHints.indexableText string Text to be indexed for the file to improve fullText queries. This is limited to 128KB in length and may contain HTML elements.

FileNet Social Collaboration - search by comments

We have social collaboration enabled on our FileNet system. I can add comment, tag, like and track how many times a document has been downloaded. These features are nice. When I tag a document, I can search documents by the tag text.
Ex: If I tag a document as say "test". I can user a search template to search for a document by its tag value i.e. test.
When I comment, I can't search document based on Comment Text.
Say I added a comment as "good doc". I can't search it by the text. Rather I need to provide an integer value like 1 search. Then search happens like "get all documents which has number of comments =1". I don't want this behavior instead I should be able to search on the comment text.
Can anybody help on this?
One way to achieve this would be to use CBR on the property. See how to enable CBR on a property
The property will then be full-text searchable using the CONTAINS statement, see doc.
Optionally (but i'm not sure as i've never personally used it) - the satisfies operator might exactly what you're looking for according to the documentation.

CouchDB, all_docs and filter design documents with endkey

First, this question - filter design documents from all_docs - already seemed to be solved like described here:
https://plus.google.com/+JasonDeRose/posts/1iP5tu3wVqw
/mydb/_all_docs?endkey=%22_%22
and worked in first place. However, suddenly in a different setup (actually just different deploy), the query only returns an empty collection []. It seems like the ordering changed, without endkey="_" the full collection is returned (including design documents). I tried various combinations of endkey/startkey but cannot achieve to filter the design documents again.
Finally I added a filter and switched to _changes?include_docs=true to load the initial documents. I also thought about defining a view, but don't like that this results in data replication and some inconveniences with the changes feed (needed in another context). The filter on the other hand will be executed for every document.
Is it a bug that endkey=%22_%22 doesn't work anymore and is there a more convenient, still working way?
/_all_docs is a special case for CouchDB. Instead of the normal Unicode Collation, it uses ASCII collation.
The '_' character in ASCII order shows up between uppercase letters and lowercase letters. So if your doc id starts with lowercase letters (default behaviour), they will show up after any design docs. If your doc ids start with uppercase letters, they will show up before design docs.
Try creating a document with an id of: "ABC" You will see it show up before the design doc and your trick to filter design docs would work in this case.
However, I recommend you stop using the `_all_docs view altogether. Instead use the normal view functionality. When you create a view, CouchDB automatically skips design docs for you. So if your view looked like:
function(doc){
emit(doc._id, null);
}
You could query this with no start or end key, and get all docs without design docs.
Also, please look at Unicode Collation order, this is the order all your other views will be in, and it's important to understand as you work with CouchDB. You can read all about it here:
http://docs.couchdb.org/en/stable/ddocs/views/collation.html

How is the search snippet and meta "description" of a post determined?

Wen you search in stackoverflow, in most cases a search snippet (first 40 words or so of a post / question) is shown. In some cases, more text is shown and this text includes the search terms. Both blocks of text are ending with ellipsis symbol.
If you look at the meta tag "description" or "og:description", a similar text is included, thus allowing Google to index correctly.
My questions:
What search engine is stackoverflow using (elastic search / Lucene) ?
How and when is the search snippet determined (in realtime during search action or when saving a post / question ?)
How and when is the meta-description determined.
I ask these questions because I want to prevent that I start coding an algorithm to determine the first 40 words or so of an html article (in our case a blog post).
thx
Marc
Stackoverflow uses Elasticsearch.
Elasticsearch has a highlighting-features that take care of these things.
The snippet is determined search time, to find the snippet that is most likely to be relevant for the user's query.

Resources