Google Search Appliance: Using entity recognition in tandem with document dates - google-search-appliance

Is it possible to use a defined entity as a document date for sorting?
I have created an entity which successfully shows up in dynamic navigation and also responds to the search:
inmeta:gsaentity_pubdate
I have entered a host pattern in the document dates section with a Locate Date In of Meta Tag, and Meta Tag Name of gsaentity_pubdate.
When checking the resulting search XML with sort=D:S:d1 or sort=D:L:d1, the results come back with no date:
<FS NAME="date" VALUE=""/>

I've come to the conclusion that it is not possible to use entity recognition-generated metadata in certain areas of the GSA as of firmware 7.0.
The getfields= parameter does not return gsaentity_ metadata fields in the results (except when using getfields=*, see Michael's comment); it only selects actual metadata fields from the original document. This indicates that there is a logical separation between these two types of metadata in the GSA, and type 2 metadata generated by entity recognition cannot be used in many places where type 1 metadata can.

Related

elasticsearch - Tag data with lookup table values

I’m trying to tag my data according to a lookup table.
The lookup table has these fields:
• Key- represent the field name in the data I want to tag.
In the real data the field is a subfield of “Headers” field..
An example for the “Key” field:
“Server. (* is a wildcard)
• Value- represent the wanted value of the mentioned field above.
The value in the lookup table is only a part of a string in the real data value.
An example for the “Value” field:
“Avtech”.
• Vendor- the value I want to add to the real data if a combination of field- value is found in an document.
An example for combination in the real data:
“Headers.Server : Linux/2.x UPnP/1.0 Avtech/1.0”
A match with that document in the look up table will be:
Key= Server (with wildcard on both sides).
Value= Avtech(with wildcard on both sides)
Vendor= Avtech
So baisically I’ll need to add a field to that document with the value- “ Avtech”.
the subfields in “Headers” are dynamic fields that changes from document to document.
of a match is not found I’ll need to add to the tag field with value- “Unknown”.
I’ve tried to use the enrich processor , use the lookup table as the source data , the match field will be ”Value” and the enrich field will be “Vendor”.
In the enrich processor I didn’t know how to call to the field since it’s dynamic and I wanted to search if the value is anywhere in the “Headers” subfields.
Also, I don’t think that there will be a match between the “Value” in the lookup table and the value of the Headers subfield, since “Value” field in the lookup table is a substring with wildcards on both sides.
I can use some help to accomplish what I’m trying to do.. and how to search with wildcards inside an enrich processor.
or if you have other idea besides the enrich processor- such as parent- child and lookup terms mechanism.
Thanks!
Adi.
There are two ways to accomplish this:
Using the combination of Logstash & Elasticsearch
Using the only the Elastichsearch Ingest node
Constriant: You need to know the position of the Vendor term occuring in the Header field.
Approach 1
If so then you can use the GROK filter to extract the term. And based on the term found, do a lookup to get the corresponding value.
Reference
https://www.elastic.co/guide/en/logstash/current/plugins-filters-grok.html
https://www.elastic.co/guide/en/logstash/current/plugins-filters-kv.html
https://www.elastic.co/guide/en/logstash/current/plugins-filters-jdbc_static.html
https://www.elastic.co/guide/en/logstash/current/plugins-filters-jdbc_streaming.html
Approach 2
Create an index consisting of KV pairs. In the ingest node, create a pipeline which consists of Grok processor and then Enrich it. The Grok would work the same way mentioned in the Approach 1. And you seem to have already got the Enrich part working.
Reference
https://www.elastic.co/guide/en/elasticsearch/reference/current/grok-processor.html
If you are able to isolate the sub field within the Header where the Term of interest is present then it would make things easier for you.

Dealing with Empty Fields

I am new to stormcrawler and elasticsearch in general. I am currently using stormcrawler 2.0 to index website data (including non-HTML items such as PDF's and Word Documents) into elasticsearch. In some cases, the metadata of PDF's or Word documents do not contain a title so the field is stored blank/null in elasticsearch. This is unfortunately causing issues in the webapp I am using to display search results (search-ui). Is there a way I can have stormcrawler insert a default value of "Untitled" into the title field if none exists in the metadata?
I understand that elasticsearch has a null_value field parameter, but if I understand correctly that parameter cannot be used for text fields and only helps with searching.
Thanks!
One option would be to write a custom ParseFilter to give an arbitrary value to any missing key or a key with an empty value. The StormCrawler code has quite a few examples of ParseFilters, see also the WIKI.
The same could be done as a custom Bolt placed between the parser and the indexer; grab the metadata and normalise to your heart's content.

TFS - Changing Description field Datatype

In TFS, what the consequences of changing the Description field's datatype from HTML to Plaintext?
HTML
Supports the ability to capture rich-text data and to use
longer text descriptions such as the Description or Repro Steps
fields. An HTML field differs from a PlainText field in that an HTML
field is strongly typed to HTML for richer displays of information.
HTML fields are automatically indexed for full-text search when
full-text search is available. See Full-Text and partial word
searches.
There are just different field data type. The field data type determines the kind and size of data that you can store in the field. A field can have only one type defined within a team project collection.
PlainText
Supports entry of a text string that can contain more than 255 Unicode
characters, such as the Title field. These fields are automatically
indexed for full-text search, when full-text search is available.
More details about the field, please refer this: Field data types and attributes

ES custom dynamic mapping field name change

I have a use case which is a bit similar to the ES example of dynamic_template where I want certain strings to be analyzed and certain not.
My document fields don't have such a convention and the decision is made based on an external schema. So currently my flow is:
I grab the inputs document from the DB
I grab the approrpiate schema (same database, currently using logstash for import)
I adjust the name in the document accordingly (using logstash's ruby mutator):
if not analyzed I don't change the name
if analyzed I change it to ORIGINALNAME_analyzed
This will handle the analyzed/not_analyzed problem thanks to dynamic_template I set but now the user doesn't know which fields are analyzed so there's no easy way for him to write queries because he doesn't know what's the name of the field.
I wanted to use field name aliases but apparently ES doesn't support them. Are there any other mechanisms I'm missing I could use here like field rename after indexation or something else?
For example this ancient thread mentions that field.sub.name can be queried as just name but I'm guessing this has changed when they disallowed . in the name some time ago since I cannot get it to work?
Let the user only create queries with the original name. I believe you have some code that converts this user query to Elasticsearch query. When converting to Elasticsearch query, instead of using the field name provided by the user alone use both the field names ORIGINALNAME as well as ORIGINALNAME_analyzed. If you are using a match query, convert it to multi_match. If you are using a term query, convert it to a bool should query. I guess you get where I am going with this.
Elasticsearch won't mind if a field does not exists. This can be a problem if there is already a field with _analyzed appended in its original name. But with some tricks that can be fixed too.

difference between source and fields

Whats the differenet between source and fields
acording to documentation they booth are used to listing fields which we want to from index.
Fields is best used for fields that are stored. When not stored it behaves similar to source.
So in case all the fields you want in result are all stored it would be faster filtering using "fields" instead of source.
Also fields can be used to get metadata fields if they are stored.
However one of the limitations of fields is that it can be used only to fetch leaf fields i.e it cannot be used on nested fields/object.
The following article in found provides a good explanation.

Resources