I was able to use sphinx rt index successfully but I have two issues though.
The first one is how to use autoincrement in rt index for the ID?
The second one is how to get the text field? the documentation says "you should explicitly enumerate all the text fields", I'm not sure how to do that?
I'm using PHP to query the rt index and I can see the result except for the text fields, I'm using the same index in the sphinx doc.
index rt
{
type = rt
path = /usr/local/sphinx/data/rt
rt_field = title
rt_field = content
rt_attr_uint = gid
}
Sphinx doesnt have "autoincrement" ids. You could run a query to find the max id, and then add one. but its not 'safe' if have multiple clients inserting. There is no lock index.
Fields are not stored in the index. So you CAN'T get them back out. They are tokenized and indexed, but not stored.
The 'enumerate' comment, is that you need to list all fields in the index definition. (unlike disk indexes, which will automatically make a column a field, if its not defined as an attribute. )
Attributes on the other hand ARE stored, and can be retrieved. If want to be able to make a column searchable, and retrieveable, need to insert it twice, once as a field, then again as an attribute.
(Note sphinx is not really intended to be a 'database' - but rather just an index to one. So it designed around the case that its 'mirroring' the data)
Related
I’m trying to tag my data according to a lookup table.
The lookup table has these fields:
• Key- represent the field name in the data I want to tag.
In the real data the field is a subfield of “Headers” field..
An example for the “Key” field:
“Server. (* is a wildcard)
• Value- represent the wanted value of the mentioned field above.
The value in the lookup table is only a part of a string in the real data value.
An example for the “Value” field:
“Avtech”.
• Vendor- the value I want to add to the real data if a combination of field- value is found in an document.
An example for combination in the real data:
“Headers.Server : Linux/2.x UPnP/1.0 Avtech/1.0”
A match with that document in the look up table will be:
Key= Server (with wildcard on both sides).
Value= Avtech(with wildcard on both sides)
Vendor= Avtech
So baisically I’ll need to add a field to that document with the value- “ Avtech”.
the subfields in “Headers” are dynamic fields that changes from document to document.
of a match is not found I’ll need to add to the tag field with value- “Unknown”.
I’ve tried to use the enrich processor , use the lookup table as the source data , the match field will be ”Value” and the enrich field will be “Vendor”.
In the enrich processor I didn’t know how to call to the field since it’s dynamic and I wanted to search if the value is anywhere in the “Headers” subfields.
Also, I don’t think that there will be a match between the “Value” in the lookup table and the value of the Headers subfield, since “Value” field in the lookup table is a substring with wildcards on both sides.
I can use some help to accomplish what I’m trying to do.. and how to search with wildcards inside an enrich processor.
or if you have other idea besides the enrich processor- such as parent- child and lookup terms mechanism.
Thanks!
Adi.
There are two ways to accomplish this:
Using the combination of Logstash & Elasticsearch
Using the only the Elastichsearch Ingest node
Constriant: You need to know the position of the Vendor term occuring in the Header field.
Approach 1
If so then you can use the GROK filter to extract the term. And based on the term found, do a lookup to get the corresponding value.
Reference
https://www.elastic.co/guide/en/logstash/current/plugins-filters-grok.html
https://www.elastic.co/guide/en/logstash/current/plugins-filters-kv.html
https://www.elastic.co/guide/en/logstash/current/plugins-filters-jdbc_static.html
https://www.elastic.co/guide/en/logstash/current/plugins-filters-jdbc_streaming.html
Approach 2
Create an index consisting of KV pairs. In the ingest node, create a pipeline which consists of Grok processor and then Enrich it. The Grok would work the same way mentioned in the Approach 1. And you seem to have already got the Enrich part working.
Reference
https://www.elastic.co/guide/en/elasticsearch/reference/current/grok-processor.html
If you are able to isolate the sub field within the Header where the Term of interest is present then it would make things easier for you.
I have a hash_file field in my index, and want to prevent inserting duplicate documents by checking this field.
how can I check when inserting data (not before insert)?
with bulk function how can I check this?
ps:I use version6.8
Why not use the hash_file field's value as the document id, so that there are unique documents for each given hash value and you do not need to worry about checking for duplicates. Unless of-course you specifically need the documents to have some type of id that you are going to use later.
If you decide to use the hash value as the _id though keep in mind that
_id is limited to 512 bytes in size and larger values will be rejected.
Hope this helps.
I want to have in the search response only documents with specified doc id. In stackoverflow I found this question (Lucene filter with docIds) but as far as I understand there is created the additional field in the document and then doing search by this field. Is there another way to deal with it?
Lucene's docids are intended only to be internal keys. You should not be using them as search keys, or storing them for later use. Those ids are subject to change without warning. They will be changed when updating or reindexing documents, and can change at other times, such as segment merges, as well.
If you want your documents to have a unique identifier, you should generate that key separate from the docId, and index it as a field in your document.
I am using elasticsearch as a document database and each record I create has a guid id that the system uses for the record id. Business people want to offer a feature to let the user have their own auto file name convention based on date and how many records were created so far this day/month.
What I need is to prevent duplicate user file names. Is there a way to setup an indexed field to be unique? Like a sql unique constraint?
You'd need to use the field that is supposed to be unique as id for your documents. By default a new document with existing id would override the existing document with same id, but you can switch to op_type=create in order to get back an error if a document with same id already exists.
There's no way to have the same behaviour with arbitrary fields though, only the _id field works that way. I would probably consider handling this logic in the application layer instead of within elasticsearch.
One solution will be to use uniqueId field value for specifying document ID and use op_type=create while storing the documents in ES. With this you can make sure your uniqueId field will have unique value and will not be overridden by another same valued document.
For this, the elasticsearch document says:
The index operation also accepts an op_type that can be used to force a create operation, allowing for "put-if-absent" behavior. When create is used, the index operation will fail if a document by that id already exists in the index.
Here is an example of using the op_type parameter:
$ curl -XPUT 'http://localhost:9200/es_index/es_type/unique_a?op_type=create' -d '{
"user" : "kimchy",
"uniqueId" : "unique_a"
}'
If you run the above request it is ok, but running it the next time will give you an error.
You can use the _id in the column you want to have unique contraint on.
Here is the sample river that uses postgresql. Yo can change the Database Driver/DB-URL according to your usage.
curl -XPUT localhost:9200/_river/simple_jdbc_river/_meta -d "{\"type\":\"jdbc\",\"jdbc\":{\"strategy\":\"simple\",\"poll\":\"1s\",\"driver\":\"org.postgresql.Driver\",\"url\":\"jdbc:postgresql://DB-URL/DB-INSTANCE\",\"user\":\"USERNAME\",\"password\":\"PASSWORD\",\"sql\":\"select t.id as _id,t.name from topic as t \",\"digesting\" : true},\"index\":{\"index\":\"jdbc\",\"type\":\"topic_jdbc_river1\"}}"
So far as to ES 7.5, there is no such extra "constraint" to ensure uniqueness using a custom field in the mapping.
But you still can walk around it via your own application UUID, which could be used directly explicitly as the _id (which is unique implictly) to achieve your goals.
PUT <your_index_name>/_doc/<your_app_uuid>
{
"a_field": "a_value"
}
Another approach might be to generate the string you store in a field that should be unique by integrating an auto-incrementing integer. This way you ensure from the start that your field values are unique.
You would put your file name together like this:
<current day/month>_<auto-incremented integer>
Auto-incrementing integers are not supported by Elasticsearch per se but you could mimic them using this approach. If you happen to use node.js you can use the es-sequence module.
I have a problem that only happens rarely with FT search. but once it happens it stays. I use the following search term in the FT search box in Lotus Notes
[Tags] = "foo"
in most application this search term work fine. but for some applications this search term gives the error "query is not understandable".
It does not matter if I replace the value, e.g [Tags] = "boo" produce the same result. and also FIELD Tags = "boo". for the record [Tag] = "foo" works fine so it seem be issues with the field or field name.
The problem only happens in some applications. Once this problem start happening no views can be searched using that search query and I get the error message everytime I search.
It does not help to remove, compact and re-create the FT index.
I get the same error in xpages when using the same search query in a view data source.
I have seen this problem using other fieldnames as well in other application.
If I remove the FT index the search query works
Creating a new copy of the "broken" database does not resolve the problem
I tried to have only one document in database, create a new FT index. the document in view does not have the field "Tags" still not working. (there are other forms in db with the fieldname "Tags")
This is a real show stopper for me as I have built some of my XPages based on search values from specific fields
In my own invstigation of this problem I think it has to do with some sort of bug in the FT index. There seem to be some data contained in documents or forms that causes the FT index to not work correctly.
I am looking for a solution to this problem as I have not found a way to repair it once it has become broken.
Update:
It does not help to follow this procedure
https://www-304.ibm.com/support/docview.wss?uid=swg21261002
Here is my debug info
[1078:0002-2250] IN FTGSearch
[1078:0002-2250] option = 0x400219
[1078:0002-2250] Query: ( FIELD Tags = "foo")
[1078:0002-2250] OUT FTGSearch error = F09
[1078:0002-2250] FTGSearch: found=0, returned=0, start=0, count=0, limit=0
It sounds like you need to fix the UNK table with a compact. Here is the listing of compact options, use a copy style not in place.
http://www-01.ibm.com/support/docview.wss?uid=swg21084388
If Tags field is sometimes numeric, I would advise looking at the database design. The UNK table is a table of all fields in the NSF. The first time a field name is used, it is stored in the UNK table as that data type. Full text searching uses that data type and only that data type. If you have a field Tags on more than one form in a database, once numeric and once text, you're in for big trouble with full text searches. The datatype in searches will depend on which datatype the field was on the first document saved which had that field. Even if you delete all documents that have it as numeric, you won't change the UNK table without the compact. Sounds like that's what you have here. Ensure the database never stores Tags as numeric. Delete or change all docs with it stored numeric. Then compact.
Thank you all for answering. I learned a whole lot about UNK tables and FT index today.
The problem was that I had a numeric field called "Tags" in a form that I hadn't looked at and really didn't think that it would contain a field by that name.
after using the DDE search I found all instances of the tags field and could eaily locate the problem form. I renamed the field in the form, removed the FT indx , used compact -c and recreated the ft index. now everythig is working fine.
One other thing to notice is that I have several databases with the same design but only a few of them had the ft index problem, the reason for this is probably because some of these databases was created after the form with the faulty Tags field was created.
I am so happy to have solved this.
lessons learned
If you plan to use fulltext index in your application. make sure you do not have the same field name in different forms and use different field types.
from now on I will probably use shared fields more :-)
One more thing we discovered
You actually do not need notes peek to find out which field tpe is stored in the UNK table. you can use the "Fields" button in the searchbar. if you select the field and the right hand box displays "contains" you know the unk table has a text field type set.