I want to use the synonym tokenfilter in Elasticsearch for an index. I downloaded the Prolog version of WordNet 3.0, and found the wn_s.pl file that Elasticsearch can understand. However, it seems that the file contains synonyms for all sorts of words and phrases, while I am really only interested in supporting synonyms for nouns. Is there a way to extract those type of entries?
Given that the format of wn_s.pl is
s(112947045,1,'usance',n,1,0).
s(200001742,1,'breathe',v,1,25).
A very raw way of doing that would be to execute the following in your terminal to only take the lines from that file that have the ',n,' string.
grep ",n," wn_s.pl > wn_s_nouns_only.pl
The file wn_s_nouns_only.pl will only have the entries that are marked as nouns.
Related
I'm trying to index word documents in my elasticsearch environment. I tried using the elasticsearch ingest-attachment plugin, but it seems like it's only possible to ingest base64 encoded data.
My goal is to index whole directories with word files. I tried using FSCrawler but it sadly currently contains a bug when indexing word documents. I would be really thankfull if someone could explain me a way to index directories containing word documents.
I am using elasticsearch 6.8 for text searching. And I realised that elasticsearch tokenizer breaks text into words by using delimiters listed here: http://unicode.org/reports/tr29/#Default_Word_Boundaries. I am using match_phase to search one of the fields in my document and I'd like to remove one delimiter used by tokenizer.
I did some search and found some solutions like, using keyword rather than text. This solution will have a big impact on my search function because it doesn't support partial query.
Another solution is to use keyword query but use wildcard to support partial query. But this may impact performance on the query. And also, I still like using tokenizer for other delimiters.
A third options is to use tokenize_on_chars to define all characters used to tokenize text. But this requires me to list all other delimiters. So I am looking for something like tokenize_except_chars.
So is there a easy way for me to take one character out from delimiters tokenizer is using in elasticsearch6.8?
I found elasticsearch supports protected_words which can do the job. More info can be found in https://www.elastic.co/guide/en/elasticsearch/reference/6.8/analysis-word-delimiter-tokenfilter.html
My documents have an analyzed field url with content looking like this
http://sub.example.com/data/11/222/333/filename.txt
I would like to find all documents whose filename starts with an underscore. I've tried multiple approaches (wildcard, pattern, query_string, span queries) but I never got the right result. I expect this is because the the underscore is a term separator. How can I write such a query? Is it possible at all without changing the field to not analyzed (which I cannot do at the moment)?
It's ElasticSearch 1.5, but we'll be migrating to at least 2.4 in foreseeable future.
You might be able to write a script that would do that, but it would be amazingly slow.
You best bet (even though you say you can't right now) is changing the field from analyzed to a multi-field. This way you could have both analyzed and not-analyzed versions to work work.
You could use the Reindex API to migrate all the data from the old version to the new version (assuming you're using ES 2.3 or greater).
Is there a way to perform a search in a document which I don't want to be stored anywhere? I've got some experience with Sphinx search and ElasticSearch and it seems they both operate on a database of some kind. I want to search a word in a single piece of text, in a string variable.
I ended up using nltk and pymorphy just tokenizing my text and comparing stems/normalized morphological forms from pymorphy with search items. No need for any heavy full-text search weaponry.
>
Is there any technique To eliminate duplicate documents while search in elastic-search.how to compare the values among the different documents in the search results.is any script available.
>
You can use the More Like This API to look for documents that match a specified documents field values. Some customization maybe required.