I have ten txt files and generated a corpus of these files.
After creating my corpus i need to generate vector space model.
For VSM i preprocessed the corpus to remove stop words, number, punctuation etc...
Now i need to read the corpus and generate the synonyms against each keyword/Character/Word present in the corpus.
I am unable to use wordnet to get synonyms against each file present in my text corpus.
I want to generate synonyms of each word and append those keywords in the same file..
Related
I'm trying to index word documents in my elasticsearch environment. I tried using the elasticsearch ingest-attachment plugin, but it seems like it's only possible to ingest base64 encoded data.
My goal is to index whole directories with word files. I tried using FSCrawler but it sadly currently contains a bug when indexing word documents. I would be really thankfull if someone could explain me a way to index directories containing word documents.
When we talk about inverted index, we always talk about indexing unstructured text documents. But documents in ElasticSearch are in JSON format, they are "key"-"value" pairs. So I want to know how the inverted index of JSON documents looks like. In another word, when we do the search like "select * from table where name = john", what does ES do?
An inverted index basically stores a relationship between terms and the document/field they were found in. Now, those terms can come from unstructured text, but not only. A JSON document also contains text, which ES analyzes and indexes.
Basically, from a 30000 feet perspective, the way it works is that ES parses the JSON documents it receives, iterates over all fields and analyzes/tokenizes the value of all those fields. The tokens that come out of this analysis process are then indexed into the inverted index.
Long story short, it doesn't have to be unstructured text that gets indexed into an inverted index, it can also be a JSON document, etc, which also contain structured, unstructured text, but also numerical figures, dates, etc.
Having the following dataset input format : TextA TextB
Is it possible to use a single hadoop MapFile to provide indexing (binary search support) on the first column (TextA) and also on the second one (TextB) ?
The idea would be to have the same data folder, but with different index files.
You can't, the data file MUST be sorted by key.
If you try to visualize how MapFile is implemented you will figure out that it cannot work:
The large data file is sorted by key
The index file contains N keys are is loading in memory
When you do a get, the two neighboring keys in the index file are found. Then a binary search is done in the large data file (that's why it must be sorted by key)
How would you meet the sorting requirement with a single file ?
Is there a way to perform a search in a document which I don't want to be stored anywhere? I've got some experience with Sphinx search and ElasticSearch and it seems they both operate on a database of some kind. I want to search a word in a single piece of text, in a string variable.
I ended up using nltk and pymorphy just tokenizing my text and comparing stems/normalized morphological forms from pymorphy with search items. No need for any heavy full-text search weaponry.
I want to use the synonym tokenfilter in Elasticsearch for an index. I downloaded the Prolog version of WordNet 3.0, and found the wn_s.pl file that Elasticsearch can understand. However, it seems that the file contains synonyms for all sorts of words and phrases, while I am really only interested in supporting synonyms for nouns. Is there a way to extract those type of entries?
Given that the format of wn_s.pl is
s(112947045,1,'usance',n,1,0).
s(200001742,1,'breathe',v,1,25).
A very raw way of doing that would be to execute the following in your terminal to only take the lines from that file that have the ',n,' string.
grep ",n," wn_s.pl > wn_s_nouns_only.pl
The file wn_s_nouns_only.pl will only have the entries that are marked as nouns.