Difference between Gensim's FastText and Facebook's FastText - gensim

I came upon the realization that there exists the original implementation of FastText here by which you can use fasttext.train_unsupervised in order to generate word vectors (see this link as an example). However, turns out that gensim also supports fasttext and its API is similar to that of word2vec. See example here.
I am wondering if there is a difference between the 2 implementations? The documentation was not clear but do they both mimic the paper Enriching Word Vectors with Subword Information? And if yes then why would one use gensim's fasttext over fasttext ?

Gensim intends to match the Facebook implementation, but with a few known or intentional differences. Specifically, Gensim doesn't implement:
the -supervised option, & specific-to-that-mode autotuning/quantization/pretrained-vectors options
word-multigrams (as controlled by the -wordNgrams paramerter to fasttext)
the plain softmax option for loss-optimization
Regarding options to -loss, I'm relatively sure that despite Facebook's command-line options docs indicating that the fasttext default is softmax, it is actually ns except when in -supervised mode, just like word2vec.c & Gensim. See for example this source code.
I suspect a future contribution to Gensim that adds wordNgrams support would be welcome, if that mode is useful to some users, and to match the reference implementation.
So far the choice of Gensim has been to avoid any supervised algorithms, so the -supervised mode is less-likely to appear in any future Gensim. (I'd argue for it, though, if a working implementation was contributed.)
The plain softmax mode is so much slower on typical large output vocabularies that few non-academic projects would want to use it over hs or ns. (It may still be practical with a smaller-number of output-labels, as in -supervised mode, though.)

I found 1 difference from the gensim's documentation:
word_ngrams (int, optional) – In Facebook’s FastText, “max length of word ngram” -
but gensim only supports the default of 1 (regular unigram word handling).
This means that gensim only supports unigrams, but no bigrams or trigrams.

Related

Algorithm for multiple extended string matching

I need to implement an algorithm for multiple extended string matching in text.
Extended means the presence of wildcards (any number of characters instead of a star), for example:
abc*def //matches abcdef, abcpppppdef etc.
Multiple means that the search is going on simultaneously for multiple string patterns (not a separate search for each pattern), for example:
abc*def
abc
whatever
some*string
QUESTION:
What is the fast algorithm that can do multiple extended string matching?
Preferably, optimized for SIMD instructions and multicore implementation. Open source implementation (C/C++/Python) would be great as well.
Thank you
I think that it might make sense to start by reading the following Wikipedia article's section: http://en.wikipedia.org/wiki/Regular_expression#Implementations_and_running_times. You can then perform a literature review on algorithms, implementing regular expression pattern matching.
In terms of practical implementation, there is a large variety of regular expression (regex) engines in a form of libraries, focused on one or more programming languages. Most likely, the best and most popular option is the C/C++ PCRE library, with its newest version PCRE2, released in 2015. Another C++ regex library, which is quite popular at Google, is RE2. I recommend you to read this paper, along with the two other, linked within the article, for details on algorithms, implementation and benchmarks. Just recently, Google has released RE2/J - a linear time version of RE2 for Java: see this blog post for details. Finally, I ran across an interesting pure C regex library TRE, which offers way too many cool features to list here. However, you can read about them all on this page.
P.S. If the above is not enough for you, feel free to visit this Wikipedia page for details of many more regex engines/libraries and their comparison across several criteria. Hope my answer helps.

Evaluating language identification methods

Part of my thesis work is to evaluate number of language detection methods that are already available and then finally implement one them.
For this I have chosen the following methods,
N-Gram-Based Text Categorization by Cavnar and Trenkle
Statistical Identification of Language by Ted Dunning
Using compression-based language models for text categorization by Teahan and Harper
Character Set Detection
A composite approach to language/encoding detection
I have to first evaluate the methods and preferably present a table with accuracy for each of these methods. My question is that in order to find the accuracy of each of these methods, do I need to go ahead a build the language models using training data, then test them and record the accuracy or is there any other approach that I can follow here. Though most of the researches already include these accuracy tables, I am not sure if it's accepted in my education to simply grab it and present in the report.
Appreciate any thoughts on this.
I would also suggest asking your thesis advisor. Implementing all of them will be a lot of work, and it is very difficult to really compare them without being able to test them. If I remember correctly the last three have not been well evaluated in the literature, so it would be difficult to compare their results. I have implemented (and evaluated) only the first one of those myself. One big question is also how big a part of your thesis this LI evaluation and implementation is?

Sentiment Analysis of given text

This topic has many thread. But also I am posting another one. All the post may be a way to do a sentiment analysis, but I found no way.
I want to implement the doing ways of sentiment analysis. So I would request to show me a way. During my research, I found that this is used anyway. I guess Bayesian algorithm is used to calculate positive words and negative words and calculate the probability of the sentence being positive or negative using bag of words.
This is only for the words, I guess we have to do language processing too. So is there anyone who has more knowledge? If yes, can you guide me with some algorithms with their links for reference so that I can implement. Anything in particular that may help me in my analysis.
Also can you prefer me language that I can work with? Some says Java is comparably time consuming so they don't recommend Java to work with.
Any type of help is much appreciated.
First of all, sentiment analysis is done on various levels, such as document, sentence, phrase, and feature level. Which one are you working on? There are many different approaches to each of them. You can find a very good intro to this topic here. For machine-learning approaches, the most important element is feature engineering and it's not limited to bag of words. You can find many other useful features in different applications from the tutorial I linked. What language processing you need to do depends on what features you want to use. You may need POS-tagging if POS information is needed for your features for example.
For classifiers, you can try Support Vector Machines, Maximum Entropy, and Naive Bayes (probably as a baseline) and these are frequently used in the literature, about which you can also find a pretty comprehensive list in the link. The Mallet toolkit contains ME and NB, and if you use SVMlight, you can easily convert the feature formats to the Mallet format with a function. Of course there are many other implementations of these classifiers.
For rule-based methods, Pointwise Mutual Information is frequently used, and some kinds of scoring-based methods, etc.
Hope this helps.
For the text analyzing there is no language stronger than SNOBOL. In SNOBOL-4 the Fortran interpretator, for example, takes only 60 lines.
NLTK offers really good Algorithm for sentiment analysis. It is open source so you can have a look at the source code and check out the algorithm used. You can even download NLTK book which is free and has some good material on sentiment analysis.
Coming to your second point I dont think Java is that slow. I am myself coding in c++ for years but lately also started with java as if you see a lot of very popular open source softwares like lucene, solr, hadoop, neo4j are all written in java.

Ruby Text Analysis

Is there any Ruby gem or else for text analysis? Word frequency, pattern detection and so forth (preferably with an understanding of french)
the generalization of word frequencies are Language Models, e.g. uni-grams (= single word frequency), bi-grams (= frequency of word pairs), tri-grams (=frequency of world triples), ..., in general: n-grams
You should look for an existing toolkit for Language Models — not a good idea to re-invent the wheel here.
There are a few standard toolkits available, e.g. from the CMU Sphinx team, and also HTK.
These toolkits are typically written in C (for speed!! because you have to process huge corpora) and generate standard output format ARPA n-gram files (those are typically a text format)
Check the following thread, which contains more details and links:
Building openears compatible language model
Once you generated your Language Model with one of these toolkits, you will need either a Ruby Gem which makes the language model accessible in Ruby, or you need to convert the ARPA format into your own format.
adi92's post lists some more Ruby NLP resources.
You can also Google for "ARPA Language Model" for more info
Last not least check Google's online N-gram tool. They built n-grams based on the books they digitized — also available in French and other languages!
The Mendicant Bug: NLP Resources for Ruby
contains lots of useful Ruby NLP links.
I had tried using the Ruby Linguistics stuff a long time ago, and remember having a lot of problems with it... I don't recommend jumping into that.
If most of your text analysis involves stuff like counting ngrams and naive Bayes, I recommend just doing it on your own. Ruby has pretty good basic libraries and awesome support for regexes, so this should not be that tricky, and it will be easier for you to adapt stuff to the idiosyncrasies of the problem you are trying to solve.
Like the Stanford parser gem, its possible to use Java libraries that solve your problem from within Ruby, but this can be tricky, so probably not the best way to solve a problem.
I wrote the gem words_counted for this reason. You can see a demo on rubywordcount.com. It has a lot of the analysis features you mention, and a host more. The API is well documented and can be found in the readme on Github.

Is there an implementation of a croatian word stemming algorithm?

i'm searching for an implementation of a croatian word stemming algorithm. Ideally in Java but i would also accept any other language.
Is there somewhere a community of english speaking developers, who are developing search applications for the croatian language?
Thanks,
Slavic languages are highly inflective. The most accurate and fast approach would be a combination of rules and large mappings/dictionaries.
Work has been done, but it has been held back. The Croatian morphological lexicon will help, but it's behind a slow API. More work can be found between Bosnian, Serbian and Croatian, than just Croatian alone.
Large mappings aren't always convenient (and one could effectively build a better rule transformer from the mapping/dictionaries/corpus).
Implementing using Hunspell and affix files could be a great way to get the community and java support. Eg. Google search: hr_hr.aff
Not tested: One should be able to reverse all the words, build a trie of the ending characters, traverse using some rules (eg LCS) and build an accurate statistical transformer using corpus text.
Best I can do is some python:
import hunspell
hs = hunspell.HunSpell(
'/usr/share/myspell/hr_HR.dic',
'/usr/share/myspell/hr_HR.aff')
# The following should return ['hrvatska']:
print hs.stem('hrvatski')
here you can find a recent implementation done on ffzg in python - stemmer for croatian.
We performed basic evaluation of the stemmer on a lemmatized newspaper corpus as gold standard with a precision of 0.986 and recall of 0.961 (F1 0.973) for adjectives and nouns. On all parts of speech we obtained precision of 0.98 and recall of 0.92 (F1 0.947).
It is released under GNU licence but feel free to contact the author on further help (I only know the original author Nikola, but not his student).

Resources