any way to combine 2 ngram language model into 1? - n-gram

I has 2 ngram language model (model_A and model_B) now.
they are trained based on differenct corpus, so the vocabulary is different
they are smoothed with backoff, stored in ARPA format, so I have 2 ARPA files, ARPA_A and ARPA_B.
Now if I want to interpolate them, i.e. given any phrase ABC.
model_C(ABC) = 0.5 * model_A(ABC) + 0.5 * model_B(ABC)
How can I merge ARPA_A and ARPA_B into one (ARPA_C)?

Yes. It is possible to combine two ngram language model. Considering that you are using open fst based Open ngram.
ngrammerge --use_smoothing --normalize --alpha=3 --beta=2 earnest.aa.mod earnest.ab.mod >earnest.merged.mod

Related

'Duplicate' NGram values in topic list created using bertopic

I've set the CountVectorizer to examine bi and trigrams (ngram_range=(1, 3)) . This seems very useful. However, I'm seeing "duplicate" terms e.g.:
The terms "justice," "India," "gate," and "along" appear to overlap significantly. I'm utilising these vocabularies to choose documents for further processing, and it appears that we have one phrase "pushing out" other terms that could otherwise surface. In fact, I'm conducting a broad search across all of these terms to pick target documents for additional processing, so I'm not sure what I'm "missing" otherwise. Is this something I'm thinking about correctly? In this case, would it be a "good thing" if "india gate" and "justice khanna" were combined into a single term?
also how can I combine these into a single term in bertopic so that these overlaps don't occur
In BERTopic, there is the diversity parameter that allows you to fine-tune the topic representations. The underlying algorithm for this is called MaximalMarginalRelevance. It is a value between 0 and 1 that indicates how diverse keywords in a single topic should be compared to one another. A value of 1 indicates high diversity and 0 indicates little diversity. It works as follows:
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
# Get documents
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
# Train BERTopic and apply MMR
topic_model = BERTopic(diversity=0.4)
topics, probs = topic_model.fit_transform(docs)
Do note that in the upcoming version, the diversity parameter is removed and will be replaced as follows:
from bertopic.representation import MaximalMarginalRelevance
from bertopic import BERTopic
# Create your representation model
representation_model = MaximalMarginalRelevance(diversity=0.3)
# Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)

Extracting data from text file in AMPL without adding indexes

I'm new to AMPL and I have data in a text file in matrix form from which I need to use certain values. However, I don't know how to use the matrices directly without having to manually add column and row indexes to them. Is there a way around this?
So the data I need to use looks something like this, with hundreds of rows and columns (and several more matrices like this), and I would like to use it as a parameter with index i for rows and j for columns.
t=1
0.0 40.95 40.36 38.14 44.87 29.7 26.85 28.61 29.73 39.15 41.49 32.37 33.13 59.63 38.72 42.34 40.59 33.77 44.69 38.14 33.45 47.27 38.93 56.43 44.74 35.38 58.27 31.57 55.76 35.83 51.01 59.29 39.11 30.91 58.24 52.83 42.65 32.25 41.13 41.88 46.94 30.72 46.69 55.5 45.15 42.28 47.86 54.6 42.25 48.57 32.83 37.52 58.18 46.27 43.98 33.43 39.41 34.0 57.23 32.98 33.4 47.8 40.36 53.84 51.66 47.76 30.95 50.34 ...
I'm not aware of an easy way to do this. The closest thing is probably the table format given in section 9.3 of the AMPL Book. This avoids needing to give indices for every term individually, but it still requires explicitly stating row and column indices.
AMPL doesn't seem to do a lot with position-based input formats, probably because it defaults to treating index sets as unordered so the concept of "first row" etc. isn't meaningful.
If you really wanted to do it within AMPL, you could probably put together a work-around along these lines:
declare a single-index param with length equal to the total size of your matrix (e.g. if your matrix is 10 x 100, this param has length 1000)
edit the beginning and end of your "matrix" data file to turn it into appropriate format for a single-index parameter indexed from 1 to n
then define your matrix something like this:
param m{i in 1..nrows,j in 1..ncols} := x[j+i*(ncols-1)];
(not tested, I won't promise that I have rows and columns the right way around there!)
But you're probably better off editing the input file into one of the standard AMPL matrix formats. AMPL isn't really designed for data wrangling - you can do it in a pinch but if you're doing this kind of thing repeatedly it may be less trouble to code it in a general-purpose language e.g. Python.

Is there a way to set min_df and max_df in gensim's tfidf model?

I am using gensim's tdidf model like so:
from gensim import corpora, models
dictionary = corpora.Dictionary(some_corpus)
mapped_corpus = [dictionary.doc2bow(text)
for text in some_corpus]
tfidf = models.TfidfModel(mapped_corpus)
Now I'd like to apply thresholds to remove terms that appear too frequently (max_df) and too infrequently (min_df). I know that scikit's CountVectorizer allows you to do this, but I can't seem to find how to set these thresholds in gensim's tfidf. Could someone please help?
You can filter your dictionary with
dictionary.filter_extremes(no_below=min_df, no_above=rel_max_df)
Note that no_below expects the minimum number of documents in which tokens must appear, whereas no_above expects a maximum relative frequency, e.g. 0.5. Afterwards you can then construct your corpus with the filtered dictionary. According to the gensim docs it is also possible to construct a TfidfModel with only a dictionary.

SSRS 2008: Using StDevP from multiple fields / Combining multiple fields in general

I'd like to calculate the standard deviation over two fields from the same dataset.
example:
MyFields1 = 10, 10
MyFields2 = 20
What I want now, is the standard deviation for (10,10,20), the expected result is 4.7
In SSRS I'd like to have something like this:
=StDevP(Fields!MyField1.Value + Fields!MyField2.Value)
Unfortunately this isn't possible, since (Fields!MyField1.Value + Fields!MyField2.Value) returns a single value and not a list of values. Is there no way to combine two fields from the same dataset into some kind of temporary dataset?
The only solutions I have are:
To create a new Dataset that contains all values from both fields. But this is very annoying because I need about twenty of those and I have six report parameters that need to filter every query. => It's probably getting very slow and annoying to maintain.
Write the formula by hand. But I don't really know how yet. StDevP is not that trivial to me. This is how I did it with Avg which is mathematically simpler:
=(SUM(Fields!MyField1.Value)+SUM(Fields!MyField2.Value))/2
found here: http://social.msdn.microsoft.com/Forums/is/sqlreportingservices/thread/7ff43716-2529-4240-a84d-42ada929020e
Btw. I know that it's odd to make such a calculation, but this is what my customer wants and I have to deliver somehow.
Thanks for any help.
CTDevP is standard deviation.
Such expression works fine for me
=StDevP(Fields!MyField1.Value + Fields!MyField2.Value) but it's deviation from one value (Fields!MyField1.Value + Fields!MyField2.Value) which is always 0.
you can look here for formula:
standard deviation (wiki)
I believe that you need to calculate this for some group (or full dataset), to do this you need set in the CTDevP your scope:
=StDevP(Fields!MyField1.Value + Fields!MyField2.Value, "MyDataSet1")

Find item by arbitrary query

Problem: I have item in database, wich called "AABGng-LS 4х4 0.66kV". AABG is vendor, ng-LS is type, 4*4 is cable cross-section, 0.66 kV is voltage. User must find this item for this queries:
AABG ng LS 4х4 660 V
AABGng-LS-660 4х4
AABG ng-LS 0.66 4*4
How can it be solved (algorithm)? I prefer ruby language, but algorithm in any language can be suggested.
the problem that you are describing is one of a search-index. this involves a lot of steps to get it working if you want to do it yourself, like normalizing, stemming, matching etc.
i would advise you to have a look at lucene based search indexes like elasticsearch, solr etc.

Resources