Spam filter Quality using python - filter

create function compute_quality_for_corpus(corpus_dir) which evaluates the filter quality based on the information contained in files !truth.txt and !prediction.txt in the given corpus.
any suggestion ?

Related

When is it appropriate to use tensor-based DBs (e.g Marqo) instead of keyword based (e.g Elasticsearch)?

I'm implementing end user search for an e-commerce website. The catalogue contains images, text, prices of different items. LLMs are all the hype at the moment but I'm not sure how well proven the performance is in comparison to keyword based for e-commerce.
I've tried tensor based search and it appears to perform well but its hard to benchmark search against relevance so I'm not sure of putting it into production.
What frameworks are people using to determine when you use tensor/vector based search vs keyword based?

Can you provide additional tags for documents using TaggedLineDocument?

When training a doc2vec model using a corpus in the TaggedDocument class, you can provide a list of tags. When the doc2vec model is trained it learns a vector representation for the tags. For example you could have one tag representing the document, and another representing some classification that can be shared between documents.
How would one provide additional tags when streaming a corpus using TaggedLineDocument?
The TaggedLineDocument class only considers documents to be one per line, with a single tag that is their line-number.
If you want more tags, you'll have to provide your own iterable which does that. It should only be a few lines of code, depending on where your other tags come from. You can use the source for TaggedLineDocument – which is itself only 9 lines of Python code –as a model to build on:
https://github.com/RaRe-Technologies/gensim/blob/e4199cb4e9a90df44ca59c1d0505b138caa21951/gensim/models/doc2vec.py#L1126
Note: while supplying ore than one tag per document is a natural extension of the original 'Paragraph Vectors' approach, and often can provide benefits, sometimes it also 'dilutes' the salience of each tag's vector – which will be a special concern as the average number of tags per document grows, or the model acquires many more tags than unique documents. So be sure to comparatively evaluate whether any multiple-tag strategy is helping or hurting, in different modes, and whether things like pre-known categories work better as extra tags or known-labels for some later steps.

Google cloud natural language API adding own context classifier

I have been searching how to create a new entity in google natural language API, and found nothing. Can anybody help how to create a new classifier such that if I pass a sentence and I want to detect suppose 'python' as programming language then how would I get that. Current the API is giving 'python' as 'other'.
I have also looked into cloud auto ml api for my solution and tried to create and train a model but It was only able to do sentiment analysis not entity detection.It was giving me the score rather than telling me that Java is programming language.
Thanks in advance.Your help will be appreciated.
Automl content classification classifies your data into the labels specified in the training set. It does not do entity detection. But it seems like what you need to do is closer to content classification than entity detection. My understanding from the description you provided is that you have content (may be words or phrases or short sentences) and you want to classify them into some labels (e.g. programmingLanguage). If you put together a good training set, the automl model should be able to do this.
The number it provides in eval is not sentiment, it's the probability of the predicted label. As you can see in the eval page you posted, it's telling you that java is a programmingLanguage with probability of 1 (so, it's very certain about it).

Latent Semantic Indexation with gensim

In order to use the Latent semantic indexation method from gensim, I want to begin with a small "classique" example like :
import logging, gensim, bz2
id2word = gensim.corpora.Dictionary.load_from_text('wiki_en_wordids.txt')
mm = gensim.corpora.MmCorpus('wiki_en_tfidf.mm')
lsi = gensim.models.lsimodel.LsiModel(corpus=mm, id2word=id2word, num_topics=400)
etc..
My question is : How to get the corpus iterator 'wiki_en_tfidf.mm' ? Must I download it from somewhere ? I have searched on the Internet but I did not find anything. Help please ?
The first page of search results includes a link to:
https://radimrehurek.com/gensim/wiki.html
which says "First let’s load the corpus iterator and dictionary, created in the second step above."
Step 2 is
Convert the articles to plain text (process Wiki markup) and store the result as sparse TF-IDF vectors. In Python, this is easy to do
on-the-fly and we don’t even need to uncompress the whole archive to
disk. There is a script included in gensim that does just that, run:
$ python -m gensim.scripts.make_wiki

Defining new language grammar rules?

Can you help me how could I edit the .tagger file using Stanford NLP? I have problem here, i can't open and edit the file to define the grammar rules for new language to generate part of speech?
The .tagger files are serialized statistical models used by a Maximum Entropy based sequence tagger. You can't edit them in any meaningful way.
If you want to create part of speech tags for a new language, you will have to create training data which consists of a large set of sentences in the language you want and having the correct part of speech tag for each word in the sentence, and then train a new part of speech tagging model.

Resources