Vowpal Wabbit -ngrams in selected Namespace - n-gram

I have a sample.
text1_Namespace1: text
text2_Namespace2: text2
I want to make a new feature only from the Namespace1 text1 using the n-gram and not create other interactions in text2
Can the VW selectively generate ngrams for certain Namespace1?

As vw -h says, you can generate n-grams for a single namespace 'foo' using --ngram fN (e.g. --ngram f2 for bigrams, --ngram f3 for trigrams etc).
Note that in VW, only the first character of a namespace name is significant for the purpose of namespace interactions and generating ngrams. The general advice is to use either one-character namespace names or make sure that each namespace starts with a different character.

Works! Even here such a design:
vw -d test.data --loss_function logistic --skips b2 --ngram b2 --ngram g2 --skips g1

Related

How to create boxed text without title in Sphinx?

There are several directives in Sphinx to create a boxed text, ex .. note:: or .. topic::, but they all include a title. Is it possible to create a boxed text without title?
You can use Generic Admonition.
.. admonition:: \ \
You can use a `Generic Admonition`_
.. hint::
| Generic Admonition still requires a title, but an escaped whitespace can be used!
|
| This can be convenient, because the box uses a style consistent with other admonitions.
|
| And might spare you from having to fiddle with CSS while Rome burns :)
.. _Generic Admonition: https://docutils.sourceforge.io/docs/ref/rst/directives.html#generic-admonition

How can I expand stanford coreNLP spanish model/dictionary

I just run a "hello world" using Standford Core NLP to get named entities from text. But some places are not recognized properly such as "Ixhuatlancillo" or "Veracruz", both cities which has to be labeled as LUG (place) are labeled as ORG.
I will like to expand the spanish model or dictionary to add places(cities) from México, and to add person names. How can I do this?
Thanks in advance.
The fastest and easiest way would be to use the regexner annotator. You can use this to manually build a dictionary.
Here is an example rule format (separated by tab, the first column can be any number of words)
system administrator TITLE MISC 2
token sequence tag tags-that-can-be-overwritten priority
That above rule would mark "system administrator" in text as TITLE.
For your case:
Veracruz LUG MISC,ORG,PERS 2
This will allow the dictionary to overwrite MISC,ORGS, and PERS. Without adding extra tags in the third column it won't overwrite previously tagged ner tags.
You might use a command like this to run it:
java -Xmx8g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,regexner -props StanfordCoreNLP-spanish.properties -regexner.mapping /path/to/new_spanish.rules - regexner.ignorecase -regexner.validpospattern "^(NN|JJ|NNP).*" -outputFormat text -file sample-text.txt
Note that regexner.ignorecase means to make caseless matches, and -regexner.validpospattern is saying you should only match sequences with the specified pos tag pattern.
All of this being said, I just ran on the sentence:
Ella fue a Veracruz.
and it tagged it properly. Could you let me know what sentence you ran on that caused an incorrect tag for Veracruz?

elasticsearch custom tokenizer don't split time by ":"

for example, I have log like this:
11:22:33 user:abc&game:cde
if I use the standard tokenizer, this log will be split to :
11 22 33 user abc game cde
but 11:22:33 means time, I don't want to split it, I want to use custom tokenizer to split it to:
11:22:33 user abc game cde
so, how should I set the tokenizer?
You can use pattern tokenizer in order to achieve that.
A tokenizer of type pattern that can flexibly separate text into terms via a regular expression
Read more here: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-tokenizer.html

Stanford NER tool -- spaces in training file

I've been looking through the Stanford NER classifier. I have been able to train a model using a simple file that has spaces only to delimit the items the system expects. For instance,
/a/b/c sanferro 2
/d/e/f ginger 2
However, I run into errors while trying forms such as:
/a/b/c san ferro 2
Here "san ferro" is a single "word" and "2" is the "answer" or desired labeling output.
How can I encode spaces? I've tried enclosing a double quotes but that doesn't work.
Typically you use CoNLL style data to train a CRF. Here is an example:
-DOCSTART- O
John PERSON
Smith PERSON
went O
to O
France LOCATION
. O
Jane PERSON
Smith PERSON
went O
to O
Hawaii LOCATION
. O
A "\t" character separates the tokens and the tags. You put a blank space in between the sentences. You use the special symbol "-DOCSTART-" to indicate where a new document starts. Typically you provide a large set of sentences. This is the case when you are training a CRF.
If you just want to tag certain patterns the same way all the time, you may want to use RegexNER, which is described here: http://nlp.stanford.edu/software/regexner/
Here is more documentation on using the NER system: http://nlp.stanford.edu/software/crf-faq.shtml

Given a large list of URLs, What's the best method of data mining to group the URLs together into patterns or regExs?

I've got a list of 1 million URLs and I'd like to cluster similar URLs together. The output of the process would be a list of regular expressions or patterns. Ideally I'd like to use Ruby to derive the data. My initial thoughts flow toward using a Machine Learning classifier, but I'm not sure where to start or what data mining technique to use.
Possible example:
Input:
http://www.example.com/folder-A/file.html
http://www.example.com/folder-A/dude.html
http://www.example.com/folder-B/huh.html
http://www.example.com/folder-C/what-ever.html
Output:
http://www\.example\.com/folder-A/[a-z]\.html
http://www\.example\.com/folder-[A-C]/[-a-z]\.html
This program:
#!/usr/bin/env perl
use strict;
use warnings;
# the following is a CPAN module requiring independent installation:
use Regexp::Assemble;
my #url_list = qw(
http://www.example.com/folder-A/file.html
http://www.example.com/folder-A/dude.html
http://www.example.com/folder-B/huh.html
http://www.example.com/folder-C/what-ever.html
);
my $asm = Regexp::Assemble->new;
for my $url (#url_list) {
$asm->add($url);
}
my $pat = $asm->re;
for ($pat) {
s/^.*?://;
s/\)$//;
}
print "$pat\n";
when run, duly prints out:
http://www.example.com/folder-(?:A/(?:dud|fil)e|C/what-ever|B/huh).html
Is that what you were looking for?
Hi you can use this(http://www.brics.dk/automaton/) automaton library to create or operation of several string and then optimize automaton in this case you will just get generilized one Regular expression.
More simple solution is to use prefix optimization to extract similar first part, for this look at this example http://code.google.com/p/graph-expression/wiki/RegexpOptimization.
Unfortunately all this stuff is done for java, but of course generated regexp can be used in any regular expression engine.
If you are asking how you should parse a URL with a regular expression then take a look at the IETF's RFC 2396.
RFC 2396 URI Generic Syntax August
1998
B. Parsing a URI Reference with a Regular Expression
As described in Section 4.3, the generic URI syntax is not
sufficient to disambiguate the components of some forms of URI.
Since the "greedy algorithm" described in that section is identical
to the disambiguation method used by POSIX regular expressions, it
is natural and commonplace to use a regular expression for parsing
the potential four components and fragment identifier of a URI
reference.
The following line is the regular expression for breaking-down a
URI reference into its components.
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
12 3 4 5 6 7 8 9
The numbers in the second line above are only to assist
readability; they indicate the reference points for each
subexpression (i.e., each paired parenthesis). We refer to the
value matched for subexpression as $. For example, matching
the above expression to
http://www.ics.uci.edu/pub/ietf/uri/#Related
results in the following subexpression matches:
$1 = http:
$2 = http
$3 = //www.ics.uci.edu
$4 = www.ics.uci.edu
$5 = /pub/ietf/uri/
$6 = <undefined>
$7 = <undefined>
$8 = #Related
$9 = Related
where indicates that the component is not present, as
is the case for the query component in the above example.
Therefore, we can determine the value of the four components and
fragment as
scheme = $2
authority = $4
path = $5
query = $7
fragment = $9
and, going in the opposite direction, we can recreate a URI
reference from its components using the algorithm in step 7 of
Section 5.2.
From there you should be able to compare the fragments of the URL and identify patterns.
Your question is a bit vague, but it sounds like something you could do with a map/reduce type setup. Partition your data in smaller chunks, group each chunk by "root" (whatever you mean by that, I assume "authority" or maybe "scheme" + "authority") and then merge the groups in the reduce stage.

Resources