Ignore words for lemmatizer

Ignore words for lemmatizer - stanford-nlp

I would like to use Stanford CoreNLP for lemmatization but I have some words not to be lemmatized. Is there a way to provide this ignore list to the tool? I am following this code, and when the program calls this.pipeline.annotate(document);then, that's it; it would be hard to replace the occurrences. One solution is that create a mapping list in which each word to be ignored is paired with lemmatize(word) (i.e., d = {(w1, lemmatize(w1)), (w2, lemmatize(w2), ...} and do the post processing with this mapping list. But it should be easier than this, I guess.
Thanks for the help.

I think I found the solution with my friend's help.
for(CoreMap sentence: sentences) {
// Iterate over all tokens in a sentence
for (CoreLabel token: sentence.get(TokensAnnotation.class)) {
System.out.print(token.get(OriginalTextAnnotation.class) + "\t");
System.out.println(token.get(LemmaAnnotation.class));
}
}
You can get original form of the word by calling token.get(OriginalTextAnnotation.class).

Related

django-haystack with Elasticsearch: how can I get fuzzy (word similarity) search to work?

I'm using elasticsearch==2.4.1 and django-haystack==3.0 with Django==2.2 using an Elasticsearch instance version 2.3 on AWS.
I'm trying to implement a "Did you mean...?" using a similarity search.
I have this model:
class EquipmentBrand(SafeDeleteModel):
name = models.CharField(
max_length=128,
null=False,
blank=False,
unique=True,
)
The following index:
class EquipmentBrandIndex(SearchIndex, Indexable):
text = fields.EdgeNgramField(document=True, model_attr="name")
def index_queryset(self, using=None):
return self.get_model().objects.all()
def get_model(self):
return EquipmentBrand
And I'm searching like this:
results = SearchQuerySet().models(EquipmentBrand).filter(content=AutoQuery(q))
When name is "Example brand", these are my actual results:
q='Example brand" -> Found
q='bra" -> Found
q='xam' -> Found
q='Exmple' -> *NOT FOUND*
I'm trying to get the last example to work, i.e. finding the item if the word is similar.
My goal is to suggest items from the database in case of typos.
What am I missing to make this work?
Thanks!

I don't think you want to be using EdgeNgramField. "Edge" n-grams, from the Elasticsearch Docs:
emits N-grams of each word where the start of the N-gram is anchored to the beginning of the word.
It's intended for autocomplete. It only matches string that are prefixes of the target. So, when the target document include "example", searches that work would be "e", "ex", "exa", "exam", ...
"Exmple" is not one of those strings. Try using plain NgramField.
Also, please consider upgrading. So much has been fixed and improved since ES 2.4.1

Is it possible to sort with gsub (anonymous function) like ruby in Scala?

Im new in Scala. I need to know if is possible do something like this in Scala:
input2.lines.sort_by { |l| l.gsub(/.*?\+(.*?)\+(.*)\n/,"\\2\n").to_i }
Please help

It looks like you're trying to sort strings by a sub-section within each string. To do that you first need a regex with a capture group to select the region you're interested in.
val re = ".*\\+.*\\+(\\d+)".r
Now you can extract and modify what was captured and use the result as the sorting rule.
lines.sortBy{case re(n) => n.toInt}

Jackrabbit XPath Query: UUID with leading number in path

I have what I think is an interesting problem executing queries in Jackrabbit when a node in the query path is a UUID that start with a number.
For example, this query work fine as the second node starts with a letter, 'f':
/*/JCP/feeadeaf-1dae-427f-bf4e-842b07965a93/label//*[#sequence]
This query however does not, if the first 'f' is replaced with '2':
/*/JCP/2eeadeaf-1dae-427f-bf4e-842b07965a93/label//*[#sequence]
The exception:
Encountered "-" at line 1, column 26.
Was expecting one of:
<IntegerLiteral> ...
<DecimalLiteral> ...
<DoubleLiteral> ...
<StringLiteral> ...
... rest omitted for brevity ...
for statement: for $v in /*/JCP/2eeadeaf-1dae-427f-bf4e-842b07965a93/label//*[#sequence] return $v
My code in general
def queryString = queryFor path
def queryManager = session.workspace.queryManager
def query = queryManager.createQuery queryString, Query.XPATH // fails here
query.execute().nodes
I'm aware my query, with the leading asterisk, may not be the best, but I'm just starting out with querying in general. Maybe using another language other than XPATH might work.
I tried the advice in this post, adding a save before creating the query, but no luck
Jackrabbit Running Queries against UUID
Thanks in advance for any input!

A solution that worked was to try and properly escape parts of the query path, namely the individual steps used to build up the path into the repository. The exception message was somewhat misleading, at least to me, as in made me think that the hyphens were part of the root cause. The root problem was that the leading number in the node name created an illegal XPATH query as suggested above.
A solution in this case is to encode the individual steps into the path and build the rest of the query. Resulting in the leading number only being escaped:
/*/JCP/_x0032_eeadeaf-1dae-427f-bf4e-842b07965a93//*[#sequence]
Code that represents a list of steps or a path into the Jackrabbit repository:
import org.apache.commons.lang3.StringUtils;
import org.apache.jackrabbit.util.ISO9075;
class Path {
List<String> steps; //...
public String asQuery() {
return steps.size() > 0 ? "/*" + asPathString(encodedSteps()) + "//*" : "//*";
}
private String asPathString(List<String> steps) {
return '/' + StringUtils.join(steps, '/');
}
private List<String> encodedSteps() {
List<String> encodedSteps = new ArrayList<>();
for (String step : steps) {
encodedSteps.add(ISO9075.encode(step));
}
return encodedSteps;
}
}
Some more notes:
If we escape more of the query string as in:
/_x002a_/JCP/_x0032_eeadeaf-1dae-427f-bf4e-842b07965a93//_x002a_[#sequence]
Or the original path encoded as a whole as in:
_x002f_a_x002f_fffe4dcf0-360c-11e4-ad80-14feb59d0ab5_x002f_2cbae0dc-35e2-11e4-b5d6-14feb59d0ab5_x002f_c
The queries do not produce the wanted results.
Thanks to #matthias_h and #LarsH

An XML element name cannot start with a digit. See the XML spec's rules for STag, Name, and NameStartChar. Therefore, the "XPath expression"
/*/JCP/2eeadeaf-1dae-427f-bf4e-842b07965a93/label//*[#sequence]
is illegal, because the name test 2eead... isn't a legal XML name.
As such, you can't just use any old UUID as an XML element name nor as a name test in XPath. However if you put a legal NameStartChar on the front (such as _), you can probably use any UUID.
I'm not clear on whether you think you already have XML data with an element named <2eead...> (and are trying to query that element's descendants); if so, whatever tool produced it is broken, as it emits illegal XML. On the other hand if the <2eead...> is something that you yourself are creating, then presumably you have the option of modifying the element name to be a legal XML name.

Selenium Webdriver + Ruby regex: Can I use regex with find_element?

I am trying to click an element that changes per each order like so
edit_div_123
edit_div_124
edit_div_xxx
xxx = any three numbers
I have tried using regex like so:
#driver.find_element(:css, "#edit_order_#{\d*} > div.submit > button[name=\"commit\"]").click
#driver.find_element(:xpath, "//*[(#id = "edit_order_#{\d*}")]//button").click
Is this possible? Any other ways of doing this?

You cannot use Regexp, like the other answers have indicated.
Instead, you can use a nifty CSS Selector trick:
#driver.find_element(:css, "[id^=\"edit_order_\"] > div.submit > button[name=\"commit\"]").click
Using:
^= indicates to find the element with the value beginning with your criteria.
*= says the criteria should be found anywhere within the element's value
$= indicates to find the element with with your criteria at the end of the value.
~= allows you to find the element based on a single criteria when the actual value has multiple space-seperated list of values.
Take a look at http://net.tutsplus.com/tutorials/html-css-techniques/the-30-css-selectors-you-must-memorize/ for some more info on other neat CSS tricks you should add to your utility belt!

You have no provided any html fragment that you are working on. Hence my answer is just based on the limited inputs provided your question.
I don't think WebDriver APIs support regex for locating elements. However, you can achieve what you want using just plain XPath as follows:
//*[starts-with(#id, 'edit_div_')]//button
Explanation: Above xpath will try to search all <button> nodes present under all elements whose id attribute starts with string edit_div_
In short, you can use starts-with() xpath function in order to match element with id format as edit_div_ followed by any number of characters

No, you can not.
But you should do something like this:
function hasClass(element, className) {
var re = new RegExp('(?:^|\\s+)' + className + '(?:\\s+|$)');
return re.test(element.className);
}

This worked for me
#driver.find_element(:xpath, "//a[contains(#href, 'person')]").click

mustache/hogan i18n and taking word-order into account

Found a couple of questions (and answers) on this: How is internationalization configured for Hogan.js?
,etc.
but non in particular that take word order into account. I need the ability to:
step 1. given a key -> lookup a sentence in a particular language.
step 2. this sentence may contain {{var}} , which need to be
substituted by json-values.
step 2. alone is general mustache-templating.
step 1. alone could be done with several techniques, but I prefer techniques that don't involve any specialized code outside of the Mustache/Hogan engine (in combination with a i18n-resource bundle of course) . Hogan seems to support this with something like: (from url above)
var template = "{{#i18n}}Name{{/i18n}}: {{username}}",
context = {
username: "Jean Luc",
i18n: function (i18nKey) {return translatedStrings[i18nKey];}
};
However to combine 1. and 2. in this example I would want translatedStrings[i18nKey] to return a string which potentially contains {{<some expansion>}} as well.
Someone knows of an elegant way to do this?
Rationale:
Often languages differ a lot in word order, etc. which makes for complex templates without this ability.

The latest version of Hogan.js will handle Mustache tags inside the result returned from a lambda. One minor change to the code in your question however, is that the result of the lambda should be a function in order to modify the string:
var translatedStrings = { name: "Nom {{rank}}" };
var template = "{{#i18n}}name{{/i18n}}: {{username}}",
context = {
username: "Jean Luc",
rank: 'Captain',
i18n: function() {
return function (i18nKey) {return translatedStrings[i18nKey];};
}
};
document.write(Hogan.compile(template).render(context)); // Nom Captain: Jean Luc
I created a jsfiddle that demonstrates this with the latest version.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Ignore words for lemmatizer - stanford-nlp

Related

django-haystack with Elasticsearch: how can I get fuzzy (word similarity) search to work?

Is it possible to sort with gsub (anonymous function) like ruby in Scala?

Jackrabbit XPath Query: UUID with leading number in path

Selenium Webdriver + Ruby regex: Can I use regex with find_element?

mustache/hogan i18n and taking word-order into account

Categories

Resources