Add feature extractor to Stanford NER - stanford-nlp

From http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/ie/NERFeatureFactory.html, to add a new extractor, the last step is:
Add code to NERFeatureFactory for this feature. First decide which
classes (hidden states) are involved in the feature. If only the
current class, you add the feature extractor to the featuresC code, if
both the current and previous class, then featuresCpC, etc.
Do we only have to add a string to feature collection, such as: featuresCpCnC.add(getWord(c) + "-PNSEQW");, and then StanfordNER will parse the string into a real feature? In that case, how do I specify the specific class/field, e.g., title and author, in the feature string? When I dump features in to text file (using exportFeatures or printFeatures), I only find features with generic class like June-PSEQW|CpC, while I want something like June-DateField-DateField-PSEQW|CpC, which means (class[t-1]==DateField)*(class[t]==DateField)*(word[t-1]=="June")

I believe this is expected behavior -- are there performance issues that indicate that training is not working as expected?
To elaborate, in the most general case a featurizer f(x,y) takes both the input x and the output y, and constructs a feature vector for that particular pair. However, in many NLP applications, the features only really depend on the input x, and so the featurizer interface we expose is just f(x), and just implicitly join the features with the output class in the backend (see, e.g., page 10 on "Block Feature Vectors"). In this case, it seems reasonable that we'd only print f(x), and not the full f(x,y).

Related

What is the essential difference between Document and Collectiction in YAML syntax?

Warning: This question is a more philosophical question than practical, but I find it well as to be asked and answered in practical contexts (forums like StackOverflow here, instead of the SoftwareEngineering stack-exchange website), due to the native development in the actual use de-facto of YAML and the way the way it's specification has evolved and features have been added to it over time. Let's ask:
As opposed to formats/languages/protocols such as JSON, the YAML format allows you (according to this link, that seems pretty official, or at least accurate and reliable source to understand the YAML specification) to embed multiple 'Documents' within one file/stream, using the three-dashes marking ("---").
If so, it's hard to ignore the fact that the concept/model/idea of 'Document' in YAML, is no longer an external definition, or "meta"-directive that helps the human/parser to organize multiple/distincted documents along each other (similar to the way file-systems defining the concept of "file" to organize different files, but each file in itself - does not necessarily recognize that it's a file, or that it's being part of a file system that wraps it, by definition, AFAIK.
However, when YAML allows for a multi-Document YAML files, that gather collections of Documents in a single YAML file (and perhaps in a way that is similar/analogous to HTTP Pipelining approach of HTTP protocol), the concept/model/idea/goal of Document receives a new, wider definition/character de-facto, as a part of the YAML grammar and it's produces, and not just of the YAML specification as an assistive concept or format description that helps to describe the specification.
If so, being a Document part of the language itself, what is the added value of this data-structure, compared to the existing, familiar and well-used good old data-structure of Collection (array of items)?
I'm asking it, because I've seen in this link (here) some snippet (in the second example), which describes a YAML sequence that is actually a collection of logs. For some reason, the author of the example, chose to prefer to present each log as a separate "Document" (separated with three-dashes), gathered together in the same YAML sequence/file, instead of writing a file that has a "Collection" of logs represented with the data-type of array. Why did he choose to do this? Is his choice fit, correct, ideal?
I can speculate that the added value of the distinction between a Document and a Collection become relevant when using more advanced features of the YAML grammar, such as Anchors, Tags, References. I guess every Document provide a guarantee that all these identifiers will be a unique set, and there is no collision or duplicates among them. Am I right? And if so, is this the only advantage, or maybe there are any more justifications for the existence of these two pretty-similar data structures?
My best for now, is to see Document as a "meta"-Collection, that is more strict, and lack of high-level logic, or as two different layers of collection schemes. Is it correct, accurate way of view?
And even if I am right, why in the above example (of the logs document from the link), when there's no use and not imply or expected to use duplications or collisions or even identifiers/anchors or compound structures at all - the author is still choosing to represent the collection's items as separate documents? Is this just not so successful selection of an example? Or maybe I'm missing something, and this is a redundancy in the specification, or an evolving syntactic-sugar due to practical needs?
Because the example was written on a website that looks serious with official information written by professionals who dealt with the essence of the language and its definition, theory and philosophy behind (as opposed to practical uses in the wild), and also in light of other provided examples I have seen in it and the added value of them being meticulous, I prefer not to assume that the example is just simply imperfect/meticulous/fit, and that there may be a good reason to choose to write it this way over another, in the specific case exampled.
First, let's look at the technical difference between the list of documents in a YAML stream and a YAML sequence (which is a collection of ordered items). For this, I'll discuss YAML tags, which are an advanced feature so I'll provide a quick overview:
YAML nodes can have tags, such as !!str (the official tag for string values) or !dice (a local tag that can be interpreted by your application but is unknown to others). This applies to all nodes: Scalars, mappings and sequences. Nodes that do not have such a tag set in the source will be assigned the non-specific tag ?, except for quoted scalars which get ! instead. These non-specific tags are later resolved to specific tags, thereby defining to which kind of data structure the node will be deserialized into.
YAML implementations in scripting languages, such as PyYAML, usually only implement resolution by looking at the node's value. For example, a scalar node containing true will become a boolean value, 42 will become an integer, and droggeljug will become a string.
YAML implementations for languages with static types, however, do this differently. For example, assume you deserialize your YAML into a Java class
public class Config {
String name;
int count;
}
Assume the YAML is
name: 42
count: five
The 42 will become a String despite the fact that it looks like a number. Likewise, five will generate an error because it is not a number; it won't be deserialized into a string. This means that not the content of the node defines how it will be deserialized, but the path to the node.
What does this have to do with documents? Well, the YAML spec says:
Resolving the tag of a node must only depend on the following three parameters: (1) the non-specific tag of the node, (2) the path leading from the root to the node and (3) the content (and hence the kind) of the node.)
So, the technical difference is: If you put your data into a single document with a collection at the top, the YAML processor is allowed to take into account the position of the data in the top-level collection when resolving a tag. However, when you put your data in different documents, the YAML processor must not depend on the position of the document in the YAML stream for resolving the tag.
What does this mean in practice? It means that YAML documents are structurally disjoint from one another. Whether a YAML document is valid or not must not depend on any preceeding or succeeding documents. Consequentially, even when deserialization runs into a semantic problem (such as with the five above) in one document, a following document may still be deserialized successfully.
The goal of this design is to be able to concatenate arbitrary YAML documents together without altering their semantics: A middleware component may, without understanding the semantics of the YAML documents, collect multiple streams together or split up a single stream. As long as they are syntactically correct, stream splitting and merging are sound operations that do not invalidate a YAML document even if another document is structurally invalid.
This design primary focuses on sending and receiving data over networks. Of course, nowadays, YAML is primarily used as configuration language. This is why this feature is seldom used and of rather little importance.
Edit: (Reply to comment)
What about end-cases like a string-tagged Document starts with a folded-string, making even its following "---" and "..." just a characters of the global string?
That is not the case, see rules l-bare-document and c-forbidden. A line containing un-indented ... not followed by non-whitespace will always end a document if one is open.
Moreover, ... doesn't do anything if no document is open. This ensures that a stream merger can always append ... to a document to ensure that the current document is closed, but no additional one is created.
--- has widely been adopted as separator between YAML documents (and, perhaps more prominently, between YAML front matter and content in tools like Jekyll) where ... would have been more appropriate, particularly in Jekyll. This gives the false impression that --- should be used by tooling to separate documents, when in reality ... is the syntactic element designed for that use-case.

How do I balance script-oriented OpenType features with other OpenType features using DirectWrite?

Full disclosure: I'm working on my libui GUI framework's text API. This wraps DirectWrite on Windows, Core Text on OS X, and Pango (which uses HarfBuzz for OpenType shaping) on other Unixes. One of the text formatting attributes I want to specify is a collection of OpenType features to use, which all three provide; DirectWrite's is IDWriteTypography.
Now, when you draw some text with these libraries, by default you'll get a few useful OpenType features enabled, such as the standard ligatures (liga) like the f+i ligature. I thought this was font-specific, but it turns out this is specific to the script of the text being shaped. Microsoft provides guidelines for all the scripts supported by OpenType (under "Script-specific Development"), and I can see rather complex logic for doing it all in HarfBuzz itself to confirm it.
On Core Text and Pango, if I enable other attributes, they'll be added on top of these defaults. But with DirectWrite, in particular IDWriteTextLayout::SetTypography(), doing so removes the defaults:
The program that produces this output is can be found here.
Obviously my first option would be to ask how to get the default features on DirectWrite. Someone did so already on this site, though, and the answer seems to be "no".
I am guessing that DirectWrite is allowing me to be in complete control of the list of features to apply to some text. This is nice, except that I can't do this with the other APIs unless I explicitly disable the default features somehow! Of course, I don't know if this list will ever change, so hardcoding it might not be the best idea.
Even if hardcoding is an option, I could just grab HarfBuzz's list for each script, but a) it's rather complicated b) there are multiple possible shapers for a script, depending on (I think) version compatibility (for instance, Myanmar).
So why not use HarfBuzz's lists to recreate the default list of features for DirectWrite anyway? It seems to want to be accurate to other shapers anyway, so this should work, right? Well I would need to do two things: figure out what script to use, and figure out which attributes to use on which characters for script where the position of a character in the word matters.
DirectWrite provides an interface IDWriteTextAnalyzer that provides facilities to perform shaping. I could use this, but it seems the script data is returned in a DWRITE_SCRIPT_ANALYSIS structure, and the description for the script ID says "The zero-based index representation of writing system script.".
This doesn't help, so I wrote a program to just dump the script numbers for text I type in. Running it on the input string
لللللللللللللاااااااااالا abcd محمد ابن بطوطة‎‎ Отложения датского яруса
yields the output
0 - 26 script 3 shapes 0
26 - 5 script 49 shapes 0
31 - 14 script 3 shapes 0
45 - 2 script 1 shapes 1
47 - 25 script 22 shapes 0
I cannot match these script numbers to anything in any of the Windows headers: if there is a defined number for Arabic, Latin, or Cyrillic in any API, they don't match these. And even if I did get a mapping between script and script number, that still doesn't give me the data to apply intra-word features.
What about Uniscribe? Well, the documentation for the equivalent SCRIPT_ANALYSIS type says that its script ID is an "[opaque] value" whose "value for this member is undefined and applications should not rely on its value being the same from one release to the next". And while I can get a language code to identify the script by, there's still no defined value other than LANG_ENGLISH for "Western" (Latin?) scripts. Are the DirectWrite values the same as the Uniscribe ones? And it seems like I can at least figure the initial and final states of words by looking at the fLinkBefore and fLinkAfter fields, but is this enough to properly apply attributes per-script?
HarfBuzz does have an experimental DirectWrite backend that isn't intended to be used by real programs; I'm not yet sure whether it has the same feature-clobbering I specified above. If I find out, I'll update this part here.
Finally, if I enter the following equivalent test case to the first one above in something like kaxaml:
<Page
xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation"
xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml">
<Grid>
<FlowDocumentPageViewer>
<FlowDocument FontFamily="Constantia" FontSize="48">
<Paragraph>
afford afire aflight 1/4<LineBreak/>
<Run Typography.Fraction="1">afford afire aflight 1/4</Run>
</Paragraph>
</FlowDocument>
</FlowDocumentPageViewer>
</Grid>
</Page>
I see the ligatures being applied properly, even in the latter case:
(The fraction at the end is just to prove that that attribute is being applied.) If I assume XAML uses DirectWrite, then that proves my first option (simply overlaying my custom attributes on top of the defaults) should be possible... (I make this assumption based on the idea that XAML provides a strikingly similar API to Direct2D for drawing 2D graphics, and has a lot of holes filled in where I had to manually write a lot of glue code to do the same things with vanilla Direct2D, so I assume whatever is possible in XAML is possible with Direct2D, and by extension DirectWrite since they were technically introduced together...)
At this point I'm completely lost. I want to at least be predictable across platforms, and I'm not sure how programs are even supposed to, let alone going to, use OpenType features directly or not anyway. Am I making bad expectations of text layout APIs? Will I have to drop IDWriteTextLayout and do all the text shaping and layout myself if I want this?
Or do I have to drop vanilla Windows 7 support and upgrade to the Platform Update DirectWrite feature set? Or even Windows 7 entirely?
After some discussions with Peter Sikking and Ebrahim Byagowi, I went and debugged a more general-purpose program I built quickly to test things, and I figured out what's going on internally.
First, however, I will say this applies to Uniscribe and DirectWrite equally.
As it turns out, DirectWrite is always providing a set of default OpenType features, regardless of what feature set I use! The situation is that the list of default features provided differs depending on whether I load my own features or not, and depending on the shaping engine. For the latn script in horizontal writing mode and for English, this is done with the "generic engine".
If I don't provide any features, the generic engine will load script-specific features. For horizontal latn, this list is
locl
ccmp
rlig
rclt
calt
liga
clig
If I do provide features, the generic engine will use the same default list for all scripts:
locl
ccmp
rclt
rlig
mark
mkmk
dist
So I don't know what to do about this. I could probably just provide liga and a few others myself in libui code (marked as a HACK of course), but this is still weird. I'm not sure what the motivation is either. Either way, this explains the behavior I'm seeing.
Supposing your question in general is about programming or at least concerns programming, I will try and give answers to some of your interrogative sentences.
would I have to drop the use of IDWriteTextLayout entirely in my code if I want to be able to add typographical features on top of the defaults?
It depends. If an IDWriteTextLayout interface suits well your project tasks in all ways except ease of variation of DirectWrite default typographic features, learn what you should about typography and create an IDWriteTypography instance suitable for your needs. Developing a custom text layout for the program may require substantial time and effort, especially if the program is supposed to render bidirectional texts, complex scripts, inline objects, etc.
It may happen that the tasks of your project require to develop a text layout engine for reasons other than just controlling typographic features used in rendered text. For example, your manager/customer may ask for implementation of customized linebreaking opportunities or a glyph advance justification algorithm. In this scenario, you will implement an IDWriteTextAnalizer::GetGlyphs method. This method has parameters DWRITE_TYPOGRAPHIC_FEATURES ** features, const UINT32 * featureRangeLengths, UINT32 featureRanges, and this parameters enable you to supersede a set of "default" typography features for a range of the text to be rendered (see my answer to the other question What are the default typography settings used by IDWriteTextLayout?). Only affected features will be altered; the other features has their "default" values. Morever, if you omit this parameters in a GetGlyphs call for the next text range (for example, use values of NULL, NULL, 0), the features altered in the previous GetGlyphs call will not be altered by the call for this next range.
the documentation for the equivalent SCRIPT_ANALYSIS type says that its script ID is an "[opaque] value" whose "value for this member is undefined and applications should not rely on its value being the same from one release to the next". And while I can get a language code to identify the script by, there's still no defined value other than LANG_ENGLISH for "Western" (Latin?) scripts.
Strictly speaking, this is not an interrogative statement, but I guess you are dissatisfied with how these Unicode script IDs are defined and how one can use the API with so vaguely defined structures and constants.
It may be off topic, but I risk to hypothesize on the origin of the "Unicode script ID" values. As of 2010-07-17, the Unicode, Inc. published The Unicode 6.0 version. The standard contained the document
http://www.unicode.org/Public/6.0.0/ucd/PropertyValueAliases.txt, with a section containing a list of scripts. The list went so:
# Script (sc)
sc ; Arab ; Arabic
sc ; Armi ; Imperial_Aramaic
etc.
The Arabic script is #1, the Cyrillic script is #20, the Latin script is #47 in this list. Furthermore, elsewhere I saw this list starting with scripts Common and Inherited. It places the Arabic script to the 3rd, the Cyrillic to the 22nd, and the Latin to the 49th place. These ordinals are familiar to you, aren't they?
Fortunately, we need not rely on the "Unicode script ID" values; we need script properties, not script IDs or abbreviations. The API is self-consistent in that it gives actual script properties for the text range, when we pass to a GetScriptProperties method the number derived from an AnalyzeScript call.

The system cannot find the file specified in uft 12.01

I was trying to use Insight feature of UFT to avoid using the build configuration of libraries from development side for a flex based application. When i tried using the method "GetVisibleText" UFT 12.01 returns "The system cannot find the file specified". But i was click on different buttons in the same page Example buttton x, Button y at my wish. So it means UFT is distinguishes the objects. My purpose is to check on the dynamic text objects in the page. Note : "GetRoProperty" returned nothing and there is only one property called "similarity" and its returning a constant value at all the times immaterial of different pages.
UFT's Insight technology uses images in order to identify objects, the fact that it identifies button x does not mean that it has any intrinsic understanding that it contains the text "x".
In Insight the similarity property is used in order to decide how dissimilar a control has to be from the captured image in order for it not to constitute a match. Similarity isn't a regular identification property as we are used to. This is why you get the same value for each test object (it doesn't mean that the specific object supports this property).
Regarding GetVisibleText, UFT uses OCR in order to extract the text. You can specify which language you're expecting in the last parameter.
In any case none of these things should fail due to not being able to find a file. I have two thoughts on the matter:
Are you using descriptive programming to identify the InsightObject (see link further on) if so perhaps the image file you specified isn't found?
What OCR Mechanism are you using? (Tools ⇒ Options ⇒ GUI Testing ⇒ Text Recognition), perhaps the mechanism you're using isn't installed correctly and this is causing the failure, try using a different OCR mechanism.
You can read a bit more about Insight here.

OpenNLP, Training Named Entity Recognition on unsupported languages: clarifications needed

I want to experiment NER on a specific domain, that is location names extraction from travel offers in Italian language.
So far I've got that I need to prepare the training set by myself, so I'm going to put the
<START:something><END>
tags in some offers from my training set.
But looking at OpenNLP documentation on how to train for NER, I ended up in having a couple of questions:
1) When defining the START/END tags, I'm I free to use whatever name inside the tags (where I wrote "something" a few line above) or is there a restricted set to be bound?
2) I noticed that the call to the training tool
opennlp TokenNameFinderTrainer
takes a string representing the language as the first argument. What is that for? Considering I want to train a model on Italian language that is NOT supported, is there any additional task to be done before I could train for NER?
1) Yes, you can specify multiple types. If the training file contains multiple types, the created model will also be able to detect these multiple types.
2) I think that "lang" parameter has the same meaning/use of other commands (e.g. opennlp TokenizerTrainer -lang it ...)

VS2010 syntax coloring: how to obtain the previous classification type

I'm trying to play with the new syntax coloring capabilities of VS2010 based on Noah Richards' diff coloring sample. The goal is to create syntax coloring for SpecFlow (http://www.specflow.org).
In my case, finding the syntax elements are fairly complex and not line-level. Therefore, when I implement the GetClassificationSpans I don't want to re-parse the entire file, but rather take the state of the beginning of the changed text and parsing the content from that point on.
I thought that I can get the previous classifications as ClassificationTags. I did this using the IBufferTagAggregatorFactoryService class.
It works, but I'm not sure whether this is the best way to go. Shall I create only tag aggregator for the entire classifier class or I can create it every time when GetClassificationSpans is called? Shall I create a special tag to remember the parsing state?
Maybe this is anyway not the right way to go, I'm also interested in other suggestions.
Br,
Gaspar
Edit: I've found a good article series in the topic: http://www.hill30.com/MikeFeingoldBlog/index.php/2009/07/31/django-editor-in-vs-2010-part-1-colors/
Essentially, you'll have to remember the state yourself. Most VS language services keep a state cookie for the beginning of each line that they update on text change.
At any point, getting classifications (through either a classifier aggregator or tag aggregator) will always result into a call into the current classifiers/taggers, so it won't be returning any type of cached state (or the "last" classifications returned). The editor doesn't really cache this information, and just acts as a dumb pass-through for the information your classifier provides to the visible lines being formatted.
Also, If you do it from a classifier (provided by either an IClassifierProvider or ITaggerProvider), you are setting yourself up for some nasty recursion, especially if your classifier responds to GetClassificationSpans by calling into the aggregator (which then calls back into your classifier for some earlier text, etc.). If your classifier needs to consume other classifications to work correctly (and not its own classifications), the only safe way to write that is to:
Implement your "classifier" as an ITagger<IClassificationTag>, and provide it from an IViewTaggerProvider.
Grab an ITagAggregator<IClassificationTag> from an IBufferTagAggregatorFactoryService, but only once.
Implement IDisposable on your tagger and dispose the tag aggregator in Dispose().

Resources