Is there a way to implement custom data detectors in NSTextView (like the one that recognizes dates or telephone numbers)?
I think there is an API Reference, but the whole process is not documented. Is there anything that can help me understand what's the right thing to implement?
The LSM is well suited for training/evaluation/categorization of text (think spam filtering). The LSMSmartCategorizer sample code shows how to train and use a LSM map against news feeds.
You can also try to use the NSRegularExpression/NSDataDetector classes (available starting with Lion). They are designed to match on a text input. Once the matches are available, iterate (with a custom block) over the result and perform some highlighting or style modification.
Hope it helps.
Related
LibShortText is an open source tool for short-text classification and analysis.
http://www.csie.ntu.edu.tw/~cjlin/libshorttext/
I have tried to figure out if it also works with other languages than english (e.g. german)? But I didn't find a hint.
Who knows the answer? Thank you in advance.
I think so (but may need some extra preprocessing). Libsvm and Liblinear are both language-agnostic. Since LibShortText is built on top of LibLinear, it should work for all languages too.
According to this paper, it has internal pre-processing methods to extract features.
libshorttext.converter: For given short texts, LibShortText follows
the bag-of-word model to generate features. Users apply procedures in
this library to pre-process short texts by tokenization, stemming
(optional), and stop-word removal (optional). The library also allows
users to choose between unigram and bigram features.
However, it looks like its stemming and stop-word removal only supports English. So if you want to have better features extracted for non-English text, you might want to use your own pre-processing methods, for example, using nltk.
When a batch gets created documents should get separated automatically without using separator sheet or Barcode separator.
How can I classify documents for Invoice and supporting document.
In our project we get many invoices with supporting document so the scanning person has to insert the separator sheets manually, so to avoid this we want to automatically classify the supporting documents.
In general the concept would be that you would enable separation in the project and then train your classes with examples to be used for the layout or content classifiers.
However, as I'm sure you've seen, the obstacle with invoices is that they are different enough between vendors that it would not reliably classify all to an Invoice class. Similarly with "Supporting Documents" which are likely to be very different from each other, so unfortunately there isn't a completely easy answer without separator sheets (or barcode stickers affixed to supporting docs).
What you might want to do is write code in the one of the separation events like Document_AfterSeparate event. Despite the name, the document has not yet been split at this point, but the classifiers have run. See Scripting Help topic "Server Script Events Sequence > Document Separation > Standard Document Separation" for more detail. Setting the SplitPage property on the CDocPage (pXDoc.CDoc.Pages.ItemByIndex(lPage).SplitPage) will allow you to use your own logic to determine which pages to separate.
For example if you know that you will always have single page invoices, you can split on the first page and classify accordingly. Or you can try to search for something that indicates the end of the invoice like "Total" or other characteristics. There is an example of how you can use locators to help separation in the Scripting Help topic "Script Samples > Use Locator Results for Standard Document Separation". The example uses a Barcode Locator, but the same concept works if you wanted to try it with a Format Locator or anything else.
Without Separator sheets you will need a smart classification software like Kofax Transformation Module (KTM). Its kind of expensive. you will need to verify the cost saving and ROI.
Is there a data structure within LiveCode that can be used as a "holder" for associated data, letting me handle it collectively? I come from a Java / Javascript / C background so I am looking for a Class or Struct sort of data structure.
I've found examples of Groups, which seem to have some of this functionality, but it feels a bit like I'm bending the language to meet my needs.
As a specific example, suppose I had an image field on my screen that would randomly display an image and, when pressed, play an associated sound clip. I'd expect to create a list of "structures" that contained the path to the image and the path to the associated sound clip, and use that data to populate the image field and to decide what sound clip to play.
Would a Group be the correct structure to use in this case? Or am I approaching this in a way that isn't really fitting with the way LiveCode works?
It takes a little getting used to, but the xTalk world is much simpler and more open than any ordinary procedural language. So much of what you once had to manage is no longer required.
So when splash21 said that you could store all your image and sound references in a custom property, he was really saying that the LiveCode environment contains intrinsic, high level functionality that makes these sorts of things instantly accessible, and the only thing required of you is to call for them, and they simply work.
The only way to appreciate this is to make a few simple programs, to really see what is possible. Make your application. Everything you mentioned can be accomplished with perhaps a dozen lines of code in a single handler. I recommend that you join the LiveCode use list and forums. The community is vibrant and eager to help, frequently with full blown solutions to specific problems, but more importantly, as guides and mentors to new users
Craig Newman
Arrays in LiveCode are actually associative arrays (like hash maps). A key is associated with a value. The value might be as well an array.
Chapter 5.5.7 of the User's Guide says
Array elements may contain nested or sub-elements, making them multi-dimensional.
This type of array is ideal for processing hierarchical data structures such as trees or
XML. To access a sub-element, simply declare it using an additional set of square
brackets.
put "ABC" into myVariable["myKeyName"][“aChildElement”]
see also
How to store pictures in a stack?
Dave- I'm hoping to get a struct-like container implemented in the near future. Meanwhile you can, as splash21 mentioned, use custom properties (or better yet, custom property sets) to do what you want. This will give you a pseudo-struct for each object and you can implement the file and sound specifications into the properties. And if you use that in conjunction with a behavior object you'll end up very close to a real inheritable class formation.
I have a huge amount of documents (mainly pdfs and doc's) I want to classify, so I can search over them according to certain tags. These tags could either be of my own (I put the tags to the document) or extracted from the text.
I've just seen a post related to this (Classify data using Apache Mahout), but perhaps there is something even more simple.
Mahout might be overkill for your problem - but you can get a fairly quick, easy solution by using OpenNLP.
http://opennlp.sourceforge.net/api/index.html
Specifically, look at the opennlp.tools.doccat package. Essentially, you have to go through and manually tag a small(ish) set of the items for each category you desire. If they are really distinct, you can get away with a small sample size.
You can use the DocumentCategorizerME.train() static function to train a collection of documents, where each requires a category tag and the text block to train on. Then, you can initialize the DocumentCategorizerME with the trained model and begin classifying all the rest of your documents.
Once you do this, you can (I think) write the model to a file so you don't have to ever do that again.
This post on extracting keywords and classifying webpages is related and may be helpful. In your example it sounds like you can use tags in lieu of the keyword extraction piece (although you may want to use both in combination). Weka is easy to use, I would definitely recommend giving it a look.
I'm trying to play with the new syntax coloring capabilities of VS2010 based on Noah Richards' diff coloring sample. The goal is to create syntax coloring for SpecFlow (http://www.specflow.org).
In my case, finding the syntax elements are fairly complex and not line-level. Therefore, when I implement the GetClassificationSpans I don't want to re-parse the entire file, but rather take the state of the beginning of the changed text and parsing the content from that point on.
I thought that I can get the previous classifications as ClassificationTags. I did this using the IBufferTagAggregatorFactoryService class.
It works, but I'm not sure whether this is the best way to go. Shall I create only tag aggregator for the entire classifier class or I can create it every time when GetClassificationSpans is called? Shall I create a special tag to remember the parsing state?
Maybe this is anyway not the right way to go, I'm also interested in other suggestions.
Br,
Gaspar
Edit: I've found a good article series in the topic: http://www.hill30.com/MikeFeingoldBlog/index.php/2009/07/31/django-editor-in-vs-2010-part-1-colors/
Essentially, you'll have to remember the state yourself. Most VS language services keep a state cookie for the beginning of each line that they update on text change.
At any point, getting classifications (through either a classifier aggregator or tag aggregator) will always result into a call into the current classifiers/taggers, so it won't be returning any type of cached state (or the "last" classifications returned). The editor doesn't really cache this information, and just acts as a dumb pass-through for the information your classifier provides to the visible lines being formatted.
Also, If you do it from a classifier (provided by either an IClassifierProvider or ITaggerProvider), you are setting yourself up for some nasty recursion, especially if your classifier responds to GetClassificationSpans by calling into the aggregator (which then calls back into your classifier for some earlier text, etc.). If your classifier needs to consume other classifications to work correctly (and not its own classifications), the only safe way to write that is to:
Implement your "classifier" as an ITagger<IClassificationTag>, and provide it from an IViewTaggerProvider.
Grab an ITagAggregator<IClassificationTag> from an IBufferTagAggregatorFactoryService, but only once.
Implement IDisposable on your tagger and dispose the tag aggregator in Dispose().