Importing previously annotated document to IBM Knowledge Studio - watson-discovery

I am doing some research on building custom models for Entity extraction. For this i have used some of the public dataset and wanted to see how it performs in IBM Knowledge Studio. But i am finding it difficult to find a way to load the public dataset (which is already annotated) to the Knowledge Studio.
There document says, previously annotated documents can be imported, but it doesn't specify about the format
https://console.bluemix.net/docs/services/watson-knowledge-studio/create-project.html#create-project
Document also says, it can be from a UIMA analysis engine, i don't find any good examples which show the format of the file
Can anyone help on this?

Watson Knowledge Studio can handle XMI files that are exported from Watson Explorer Content Analytics, Content Analytics Studio or Apache UIMA. You can find some information in below document.
https://console.bluemix.net/docs/services/watson-knowledge-studio/preannotation.html#preannotation

Hope this helps other.
There is no clear documentation on the input format when you are trying to import existing annotated data in to Knowledge Studio.
We did a work around, by manually annotating few files from Knowledge Studio and exported this data. We did analysis on this exported data and wrote custom programs which will transform the existing annotated text to the format that Knowledge Studio accepts.
Then imported this data back to Knowledge Studio

Related

Is there a common solution to generate msoffice, libreoffice and pdf documents on the fly?

I would like to generate office documents (msoffice, oo) and pdf on the fly from one source document. Currently i think about opendocument as templates files and libreoffice-headless as converter.
Does anybody have experience on this topic and is there a (commercial?) ready to use solution?
A commercial solution is Docmosis which has a downloadable and cloud-service solutions using MSWord/OpenOffice documents as templates and providing template-population features, load balancing, doc/docx/odt/pdf/rtf/html production and quite a few other features. One of it's key features is to generate point-in-time output in multiple formats (from the same template and data) as you mentioned. It has at least one Ruby example to show the population features. Please note I work for the company that created Docmosis.
Another option is the open source JOD Reports.
I hope that helps.

Need some clarification regarding getting started with HTML Agility Pack

My background:
I am a newbie when it comes to HTML scrubbing. It has been about four years since I did my only work coding for with C# for html. My other coding with C# equally a while back was for forms to manipulate data in SQL Server databases.
What I have done to try to get started with HTML Agility Pack (HAP):
I have spent several days trying to make sense of instructions found from various online sources about how to get started with HTML Agility Pack. Some of what I have found so far is listed below:
www.4guysfromrolla.com/articles/011211-1.aspx
olussier.net/2010/03/30/easily-parse-html-documents-in-csharp/
stackoverflow.com/questions/846994/how-to-use-html-agility-pack
shatalov.su/en/articles/web/parser_1.php
still more referred to below...
My Results so far:
I have found the material to be quite confusing with each source seeming to tell me something different. All my attempts have come to dead ends.
So that you can efficiently sort out my confusion and reply to my specific situation I will describe in three sections below my project, my environment and my questions;
My Project
I am tasked with creating a process to scrub data from html files. I know the files well. The files will reside on the file system on local on the machine. The html file(s) will be created elsewhere by a process we do not own and will be placed in the local folder I just referred to above. (FYI - Though it is not a part of my question, I expect to create a project or app that will be run on a schedule to perform the scrubbing task and then input the collected data into a database table.)
My Environment
As stated above the html file(s) to be processed will reside on the local machine.
I have newly installed Visual Studio 2010 Professional on this machine to code for this project.
The HTML Agility Pack is now accessible to this machine on a file share.
Under REGEIT: HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\NET Framework Setup\NDP are listed the following indicating the version of .NET framework installed on this machine;
CDF
V2.0.50727
V3.0
V3.5
V4
V4.0
My Questions
1.) I am told by some sites to download HTML Agility Pack and to use the file "HtmlAgilityPack.dll," however the zip file contains nine folders, each with a different copy of this file. Which one do I want?
Here are the names of the folders;
Net20
Net40
Net40-client
Net45
sl3-wp
sl4
sl4-windowsphone71
sl5
winrt45
2.) An answer to a forum question “How to I use the HTML Agility Pack” at stackoverflow.com/questions/846994/how-to-use-html-agility-pack instructs the questioner to “Download and build the HTML Agility Pack Solution”, and directs the questioner to the site htmlagilitypack.codeplex.com which then has a link to nuget.org/packages/HtmlAgilityPack which says to ‘install’ the HTMLAgilityPack by running the command “PM> Install-Package HtmlAgilityPack” in the “Package Manager Console”
What does all this mean? Other sites say to bout the dll in the bin folder. What is that telling me to do?
Please explain with more detailed to get me started.
3.) Assuming I am using C# what kind of project should I create?
4.) Please direct me to any other resources that you believe is applicable to my project.
Looks like you can create a .NET 4.0 project, given the .NET framework versions you have installed on your machine. What type of project depends on how you'd like your application to run. I'd personally opt for creating a C# Class Library project that contains the load html and scrub code and then host that in whatever mechanism you want to use to actually open the files.
To open a file from FileSystem, either use File.OpenRead or File.ReadAllText from System.IO.File. You can pass the stream or the file contents to the HtmlDocument.Load/LoadHtml methods.
HtmlDocument doc = new HtmlDocument();
// Use File.ReadAllText
string contents = File.ReadAllText("PathToFileName");
doc.LoadHtml(contents);
// Or use a stream
using (var contents = File.OpenRead("PathToFileName"))
{
doc.Load(contents);
}
Possibilities for hosting are plentiful. Console Application (can be invoked from the command line or through the Task Scheduler), Windows Service (can be loaded in Windows, run in the background even when nobody is logged on to the machine and can potentially use the FileSystemWatcher to automatically pic up the files, or a Windows Forms/WPF application which will let the user select the files to process and then show the results somehow.
As for how to use it, this is one of the primary issues with the Html Agility Pack. New ways of using it have been added over time and the actual library has therefore several ways you can use. You could take the old fashioned XPath query route (which was the original API) or you can use the Linq-to-HTML/XML route (which is the newer, way). Neither is better than the other, they both have their distinct advantages. The XPath solution allows you to store the queries in a text file easily, so it's great for a configurable system, while the Linq-To-HTML version is a little easier on the eyes from a developer perspective.
As for how to download it, there are a number of options here as well.
You can indeed download the sources from the CodePlex website. Regardless of how you proceed, you might want to do that any way, it allows you to look under the hood and figrue out why something works the way it does, even if you don't compile the library yourself.
You can download the binaries from CodePlex and store them with your project, before the creation of services such as NuGet, this was the only easy way for developers to distribute their libraries.
I'd personally choose to go the NuGet route. When you're using Visual Studio 2012, NuGet is already integrated with Visual Studio. When you're using Visual Studio 2010, you'll have to install the NuGet extension to get the same functionality. Once installed you can open the Nuget Package manager Console from within Visual Studio. With a Visual Studio Solution open and your freshly created Class Library selected in the Solution Explorer you then proceed to enter the Install-Package HtmlAgilityPack command to let Visual Studio download and install the proper version of the HTML Agility Pack for your project. No worries about which library to select, Visual Studio will do that for you.
How to use it now that you've installed the library completely depends on what type of HTML scrubbing you're after and whether you choose the XPath or the Linq-to-HTML route. But it generally comes down to loading the HTML Document:
HtmlDocument doc = new HtmlDocument();
doc.Load(/* path to file or stream */); or doc.LoadHtml(/*string*/);
And after loading the file and catching any parsing errors that might occur, proceed to query the HTML using XPath like the contents are actually XML (the XML/XPath documentation from MSDN actually applies here):
var nodes = doc.DocumentNode.SelectNodes("//table/tr/td");
Or the same query using Linq-to-HTML:
var nodes = doc.DocumentNode.Descendants("table")
.Select(table => table.Elements("tr").Select(tr => tr.Elements("td")));
Or use the Linq-to-Html with Linq query syntax:
var tds = from tables in doc.DocumentNode.Descendants("table")
from tr in tables.Elements("tr")
from td in tr.Elements("td")
select td;
You can make the queries as wild as you want. The syntax is either similar to the standard XPathnavigator syntax in the .NET Framework (using SelectNodes/SelectSingleNode/Children etc) or the Linq-to-XML syntax (using .Descendants/.Ancesters/.Element(s) and standard Linq).
See also:
Linq to XML documentation
XPathNavigator/IXPathNavigable documentation

MvcScaffolding a new project and add it to the solution (Large Scale Generation)

If I want to Large Scale Generation and define
Application = Framework (binary core components) + Generated Code + Custom Code
How would I go about creating code generation framework using scaffolding to generate multiple projects and associated files from some metadata (let's say a DSL model defined in a solution folder)
I know that I can use MvcScaffolding powershell cmdlets to add files to the current on another project.
I need to know if I can add a new project (Class library, Web appication) to the current solution from some kind of project template, apply source transformation and possibly merge some custom data. That would allow additional files to be added and I would prefer that both creation of the project and adding some files initially be done in one powershell line based on some input parameters (let say the name of some DSL model, XSD schema, XML data)
Could I just create a new solution and invoke some scaffolders? Are there scaffolders at a solution level?
I would like to have a scaffodling framework resembling software factories (Web service software factory). Any samples, ideas, articles?
Thanks
Rad
I don't see any reason why not.
Your T4 templates can access EnvDTE and so do all sorts of fun VS automation, and of course the .ps1 powershell scripts can (I guess - I am no powershell guru) do pretty much anything you yourself can do on your box.
But out of interest why would you want to generate whole projects? i.e. are you sure that is time saving?

Visual Studio extensions for code generation...what's the best way

So we have this tool, it's a web page, we drop a large piece of text in textBox a (say sql) run the tool
and it generates the guts of a code file in TextBoxb (say a custom view class model).
it's written in C#.
I know there are several ways to create visual studio extensions.
What I'd like to know is, what's the best/easiest/fastest way to take a c# dll that has a method that takes text in and returns text out, and turn it into a VisualStudio extenson, that takes text in and creates a files, adds it to the project and puts the text into it.
We're using Vs2008 and VS2010, and I'm okay the best soloution only work on 2010.
The closest I've found by googling so far is this:
http://visualstudiomagazine.com/articles/2009/03/01/generate-code-from-custom-file-formats.aspx
but they are for custom file formats only, i want to generate*.cs and *.rdlc and similar files.
I'd also like to be able to add methods to an existing class.
tutorial walkthroughs or better approaches woud be greatly appreicated.
VS package Builder is the answer. Lots easier.

File format or parsing guidance for Visual Studio SUO files.

Where can I find a file format spec, or guidance for parsing, .suo files? I'd like to extract breakpoint information from them.
The MSDN topic Solution User Options (.Suo) File briefly describes how storage streams are read from and written to this structure storage file, but this information is very scant, especially for someone of my structured storage experience.
There's little hope to ever get any useful info out of a .suo file. Even if you do manage to reverse-engineer its (complicated) format, your hard work will be for naught with the next release or service pack for Visual Studio.
The file stores IDE state. That state is also accessible from the extensibility interfaces. Use macros to get ahead. Lookup the EnvDTE namespace in the MSDN library to get started.

Resources