NPOI - Determine heading before a paragraph - npoi

I'm attempting to write a parser to extract details from a word document using NPOI. I'm able to retrieve details from each table in the document but I need to be able to identify which section of the document the table comes from in order to differentiate between them. While I can identify all of the lines that have the specific heading type I need, I can't work out how to tell which heading precedes which table.
Can anybody offer any advice? If it's not possible with NPOI, can anybody recommend another way to do it?

If you are parsing word document. I'll suggest you to use OpenXMlpowertool by Eric white, download it from NuGet package manager or download directly from net.
Here is the code snippet i have used to parse document , code snippet is very small, clean and stable. You must first debug it to understand it working which will help you to customize it for yourself. it will read all text, paragraphs , bullets and contents etc. go through the documentation of Eric White for more details but below code snippet is the most you'll need to parse and top of it you can build your functionality.
using DocumentFormat.OpenXml.Packaging;
using OpenXmlPowerTools;
private static WordprocessingDocument _wordDocument;
_wordDocument = WordprocessingDocument.Open(wordFileStream, false); // stream wordFileStream in constructor
// To get header and footer use this
var headerList = _wordDocument.MainDocumentPart.HeaderParts.ToList();
var footerList = _wordDocument.MainDocumentPart.FooterParts.ToList();
private void GetDocumentBodyContents()
{
List<string> allList = new List<string>();
List<string> allListText = new List<string>();
try
{
//RevisionAccepter.AcceptRevisions(_wordDocument);
XElement root = _wordDocument.MainDocumentPart.GetXDocument().Root;
XElement body = root.LogicalChildrenContent().First();
OutputBlockLevelContent(_wordDocument, body);
}
catch (Exception ex)
{ }
}
private void OutputBlockLevelContent(WordprocessingDocument wordDoc, XElement blockLevelContentContainer)
{
try
{
string currentItem = string.Empty, currentItemText = string.Empty, numberText = string.Empty;
foreach (XElement blockLevelContentElement in
blockLevelContentContainer.LogicalChildrenContent())
{
if (blockLevelContentElement.Name == W.p)
{
currentItem = ListItemRetriever.RetrieveListItem(wordDoc, blockLevelContentElement);
//currentItemText = blockLevelContentElement
// .LogicalChildrenContent(W.r)
// .LogicalChildrenContent(W.t)
// .Select(t => (string)t)
// .StringConcatenate();
currentItemText = blockLevelContentElement
.LogicalChildrenContent(W.r)
.Select(t =>
{
if (t.LogicalChildrenContent(W.br).Count() > 0)
{
//Adding line Break for Steps because it is truncated when typecaste with String
t.SetElementValue(W.br, "<br />");
}
return (string)t;
}
).StringConcatenate();
continue;
}
// If element is not a paragraph, it must be a table.
foreach (var row in blockLevelContentElement.LogicalChildrenContent())
{
foreach (var cell in row.LogicalChildrenContent())
{
// Cells are a block-level content container, so can call this method recursively.
OutputBlockLevelContent(wordDoc, cell);
}
}
}
}
catch (Exception ex)
{
}
}

Related

Combining forms while retaining form fonts in itext7

I am trying to fill and combine multiple forms without flattening(need to keep them interactive for users). However I notice a problem. I have PDF files that contain the forms I am trying to fill. The form fields have their fonts set in adobe PDF. I notice after I combine the forms the fields lose their original fonts. Here is my program.
using iText.Forms;
using iText.Kernel.Pdf;
using System;
using System.Collections.Generic;
using System.IO;
using System.Runtime.CompilerServices;
using System.Threading.Tasks;
namespace PdfCombineTest
{
class Program
{
static void Main(string[] args)
{
Stream file1;
Stream file2;
using (var stream = new FileStream("./pdf-form-1.pdf", FileMode.Open, FileAccess.Read))
{
file1 = Program.Fill(stream, new[] { KeyValuePair.Create("Text1", "TESTING"), KeyValuePair.Create("CheckBox1", "Yes") });
}
using (var stream = new FileStream("./pdf-form-2.pdf", FileMode.Open, FileAccess.Read))
{
file2 = Program.Fill(stream, new[] { KeyValuePair.Create("Text2", "text 2 text") });
}
using (Stream output = Program.Combine(new[] { file1, file2 }))
{
using (var fileStream = File.Create("./output.pdf"))
{
output.CopyTo(fileStream);
}
}
}
public static Stream Combine(params Stream[] streams)
{
MemoryStream copyStream = new MemoryStream();
PdfWriter writer = new PdfWriter(copyStream);
writer.SetSmartMode(true);
writer.SetCloseStream(false);
PdfPageFormCopier formCopier = new PdfPageFormCopier();
using (PdfDocument combined = new PdfDocument(writer))
{
combined.InitializeOutlines();
foreach (var stream in streams)
{
using (PdfDocument document = new PdfDocument(new PdfReader(stream)))
{
document.CopyPagesTo(1, document.GetNumberOfPages(), combined, formCopier);
}
}
}
copyStream.Seek(0, SeekOrigin.Begin);
return copyStream;
}
public static Stream Fill(Stream inputStream, IEnumerable<KeyValuePair<string, string>> keyValuePairs)
{
MemoryStream outputStream = new MemoryStream();
PdfWriter writer = new PdfWriter(outputStream);
writer.SetCloseStream(false);
using (PdfDocument document = new PdfDocument(new PdfReader(inputStream), writer))
{
PdfAcroForm acroForm = PdfAcroForm.GetAcroForm(document, true);
acroForm.SetGenerateAppearance(true);
IDictionary<string, iText.Forms.Fields.PdfFormField> fields = acroForm.GetFormFields();
foreach (var kvp in keyValuePairs)
{
fields[kvp.Key].SetValue(kvp.Value);
}
}
outputStream.Seek(0, SeekOrigin.Begin);
return outputStream;
}
}
}
I've noticed after several hours of debugging that PdfPageFormCopier excludes the default resources which contain fonts when merging form fields, is there a way around this? The project I'm working on currently does this process in ItextSharp and it works as intended. However we are looking to migrate to iText7.
Here are links to some sample pdf's I made I can't upload the actual pdf's I'm working with but these display the same problem.
https://www.dropbox.com/s/pukt91d4xe8gmmo/pdf-form-1.pdf?dl=0
https://www.dropbox.com/s/c52x6bc99gnrvo6/pdf-form-2.pdf?dl=0
So my solution was to modify the PdfPageFormCopier class from iText. The main issue is in the function below.
public virtual void Copy(PdfPage fromPage, PdfPage toPage) {
if (documentFrom != fromPage.GetDocument()) {
documentFrom = fromPage.GetDocument();
formFrom = PdfAcroForm.GetAcroForm(documentFrom, false);
}
if (documentTo != toPage.GetDocument()) {
documentTo = toPage.GetDocument();
formTo = PdfAcroForm.GetAcroForm(documentTo, true);
}
if (formFrom == null) {
return;
}
//duplicate AcroForm dictionary
IList<PdfName> excludedKeys = new List<PdfName>();
excludedKeys.Add(PdfName.Fields);
excludedKeys.Add(PdfName.DR);
PdfDictionary dict = formFrom.GetPdfObject().CopyTo(documentTo, excludedKeys, false);
formTo.GetPdfObject().MergeDifferent(dict);
IDictionary<String, PdfFormField> fieldsFrom = formFrom.GetFormFields();
if (fieldsFrom.Count <= 0) {
return;
}
IDictionary<String, PdfFormField> fieldsTo = formTo.GetFormFields();
IList<PdfAnnotation> annots = toPage.GetAnnotations();
foreach (PdfAnnotation annot in annots) {
if (!annot.GetSubtype().Equals(PdfName.Widget)) {
continue;
}
CopyField(toPage, fieldsFrom, fieldsTo, annot);
}
}
Specifically the line here.
excludedKeys.Add(PdfName.DR);
If you walk the the code in the CopyField() function eventually you will end in the PdfFormField class. You can see the constructor below.
public PdfFormField(PdfDictionary pdfObject)
: base(pdfObject) {
EnsureObjectIsAddedToDocument(pdfObject);
SetForbidRelease();
RetrieveStyles();
}
The function RetrieveStyles() will try to set the font for the field based on the default appearance. However that will not work. Due to the function below.
private PdfFont ResolveFontName(String fontName) {
PdfDictionary defaultResources = (PdfDictionary)GetAcroFormObject(PdfName.DR, PdfObject.DICTIONARY);
PdfDictionary defaultFontDic = defaultResources != null ? defaultResources.GetAsDictionary(PdfName.Font) :
null;
if (fontName != null && defaultFontDic != null) {
PdfDictionary daFontDict = defaultFontDic.GetAsDictionary(new PdfName(fontName));
if (daFontDict != null) {
return GetDocument().GetFont(daFontDict);
}
}
return null;
}
You see it is trying to see if the font exists in the default resources which was explicitly excluded in the PdfPageFormCopier class. It will never find the font.
So my solution was to create my own class that implements the IPdfPageExtraCopier interface. I copied the code from the PdfPageFormCopier class and removed the one line excluding the default resources. Then I use my own copier class in my code. Not the prettiest solution but it works.

How to get a CNContact phone number(s) as string in Xamarin.ios?

I am attempting to retrieve the names and phone number(s) of all contacts and show them into tableview in Xamarin.iOS. I have made it this far:
var response = new List<ContactVm>();
try
{
//We can specify the properties that we need to fetch from contacts
var keysToFetch = new[] {
CNContactKey.PhoneNumbers, CNContactKey.GivenName, CNContactKey.FamilyName, CNContactKey.EmailAddresses
};
//Get the collections of containers
var containerId = new CNContactStore().DefaultContainerIdentifier;
//Fetch the contacts from containers
using (var predicate = CNContact.GetPredicateForContactsInContainer(containerId))
{
CNContact[] contactList;
using (var store = new CNContactStore())
{
contactList = store.GetUnifiedContacts(predicate, keysToFetch, out
var error);
}
//Assign the contact details to our view model objects
response.AddRange(from item in contactList
where item?.EmailAddresses != null
select new ContactVm
{
PhoneNumbers =item.PhoneNumbers,
GivenName = item.GivenName,
FamilyName = item.FamilyName,
EmailId = item.EmailAddresses.Select(m => m.Value.ToString()).ToList()
});
}
BeginInvokeOnMainThread(() =>
{
tblContact.Source = new CustomContactViewController(response);
tblContact.ReloadData();
});
}
catch (Exception e)
{
Console.WriteLine(e);
throw;
}
and this is my update cell method
internal void updateCell(ContactVm contact)
{
try
{
lblName.Text = contact.GivenName;
lblContact.Text = ((CNPhoneNumber)contact.PhoneNumbers[0]).StringValue;
//var no = ((CNPhoneNumber)contact.PhoneNumbers[0]).StringValue;
//NSString a = new NSString("");
// var MobNumVar = ((CNPhoneNumber)contact.PhoneNumbers[0]).ValueForKey(new NSString("digits")).ToString();
var c = (contact.PhoneNumbers[0] as CNPhoneNumber).StringValue;
}
catch(Exception ex)
{
throw ex;
}
}
I would like to know how to retrieve JUST the phone number(s) as a string value(s) i.e. "XXXXXXXXXX". Basically, how to call for the digit(s) value.
this line of code
lblContact.Text = ((CNPhoneNumber)contact.PhoneNumbers[0]).StringValue;
throw a run time exception as specified cast is not valid
Yes, exception is correct. First of all, you don't need any casts at all, contact.PhoneNumbers[0] will return you CNLabeledValue, so you just need to write it in next way
lblContact.Text = contact.PhoneNumbers[0].GetLabeledValue(CNLabelKey.Home).StringValue
//or
lblContact.Text = contact.PhoneNumbers[0].GetLabeledValue(CNLabelPhoneNumberKey.Mobile).StringValue
or you may try, but not sure if it works
contact.PhoneNumbers[0].Value.StringValue
I got solution. Here is my code if any one required.
CNLabeledValue<CNPhoneNumber> numbers =
(Contacts.CNLabeledValue<Contacts.CNPhoneNumber>)contact.PhoneNumbers[0];
CNPhoneNumber number = numbers.Value;
string str = number.StringValue;
lblContact.Text = str;

HTML Agility pack throw in "An unhandled exception of type 'System.ArgumentNullException' occurred in mscorlib.dll" Exception

I'm going to parsing a website content Website HTML Content. I use follow codes:
private async void loadButton_Click(object sender, RoutedEventArgs e)
{
string address = "http://www.iiees.ac.ir/fa/eqcatalog";
loadButton.IsEnabled = false;
//string htmlCode = await DownloadStringAsync(address, Encoding.UTF8);
string htmlCode = await GetDataAsync(address, Encoding.UTF8);
paragraph1.Inlines.Add(new Run(htmlCode));
loadButton.IsEnabled = true;
}
private async Task<string> GetDataAsync(string address, Encoding encoding)
{
string htmlCode = await DownloadStringAsync(address, encoding);
List<string> options = new List<string>();
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlCode);
foreach (var table in doc.DocumentNode.SelectNodes("//table"))
{
if (table.Id == "CatalogForm")
{
foreach (HtmlNode row in table.SelectNodes("tr"))
{
if (row.Attributes.Any(a => a.Value == "hidecircle"))
{
foreach (HtmlNode td in row.SelectNodes("TD"))
{
foreach (HtmlNode option in td.SelectNodes("option"))
{
options.Add(option.InnerText);
}
}
}
}
}
}
var sb = new StringBuilder();
options.ForEach(str => sb.AppendLine(str));
return sb.ToString();
}
But throw in "An unhandled exception of type 'System.ArgumentNullException' occurred in mscorlib.dll" exception.
For more information, I upload my project here.
Download project
I created a sample from your code, and I found the following:
I am not particularly familiar with xpath, but seems it is case sensitive. Searching for "TD" will return nothing, while searching for "td" will.
SelectNodes will return null whenever nothing was found, and a foreach that depends on it will always throw an exception. So it is better to check whether the result of SelectNodes is not null before trying to loop over it.
foreach (HtmlNode td in row.SelectNodes("td"))
{
var ops = td.SelectNodes("option");
if (ops != null)
{
foreach (HtmlNode option in ops)
{
options.Add(option.InnerText);
}
}
}

MSBuild WriteCodeFragment Task

It's not clear to me what the purpose of the WriteCodeFragment task is (https://msdn.microsoft.com/en-us/library/microsoft.build.tasks.writecodefragment.aspx). Google just seems to turn-up documentation and the one example I've seen isn't definitive.
Can someone clarify? Thanks.
It looks like it's only good for populating a sourcecode file (of your choice of language) with attributes (read: to embed build information into a project).
This is the sourcecode for the task:
https://github.com/Microsoft/msbuild/blob/master/src/XMakeTasks/WriteCodeFragment.cs
The specific GenerateCode() method for reference. The type that it's principally enumerating looks to be attribute-specific (read: declarative and non-functional):
/// <summary>
/// Generates the code into a string.
/// If it fails, logs an error and returns null.
/// If no meaningful code is generated, returns empty string.
/// Returns the default language extension as an out parameter.
/// </summary>
[SuppressMessage("Microsoft.Globalization", "CA1305:SpecifyIFormatProvider", MessageId = "System.IO.StringWriter.#ctor(System.Text.StringBuilder)", Justification = "Reads fine to me")]
private string GenerateCode(out string extension)
{
extension = null;
bool haveGeneratedContent = false;
CodeDomProvider provider;
try
{
provider = CodeDomProvider.CreateProvider(Language);
}
catch (ConfigurationException ex)
{
Log.LogErrorWithCodeFromResources("WriteCodeFragment.CouldNotCreateProvider", Language, ex.Message);
return null;
}
catch (SecurityException ex)
{
Log.LogErrorWithCodeFromResources("WriteCodeFragment.CouldNotCreateProvider", Language, ex.Message);
return null;
}
extension = provider.FileExtension;
CodeCompileUnit unit = new CodeCompileUnit();
CodeNamespace globalNamespace = new CodeNamespace();
unit.Namespaces.Add(globalNamespace);
// Declare authorship. Unfortunately CodeDOM puts this comment after the attributes.
string comment = ResourceUtilities.FormatResourceString("WriteCodeFragment.Comment");
globalNamespace.Comments.Add(new CodeCommentStatement(comment));
if (AssemblyAttributes == null)
{
return String.Empty;
}
// For convenience, bring in the namespaces, where many assembly attributes lie
globalNamespace.Imports.Add(new CodeNamespaceImport("System"));
globalNamespace.Imports.Add(new CodeNamespaceImport("System.Reflection"));
foreach (ITaskItem attributeItem in AssemblyAttributes)
{
CodeAttributeDeclaration attribute = new CodeAttributeDeclaration(new CodeTypeReference(attributeItem.ItemSpec));
// Some attributes only allow positional constructor arguments, or the user may just prefer them.
// To set those, use metadata names like "_Parameter1", "_Parameter2" etc.
// If a parameter index is skipped, it's an error.
IDictionary customMetadata = attributeItem.CloneCustomMetadata();
List<CodeAttributeArgument> orderedParameters = new List<CodeAttributeArgument>(new CodeAttributeArgument[customMetadata.Count + 1] /* max possible slots needed */);
List<CodeAttributeArgument> namedParameters = new List<CodeAttributeArgument>();
foreach (DictionaryEntry entry in customMetadata)
{
string name = (string)entry.Key;
string value = (string)entry.Value;
if (name.StartsWith("_Parameter", StringComparison.OrdinalIgnoreCase))
{
int index;
if (!Int32.TryParse(name.Substring("_Parameter".Length), out index))
{
Log.LogErrorWithCodeFromResources("General.InvalidValue", name, "WriteCodeFragment");
return null;
}
if (index > orderedParameters.Count || index < 1)
{
Log.LogErrorWithCodeFromResources("WriteCodeFragment.SkippedNumberedParameter", index);
return null;
}
// "_Parameter01" and "_Parameter1" would overwrite each other
orderedParameters[index - 1] = new CodeAttributeArgument(String.Empty, new CodePrimitiveExpression(value));
}
else
{
namedParameters.Add(new CodeAttributeArgument(name, new CodePrimitiveExpression(value)));
}
}
bool encounteredNull = false;
for (int i = 0; i < orderedParameters.Count; i++)
{
if (orderedParameters[i] == null)
{
// All subsequent args should be null, else a slot was missed
encounteredNull = true;
continue;
}
if (encounteredNull)
{
Log.LogErrorWithCodeFromResources("WriteCodeFragment.SkippedNumberedParameter", i + 1 /* back to 1 based */);
return null;
}
attribute.Arguments.Add(orderedParameters[i]);
}
foreach (CodeAttributeArgument namedParameter in namedParameters)
{
attribute.Arguments.Add(namedParameter);
}
unit.AssemblyCustomAttributes.Add(attribute);
haveGeneratedContent = true;
}
StringBuilder generatedCode = new StringBuilder();
using (StringWriter writer = new StringWriter(generatedCode, CultureInfo.CurrentCulture))
{
provider.GenerateCodeFromCompileUnit(unit, writer, new CodeGeneratorOptions());
}
string code = generatedCode.ToString();
// If we just generated infrastructure, don't bother returning anything
// as there's no point writing the file
return haveGeneratedContent ? code : String.Empty;
}
That only other example that I found was this:
(Using WriteCodeFragment MSBuild Task)
<Target Name="BeforeBuild">
<ItemGroup>
<AssemblyAttributes Include="AssemblyVersion">
<_Parameter1>123.132.123.123</_Parameter1>
</AssemblyAttributes>
</ItemGroup>
<WriteCodeFragment Language="C#" OutputFile="BuildVersion.cs" AssemblyAttributes="#(AssemblyAttributes)" />
</Target>
Plugging this into an actual build renders:
//------------------------------------------------------------------------------
// <auto-generated>
// This code was generated by a tool.
// Runtime Version:4.0.30319.42000
//
// Changes to this file may cause incorrect behavior and will be lost if
// the code is regenerated.
// </auto-generated>
//------------------------------------------------------------------------------
using System;
using System.Reflection;
[assembly: AssemblyVersion("123.132.123.123")]
// Generated by the MSBuild WriteCodeFragment class.

Force tags to update from external event

this is a somewhat exotic question but I will try anyway.
I am writing a Visual Studio 2010 Extension using MEF. At some point in my code, I am required to provide some error glyphs ( like brakepoints ) whenever a document is saved. The problem is that I can't seem to find how to force GetTags of my ITagger to be called without having the user writing anything.
I can catch the document save event but I am missing the "link" between this and somehow getting GetTags method to be called. Any ideas?
Here is some code :
internal class RTextErrorTag : IGlyphTag
{}
class RTextErrorGlyphTagger : ITagger<RTextErrorTag>
{
private IClassifier mClassifier;
private SimpleTagger<ErrorTag> mSquiggleTagger;
private EnvDTE._DTE mMSVSInstance = null;
private List<object> mDTEEvents = new List<object>();
internal RTextErrorGlyphTagger(IClassifier classifier, IErrorProviderFactory squiggleProviderFactory, ITextBuffer buffer, IServiceProvider serviceProvider)
{
this.mClassifier = classifier;
this.mSquiggleTagger = squiggleProviderFactory.GetErrorTagger(buffer);
mMSVSInstance = serviceProvider.GetService(typeof(EnvDTE._DTE)) as EnvDTE._DTE;
var eventHolder = mMSVSInstance.Events.DocumentEvents;
mDTEEvents.Add(eventHolder);
eventHolder.DocumentSaved += new EnvDTE._dispDocumentEvents_DocumentSavedEventHandler(OnDocumentSaved);
}
void OnDocumentSaved(EnvDTE.Document Document)
{
//fire up event to renew glyphs - how to force GetTags method?
var temp = this.TagsChanged;
if (temp != null)
temp(this, new SnapshotSpanEventArgs(new SnapshotSpan() )); //how to foce get
}
IEnumerable<ITagSpan<RTextErrorTag>> ITagger<RTextErrorTag>.GetTags(NormalizedSnapshotSpanCollection spans)
{
foreach (SnapshotSpan span in spans)
{
foreach (var error in this.mSquiggleTagger.GetTaggedSpans(span))
{
yield return new TagSpan<RTextErrorTag>(new SnapshotSpan(span.Start, 1), new RTextErrorTag());
}
}
}
public event EventHandler<SnapshotSpanEventArgs> TagsChanged;
}

Resources