I am a newbee to PDFClown and need help in parsing my pdf contents.
My PDF has huge number of MarkedContents which is displayed when converted as Stream.
But i am not able to parse them into objects to extract the Path Information contained within, which is my objective.
Here is my code -
if(level.Contents[i] is MarkedContent)
{
PdfDataObject ContentDataObj = level.Contents.BaseDataObject;
PdfIndirectObject pdfIndirectObject = level.Contents.BaseDataObject.IndirectObject;
PdfStream ContentStream = (PdfStream)ContentDataObj.Resolve();
ContentParser contentParser = new ContentParser(ContentStream.GetBody(true).ToByteArray());
IList<ContentObject> markerContentObjList = contentParser.ParseContentObjects();
//Here i am getting only two Content Objects, where as the stream has so many distinct Marked Contents
for (int k = 0; k < markerContentObjList.Count; k++)
{
}
}
Below is the DOM Inspector screenshot and Stream data
In Short
There are multiple errors in the content streams of your PDF, in particular errors that close more objects than are opened. This most likely is causing the early stop of parsing. Even if it is not, PDF Clown would associate starts and ends of objects differently than intended. Thus, the only real fix of the issue is to ask the source of the documents to provide a non-broken version.
The First Content Stream
The screen shot you provided shows your first page content stream:
The second content stream of that page exhibits the same issues as this one:
Non-Matching Starts and Ends of Marked Content Sequences
If we look at the marked content operators, we see
/OC /Heading BDC
...
EMC
EMC
/OC /Heading BDC
...
EMC
As you can see, there are two EMC operators for the first BDC. This is invalid. Confer ISO 32000-2 section 14.6 Marked content.
Invalid Fill Operator
Furthermore, there is a Fill operator directly following a text object:
BT
...
ET
f
This also is invalid, path painting operators are only allowed after a path object or a clipping path object, not after a text object. Confer ISO 32000-2 Figure 9 Graphics objects.
A Related PDF Clown Issue
Actually there is a bug in PDF Clown which makes processing of marked content with PDF Clown impossible anyway: PDF Clown assumes that marked content sections and save/restore graphics state blocks are properly contained in each other and don't overlap, see this answer for details. This assumption is wrong and results in incorrect graphic state contents as explained in that answer.
Thus, one should patch marked content support out of PDF Clown as explained there to at least have proper graphics state information. Thereafter, obviously, you cannot properly process marked content unless you add correct support for it yourself.
Why PDF Clown Stops at the End of the First Stream
As you observed, PDF Clown stops not after the extra EMC but instead at the end of the first content stream.
This is due to the PDF Clown issue explained above: Based on the assumption that marked content sections and save/restore graphics state blocks are properly contained in each other, PDF Clown simply makes EMC and Q close the most recently opened and still open marked content section or save/restore graphics state block without checking whether it matches alright.
Thus, it matches opening and closing operators in your stream like this:
[Start of page content]
. q
. . /OC /Heading BDC
. . EMC
. EMC
. /OC /Drawing BDC
. EMC
Q
So for PDF Clown that last Q does not match the initial q in the content but the start of page content itself.
I think that PDF Clown stops parsing here because it assumes it has found the end of page contents.
I discovered a bug in my mediawiki.
When I try to open a page with a + or a & in the name, it end up in a "The Site can't be reached"-Error. In the logs i can see, that i get a high number of HTTP 301 Codes when reaching such a Page. It also translates the characters:
+ into %2B
& into %26
But is does seem like it's not getting translated back? I'm also not using any mod_rewrite Code, well atleast none that I know of.
https://www.mediawiki.org/wiki/Manual%3aShort_URL#URL_like_-_example.com.2FPage_title
This describes my Problem (under Troubleshooting), but / works fine. But as I mentiond, I don't use any URL rewriting.
I would appreciate some help, thanks :)
Edit: I've just tested some more characters that are legal for pagetitles, seems like = isn't working either.
It's translated from = into %3D
I have read all of the questions on here about this topic and none of them provided me with a workable solution, so I'm asking this one.
I am running a legitimate copy of Excel 2013 in Windows 7. I record a macros where I insert a picture, and in the open file dialog I paste this URL: http://ecx.images-amazon.com/images/I/41u%2BilIi00L._SL160_.jpg (simply a picture of a product on Amazon). This works as expected.
The resulting macros looks like this:
Sub insertImage()
'
' percent Macro
'
'
ActiveSheet.Pictures.Insert( _
"http://ecx.images-amazon.com/images/I/41u+ilIi00L._SL160_.jpg").Select
End Sub
However, when I attempt to run this, the Insert line breaks with the following error:
Run-time error '1004':
Unable to get the Insert property of the Picture class
I am trying to insert a number of pictures into an excel document and I am using the ActiveSheet.Pictures.Insert method to do this. I have been experiencing this issue there, so I recreated it in a way others could replicate to facilitate getting an answer...
An interesting thing to note is:
http://ecx.images-amazon.com/images/I/41u%2BilIi00L._SL160_.jpg 'This is what I pasted
http://ecx.images-amazon.com/images/I/41u+ilIi00L._SL160_.jpg 'This is what the recorded macros recorded
It looks like Excel automatically resolved the %2B to a +. I tried making that change, but to no success.
Another interesting thing to note is that sometimes this does work and sometimes it doesn't. This url is a case where it does not work. Here's one where it does: http://ecx.images-amazon.com/images/I/51mXQ-IjigL._SL160_.jpg
Why would Excel generate a macros it can't run? More importantly, how can I avoid this error and get on with my work!? Thanks!
Try this workaround:
Sub RetrieveImage()
Dim wsht As Worksheet: Set wsht = ThisWorkbook.ActiveSheet
wsht.Shapes.AddPicture "http://ecx.images-amazon.com/images/I/41u+ilIi00L._SL160_.jpg", _
msoFalse, msoTrue, 0, 0, 100, 100
End Sub
All fields are required, which is kind of a bummer since you cannot get the default size. The location offsets and sizes are in pixels/points. Also, the % turning to + is just alright, as % would cause it to be not recognized (dunno why).
Result:
Let us know if this helps.
I'm experiencing the same issue.
After some digging I found out Excel does a HTTP HEAD request before getting the image.
If this HEAD request is unsuccessful Excel will return the exact error messages mentioned in this discussion.
Your could easily test this using Fiddler.
I've been trying to get this to work, but I'm very frustrated at this point. I am a beginner in this field, so maybe I'm just making mistakes.
What I need to do is to take in a website .html and store it into a txt file. Now the problem is that this website is in Russian (encoding windows-1251) and Silverlight only supports 3 encodings. So in order to bypass that limitation, I got my hands on an encoding class that transfers the stream into a byte array and then tries to pull the correctly encoded string from the text. The problem with this is that
1) I try to ensure that webClient recieves a Unicode encoded stream, because the other ones do not seem to create a retrievable string, but it still doesn't seem to work.
WebClient wc = new WebClient();
wc.Encoding = System.Text.Encoding.Unicode;
wc.DownloadStringCompleted += new DownloadStringCompletedEventHandler(wc_LoadCompleted);
wc.DownloadStringAsync(new Uri(site));
2) I fear that when I store the html into a txt file using streamWriter, the encoding is, yet again, somehow screwed up.
3) The encoding class is not doing its job.
Encoding rus = Encoding.GetEncoding(1251);
Encoding eng = Encoding.Unicode;
byte[] bytes = rus.GetBytes(string);
textBlock1.Text = eng.GetString(bytes);
Can anyone offer any help on this matter? This huge detriment to my project. Thanks in advance,
Since you want to handle an encoding alien to Silverlight you should start with downloading using OpenReadAsync and OpenReadCompleted.
Now you should be able to take the Stream provided by the event args Result property and supply it directly to the encoding component you have acquired to generate the correct string result.
At work we make pretty extensive use of Visio drawing as support for documentation. Unfortunately vsd files don't play nicely with our wiki or documentation extraction tools like javadoc, doxygen or naturaldocs. While it is possible to convert Visio files to images manually, it's just a hassle to keep the image current and the image files are bound to get out of date. And let's face it: Having generated files in revision control feels so wrong.
So I'm looking for a command line tool that can convert a vsd file to jpeg, png, gif or any image that can be converted to an image that a browser can display. Preferably it will run under unix, but windows only is also fine. I can handle the rest of the automation chain, cron job, image to image conversion and ssh, scp, multiple files, etc.
And that's why I'm turning to you: I can't find such a tool. I don't think I can even pay for such a tool. Is my Google-fu completely off? Can you help me?
I mean, it has got to be possible. There has to be a way to hook into Visio with COM and get it to save as image. I'm using Visio 2007 by the way.
Thanks in advance.
I slapped something together quickly using VB6, and you can download it at:
http://fournier.jonathan.googlepages.com/Vis2Img.exe
You just pass in the input visio file path, then the output file path (visio exports based on file extension) and optionally the page number to export.
Also here is the source code I used, if you want to mess with it or turn it into a VBScript or something, it should work, though you'd need to finish converting it to late-bound code.
hope that helps,
Jon
Dim TheCmd As String
Const visOpenRO = 2
Const visOpenMinimized = 16
Const visOpenHidden = 64
Const visOpenMacrosDisabled = 128
Const visOpenNoWorkspace = 256
Sub Main()
' interpret command line arguments - separated by spaces outside of double quotes
TheCmd = Command
Dim TheCmds() As String
If SplitCommandArg(TheCmds) Then
If UBound(TheCmds) > 1 Then
Dim PageNum As Long
If UBound(TheCmds) >= 3 Then
PageNum = Val(TheCmds(3))
Else
PageNum = 1
End If
' if the input or output file doesn't contain a file path, then assume the same
If InStr(1, TheCmds(1), "\") = 0 Then
TheCmds(1) = App.Path & "\" & TheCmds(1)
End If
If InStr(1, TheCmds(2), "\") = 0 Then
TheCmds(2) = App.Path & "\" & TheCmds(2)
End If
ConvertVisToImg TheCmds(1), TheCmds(2), PageNum
Else
' no good - need an in and out file
End If
End If
End Sub
Function ConvertVisToImg(ByVal InVisPath As String, ByVal OutImgPath As String, PageNum As Long) As Boolean
ConvertVisToImg = True
On Error GoTo PROC_ERR
' create a new visio instance
Dim VisApp As Visio.Application
Set VisApp = CreateObject("Visio.Application")
' open invispath
Dim ConvDoc As Visio.Document
Set ConvDoc = VisApp.Documents.OpenEx(InVisPath, visOpenRO + visOpenMinimized + visOpenHidden + visOpenMacrosDisabled + visOpenNoWorkspace)
' export to outimgpath
If Not ConvDoc.Pages(PageNum) Is Nothing Then
ConvDoc.Pages(PageNum).Export OutImgPath
Else
MsgBox "Invalid export page"
ConvertVisToImg = False
GoTo PROC_END
End If
' close it off
PROC_END:
On Error Resume Next
VisApp.Quit
Set VisApp = Nothing
Exit Function
PROC_ERR:
MsgBox Err.Description & vbCr & "Num:" & Err.Number
GoTo PROC_END
End Function
Function SplitCommandArg(ByRef Commands() As String) As Boolean
SplitCommandArg = True
'read through command and break it into an array delimited by space characters only when we're not inside double quotes
Dim InDblQts As Boolean
Dim CmdToSplit As String
CmdToSplit = TheCmd 'for debugging command line parser
'CmdToSplit = Command
Dim CharIdx As Integer
ReDim Commands(1 To 1)
For CharIdx = 1 To Len(CmdToSplit)
Dim CurrChar As String
CurrChar = Mid(CmdToSplit, CharIdx, 1)
If CurrChar = " " And Not InDblQts Then
'add another element to the commands array if InDblQts is false
If Commands(UBound(Commands)) <> "" Then ReDim Preserve Commands(LBound(Commands) To UBound(Commands) + 1)
ElseIf CurrChar = Chr(34) Then
'set InDblQts = true
If Not InDblQts Then InDblQts = True Else InDblQts = False
Else
Commands(UBound(Commands)) = Commands(UBound(Commands)) & CurrChar
End If
Next CharIdx
End Function
F# 2.0 script:
//Description:
// Generates images for all Visio diagrams in folder were run according to pages names
//Tools:
// Visio 2010 32bit is needed to open diagrams (I also installed VisioSDK32bit.exe on my Windows 7 64bit)
#r "C:/Program Files (x86)/Microsoft Visual Studio 10.0/Visual Studio Tools for Office/PIA/Office14/Microsoft.Office.Interop.Visio.dll"
open System
open System.IO
open Microsoft.Office.Interop.Visio
let visOpenRO = 2
let visOpenMinimized = 16
let visOpenHidden = 64
let visOpenMacrosDisabled = 128
let visOpenNoWorkspace = 256
let baseDir = Environment.CurrentDirectory;
let getAllDiagramFiles = Directory.GetFiles(baseDir,"*.vsd")
let drawImage fullPathToDiagramFile =
let diagrammingApplication = new ApplicationClass()
let flags = Convert.ToInt16(visOpenRO + visOpenMinimized + visOpenHidden + visOpenMacrosDisabled + visOpenNoWorkspace)
let document = diagrammingApplication.Documents.OpenEx(fullPathToDiagramFile,flags)
for page in document.Pages do
let imagePath = Path.Combine(baseDir, page.Name + ".png")
page.Export (imagePath)
document.Close()
diagrammingApplication.Quit()
let doItAll =
Array.iter drawImage getAllDiagramFiles
doItAll
You can try "Visio to image" converter
http://soft.postpdm.com/visio2image.html
Tested with MS Visio 2007 and 2010
There has to be a way to hook into Visio with COM and get it to save as image.
Why not try writing something yourself, then, if you know how to use COM stuff? After all, if you can't find anything already made to do it, and you know you can figure out how to do it yourself, why not write something to do it yourself?
EDIT: Elaborating a bit on what I stated in my comment: writing a script of some sort does seem to be your best option in this situation, and Python, at least, would be quite useful for that, using the comtypes library found here: http://starship.python.net/crew/theller/comtypes/ Of course, as I said, if you prefer to use a different scripting language, then you could try using that; the thing is, I've only really used COM with VBA and Python at this point (As an aside, Microsoft tends to refer to "Automation" these days rather than specifically referencing COM, I believe.) The nice thing about Python is that it's an interpreted language, and thus you just need a version of the interpreter for the different OSes you're using, with versions for Windows, OSX, Linux, Unix, etc. On the other hand, I doubt you can use COM on non-Windows systems without some sort of hack, so you may very well have to parse the data in the source files directly (and even though Visio's default formats appear to use some form of XML, it's probably one of those proprietary formats Microsoft seems to love).
If you haven't used Python before, the Python documentation has a good tutorial to get people started: http://docs.python.org/3.1/tutorial/index.html
And, of course, you'll want the Python interpreter itself: http://python.org/download/releases/3.1/ (Note that you may have to manually add the Python directory to the PATH environment variable after installation.)
When you write the script, you could probably have the syntax for running the script be something like "python visioexport.py <source/original file[ with path]>[ <new file[ with path]>]" (assuming the script file is in your Python directory), with the new file defaulting to a file of the same name and in the same folder/directory as the original (albeit with a different extension; in fact, if you wish, you could set it up to export to multiple formats, with the format defaulting to that of whatever default extension you choose and being specified by an alternate extension of you specify one in the file name. As well, you could likely set it up so that if you only have the new file name after the source file, no path specified, it'll save with that new file name to the source file's directory. And, of course, if you don't specify a path for the source file, just a file name, you could set it up to get the file from the current directory).
On the topic of file formats: it seems to me that converting to SVG might be the best thing to do, as it would be more space-efficient and would better reflect the original images' status as vectored images. On the other hand, the conversion from a Visio format to SVG is not perfect (or, at least, it wasn't in Visio 2003; I can't find a source of info similar to this one for Visio 2007), and as seen here, you may have to modify the resultant XML file (though that could be done using the script, after the file is exported, via parts of the Python standard library). If you don't mind the additional file size of bitmaps, and you'd rather not have to include additional code to fix resultant SVG files, then you probably should just go with a bitmap format such as PNG.