I have some Oracle Reports (.rdf) that I am considering converting to BIRT reports. Is there a way to convert .rdf files to BIRT report design files?
A fully automated solution is probably not possible. You can partially automate the conversion process as follows:
Convert the RDF files to XML.
Extract the report query.
Convert the XML to BIRT (or JRXML) using XSLT.
XML Conversion
The first step is fairly simple, using Cygwin:
cd /path/to/reports/
mkdir xml
for i in *.rdf; do
rwconverter.exe batch=yes source="$i" dest=xml/"$i".xml dtype=xmlfile \
userid=scott/tiger#xe
done
Extraction
The second step is also relatively easy, using starlet (rename xml.exe to starlet.exe to avoid conflicts with Oracle's xml.exe):
starlet.exe sel -t -v "/report/data/dataSource/select" filename.rdf.xml
You can also use xmllint, but it includes the select and CDATA elements, which you'd have to parse separately:
xmllint --xpath /report/data/dataSource/select filename.rdf.xml
Format Conversion
The third step is challenging. Create an XSL template that reads the RDF layouts (e.g., <displayInfo x="0.74377" y="0.97913" width="1.29626" height="1.62695" />). Then convert those layouts to the corresponding format used by the destination report engine (such as BIRT or JasperReports).
You wouldn't get a 100% solution, but an 80% solution could significantly reduce the amount of monotonous, error-prone work required to convert the reports.
I once had a job to convert *.rdf files to Jasper reports. Since I don't know BIRT, I have no idea if this approach would work with BIRT, too.
Anyway, I exported the *.rdf files as xml and parsed the output with Perl and wrote the Jasper definitions, also as xml. For most reports, this worked pretty well. I have -however- one report in mind that I couldn't translate automatically: that was a report where the result set of two queries were laid out side by side.
I haven't found any tools which do this. Some might call this a business opportunity.
RDF files, as far as I can tell, are in some funky binary format. To even have a shot at this, I'd probably convert it first into a REX file, which is supposed to be portable xml. Then, it is a matter of transforming from the REX structure to the BIRT structure. Honestly, I have no idea how documented the REX file is, but maybe since you know how it looks from the visual side you can make sense of it.
Good luck!
Related
I already did a lot of research and realized that clear information about "How to generate PDF/A-1a" or "...convert to PDF/A-1a" is really rare. I found some information to convert to PDF/A-1a via GhostScript, but I didn't make it to get it working. So, maybe there are some necessary conditions for the data missing in the first place. Conditions like propper metadata of the PDF, structured data for readability by a screen reader, alternative text for pictures, and a declaration of the given language of the text. I need a proper working GhostScript command with the corresponding gs version and the mandatory file conditions to generate or even convert to PDF/A-1a. PDF/A-1b means nothing to me because I'm already able to convert to that.
Thanks for any help.
I have a legacy VB application that still has some life in it, and I am wanting to translate it to another language.
I plan to write a Ruby script, possibly utilising a parser, to extract all strings from the three million lines of source, replace them with constants, and move them to a string resource file that can be used to provide translations.
Is anyone aware of a script/library that could be used to intelligently extract the strings?
I'm not aware of any existing off-the-shelf tool that you could use. We created a tool like this at my work and it worked well. The FRM file format is quite simple (although only briefly documented). We wrote a tool that (1) extracted all strings from control definitions and (2) generated the code to reload them at runtime during Form_Load.
I have a very large PDF File (200,000 KB or more) which contains a series of pages containing nothing but tables. I'd like to somehow parse this information using Ruby, and import the resultant data into a MySQL database.
Does anyone know of any methods for pulling this data out of the PDF? The data is formatted in the following manner:
Name | Address | Cash Reported | Year Reported | Holder Name
Sometimes the Name field overflows into the address field, in which case the remaining columns are displayed on the following line.
Due to the irregular format, I've been stuck on figuring this out. At the very least, could anyone point me to a Ruby PDF library for this task?
UPDATE: I accidentally provided incorrect information! The actual size of the file is 300 MB, or 300,000 KB. I made the change above to reflect this.
I assume you can copy'n'paste text snippets without problems when your PDF is opened in Acrobat Reader or some other PDF Viewer?
Before trying to parse and extract text from such monster files programmatically (even if it's 200 MByte only -- for simple text in tables that's huuuuge, unless you have 200000 pages...), I would proceed like this:
Try to sanitize the file first by re-distilling it.
Try with different CLI tools to extract the text into a .txt file.
This is a matter of minutes. Writing a Ruby program to do this certainly is a matter of hours, days or weeks (depending on your knowledge about the PDF fileformat internals... I suspect you don't have much experience of that yet).
If "2." works, you may halfway be done already. If it works, you also know that doing it programmatically with Ruby is a job that can in principle be solved. If "2." doesn't work, you know it may be extremely hard to achieve programmatically.
Sanitize the 'Monster.pdf':
I suggest to use Ghostscript. You can also use Adobe Acrobat Distiller if you have access to it.
gswin32c.exe ^
-o Monster-PDF-sanitized ^
-sDEVICE=pdfwrite ^
-f Monster.pdf
(I'm curious how much that single command will make your output PDF shrink if compared to the input.)
Extract text from PDF:
I suggest to first try pdftotext.exe (from the XPDF folks). There are other, a bit more inconvenient methods available too, but this might do the job already:
pdftotext.exe ^
-f 1 ^
-l 10 ^
-layout ^
-eol dos ^
-enc Latin1 ^
-nopgbrk ^
Monster-PDF-sanitized.pdf ^
first-10-pages-from-Monster-PDF-sanitized.txt
This will not extract all pages but only 1-10 (for proof of concept, to see if it works at all). To extract from every page, just leave off the -f 1 -l 10 parameter. You may need to tweak the encoding by changing the parameter to -enc ASCII7 (or UTF-8, UCS-2).
If this doesn't work the quick'n'easy way (because, as sometimes happens, some font in the original PDF uses "custom encoding vector") you should ask a new question, describing the details of your findings so far. Then you need to resort bigger calibres to shoot down the problem.
At the very least, could anyone point
me to a Ruby PDF library for this
task?
If you haven't done so, you should check out the two previous questions: "Ruby: Reading PDF files," and "ruby pdf parsing gem/library." PDF::Reader, PDF::Toolkit, and Docsplit are some of the relatively popular suggested libraries. There is even a suggestion of using JRuby and some Java PDF library parser.
I'm not sure if any of these solutions is actually suitable for your problem, especially that you are dealing with such huge PDF files. So unless someone offers a more informative answer, perhaps you should select a library or two and take them for a test drive.
This will be a difficult task, as rendered PDFs have no concept of tabular layout, just lines and text in predetermined locations. It may not be possible to determine what are rows and what are columns, but it may depend on the PDF itself.
The java libraries are the most robust, and may do more than just extract text. So I would look into JRuby and iText or PDFbox.
Check whether there is any structured content in the PDF. I wrote a blog article explaining this at http://www.jpedal.org/PDFblog/?p=410
If not, you will need to build it.
Maybe the Prawn ruby library? link text
Sadly, a project that I have been working on lately has a large amount of copy-and-paste code, even within single files. Are there any tools or techniques that can detect duplication or near-duplication within a single file? I have Beyond Compare 3 and it works well for comparing separate files, but I am at a loss for comparing single files.
Thanks in advance.
Edit:
Thanks for all the great tools! I'll definitely check them out.
This project is an ASP.NET/C# project, but I work with a variety of languages including Java; I'm interested in what tools are best (for any language) to remove duplication.
Check out Atomiq. It finds code that is duplicate that is prime for extracting to one location.
http://www.getatomiq.com/
If you're using Eclipse, you can use the copy paste detector (CPD) https://olex.openlogic.com/packages/cpd.
You don't say what language you are using, which is going to affect what tools you can use.
For Python there is CloneDigger. It also supports Java but I have not tried that. It can find code duplication both with a single file and between files, and gives you the result as a diff-like report in HTML.
See SD CloneDR, a tool for detecting copy-paste-edit code within and across multiple files. It detects exact copyies, copies that have been reformatted, and near-miss copies with different identifiers, literals, and even different seqeunces of statements.
The CloneDR handles many languages, including Java (1.4,1.5,1.6) and C# especially up to C#4.0. You can see sample clone detection reports at the website, also including one for C#.
Resharper does this automagically - it suggests when it thinks code should be extracted into a method, and will do the extraction for you
Check out PMD , once you have configured it (which is tad simple) you can run its copy paste detector to find duplicate code.
One with some Office skills can do following sequence in 1 minute:
use ordinary formatter to unify the code style, preferably without line wrapping
feed the code text into Microsoft Excel as a single column
search and replace all dual spaces with single one and do other replacements
sort column
At this point the keywords for duplicates will be already well detected. But to go further
add comparator formula to 2nd column and counter to 3rd
copy and paste values again, sort and see the most repetitive lines
There is an analysis tool, called Simian, which I haven't yet tried. Supposedly it can be run on any kind of text and point out duplicated items. It can be used via a command line interface.
Another option similar to those above, but with a different tool chain: https://www.npmjs.com/package/jscpd
I am so close to having this work, I am trying to directly embed one jasper subreport into the main report xml of the other. You'd think this would be easy, but I can't find a single example on doing it. Everyone seems to use files or resources or whatever. I have one report working straight from a string and I want it to contain it's subreport.
Anyone? Syntax? Thanks!
The only way I know of to do this with JasperReports is to use a separate .jrxml file for the subreport, and include it in the main report using the subreport command.
Another option you have for any embedded reports is to use subdatasets, but as far as I know they're only useful for graphs.
As it sounds like you control the code surrounding the generation of the report, you could come up with a simple format to define multiple reports in the one string, and then have your code extract each report at runtime.
When we've needed to deal with a single file but have subreports for a JasperReport, we've used Zip files, and simply zipped up the main report and all it's necessary sub reports, and then unzipped them into a temporary directory when we need to (all in code of course)