Can I output an XML Document with elements and attributes from Power Query/M - powerquery

I am using Power Query/M to transform some records. I can easily output this to JSON, but I am looking to output in XML format. I have found Xml.Document but I have not found any instructions on how to handle elements versus attributes.
in Text.FromBinary(Xml.Document(record));

Related

How does file convertors work in general like word to pdf, XML to json, word to txt etc

I've used many types of file convertor like word to pdf, XML to json, word to txt etc.
How do they work in backend? Is there some specific guidelines each of them follow? Are there some similarity in the way they are implemented.
I tried searching it but most of the articles take me to the web app that can convert the doc, but none of them gives clarity on how it's done.
All of them work by parsing the first document into a data structure. Then generate a document in the other format from that data structure using recursion.
Parsing itself is a giant topic that people take courses on in computer science. But long story short, it proceeds by breaking the document into tokens, and then fitting the tokens into a parse tree using one of a standard set of methods. They have all sorts of fancy names like Recursive Descent and LALR(1). That's where most of the theory you'd want to learn is.
For example if you're writing a JSON to XML converter, you'd first need to parse that JSON. A JSON Parser shows how you could write that, from scratch, using recursive descent. Once written you just need to write a recursive function that takes each data type and does something appropriate with it to generate text in the format that you want.
Incidentally you can also write a "document converter" that converts from a document format to the same document format. Why would someone want to do that? The two most common use cases are to prettify or minify code. Despite the fact that only one format is being dealt with, the principles of how you do it are exactly the same.

Extract data from same title but different xml response

I trying to extract data from xml response using RegEx. But problem is different xml response but same tag. How do i extract both of them.
This is first xml
This is second xml
As u see there are same tag named "AcctId" but contain different data.
In the first case the value you're looking for is under CAAcctId tag
In the second case the value you're looking for is under AcctId tag
just amend your regex to check the previous line and it should start working as you expect.
Also given you're getting XML it might make more sense to go for XPath Extractor which allows executing arbitrary XPath queries to fetch data from XML/XHTML responses, it will be more readable, robust and reliable than trying to parse XML with regular expressions which are sensitive to markup change

Partial Indexing of an XML file (Bleve)

I am evaluating a couple different libraries to see which one will best fit what I need.
Right now I am looking at Bleve, but I am happy to use any library.
I am looking to index full files except specific ones which are in XML format. For those I only want Bleve to index specific tags as most of the tags are worthless to search. I am trying to evaluate if this is possible but, being new to Bleve, I am not sure what part I need to customize.
The documentation is very good, but I can't seem to find this answer. All I need is an explanation with keywords and steps, no code is required, I just need a push as I have spent hours spinning my wheels with google searches and I am getting no where.
There are probably many ways to approach this. Here's one.
Bleve indexes documents which are collections of key/value metadata pairs.
In your case, a document could be represented by 2 key/value pairs: name of .xml file (to uniquely identify the document) and content of the file.
type Doc struct {
Name string
Body string
}
The issue is that body is XML and Bleve doesn't support XML out-of-the-box.
A way to address it would be to pre-process XML file by stripping unwanted tags and content. You can do it using encoding/xml standard library.
For an example of a similar task you can see the code of https://github.com/blevesearch/fosdem-search/
In there they index file in custom format (https://github.com/blevesearch/fosdem-search/blob/master/fosdem.ical) by parsing it into a format they can submit to Bleve for indexing (https://github.com/blevesearch/fosdem-search/blob/master/ical.go).

Need to implement bulk PDF extraction using Tesseract API

I have large numbers PDF document, from which I need to extract text. The extracted text I use for further processing. I did this for a small subset of documents using Tesseract API in a linear approach and I get the required output. However, this takes a very long time when I have a large number of documents.
I tried to use the Hadoop environment processing capabilities (Map-Reduce) and storage (HDFS) for solving this issue. However, I am facing problem to implement Tesseract API into the Hadoop (Map-Reduce) approach. As Teserract converts the files into intermediate image files, I am confused as to how intermediate result Image files of Tesseract-API-process can be handled inside HDFS.
I have searched and unsuccesfully tried a few options earlier like:
I have extracted text from PDF by extending FileInputFormat class into my own PdfInputFormat class using Hadoop-Map-Reduce, for this i used Apache PDFBox to extract text from pdf, but when it comes to scanned-pdf's which contains image, this solution does not give me the required results.
I found few answers on the same topic stating to use -Fuse and that will help or one should generate image files locally and than upload those into hdfs for further processing. Not sure if this is the correct approach.
Would like to know approaches around this.
This is an approach found to process multiple pdf's to extract text using the power of the Hadoop Framework, and then use this text for further processing:
Put all the PDFs to be converted to text in one folder.
Create one text file per pdf to contain the path to the pdf. e.g. if I have 10 pdfs to convert, then I have 10 text files generated, each containing the unique path to the respective pdf.
These text files are given as input in the map-reduce program
Because input file size is very small only 1 input split is generated by framework for 1 input. e.g if I have 10 pdfs as input, then framework will generate 10 input-split.
From each Input-split one line(record) is read by Record-Reader and passed to one mapper as a value. So if there are 10 records(line==File Path) in input text file , 10 times mapper will run. As I have one record per input-split so one mapper-reducer is assigned to do task for that input-split.
As I have 10 input-split 10 mapper will run, parallel.
Inside the Mapper ghost-script generates images, passing the file name from Mapper value attribute. The image is converted to text using Tesseract inside the mapper itself to get the text of each pdf. This is the output.
This is passed to the reducer to do other analytics work as required.
This is the current solution. Would like feedback on this.

Parse and write a content mathML rational number with boost ptree containing sep

I am trying to write and to read/parse MathMl content XML files with boost ptree (property_tree)I cannot seep. I cannot solve to write or to parse this code:
<?xml version="1.0" encoding="utf-8"?>
<math xmlns="http://www.w3.org/1998/Math/MathML">
<cn type="rational">1<sep/>2</cn>
</math>
The problem is the "sep/". When I use get_value() or get() with string or int I get "12". I cannot separate the 1 and the 2. How can I get or write the two separate values "1" and "2".
Boost Property Tree is not an XML parser.
Instead, it's a settings persistency utility, that facilitates to
(de) serialize a certain set of hierarchical data types
to a number of (partly interchangeable) formats
Note that the featureset for each format is not the same.
Specifically for your goal you need a parser that handles mixed-content elements (elements containing both text and sub-elements, mixed). There's a surprising number of XML parsers that don't handle this. Boost Property Tree is (uses?) such a parser.
So, you should look at another library to get you this.
What XML parser should I use in C++?

Resources