Reading XML-files with StAX / Kettle (Pentaho) - etl

I'm doing an ETL-process with Pentaho (Spoon / Kettle) where I'd like to read XML-file and store element values to db.
This works just fine with "Get data from XML" -component...but the XML file is quite big, several giga bytes, and there fore reading the file takes too long.
Pentaho Wiki says:
The existing Get Data from XML step is easier to use but uses DOM
parsers that need in memory processing and even the purging of parts
of the file is not sufficient when these parts are very big.
The XML Input Stream (StAX) step uses a completely different approach
to solve use cases with very big and complex data stuctures and the
need for very fast data loads...
There fore I'm now trying to do the same with StAX, but it just doesn't seem to work out like planned. I'm testing this with XML-file which only has one element group. The file is read and then mapped/inserted to table...but now I get multiple rows to table where all the values are "undefined" and some rows where I have the right values. In total I have 92 rows in the table, even though it should only have one row.
Flow goes like:
1) read with StAX
2) Modified Java Script Value
3) Output to DB
At step 2) I'm doing as follow:
var id;
if ( xml_data_type_description.equals("CHARACTERS") &&
xml_path.equals("/labels/label/id") ) {
id = xml_data_value; }
...
I'm using positional-staz.zip from http://forums.pentaho.com/showthread.php?83480-XPath-in-Get-data-from-XML-tool&p=261230#post261230 as an example.
How to use StAX for reading XML-file and storing the element values to DB?
I've been trying to look for examples but haven't found much. The above example uses "Filter Rows" -component before inserting the rows. I don't quite understand why it's being used, can't I just map the values I need? It might be that this problem occurs because I don't use, or know how to use, Filter Rows -component.
Cheers!

I posted a possible StAX-based solution on the forum listed above, but I'll post the gist of it here since it is awaiting moderator approval.
Using the StAX parser, you can select just those elements that you care about, namely those with a data type of CHARACTERS. For the forum example, you basically need to denormalize the rows in sets of 4 (EXPR, EXCH, DATE, ASK). To do this you add the row number to the stream (using an Add Sequence step) then use a Calculator to determine a "bucket number" = INT((rownum-1)/4). This will give you a grouping field for a Row Denormaliser step.
When the post is approved, you'll see a link to a transformation that uses StAX and the method I describe above.
Is this what you're looking for? If not please let me know where I misunderstood and maybe I can help.

Related

Generate new format from a non-system generated report using Power Query

I have an excel file which is non-system generated report format.
I wish to calculate and generate another new output.
Given the Report format as below:-
1) Inside the query when load this excel file, how can I create a new column to copy and paste on the first found value (1#51) at column at the next record, if the next record is empty. Once, if detected a new value (1#261) then copy and paste to the subsequent null value of few next records till this end?
2) The final aim is to generate a new output to auto match/calculate the money to be assign to different reference. As shown below:-
The reference A ~ E is sharing the 3 bank Ref (28269,28542 & RMP) , was thinking to read the same data source a few times, first time to read the column A ~ O(QueryRef) and 2nd time to read the same source to read from A, Q ~ V(QueryBank).
After this I do not have idea how I can allocate the $$ from Query Bank to QueryRef based on the Sum of Total AR.
Eg,
Total Amt of BankRef 28269, $57,044.67 is sufficient to cover Ref#A $10,947.12
BankRef 28269 still sufficient to cover Ref#B $27,647.60
BankRef 28269 left only $18,449.95 , hence the balance of 28269 be allocate to Ref#C.
Remaining balance of Ref#C will need to use BankRef28542 to cover,i.e. $1,812.29
Ref#D will then be allocated of the remaining balance of BankRef28542, i.e. $4,595.32
Ref#D still left $13,350.03 unallocated, hence this will use BankRef#RMP
Ref#E only need $597.66, and BankRef#RMP is sufficient to cover this.
I am not sure if my above case study can be solved using power query or not, due to me still being a newbie # Power Query? Or this is too complicate to handle hence we need to write a program to auto matching this kinds of scenario?
Attached is the sample source file and output :
https://www.dropbox.com/sh/dyecwcdz2qg549y/AACzezsXBwAf8eHUNxxLD1eWa?dl=0
Any advice/opinion/guidance is very much appreciated.
Answering question one:
You have a feature in Powerquery called FILL, DOWN or UP.
For a selected column you can copy the first non empty value to all rows under until a new non empty row is found and so on.

Marklogic - get list of all unique document structures in a Marklogic database

I want to get a list of all distinct document structures with a count in a Marklogic database.
e.g. a database with these 3 documents:
1) <document><name>Robert</name></document>
2) <document><name>Mark</name></document>
3) <document><fname>Robert</fname><lname>Smith</lname></document>
Would return that there are two unique document structures in the database, one used by 2 documents, and the other used by 1 document.
I am using this xquery and am getting back the list of unique sequence of elements correctly:
for $i in distinct-values(for $document in doc()
return <div>{distinct-values(
for $element in $document//*/*/name() return <div>{$element}</div>)} </div>)
return $i
I appreciate that this code will not handle duplicate element names but that is OK for now.
My t questions are:
1) Is there a better/more efficient way to do this? I am assuming yes.
2) Is there a way to get back enough detail so that I could build up the xml tree of each unique structure?
3) What is the best way to return the count of each distinct structure e.g. 2 and 1 and in the above example
If you have a finite list of elements for which you need to do this for, consider co-occurance or other similiar solutions: https://docs.marklogic.com/cts:value-co-occurrences
This requires a range index on each element in question.
MarkLogic works best to use indexes whenever possible. The other solution I can think of is that you actually create a hash/checksum for the values of the target content for each document in question and store this with the document (or in a triple if you happen to have a licence for semantics). Then you you would already have a key for
the unique combinations.
1) Is there a better/more efficient way to do this? I am assuming yes.
If it were up to me, I would create the document structured in a consistent fashion (like you're doing), then hash it, and attach the hash to each document as a collection. Then I could count the docs in each collection. I can't see any efficient way (using indexes) to get the counts without first writing to the document content or metadata (collection is a type of metadata) then querying against the indexes.
2) Is there a way to get back enough detail so that I could build up the xml tree of each unique structure?
After you get the counts for each collection, you could retrieve one doc from each collection and walk through it to build an empty XML structure. XSLT would probably be a good way to do this if you already know XSLT.
3) What is the best way to return the count of each distinct structure e.g. 2 and 1 and in the above example
Turn on the collection lexicon on your database. Then do something like the following:
for $collection in cts:collections()
return ($collection, cts:frequency($collection))
Not sure I follow exactly what you are after, but I am wondering if this is more what you are looking for- functx:distinct-element-paths($doc)
http://www.xqueryfunctions.com/xq/functx_distinct-element-paths.html
Here's a quick example:
xquery version "1.0-ml";
import module namespace functx = "http://www.functx.com" at "/MarkLogic/functx/functx-1.0-nodoc-2007-01.xqy";
let $doc := <document><fname>Robert</fname><lname>Smith</lname></document>
return
functx:distinct-element-paths($doc)
Outputs the following strings (which could be parsed, of course):
document
document/fname
document/lname
there are existing 3rd party tools that may work, depending on the size of the data, and the coverage required (is 100% sampleing needed).
Search for "Generate Schema from XML" --
Such tools will look at a sample set and infer a schema (xsd, dtd, rng etc).
They do an accurate job, but not always in the same way a human would.
If they do not have native ML integration then you need to expose a service or exort the data for analysis.
Once you HAVE a schema, load it into MarkLogic, and you can query the schema (and elements validated by it) directly and programmatically in ML
If you find a 'generate schema' tool that is implemented in XSLT, XQuery, or JavaScript you may be able to import and execute it in-server.

Get data's source in kettle

When I use kettle , I was wandering how to get a table column's source column. Just for an example , after I have merged two tables into one table based on primary key already , Given any column in output table , I could judge whether table it belongs to and get the original column name in original table. Thank you for helping and sorry for my poor English...
http://i.stack.imgur.com/xoR0s.png
When I was given any field in table3 (suppose a field named A in table3) , I could know where it comes from without the graphical view (from java code or other ways) , like the original table name (here are input1 or input2) and the original column name(maybe B in input1 , but represents A in table3). Besides I use mysql.
There are a couple of ways to do this:
1) Manually. If you right-click on the output step and choose Show Output fields (or whatever it's called), you will see the "origin step" for each of the outgoing fields. You can do the same for input fields. Then you can trace them back to those origin steps, and repeat the process of viewing the input fields at those steps, and seeing those fields' origins, and so on. This is probably not what you're looking for.
2) With code. Prior to 6.0, you'd need to programmatically perform the same operations as are listed in option 1 above. In 6.0 there is the Data Lineage capability, which offers the LineageClient API that can find the origin fields for the specified output fields. For more information see my blog post describing the Data Lineage capability. Also I put a Gremlin Console in the PDI Marketplace, to make the use of LineageClient easier (and you can visually see the lineage graph too).

How to get sorted rows out of cassandra when using RandomPartioner and Hector as Client?

To improve my skills on Hector and cassandra I'm trying diffrent methods to query data out of cassandra.
Currently I'm trying to make a simple message system. I would like to get the posted messages in chronological order with the last posted message first.
In plain sql it is possible to use 'order by'. I know it is possible if you use the OrderPreservingPartitioner but this partioner is deprecated and less-efficient than the RandomPartioner. I thought of creating an index on a secondary column with a timestamp als value, but I can't figure out how to obtain the data. I'm sure that I have to use at least two queries.
My column Family looks like this:
create column family messages
with comparator = UTF8Type
and key_validation_class=LongType
and compression_options =
{sstable_compression:SnappyCompressor, chunk_length_kb:64}
and column_metadata = [
{column_name: message, validation_class: UTF8Type}
{column_name: index, validation_class: DateType, index_type: KEYS}
];
I'm not sure if I should use DataType or long for the index column, but I think that's not important for this question.
So how can I get the data sorted? If possible I like to know hows its done white the CQL syntax and whitout.
Thanks in advance.
I don't think there's a completely simple way to do this when using RandomPartitioner.
The columns within each row are stored in sorted order automatically, so you could store each message as a column, keyed on timestamp.
Pretty soon, of course, your row would grow large. So you would need to divide up the messages into rows (by day, hour or minute, etc) and your client would need to work out which rows (time periods) to access.
See also Cassandra time series data
and http://rubyscale.com/2011/basic-time-series-with-cassandra/
and https://www.cloudkick.com/blog/2010/mar/02/4_months_with_cassandra/
and http://pkghosh.wordpress.com/2011/03/02/cassandra-secondary-index-patterns/

Using LINQ to query flat text files with fixed-length records?

I've got a file filled with records like this:
NCNSCF1124557200811UPPY19871230
The codes are all fixed-length, and some of them link to other flat files (sort of like a relational database). What's the best way of querying this data using LINQ?
This is what I came up with intuitively, but I was wondering if there's a more elegant way:
var records = File.ReadAllLines("data.txt");
var table = from record in records
select new { FirstCode = record.Substring(0, 2),
OtherCode = record.Substring(18, 4) };
For one thing I wouldn't read it all into memory to start with. It's very easy to write a LineReader class which iterates over a file a line at a time. I've got a version in MiscUtil which you can use.
Unless you only want to read the results once, however, you might want to call ToList() at the end to avoid reading the file multiple times. (This is still nicer than reading all the lines and keeping that in memory - you only want to do the splitting once.)
Once you've basically got in-memory collections of all the tables, you can use normal LINQ to Objects to join them together etc. You might want to go to a more sophisticated data model to get indexes though.
I don't think there's a better way out of the box.
One could define a Flat-File Linq Provider which could make the whole thing much simpler, but as far as I know, no one has yet.

Resources