XSLT Performance Issue - performance

The XSLT transformation is done through dot net code using API provided by Saxon. I am using Saxon 9 home edition api. The XSLT version is 2.0 and generates xml output. The input file size is 123 KB.
The XSLT adds attributes to the input XML file depending on certain scenarios. There are total 7 modes used in this XLST. The value of attribute generated in one mode is used in another mode and hence multiple modes are used.
The output is correctly generated but it takes around 10 second to execute this XSLT. When same XSLT executed in 'Altova XMLSpy 2013', it took around 3-4 seconds.
Is there a way to further reduce this 10 second execution time? What could be the the cause for this much time for execution?
The XSLT is available at below link for download.
XSLT Link

Without having a source document to run this against (and therefore to make measurements) it's very hard to be definitive about where the inefficiencies are, but the most obvious at first glance is the weird testing of element names in patterns like:
match="*[name()='J' or name()='H' or name()='F' or name()='D' or name()='B' or name()='I' or name()='G' or name()='E' or name()='C' or name()='A' or name()='X' or name()='Y' or name()='O' or name()='K' or name()='L' or name()='M' or name()='N']
which in Saxon would be vastly more efficient if written the natural way as
match="J|H|F|D|B|I|G|E|C|A|X|Y|O|K|L|M|N"
It's also more likely to be correct that way, since comparing name() against a string is sensitive to the chosen prefix, and XSLT code really ought to work whatever namespace prefix the source document author has chosen.
The reason the latter is much more efficient is that Saxon organizes the source tree for rapid matching of elements by name (meaning namespace URI plus local name, excluding prefix). When you match by name in this way, the matching template rule can be found by a quick hash table lookup. If you use predicates that have to be evaluated by expanding the name (with prefix) as a string and comparing the string, not only does each comparison take longer, it can't be optimized with a hash lookup.

Related

What is the essential difference between Document and Collectiction in YAML syntax?

Warning: This question is a more philosophical question than practical, but I find it well as to be asked and answered in practical contexts (forums like StackOverflow here, instead of the SoftwareEngineering stack-exchange website), due to the native development in the actual use de-facto of YAML and the way the way it's specification has evolved and features have been added to it over time. Let's ask:
As opposed to formats/languages/protocols such as JSON, the YAML format allows you (according to this link, that seems pretty official, or at least accurate and reliable source to understand the YAML specification) to embed multiple 'Documents' within one file/stream, using the three-dashes marking ("---").
If so, it's hard to ignore the fact that the concept/model/idea of 'Document' in YAML, is no longer an external definition, or "meta"-directive that helps the human/parser to organize multiple/distincted documents along each other (similar to the way file-systems defining the concept of "file" to organize different files, but each file in itself - does not necessarily recognize that it's a file, or that it's being part of a file system that wraps it, by definition, AFAIK.
However, when YAML allows for a multi-Document YAML files, that gather collections of Documents in a single YAML file (and perhaps in a way that is similar/analogous to HTTP Pipelining approach of HTTP protocol), the concept/model/idea/goal of Document receives a new, wider definition/character de-facto, as a part of the YAML grammar and it's produces, and not just of the YAML specification as an assistive concept or format description that helps to describe the specification.
If so, being a Document part of the language itself, what is the added value of this data-structure, compared to the existing, familiar and well-used good old data-structure of Collection (array of items)?
I'm asking it, because I've seen in this link (here) some snippet (in the second example), which describes a YAML sequence that is actually a collection of logs. For some reason, the author of the example, chose to prefer to present each log as a separate "Document" (separated with three-dashes), gathered together in the same YAML sequence/file, instead of writing a file that has a "Collection" of logs represented with the data-type of array. Why did he choose to do this? Is his choice fit, correct, ideal?
I can speculate that the added value of the distinction between a Document and a Collection become relevant when using more advanced features of the YAML grammar, such as Anchors, Tags, References. I guess every Document provide a guarantee that all these identifiers will be a unique set, and there is no collision or duplicates among them. Am I right? And if so, is this the only advantage, or maybe there are any more justifications for the existence of these two pretty-similar data structures?
My best for now, is to see Document as a "meta"-Collection, that is more strict, and lack of high-level logic, or as two different layers of collection schemes. Is it correct, accurate way of view?
And even if I am right, why in the above example (of the logs document from the link), when there's no use and not imply or expected to use duplications or collisions or even identifiers/anchors or compound structures at all - the author is still choosing to represent the collection's items as separate documents? Is this just not so successful selection of an example? Or maybe I'm missing something, and this is a redundancy in the specification, or an evolving syntactic-sugar due to practical needs?
Because the example was written on a website that looks serious with official information written by professionals who dealt with the essence of the language and its definition, theory and philosophy behind (as opposed to practical uses in the wild), and also in light of other provided examples I have seen in it and the added value of them being meticulous, I prefer not to assume that the example is just simply imperfect/meticulous/fit, and that there may be a good reason to choose to write it this way over another, in the specific case exampled.
First, let's look at the technical difference between the list of documents in a YAML stream and a YAML sequence (which is a collection of ordered items). For this, I'll discuss YAML tags, which are an advanced feature so I'll provide a quick overview:
YAML nodes can have tags, such as !!str (the official tag for string values) or !dice (a local tag that can be interpreted by your application but is unknown to others). This applies to all nodes: Scalars, mappings and sequences. Nodes that do not have such a tag set in the source will be assigned the non-specific tag ?, except for quoted scalars which get ! instead. These non-specific tags are later resolved to specific tags, thereby defining to which kind of data structure the node will be deserialized into.
YAML implementations in scripting languages, such as PyYAML, usually only implement resolution by looking at the node's value. For example, a scalar node containing true will become a boolean value, 42 will become an integer, and droggeljug will become a string.
YAML implementations for languages with static types, however, do this differently. For example, assume you deserialize your YAML into a Java class
public class Config {
String name;
int count;
}
Assume the YAML is
name: 42
count: five
The 42 will become a String despite the fact that it looks like a number. Likewise, five will generate an error because it is not a number; it won't be deserialized into a string. This means that not the content of the node defines how it will be deserialized, but the path to the node.
What does this have to do with documents? Well, the YAML spec says:
Resolving the tag of a node must only depend on the following three parameters: (1) the non-specific tag of the node, (2) the path leading from the root to the node and (3) the content (and hence the kind) of the node.)
So, the technical difference is: If you put your data into a single document with a collection at the top, the YAML processor is allowed to take into account the position of the data in the top-level collection when resolving a tag. However, when you put your data in different documents, the YAML processor must not depend on the position of the document in the YAML stream for resolving the tag.
What does this mean in practice? It means that YAML documents are structurally disjoint from one another. Whether a YAML document is valid or not must not depend on any preceeding or succeeding documents. Consequentially, even when deserialization runs into a semantic problem (such as with the five above) in one document, a following document may still be deserialized successfully.
The goal of this design is to be able to concatenate arbitrary YAML documents together without altering their semantics: A middleware component may, without understanding the semantics of the YAML documents, collect multiple streams together or split up a single stream. As long as they are syntactically correct, stream splitting and merging are sound operations that do not invalidate a YAML document even if another document is structurally invalid.
This design primary focuses on sending and receiving data over networks. Of course, nowadays, YAML is primarily used as configuration language. This is why this feature is seldom used and of rather little importance.
Edit: (Reply to comment)
What about end-cases like a string-tagged Document starts with a folded-string, making even its following "---" and "..." just a characters of the global string?
That is not the case, see rules l-bare-document and c-forbidden. A line containing un-indented ... not followed by non-whitespace will always end a document if one is open.
Moreover, ... doesn't do anything if no document is open. This ensures that a stream merger can always append ... to a document to ensure that the current document is closed, but no additional one is created.
--- has widely been adopted as separator between YAML documents (and, perhaps more prominently, between YAML front matter and content in tools like Jekyll) where ... would have been more appropriate, particularly in Jekyll. This gives the false impression that --- should be used by tooling to separate documents, when in reality ... is the syntactic element designed for that use-case.

Is there some standard for handling sets of ranges of numbers?

When you open the print dialog of an editor, you typically get a field for specifying which pages you want to print - which can be multiple ranges, e.g.: "5,11,31-33"
Now, there are other scenarios in which this kind of input from a user is relevant - especially in configuration files for sequential or iterative processes where you want to qualify which iterations or elements a certain action or feature should apply to.
However, I'm not aware of a name for this kind of strings; nor of an accepted standard format/convention for them (i.e. can you add spaces? Can you use semicolons instead of commas? Must the ranges be sorted? Are overlaps allowed and are they maintained or discarded? Can ranges use ".." instead of "-"? Can you range down instead of up? etc.).
Is there some such convention or such standard?
My motivation is double: I need to parse such ranges in a piece of code I'm looking at, any I want both to do it correctly (or rather per-convention), and secondly to go look for parsing functionality in existing libraries. Right now I don't even have a name to go on.

Searching a list of keywords from text files in folders

I have compiled a list of db object names, one name per line, in a text file. I want to know for each names, where it is being used. The target search is a group of folders containing sub-folders of source codes.
Before I give up looking for a tool to do this and start creating my own, perhaps you can help to point to me an existing one.
Ideally, it should be a Windows desktop application. I have not used grep before.
use grep (there are tons of port of this command to windows, search the web).
eventually, use AgentRansack.
See our Source Code Search Engine. It indexes a large code base according to the atoms (tokens) of the language(s) of interest, and then uses that index to quickly execute structured queries stated in terms of language elememnts. It is a kind of super-grep, but it isn't fooled by comments or string literals, and it automatically ignores whitespace. This means you get a lot fewer false positive hits than you get with grep.
If you had an identifier "foo", the following query would find all mentions:
I=foo
For C and Java, you can constrain the types of identifier accesses to Use, Read, Write or Defines.
D=bar*
would find only declarations of identifiers which started with the letters "bar".
You can write more complex queries using sequences of language tokens:
'int' I=*baz* '['
for C, would find declarations of any variable name that contained the letters "baz" and apparantly declared an array.
You can see the hits in a GUI, and one-click navigate to a source code view of any hit.
It is a Windows application. It handles a wide variety of languages: C#, C++, Java, ... and many more.
I had created an SSIS package to load my 500+ source code files that is distributed into some depth of folders belongs to several projects, into a table, with 1 row as 1 line from the files (total is 10K+ lines).
I then made a select statement against it, by cross-applying the table that keeps the list of 5K+ keywords of db objects, with the help of RegEx for MS-SQL, http://www.simple-talk.com/sql/t-sql-programming/clr-assembly-regex-functions-for-sql-server-by-example/. The query took almost 1.5 hr to complete.
I know it's a long winded, but this is exactly what I need. I thank you for your efforts in guiding me. I would be happy to explain the details further, should anyone gets interested using my method.
insert
dbo.DbObjectUsage
select
do.Id as DbObjectId,
fl.Id as FileLineId
from
dbo.FileLine as fl -- 10K+
cross apply
dbo.DbObject as do -- 5K+
where
dbo.RegExIsMatch('\b' + do.name + '\b', fl.Line, 0) != 0

Eliminating code duplication in a single file

Sadly, a project that I have been working on lately has a large amount of copy-and-paste code, even within single files. Are there any tools or techniques that can detect duplication or near-duplication within a single file? I have Beyond Compare 3 and it works well for comparing separate files, but I am at a loss for comparing single files.
Thanks in advance.
Edit:
Thanks for all the great tools! I'll definitely check them out.
This project is an ASP.NET/C# project, but I work with a variety of languages including Java; I'm interested in what tools are best (for any language) to remove duplication.
Check out Atomiq. It finds code that is duplicate that is prime for extracting to one location.
http://www.getatomiq.com/
If you're using Eclipse, you can use the copy paste detector (CPD) https://olex.openlogic.com/packages/cpd.
You don't say what language you are using, which is going to affect what tools you can use.
For Python there is CloneDigger. It also supports Java but I have not tried that. It can find code duplication both with a single file and between files, and gives you the result as a diff-like report in HTML.
See SD CloneDR, a tool for detecting copy-paste-edit code within and across multiple files. It detects exact copyies, copies that have been reformatted, and near-miss copies with different identifiers, literals, and even different seqeunces of statements.
The CloneDR handles many languages, including Java (1.4,1.5,1.6) and C# especially up to C#4.0. You can see sample clone detection reports at the website, also including one for C#.
Resharper does this automagically - it suggests when it thinks code should be extracted into a method, and will do the extraction for you
Check out PMD , once you have configured it (which is tad simple) you can run its copy paste detector to find duplicate code.
One with some Office skills can do following sequence in 1 minute:
use ordinary formatter to unify the code style, preferably without line wrapping
feed the code text into Microsoft Excel as a single column
search and replace all dual spaces with single one and do other replacements
sort column
At this point the keywords for duplicates will be already well detected. But to go further
add comparator formula to 2nd column and counter to 3rd
copy and paste values again, sort and see the most repetitive lines
There is an analysis tool, called Simian, which I haven't yet tried. Supposedly it can be run on any kind of text and point out duplicated items. It can be used via a command line interface.
Another option similar to those above, but with a different tool chain: https://www.npmjs.com/package/jscpd

Dynamically updating RDF File

Is it possible to update an rdf file dynamically from user generated input through a webform? The exact scenario would beskos concept definitions being created and updated through user input to html forms.
I was considering xpath but is there a better / generally accepted / best practice way of doing this kind of thing?
For this type of thing there are IMO two approaches:
1 - Using Named Graphs in a Triple Store
Rather than editing an actual fixed file you use a Graph which is stored as a named graph in a Triple Store that supports triple level updates (i.e. you can change individual Triples in a Graph). For example you could use a store like Virtuoso or a Jena based store (Jena SDB/TDB) to do this, basically any store that supports the SPARUL language or has it's own equivalent.
2 - Using a fixed RDF file and altering it
From your mention of XPath I assume that you are intending to store your file as RDF/XML. While XPath would potentially work for this it's going to be dependent on the exact serialization of your file and may get very complex. If your app is going to allow users to submit and edit their own files then they'll be no guarantees over how the RDF has been serialized into RDF/XML so your XPath expressions might not work. If you control all the serialization and processing of the RDF/XML then you can keep it in a format that your XPath will work on.
From my point of view the simplest way to do this approach is to load the file into memory using an appropriate RDF library, manipulate it in memory and then persist the whole thing back to disk when the user is done (or at regular intervals or whatever is appropriate to your application)

Resources