Performance of XElement LINQ query against attribute or element

Performance of XElement LINQ query against attribute or element - performance

I'm using LINQ to XML to search the XML below (small section of final document) initially on the Country name attibute and was wondering if there would be any performance benefit in creating the name as a child element rather than attribute
<Countries>
<Country name="United Kingdom">
<Grades>
<Grade>PA</Grade>
<Grade>FE</Grade>
</Grades>
</Country>
</Countries>
Thanks

In terms of LINQ processing time the difference should be very very small, it depends on the shape of the document. If you are looking for an attribute on element with many attributes it's going to be slower, so if you know you will have just that one attribute you're looking for it will be fast. Same goes for elements. So if the sample above is representative, the faster one would be the attribute as there's just one attribute, but there would be two elements if you moved the name to the element.
Second consideration which is probably more important is parsing speed. You will need to parse the document first to be able to search it. Parsing speed depends mainly on the number of characters it has to process. So the longer the input document (in bytes), the longer it's going to take to parse it. In this sense attributes are a bit shorter than elements (usually). Also the bookkeeping the parser has to do is little bit less for attributes than elements (especially if you have just one attribute on an element).
But as with anything about performance: Measure It. That's the ultimate answer.

Related

Is there a way to change the Search API facet count to show a total word count instead of the count of matching fragments (documents)?

I'm creating an application using Marklogic 8 and the search API. I need to create facets based on MarkLogic defined collections, but instead of the facet count giving a tally of the number of fragments (documents) which contain X number of occurrences of the keyword search performed, I need the facet count to reflect the total number of times the keyword appears in all documents in the collection.
Right now, I'm using search:search() to process the query and return a element with the facet option enabled.
In the MarkLogic documentation, I've been looking at cts:frequency() which says:
"If you want the total frequency instead of the fragment-based frequency (that is, the total number of occurences of the value in the items specified in the cts:query option of the lexicon API), you must specify the item-frequency option to the lexicon API value input to cts:frequency."
But, I can't get that to work.
I've tried running a query like this in query console, but it times out.
cts:element-values(QName("http://www.tei-c.org/ns/1.0", "TEI"),
"", "item-frequency",
cts:and-query((
fn:collection("KirchlicheDogmatik/volume4/part3"),
cts:word-query("lehre"))))

The issue is probably that you have a range index on <TEI>, which contains the entire document. Range indexes are memory-mapped, so you have essentially forced the complete text contents of your database into memory. It's hard to say exactly what's going on, but it's probably struggling to inspect the values (range indexes are designed for smaller atomic values) and possibly swapping to disk.
MarkLogic has great documentation on its indexing, so I'd recommend starting there for a better understanding on how to use them: https://docs.marklogic.com/guide/concepts/indexing#id_51573
Note that even using the item-frequency option, results (or counts) are not guaranteed to be one-to-one with the "total number of times the keyword appears." It will report the number of "items" matching - in your example it would report on the number of <TEI> elements matching.
The problem of getting an exact count of terms matching a query across the whole database is actually quite hard. To get exact matching values within a document, you would need to use cts:highlight or cts:walk, which requires loading the whole document into memory. That typically works fine for a subset of documents, but ultimately to get an accurate value for the entire database, you would need to load the entire database into memory and process every document.
Nearly any approach to getting a term match count requires some kind of approximation and depends heavily on your markup. For example, if you index <p> (or even better <s>) elements, it would be possible to construct a query that uses indexes to count the number of matching paragraphs (or sentences), but that would still load an incredibly large amount of data into memory and keep it there. This is technically feasible if you are willing to allocate enough memory (and/or enough servers), but it hardly seems worth it.

Relation between two texts with different tags

I'm currently having a problem with the conception of an algorithm.
I want to create a WYSIWYG editor that goes along the current [bbcode] editor I have.
To do that, I use a div with contenteditable set to true for the WYSIWYG editor and a textarea containing the associated bbcode. Until there, no problem. But my concern is that if a user wants to add a tag (for example, the [b] tag), I need to know where they want to include it.
For that, I need to know exactly where in the bbcode I should insert the tags. I thought of comparing the two texts (one with html tags like <span>, the other with bbcode tags like [b]), and that's where I'm struggling.
I did some research but couldn't find anything that would help me, or I did not understand it correctly (maybe did I do a wrong research). What I could find is the Jaccard index, but I don't really know how to make it work correctly.
I also thought of another alternative. I could just take the code in the WYSIWYG editor before the cursor location, and split it every time I encounter a html tag. That way, I can, in the bbcode editor, search for the first occurrence, then search for the second occurrence starting at the last index found, and so on until I reach the place where the cursor is pointing at.
I'm not sure if it would work, and I find that solution a bit dirty. Am I totally wrong or should I do it this way?
Thanks for the help.

A popular way of determining what is the level of the similarity between the two texts is computing the mentioned Jaccard similarity. Citing Wikipedia:
The Jaccard index, also known as Intersection over Union and the Jaccard similarity coefficient, is a statistic used for comparing the similarity and diversity of sample sets. The Jaccard coefficient measures the similarity between finite sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets:
If you have a large number of texts though, computing the full Jaccard index of every possible combination of two texts is super computationally expensive. There is another way to approximate this index that is called minhashing. What it does is use several (e.g. 100) independent hash functions to create a signature and it repeats this procedure many times. This whole process has a nice property that the probability (over all permutations) that T1 = T2 is the same as J(A,B).
Another way to cluster similar texts (or any other data) together is to use Locality Sensitive Hashing which by itself is an approximation of what KNN does, and is usually worse than that, but is definitely faster to compute. The basic idea is to project the data into low-dimensional binary space (that is, each data point is mapped to a N-bit vector, the hash key). Each hash function h must satisfy the sensitive hashing property prob[h(x)=h(y)]=sim(x,y) where sim(x,y) in [0,1] is the similarity function of interest. For dots products it can be visualized as follows:
we can now ask what would be the has of the indicated point (in this case it's 101) and everything that is close to this point has the same hash.
EDIT to answer the comment
No, you asked about the text similarity and so I answered that. You basically ask how can you predict the position of the character in text 2. It depends on whether you analyze the writer's style or just pure syntax. In any of those two cases, IMHO you need some sort of statistics that will tell where it is likely for this character to occur given all the other data/text. You can go with n-grams, RNNs, LSTMs, Markov Chains or any other form of sequential data analysis.

Indexing by float or double field algorithm

I have a task to perform fast search in huge in-memory array of objects by some object's fields. I need to select the subset of objects satisfying some criteria.
The criteria may be specified as a floating point value or range of such values (eg. 2.5..10).
The problem is that the float property to be searched on is not quite uniformly distributed; it could contain few objects with value range 10-20 (for example) and another million objects with values 0-1, and another million with values 100-150.
So, how possible is it to build index for effective searching those objects? Code samples are welcome.

If the in memory array is ordered then binary search would be my first attempt. Wikipedia entry has example code as well.
http://en.wikipedia.org/wiki/Binary_search_algorithm

If you're doing lookups only, a single sort followed by multiple binary searches is good.
You could also try a perfect hash algorithm, if you want the ultimate in lookup speed and little more.
If you need more than just lookups, check out treaps and red-black trees. The former are fast on average, while the latter are decent performers with a low operation duration variability.

You could try a range tree, for the range requirement.

I fail to see what the distribution of values has to do with building an index (with the possible exception of exact duplicates). Since the data fits in memory, just extract all the fields with their original position, sort them, and use a binary search as suggested by #MattiLyra.
Are we missing something?

XPath: / quicker than //?

If I got it right, / means that the node right to it must be an immediate child of the node left to it, e.g. /ul/li returns li items which are immediate children of a ul item which is the document root. //ul//li returns li items which are descendants of any ul item which is somewhere in the document.
Now: Is /ul/li faster than //ul//li, even if the result set is the same?

Generally speaking, yes, of course!
/ul/li visits at most (number_of_ul * number_of_li nodes), with a maximum depth of 2. //ul//li could potentially visit every node in the document.
However, you may be using a document system with some kind of indexing, or you could have a document where the same number of nodes ends up getting visited, or whatever, which could either make // not as slow or or the same speed as or possibly even faster than /ul/li. I guess you could also potentially have a super-dumb XPath implementation that visits every node anyway.
You should profile your specific scenario rather than ask which is faster. "It depends" is the answer.

There are probably at least 50 implementations of XPath, and their performance varies dramatically, by several orders of magnitude. It's therefore meaningless to ask questions about XPath performance without reference to a specific implementation.
Generally the advice given is to use as specific a path as you can: /a/b/c/d is better than //d. However, this advice is not always correct. Some products will execute //d faster because they have gone to the trouble of creating an index: this is particularly true if you are running against an XML database. Also, performance isn't everything. When you are dealing with complex vocabularies like FpML, the specific path to an element can easily be ten steps with names averaging 20 characters, so that's a 200-character XPath, and it can be very hard to spot when you get it wrong. Programmer performance matters more than machine performance.

What's a simple way to design a memoization system with limited memory?

I am writing a manual computation memoization system (ugh, in Matlab). The straightforward parts are easy:
A way to put data into the memoization system after performing a computation.
A way to query and get data out of the memoization.
A way to query the system for all the 'keys'.
These parts are not so much in doubt. The problem is that my computer has a finite amount of memory, so sometime the 'put' operation will have to dump some objects out of memory. I am worried about 'cache misses', so I would like some relatively simple system for dropping the memoized objects which are not used frequently and/or are not costly to recompute. How do I design that system? The parts I can imagine it having:
A way to tell the 'put' operation how costly (relatively speaking) the computation was.
A way to optionally hint to the 'put' operation how often the computation might be needed (going forward).
The 'get' operation would want to note how often a given object is queried.
A way to tell the whole system the maximum amount of memory to use (ok, that's simple).
The real guts of it would be in the 'put' operation when you hit the memory limit and it has to cull some objects, based on their memory footprint, their costliness, and their usefulness. How do I do that?
Sorry if this is too vague, or off-topic.

I'd do it by creating a subclass to DYNAMICPROPS that uses a cell array to store the data internally. This way, you can dynamically add more data to the object.
Here's the basic design idea:
The data is stored in a cell array. Each property gets its own row, with the first column being the property name (for convenience), the second column a function handle to calculate the data, the third column the data, the fourth column the time it took to generate the data, the fifth column an array of, say, length 100 storing the timestamps corresponding to when the property was accessed the last 100 times, and the sixth column contains the variable size.
There is a generic get method that takes as input the row number corresponding to the property (see below). The get method first checks whether column 3 is empty. If no, it returns the value and stores the timestamp. If yes, it performs the computation using the handle from column 1 inside a TIC/TOC statement to measure how expensive the computation is (which is stored in col4, and the timestamp is stored in col5). Then it checks whether there is enough space for storing the result. If yes, it stores the data, otherwise it checks size, as well as the product of how many times data were accessed with how long it would take to regenerate, to decide what to cull.
In addition, there is an 'add' property, that adds a row to the cell array, creates a dynamic property (using addprops) of the same name as the function handle, and sets the get-method to myGetMethod(myPropertyIndex). If you need to pass parameters to the function, you can create an additional property myDynamicPropertyName_parameters with a set method that will remove previously calculated data whenever the parameters change value.
Finally, you can add a few dependent properties, that can tell how many properties there are (# of rows in the cell array), how they're called (first col of the cell array), etc.

Consider using Java, since MATLAB runs on top of it, and can access it. This would work if you have marshallable values (ints, doubles, strings, matrices, but not structs or cell arrays).
LRU containers are available for Java:
Easy, simple to use LRU cache in java
http://java-planet.blogspot.com/2005/08/how-to-set-up-simple-lru-cache-using.html
http://www.codeproject.com/KB/java/lru.aspx

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio