Increase performance of large tables from Plone search results - performance

The traditional way of Zope to handle large search results is batched output: The first batchsize items are displayed, and to get the next chunk of data, you click on a "next" link to get the next chunk from the server.
Nowadays there are cool Javascript solutions which allow client-side sorting and filtering of tables, e.g. Datatables. These work fine; but if the table is large, and Zope generates the complete HTML, it sometimes takes a long time before the page loads (seems like the search is reasonably fast, but the TAL engine is the performance bottleneck).
So, how is this tackled best?
Generate the whole table from JSON? (needs Javascript for anything to work)
Use the standard paging, and replace it by a client-side table solution if Javascript is available?
provide the data of pages 2+ by JSON
provide all data by JSON
let the table engine load contents for next page or filtering
Is there some plug-in solution to apply such enhancements to standard views (like folder contents)?
I have a page which contains about 1600 items and takes 60s+ to load, which certainly needs to be improved ...
Any pointers and/or code snippets? Thank you!

You'll have to roll your own customizations. The new folder contents in plone 5 works only with json data; however, it does not return the whole folder resultset at a time--it still does paging server side. I've actually not encountered anyone the use-case of sorting thousands of results client-side. I usually do that server side and bring in the data with ajax.
You'll want to generate json data structures yourself from catalog results(brains). Something like this in a view perhaps:
import json
from Products.CMFCore.utils import getToolByName
catalog = getToolByName(context, 'portal_catalog')
query = {
'path': {
'query': '/'.join(context.getPhysicalPath()),
'depth': 1
}
}
result = []
for brain in catalog(**query):
result.append({
'id': brain.id,
'uid': brain.UID,
'title': brain.Title,
'url': brain.getURL()
})
return json.dumps(result)
Along with JavaScript. There are plenty of libraries to present json data in tables.
Things you can investigate if you're interested:
https://github.com/plone/plone.app.content/blob/master/plone/app/content/browser/vocabulary.py
https://github.com/plone/mockup/tree/master/mockup/patterns/structure

Related

Umbraco 8 - Get Children Of Node Using ContentAtXPath() Method

I've been refactoring an existing Umbraco project to use more performant querying when getting back document data as everything was previously being returned using LINQ. I have been using a combination of Umbraco's querying via XPaths and Examine.
I am currently stumped on trying to get child documents using the Umbraco.ContentAtXPath() method. What I would like to do is get child document based on a path I parse to the method. This is what I have currently:
IEnumerable<IPublishedContent> umbracoPages = Umbraco.ContentAtXPath("//* [#isDoc]/descendant::/About [#isDoc]");
Running this returns a "Object reference not set to an instance of an object." error and unable to see exactly where I'm going wrong (new to this form of querying in Umbraco).
Ideally, I'd like to enhance the querying to also carry out sorting using the non-LINQ approach, as demonstrated here.
Up until Umbraco 8, content was cached in an XML file, which made XPath perfect for querying content efficiently. In v8, however, the so called "NuCache" is not file based nor XML based, so the XPath query support is only there for ... well... Old times sake, I guess? Either way it's probably not going to be super efficient and (I'd advise) not something to "aim for". That said I of course don't know what you are changing from (Linq can be a lot of things) :-/
It certainly depends on how big your dataset is.
As Umbraco has moved away from the XML backed cache, you should look into Linq queries against your content models. Make sure you use ModelsBuilder to generate the models.
On a small dataset Linq will be much quicker than examine. On a large dataset Examine/Lucene will be much more steady on performance.
Querying NuCache is pretty fast in Umbraco 8, only beaten by an Examine search.
Assuming you're using Models Builder, and your About page is a child of Home page, you could use:
var homePage = (HomePage) Model.Root();
var aboutPage = homePage?.Children<AboutPage>().FirstOrDefault();
var umbracoPages = aboutPage.Children();
Where HomePage is your home page Document Type Alias and AboutPage is your About page Document Type alias.

DOM-based data storage, best choices/performances?

This question is about load data embedded into DOM structure.
I am using jQuery2, but the question is valid for any other framework or single Javascript code.
There are two scenarios:
When the data is load once (with the page), no "refresh data" is need.
When data is refreshed by some event.
The average performance can be changed with either one or other
Suppose a typical case of scenario-2, where a page fragment must be reloaded, with new HTML and new data. So, the $('#myDiv').load('newHtmlFragment') will be used any way... And, for jQuery programmer, using AJAX, there are two ways to load an "DOM-based data":
by HTML: expressing all data into the "newHtmlFragment" HTML. Suppose many paragraphs, each like <p id="someId" data-X="someContent">...more content</p>. There are some "verbose overhead" for each data-X1="contentX1" data-X2="contentX2" ..., and is not elegant for webservice script if it is not an XHTML-oriented one (I am using PHP, my data is an array, and I preffer to use json_encode).
by jQuery evaluation: using the same $('#myDiv').load('newHtmlFragment') only for <p id="someId">...more content</p>, with no data-X. A second AJAX load an jQuery script like $('#someId').data(...) and evaluate it. So this is an overhead, for node-selection and data-inclusion, but with big item-data, each data can be enconded by JSON.
by pure JSON: similar to "by jQuery", but the second AJAX load an JSON object like var ALLDATA={'someId1':{...}, 'someId2':{...}, ...}. So this is an overhead for a static function that executes something like $('#myDiv p').each(function(){... foreach ... $(this).data('x',ALLDATA[id]['x']);}) retrive selection, but with big data, all data can be enconded by JSON.
The question: what the best choice? It depends on scenarios or another context parameter? There are signfificative performance tradeoffs?
PS: a complete answer needs to address the issue of performance... If no significative performance differences, the best choice relies on "best programming style" and software engineering considerations.
More context, if you need as reference to answer. My practical problem is at scenario-1, and I am using the second choice, "by jQuery script", executing:
$('#someId1').data({'bbox':[201733.2,7559711.5,202469.4,7560794.9],'2011-07':[3,6],'2011-08':[2,3],'2011-10':[4,4],'2012-01':[1,2],'2012-02':[12,12],'2012-03':[3,6],'2012-04':[6,12],'2012-05':[3,4],'2012-06':[2,4],'2012-07':[3,5],'2012-08':[10,11],'2012-09':[7,10],'2012-10':[1,2],'2012-12':[2,2],'2013-01':[6,10],'2013-02':[19,26],'2013-03':[2,4],'2013-04':[5,8],'2013-05':[4,5],'2013-06':[4,4]});
$('#someId2').data({'bbox':[197131.7,7564525.9,198932.0,7565798.1],'2011-07':[39,51],'2011-08':[2,3],'2011-09':[4,5],'2011-10':[13,14],'2011-11':[40,42],'2011-12':[21,25],'2012-01':[10,11],'2012-02':[26,31],'2012-03':[27,35],'2012-04':[8,10],'2012-05':[24,36],'2012-06':[4,7],'2012-07':[25,30],'2012-08':[9,11],'2012-09':[42,52],'2012-10':[4,7],'2012-11':[17,22],'2012-12':[7,8],'2013-01':[21,25],'2013-02':[5,8],'2013-03':[8,11],'2013-04':[28,40],'2013-05':[55,63],'2013-06':[1,1]});
$('#...').data(...); ... more 50 similar lines...
This question can be discussed from different aspects. Two that I can think of right now would be software engineering and end-user experience. A comprehensive solution covering first can also cover the later but as usually it's not possible to come up with such a solution (due to its cost) these two hardly overlap.
Software engineering point of view
In this POV it is strongly suggested that different parts of the system to be as isolated as possible. This means you better postpone the marriage of data and view as late as you can. It helps you to divide your developers into two separate groups; those who know server-side programming and have no idea how HTML (or any other interface layer technology) works and those who are solely experienced on HTML and Javascript. Just this division alone is a blessing for management and it helps greatly in big projects where teamwork is essential. It also helps the maintenance and expansion of the system, all the things software engineering aims at.
User experience point of view
As good as the previous solution sounds, it comes with (solvable) drawbacks. As you mentioned in your question, if we are to load view and data separately, it elevates the number of requests we have to send to server to retrieve them. It imposes two problems, first the overhead that comes with each request and second the delay user has to wait for each request to be responded. The second is more obvious so let's start with that. With all the advances to the Internet and bandwidths, yet our users' exceeding expectations enforce us to consider this delay. One way to reduce the number of requests would be your first choice: data within HTML fragments. Multiple number of requests also has an overhead problem as well. This can be explained by HTTP protocol's handshake (both on client-side and server-side) and by the fact that each request will lead to loading the session on server which in a large scale could be pretty considerable. So again your first option could be the answer to this problem.
Tie breaker
Having both sides of the story said, what then? The ultimate solution is a combination of both where data and view are married on client but they are downloaded at the same time. To be honest I don't know of such a library. But the principle is simple, you need a mechanism to package data and empty HTML fragments within the same response and combine them into what user will see on client. This solution is costy (to implement) but it is sort of the cost that once paid you can benefit from it for a life time.
It depends on how you are going to use the data that is stored. Consider this example:
Example
You have 300 items in a database (say server access logs of the last 300 days). Now you want to display 300 <div> tags, each tag representing one database-item.
Now there are 2 options to do this:
<div data-stat1="stat1" data-stat2="stat2" data-stat3="stat3">(...)</div>
<!-- again, this is repeated 300 times -->
<script>
// Example on how to show "stat1" value in all <div>s
function showStat1() {
for(var i=1; i<=300; i+= 1) {
var theID = '#id-' + i;
jQuery(theID).text(jQuery(theID).data('stat1'));
}
}
</script>
OR
<div id="id-1">(...)</div>
<!-- repeat this 300 times, for all DB items -->
<script>
data = { // JSON data which is displayed in the <div> tags
'1': ['stat1', 'stat2', 'stat3'],
// same for all 300 items
}
// Example on how to show "stat1" value in all <div>s
function showStat1() {
for(var i=1; i<=300; i+= 1) {
var theID = '#id-' + i;
jQuery(theID).text(data[i][0]);
}
}
</script>
Which scenario is better?
Case 1:
The data is directly coded into the DOM elements, which makes this one an easy to implement solution. You can see in my examples that the first code block takes a lot less code to generate and the code is less complex.
This solution is very powerful, when you want to store data that is directly related to an DOM element (as you do not have to create a logical connection between JSON and DOM): It is very simple to access data for the correct element.
However, your main concern is performance - this is NOT the way to go. Because every time you access data, you have to first select the correct DOM element and attribute via javascript, which takes considerable amount of time. So the simplicity costs you a lot of performance overhead when you want to read/write data in the element.
Case 2:
This scenario very cleanly separates the DISPLAY from the DATA storage. It has major advantages compared to the first case.
A) Data should not be mingled with the display elements - imagine you want to switch from <div> to <table> and suddenly you have to rewrite all javascript to correctly select the correct table cells
B) Data is directly accessible without traversing the DOM tree. Imagine you want to calculate an average sum of all values. In first case you need to loop all DOM elements and read value from them. In the second you simply loop a normal array
C) The data can be loaded or refreshed via ajax. You only transfer the JSON object but not all the HTML stuff that is required for displaying the data
D) Performance is way better, mostly because of the above mentioned points. Javascript is a lot better in handling simple arrays or (not too complex) JSON objects than it is in filtering data out of the DOM tree.
In general this solution is more than twice as fast as the first case. You will also discover - even though it is not obvious - that the second case is also easier to code and maintain! The code simply is easier to read and understand, as you clearly separate the data from the UI elements...
Performance comparison
You can compare the two solutions on this JSPerf scenario:
http://jsperf.com/performance-on-data-storage
Taking it further
For implementation of the second case I generally use this approach:
I generate the HTML code which will serve as the UI. It often happens that I have to generate HTML via javascript, but now I assume that the DOM tree will not change after the page is loaded
At the end of the HTML I include a tag with a JSON object for initial page display
Via jQuery onReady event I then parse the JSON object and update the UI elements as required (e.g. populating the table with data)
When data is loaded dynamically I simply transfer a new JSON object with the new data and use exact the same javascript function as in step 3 to display the new results.
Example of the HTML file:
<div>ID: <span class="the-id">-</span></div>
<div>Date: <span class="the-date">-</span></div>
<div>Total page hits: <span class="the-value">-</span></div>
<button type="button" class="refresh">Load new data</button>
<script src="my-function.js"></script>
<script>
function initial_data() {
return {"id":"1", "date":"2013-07-30", "hits":"1583"};
}
</script>
The "my-function.js" file:
jQuery(function initialize_ui() {
// Initialize the global variables we will use in the app
window.my_data = initial_data();
window.app = {};
app.the_id = jQuery('.the-id');
app.the_date = jQuery('.the-date');
app.the_value = jQuery('.the-value');
app.btn_refresh = jQuery('.refresh');
// Add event handler: When user clicks refresh button then load new data
app.btn_refresh.click(refresh_data);
// Display the initial data
render_data();
});
function render_data() {
app.the_id.text(my_data.id);
app.the_date.text(my_data.date);
app.the_value.text(my_data.hits);
}
function refresh_data() {
// For example fetch and display the values for date 2013-07-29
jQuery.post(url, {"date":"2013-07-29"}, function(text) {
my_data = jQuery.parseJSON(text);
render_data();
});
}
I did not test this code. Also there is crucial error handling and other optimization missing. But it helps to illustrate the concepts that I try to describe

Best practices for filtering data

The way I see it, when building dynamic charts with filters, each time the user requests filtered data I can
Execute a new MySQL query, and use MySQL to do the filtering.
SELECT date,
SUM(IF( `column` = `condition`, 1, 0)) as count
...
Execute a new MySQL query, and use the server-side language (PHP in my case) to filter.
function getData(condition) {
$resultSet = mysqli_query($link, "SELECT date, column ... ");
$count = 0;
while ($row = mysqli_fetch_assoc($result_set)) {
if ($row['column'] == 'condition') {
$count++;
}
}
}
Initially execute a single MySQL query, pass all the data to the client, and use Javascript & d3 to do the filtering.
I expect the answer is not black-and-white. For instance, if some filter is barely requested, it may not make sense to make the other 95% of users wait for the relevant data, and thus the filter would necessitate a new data call. But I'm really thinking about edge cases - situations where filters are used regularly, but idiosyncratically. At times like this, is it better to put filtering logic in front-end, back-end, or within my database queries?
Generally, if the filtering can be done on the front end, it should be done there. The advantages are:
Doesn't matter if your server goes down
Saves you bandwidth costs
Saves the user waiting for round trip time
The disadvantages are that it may be slower and more complicated than it would be on the backend. However, dependent on data volume, there's a lot of cases (like your examples) where Javascript is plenty good enough. d3 even has a built-in filter function:
//remove anything that isn't cake
d3.selectAll('whatever')
.filter(function(d){return d.type != 'cake'})
.remove()
If you need more complex filtering, such as basic aggregates, you can use Crossfilter (also from Mike Bostock) or the excellent d3+crossfilter wrapper dc.js.

How to cache Zend Lucene search results in Code Igniter?

I'm not sure if this is the best way to go about this, but my aim is to have pagination of my lucene search results.
I thought it would make sense to run the search, store all the results in the cache, and then have a page function on my results controller that could return any particular subset of results from the cached results.
Is this a bad approach? I've never used caching of any sort, so don't know where to begin. The CI Caching Driver looked promising, but everything throws a server error. I don't know if I need to install APC, or Memcached, or what to do.
Help!
Lucene is a search engine that is built for scale. You can push it pretty far till the need arises to cache the search results. I would suggest you use the default settings and run it.
If you still feel the need for cache, first look at this Lucene FAQ and then the next level would perhaps be something on the lines of memcache.
Hope it helps!
Zend Search Lucene is indexed on the file system and as the user above has stated, built for scale. Unless you are indexing hundreds of thousands of documents, then caching is not really necessary - especially since all you would effectively be doing is taking data from one file and storing it in another.
On the other hand, if you are only storing, say, product Id in your search index and then selecting the products from the database when you get a result, it's well worth caching. This can easily be achived by using Zend_Cache.
A basic example of Zend Db caching is here:
$frontendOptions = array(
'automatic_serialization' => true
);
$backendOptions = array(
'cache_dir' => YOUR_CACHE_PATH_ON_THE_FILE_SYSTEM,
'file_name_prefix' => 'my_cache_prefix',
);
$cache = Zend_Cache::factory('Core',
'File',
$frontendOptions,
$backendOptions
);
Zend_Db_Table_Abstract::setDefaultMetadataCache($cache);
This should be added to your bootstrap file in an _initDbCache (call it whatever you want) method.
Of course that is a very simple implementation and does not achieve full result caching, more information on Zend Caching with Zend Db can be found here.

How to cache a collection in Magento?

I have a collection that takes significant time to load. What I would like is to cache it (APC, Memcache). It is not possible to cache the entire object (as it cannot be unserialized and it is over 1 MB). I'm thinking that caching the collection data ($col->getData() ) is the way to go, but I found no way to rebuild the object based on this array. Any clues?
Collections already have some caching built in but they need a little prompting so put this in the constructor of a collection:
$cache = Mage::app()->getCacheInstance();
$prefix = "SomeUniqueValue";
$this->initCache($cache, $prefix, array(Mage_Catalog_Model_Product::CACHE_TAG));
Choose tags appropriate to the content of the collection so that it will be flushed automatically. This way builds an ID based on the query being executed, it is most useful when the collection is filtered, ordered or paged - it avoids a version conflict.
Generally this hardly gets used because when you retrieve data you almost always end up displaying it, probably as HTML, so it makes sense to cache the output instead. Block caching is widely used and better documented.
I really don't know, but I searched for all the files that have the word "cache" in them with file names of "Collection.php" and got a few results. The most promising example to look at might be Mage_Sales_Model_Entity_Quote_Item_Collection (_getProductCollection() method). Looks like Varien_Data_Collection (which is a parent class of any magento collection) has a few cache-related methods: initCache() and _getCacheInstance().
Can't say I have used them before but might be useful someday.
Good luck.
You can get more information here: Can I use Magento's Caching layer as a Key/Value Store?
I'll be posting more info there as I find it.

Resources