How do I debug an inefficient JCR query in FreeMarker template? - freemarker

I have a Magnolia FreeMarker template that is repeatedly executing (1,400+ times per page load) a simple but unnecessary select by ID JCR query. For example:
select BUNDLE_DATA FROM BUNDLE WHERE NODE_ID = ?
Because of the complexity of the FreeMarker code, I'm unable to tell which line(s) is causing the repeated query.
How do I debug JCR queries in Magnolia so that I can determine the FreeMarker code that is causing the repeated queries?

Going at DB level is going too deep. What the query above is, is JCR request to get all the properties of node w/ ID equal to NODE_ID, but as you say, it is meaningless in terms of telling you where/what operation is causing it.
Could be anything from:
ctx.getJCRSession('some_workspace').getNode('some_path')
to:
node.getNode('some_subpath')
to:
searchfn.searchPages('...')
or even:
cmsfn.children(node)
and possibly few more.
To further complicate the things, you would see query on the DB only if local (in-memory) jcr cache didn't contain the item that your template has requested so you will not even catch all the requests for content at DB level reliably.
One thing is for sure, requesting 1k+ nodes for single template rendering is in most cases indication that you are doing something wrong (unless you indeed want to render some kind of summary or overview going over thousands nodes).
Easiest way forward is to stop to think about logic in your template first. If it contains multiple components, you try to nail it down to given component to limit the search space, eg by removing components one by one and observing the changes.
If it boils down to some JCR query being executed, you can configure debug level logging directly on:
info.magnolia.templating.functions.SearchTemplatingFunctions
This is assuming you were using searchfn to do the search from the template. For other usecases debugging would be bit harder as ordinary JCR Node API is used to get also all other content (e.g. configuration related), not just one that is rendered on the page.
Likely case is that it is a single command or loop in your template causing it so timing execution of different parts of template can also give you clue and help narrow down the problem. If you share the template itself, it might be possible (but not guaranteed) that we can spot in it something and give you more precise indicators.

Related

Elasticsearch high level REST client - Indexing has latency

we have started using the high level REST client finally, to ease the development of queries from backend engineering perspective. For indexing, we are using the client.update(request, RequestOptions.DEFAULT) so that new documents will be created and existing ones modified.
The issue that we are seeing is, the indexing is delayed, almost by 5 minutes. I see that they use async http calls internally. But that should not take so long, I looked for some timing options inside the library, didn't find anything. Am I missing anything or the official documentation is missing for this?
Since refresh_interval: 1 in your index settings, it means it is never refreshed unless you do it manually, which is why you don't see the data just after it's been updated.
You have three options here:
A. You can call the _update endpoint with the refresh=true (or refresh=wait_for) parameter to make sure that the index is refreshed just after your update.
B. You can simply set refresh_interval: 1s (or any other duration that makes sense for you) in your index settings, to make sure the index is automatically refreshed on a regular basis.
C. You can explicitly call index/_refresh on your index to refresh it whenever you think is appropriate.
Option B is the one that usually makes sense in most use cases.
Several reference on using the refresh wait_for but I had a hard time finding what exactly needed to be done in the rest high level client.
For all of you that are searching this answer:
IndexRequest request = new IndexRequest(index, DOC_TYPE, id);
request.setRefreshPolicy(WriteRequest.RefreshPolicy.WAIT_UNTIL);

Connecting NiFi to ElasticSearch

I'm trying to solve one task and will appreciate any help - links to documentation, or links to forums, or other FAQs besides https://cwiki.apache.org/confluence/display/NIFI/FAQs, or any meaningful answer in this post =) .
So, I have the following task:
Initial part of my system collects data each 5-15 min from different DB sources. Then I remove duplicates, remove junk, combine data from different sources according to logic and then redirect it to second part of the system as several streams.
As far as I know, "NiFi" can do this task in the best way =).
Currently I can successfully get information from InfluxDB by "GetHTTP" processor. However I can't configure same kind of processor for getting information from Elastic DB with all necessary options. I'd like to receive data each 5-15 minutes for time period from "now-minus-<5-15 minutes>" to "now". (depends on scheduler period) with several additional filters. If I understand it right, this can be achieved either by subscription to "_index" or by regular requests to DB with desired interval.
I know that NiFi has several specific Processors designed for Elasticsearch (FetchElasticsearch5, FetchElasticsearchHttp, QueryElasticsearchHttp, ScrollElasticsearchHttp) as well as GetHTTP and PostHTTP Processors. However, unfortunately, I have lack of information or even better - examples - how to configure their "Properties" for my purposes =(.
What's the difference between FetchElasticsearchHttp, QueryElasticsearchHttp? Which one fits better for my task? What's the difference between GetHTTP and QueryElasticsearchHttp besides several specific fields? Will GetHTTP perform the same way if I tune it as I need?
Any advice?
I will be grateful for any help.
The ElasticsearchHttp processors try to make it easier to interact with ES by generating the appropriate REST API call based on the properties you set. If you know the full URL you need, you could use GetHttp or InvokeHttp. However the ESHttp processors let you put in just the stuff you're looking for, and it will generate the URL and return the results.
FetchElasticsearch (and its variants) is used to get a particular document when you know the identifier. This is sometimes used after a search/query, to return documents one at a time after you know which ones you want.
QueryElasticsearchHttp is for when you want to do a Lucene-style query of the documents, when you don't necessarily know which documents you want. It will only return up to the value of index.max_result_window for that index. To get more records, you can use ScrollElasticsearchHttp afterwards. NOTE: QueryElasticsearchHttp expects a query that will work as the "q" parameter of the URL. This "mini-language" does not support all fields/operators (see here for more details).
For your use case, you likely need InvokeHttp in order to issue the kind of query you describe. This article describes how to issue a query for the last 15 minutes. Once your results are returned, you might need some combination of EvaluateJsonPath and/or SplitJson to work with the individual documents, see the Elasticsearch REST API documentation (and NiFi processor documentation) for more details.

Application level caching in AEM

We are working on a site in AEM 6.1 which has news and events content with most pages having info on recent and related news/events based on tagging that are dynamic. We are using dispatcher. Please suggest on some caching techniques that could be implemented at application level apart from the dispatcher. Thanks.
Aim of implementing the caching on dispatcher is to allow less hits on your app server and serve as much as possible from web server. In short improving response time from your app server. But in some cases we can't cache too much on web server if results change on app server frequently.
On app server we can have following solutions implemented to get results quickly on top of having dispatcher in place.
Make sure your content hierarchy where you are ingesting news items have as less number of article as possible. Divide your hierarchy based on following structure. Year >> Month >> Day >> Hour (this can be ignored if content flow is less) >> news items.
Having this structure in place, write path based query so that you don't have to traverse in whole content hierarchy.
There is a concept of transient node in CQ, for each news item which is getting created in CQ, update the transient node with newly created item. Means for recent news you don't have to traverse content structure just refer to transient node which has reference to newly created news item.
You could also write a cron job which gets executed in background and takes care of collating views namely top recent news.
To complement the answer of Rupesh I would say that definitely use dispatcher cache as much as you can and for using local caching strategies in AEM try using guava cache it is a very good and easy to use tool there is also a lot of information on how you can set it up and use it for your specific needs. Hope it helps.
I would suggest the following:
For recent news/events, write a scheduler
(https://sling.apache.org/documentation/bundles/scheduler-service-commons-scheduler.html) that will compute the list of recent news/events and write it to a specific node as properties, example:
/tmp/recent
news [/path/to/news1,/path/to/news2]
events [/path/to/event1,/path/to/event2]
Most recent always at the end of the array. Your code need to limit it to the amount of max recent you want to have.
Let's say you want to have the last 5 changed and a 6th page is changed, then you just pop and push(new_page_path)
This could run once a day or at the frequency which you feel is the best depending on your requirements.
If you need instant update, then you can additionally write a Listener when a page is changed/deleted and update the recent list. In this case I would suggest putting the code that deal with updating the recent list into a service and use that service in both the scheduler and the listener.
Listener and scheduler need to run on both author and publisher and on publisher trigger dispatcher cache invalidation for /tmp/recent afterwards.
In order to render the recent list without having to invalidate the whole pages, I would suggest you use SSI for that, that means have a component in your page that will render an SSI include to /tmp/recent.news.html or /tmp/recent.events.html depending on whether you want to render recent news or events.
Give the node /tmp/recent the resourceType for handling the "news" and "events" selector and implement that resourceType to render the content.
For the related
Use the Tag Manager (https://docs.adobe.com/docs/en/cq/5-6-1/javadoc/com/day/cq/tagging/TagManager.html) "find" method to lookup for all news/events having the same tag as the current page. I assume your news and events pages have a dedicated template or resource type.
Also I would suggest having a dedicated component that would include that content using SSI include. Let's say your page has 2 tags, ns/tag1 and ns/tag2, then you could perform the SSI include like this:
SSI include /etc/tags/ns/tag1.related_news.html
SSI include /etc/tags/ns/tag1.related_events.html
SSI include /etc/tags/ns/tag2.related_news.html
SSI include /etc/tags/ns/tag2.related_events.html
depending on what you want to include
Write a component under /apps/cq/tagging/components/tag (sling:resourceSuperType= /libs/cq/tagging/components/tag) that will provide the rendering for the "related_news" and "related_events" selector and list all related pages.
The advantage with this approach is that you can share the related page for each tag and whenever the tag is changed/deleted then the cache gets invalidated automatically.
In both cases (recent and related) configure the dispatcher to cache the output.

Micro Services and noSQL - Best practice to enrich data in micro service architecture

I want to plan a solution that manages enriched data in my architecture.
To be more clear, I have dozens of micro services.
let's say - Country, Building, Floor, Worker.
All running over a separate NoSql data store.
When I get the data from the worker service I want to present also the floor name (the worker is working on), the building name and country name.
Solution1.
Client will query all microservices.
Problem - multiple requests and making the client be aware of the structure.
I know multiple requests shouldn't bother me but I believe that returning a json describing the entity in one single call is better.
Solution 2.
Create an orchestration that retrieves the data from multiple services.
Problem - if the data (entity names, for example) is not stored in the same document in the DB it is very hard to sort and filter by these fields.
Solution 3.
Before saving the entity, e.g. worker, call all the other services and fill the relative data (Building Name, Country name).
Problem - when the building name is changed, it doesn't reflect in the worker service.
solution 4.
(This is the best one I can come up with).
Create a process that subscribes to a broker and receives all entities change.
For each entity it updates all the relavent entities.
When an entity changes, let's say building name changes, it updates all the documents that hold the building name.
Problem:
Each service has to know what can be updated.
When a trailing update happens it shouldnt update the broker again (recursive update), so this can complicate to the microservices.
solution 5.
Keeping everything normalized. Fileter and sort in ElasticSearch.
Problem: keeping normalized data in ES is too expensive performance-wise
One thing I saw Netflix do (which i like) is create intermediary services for stuff like this. So maybe a new intermediary service that can call the other services to gather all the data then create the unified output with the Country, Building, Floor, Worker.
You can even go one step further and try to come up with a scheme for providing as input which resources you want to include in the output.
So I guess this closely matches your solution 2. I notice that you mention for solution 2 that there are concerns with sorting/filtering in the DB's. I think that if you are using NoSQL then it has to be for a reason, and more often then not the reason is for performance. I think if this was done wrong then yeah you will have problems but if all the appropriate fields that are searchable are properly keyed and indexed (as #Roman Susi mentioned in his bullet points 1 and 2) then I don't see this as being a problem. Yeah this service will only be as fast as the culmination of your other services and data stores, so they have to be fast.
Now you keep your individual microservices as they are, keep the client calling one service, and encapsulate the complexity of merging the data into this new service.
This is the video that I saw this in (https://www.youtube.com/watch?v=StCrm572aEs)... its a long video but very informative.
It is hard to advice on the Solution N level, but certain problems can be avoided by the following advices:
Use globally unique identifiers for entities. For example, by assigning key values some kind of URI.
The global ids also simplify updates, because you track what has actually changed, the name or the entity. (entity has one-to-one relation with global URI)
CAP theorem says you can choose only two from CAP. Do you want a CA architecture? Or CP? Or maybe AP? This will strongly affect the way you distribute data.
For "sort and filter" there is MapReduce approach, which can distribute the load of figuring out those things.
Think carefully about the balance of normalization / denormalization. If your services operate on URIs, you can have a service which turns URIs to labels (names, descriptions, etc), but you do not need to keep the redundant information everywhere and update it. Do not do preliminary optimization, but try to keep data normalized as long as possible. This way, worker may not even need the building name but it's global id. And the microservice looks up the metadata from another microservice.
In other words, minimize the number of keys, shared between services, as part of separation of concerns.
Focus on the underlying model, not the JSON to and from. Right modelling of the data in your system(s) gains you more than saving JSON calls.
As for NoSQL, take a look at Riak database: it has adjustable CAP properties, IIRC. Even if you do not use it as such, reading it's documentation may help to come up with suitable architecture for your distributed microservices system. (Of course, this applies if you have essentially parallel system)
First of all, thanks for your question. It is similar to Main Problem Of Document DBs: how to sort collection by field from another collection? I have my own answer for that so i'll try to comment all your solutions:
Solution 1: It is good if client wants to work with Countries/Building/Floors independently. But, it does not solve problem you mentioned in Solution 2 - sorting 10k workers by building gonna be slow
Solution 2: Similar to Solution 1 if all client wants is a list enriched workers without knowing how to combine it from multiple pieces
Solution 3: As you said, unacceptable because of inconsistent data.
Solution 4: Gonna be working, most of the time. But:
Huge data duplication. If you have 20 entities, you are going to have x20 data.
Large complexity. 20 entities -> 20 different procedures to update related data
High cohesion. All your services must know each other. Data model change will propagate to every service because of update procedures
Questionable eventual consistency. It can be done so data will be consistent after failures but it is not going to be easy
Solution 5: Kind of answer :-)
But - you do not want everything. Keep separated services that serve separated entities and build other services on top of them.
If client wants enriched data - build service that returns enriched data, as in Solution 2.
If client wants to display list of enriched data with filtering and sorting - build a service that provides enriched data with filtering and sorting capability! Likely, implementation of such service will contain ES instance that contains cached and indexed data from lower-level services. Point here is that ES does not have to contain everything or be shared between every service - it is up to you to decide better balance between performance and infrastructure resources.
This is a case where Linked Data can help you.
Basically the Floor attribute for the worker would be an URI (a link) to the floor itself. And Any other linked data should be expressed as URIs as well.
Modeled with some JSON-LD it would look like this:
worker = {
'#id': '/workers/87373',
name: 'John',
floor: {
'#id': '/floors/123'
}
}
floor = {
'#id': '/floor/123',
'level': 12,
building: { '#id': '/buildings/87' }
}
building = {
'#id': '/buildings/87',
name: 'John's home',
city: { '#id': '/cities/908' }
}
This way all the client has to do is append the BASE URL (like api.example.com) to the #id and make a simple GET call.
To remove the extra calls burden from the client (in case it's a slow mobile device), we use the gateway pattern with micro-services. The gateway can expand those links with very little effort and augment the return object. It can also do multiple calls in parallel.
So the gateway will make a GET /floor/123 call and replace the floor object on the worker with the reply.

REST - complex applications

I'm struggling to apply RESTful principles to a new web application I'm working on. In particular, it's the idea that to be RESTful, each HTTP request should carry enough information by itself for its recipient to process it to be in complete harmony with the stateless nature of HTTP.
The application allows users to search for medications. The search accepts filters as input, for example, return discontinued medicines, include complimentary therapy etc..etc. In total there are around 30 filters that can be applied.
Additionally, patient details can be entered including the patients age, gender, current medications etc.
To be Restful, should all this information be included with every request? This seems to place a huge overhead on the network. Also, wouldn't the restrictions on URL length, at least for GET, make this unfeasible?
The "Filter As Resource" is a perfect tact for this.
You can PUT the filter definition to the filter resource, and it can return the filter ID.
PUT is idempotent, so even if the filter is already there, you just need to detect that you've seen the filter before, so you can return the proper ID for the filter.
Then, you can add a filter parameter to your other requests, and they can grab the filter to use for the queries.
GET /medications?filter=1234&page=4&pagesize=20
I would run the raw filters through some sort of canonicalization process, just to have a normalized set, so that, e.g. filter "firstname=Bob lastname=Eubanks" is identical to "lastname=Eubanks firstname=Bob". That's just me though.
The only real concern is that, as time goes on, you may need to obsolete some filters. You can simply error out the request should someone make a request with a missing or obsolete filter.
Edit answering question...
Let's start with the fundamentals.
Simply, you want to specify a filter for use in queries, but these filters are (potentially) involved and complicated. If it was simple /medications/1234, this wouldn't be a problem.
Effectively, you always need to send the filter to the query. The question is how to represent that filter.
The fundamental issue with things like sessions in REST systems is that they're typically managed "out of band". When you, say, go and create a medication, you PUT or POST to the medications resource, and you get a reference back to that medication.
With a session, you would (typically) get back a cookie, or perhaps some other token to represent that session. If your PUT to the medications resource created a session also, then, in truth, your request created two resources: a medication, and a session.
Unfortunately, when you use something like a cookie, and you require that cookie for your request, the resource name is no longer the true representation of the resource. Now it's the resource name (the URL), and the cookie.
So, if I do a GET on the resource named /medications/search, and the cookie represents a session, and that session happens to have a filter in it, you can see how in effect, that resource name, /medications/search, isn't really useful at all. I don't have all of the information I need to make effective use, because of the side effect of the cookie and the session and the filter therein.
Now, you could perhaps rewrite the name: /medications/search?session=ABC123, effectively embedding the cookie in the resource name.
But now you run in to the typical contract of sessions, notably that they're short lived. So, that named resource is less useful, long term, not useless, just less useful. Right now, this query gives me interesting data. Tomorrow? Probably not. I'll get some nasty error about the session being gone.
The other problem is that sessions typically are not managed as a resource. For example, they're usually a side effect, vs explicitly managed via GET/PUT/DELETE. Sessions are also the "garbage heap" of web app state. In this case, we're just kind of hoping that the session is properly populated with what is needed for this request. We actually don't really know. Again, it's a side effect.
Now, let's turn it on its head a little bit. Let's use /medications/search?filter=ABC123.
Obviously, casually, this looks identical. We just changed the name from 'session' to 'filter'. But, as discussed, Filters, in this case, ARE a "first class resource". They need to be created, managed, etc. the same as a medication, a JPEG, or any other resource in your system. This is the key distinction.
Certainly, you could treat "sessions" as a first class resource, creating them, putting stuff in them directly, etc. But you can see how, at least from a clarity point of view, a "first class" session isn't really a good abstraction for this case. Using a session, its like going to the cleaners and handing over your entire purse or briefcase. "Yea, the ticket is in there somewhere, dig out what you want, give me my clothes", especially compared to something explicit like a filter.
So, you can see how, at 30,000 feet, there's not a lot of difference in the case between a filter and a session. But when you zoom in, they're quite different.
With the filter resource, you can choose to make them a persistent thing forever and ever. You can expire them, you can do whatever you want. Sessions tend to have pre-conceived semantics: short live, duration of the connection, etc. Filters can have any semantics you want. They're completely separate from what comes with a session.
If I were doing this, how would I work with filters?
I would assume that I really don't care about the content of a filter. Specifically, I doubt I would ever query for "all filters that search by first name". At this juncture, it seems like uninteresting information, so I won't design around it.
Next, I would normalize the filters, like I mentioned above. Make sure that equivalent filters truly are equivalent. You can do this by sorting the expressions, ensuring fieldnames are all uppercase, or whatever.
Then, I would store the filter as an XML or JSON document, whichever is more comfortable/appropriate for the application. I would give each filter a unique key (naturally), but I would also store a hash for the actual document with the filter.
I would do this to be able to quickly find if the filter is already stored. Since I'm normalizing it, I "know" that the XML (say) for logically equivalent filters would be identical. So, when someone goes to PUT, or insert a new filter, I would do a check on the hash to see if it has been stored before. I may well get back more than one (hashes can collide, of course), so I'll need to check the actual XML payloads to see whether they match.
If the filters match, I return a reference to the existing filter. If not, I'd create a new one and return that.
I also would not allow a filter UPDATE/POST. Since I'm handing out references to these filters, I would make them immutable so the references can remain valid. If I wanted a filter by "role", say, the "get all expire medications filter", then I would create a "named filter" resource that associates a name with a filter instance, so that the actual filter data can change but the name remain the same.
Mind, also, that during creation, you're in a race condition (two requests trying to make the same filter), so you would have to account for that. If your system has a high filter volume, this could be a potential bottleneck.
Hope this clarifies the issue for you.
To be Restful, should all this information be included with every request?
No. If it looks like your server is sending (or receiving) too much information, chances are that there are one or more resources you haven't yet identified.
The first and most important step in designing a RESTful system is to identify and name your resources. How would you do that for your system?
From your description, here's one possible set of resources:
User - a user of the system (maybe a doctor or patient (?) - Role might need to be exposed as a resource here)
Medication - the stuff in the bottle, but it also might represent the kind of bottle (quantity and contents), or it might represent a particular bottle - depending on if you're a pharmacy or just a help desk.
Disease - the condition for which a Patient might want to take a Medication.
Patient - a person who might take a Medication
Recommendation - a Medication that might be beneficial to a Patient based on a Disease they suffer from.
Then you could look for relationships among resources;
User has and belongs to many Roles
Medication has and belongs to many Diseases
Disease has many Recommendations.
Patient has and belongs to many Medications and Diseases (poor chap)
Patient has many Recommendations
Recommendation has one Patient and has one Disease
The specifics are probably not right for your particular problem, but the idea is simple: create a network of relationships among your resources.
At this point it might be helpful to think about URI structure, although keep in mind that REST APIs must be hypertext-driven:
# view all Recommendations for the patient
GET http://server.com/patients/{patient}/recommendations
# view all Recommendations for a Medication
GET http://servier.com/medications/{medication}/recommendations
# add a new Recommendation for a Patient
PUT http://server.com/patients/{patient}/recommendations
Because this is REST, you'll spend most of your time defining the media types used to transfer representations of your resources between client and server.
By exposing more resources, you can cut down on the amount of data that needs to be transferred during each request. Also notice there are no query parameters in the URIs. The server can be as stateful as it needs to be to keep track of it all, and each request can be fully self-contained.
REST is for APIs, not (typical) applications. Don't try to wedge a fundamentally stateful interaction into a stateless model just because you read about it on wikipedia.
To be Restful, should all this information be included with every request? This seems to place a huge overhead on the network. Also, wouldn't the restrictions on URL length, at least for GET, make this unfeasible?
The size of parameters is usually insignificant compared to the size of resources the server sends. If you're using such large parameters that they are a network burden, place them on the server once and then use them as resources.
There are no significant restrictions on URL length -- if your server has such a limit, upgrade it. It's probably years old and chock-full of security vulnerabilities anyway.
No all of that does not have to be in every request.
Each resource (medication, patient history, etc) should have a canonical URI that uniquely identifies it. In some applications (eg, Rails-based ones) this will be something like "/patients/1234" or "/drugs/5678" but the URL format is unimportant.
A client that has previously obtained the URI for a resource (such as from a search, or from a link embedded in another resource) can retrieve it using this URI.
Are you working on a RESTful API that other apps will use to search your data? Or are you building a end-user focused web application where users will log in and perform these searches?
If your users are logging in, then you're already stateful as you'll have some type of session cookie to maintain the logged in state. I would go ahead and create a session object that contains all the search filters. If a user hasn't set any filters, then this object will be empty.
Here's a great blog post about using GET vs POST. It mentions a URL length limit set by Internet Explorer of 2,048 characters, so you want to use POST for long requests.
http://carsonified.com/blog/dev/the-definitive-guide-to-get-vs-post/

Resources