How to ignore "stop words" while sorting in MarkLogic? - sorting

Is there any way to ignore "stop words" while sorting.
For example:
I have words like
dixit
singla
the marklogic
On sorting in descending order the result should be
singla, the marklogic, dixit
As in the above example the is ignored.
Any way to achieve this?
Update:
Stop word can occur at any place.
for example
the MarkLogic
MarkLogic is the best
the MarkLogic is awesome
while sorting should not consider any stop word in the text.
Above is just a small example to describe the problem.
In actual I am using search:search API.
For sorting, I am using sort-order search options.
The element on which I have to perform sorting is dynamic. There are approx 30-35 elements.
Is there any way to customize the collation at this level like to configure some words (stop words) which will be ignored while sorting.

There is no standard collation URI that is going to do this for you (at least none that I've ever seen). You can do it dynamically, of course, by sorting on the result of a function invocation, but if you want it done efficiently at scale (and available to search:search), then you need to materialize the sortable string into your document. I've often done this as an attribute on the element:
<title sortable="Great Gatsby, The">The Great Gatsby</title>
Then you put a range index on the title/#sortable attribute.
You can also use the "envelope pattern" where materialized metadata like this is maintained in its own section of the document with the original kept in its own section. For things like this, I think it's a bit more elegant to decorate the elements directly, to keep the context.

If I understand your question correctly you're trying to get rid of the definite article when sorting your result-set.
In order to do this you need to use some additional functions and create a 'sort' criteria. My solution would look like this (I'm also including some sample documents so that you can test this just by copy-pasting):
(:
xdmp:document-insert("/peter.xml", <person><firstName>Peter</firstName><lastName>O'Toole</lastName><age>60</age></person>);
xdmp:document-insert("/john.xml", <person><firstName>John</firstName><lastName>Adams</lastName><age>18</age></person>);
xdmp:document-insert("/simon.xml", <person><firstName>Simon</firstName><lastName>Petrov</lastName><age>22</age></person>);
xdmp:document-insert("/mark.xml", <person><firstName>Mark</firstName><lastName>the Lord</lastName><age>25</age></person>);
:)
for $person in /person
let $sort := fn:reverse(fn:tokenize($person/lastName, ' '))[1]
order by $sort
(: return $person :)
return $person/lastName/text()
Notice that now the sort order is going to be
- Adams
- the Lord
- O'Toole
- Petrov
I hope this will help.

Related

IFTTT JavaScript filter - How to make case insensitive searches + How to search Include and Exclude sets of terms

First off I'm a total novice for Javascript, so please go gently. I'm aware of how people feel about having to now pay for IFTTT, but it's perfect for what I need.
I am using a more expansive version of this code below to capture certain keywords from Tweets to then generate emails if the search returns a positive result. This search works very nicely, except it is case sensitive which is a problem.
Yes, I know you can manipulate the twitter search to pick up specific words or phrases. I am very proficient in achieving searches this way. I am casting a wide net to pick up approx 120 search words or phrases which is too long to achieve through "OR" Twitter search parameters alone which is why I'm using this.
Q1 - I have tried adding item.toLowerCase() and just .toLowerCase() in various parts of the code so it wouldn't matter if the sentence case of the search term is different to that of the original tweet text case. I just can't get it to work though. I've seen various posts on here but I can't get any of them to work in IFTTT. I believe IFTTT doesn't accept REGEX either, which is annoying.
Any advice of how to get this code running so it's case-insensitive for text within IFTTT?
Q2 - I have approx 120 search terms for the tweet text to return positive results. There is a lot of junk that comes through with that. Does anyone know how to add a second layer of 'and exclude' search terms?
I have something like 300-400 words and specific phrases which would be used to stop the email from being triggered - so it'd be something like "IF tweet text contains a, b, c BUT text ALSO contains x, y, z... do not send the email"
let str=Twitter.newTweetFromSearch.Text;
let searchTerms=[
"Northbound",
"Westbound",
"Southbound",
"Eastbound"
]
let foundOne=0;
if(searchTerms.some(function(v){return str.indexOf(v)>=0;})){
foundOne=1;
}
if(foundOne==0){
Email.sendMeEmail.skip();
}
I have looked at the Twitter API, but that is a step too far for my coding ability which is why I'm using IFTTT.
Any help is very much appreciated
Thank you.
I'm playing with IFTTT Filter myself at the moment, so here are some thoughts about solving your solution.
If you want to do a case insensitive seatch on the original text, convert the original text to lowercase, then have all your search terms in lowercase.
Plus I think you want to iterate over the searchTerms array, and use the includes() method. Ok, just realised that .some() does the iteration for you, but I prefer includes() over indexof().
let str=Twitter.newTweetFromSearch.Text.toLowerCase();
let searchTerms=[
"northbound",
"westbound",
"southbound",
"eastbound"
]
let foundOne=0;
if(searchTerms.some(function(term){return str.includes(term);})){
foundOne=1;
}
if(foundOne==0){
Email.sendMeEmail.skip();
}
Or you could just skip having the foundOne variable, and do the search in the if() statement.
let str=Twitter.newTweetFromSearch.Text.toLowerCase();
let searchTerms=[
"northbound",
"westbound",
"southbound",
"eastbound"
]
if(!searchTerms.some(function(term){return str.includes(term);})){
Email.sendMeEmail.skip();
}

How to perform Sorting using flow steps in WebMethods (Software AG)?

I am new to this Middleware and I tried my level best to perform sorting using the flow steps in designer but couldn't make it.Can anybody help me out by giving me direction for how to complete my work?(like the flow steps in order and where i can put the conditions and all)
Thanks.
No need to over-complicate it - use the utilities - pub.document:sortDocuments is what you are looking for.
If you receive stringList as input - convert this into a documentList. This can be done using pub.list:stringListToDocumentList (set the key to 'value')
Use pub.document:sortDocuments to sort the documentList. Remember to specify the key as 'value' once again and compareStringAs as 'numeric'. The order can also be set (ascending/descending)
What do you want to Sort? For Document-Lists you will find a built-in services in the WmPublic Folder.
For String-Lists i would use a Java-Service for Sorting.
Logic behind Sorting in webMethods is same as all other languages. You need LOOP to iterate every string in stringList, BRANCH to compare the two number and then map to the compare result to new StringList.
What format do you have the numbers in? Are they in a flat file or in a string list etc.

Solr query conundrum

I've recently swapped from using Lucene for Sitecore to Solr.
For the most part it has been smooth, but the way I was writing some queries (using Sitecore.ContentSearch.Linq) abstraction now don't seem to be compatible.
Specifically, I have a situation where I've got "global" content and "regional" content, like so:
Home (000)
X
Y
Z
Regions (ID: 111)
Region 1 (ID: 221)
A
B
Region 2 (ID: 222)
D
My code worked on Lucene, but now doesn't on Solr. It should find all "global" and a single region's content, excluding all other region's content. So as an example, if the user's current region was Region 1, I'd want the query to return content X, Y, Z, A, B.
Sitecore's Item Crawler has a field for each item in the index called "_path" which is a multivalued string field of IDs, so as an example, Region 1's _path field value would be [000, 111, 221 ].
When I write this using the Linq abstraction it comes out as below which doesn't return results.
-_path:(111) OR _path:(221)
But _path:(111) does return result. Mind blown.
When I use the Solr interface and wrap each side of the OR in extra brackets like below (which I'd consider redundant) it works! Mind blown v2.
(-_path:(111)) OR (_path:(221))
Firstly, what's the difference between those queries?
Secondly, my real problem is I can't add these extra brackets as I'm working in an abstraction Linq so the brackets will be "optimized" out.
Any advice would be awesome! Cheers.
The problem here is, lucene's negative queries don't work like you think they do. They only remove results from what has been found. -_path:111 doesn't find all documents which aren't in 111, it doesn't find anything at all. It only removes results. So you are finding all results with path "221", then removing any that also have path "111", which from your heirarchy, I assume is all of them. See my answer here for a bit more on that topic.
The OR makes it seem like it ought to work, but really -_path:(111) OR _path:(221) is the same as -_path:(111) _path:(221). The moral here is: Don't use Lucene's AND/OR/NOT syntax, if you can help it. Use +/-. +/- syntax actually expresses how the query operates, AND/OR/NOT doesn't. It attempts to shoehorn it into a different, SQL-like retrieval model and leads to some unexpected behavior like this.
So, what about: (-_path:(111)) OR (_path:(221))
Well, first, does it actually work? Or does it just get some results?
If it just gets some results, but just seems to get the same results as _path:221: The reason is -_path:111 gets no results, so your query is, in practice, something like: (nothing) OR (_path:221), which is equivalent to _path:221
If it really does get the results you expect (I'm guessing it probably does): Something is translating your query into something like: (*:* -_path:111) (_path:221). Solr does have some logic along these lines, though I'm not quite sure in this case. Essentially, it puts a match-all in front of any lonely negative queries it finds, allowing them to do what you were expecting. If the implicit *:* makes you nervous about performance, well, it should. But lucene is an inverted index, it does well with finding matches on a term quickly. Getting everything that doesn't match goes against the grain of that retrieval model, and will pretty much have to do a full scan of the index.

Find HTML Tags in Properties

My current issue is to find HTML-Tags inside of property values. I thought it would be easy to search with a query like /jcr:root/content/xgermany//*[jcr:contains(., '<strong>')] order by #jcr:score
It looks like there is a problem with the chars < and > because this query finds everything which has strong in it's property. It finds <strong>Some Text</strong> but also This is a strong man.
Also the Query Builder API didn't helped me.
Is there a possibility to solve it with a XPath or SQL Query or do I have to iterate through the whole content?
I don't fully understand why it finds This is a strong man as a result for '<strong>', but it sounds like the unexpected behavior comes from the "simple search-engine syntax" for the second argument to jcr:contains(). Apparently the < > are just being ignored as "meaningless" punctuation.
You could try quoting the search term:
/jcr:root/content/xgermany//*[jcr:contains(., '"<strong>"')]
though you may have to tweak that if your whole XPath expression is enclosed in double quotes.
Of course this will not be very robust even if it works, since you're trying to find HTML elements by searching for fixed strings, instead of actually parsing the HTML.
If you have an specific jcr:primaryType and the targeted properties you can do something like this
select * from nt:unstructured where text like '%<strong>%'
I tested it , but you need to know the properties you are intererested in.
This is jcr-sql syntax
Start using predicates like a champ this way all of this will make sense to you!
HTML Encode <strong>
HTML Decimal <strong>
Query builder is your friend:
Predicates: (like a CHAMP!)
path=/content/geometrixx
type=nt:unstructured
property=text
property.operation=like
property.value=%<strong>%
Have go here:
http://localhost:4502/libs/cq/search/content/querydebug.html?charset=UTF-8&query=path%3D%2Fcontent%2Fgeometrixx%0D%0Atype%3Dnt%3Aunstructured%0D%0Aproperty%3Dtext%0D%0Aproperty.operation%3Dlike%0D%0Aproperty.value%3D%25%3Cstrong%3E%25
Predicates: (like a CHAMP!)
path=/content/geometrixx
type=nt:unstructured
property=text
property.operation=like
property.value=%<strong>%
Have a go here:
http://localhost:4502/libs/cq/search/content/querydebug.html?charset=UTF-8&query=path%3D%2Fcontent%2Fgeometrixx%0D%0Atype%3Dnt%3Aunstructured%0D%0Aproperty%3Dtext%0D%0Aproperty.operation%3Dlike%0D%0Aproperty.value%3D%25%26lt%3Bstrong%26gt%3B%25
XPath:
/jcr:root/content/geometrixx//element(*, nt:unstructured)
[
jcr:like(#text, '%<strong>%')
]
SQL2 (already covered... NASTY YUK..)
SELECT * FROM [nt:unstructured] AS s WHERE ISDESCENDANTNODE([/content/geometrixx]) and text like '%<strong>%'
Although I'm sure it's entirely possible with a string of predicates, it's possibly heading down the wrong route. Ideally it would be better to parse the HTML when it is stored or published.
The required information would be stored on simple properties on the node in question. The query will then be a lot simpler with just a property = value query, than lots of overly complex query syntax.
It will probably be faster too.
So if you read in your HTML with something like HTMLClient and then parse it with a OSGI service, that can accurately save these properties for you. Every time the HTML is changed the process would update these properties as necessary. Just some thoughts if your SQL is getting too much.

Sorting by counting the intersection of two lists in MongoDB

We have a posting analyzing requirement, that is, for a specific post, we need to return a list of posts which are mostly related to it, the logic is comparing the count of common tags in the posts. For example:
postA = {"author":"abc",
"title":"blah blah",
"tags":["japan","japanese style","england"],
}
there are may be other posts with tags like:
postB:["japan", "england"]
postC:["japan"]
postD:["joke"]
so basically, postB gets 2 counts, postC gets 1 counts when comparing to the tags in the postA. postD gets 0 and will not be included in the result.
My understanding for now is to use map/reduce to produce the result, I understand the basic usage of map/reduce, but I can't figure out a solution for this specific purpose.
Any help? Or is there a better way like custom sorting function to work it out? I'm currently using the pymongodb as I'm python developer.
You should create an index on tags:
db.posts.ensure_index([('tags', 1)])
and search for posts that share at least one tag with postA:
posts = list(db.posts.find({_id: {$ne: postA['_id']}, 'tags': {'$in': postA['tags']}}))
and finally, sort by intersection in Python:
key = lambda post: len(tag for tag in post['tags'] if tag in postA['tags'])
posts.sort(key=key, reverse=True)
Note that if postA shares at least one tag with a large number of other posts this won't perform well, because you'll send so much data from Mongo to your application; unfortunately there's no way to sort and limit by the size of the intersection using Mongo itself.

Resources