CouchDB, all_docs and filter design documents with endkey - filter

First, this question - filter design documents from all_docs - already seemed to be solved like described here:
https://plus.google.com/+JasonDeRose/posts/1iP5tu3wVqw
/mydb/_all_docs?endkey=%22_%22
and worked in first place. However, suddenly in a different setup (actually just different deploy), the query only returns an empty collection []. It seems like the ordering changed, without endkey="_" the full collection is returned (including design documents). I tried various combinations of endkey/startkey but cannot achieve to filter the design documents again.
Finally I added a filter and switched to _changes?include_docs=true to load the initial documents. I also thought about defining a view, but don't like that this results in data replication and some inconveniences with the changes feed (needed in another context). The filter on the other hand will be executed for every document.
Is it a bug that endkey=%22_%22 doesn't work anymore and is there a more convenient, still working way?

/_all_docs is a special case for CouchDB. Instead of the normal Unicode Collation, it uses ASCII collation.
The '_' character in ASCII order shows up between uppercase letters and lowercase letters. So if your doc id starts with lowercase letters (default behaviour), they will show up after any design docs. If your doc ids start with uppercase letters, they will show up before design docs.
Try creating a document with an id of: "ABC" You will see it show up before the design doc and your trick to filter design docs would work in this case.
However, I recommend you stop using the `_all_docs view altogether. Instead use the normal view functionality. When you create a view, CouchDB automatically skips design docs for you. So if your view looked like:
function(doc){
emit(doc._id, null);
}
You could query this with no start or end key, and get all docs without design docs.
Also, please look at Unicode Collation order, this is the order all your other views will be in, and it's important to understand as you work with CouchDB. You can read all about it here:
http://docs.couchdb.org/en/stable/ddocs/views/collation.html

Related

Kibana (elasticsearch visualization) - how add plot based on sub-string of field?

I have a field in my logs called json_path containing data like /nfs/abc/123/subdir/blah.json and I want to create count plot on part of the string abc here, so the third chunk using the token /. I have tried all sorts of online answers, but they're all partial answers (nothing I can easily understand how to use or integrate). I've tried running POST/GET queries in the Console, which all failed due to syntax errors I couldn't manage to debug (they were complaining about newline control chars, when there were none that I could obviously see or see in a text editor explicitly showing control-characters). I also tried Management -> Index Patterns -> Scripted Field but after adding my code there, basically the whole Kibana crashed (stopped working temporarily) until I removed that Scripted Field.
All this elasticsearch and kibana stuff is annoyingly difficult, all the docs expect you to be an expert in their tool, rather than just an engineer needing to visualize some data.
I don't really want to add a new data field in my log-generation code, because then all my old logs will be unsupported (which have the relevant data, it just needs that bit of string processing before data viz). I know I could probably back-annotate the old logs, but the whole Kibana/elasticsearch experience is just frustrating and I don't use it enough to justify learning such detailed procedures (I actually learned a bunch of this stuff a year ago, and then promptly forgot it due to lack of use).
You cannot plot on a sub string of a field unless you extract that sub string into a new field. I can understand the frustration in learning a new product but to be able to achieve what you want you need to have that sub string value in a new field. Scripted fields are generally used to modify a field. To be able to extract sub string from a field I’d recommend using Ingest Node processor like grok processor. This will add a new field which you can use to plot in Kibana visualizations..

Kofax Seperate Main Invoice from Supporting Document without using Seperator sheet

When a batch gets created documents should get separated automatically without using separator sheet or Barcode separator.
How can I classify documents for Invoice and supporting document.
In our project we get many invoices with supporting document so the scanning person has to insert the separator sheets manually, so to avoid this we want to automatically classify the supporting documents.
In general the concept would be that you would enable separation in the project and then train your classes with examples to be used for the layout or content classifiers.
However, as I'm sure you've seen, the obstacle with invoices is that they are different enough between vendors that it would not reliably classify all to an Invoice class. Similarly with "Supporting Documents" which are likely to be very different from each other, so unfortunately there isn't a completely easy answer without separator sheets (or barcode stickers affixed to supporting docs).
What you might want to do is write code in the one of the separation events like Document_AfterSeparate event. Despite the name, the document has not yet been split at this point, but the classifiers have run. See Scripting Help topic "Server Script Events Sequence > Document Separation > Standard Document Separation" for more detail. Setting the SplitPage property on the CDocPage (pXDoc.CDoc.Pages.ItemByIndex(lPage).SplitPage) will allow you to use your own logic to determine which pages to separate.
For example if you know that you will always have single page invoices, you can split on the first page and classify accordingly. Or you can try to search for something that indicates the end of the invoice like "Total" or other characteristics. There is an example of how you can use locators to help separation in the Scripting Help topic "Script Samples > Use Locator Results for Standard Document Separation". The example uses a Barcode Locator, but the same concept works if you wanted to try it with a Format Locator or anything else.
Without Separator sheets you will need a smart classification software like Kofax Transformation Module (KTM). Its kind of expensive. you will need to verify the cost saving and ROI.

Skip common/duplicate parts while indexing web pages with ElasticSearch

I don't have any experience with ElasticSearch yet, but from what I read I think it suits most my needs. I have a web scraper which scrapes pages of certain domains.
I want to feed these pages into SE and offer a front end interface to search the scraped content. I'm building some sort of vertical search engine.
But as we all know, web pages of one host often only contain a little bit of unique content, a great part of the pages are common. Footer, header, menu etc. are the same on every page.
Does ElasticSearch have some build in intelligence that can filter out the common parts and only search the real content??
It's not terribly difficult to pump web content into Elastic, so I'll assume you have that down. =)
I think this article is fantastic for understanding how to index/search web pages:
http://blog.urx.com/urx-blog/2014/9/4/the-science-of-crawl-part-1-deduplication-of-web-content
It's a complex problem and they have some great detail. There is nothing I know of natively in Elastic that has intelligence to help you eliminate duplicates etc.
The strategy you need to adopt here would be to create a unique key per document. Taking checksum using sha1 or similar algorithm will do the job for getting the unique key. Make this the document ID so that only one page occurs at all point of time. Again use _create API to index if you dont want new duplicates to be indexed ( More efficient ) , and in case you want the new ones to be the document use normal indexing.
In case you need to modify the orginal document in case of disocvery of duplicate document , use upser.
I have explained a great deal of this in this blog.

to_tsquery() validation

I'm currently developing a website that allows a search on a PostgreSQL
database, the search works with to_tsquery() and I'm trying to find a way to validate the input before it's being sent as a query.
Other than that I'm also trying to add a phrasing capability, so that if someone searches for HELLO | "I LIKE CATS" it will only find results with "hello" or the entire phrase "i like cats" (as opposed to I & LIKE & CATS that will find you articles that have all 3 words,
regardless where they might appear).
Is there some reason why it's too expensive to let the DB server validate it? It does seem a bit excessive to duplicate the ts_query parsing algorithm in the client.
If the concern is that you don't want it to try running the whole query (which presumably will involve table access) each time it validates, you could use the input in a smaller query, just in pseudocode (which may look a bit like Python, but that's just coincidence):
is_valid_query(input):
try:
execute("SELECT ts_query($1)", input);
return True
except DatabaseError:
return False
With regard to phrasing, it's probably easiest to search by the non-phrased query first (using indexes), then filter those for having the phrase. That could be done server side or client side. Depending on the language being parsed, it might be easiest to construct a simple regex of the phrase that deals with repeated whitespace or other ignorable symbols.
Search for to_tsquery('HELLO|(I&LIKE&CATS)'), getting back a list of documents which loosely match.
In the client, filter that to those matching the regex "HELLO|(I\s+LIKE\s+CATS)".
The downside is you do need some additional code for translating your query into the appropriate looser query, and then for translating it into a regex.
Finally, there might be a technique in PostgreSQL to do proper phrase searching using the lexeme positions that are stored in ts_vectors. I'm guessing that phrase searches are one of the intended uses, but I couldn't find an example of it in my cursory search. There's a section on it near the bottom of http://linuxgazette.net/164/sephton.html at least.

What is a proper way for naming of message properties in i18n?

We do have a website which should be translate into different languages. Some of the wording is in message properties files ready for translation. I want now add the rest of the text into these files.
What is a good way to name the text blocks?
<view>.<type>.<name>
We mostly have webpages and some of the elements/modules are repeating on some sites.
As far as I know, no "standard" exists. Therefore it is pretty hard to tell what is proper and what is improper way of naming resource keys. However, based on my experience, I could recommend this way:
property file name: <module>.properties
resource keys: <view or dialog>[.<sub-context>].<control-type>.<name>
We may discuss if it is proper way to put every strings from one module into one property files - probably it could be right if updates doesn't happen often and there are not so many messages. Otherwise you might think about one file per view.
As for key naming strategy: it is important for the Translator (sounds like film with honorable governor Arnold S. isn't it?) to have a Context. Translation may actually depend on it, i.e. in Polish you would translate a message in a different way if it is page/dialog/whatever title and in totally different way if it is text on a button.
One example of such resource key could be:
preferences.password_area.label.username=User name
It gives enough hints to the Translator about what it actually is, which could result in correct translation...
We have come up with the following key naming convention (Java, btw) using dot notation and camel case:
Label Keys (form labels, page/form/app titles, etc...i.e., not full sentences; used in multiple UI locations):
If the label represents a Java field (i.e., a form field) and matches the form label: label.nameOfField
Else: label.sameAsValue
Examples:
label.firstName = First Name
label.lastName = Last Name
label.applicationTitle = Application Title
label.editADocument = Edit a Document
Content Keys:
projectName.uiPath.messageOrContentType.n.*
Where:
projectName is the short name of the project (or a derived name from the Java package)
uiPath is the UI navigation path to the content key
messageOrContentType (e.g., added, deleted, updated, info, warning, error, title, content, etc.) should be added based on the type of content. Example messages: (1) The page has been updated. (2) There was an error processing your request.
n.* handles the following cases: When there are multiple content areas on a single page (e.g., when the content is separated by, an image, etc), when content is in multiple paragraphs or when content is in an (un)ordered list - a numeric identifier should be appended. Example: ...content.1, ...content.2
When there are multiple content areas on a page and one or more need to be further broken up (based on the HTML example above), a secondary numeric identifier may be appended to the key. Example: ...content.1.1, ...content.1.2
Examples:
training.mySetup.myInfo.content.1 = This is the first sentence of content 1. This is the second sentence of content 1. This content will be surrounded by paragraph tags.
training.mySetup.myInfo.content.2 = This is the first sentence of content 2. This is the second sentence of content 2. This content will also be surrounded by paragraph tags.
training.mySetup.myInfo.title = My Information
training.mySetup.myInfo.updated = Your personal information has been updated.
Advantages / Disadvantages:
+ Label keys can easily be reused; location is irrelevant.
+ For content keys that are not reused, locating the page on the UI will be simple and logical.
- It may not be clear to translators where label keys reside on the UI. This may be a non-issue for translators who do not navigate the pages, but may still be an issue for developers.
- If content keys must be used in more than one location on the UI (which is highly likely), the key name choice will not make sense in the other location(s). In our case, management is not concerned with a duplication of values for content areas, so we will be using different keys (to demonstrate the location on the UI) in this case.
Feedback on this convention - especially feedback that will improve it - would be much appreciated since we are currently revamping our resource bundles! :)
I'd propose the below convention
functionalcontext.subcontext.key
logicalcontext.subcontext.key
This way you can logically group all the common messages in a super context (id in the below example). There are few things that aren't specific to any functional context (like lastName etc) which you can group into logical-context.
order.id=Order Id
order.submission.submit=Submit Order
name.last=Last Name
the method that I have personally used and that I've liked more so far is using sentence to localisee as the key. For example: (pls replace T with the right syntax dependably on the language)
for example:
print(T("Hello world"))
in this case T will search for a key "Hello world". If it is not found then the key is returned, otherwise the value of the key.
In this way, you do not need to edit the message (in your default language) at least that you need to use parameters.... It saved me a LOT of dev time

Resources