Cluster list of comments - text-analysis

I am a beginner who just has discovered the great Carrot2 framework .
I try to use the Java API to cluster a list of Facebook comments (~100 comments with about 10-200 characters) with the LingoClusteringAlgorithm. Can I use the comment text as "title" field and leave the fields "snippet", "url" and "query" empty? Or is there a better way (f.ex. using the comment text twice for "title" and "snippet"?)

At least one of "title" or "snippet" must be non empty, so you can put the post text in the "snippet" and leave "title" blank. It shouldn't matter whether you leave "title" or "snippet" blank, the results should be the same in both cases.
The "url" field is used only for display purposes, it does not affect the results of clustering. You can leave it empty or put the direct link to the post if you plan to use it in the UI.

Related

FileNet Social Collaboration - search by comments

We have social collaboration enabled on our FileNet system. I can add comment, tag, like and track how many times a document has been downloaded. These features are nice. When I tag a document, I can search documents by the tag text.
Ex: If I tag a document as say "test". I can user a search template to search for a document by its tag value i.e. test.
When I comment, I can't search document based on Comment Text.
Say I added a comment as "good doc". I can't search it by the text. Rather I need to provide an integer value like 1 search. Then search happens like "get all documents which has number of comments =1". I don't want this behavior instead I should be able to search on the comment text.
Can anybody help on this?
One way to achieve this would be to use CBR on the property. See how to enable CBR on a property
The property will then be full-text searchable using the CONTAINS statement, see doc.
Optionally (but i'm not sure as i've never personally used it) - the satisfies operator might exactly what you're looking for according to the documentation.

Google Place API keyword vs name

I am using the Google Places API and I am making a request that looks like this.
https://maps.googleapis.com/maps/api/place/radarsearch/json?location=[LOCATION]&radius=500&keyword=[STORE_NAME]&key=[API_KEY]
The issue is that when I use the keyword in the request, the keyword looks for the value I provide in the name, address and anywhere else in the google content for a place. How can I do it so that I can give the store name and have it only search in the name field. I don't care about looking in the address.
Let me know if this doesn't make sense.
Places API Radar Search has just been deprecated, and will be retired in a year.
Please see full details in blog post Removing Place Add, Delete & Radar Search features.
On a related note, for both Radar Search and Nearby Search, the name parameter is just equivalent to keyword:
name — A term to be matched against all content that Google has indexed for this place. Equivalent to keyword. The name field is no longer restricted to place names. Values in this field are combined with values in the keyword field and passed as part of the same search string. We recommend using only the keyword parameter for all search terms.

CouchDB, all_docs and filter design documents with endkey

First, this question - filter design documents from all_docs - already seemed to be solved like described here:
https://plus.google.com/+JasonDeRose/posts/1iP5tu3wVqw
/mydb/_all_docs?endkey=%22_%22
and worked in first place. However, suddenly in a different setup (actually just different deploy), the query only returns an empty collection []. It seems like the ordering changed, without endkey="_" the full collection is returned (including design documents). I tried various combinations of endkey/startkey but cannot achieve to filter the design documents again.
Finally I added a filter and switched to _changes?include_docs=true to load the initial documents. I also thought about defining a view, but don't like that this results in data replication and some inconveniences with the changes feed (needed in another context). The filter on the other hand will be executed for every document.
Is it a bug that endkey=%22_%22 doesn't work anymore and is there a more convenient, still working way?
/_all_docs is a special case for CouchDB. Instead of the normal Unicode Collation, it uses ASCII collation.
The '_' character in ASCII order shows up between uppercase letters and lowercase letters. So if your doc id starts with lowercase letters (default behaviour), they will show up after any design docs. If your doc ids start with uppercase letters, they will show up before design docs.
Try creating a document with an id of: "ABC" You will see it show up before the design doc and your trick to filter design docs would work in this case.
However, I recommend you stop using the `_all_docs view altogether. Instead use the normal view functionality. When you create a view, CouchDB automatically skips design docs for you. So if your view looked like:
function(doc){
emit(doc._id, null);
}
You could query this with no start or end key, and get all docs without design docs.
Also, please look at Unicode Collation order, this is the order all your other views will be in, and it's important to understand as you work with CouchDB. You can read all about it here:
http://docs.couchdb.org/en/stable/ddocs/views/collation.html

How is the search snippet and meta "description" of a post determined?

Wen you search in stackoverflow, in most cases a search snippet (first 40 words or so of a post / question) is shown. In some cases, more text is shown and this text includes the search terms. Both blocks of text are ending with ellipsis symbol.
If you look at the meta tag "description" or "og:description", a similar text is included, thus allowing Google to index correctly.
My questions:
What search engine is stackoverflow using (elastic search / Lucene) ?
How and when is the search snippet determined (in realtime during search action or when saving a post / question ?)
How and when is the meta-description determined.
I ask these questions because I want to prevent that I start coding an algorithm to determine the first 40 words or so of an html article (in our case a blog post).
thx
Marc
Stackoverflow uses Elasticsearch.
Elasticsearch has a highlighting-features that take care of these things.
The snippet is determined search time, to find the snippet that is most likely to be relevant for the user's query.

Marklogic Autocomplete feature

I am going by Marklogic Tutorial for Oscars to develop an application for my documents that I have ingested in the database.
What I am not able to understand is that though the Search box performs autocomplete, but it doesnot do that for elements..i.e if I type Cha...then it should start suggesting all names starting with Cha..sucha as Charles, Charley etc.
As shown in figure: I can write Decade(which will get autocompleted..) and select one among 1920s..1930s..etc.
But I dont want to specify Field name as such.I just want to type actor name and it should provide auto suggest on that...
I have looked for it in the documentation where it says that search:suggest function can do this; but I am new to xquery etc and dont know how to proceed...
Do I need to modify this function or add something to it?How?
If you enter a full-text search term, then autocomplete works on words and phrases from the full-text index. If you prepend a search field keyword, then the autocomplete limits to that.
I don't know the search field keywords by name, but I'd guess they are award:, decade: and winners:. So, if you type in decade:, then autocomplete should come up with decades only.
--edit--
Based on your comment, it sounds you want to change the source for autocomplete if you don't specify a specific search field. That is very easy. If you start the wizard to create an Oscar Example application, that option is on the first screen. You can also revisit that same wizard from the Application Builder after creation to apply changes.
Just open that wizard, go to the Search step, and look for a button 'Advanced Settings'. In the middle of the overlay screen there should be a caption called 'Suggestions', and below a drop-down to specify the Default Source, which is the source for autocompletion when you don't prefix your search term. Change that to 'name' if you want unprefixed terms to autocomplete against actor names..
HTH!

Resources