Chart multipile fields in Kibana - elasticsearch

I am trying to create a pie chart in Kibana (V2.3.1) which displays values from multiple fields.
Lets say I got documents representing humans with the following fields: (representing if the finger is bent or straight)
Human 1:
human.right_arm.thumb = bent
human.right_arm.pinky = straight
human.left_arm.thumb = straight
human.left_arm.pinky = half-bent
Human 2:
human.right_arm.thumb = straight
human.right_arm.pinky = bent
human.left_arm.thumb = half-bent
human.left_arm.pinky = half-bent
Now I want to create a pie chart on the status of all the fingers. It would create a result like:
bent (= 2) = 25% coverage of the pie
straight (= 3) = 37.5% coverage of the pie
half-bent (= 3) = 37.5% coverage of the pie
In Kibana I can only split one field per chart. So how do I combine the results for all fingers?
And how can I get the same status but then from all the thumbs?
I think scripted fields are the way to go, but I cannot figure out how since as far as I can see the aggregation only combines the results of fields while it should represent a set of fields ("all fingers" or "all thumbs").
I searched the web and found similar issues but never a clear answer.
If necessary I can make changes in Logstash. We use the ruby/code filter to define these fields.
Note: Sadly I am not able to update our ELK stack to a newer version.

Can you make the state of the finger an separate aggregatable field? Then you'll be able to create a pie chart with a count metric and split the slices by terms and then choose the field with the name of the state of the finger.
Eg.
Otherwise, this scripted field might work (not tested since I don't have the necessary setup):
def fingerState = doc['whatever the field is called'].value;
if (fingerState != null)
{
int index = fingerState.lastIndexOf('=');
if (index > 0)
{
return fingerState.substring(index+1);
}
}
return fingerState; //this will return the whole thing if for some reason this format isnt consistent
As for the second question, you can do something like
but for this to work you need to make the state of finger aggregatable.
Hope this works and that it's compatible with your version on ELK (I'm using 5.2)

Related

Time-sensitive Cloudant view not always returning correct results

I have a view on a Cloudant database that is designed to show events that are happening in the next 24 hours:
function (doc) {
// activefrom and activeto are in UTC
// set start to local time in UTC
var m = new Date();
var start = m.getTime();
// end is start plus 24 hours of milliseconds
var end = start + (24*60*60*1000);
// only want approved disruptions for today that are not changed conditions
if (doc.properties.status === 'Approved' && doc.properties.category != 'changed' && doc.properties.activefrom && doc.properties.activeto){
if (doc.properties.activeto > start && doc.properties.activefrom < end)
emit([doc.properties.category,doc.properties.location], doc.properties.timing);
}
}
}
This works fine for most of the time but every now and then the view does not show the expected results.
If I edit the view, even just adding a comment, the output changes to the expected results. If I re-edit the view and remove the change, the results return to the incorrect results.
Is this because of the time-sensitive nature of the view? Is there a better way to achieve the same result?
The date that is indexed by your MapReduce function is the time that the server dealing with the work performs the indexing operation.
Cloudant views are not necessarily generated at the point that data is added to the database. Sometimes, depending on the amount of work the cluster is having to do, the Cloudant indexer is not triggered until later. Documents can even remain unindexed until the view is queried. In that circumstance, the date in your index would not be "the time the document was inserted" but "the time the document was indexed/queried", which is probably not your intention.
Not only that, different shards (copies) of the database may process the view build at different times, giving you inconsistent results depending on which server you asked!
You can solve the problem by indexing something from your source document e.g.
if your document looked like:
{
"timestamp": 1519980078159,
"properties": {
"category": "books",
"location": "Rome, IT"
}
}
You could generate an index using the timestamp value from your document and the view you create would be consistent across all shards and would be deterministic.

lucene.net, document boost not working

i am a beginner & developing my very first project with lucene.net i.e. an address search utility, lucene.net 3.0.3
using standard analyzer, query parser, (suppose i have a single field, Stored & Analyzed as well)
- sample data : (every row is a document with a single field)
(Postcode and street column concatenated)
UB6 9AH Greenford Road something
UB6 9AP Greenford Road something
UB1 3EB Greenford Road something
PR8 3JT Greenford Road something
HA1 3QD something Greenford Road
SM1 1JY something Greenford Road something
Searching
StringBuilder customQuery = new StringBuilder();
customQuery.Append(_searchFieldName + ":\"" + searchTerm + "\"^" + (wordsCount));
// this is for phrase matching
foreach (var word in words.Where(word => !string.IsNullOrEmpty(word)))
{
customQuery.Append(" +" + _searchFieldName + ":" + word + "*");
}
// this is prefix match for each word
Query query = _parser.Parse(customQuery.ToString());
_searcher.Search(query, collector);
all above (searching) working fine
Question
if i search for "Greenford road" ,
i may want that row that has 'SM1' should come up (means i want to priorities result as per postcode)
i have tested Query-Time-Boost and it works fine
but i may have a long list of priority postcodes sometimes (so i don't want to loop over each postcode and set its priority at query time
I WANT DOCUMENT TIME BOOSTING
but whatever document boost i set (at the time of indexing), it doesn't effect my search results
doc.Add(new Field(SearchFieldName, SearchField, Field.Store.YES, Field.Index.ANALYZED));
if (condition == true)
{
doc.Boost = 2; // or 5 or 200 etc (nothing works)
}
please HELP
i tried to understand similarity and scoring, but its too much mathematics there...
please help....
I recently had this problem myself and I think it might be due to wildcard queries (It was in my case at least). There is another post here that explains the issue better, and provides a possible solution:
Lucene .net Boost not working when using * wildcard

How to achieve dimensional charting on large dataset?

I have successfully used combination of crossfilter, dc, d3 to build multivariate charts for smaller datasets.
My current system caters to 1.5 million txns a day and I want to use the above combination to show dimensional charts on this big sized data (spanned over 6 months). I cannot push this sized data to the frontend for obvious reasons.
The txn data has seconds level granularity but this level of granularity is not required in the visualization. If txn data can be rolled up to a granularity of a day at the backend and push the day based aggregation to the front end then it can drastically reduce the IO traffic and size of the data given to the crossfilter,dc and then dc can show its visualization magic.
Taking forward the above idea -> I decided to reduce the size of the data by reducing the granularity of the timeseries data from millseconds to day by pre-aggregating the data from various dimensions using the below GROUP BY query (this is similar to the stuff done by crossfilter but at the frontend)
SELECT TRUNC(DATELOGGED) AS DTLOGGED, CODE, ACTION, COUNT(*) AS
TXNCOUNT, GROUPING_ID(TRUNC(DATELOGGED),CODE, ACTION) AS grouping_id
FROM AAAA GROUP BY GROUPING SETS(TRUNC(DATELOGGED),
(TRUNC(DATELOGGED),CURR_CODE), (TRUNC(DATELOGGED),ACTION));
Sample output of these rows:
Tuples/Rows in which aggregation is done by (TRUNC(DATELOGGED),CODE) will have a common grouping_id 1 and by (TRUNC(DATELOGGED),ACTION) will have a common grouping_id 2
//group by DTLOGGED, CODE
{"DTLOGGED":"2013-08-03T07:00:00.000Z","CODE":"144","ACTION":"", "TXNCOUNT":69,"GROUPING_ID":1},
{"DTLOGGED":"2013-08-03T07:00:00.000Z","CODE":"376","ACTION":"", "TXNCOUNT":20,"GROUPING_ID":1},
{"DTLOGGED":"2013-08-04T07:00:00.000Z","CODE":"144","ACTION":"", "TXNCOUNT":254,"GROUPING_ID":1},
{"DTLOGGED":"2013-08-04T07:00:00.000Z","CODE":"376","ACTION":"", "TXNCOUNT":961,"GROUPING_ID":1},
//group by DTLOGGED, ACTION
{"DTLOGGED":"2013-08-03T07:00:00.000Z","CODE":"","ACTION":"ENROLLED_PURCHASE", "TXNCOUNT":373600,"GROUPING_ID":2},
{"DTLOGGED":"2013-08-03T07:00:00.000Z","CODE":"","ACTION":"UNENROLLED_PURCHASE", "TXNCOUNT":48978,"GROUPING_ID":2},
{"DTLOGGED":"2013-08-04T07:00:00.000Z","CODE":"","ACTION":"ENROLLED_PURCHASE", "TXNCOUNT":402311,"GROUPING_ID":2},
{"DTLOGGED":"2013-08-04T07:00:00.000Z","CODE":"","ACTION":"UNENROLLED_PURCHASE", "TXNCOUNT":54910,"GROUPING_ID":2},
//group by DTLOGGED
{"DTLOGGED":"2013-08-03T07:00:00.000Z","CODE":"","ACTION":"", "TXNCOUNT":460732,"GROUPING_ID":3},
{"DTLOGGED":"2013-08-04T07:00:00.000Z","CODE":"","ACTION":"", "TXNCOUNT":496060,"GROUPING_ID":3}];
Questions:
These rows are are dis-joined i.e. not like usual rows where each row will have valid values for CODE and ACTION in a single row.
After a selection is made in one of the graphs, the redrawing effect either removes the other graphs or shows no data on them.
Please give me any troubleshooting help or suggest better ways to solve this?
http://jsfiddle.net/universallocalhost/5qJjT/3/
So there are a couple things going on in this question, so I'll try to separate them:
Crossfilter works with tidy data
http://vita.had.co.nz/papers/tidy-data.pdf
This means that you will need to come up with a naive method of filling in the nulls you're seeing (or if need be, in your initial query of the data, omit the nulled values. If you want to get really fancy, you could even infer the null values based off of other data. Whatever your solution, you need to make your data tidy prior to putting it into crossfilter.
Groups and Filtering Operations
txnVolByCurrcode = txnByCurrcode.group().reduceSum(function(d) {
if(d.GROUPING_ID ===1) {
return d.TXNCOUNT;
} else {
return 0;
}
});
This is a filtering operation done on the reduction. This is something that you should separate. Allow that filtering to occur elsewhere (either in the visual, crossfilter itself, or in the query on the data).
This means your reduceSum's become:
var txnVolByCurrcode = txnByCurrcode.group().reduceSum(function(d) {
return d.TXNCOUNT;
});
And if you would like the user to select which group to display:
var groupId = cfdata.dimension(function(d) { return d.GROUPING_ID; });
var groupIdGroup = groupId.group(); // this is an interesting name
dc.pieChart("#group-chart")
.width(250)
.height(250)
.radius(125)
.innerRadius(50)
.transitionDuration(750)
.dimension(groupId)
.group(groupIdGroup)
.renderLabel(true);
For an example of this working:
http://jsfiddle.net/b67pX/

Criteria and sorting according to it

Hi I have 2 problems related to hibernate criteria
I have the following product which contain many colors.
I wish to find the product which contain at least RED and GREEN.
Product class
String id;
name;
style;
List<Color> colors{};
Color class
id
color
1) Every time I do a retrieval, each product will appear depending on how many colors it has..
for example a product A has red green blue, it will appear 3 times.
I have used FetchMode: Select but it doesn't seems to change.
The only possible solution I can think of is inserting them into a hashset and rewrite the hashcode and equal method for primary key only
2) How do I return queries that is sorted according to the closest match to my search?
For example I search for style and color red,green.
so products that matches style color and red green
1) You use need to distict results.
It is not a matter of changing FetchMode.
Please take a look article this
setResultTransformer(Criteria.DISTINCT_ROOT_ENTITY)
2) Well... there is no that kind of criteria function to automatically find and order closest match stuff
Anyway, the simplest way to make similar function is to use addOrder with createAlias instead of setFetch
ct.createAlias("colors", "cs")
.add( Restrictions.like("style", value + "%"))
.add( Restrictions.in("color", colorsArray ))
.addOrder( Order.asc("style"))
.addOrder( Order.asc("cs.color"))
I cannot write all kind of match method in here.
Please refer Restrictions's various expression on here

Best clustering algorithm? (simply explained)

Imagine the following problem:
You have a database containing about 20,000 texts in a table called "articles"
You want to connect the related ones using a clustering algorithm in order to display related articles together
The algorithm should do flat clustering (not hierarchical)
The related articles should be inserted into the table "related"
The clustering algorithm should decide whether two or more articles are related or not based on the texts
I want to code in PHP but examples with pseudo code or other programming languages are ok, too
I've coded a first draft with a function check() which gives "true" if the two input articles are related and "false" if not. The rest of the code (selecting the articles from the database, selecting articles to compare with, inserting the related ones) is complete, too. Maybe you can improve the rest, too. But the main point which is important to me is the function check(). So it would be great if you could post some improvements or completely different approaches.
APPROACH 1
<?php
$zeit = time();
function check($str1, $str2){
$minprozent = 60;
similar_text($str1, $str2, $prozent);
$prozent = sprintf("%01.2f", $prozent);
if ($prozent > $minprozent) {
return TRUE;
}
else {
return FALSE;
}
}
$sql1 = "SELECT id, text FROM articles ORDER BY RAND() LIMIT 0, 20";
$sql2 = mysql_query($sql1);
while ($sql3 = mysql_fetch_assoc($sql2)) {
$rel1 = "SELECT id, text, MATCH (text) AGAINST ('".$sql3['text']."') AS score FROM articles WHERE MATCH (text) AGAINST ('".$sql3['text']."') AND id NOT LIKE ".$sql3['id']." LIMIT 0, 20";
$rel2 = mysql_query($rel1);
$rel2a = mysql_num_rows($rel2);
if ($rel2a > 0) {
while ($rel3 = mysql_fetch_assoc($rel2)) {
if (check($sql3['text'], $rel3['text']) == TRUE) {
$id_a = $sql3['id'];
$id_b = $rel3['id'];
$rein1 = "INSERT INTO related (article1, article2) VALUES ('".$id_a."', '".$id_b."')";
$rein2 = mysql_query($rein1);
$rein3 = "INSERT INTO related (article1, article2) VALUES ('".$id_b."', '".$id_a."')";
$rein4 = mysql_query($rein3);
}
}
}
}
?>
APPROACH 2 [only check()]
<?php
function square($number) {
$square = pow($number, 2);
return $square;
}
function check($text1, $text2) {
$words_sub = text_splitter($text2); // splits the text into single words
$words = text_splitter($text1); // splits the text into single words
// document 1 start
$document1 = array();
foreach ($words as $word) {
if (in_array($word, $words)) {
if (isset($document1[$word])) { $document1[$word]++; } else { $document1[$word] = 1; }
}
}
$rating1 = 0;
foreach ($document1 as $temp) {
$rating1 = $rating1+square($temp);
}
$rating1 = sqrt($rating1);
// document 1 end
// document 2 start
$document2 = array();
foreach ($words_sub as $word_sub) {
if (in_array($word_sub, $words)) {
if (isset($document2[$word_sub])) { $document2[$word_sub]++; } else { $document2[$word_sub] = 1; }
}
}
$rating2 = 0;
foreach ($document2 as $temp) {
$rating2 = $rating2+square($temp);
}
$rating2 = sqrt($rating2);
// document 2 end
$skalarprodukt = 0;
for ($m=0; $m<count($words)-1; $m++) {
$skalarprodukt = $skalarprodukt+(array_shift($document1)*array_shift($document2));
}
if (($rating1*$rating2) == 0) { continue; }
$kosinusmass = $skalarprodukt/($rating1*$rating2);
if ($kosinusmass < 0.7) {
return FALSE;
}
else {
return TRUE;
}
}
?>
I would also like to say that I know that there are lots of algorithms for clustering but on every site there is only the mathematical description which is a bit difficult to understand for me. So coding examples in (pseudo) code would be great.
I hope you can help me. Thanks in advance!
The most standard way I know of to do this on text data like you have, is to use the 'bag of words' technique.
First, create a 'histogram' of words for each article. Lets say between all your articles, you only have 500 unique words between them. Then this histogram is going to be a vector(Array, List, Whatever) of size 500, where the data is the number of times each word appears in the article. So if the first spot in the vector represented the word 'asked', and that word appeared 5 times in the article, vector[0] would be 5:
for word in article.text
article.histogram[indexLookup[word]]++
Now, to compare any two articles, it is pretty straightforward. We simply multiply the two vectors:
def check(articleA, articleB)
rtn = 0
for a,b in zip(articleA.histogram, articleB.histogram)
rtn += a*b
return rtn > threshold
(Sorry for using python instead of PHP, my PHP is rusty and the use of zip makes that bit easier)
This is the basic idea. Notice the threshold value is semi-arbitrary; you'll probably want to find a good way to normalize the dot product of your histograms (this will almost have to factor in the article length somewhere) and decide what you consider 'related'.
Also, you should not just put every word into your histogram. You'll, in general, want to include the ones that are used semi-frequently: Not in every article nor in only one article. This saves you a bit of overhead on your histogram, and increases the value of your relations.
By the way, this technique is described in more detail here
Maybe clustering is the wrong strategy here?
If you want to display similar articles, use similarity search instead.
For text articles, this is well understood. Just insert your articles in a text search database like Lucene, and use your current article as search query. In Lucene, there exists a query called MoreLikeThis that performs exactly this: find similar articles.
Clustering is the wrong tool, because (in particular with your requirements), every article must be put into some cluster; and the related items would be the same for every object in the cluster. If there are outliers in the database - a very likely case - they could ruin your clustering. Furthermore, clusters may be very big. There is no size constraint, the clustering algorithm may decide to put half of your data set into the same cluster. So you have 10000 related articles for each article in your database. With similarity search, you can just get the top-10 similar items for each document!
Last but not least: forget PHP for clustering. It's not designed for this, and not performant enough. But you can probably access a lucene index from PHP well enough.
I believe you need to make some design decisions about clustering, and continue from there:
Why are you clustering texts? Do you want to display related documents together? Do you want to explore your document corpus via clusters?
As a result, do you want flat or hierarchical clustering?
Now we have the complexity issue, in two dimensions: first, the number and type of features you create from the text - individual words may number in the tens of thousands. You may want to try some feature selection - such as taking the N most informative words, or the N words appearing the most times, after ignoring stop words.
Second, you want to minimize the number of times you measure similarity between documents. As bubaker correctly points out, checking similarity between all pairs of documents may be too much. If clustering into a small number of clusters is enough, you may consider K-means clustering, which is basically: choose an initial K documents as cluster centers, assign every document to the closest cluster, recalculate cluster centers by finding document vector means, and iterate. This only costs K*number of documents per iteration. I believe there are also heuristics for reducing the needed number of computations for hierarchical clustering as well.
What does the similar_text function called in Approach #1 look like? I think what you're referring to isn't clustering, but a similarity metric. I can't really improve on the White Walloun's :-) histogram approach - an interesting problem to do some reading on.
However you implement check(), you've got to use it to make at least 200M comparisons (half of 20000^2). The cutoff for "related" articles may limit what you store in the database, but seems too arbitrary to catch all useful clustering of texts,
My approach would be to modify check() to return the "similarity" metric ($prozent or rtn). Write the 20K x 20K matrix to a file and use an external program to perform a clustering to identify nearest neighbors for each article, which you could load into the related table. I would do the clustering in R - there's a nice tutorial for clustering data in a file running R from php.

Resources