Carrot2+ElasticSearch Basic Flow of Information - elasticsearch

I am using Carrot2 and ElasticSearch. I has elastic search server running with a lot of data when I installed carrot2 plugin.
Wanted to get answers to a few basic questions:
Will clustering work only on newly indexed documents or even old documents?
How can I specify which fields to look at for clustering?
The curl command is working and giving some results. How can I get the curl command which takes a JSON as input to a REST API url of the form localhost:9200/article-index/article/_search_with_clusters?.....
Appreciate any help.

Yes, if you want to use the plugin straight off the ES installation, you need to make REST calls of your own. I believe you are using Python. Take a look at requests. It is a delightful REST tool for python.
To make POST requests you can do the following :
import json
url = 'localhost:9200/article-index/article/_search_with_clusters'
payload = {'some': 'data'}
r = requests.post(url, data=json.dumps(payload))
print r.text
Find more information at requests documentation.

Will clustering work only on newly indexed documents or even old
documents?
It will work even on old documments
How can I specify which fields to look at for clustering?
Here's an example using the shakepspeare dataset. The query is which of shakespeare's plays are about war?
$ curl -XPOST http://localhost:9200/shakespeare/_search_with_clusters?pretty -d '
{
"search_request": {
"query": {"match" : { "_all": "war" }},
"size": 100
},
"max_hits": 0,
"query_hint": "war",
"field_mapping": {
"title": ["_source.play_name"],
"content": ["_source.text_entry"]
},
"algorithm": "lingo"
}'
Running this you'll get back plays like Richard, Henry... The title is what carrot2 uses to develop the cluster names and the text entry is what it uses to make the clusters.
The curl command is working and giving some results. How can I get the
curl command which takes a JSON as input to a REST API url of the form
localhost:9200/article-index/article/_search_with_clusters?.....
Typically use the elasticsearch client libraries for your language of choice.

Related

Elasticsearch - List all sources sending messages to ES

I am trying to get a list which shows me all sources ES is receiving messages from. I am pretty new with this topic and trying to get deeper into it. I am searching basically for a solution to see the total amount of sources sending logs to my central logging solution and in best case also provided my a list with the source names.
Does anyone have an idea how to get such information querying Elasticsearch?
Yes, this is possible, though the solution depends on how your data looks.
Users typically index data in Elasticsearch so that it contains more than just the raw log lines. This is done automatically if you're using Filebeat. Otherwise, you'd do something (add a field using Logstash, rely on a host field in syslog, etc) to ensure you have a field that contains your "source" identifier:
{
"message": "my super valuable logline",
"source": "my_kinda_awesome_app"
}
given ^^ you can identify all sources (and record counts!) with a terms aggregation like:
{
"aggs": {
"my_sources": {
"terms": { "field": "source" }
}
}
}
Kibana makes this all easier since you don't need to know/write ES queries and can do stuff visually.

Fuzzy search by default in kibana

I'm trying to make some fuzzy search in kibana using their IHM (ideally by default). I know how to do such a request in the DEV TOOLS section. The problem is to have that option by default. Is it possible? I'd like also to save all the requests that I entered (by default).
Please find below the search I try to incorporate and get the results.
GET /_search
{
"query": {
"fuzzy" : { "NOM" : "COUT" }
}
}
PS: I know that there is a Lucene Syntax for sophisticated requests.
Thanks a lot for your help !

Using Elastic Query DSL in Kibana Discover to enable more_like_this etc

The Kibana documentation says:
When lucene is selected as your query language you can also submit
queries using the Elasticsearch Query DSL.
However, whenever I try to enter such a query in the Discover pane, I get a parse error. These are queries that work fine in the Dev Tools pane.
For example, if I try even a simple query like this:
{"query":{"match_phrase":{"summary":"stochastic noise"}}}
I get the following error:
Discover: [parsing_exception] no [query] registered for [query], with { line=1 & col=356 }
Error: [parsing_exception] no [query] registered for [query], with { line=1 & col=356 }
at respond (http://<mydomain>:5601/bundles/vendors.bundle.js?v=16602:111:161556)
at checkRespForFailure (http://<mydomain>:5601/bundles/vendors.bundle.js?v=16602:111:160796)
at http://<mydomain>:5601/bundles/vendors.bundle.js?v=16602:105:285566
at processQueue (http://<mydomain>:5601/bundles/vendors.bundle.js?v=16602:58:132456)
at http://<mydomain>:5601/bundles/vendors.bundle.js?v=16602:58:133349
at Scope.$digest (http://<mydomain>:5601/bundles/vendors.bundle.js?v=16602:58:144239)
at Scope.$apply (http://<mydomain>:5601/bundles/vendors.bundle.js?v=16602:58:147018)
at done (http://<mydomain>:5601/bundles/vendors.bundle.js?v=16602:58:100026)
at completeRequest (http://<mydomain>:5601/bundles/vendors.bundle.js?v=16602:58:104697)
at XMLHttpRequest.xhr.onload (http://<mydomain>:5601/bundles/vendors.bundle.js?v=16602:58:105435)
(I've removed my domain above and replaced with <mydomain>)
The above query works fine and returns results using cURL on the command line, or using
GET /_search
{
"query": {
"match_phrase": {
"summary": "stochastic noise"
}
}
}
In the Dev Tools console.
I'm hoping to use the more_like_this query from the Discover panel, so (I think) I will need to use the Query DSL and not just use the straight lucene query syntax. But if there's a way to use the specialty queries like that using straight lucene (or kuery) that would be great.
The reason is simply because the input box only supports whatever you include inside the query section, so if you input this, it will work:
{"match_phrase":{"summary":"stochastic noise"}}
It makes sense if you think about it, i.e. the aggs section makes no sense in the Discover pane and the from/size attributes are already taken care of by the default settings.
If you look at the full query DSL, you'll see that there are several sections: query, aggs, from, size, _source, highlight, etc. In the Discover pane, you should only specify whatever goes into the query section, nothing else.

elasticsearch faking index per user - how are routing values inferred when updating?

Using the fake index per user as suggested by docs. ES version 1.6.0 sometimes fails to behave as expected.
Checking the alias:
curl localhost:9200/testbig/_alias/<userId>
{"<indexname>":{"aliases":{"<userId>":{"filter":{"term":
{"userId":"<userId>"}},"index_routing":"<userId>","search_routing":"<userId>"}}
}}
But trying to update a document:
curl -XPOST localhost:9200/<userId>/<type>/<id>/_update -d
'{"doc":{"userId":"<userId>","field1":"val1"}}'
I get
{ "error": "ElasticsearchIllegalArgumentException[Alias [<userId>] has
index routing associated with it [<userId>], and was provided with
routing value [<DIFFERENTuserId>], rejecting operation]",
"status": 400 }
In case anyone else suffers a similar issue, what causes is this:
If you start by using actual separate indexes for each user, it's OK to have records with the same id, i.e. paths like
localhost:9200/userid1/type/id1
localhost:9200/userid2/type/id1
but when the userids are just aliases, these correspond, of course, to the same document. Hence the routing clash on subsequent updates.

Register an ElasticSearch query with the Percolator using the Java API

I am trying to use ElasticSearch's Percolator feature; doing this via the curl examples from the documentation is straightforward enough, as is percolating a document using the Java API. What I can't find out how to do is registering a query with the percolator using the Java API - how is this done?
Using the example from the documentation, how would I do this in Java?
curl -XPUT localhost:9200/_percolator/test/kuku -d '{
"query" : {
"term" : {
"field1" : "value1"
}
}
}'
_percolator is just an index. You register queries with it by indexing queries as you normally would index documents:
client.prepareIndex("_percolator", "test", "kuku")
.setSource(jsonBuilder().startObject()
.field("query", termQuery("field1", "value1"))
.endObject())
.setRefresh(true)
.execute().actionGet()
You can also check elasticsearch integration tests for more examples.
EDIT: The link above is dead, you might want to take a look at the official documentation here concerning the integration tests.
I have also added a gist of the old PercolatorTests class on gist.

Resources