I'm using a simple query string over the following text:
Jiboia de três metros é capturada em avenida de Governador
Obs: This is the content of my message field
My query string (no results)
My query string (1 result)
Have a trick for latin characters?
My mapping:

Do you need let elasticsearch know how handle your characters.
I did an example using an custom tokenizer like this:
curl -XPOST "" -d'
"settings" : {
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 1
"analysis" : {
"filter" : {
"custom_filter" : {
"type" : "word_delimiter",
"type_table": ["ê => ALPHA", "Ê => ALPHA"]
"analyzer" : {
"custom_analyzer" : {
"type" : "custom",
"tokenizer" : "whitespace",
"filter" : ["lowercase", "custom_filter"]
"mappings" : {
"my_type" : {
"properties" : {
"msg" : {
"type" : "string",
"analyzer" : "custom_analyzer"
I just created an analyzer using a tokenizer that know that ê and Ê need to be interpreted as characters.
after that i just do a search in my msg field
curl -XPOST "" -d'
And will work :D

I found the problem.
Im using javascript atob function to decode the message after index it on elastic.
The atob function does not work well with my latin characters and break it.
I change atob for the native Buffer class on node js.
Obs: The default analizer work perfect with latin chars!


How do I configure elastic search to use the icu_tokenizer?

I'm trying to search a text indexed by elasticsearch and the icu_tokenizer but can't get it working.
My testcase is to tokenize the sentence “Hello. I am from Bangkok”, in thai สวัสดี ผมมาจากกรุงเทพฯ, which should be tokenized to the five words สวัสดี, ผม, มา, จาก, กรุงเทพฯ. (Sample from Elasticsearch - The Definitive Guide)
Searching using any of the last four words fails for me. Searching using any of the space separated words สวัสดี or ผมมาจากกรุงเทพฯ works fine.
If I specify the icu_tokenizer on the command line, like
curl -XGET 'http://localhost:9200/icu/_analyze?tokenizer=icu_tokenizer' -d "สวัสดี ผมมาจากกรุงเทพฯ"
it tokenizes to five words.
My settings are:
curl http://localhost:9200/icu/_settings?pretty
"icu" : {
"settings" : {
"index" : {
"creation_date" : "1474010824865",
"analysis" : {
"analyzer" : {
"nfkc_cf_normalized" : [ "icu_normalizer" ],
"tokenizer" : "icu_tokenizer"
"number_of_shards" : "5",
"number_of_replicas" : "1",
"uuid" : "tALRehqIRA6FGPu8iptzww",
"version" : {
"created" : "2040099"
The index is populated with
curl -XPOST 'http://localhost:9200/icu/employee/' -d '
"first_name" : "John",
"last_name" : "Doe",
"about" : "สวัสดี ผมมาจากกรุงเทพฯ"
Searching with
curl -XGET 'http://localhost:9200/_search' -d'
"query" : {
"match" : {
"about" : "กรุงเทพฯ"
Returns nothing ("hits" : [ ]).
Performing the same search with one of สวัสดี or ผมมาจากกรุงเทพฯ works fine.
I guess I've misconfigured the index, how should it be done?
The missing part is:
"mappings": {
"employee" : {
"properties": {
"type": "text",
"analyzer": "icu_analyzer"
In the mapping, the document field have to be specified the analyzer to be using.
[Index] : icu
[type] : employee
[field] : about
PUT /icu
"settings": {
"analysis": {
"analyzer": {
"icu_analyzer" : {
"char_filter": [
"tokenizer" : "icu_tokenizer"
"mappings": {
"employee" : {
"properties": {
"type": "text",
"analyzer": "icu_analyzer"
test the custom analyzer using followings DSLJson
POST /icu/_analyze
"text": "สวัสดี ผมมาจากกรุงเทพฯ",
"analyzer": "icu_analyzer"
The result should be [สวัสดี, ผม, มา, จาก, กรุงเทพฯ]
My suggestion would be :
Kibana : Dev Tool could help you for effective query crafting

After using the Elasticsearch JDBC Importer, 'asciifolding' is not working as expected

Using the Elasticsearch JDBC importer with this configuration:
echo '{
"type" : "jdbc",
"jdbc" : {
"url" : "ip/db",
"user" : "myuser",
"password" : "a7sdf7hsdf8hn78df",
"sql" : "SELECT title, body, source_id, time_order, type, blablabla...",
"index" : "importeditems",
"type" : "item",
"elasticsearch.host": "_eth0_",
"detect_json" : false
}' | java \
-cp "${lib}/*" \
-Dlog4j.configurationFile=${bin}/log4j2.xml \
org.xbib.tools.Runner \
I've indexed some documents correctly with the form:
"title":"Tiempo de Opinión: Puede comenzar un ciclo",
"body":"Sebas Álvaro nos trae cada lunes historias y anécdotas de la montaña<!-- com -->",
I'm trying to ignore the accents (for example, opiniónin title has an ó), so if a user searches "tiempo de opinión" or "tiempo de opinion" with a match_phrase it gives a match with the documents with or without accent.
So after using the importer and indexing everything, I changed my index settings to defaultanalyzer with an asciifolding filter.
curl -XPOST 'localhost:9200/importeditems/_close'
curl -XPUT 'localhost:9200/importeditems/_settings?pretty=true' -d '{
"analysis": {
"analyzer": {
"default": {
"tokenizer" : "standard",
"filter": [ "lowercase", "asciifolding"]
curl -XPOST 'localhost:9200/importeditems/_open'
Then I make a match_phrase to match"tiempo de opinion" (no accent) and "tiempo de opinión" (with accent)
# No accent
curl -XGET 'localhost:9200/importeditems/_search?pretty=true' -d'
"query": {
"match_phrase" : {
"title" : "tiempo de opinion"
# With accent
curl -XGET 'localhost:9200/importeditems/_search?pretty=true' -d'
"query": {
"match_phrase" : {
"title" : "tiempo de opinión"
But no match is given when they exist (if I match_phrase the phrase tiempo de it returns some hits containing tiempo de opinión).
I think the problem is due to de JDBC Importer because I reproduced the error without using the importer, adding another index and entries by hand, changing the index settings also to asciifolding and everything works as expected. You can see this working example right here.
If I check the settings of the index created after using the importer (importeditems)
curl -XGET 'localhost:9200/importeditems/_settings?pretty=true'
This outputs:
"importeditems" : {
"settings" : {
"index" : {
"creation_date" : "1457533907278",
"analysis" : {
"analyzer" : {
"default" : {
"filter" : [ "lowercase", "asciifolding" ],
"tokenizer" : "standard"
"number_of_shards" : "5",
"number_of_replicas" : "1",
"uuid" : "x",
"version" : {
"created" : "2010199"
... and if I check the settings of the manually created index (test):
curl -XGET 'localhost:9200/test/_settings?pretty=true'
I get the same exact output:
"test" : {
"settings" : {
"index" : {
"creation_date" : "1457603253278",
"analysis" : {
"analyzer" : {
"default" : {
"filter" : [ "lowercase", "asciifolding" ],
"tokenizer" : "standard"
"number_of_shards" : "5",
"number_of_replicas" : "1",
"uuid" : "x",
"version" : {
"created" : "2010199"
Can someone please tell why is not working if I use the Elasticsearch JDBC Importer and why is it working if I add raw data?
I finally solved the issue by first changing the settings by adding the analysis module:
curl -XPOST 'localhost:9200/importeditems/_close'
curl -XPUT 'localhost:9200/importeditems/_settings?pretty=true' -d '{
"analysis": {
"analyzer": {
"default": {
"tokenizer" : "standard",
"filter": [ "lowercase", "asciifolding"]
curl -XPOST 'localhost:9200/importeditems/_open'
... and then importing all the data again.
It's extrange, because as I stated on the post, I did exactly the same in both cases (with the JDBC Importer and the raw data):
Index data
Change index settings
Make the query with match_phrase
And it worked with the raw data (test) and not with the one I used the importer with (importeditems). The only thing I can think about is that the importeditems were more than 12GB and it needs time to re-create the content with the asciifolding on it. That's why the changes were not reflecting just after the asciifolding was activated.
Anyways, if someone is having the same issue and specially for those who are working with a huge amount of data, remember first to set the analyzer, and then index all the data.
According to the docs:
Queries can find only terms that actually exist in the inverted index,
so it is important to ensure that the same analysis process is applied
both to the document at index time, and to the query string at search
time so that the terms in the query match the terms in the inverted

How to find most used phrases in elasticsearch?

I know that you can find most used terms in an index with using facets.
For example on following inputs:
"A B C"
"AA B"
term facet returns this:
But I'm wondering that is it possible to list followings:
AA B:2
A B:1
Is there such a feature in ElasticSearch?
As mentioned in ramseykhalaf's comment, a shingle filter would produce tokens of length "n" words.
"settings" : {
"analysis" : {
"filter" : {
"analyzer" : {
"shingle_analyzer" : {
"type" : "custom",
"tokenizer" : "whitespace",
"filter" : ["standard," "lowercase", "shingle", "filter_stop"]
"mappings" : {
"type" : {
"properties" : {
"letters" : {
"type" : "string",
"analyzer" : "shingle_analyzer"
See this blog post for full details.
I'm not sure if elasticsearch will let you do this the way you want natively. But you might be interested in checking out Carrot2 - http://search.carrot2.org to accomplished what you want (and probably more.)

Elasticsearch search fo words having '#' character

For example, I am right now searching like this:
But, I am getting all the results with 'sachin' and not '#sachin'. Also, I am writing a regular expression for getting the count of terms. The facet looks like this:
"facets": {
"content": {
"terms": {
"field": "content",
"size": 1000,
"all_terms": false,
"regex": "#sachin",
"regex_flags": [
This is not returning any values. I think it has something to do with escaping the '#' inside the regular expression, but I am not sure how to do it. I have tried to escape it \ and \\, but it did not work. Can anyone help me in this regard?
This article gives information on how save # and # using custom analyzers:
curl -XPUT 'http://localhost:9200/twitter' -d '{
"settings" : {
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 1
"analysis" : {
"filter" : {
"tweet_filter" : {
"type" : "word_delimiter",
"type_table": ["# => ALPHA", "# => ALPHA"]
"analyzer" : {
"tweet_analyzer" : {
"type" : "custom",
"tokenizer" : "whitespace",
"filter" : ["lowercase", "tweet_filter"]
"mappings" : {
"tweet" : {
"properties" : {
"msg" : {
"type" : "string",
"analyzer" : "tweet_analyzer"
This isn't dealing with facets, but the redefining of the type of those special characters in the analyzer could help.
Another approach that worth to consider is to index a special (e.g. "reserved") word instead of hash symbol. For example: HASHSYMBOLCHAR. Make sure that you will replace '#' chars in query as well.

Index fields with hyphens in Elasticsearch

I'm trying to work out how to configure elasticsearch so that I can make query string searches with wildcards on fields that include hyphens.
I have documents that look like this:
"name":"Crew t-shirt navy large",
"description":"This is a t-shirt",
I have tried to use a word_delimiter filter and a whitespace tokenizer:
"settings" : {
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 1
"analysis" : {
"filter" : {
"tags_filter" : {
"type" : "word_delimiter",
"type_table": ["- => ALPHA"]
"analyzer" : {
"tags_analyzer" : {
"type" : "custom",
"tokenizer" : "whitespace",
"filter" : ["tags_filter"]
"mappings" : {
"yacht1" : {
"properties" : {
"tags" : {
"type" : "string",
"analyzer" : "tags_analyzer"
But these are the searches (for tags) and their results:
deck* -> match
deck-* -> no match
deck-clo* -> no match
Can anyone see where I'm going wrong?
Thanks :)
The analyzer is fine (though I'd lose the filter), but your search analyzer isn't specified so it is using the standard analyzer to search the tags field which strips out the hyphen then tries to query against it (run curl "localhost:9200/_analyze?analyzer=standard" -d "deck-*" to see what I mean)
basically, "deck-*" is being searched for as "deck *" there is no word that has just "deck" in it so it fails.
"deck-clo*" is being searched for as "deck clo*", again there is no word that is just "deck" or starts with "clo" so the query fails.
I'd make the following modifications
"analysis" : {
"analyzer" : {
"default" : {
"tokenizer" : "whitespace",
"filter" : ["lowercase"] <--- you don't need this, just thought it was a nice touch
then get rid of the special analyzer on the tags
"mappings" : {
"yacht1" : {
"properties" : {
"tags" : {
"type" : "string"
let me know how it goes.
