elasticsearch data increase & duplicate at each restart - elasticsearch

I'm using elasticsearch with angularjs and oracle on windows 7.
it's working more & more finer ( thanks to stackoverflower help ). I have a problem with elasticsearch: the number of elements in my document is increasing and i don't know why/how.
My oracle table indexed by elasticsearch contain 12010 elements, now i got 84070 elements in elastic document (frequently checked by curl _count): so it duplicate the data 7 times now. I re-indexed the table few days ago but i remove elasticsearch "data" folder before.
data seems to increase each time i restart windows.
Thanks for help.
This is how i install and index my data :
I do this only the first time :
unzip elastic in folder : D:\work\elasticsearch-1.3.1\
install web interface : >plugin -install mobz/elasticsearch-head
install jdbc : >plugin --install jdbc --url http://xbib.org/repository/org/xbib/elasticsearch/plugin/elasticsearch-river-jdbc/1.3.0.0/elasticsearch-river-jdbc-1.3.0.0-plugin.zip
copy "ojdbc6-11.2.0.3.jar" to "D:\work\elasticsearch-1.3.1\plugins\jdbc"
service.bat install
service.bat start
creating index
curl -XPOST 'localhost:9200/donnees'
mapping :
curl -XPUT 'localhost:9200/donnees/specimens/_mapping' -d '{
"specimens" : {
"_all" : {"enabled" : true},
"_index" : {"enabled" : true},
"_id" : {"index": "not_analyzed", "store" : false},
"properties" : {
"O_OCCURRENCEID" : {"type" : "string", "store" : "no","index": "not_analyzed" } ,
....
"I_INSTITUTIONCODE" : {"type" : "string", "store" : "yes","index": "analyzed" }
}
}}'
query oracle and index data :
curl -XPUT 'localhost:9200/_river/donnees_s/_meta' -d '{
"type" : "jdbc",
"jdbc" : {
"index" : "donnees",
"type" : "specimens",
"url" : "jdbc:oracle:thin:#localhost:1523:recolnat",
"user" : "user",
"password" : "password",
"sql" : "select * from all_specimens_data"
}
}'
( is this correct ?? it doesn't work if i replace "curl -XPUT 'localhost:9200/_river/donnees_s/_meta'" by "curl -XPUT 'localhost:9200/donnees/specimens/_meta' which i use to query )
test :
curl -XGET 'http://localhost:9200/donnees/specimens/_count?q=*'
=> 12010
curl -XGET 'http://localhost:9200/donnees/specimens/_search?q=P00009359'
=> return data ok

Resolved thanks to Konstantin V. Salikhov.
Each time elasticsearch service start it query the database with the sql provided to the _river and get the data ( see me previous "query oracle and index data : "). If the data don't have an "_id" column _river can't determine which records it have already loaded and the data is duplicated each time.
To avoid duplicate i edit my "all_specimens_data" table in database ( who is in fact a view to avoid modification o database) and rename "O_OCCURRENCEID" to "_id", "O_OCCURRENCEID" is my primary key UUID.
hope this help other

Related

Kibana Create Index Pattern : strange behaviour of wildcard

I have just one index in elasticsearch, with name aa-bb-YYYY-MM.
Documents in this index contain a field i want to use as date field.
Those documents have been inserted from a custom script (not using logstash).
When creating the index pattern in kibana:
If i enter aa-bb-*, the date field is not found.
If i enter aa-*, the date field is not found.
If i enter aa*, the date field is found, and i can create the index pattern.
But i really need to group indexes by the first two "dimensions".I tried using "_" instead "-", with the same result.
Any idea of what is going on?
Its working for me. I'm on the latest build on the 5.0 release branch (just past the beta1 release). I don't know what version you're on.
I created this index and added 2 docs;
curl --basic -XPUT 'http://elastic:changeme#localhost:9200/aa-bb-2016-09' -d '{
"settings" : {
"number_of_shards" : 1
},
"mappings" : {
"test" : {
"properties" : {
"date" : { "type" : "date"},
"action" : {
"type" : "text",
"analyzer" : "standard",
"fields": {
"raw" : { "type" : "text", "index" : "not_analyzed" }
}
},
"myid" : { "type" : "integer"}
}
}
}
}'
curl -XPUT 'http://elastic:changeme#localhost:9200/aa-bb-2016-09/test/1' -d '{
"date" : "2015-08-23T00:01:00",
"action" : "start",
"myid" : 1
}'
curl -XPUT 'http://elastic:changeme#localhost:9200/aa-bb-2016-09/test/2' -d '{
"date" : "2015-08-23T14:02:30",
"action" : "stop",
"myid" : 1
}'
and I was able to create the index pattern with aa-bb-*

Adding mapping on elasticsearch using curl

Hi I am trying to add elasticsearch mapping but I am facing an exception
{"error":"InvalidTypeNameException[mapping type name [mapping] can't
start with '']","status":400}
I am using 0.99.10 version of elasticSearchIndex.
curl -XPUT 'http://localhost:9200/twitter/_mapping/tweet' -d '{"tweet" : {"properties" : {"message" : {"type" : "string", "store" : true }}}}'
When you include the document type in the url you should start your payload by properties field, like this:
curl -XPUT 'http://localhost:9200/twitter/_mapping/tweet' -d '{"properties" : {"message" : {"type" : "string", "store" : true }}}'
As described here

Elasticsearch JDBC importer not importing entry correctly

Having the following mapping:
curl -XPUT 'localhost:9200/borrador' -d '{
"mappings": {
"item": {
"dynamic": "strict",
"properties" : {
"body" : { "type": "string" },
"source_id" : { "type": "integer" },
}}}}'
I'm trying to import my DB to Elasticsearch using the Elasticsearch-JDBC importer.
This is the script I'm using:
#!/bin/sh
bin=/usr/share/elasticsearch/elasticsearch-jdbc-2.1.1.2/bin
lib=/usr/share/elasticsearch/elasticsearch-jdbc-2.1.1.2/lib
echo "Indexando base de datos..."
echo '{
"type" : "jdbc",
"jdbc" : {
"url" : "jdbc:mydbip/mydbname",
"user" : "username",
"password" : "pw",
"sql" : "select source_id, body, id as _id from table_name",
"index" : "borrador",
"type" : "item"
}
}' | java \
-cp "${lib}/*" \
-Dlog4j.configurationFile=${bin}/log4j2.xml \
org.xbib.tools.Runner \
org.xbib.tools.JDBCImporter
Most of the rows of the table are indexed correctly, but the following row from that DB is giving me an error and it's not indexing correctly:
This is the error that shows up:
[ERROR][org.xbib.elasticsearch.helper.client.BulkTransportClient][elasticsearch[importer][listener][T#1]]
bulk [957] failed with 1 failed items, failure message = failure in
bulk execution:
[3499]: index [borrador], type [item], id [14327140], message [MapperParsingException[failed to parse [body]]; nested:
IllegalArgumentException[unknown property [records]];]
As you can see in this case, this specific row has a json format string ({"format":"MS Excel","price":"750","records":"577","recordType":"records"}<!-- com -->) instead of the normal string that has the other entries that are indexing correctly.
What is happening? I would like to store that as a normal string. It's problem of the mapping as it's reading it as a json or something? Even if I remove the "dynamic": "strict", or the entire mapping, it still gives me the error. Thanks in advance.
By default the JDBC importer tries to detect JSON strings in your data and will parse them. You need to modify the configuration of your importer with the detect_json setting and set it to false:
{
"type" : "jdbc",
"jdbc" : {
"url" : "jdbc:mydbip/mydbname",
"user" : "username",
"password" : "pw",
"sql" : "select source_id, body, id as _id from table_name",
"index" : "borrador",
"type" : "item",
"detect_json": false <--- add this
}
}

Two elasticsearch jdbc river, index data count not match database data count

The table agent_task_base has 12000000 rows
curl -XPUT 'localhost:9200/river/myjdbc_river1/meta' -d '{
"type" : "jdbc",
"jdbc" : {
"url" : "...",
"user" : "...",
"password" : "...",
"sql" : "select * from agenttask_base where status=1",
"index" : "my_jdbc_index1",
"type" : "my_jdbc_type1"
}
}'
curl -XPUT 'localhost:9200/river/myjdbc_river2/meta' -d '{
"type" : "jdbc",
"jdbc" : {
"url" : "...",
"user" : "...",
"password" : "..",
"sql" : "select * from agenttask_base where status=1",
"index" : "my_jdbc_index2",
"type" : "my_jdbc_type2"
}
}'
two river execute together, but final result is
my_jdbc_index1 has 10000000+ rows
my_jdbc_index2 has 11000000+ rows
Why????
There is an issue on github of elasticsearch-jdbc-river (#143) which describes the sam problem as you described above. Try to reduce the max bulk requests and let elasticsearch indexing again.
For more details see: https://github.com/jprante/elasticsearch-river-jdbc/issues/143#issuecomment-29550301
I hope this will help
I just figured this out after much trial and error, as i was experiencing the same issue
what worked for me was defining the jdbc river parameters bulk_size and max_bulk_requests
curl -XPUT 'localhost:9200/river/myjdbc_river1/meta' -d '{
"type" : "jdbc",
"jdbc" : {
"url" : "...",
"user" : "...",
"password" : "...",
"sql" : "select * from agenttask_base where status=1",
"index" : "my_jdbc_index1",
"type" : "my_jdbc_type1",
"bulk_size" : 160,
"max_bulk_requests" : 5
}
}'
bulk size of 160 seemed to be my magic number, bulk size of 500 was too high for my local install, and would return a java.sql exception closing the database connection, but was ok for my web server environment
bottom line is you can tinker with these numbers to tune performance, but by setting them you should see your index doc count match your sql result count

Issues when replicating from couchbase bucket to elasticsearch index?

This issue seems to be related to using the XDCR in couchbase. If I had the following simple objects
1: { "name" : "Mark", "age" : 30}
2: { "name" : "Bill", "age" : "forty"}
and set up an elasticsearch index as such
curl -XPUT 'http://localhost:9200/test/couchbaseDocument/_mapping' -d '
{
"couchbaseDocument" : {
"dynamic_templates": [
{
"store_generic": {
"match": "*",
"mapping": {
"store": "yes"
}
}
}
]
}
}'
I can then add the two objects to this index using the REST API
curl -XPUT localhost:9200/test/couchbaseDocument/1 -d '{
"name" : "Mark",
"age" : 30
}'
curl -XPUT localhost:9200/test/couchbaseDocument/2 -d '{
"name" : "Bill",
"age" : "forty"
}'
They are now both searchable (despite the fact the "age" is long for one and string for the other.
If, however, I stored these two objects in a couchbase bucket (rather than straight to elasticsearch) and set up the XDCR the first object replicates fine but the second fails with the following error
failed to execute bulk item (index) index {[test][couchbaseDocument][2], source[{"doc":{"name":"Bill","age":"forty"},"meta":{"id":"2","rev":"8-00000b9360d0a0bf0000000000000000","expiration":0,"flags":0}}]}
org.elasticsearch.index.mapper.MapperParsingException: failed to parse [doc.age]
I can't figure out why it works via the REST API but not when couchbase replicates the same objects.
I followed the answer and used the following mapping to get things to work via XDCR
curl -XPUT 'http://localhost:9200/test/couchbaseDocument/_mapping' -d '
{
"couchbaseDocument" : {
"properties" : {
"doc": {
"properties" : {
"name" : {"type" : "string", "store" : "yes"},
"age" : {"type" : "string", "store" : "yes"}
}
}
}
}
}'
Now all the objects (despite having different types for the same fields) are replicated and searchable. I don't think there was any need to include the dynamic_templates approach I initially tried. The mapping works.
It's something you have to solve on elasticsearch side.
If the same field name can contain both numeric values and string values, you should create a mapping first which says that age is a String.
So elasticsearch won't try to auto guess type for this field.
Hope this helps

Resources