Two elasticsearch jdbc river, index data count not match database data count - elasticsearch

The table agent_task_base has 12000000 rows
curl -XPUT 'localhost:9200/river/myjdbc_river1/meta' -d '{
"type" : "jdbc",
"jdbc" : {
"url" : "...",
"user" : "...",
"password" : "...",
"sql" : "select * from agenttask_base where status=1",
"index" : "my_jdbc_index1",
"type" : "my_jdbc_type1"
}
}'
curl -XPUT 'localhost:9200/river/myjdbc_river2/meta' -d '{
"type" : "jdbc",
"jdbc" : {
"url" : "...",
"user" : "...",
"password" : "..",
"sql" : "select * from agenttask_base where status=1",
"index" : "my_jdbc_index2",
"type" : "my_jdbc_type2"
}
}'
two river execute together, but final result is
my_jdbc_index1 has 10000000+ rows
my_jdbc_index2 has 11000000+ rows
Why????

There is an issue on github of elasticsearch-jdbc-river (#143) which describes the sam problem as you described above. Try to reduce the max bulk requests and let elasticsearch indexing again.
For more details see: https://github.com/jprante/elasticsearch-river-jdbc/issues/143#issuecomment-29550301
I hope this will help

I just figured this out after much trial and error, as i was experiencing the same issue
what worked for me was defining the jdbc river parameters bulk_size and max_bulk_requests
curl -XPUT 'localhost:9200/river/myjdbc_river1/meta' -d '{
"type" : "jdbc",
"jdbc" : {
"url" : "...",
"user" : "...",
"password" : "...",
"sql" : "select * from agenttask_base where status=1",
"index" : "my_jdbc_index1",
"type" : "my_jdbc_type1",
"bulk_size" : 160,
"max_bulk_requests" : 5
}
}'
bulk size of 160 seemed to be my magic number, bulk size of 500 was too high for my local install, and would return a java.sql exception closing the database connection, but was ok for my web server environment
bottom line is you can tinker with these numbers to tune performance, but by setting them you should see your index doc count match your sql result count

Related

Kibana Create Index Pattern : strange behaviour of wildcard

I have just one index in elasticsearch, with name aa-bb-YYYY-MM.
Documents in this index contain a field i want to use as date field.
Those documents have been inserted from a custom script (not using logstash).
When creating the index pattern in kibana:
If i enter aa-bb-*, the date field is not found.
If i enter aa-*, the date field is not found.
If i enter aa*, the date field is found, and i can create the index pattern.
But i really need to group indexes by the first two "dimensions".I tried using "_" instead "-", with the same result.
Any idea of what is going on?
Its working for me. I'm on the latest build on the 5.0 release branch (just past the beta1 release). I don't know what version you're on.
I created this index and added 2 docs;
curl --basic -XPUT 'http://elastic:changeme#localhost:9200/aa-bb-2016-09' -d '{
"settings" : {
"number_of_shards" : 1
},
"mappings" : {
"test" : {
"properties" : {
"date" : { "type" : "date"},
"action" : {
"type" : "text",
"analyzer" : "standard",
"fields": {
"raw" : { "type" : "text", "index" : "not_analyzed" }
}
},
"myid" : { "type" : "integer"}
}
}
}
}'
curl -XPUT 'http://elastic:changeme#localhost:9200/aa-bb-2016-09/test/1' -d '{
"date" : "2015-08-23T00:01:00",
"action" : "start",
"myid" : 1
}'
curl -XPUT 'http://elastic:changeme#localhost:9200/aa-bb-2016-09/test/2' -d '{
"date" : "2015-08-23T14:02:30",
"action" : "stop",
"myid" : 1
}'
and I was able to create the index pattern with aa-bb-*

Elasticsearch JDBC importer not importing entry correctly

Having the following mapping:
curl -XPUT 'localhost:9200/borrador' -d '{
"mappings": {
"item": {
"dynamic": "strict",
"properties" : {
"body" : { "type": "string" },
"source_id" : { "type": "integer" },
}}}}'
I'm trying to import my DB to Elasticsearch using the Elasticsearch-JDBC importer.
This is the script I'm using:
#!/bin/sh
bin=/usr/share/elasticsearch/elasticsearch-jdbc-2.1.1.2/bin
lib=/usr/share/elasticsearch/elasticsearch-jdbc-2.1.1.2/lib
echo "Indexando base de datos..."
echo '{
"type" : "jdbc",
"jdbc" : {
"url" : "jdbc:mydbip/mydbname",
"user" : "username",
"password" : "pw",
"sql" : "select source_id, body, id as _id from table_name",
"index" : "borrador",
"type" : "item"
}
}' | java \
-cp "${lib}/*" \
-Dlog4j.configurationFile=${bin}/log4j2.xml \
org.xbib.tools.Runner \
org.xbib.tools.JDBCImporter
Most of the rows of the table are indexed correctly, but the following row from that DB is giving me an error and it's not indexing correctly:
This is the error that shows up:
[ERROR][org.xbib.elasticsearch.helper.client.BulkTransportClient][elasticsearch[importer][listener][T#1]]
bulk [957] failed with 1 failed items, failure message = failure in
bulk execution:
[3499]: index [borrador], type [item], id [14327140], message [MapperParsingException[failed to parse [body]]; nested:
IllegalArgumentException[unknown property [records]];]
As you can see in this case, this specific row has a json format string ({"format":"MS Excel","price":"750","records":"577","recordType":"records"}<!-- com -->) instead of the normal string that has the other entries that are indexing correctly.
What is happening? I would like to store that as a normal string. It's problem of the mapping as it's reading it as a json or something? Even if I remove the "dynamic": "strict", or the entire mapping, it still gives me the error. Thanks in advance.
By default the JDBC importer tries to detect JSON strings in your data and will parse them. You need to modify the configuration of your importer with the detect_json setting and set it to false:
{
"type" : "jdbc",
"jdbc" : {
"url" : "jdbc:mydbip/mydbname",
"user" : "username",
"password" : "pw",
"sql" : "select source_id, body, id as _id from table_name",
"index" : "borrador",
"type" : "item",
"detect_json": false <--- add this
}
}

Elastic search csv river module not working

I am trying to index csv file in elasticsearch
curl -XPUT localhost:9200/_river/my_csv_river/_meta -d '
{
"type" : "csv",
"csv_file" : {
"folder" : "/tmp",
"filename_pattern" : ".*\\.csv$",
"first_line_is_header" : "true",
"field_separator" : ",",
"field_id" : "_id"
},
"index" : {
"index" : "my_csv_data_1",
"type" : "csv_type_1",
"bulk_size" : 100,
"bulk_threshold" : 10
}
}'
after indexing while searching http://localhost:9200/my_csv_data_1/_search
got
{
error: "IndexMissingException[[my_csv_data_1] missing]",
status: 404
}
any thoughts or i missed any thing?

Elasticsearch:how to use elasticsearch-river-jdbc to keep in sync(MySql)

Now i am working with elasticsearch-river-jdbc.When i update Mysql database,i want my elasitcsearchdata will update(automatic).When i created a river,this is my code:
curl -XPUT '127.0.0.1:9200/_river/my_jdbc_river/_meta' -d '{
"type" : "jdbc",
"jdbc" : {
"url" : "jdbc:mysql://localhost:3306/myapp_development",
"user" : "root",
"password" : "",
"sql" : "select * from users",
"autocommit" : "true"
}
}'
But when i update mysql,nothing in elasticsearch data changes.
So what is my wrong??
Simply add 'schedule' as documented here:
elasticsearch-river-jdbc#time-scheduled-execution-of-jdbc-river
{
"type" : "jdbc",
"schedule" : "0 0-59 0-23 ? * *",
"jdbc" : [ {
"url" : "jdbc:mysql://localhost:3306/ZZZZ",
"user" : "root",
"password" : "ZZZ",
"sql" : "Select …"
}]
}
(This one will get updates every minute)

elasticsearch data increase & duplicate at each restart

I'm using elasticsearch with angularjs and oracle on windows 7.
it's working more & more finer ( thanks to stackoverflower help ). I have a problem with elasticsearch: the number of elements in my document is increasing and i don't know why/how.
My oracle table indexed by elasticsearch contain 12010 elements, now i got 84070 elements in elastic document (frequently checked by curl _count): so it duplicate the data 7 times now. I re-indexed the table few days ago but i remove elasticsearch "data" folder before.
data seems to increase each time i restart windows.
Thanks for help.
This is how i install and index my data :
I do this only the first time :
unzip elastic in folder : D:\work\elasticsearch-1.3.1\
install web interface : >plugin -install mobz/elasticsearch-head
install jdbc : >plugin --install jdbc --url http://xbib.org/repository/org/xbib/elasticsearch/plugin/elasticsearch-river-jdbc/1.3.0.0/elasticsearch-river-jdbc-1.3.0.0-plugin.zip
copy "ojdbc6-11.2.0.3.jar" to "D:\work\elasticsearch-1.3.1\plugins\jdbc"
service.bat install
service.bat start
creating index
curl -XPOST 'localhost:9200/donnees'
mapping :
curl -XPUT 'localhost:9200/donnees/specimens/_mapping' -d '{
"specimens" : {
"_all" : {"enabled" : true},
"_index" : {"enabled" : true},
"_id" : {"index": "not_analyzed", "store" : false},
"properties" : {
"O_OCCURRENCEID" : {"type" : "string", "store" : "no","index": "not_analyzed" } ,
....
"I_INSTITUTIONCODE" : {"type" : "string", "store" : "yes","index": "analyzed" }
}
}}'
query oracle and index data :
curl -XPUT 'localhost:9200/_river/donnees_s/_meta' -d '{
"type" : "jdbc",
"jdbc" : {
"index" : "donnees",
"type" : "specimens",
"url" : "jdbc:oracle:thin:#localhost:1523:recolnat",
"user" : "user",
"password" : "password",
"sql" : "select * from all_specimens_data"
}
}'
( is this correct ?? it doesn't work if i replace "curl -XPUT 'localhost:9200/_river/donnees_s/_meta'" by "curl -XPUT 'localhost:9200/donnees/specimens/_meta' which i use to query )
test :
curl -XGET 'http://localhost:9200/donnees/specimens/_count?q=*'
=> 12010
curl -XGET 'http://localhost:9200/donnees/specimens/_search?q=P00009359'
=> return data ok
Resolved thanks to Konstantin V. Salikhov.
Each time elasticsearch service start it query the database with the sql provided to the _river and get the data ( see me previous "query oracle and index data : "). If the data don't have an "_id" column _river can't determine which records it have already loaded and the data is duplicated each time.
To avoid duplicate i edit my "all_specimens_data" table in database ( who is in fact a view to avoid modification o database) and rename "O_OCCURRENCEID" to "_id", "O_OCCURRENCEID" is my primary key UUID.
hope this help other

Resources