i have a dataset of more than a million rows. I have integrated elasticsearch with Mysql using logstash.
When i type the following URL to fetch in postman,
http://localhost:9200/persondetails/Document/_search?q=*
i get the following:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "persondetails",
"_type": "Document",
"_id": "%{idDocument}",
"_score": 1,
"_source": {
"iddocument": 514697,
"#timestamp": "2017-08-31T05:18:46.916Z",
"author": "vaibhav",
"expiry_date": null,
"#version": "1",
"description": "ly that",
"creation_date": null,
"type": 1
}
},
{
"_index": "persondetails",
"_type": "Document_count",
"_id": "AV4o0J3OJ5ftvuhV7i0H",
"_score": 1,
"_source": {
"query": {
"term": {
"author": "rishav"
}
}
}
}
]
}
}
it is wrong as the number of rows in my table is more than 1 million and this shows that total is only 2. I am unable to find what is the mistake here.
when i type http://localhost:9200/_cat/indices?v
It shows this
health:yellow
status:open
index:persondetails
uuid:4FiGngZcQfS0Xvu6IeHIfg
pri:5
rep : 1
docs.count : 2
docs.deleted :1054
store.size : 125.4kb
pri.store.size : 125.4kb
This is my logstash.conf file
input {
jdbc {
jdbc_connection_string => "jdbc:mysql://127.0.0.1:3306/persondetails"
jdbc_user => "root"
jdbc_password => ""
schedule => "* * * * *"
jdbc_validate_connection => true
jdbc_driver_library => "/usr/local/Cellar/logstash/5.5.2/mysql-connector-java-3.1.14/mysql-connector-java-3.1.14-bin.jar"
jdbc_driver_class => "com.mysql.jdbc.Driver"
statement => "SELECT * FROM Document"
type => "persondetails"
}
}
output {
elasticsearch {
#protocol=>http
index =>"persondetails"
document_type => "Document"
document_id => "%{idDocument}"
hosts => ["http://localhost:9200"]
stdout{ codec => rubydebug}
}
}
From your result, it looks like there is an issue with your logstash configuration which is causing your document to be overwritten because the document_id is not getting generated, and effectively there is only one document in your index with document Id as "%{idDocument}"
See the following _source snippet from the result to the search query you provided:
"_source": {
"iddocument": 514697,
"#timestamp": "2017-08-31T05:18:46.916Z",
"author": "vaibhav",
"expiry_date": null,
"#version": "1",
"description": "ly that",
"creation_date": null,
"type": 1
}
Even looking at the small size of the index, it doesn't look like there are more documents. You should look at whether your jdbc input is providing the "idDocument" field.
Related
I have sample csv file in s3 with 3 column without any header. But during data transfer from s3 csv to elasticsearch, I want to give some name to each column (in my case id, name, age to column 0 to 2 respectively).
Input Sample.csv
1,myname,23
2,myname2,24
Expected Output should be following doc in ES index:
[{
"_index": "user_detail",
"_type": "user_detail_type",
"_id": "1",
"_score": 1.0,
"_source": {
"id": "1",
"name": "myname",
"age": "23"
}
},
{
"_index": "user_detail",
"_type": "user_detail_type",
"_id": "2",
"_score": 1.0,
"_source": {
"id": "2",
"name": "myname2",
"age": "24"
}
}]
Logstash config that I have written is:
input {
s3 {
bucket => "users"
region => "us-east-1"
watch_for_new_files => false
prefix => "user.csv"
}
}
filter {
// Need help here
}
output {
elasticsearch {
hosts => "localhost:9200"
index => "user_detail"
document_type => "user_detail_type"
document_id => "%{id}"
}
}
Doubt:
What should I write in filter section or any change in config to convert column[0] => id, column[1] => name, column[2] => age during Elasticsearch insertion.
I'm using the elasticsearch filter in my logstash pipeline. I correctly find the result using :
filter{
if [class] == "DPAPIINTERNAL" {
elasticsearch {
hosts => "10.1.10.16"
index => "dp_audit-2017.02.16"
query_template => "/home/vittorio/Documents/elastic-queries/matching-requestaw.json"
}
}
}
as you can see, Im using "query_template" which is :
{
"query": {
"query_string": {
"query": "class:DPAPI AND request.aw:%{[aw]}"
}
},
"_source": ["end_point", "vittorio"]
}
that tells elastichsearch to look up the log with that specific class that match "aw" with the DPAPIINTERNAL log.
Perfect! but now that i found the result, i want to add some field from it and attach them to my DPAPIINTERNAL log, for instance, i want to take "end_point" and add it in the new key "vittorio" inside my log.
This is not happening and I don't understand why.
here is the log that i'm looking at using the query:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "dp_audit-2017.02.16",
"_type": "logs",
"_id": "AVpHoPHPuEPlW12Qu",
"_score": 1,
"_source": {
"svc": "dp-1.1",
"request": {
"method": "POST|PATCH|DELETE",
"aw": "prova",
"end_point": "/bank/6311",
"app_instance": "7D1-D233-87E1-913"
},
"path": "/home/vittorio/Documents/dpapi1.json",
"#timestamp": "2017-02-16T15:53:33.214Z",
"#version": "1",
"host": "Vito",
"event": "bank.add",
"class": "DPAPI",
"ts": "2017-01-16T19:20:30.125+01:00"
}
}
]
}
}
Your need to specify the fields parameter in your elasticsearch filter, like this:
elasticsearch {
hosts => "10.1.10.16"
index => "dp_audit-2017.02.16"
query_template => "/home/vittorio/Documents/elastic-queries/matching-requestaw.json"
fields => { "[request][end_point]" => "vittorio" }
}
Note that since end_point is a nested field, you need to modify the _source in your query template like this:
"_source": ["request.end_point"]
the problem is simply that you don't have to specify the "new" field using the query_template.
"_source": ["request"] # here you specify the field you want from the query result.
and then
filter{
if [class] == "DPAPIINTERNAL" {
elasticsearch {
hosts => "10.1.10.16"
index => "dp_audit-2017.02.16"
query_template => "/home/vittorio/Documents/elastic-queries/matching-requestaw.json"
fields => {"request" => "new_key"} # here you add the fields and will tell elastich filter to put request inside new_key
}
}
}
That worked for me!
Hi I am Using the following scirp file in lostash 2.X version I have over 186000 records in MySQL database table,but while running this .conf file only one document is loading in elastic search index.
input {
jdbc {
jdbc_connection_string => "jdbc:mysql://localhost/elasticsearch"
jdbc_user => "root"
jdbc_password => "empower"
#jdbc_validate_connection => true
jdbc_driver_library => "/home/wtc082/Documents/com.mysql.jdbc_5.1.5.jar"
jdbc_driver_class => "com.mysql.jdbc.Driver"
statement => "SELECT * FROM index_part_content_local;"
#schedule => "* * * * *"
#codec => "json"
}
}
output {
elasticsearch {
index => "mysqltest"
document_type => "mysqltest_type"
document_id => "%{id}"
hosts => "localhost:9200"
}
}
When i use this query only one document is index
GET mysqltest/_search
{
"query": {
"match_all": {}
}
}
{
"took": 14,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "mysqltest",
"_type": "mysqltest_type",
"_id": "%{id}",
"_score": 1,
"_source": {
"partnum": "",
"property1": "",
"property2": "",
"color": "",
"size": "",
"dim": "",
"thumburl": "",
"catid": "6575",
"subcatid": "3813",
"termid": "31999",
"longdesc": "<ul><li>Equipment and Parts<li>GC32-XD Parts<li>D/V Lock Plate Screw</ul>",
"hier1desc": "Heavy Duty Tools / Equipment",
"hier2desc": "Other Heavy Duty Equipment",
"hier3desc": "Hose Crimping Equipment & Accessories",
"aaiabrandid": "BBSC",
"aaiabrandname": "Gates",
"brandimageurl": "es-logo-sm.jpg",
"linecode": "GAT",
"descrp": "D/V Lock Plate Screw",
"#version": "1",
"#timestamp": "2016-12-20T09:16:40.075Z"
}
}
]
}
}
Ok, as you can see the ID of your document is the verbatim value "%{id}", which means that apparently you don't have any id column in your database and all records from your database are indexed under the same document id, hence why you only see one document.
In your elasticsearch output, you need to make sure to use a field that is the primary key of your table
document_id => "%{PRIMARY_KEY}"
Fix that and that will work.
Environment
DB: Sybase
Logstash: 2.2.0 with JDBC Plugin, Elasticsearch Output plugin
SQL Query:
select res.id as 'res.id', res.name as 'res.name', tag.name as 'tag.name'
from Res res, ResTags rt, Tags tag
where res.id *= rt.resrow and rt.tagid *= tag.id
SQL Result:
res.id | res.name | tag.name
0 | result0 | null
0 | result0 | tagA
1 | result1 | tagA
1 | result1 | tagB
2 | result2 | tagA
2 | result2 | tagC
Index Mapping:
{
"mappings": {
"res": {
"properties": {
"id": { "type": "long"},
"name": { "type": "string" },
"tags": {
"type": "nested",
"properties": { "tagname": { "type": "string" }}
}
}
}
}
Conf File:
input {
jdbc {
jdbc_driver_library => "jtds-1.3.1.jar"
jdbc_driver_class => "Java::net.sourceforge.jtds.jdbc.Driver"
jdbc_connection_string => "jdbc:jtds:sybase://hostname.com:1234/schema"
jdbc_user => "george"
jdbc_password => "monkey"
jdbc_fetch_size => 100
statement_filepath => "/home/george/sql"
}
}
output {
elasticsearch {
action => "update"
index => "myres"
document_type => "res"
document_id => "%{res.id}"
script_lang => "groovy"
hosts => [ "my.other.host.com:5921" ]
upsert => ' {
"id" : %{res.id},
"name" : "%{res.name}",
"tags" :[{ "tagname": "%{tag.name}" }]
}'
script => '
if (ctx._source.res.tags.containsValue(null)) {
// if null has been added replace it with actual value
cts._source.res.tags = [{"tagname": "%{tag.name}" }];
else {
// if you find the tag, then do nothing
if (ctx._source.res.tags.containsValue("%{tag.name}")) {}
else {
// if the value you try to add is not null
if (%{tag.name} != null)
// add it as a new object into the tag array
ctx._source.res.tags += {"tagname": "%{tag.name}"};
}
}
'
}
}
The GOAL is to add the multiple rows returned from the database into ES, concatenating the tags as new objects (this is simplified example, so add_tag and filters do not do the job, as I have json structure deeper than 2 levels (nested of nested, etc))
The desired outcome after the bulk upload into ES would be:
{
"hits": {
"total": 3,
"max_score": 1,
"hits": [ {
"_index": "myres",
"_type": "res",
"_id": 0,
"_score": 1,
"_source": {
"res": {
"id":0,
"name": "result0",
"tags": [{"tagname": "tagA"}],
"#version": "2",
"#timestamp": "2016-xx-yy..."
}
},{
"_index": "myres",
"_type": "res",
"_id": 1,
"_score": 1,
"_source": {
"res": {
"id":1,
"name": "result1",
"tags": [{"tagname": "tagA"},{"tagname": "tagB"}],
"#version": "2",
"#timestamp": "2016-xx-yy..."
}
}{
"_index": "myres",
"_type": "res",
"_id": 2,
"_score": 1,
"_source": {
"res": {
"id":2,
"name": "result2",
"tags": [{"tagname": "tagA"},{"tagname": "tagC"],
"#version": "2",
"#timestamp": "2016-xx-yy..."
}
}
}
...
ISSUE: if in the conf, output section the script is not commented out, the below error pops out. If the script is not included, then only the initial tags (as expected) are imported, and the second ones are not.
It looks like script is not working within elasticsearch output.
ERROR message:
[400] {"error":"ActionRequestValidationException[Validation Failed:
1: script or doc is missing;
2: script or doc is missing;
3: script or doc is missing;],"status":400]} {:class=> ... bla bla ...}
NOTES
To avoid wasting peoples' time, doc_as_upsert => true also does not work as expected. It just keeps on updating / overwriting and just keeps the latest row of the db.
Also, the river plugin for jdbc to ES does not support nested of nested structure so that does not work eithe
I've been struggling with a problem for a while now, so i thought i would swing this by stackoverflow.
My document type has a title, a language field (used to filter) and a grouping id field (im leaving out all the other fields to keep this to the point)
When i search for documents i want to find all documents containing the text in the title. I only want one document for each unique grouping id.
I've been looking at tophits aggregation, and from what i can see it should be able to solve my problem.
When running this query against my index:
{
"query": {
"match": {
"title": "dingo"
}
},
"aggs": {
"top-tags": {
"terms": {
"field": "groupId",
"size": 1000000
},
"aggs": {
"top_tag_hits": {
"top_hits": {
"_source": {
"include": [
"*"
]
},
"size": 1
}
}
}
}
}
}
I get the following response (All results are in the same language):
{
"took": 9,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0,
"hits": []
},
"aggregations": {
"top-tags": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [{
"key": "3044BC9E7C29450AAB2E4B6C9B35AAE2",
"doc_count": 2,
"top_tag_hits": {
"hits": {
"total": 2,
"max_score": 1.4983996,
"hits": [{
"_index": "elasticsearch",
"_type": "productdocument",
"_id": "FB15279FB18E4B34AD66ACAF69B96E9E",
"_score": 1.4983996,
"_source": {
"groupId": "3044BC9E7C29450AAB2E4B6C9B35AAE2",
"title": "wombat, dingo and zetapunga actionfigures",
}
}]
}
}
},
{
"key": "F11799ABD0C14B98ADF2554C84FF0DA0",
"doc_count": 1,
"top_tag_hits": {
"hits": {
"total": 1,
"max_score": 1.30684,
"hits": [{
"_index": "elasticsearch",
"_type": "productdocument",
"_id": "42562A25E4434A0091DE0C79A3E7F3F4",
"_score": 1.30684,
"_source": {
"groupId": "F11799ABD0C14B98ADF2554C84FF0DA0",
"title": "awesome dingo raptor"
}
}]
}
}
}]
}
}
}
This is exactly what i expected (two hits in one bucket, but only one document retrieved for that bucket). However when i try this in NEST i can't seem to retrieve all of the documents.
My query looks like this:
result = _elasticClient.Search<T>(s => s
.From(skip)
.Filter(fd => fd.Term(f => f.Language, language))
.Size(pageSize)
.SearchType(SearchType.Count)
.Query(
q => q.Wildcard(f => f.Title, query, 2.0)
|| q.Wildcard(f => f.Description, query)
)
.Aggregations(agd =>
agd.Terms("groupId", tagd => tagd
.Field("groupId")
.Size(100000) //We sadly need all products
)
.TopHits("top_tag_hits", thagd => thagd
.Size(1)
.Source(ssd => ssd.Include("*")))
));
var topHits = result.Aggs.TopHits("top_tag_hits");
var documents = topHits.Documents<ProductDocument>(); //contains only one document (I would expect it to contain two, one for each bucket)
Inspecting the aggregations in the debugger reveals there is a "groupId" aggregation with 2 buckets (and matching what i see in my "raw" query against the index. Just without any apparent way to retrieve the documents)
So my question is. How do i retrieve the top hit for each bucket? Or am i doing this completely wrong? Is there some other way to achieve what i am trying to do?
EDIT
After the help i received, i was able to retrieve my results with the following:
result = _elasticClient.Search<T>(s => s
.From(skip)
.Filter(fd => fd.Term(f => f.Language, language))
.Size(pageSize)
.SearchType(SearchType.Count)
.Query(
q => q.Wildcard(f => f.Title, query, 2.0)
|| q.Wildcard(f => f.Description, query)
)
.Aggregations(agd =>
agd.Terms("groupId", tagd => tagd
.Field("groupId")
.Size(0)
.Aggregations(tagdaggs =>
tagdaggs.TopHits("top_tag_hits", thagd => thagd
.Size(1)))
)
)
);
var groupIdAggregation = result.Aggs.Terms("groupId");
var topHits =
groupIdAggregation.Items.Select(key => key.TopHits("top_tag_hits"))
.SelectMany(topHitMetric => topHitMetric.Documents<ProductDocument>()).ToList();
Your NEST query tries to run both Terms aggregation and TopHits side by side, while your original query runs Terms first and then for each bucket, you're calling TopHits.
You simply have to move your TopHits agg into Terms in your NEST query to make it work fine.
This should fix it:
.Aggregations(agd =>
agd.Terms("groupId", tagd => tagd
.Field("groupId")
.Size(0)
.Aggregations(tagdaggs =>
tagdaggs.TopHits("top_tag_hits", thagd => thagd
.Size(1)))
)
)
By the way, you don't have to use Include("*") to include all fields. Just remove this option, also specifying .Size(0) should bring back ALL possible terms for you.