I am quite new to the use of big data tools like Hadoop. I want to execute a publicly available cluster trace (https://github.com/google/cluster-data) on Yarn/or Yarn Simulator.
One way to do is to feed input into Yarn via Gridmix.
The format in which Gridmix (https://hadoop.apache.org/docs/r2.8.3/hadoop-gridmix/GridMix.html) takes input is basically the output from Rumen.
And Rumen (https://hadoop.apache.org/docs/r2.8.3/hadoop-rumen/Rumen.html) takes JobHistory log generated from a map-reduce cluster as input.
The google trace is not a map-reduce trace. However, I was wondering if I can transform it to the format same as what Grdimix takes as input, then I can use the Grdmix.
Can anyone here point me input format of Gridmix (Or output of Rumen)?
Or suggest me another way to do what I want to do?
Thanks.
The output of Rumen contains two files:
1. job-trace file,
2. cluster-topology file;
those two files are all json format, job-trace file as following format:
{
"jobID" : "job_1546949851050_53464",
"user" : "mammut",
"computonsPerMapInputByte" : -1,
"computonsPerMapOutputByte" : -1,
"computonsPerReduceInputByte" : -1,
"computonsPerReduceOutputByte" : -1,
"submitTime" : 1551801585141,
"launchTime" : 1551801594958,
"finishTime" : 1551801630228,
"heapMegabytes" : 200,
"totalMaps" : 2,
"totalReduces" : 1,
"outcome" : "SUCCESS",
"jobtype" : "JAVA",
"priority" : "NORMAL",
"directDependantJobs" : [ ],
"mapTasks" : [ {
"inputBytes" : 25599927,
...}]
...
}
And, the cluster-topology like:
{
"name" : "<root>",
"children" : [ {
"name" : "rack-01",
"children" : [ {
"name" : "",
"children" : null
}, {
"name" : "",
"children" : null
}, {
"name" : "",
"children" : null
} ]
}, {
"name" : "default-rack",
"children" : [ {
"name" : "x",
"children" : null
} ]
} ]
}
Related
We're using logstash to sync Elastic search and we've around 3 million documents. It takes 3 to 4 hours to sync. Currently all we get is, it is started and stopped. Is there any way to see how many records processed in logstash ?
If you're using Logstash 5 and higher, the Logstash Monitoring API can help you. You can see and monitor what's happening inside Logstash as it processes events. If you hit the Pipeline stats API you'll get the total number of processed events per stage and plugin (input/filter/output):
curl -XGET 'localhost:9600/_node/stats/pipelines?pretty'
You'll get this type of response in which you can clearly see at any time how many events have been processed:
{
"pipelines" : {
"test" : {
"events" : {
"duration_in_millis" : 365495,
"in" : 216485,
"filtered" : 216485,
"out" : 216485,
"queue_push_duration_in_millis" : 342466
},
"plugins" : {
"inputs" : [ {
"id" : "35131f351e2dc5ed13ee04265a8a5a1f95292165-1",
"events" : {
"out" : 216485,
"queue_push_duration_in_millis" : 342466
},
"name" : "beats"
} ],
"filters" : [ {
"id" : "35131f351e2dc5ed13ee04265a8a5a1f95292165-2",
"events" : {
"duration_in_millis" : 55969,
"in" : 216485,
"out" : 216485
},
"failures" : 216485,
"patterns_per_field" : {
"message" : 1
},
"name" : "grok"
}, {
"id" : "35131f351e2dc5ed13ee04265a8a5a1f95292165-3",
"events" : {
"duration_in_millis" : 3326,
"in" : 216485,
"out" : 216485
},
"name" : "geoip"
} ],
"outputs" : [ {
"id" : "35131f351e2dc5ed13ee04265a8a5a1f95292165-4",
"events" : {
"duration_in_millis" : 278557,
"in" : 216485,
"out" : 216485
},
"name" : "elasticsearch"
} ]
},
"reloads" : {
"last_error" : null,
"successes" : 0,
"last_success_timestamp" : null,
"last_failure_timestamp" : null,
"failures" : 0
},
"queue" : {
"type" : "memory"
}
}
}
I tried to get the value of a variable named version (first one) using JSONPath but apparently my solution didn't work at all.
I tried to use an expression like $..version or $.container..version .
My response below:
{
"container" : {
"version" : 8,
"updatedBy" : "user111",
"updatedOn" : "2017-08-17T16:00:24Z",
"id" : 16,
"dataEnt" : {
"dataEntid" : "dataEntid-000032",
"dataEnttype" : "21"
},
"impact" : [ ],
"operationalFocus" : false,
"periodicity" : {
"version" : 0,
"updatedBy" : "unknown",
"updatedOn" : "2017-03-31T16:44:08Z",
"step" : 1,
"period" : 31084132,
"_VALIDATION" : {
"valid" : true,
"saveAll" : true,
"reasons" : [ ],
"details" : {
"period" : {
"valid" : true,
"saveAll" : true,
"risks" : [ ],
"rmiCode" : null,
"rmiMessage" : null
},
"version" : {
"valid" : true,
"saveAll" : true,
"risks" : [ ],
"rmiCode" : null,
"rmiMessage" : null
},
"step" : {
"valid" : true,
"saveAll" : true,
"risks" : [ ],
"rmiCode" : null,
"rmiMessage" : null
}
},
"rmiCode" : null,
"rmiMessage" : null
},
"_META" : { }
}
First of all the JSON you pasted is invalid: it's missing 2 curly brackets at the end (root object and container objects are not closed). If this is not a copy/paste error on SO, but actual data problem, you may need to correct that first.
If I understood correctly, you want the value from this field in the variable:
"version" : 8
If so, JSON path should be:
$.container.version
or
container.version
if you prefer relative path to absolute.
Path like $..version or $.container..version will select multiple version fields ("version" : 0 in periodicity property, and the one that is an object inside _VALIDATION)
The following expression will get you the desired result.
Variable: ContainerVersion
JSON Expression: $..container.version
Now the stored version value can be called using: ${ContainerVersion}
If there are multiple "version" tags are there, then you can load all values of "version" by having following expression,
$..container.version[*]
You can call the variable as ${Var_1}, ${Var_2} etc..
Add debug sampler to see the loaded variable names and its corresponding values.
Hope the above helps...
I have a valid multiline JSON file. I want to parse it, and to assign it keys as field names, and values as field values.
Is it possible to do automatically?
input {
file {
path => "/home/logstash/xunit.json"
codec => json
}
}
output {
stdout {}
elasticsearch {
protocol => "http"
codec => "json"
host => "kibana.dev"
port => "9200"
}
}
After using this config, i see that something was added.. but i can't see that fields from my json appeared. Is it possible to grab name, severity, status, start & stop dates?
My json example:
[
{
"uid" : "441d1d1dd296fe60",
"name" : "test_buylinks",
"title" : "Test buylinks",
"time" : {
"start" : 1419621623182,
"stop" : 1419621640491,
"duration" : 17309
},
"severity" : "NORMAL",
"status" : "FAILED"
},
{
"uid" : "a88c89b377aca0c9",
"name" : "test_buylinks",
"title" : "Test buylinks",
"time" : {
"start" : 1419621623182,
"stop" : 1419621640634,
"duration" : 17452
},
"severity" : "NORMAL",
"status" : "FAILED"
},
{
"uid" : "32c3f8b52386c85c",
"name" : "test_buylinks",
"title" : "Test buylinks",
"time" : {
"start" : 1419621623185,
"stop" : 1419621640826,
"duration" : 17641
},
"severity" : "NORMAL",
"status" : "FAILED"
}
]
I have inserted 3 records in my ElasticSearch index as follows:
curl -XPOST 'http://127.0.0.1:9200/geoindex_test/STREET?pretty=1' -d '
{ "cityNames" : [ { "language" : "ENG",
"name" : "w bridgewater",
"raw_name" : "W BRIDGEWATER"
},
{ "language" : "ENG",
"name" : "west bridgewater",
"raw_name" : "West Bridgewater"
}
],
"id" : 1,
"streetNames" : [ { "language" : "ENG",
"name" : "cram rd",
"raw_name" : "Cram Rd"
} ]
}'
curl -XPOST 'http://127.0.0.1:9200/geoindex_test/STREET?pretty=1' -d '
{ "cityNames" : [ { "language" : "ENG",
"name" : "bridgewater corners",
"raw_name" : "BRIDGEWATER CORNERS"
},
{ "language" : "ENG",
"name" : "bridgewater center",
"raw_name" : "Bridgewater Center"
}
],
"id" : 2,
"streetNames" : [ { "language" : "ENG",
"name" : "valley view rd",
"raw_name" : "Valley View Rd"
} ]
}'
curl -XPOST 'http://127.0.0.1:9200/geoindex_test/STREET?pretty=1' -d '
{ "cityNames" : [ { "language" : "ENG",
"name" : "bridgewater",
"raw_name" : "Bridgewater"
},
{ "language" : "ENG",
"name" : "windsor",
"raw_name" : "Windsor"
}
],
"id" : 3,
"streetNames" : [ { "language" : "ENG",
"name" : "valley view rd",
"raw_name" : "Valley View Rd"
} ]
}'
And I perform a search as follows:
curl -XGET 'http://127.0.0.1:9200/geoindex_test/STREET/_search?pretty=1' -d '
{
"query" : {
"match" : { "cityNames.name" : "bridgewater" }
}
}'
I thought ElasticSearch would return the third record (id == 3) as the best match (record 3 is the only exact match to "bridgewater"), but instead it returns the record for id 1 (w bridgewater) as the best match. What am I doing wrong?
I imagine this is happening because you are using inner objects which basically collapse the objects under it, into one for search purposes. So when you're querying the search field for Object 1, for example, you're querying against ["w bridgewater", "west bridgewater"] and not discrete fields as you may imagine.
Since 'bridgewater' appears twice in object 1 and 2 (two name fields) vs once in object 3, those items rank higher in the search. Object 1 is ultimately picked, because the fields that 'bridgewater' appears in are shorter strings than in Object 2 ("w bridgewater" vs "bridgewater corners").
Instead of using inner objects like you're doing, use nested objects instead http://www.elasticsearch.org/guide/reference/mapping/nested-type/. setting the score mode to "max" will then make things match in a more intuitive manner for you.
I need to create a map/array for auto complete from a JSON response and I am looking for the best, most efficient way to do it in Ruby and Rails 3. A portion of the response is below and the working code I have is before it. What is the one line of code I need to create locations for me?
# Need help making this more efficient
response_fields = JSON.parse(response.body)
predictions = response_fields['predictions']
predictions.each do |prediction|
locations << prediction['description']
end
Sample response from API:
{
"predictions" : [
{
"description" : "Napa, CA, United States",
"id" : "cf268f9fb9a1b46aed72d59ab85ed40f982763c6",
"matched_substrings" : [
{
"length" : 4,
"offset" : 0
}
],
"reference" : "CjQvAAAAqZWNGzqtJf3awNuQNQdnZpl4dBVVXFPrPdz29r1jo1GMWYFuz3KRlK9HgdgszOThEhDeYz_vYgcOPJTaYehF11bUGhR8yH9zqMGV9kenZIo9OTBrSwftgg",
"terms" : [
{
"offset" : 0,
"value" : "Napa"
},
{
"offset" : 6,
"value" : "CA"
},
{
"offset" : 10,
"value" : "United States"
}
],
"types" : [ "locality", "political", "geocode" ]
},
You can shorten your code like this:
locations = JSON.parse(response.body)['predictions'].map { |p| p['description'] }