aws datapipeline: waiting for dependencies - hadoop

I have a datapipe that gets stuck and goes in pending mode, everytime "Waiting on dependencies".
Here I am using "Hive Activity", which needs an input and output. In my case, all my data is in the hadoop infrastructure and thus I really don't need S3 Input and S3 output. However, there is no way to remove them, as datapipeline errors out. Further, the pipe gets stuck at this point, inspite of a precondition that S3 node "exists". Every time I run this pipe I have to manually "markfinish" the S3node, things work after that.
{
Name:
#S3node1_2014-08-01T13:59:50
[View instance fields]
Description:
Status: WAITING_ON_DEPENDENCIES
Waiting on:
#RunExpertCategories_2014-08-01T13:59:50
}
Any insights would help. AWS Datapipeline documentation does not go into detail.

If you set "stage": "false" on the HiveActivity then the input and output nodes will not be required.

Related

Issues with mkdbfile in a simple "read a file > Create a hashfile job"

Hello datastage savvy people here.
Two days in a row, the same single datastage job failed (not stopping at all).
The job tries to create a hashfile using the command /logiciel/iis9.1/Server/DSEngine/bin/mkdbfile /[path to hashfile]/[name of hashfile] 30 1 4 20 50 80 1628
(last trace in the log)
Something to consider (or maybe not ? ) :
The [name of hashfile] directory exists, and was last modified at the time of execution) but the file D_[name of hashfile] does not.
I am trying to understand what happened to prevent the same incident to happen next run (tonight).
Previous to this day, this job is in production since ever, and we had no issue with it.
Using Datastage 9.1.0.1
Did you check the job log to see if captured an error? When a DataStage job executes any system command via command execution stage or similar methods, the stdout of the called job is captured and then added to a message in the job log. Thus if the mkdbfile command gives any output (success messages, errors, etc) it should be captured and logged. The event may not be flagged as an error in the job log depending on the return code, but the output should be there.
If there is no logged message revealing cause of directory non-create, a couple of things to check are:
-- Was target directory on a disk possibly out of space at that time?
-- Do you have any Antivirus software that scans directories on your system at certain times? If so, it can interfere with I/o. If it does a scan at same time you had the problem, you may wish to update the AV software settings to exclude the directory you were writing dbfile to.

storm rebalance command not updating the number of workers for a topology

I tried executing the following command for storm 1.1.1:
storm [topologyName] -n [number_of_worker]
The command successfully runs but the number of workers remain unchanged. I tried reducing the number of workers too. That also didn't work.
I have no clue whats happening. Any pointer will be helpful.
FYI:
i have implemented custom scheduling?. Is it because of that?
You can always check Storm's source code behind that CLI. Or code the re-balance (tested against 1.0.2):
RebalanceOptions rebalanceOptions = new RebalanceOptions();
rebalanceOptions.set_num_workers(newNumWorkers);
Nimbus.Client.rebalance("foo", rebalanceOptions);

Filebeat duplicating events

I am running a basic elk stack setup using Filebeat > logstash > elasticsearch > kibana - all on version 5.2
When I remove Filebeat and configure logstash to look directly at a file, it ingests the correct number of events.
If I delete the data and re-ingest the file using Filebeat to pass the same log file contents to logstash, I get over 10% more events created. I have checked a number of these to confirm the duplicates are being created by filebeat.
Has anyone seen this issue? or have any suggestions why this would happen?
I need to understand first what do you mean by removing file beat!!
Possibility-1
if you have uninstalled and installed again, then obviously file beat will read the data from the path again(which you have re-ingested and post it to logstash->elasticsearch->kibana(assuming old data is not been removed from elastic node) hence the duplicates.
Possibility-2.
You just have stopped filebeat,configured for logstash and restarted filebeat and may be your registry file is not been updated properly during shutdown(as you know,file beat reads line by line and update the registry file upto what line it has successfully published to logstash/elasticsearch/kafka etc and if any of those output servers face any difficulty processing huge load of input coming from filebeat then filebeat waits until those servers are available for further processing of input data.Once those output servers are available,filebeat reads the registry file and scan upto what line it has published and starts publishing next line onwards).
Sample registry file will be like
{
"source": "/var/log/sample/sample.log",
"offset": 88,
"FileStateOS": {
"inode": 243271678,
"device": 51714
},
"timestamp": "2017-02-03T06:22:36.688837822-05:00",
"ttl": -2
}
As you can see, it maintains timestamp in the registry file.
So this is one of the reasons for duplicates.
For further references, you can follow below links
https://discuss.elastic.co/t/filebeat-sending-old-logs-on-restart/46189
https://discuss.elastic.co/t/deleting-filebeat-registry-file/46112
https://discuss.elastic.co/t/filebeat-stop-cleaning-registry/58902
Hope that helps.

Job with multiple tasks on different servers

I need to have a Job with multiple tasks, being run on different machines, one after another (not simultaneously), and while the current job is running, another same job can arrive to the queue, but should not be started until the previous one has finished. So I came up with this 'solution' which might not be the best but it gets the job done :). I just have one problem.
I figured out I would need a JobQueue (either MongoDb or Redis) with the following structure:
{
hostname: 'host where to execute the task',
running:FALSE,
task: 'current task number',
tasks:{
[task_id:1, commands:'run these ecommands', hostname:'aaa'],
[task_id:2,commands:'another command', hostname:'bbb']
}
}
Hosts:
search for the jobs with same hostname, and running==FALSE
execute the task that is set in that job
upon finish, host sets running=FALSE, checks if there are any other tasks to perform and increases task number + sets the hostname to the next machine from the next task
Because jobs can accumulate, imagine situation when jobs are queued for one host like this: A,B,A
Since I have to run all the jobs for the specified machine how do I not start the 3rd A (first A is still running)?
{
_id : ObjectId("xxxx"), // unique, generated by MongoDB, indexed, sortable
hostname: 'host where to execute the task',
running:FALSE,
task: 'current task number',
tasks:{
[task_id:1, commands:'run these ecommands', hostname:'aaa'],
[task_id:2,commands:'another command', hostname:'bbb']
}
}
The question is how would the next available "worker" know whether it's safe for it to start the next job on a particular host.
You probably need to have some sort of a sortable (indexed) field to indicate the arrival order of the jobs. If you are using MongoDB, then you can let it generate _id which will already be unique, indexed and in time-order since its first four bytes are timestamp.
You can now query to see if there is a job to run for a particular host like so:
// pseudo code - shell syntax, not actual code
var jobToRun = db.queue.findOne({hostname:<myHostName>},{},{sort:{_id:1}});
if (jobToRun.running == FALSE) {
myJob = db.queue.findAndModify({query:{_id:jobToRun._id, running:FALSE},update:{$set:{running:TRUE}}});
if (myJob == null) print("Someone else already grabbed it");
else {
/* now we know that we updated this and we can run it */
}
} else { /* sleep and try again */ }
What this does is checks for the oldest/earliest job for specific host. It then looks to see if that job is running. If yes then do nothing (sleep and try again?) otherwise try to "lock" it up by doing findAndModify on _id and running FALSE and setting running to TRUE. If that document is returned, it means this process succeeded with the update and can now start the work. Since two threads can be both trying to do this at the same time, if you get back null it means that this document already was changed to be running by another thread and we wait and start again.
I would advise using a timestamp somewhere to indicate when a job started "running" so that if a worker dies without completing a task it can be "found" - otherwise it will be "blocking" all the jobs behind it for the same host.
What I described works for a queue where you would remove the job when it was finished rather than setting running back to FALSE - if you set running to FALSE so that other "tasks" can be done, then you will probably also be updating the tasks array to indicate what's been done.

Ruby - Elastic Search & RabbitMQ - data import being lost, script crashing silently

Stackers
I have a lot of messages in a RabbitMQ queue (running on localhost in my dev environment). The payload of the messages is a JSON string that I want to load directly into Elastic Search (also running on localhost for now). I wrote a quick ruby script to pull the messages from the queue and load them into ES, which is as follows :
#! /usr/bin/ruby
require 'bunny'
require 'json'
require 'elasticsearch'
# Connect to RabbitMQ to collect data
mq_conn = Bunny.new
mq_conn.start
mq_ch = mq_conn.create_channel
mq_q = mq_ch.queue("test.data")
# Connect to ElasticSearch to post the data
es = Elasticsearch::Client.new log: true
# Main loop - collect the message and stuff it into the db.
mq_q.subscribe do |delivery_info, metadata, payload|
begin
es.index index: "indexname",
type: "relationship",
body: payload
rescue
puts "Received #{payload} - #{delivery_info} - #{metadata}"
puts "Exception raised"
exit
end
end
mq_conn.close
There are around 4,000,000 messages in the queue.
When I run the script, I see a bunch of messages, say 30, being loaded into Elastic Search just fine. However, I see around 500 messages leaving the queue.
root#beep:~# rabbitmqctl list_queues
Listing queues ...
test.data 4333080
...done.
root#beep:~# rabbitmqctl list_queues
Listing queues ...
test.data 4332580
...done.
The script then silently exits without telling me an exception. The begin/rescue block never triggers an exception so I don't know why the script is finishing early or losing so many messages. Any clues how I should debug this next.
A
I've added a simple, working example here:
https://github.com/elasticsearch/elasticsearch-ruby/blob/master/examples/rabbitmq/consumer-publisher.rb
It's hard to debug your example when you don't provide examples of the test data.
The Elasticsearch "river" feature is deprecated, and will be removed, eventually. You should definitely invest time into writing your own custom feeder, if RabbitMQ and Elasticsearch are a central part of your infrastructure.
Answering my own question, I then learned that this is a crazy and stupid way to load a message queue of index instructions into Elastic. I created a river and can drain instructions much faster than I could with a ropey script. ;-)

Resources