Event based triggering of a Pig job - hadoop

I have a system which ingests application log data into Hadoop using Flume. I am indexing this data using Elasticsearch by running a Pig script to load data from Hadoop into ES. Now, I need to automate this task such that every time a new line gets appended, the script should be triggered, or whenever it is triggered it loads only the newly written lines. Can anyone tell me how to achieve this?

Related

Apache NIFI Jon is not terminating automatically

I am new to Apache NIFI tool. I am trying to import data from mongo db and put that data into the HDFS. I have created 2 processors one for MongoDB and second for HDFS and I configured them correctly. The job is running successfully and storing the data into HDFS but the job should terminate automatically on success. But it is not, and creating too many files in HDFS. I want to know how to make On Demand Job in NIFI and how to determine that a job is successfull.
GetMongo will continue to pull data from MongoDB based on the provided properties such as Query, Projection, Limit. It has no way of tracking the execution process, at least for now. What you can do, however, is changing the Run Schedule and/or Scheduling Strategy. You can find them by right clicking on the processor and clicking Configure. By default, Run Schedule will be 0 sec which means running continuously. Changing it to, say, 60 min will make the processor run every one hour. This will still read the same documents from MongoDB again every one hour but since you have mentioned that you just want to run it only once, I'm suggesting this approach.

Reindexing/Updating Elasticsearch using Logstash on Jenkins

I would like to automate the process of updating the elasticsearch with latest data on demand and secondly, recreating the index along with feeding data using a Jenkins job.
I am using jdbc input plugin for fetching data from 2 different databases (postgresql and microsoft sql). When the Jenkins job is triggered on demand, the logstash should run the config file and do the tasks we would like to achieve above. Now, we also have a cronjob running on the same sever (AWS) , where the logstash job would be running on demand. The issue is, the job triggered via Jenkins, starts another logstash process along with the cron job running logstash already on the AWS server. This would end up starting multiple logstash processes without terminating them, once on demand work is done.
Is there a way to achieve this scenario? Is there a way to terminate the logstash running via Jenkins job or if there's some sort of queue that would help us insert our data on demand logstash requests?
PS: I am new to ELK stack

How to get resources used for FINISHED hadoop jobs from YARN logs using job names?

I have a unix shell script which runs multiple hive scripts. I have given Job names for every hive queries inside the hive scripts.
What I need is that at the end of the shell script, I want to retrieve the resources (in terms of memory used,containers) used for the hive queries based on the job names from the YARN logs/application having appstatus as 'FINISHED'
How do I do this?
Any help would be appreciated.
You can pull this information from the Yarn History server via rest apis.
https://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/HistoryServerRest.html
Scroll through this documentation and you will see examples of how to get cluster level information on jobs executed and then how to get information on individual jobs.

Spark saveToEs asynchronously

We have a spark streaming job which writes output to ElasticSearch. When elasticsearch is slow due to any reason, spark job waits indefinitely which causes it to accumulate data and has a snowballing effect on the streaming job itself. The only solution to make the streaming job stable then is to restart it.
Is there a way to specify spark writes to elastic search to be asynchronous? I tried looking at https://www.elastic.co/guide/en/elasticsearch/hadoop/current/configuration.html and do not see any option to have async writes.

How to access Hive log information

I am trying to analyze the performance of the Hive queries. Though I was able to make Hive queries with Java but I still need to access the log information getting generated after each query. Instead of using a hack to read the latest log on the disk and using regex to extract the numbers I am looking for a graceful method if already available.
Any pointers will be helpful. Thanks in advance.
-lg
Query execution details like Status,Finished at, Finished in are displayed in Job Tracer, you can access job tracker programmatically . Related info at this link
How could I programmatically get all the job tracker and tasktracker information that is displayed by Hadoop in the web interface?
Once hive starts running a corresponding map-reduce job starts. The logs of this hadoop job can be found on the corresponding tasktracker on which each task runs.
Use jobclient API to retrieve these logs programmatically.

Resources