Delete Files After X days using Spring Boot Cron Job/Apaache Nifi

Delete Files After X days using Spring Boot Cron Job/Apaache Nifi - spring-boot

I am building a new feature in my product where client are able to set deletion policy lets say 30 days , 60 days.
And after the days which is set by client i will delete some information from my database/s3 bucket.
My clients are uploading some documents and if they upload in test environment then i keep the files in FileSystem and if they are in production environment i upload the file in s3 bucket.
So i tried to build this feature using two ways
1 - Cron Job Spring Boot
and
2 - Apache Nifi (I go through many documents there are many inbuilt processor in apache nifi which process/delete files)
So what will be the efficient way to handle this situation Java Cron Job or Apache Nifi?

Related

What is the most efficient solution for hundreds download requests in minute for HDFS folder

In my company, we have a continuous learning process.
Every 5-10 minutes we create a new model in HDFS.
Model is a folder of several files:
model ~ 1G (binary file)
model metadata 1K (text file)
model features 1K (csv file)
...
On the other hand, we have hundreds of model serving instances, that need to download the model into the local filesystem once 5-10 minutes and serve from it.
Currently, we are using WebFS from our service (java FileSystem client), but it probably creates a load to our Hadoop cluster, since it redirects requests to the concrete data nodes.
We consider to using HTTPFs service. Does it have a caching capability? So the first request will get a folder to service memory, and the next requests will use the already downloaded results?
What other technology/solution could be used for such use-case?

We have found a nice solution.
It could be used for Hadoop to reduce the read load or for Google/S3 buckets to reduce the cost.
We simply set-up a couple of Ngnix servers, and configure them as a proxy with file cache 2 minutes.
In that way, only Ngnix machines will download the data from the Hadoop cluster.
And all serving machines (that might be hundreds) will pull the data from the Nginx server, where it will be already cached

How to remove duplicate files using Apache Nifi?

I have a couple of EC2 servers set-up, with the same EFS mounted on each of these instances.
Have also setup Apache Nifi independently on each of the 2 machines. Now, when I try to make a data flow to copy files from the EFS mounted folder, I get duplicated files on both the servers.
Is there some way in Apache Nifi using which I can churn out duplicate items, since both of them are firing at the same time. Cron is not useful enough as at some point the servers will collide at the same time.

For Detecting Duplicate file you can use DetectDuplicate Processor.

Reindexing/Updating Elasticsearch using Logstash on Jenkins

I would like to automate the process of updating the elasticsearch with latest data on demand and secondly, recreating the index along with feeding data using a Jenkins job.
I am using jdbc input plugin for fetching data from 2 different databases (postgresql and microsoft sql). When the Jenkins job is triggered on demand, the logstash should run the config file and do the tasks we would like to achieve above. Now, we also have a cronjob running on the same sever (AWS) , where the logstash job would be running on demand. The issue is, the job triggered via Jenkins, starts another logstash process along with the cron job running logstash already on the AWS server. This would end up starting multiple logstash processes without terminating them, once on demand work is done.
Is there a way to achieve this scenario? Is there a way to terminate the logstash running via Jenkins job or if there's some sort of queue that would help us insert our data on demand logstash requests?
PS: I am new to ELK stack

Can log data exposed as a web service be input to Elasticssearch?

I have a number of applications that are running in different data centers, developed and maintained by different vendors. Each application has a web service that exposes relevant log data (audit data, security data, data related to cost calculations, performance data, ...) consolidated for the application.
My task is to get data from each system into a setup of Elasticsearch, Kibana and Logstash so I can create business reports or just view data the way I want to.
Assume I have a JBoss application server for integration to these "expose log" services, what is the best way to feed Elasticssearch? Some Logstash plugin that calls each service? JBoss uses some Logstash plugin? Or some other way?

The best way is to set up the logstash shipper on the server where the logs are created.
This will then ship them to a Redis server.
Another logstash instance will then pull the data from Redis, and index it, and ship it to Elasticsearch.
Kibana will then provide an interface to Elasticsearch, which is where the goodness happens.
I wrote a post on how to install Logstash a little while ago. Versions may have been updated since, but its still valid
http://www.nightbluefruit.com/blog/2013/09/how-to-install-and-setup-logstash/

Do your JBoss application server writes logs to file?
In my experiences, My JBoss application(in multiple server) writes the logs to the file. Then I use logstash to read the logs file and ship all the logs to a central server. You can refer to here.
So, what can you do is setup a logstash shipper in different data center.
If you do not have permission to do this, maybe you want to write a program to get the logs from different web services and then save them to a file. Then setup the logstash to read the logs file. So far, logstash do not have any plugin that can call web services.

Using Hadoop Cluster Remotely

I have a web application and 1 remote clusters(It can be one or more). These cluster can be on different machines.
I want to perform following operations from my web application:
1 HDFS Actions :-
Create New Directory
Remove files from HDFS(Hadoop Distributed File System)
List Files present on HDFS
Load File onto the HDFS
Unload File
2 Job Related Actions:-
Submit Map Reduce Jobs
View their status i.e. how much job has completed
Time taken by the job to finish
I need a tool that can help me do these tasks from the web application - via an API, via REST calls etc. I'm assuming that the tool will be running on the same machine( as the web application) and can point to a particular, remote cluster.
Though as a last option(as there can be multiple,disparate clusters, it would be difficult to ensure that each of them has the plug-in,library etc. installed), I'm wondering if there would be some Hadoop library,plug-in that rests on the cluster,allows access from remote machines and performs the mentioned tasks.

The best framework which allows everything you have listed here is Spring Data - Apache Hadoop. This has Java Scripting API based implementations to do the following
1 HDFS Actions :-
Create New Directory
Remove files from HDFS(Hadoop Distributed File System)
List Files present on HDFS
Load File onto the HDFS
Unload File
As well spring scheduling based implementations to do the following
2 Job Related Actions:-
Submit Map Reduce Jobs
View their status i.e. how much job has comleted
Time taken by the job to finish

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Delete Files After X days using Spring Boot Cron Job/Apaache Nifi - spring-boot

Related

What is the most efficient solution for hundreds download requests in minute for HDFS folder

How to remove duplicate files using Apache Nifi?

Reindexing/Updating Elasticsearch using Logstash on Jenkins

Can log data exposed as a web service be input to Elasticssearch?

Using Hadoop Cluster Remotely

Categories

Resources