Counting number of files in a directory with Nifi - apache-nifi

Using Apache Nifi, I am passing files to a directory. I want to count the number of files in this directory, wait until all of the files I need are present, and then run the StreamExecuteCommand processor to process the data in that directory. (Right now, the StreamExecute doesn't wait long enough for all of the files to arrive before the process begins - so I want to add this wait)
I just want to know how to count the number of files in a directory to start. I am using ListFiles to retrieve the names of files, but not sure how to count them in NiFi.
Thanks

If you are using ExecuteStreamCommand to run a shell command on the files, you could easily add something like ls -l | wc -l to the same or additional ExecuteStreamCommand processor to count the number of files in the directory.
We usually caution against this approach, however, because there are edge cases where you can have a file present in the directory which isn't "complete" if some external process is writing it. The model usually recommended is to write files in with a temporary filename like .file1, .file2, and rename each upon successful completion to file1, file2, etc. The ListFile processor supports numerous settings to avoid detecting these files until they are ready for processing.
We also usually recommend setting some boolean flag through the external process rather than waiting for an explicit count unless that value will never change.

Related

How to calculate processing time of files in jmeter for ssh protocol(source to destination file processing)

Let me put my question this way: I have 10 files that need to process from source directory to destination directory. I have put 10 files to a source directory using jmeter put command under ssh protocol,now our system process these files to destination directory.Now i need to find out our system performance by measuring the time taken by each file to process from source to destination. Please suggest how i can achieve this using jmeter.
Use While Controller to "wait" until all 10 files appear in the "destination" directory.
Use Transaction Controller to measure the overall time from uploading the files till the last file appears in the destination directory
Something like:

I want to get the latest file names under each directory of gcs

I want to know the path to the latest file under each directory using gsutil ls.
Executing the command in a loop like this is very slow.
I want the final output to be
How can I do this?
I want to know the path to the latest file under each directory using gsutil ls.
shell script
for dir in dir_list[#];do
file+=$(gsutil ls -R ${dir} | tail -n 1);
done
Running the command in a loop process is very slow.
I want the final output to be
Is there another way?
results image
gs://bucket/dir_a/latest.txt
gs://bucket/dir_b/latest.txt
gs://bucket/dir_c/latest.txt
gs://bucket/dir_d/latest.txt
There isn't other strategy for a good reason: directory doesn't exist. So, you need to scan all the files, get the metadata, get this one which is the last, and do that for each "similar prefix".
A prefix is what you call directories "/path/to/prefix/". That's why you can only perform search by prefix in GCS not by file pattern.
So, you can imagine to build a custom app which, for each different prefix (directory), create a concurrent process (fork) dedicated to this prefix. Like that you can perform parallelization. It's not so simple to write but you can!

Bash union of two directories in one statement

I'm trying to run a command that takes one location input (intended for a single directory of files), but I need to run it on files in several locations. While I'd normally run it on */*.type, I'm looking for some way to run the command over (*/dirA/*.type AND dirB/*.type).
I basically need all files of *.type within a directory structure, but they're not all at the same directory level (or I'd just do */*/*.type or something to grab them all). Unfortunately they're in a particular layout for a reason, which I can't just reorganize to make this command run.
Is there any bash shortcut/command/whatever-it's-called that I could use to get both sets of files at once?
you can say
dir{A,B}/*.type
For example running this with ls command
root#do:/tmp# ls dir{A,B}/*.type
dirA/test.type dirB/test.type
If the command works when you pass one wildcard in, that means it is expecting a list of file names. In that case you can pass it two wildcards just as easily.
command */dirA/*.type dirb/*.type

How to tar a folder while files inside the folder might being written by some other process

I am trying to create a script for cron job. I have around 8 GB folder containing thousands of files. I am trying to create a bash script which first tar the folder and then transfer the tarred file to ftp server.
But I am not sure while tar is tarring the folder and some other process is accessing files inside it or writing to the files inside it.
Although its is fine for me if the tarred file does not contains that recent changes while the tar was tarring the folder.
suggest me the proper way. Thanks.
tar will hapilly tar "whatever it can". But you will probably have some surprises when untarring, as tar also stored the size of the file it tars, before taring it. So expect some surprises.
A very unpleasant surprise would be : if the size is truncated, then tar will "fill" it with "NUL" characters to match it's recorded size... This can give very unpleasant side effects. In some cases, tar, when untarring, will say nothing, and silently add as many NUL characters it needs to match the size (in fact, in unix, it doesn't even need to do that : the OS does it, see "sparse files"). In some cases, if truncating occured during the taring of the file, tar will complain it encounters an Unexpected End of File when untarring (as it expected XXX bytes but only reads fewer than this), but will still say that the file should be XXX bytes (and the unix OSes will then create it as a sparse file, with "NUL" chars magically appended at the end to match the expected size when you read it).
(to see the NUL chars : an easy way is to less thefile (or cat -v thefile | more on a very old unix. Look for any ^#)
But on the contrary, if files are only appended to (logs, etc), then the side effect is less problematic : you will only miss some bits of them (which you say you're ok about), and not have that unpleasant "fill with NUL characters" side effects. tar may complain when untarring the file, but it will untar it.
I think tar failed (so do not create archive) when an archived file is modified during archiving. As Etan said, the solution depends on what you want finally in the tarball.
To avoid a tar failure, you can simply COPY the folder elsewhere before to call tar. But in this case, you cannot be confident in the consistency of the backuped directory. It's NOT an atomic operation, so some files will be todate while other files will be outdated. It can be a severe issue or not follow your situation.
If you can, I suggest you configure how these files are created. For example: "only recent files are appended, files older than 1 day are never changed", in this case you can easily backup only old files and the backup will be consistent.
More generally, you have to accept to loose last data AND be not consistant (each files is backup at a different date), or you have to act at a different level. I suggest :
Configure the software that produces the data to choose a consistency
Or use OS/Virtualization features. For example it's possible to do consistent snapshot of a storage on some virtual storage...

check the copying in directory in shell

I have created a script which pick the files from a directory inbox. I have to include a handling in the script whether files are coming from another process or not. if files are coming through another process then my script should wait until files are copied.
For this i have create a flag
CHECK_COPY_PROCESS=$(ps -ef|grep -E 'cp|mv|scp'|grep inbox)
if flag CHECK_COPY_PROCESS contains some value then process will go to waiting state. but problem is that suppose some files are coming from sub directory of that directory inbox
then process will generate like this cp file_name .. above logic is not working.
You could use the tip of Basile, using lsof in conjunction with awk (grepping only on the first column, which is the command name)
Example:
lsof +D /path/to/inbox|awk '$1~/mv|cp|scp/'
This is not tested as I currently don't have big files that take a while to copy on my machine.

Resources