Split rsyslog files - rsyslog

I am trying to find a way to make rsyslog write logs into multiple files simultaneous, that is received from the same host. For example write to two files, and switch every log line.
I have one host that sends insane amounts of data (600gb a day), and i want write this logs to multiple files instead of one.
This is because of Splunk that read this file for indexing purposes, cant utilize multiple pipelines (aka. multiple CPU threads) on the same file, and this causes an bottleneck.
I was thinking about making rsyslog writing a new file every second, but this does not sound optimal.
Any suggestion is much appreciated :)
I have not been able to find any documentation that describes this ability in rsyslog.

The solution for this is to use part of the subseconds to split the file randomly (using last 2 digits, nano seconds) to 100 files at the same time.
template(name="template_nano_file_format" type="list") {
property(name="$.path")
property(name="fromhost-ip")
constant(value="/")
property(name="$year")
property(name="$month")
property(name="$day")
constant(value="-")
property(name="$hour")
constant(value="-")
property(name="timegenerated" dateFormat="subseconds" regex.expression="..$" regex.type="ERE")
constant(value=".log")
}

Related

Limit the size of the Kismet PCAP output file?

I am running some kismet captures, and I need to continuously parse the outputted PCAP files, so in order to do this I need Kismet to save the file, and start a new one periodically (I use inotify-tools to detect newly created files).
But the problem is I do not find a way for kismet to do it. In man pages i found that -m option allows to limit the file size by packet size, so I ran it like this:
sudo kismet -c wlan0 -m 10
But that did not create multiple files, it carried on just putting all traffic to one file.
Any other ways to somehow make kismet break output to different files? I don't really care about what criteria is used (time, packet count, file size.. I'll take anything)
Thanks!
I think that you can modify it in the kismet.conf file.
There is an option that says
writeinterval=300
It means that every 300 seconds the pcap file will be saved. It will make a new file every 300 seconds.If you want a shorter time you can change it.
Hope it helps

FTP FileWatcher

So, I am in this little predicament where I am stuck watching a few ftp folders to see if they have new files added to them. If they do, it needs to throw an event with the file name. Thereby telling something else to download that file.
This is a pretty simple object to make, I was just curious if anyone knew how expensive this operation would be?
I plan on using the command NLIST because I don't need file size information, and there will be no sub-directories in the folder. Each file in the folder will have exactly 25 characters in its name.
There could be anywhere from 10 to 'maybe' a couple thousand (max around 2000) files per folder (usually on the lower end, 100-300, but currently growing).
The files are anywhere from 250kb to a very VERY unlikely 10mb (usually within the 250kb to 4mb range).
There possibly could be up to a few hundred folders (in which case I could change the watch frequency depending on number of folders), but currently there are only a few (6-10ish).
There also would be multiple logins for the ftp server, different logins would have access to different folders.
I am not asking for an implementation, just if anyone has some first or second hand knowledge about FTP, how could this affect my network.
I am not opposed to putting in file retention times or change the frequency in which I check for new files.
Do you have any control over the remote servers? FTP isn't really optimized for this, and you could probably do a lot better with some sort of dedicated mini-server. You could use file system monitoring on the remote side and just send out the filenames when they arrive rather than continuously polling. You'd only need to have one connection open too, rather than the two that FTP requires.

Verify whether ftp is complete or not?

I got an application which is polling on a folder continuously. Once any file is ftp to the folder, the application has to move this file to some other folder for processing.
Here, we don't have any option to verify whether ftp is complete or not.
One command "lsof" is suggested in the technical forums. It got a file description column which gives the file status.
Since, this is a free bsd command and not present in old versions of linux, I want to clarify the usage of this command.
Can you guys tell us your experience in file verification and is there any other alternative solution available?
Also, is there any risk in using this utility?
Appreciate your help in advance.
Thanks,
Mathew Liju
We've done this before in a number of different ways.
Method one:
If you can control the process sending the files, have it send the file itself followed by a sentinel file. For example, send the real file "contracts.doc" followed by a one-byte "contracts.doc.sentinel".
Then have your listener process watch out for the sentinel files. When one of them is created, you should process the equivalent data file, then delete both.
Any data file that's more than a day old and doesn't have a corresponding sentinel file, get rid of it - it was a failed transmission.
Method two:
Keep an eye on the files themselves (specifically the last modification date/time). Only process files whose modification time is more than N minutes in the past. That increases the latency of processing the files but you can usually be certain that, if a file hasn't been written to in five minutes (for example), it's done.
Conclusion:
Both those methods have been used by us successfully in the past. I prefer the first but we had to use the second one once when we were not allowed to change the process sending the files.
The advantage of the first one is that you know the file is ready when the sentinel file appears. With both lsof (I'm assuming you're treating files that aren't open by any process as ready for processing) and the timestamps, it's possible that the FTP crashed in the middle and you may be processing half a file.
There are normally three approaches to this sort of problem.
providing a signal file so that when your file is transferred, an additional file is sent to mark that transfer is complete
add an entry to a log file within that directory to indicate a transfer is complete (this really only works if you have a single peer updating the directory, to avoid concurrency issues)
parsing the file to determine completeness. e.g. does the file start with a length field, or is it obviously incomplete ? e.g. parsing an incomplete XML file will result in a parse error due to the lack of an end element. Depending on your file's size and format, this can be trivial, or can be very time-consuming.
lsof would possibly be an option, although you've identified your Linux portability issue. If you use this, note the -F option, which formats the output suitable for processing by other programs, rather than being human-readable.
EDIT: Pax identified a fourth (!) method I'd forgotten - using the fact that the timestamp of the file hasn't updated in some time.
There is a fifth method. You can also check if the FTP Session is still active. This will work if every peer has it's own ftp user account. As long as the user is not logged off from FTP, assume the files are not complete.

Is file still being uploaded?

I have an app that I'm writing that takes files in a specific directory that have been uploaded via SFTP and moves them to S3.
I have a problem where my cron job starts uploading a file when it's not completely uploaded. I have thought of every way to try and wait until the file is complete, but I have no way of knowing (that I know of).
I'm hoping that the collective genius of SO would be able to shed some light on this!
There are a number of ways to handle this:
Change the upload process to upload the data file itself (e.g., data.txt) followed by a sentinel file (e.g., data.txt.sentinel). Then wait for the sentinel before processing the data file and deleting them both. Data files older then N days with no corresponding sentinel - just delete them. This is only good if you can change the uploader.
If you can evaluate the content of the file to check completeness, this is another way. For example, if you're only uploading HTML files, you could check that it ends with </html>. Not always possible unless you can control what's being uploaded.
The not-been-modified-for-a-while method. Basically, if the file hasn't been modified for N minutes, you can assume the upload has been finished. This may still result in the processing of incomplete files where the transfer has failed partway through.
All these methods have their advantages and drawbacks and you will have to decide which is the best for you. We try to opt for number 1 where we can influence the uploading side.
And remember that N is configurable in the above scenarios. You need to balance the possibility that a too-small N will result in you processing an incomplete file in option 3 but too large a value of N will delay the processing of said file.
Is there any way you can add a step after the SFTP transfer? The idea is to SFTP the files to a temporary directory, then once that's done have the same client execute (via SSH) a script to mv the files over to the directory the cron job is looking at. mv is atomic on many local Unix filesystems, so the cron job will only either see the old file or the new one.
Of course, if you can execute a script after the SFTP transfer you can just have the script do the transfer to S3, without the cron job ;)
We are using pure-ftpd for a very similar process. Rather then having a cron job do the uploads, we use the upload script option of pure-ftp, which triggers a script every time an upload is complete. You might consider using a similar mechanism if it is available with your ftp server.

How can I speed up Perl's readdir for a directory with 250,000 files?

I am using Perl readdir to get file listing, however, the directory contains more than 250,000 files and this results long time (longer than 4 minutes) to perform readdir and uses over 80MB of RAM. As this was intended to be a recurring job every 5 minutes, this lag time will not be acceptable.
More info:
Another job will fill the directory (once per day) being scanned.
This Perl script is responsible for processing the files. A file count is specified for each script iteration, currently 1000 per run.
The Perl script is to run every 5 min and process (if applicable) up to 1000 files.
File count limit intended to allow down stream processing to keep up as Perl pushes data into database which triggers complex workflow.
Is there another way to obtain filenames from directory, ideally limited to 1000 (set by variable) which would greatly increase speed of this script?
What exactly do you mean when you say readdir is taking minutes and 80 MB? Can you show that specific line of code? Are you using readdir in scalar or list context?
Are you doing something like this:
foreach my $file ( readdir($dir) ) {
#do stuff here
}
If that's the case, you are reading the entire directory listing into memory. No wonder it takes a long time and a lot of memory.
The rest of this post assumes that this is the problem, if you are not using readdir in list context, ignore the rest of the post.
The fix for this is to use a while loop and use readdir in a scalar context.
while (
defined( my $file = readdir $dir )
) {
# do stuff.
}
Now you only read one item at a time. You can add a counter to keep track of how many files you process, too.
The solution would maybe lie in the other end : at the script that fills the directory...
Why not create an arborescence to store all those files and that way have lots of directories each with a manageable number of files ?
Instead of creating "mynicefile.txt" why not "m/my/mynicefile", or something like that ?
Your file system would thank you for that (especially if you remove the empty directories when you have finished with them).
This is not exactly an answer to your query, but I think having that many files in the same directory is not a very good thing for overall speed (including, the speed at which your filesystem handles add and delete operations, not just listing as you have seen).
A solution to that design problem is to have sub-directories for each possible first letter of the file names, and have all files beginning with that letter inside that directory. Recurse to the second, third, etc. letter if need be.
You will probably see a definite speed improvement on may operations.
You're saying that the content gets there by unpacking zip file(s). Why don't you just work on the zip files instead of creating/using 250k of files in one directory?
Basically - to speed it up, you don't need specific thing in perl, but rather on filesystem level. If you are 100% sure that you have to work with 250k files in directory (which I can't imagine a situation when something like this would be required) - you're much better off with finding better filesystem to handle it than to finding some "magical" module in perl that would scan it faster.
Probably not. I would guess most of the time is in reading the directory entry.
However you could preprocess the entire directory listing, creating one file per 1000-entries. Then your process could do one of those listing files each time and not incur the expense of reading the entire directory.
Have you tried just readdir() through the directory without any other processing at all to get a baseline?
You aren't going to be able to speed up readdir, but you can speed up the task of monitoring a directory. You can ask the OS for updates -- Linux has inotify, for example. Here's an article about using it:
http://www.ibm.com/developerworks/linux/library/l-ubuntu-inotify/index.html?ca=drs-
You can use Inotify from Perl:
http://metacpan.org/pod/Linux::Inotify2
The difference is that you will have one long-running app instead of a script that is started by cron. In the app, you'll keep a queue of files that are new (as provided by inotify). Then, you set a timer to go off every 5 minutes, and process 1000 items. After that, control returns to the event loop, and you either wake up in 5 minutes and process 1000 more items, or inotify sends you some more files to add to the queue.
(BTW, You will need an event loop to handle the timers; I recommend EV.)

Resources