Logstash close file descriptors? - elasticsearch

BACKGROUND:
We have rsyslog creating log files directories like: /var/log/rsyslog/SERVER-NAME/LOG-DATE/LOG-FILE-NAME
So multiple servers are spilling out their logs of different dates to a central location.
Now to read these logs and store them in elasticsearch for analysing I have my logstash config file something like this:
file{
path => /var/log/rsyslog/**/*.log
}
ISSUE :
Now as number of log files in the directory increase, logstash opens file descriptors (FD) for new files and will not release FDs for already read log files.
Since log files are generated per date, once it is read, it is of no use after that since it will not be updated after that date.
I have increased the file openings limit to 65K in /etc/security/limits.conf
Can we make logstash close the handle after some time so that number of file handles opened do not increase too much ??

I think you may have hit this bug: http://github.com/elastic/logstash/issues/1604. Do you have the same symptoms? Exceptions in logs after some time? If you run sudo lsof | grep java | wc -l do you see the descriptors steadily increasing over time? (some of them might close, but some will stay open and their number will increase)

I've been tracking this issue for some time, and I don't know that it's properly solved.
We were in a similar boat, perhaps bigger: Logstash couldn't open handles for hundreds of thousands of log files on a box, even though very few of them written to actively. LOGSTASH-271 captured this issue, and there were some attempts to patch Logstash, including PR #1260.
It seems a fix may have made it's way into Logstash 1.5 with PR #1545, but I've never tested this personally. We ended up forking the underlying library Logstash uses to implement the file input, called FileWatch, into FFileWatch, which adds an "eviction mechanism".
The basic idea behind this approach is to only keep files open while they're being written. Normally, Logstash will open a handle on the file and keep it open forever, but FFileWatch adds an option to close the handle if the file has not changed recently (eviction_interval). I then created a custom build of Logstash using the forked gem.
Obviously this is less than ideal, but it worked for us. Eventually we dropped Logstash entirely for picking up log files, although we still use it further down the log processing pipeline. We implemented our own lightweight log shipper (Franz), which does not suffer from this issue.

Related

iMessage app storage location - why is chat.db-wal updated instantly but chat.db takes awhile?

So playing around with iMessages and thinking of ways to back them up and various things.
I found their location at ~/Library/Messages.
There are three files
1. chat.db
2. chat.db-wal
3. chat.db-shm
If I run a node script that watches for file changes while sending a iMessage to someone I see chat.db-wal is changed instantly but chat.db takes awhile to update.
I would like to get the messages as soon as possible, but I am not sure I can read the .db-wal file. Anyone know if I can read that file? Or why the .db file seems to take longer to update?
Thanks.
Everything is fine. Your data is there. This is just how SQLite works.
In order to support ACID transactions, where your data is guaranteed to be stored properly in the case of crashes or power-offs, SQLite first writes your data into a "write-ahead log" (the *-wal file). When the database is properly closed, or the write-ahead log gets too full, SQLite will update the database file with the contents of the log.
SQLite, when reading, will consult the write-ahead log first, even if multiple connections are using the same database. Data in the log is still "in the database".
SQLite should apply the log to the database as part of closing the database. If it does not, you can run PRAGMA wal_checkpoint; to manually checkpoint the log file.
Corollary to this: do not delete the -wal file, especially if you have not cleanly closed the database last time you used it.
More information about write-ahead logging in SQLite can be found in the SQLite documentation.

Can I delete .dmp and .phd files of Liberty Profile server?

In the folder <WAS Liberty Profile root>\<profile>\usr\servers\defaultServer there are many files named core.*.dmp and heapdump.*.phd. The size of these files is between 130 MB and 1.3 GB when my deployed app uses 4 MB.
Can I delete these files *.dmp and *.phd?
What are these files for?
Short answer: yes, it's safe to delete them, but you should find out why they're appearing, as it could indicate that your application is not running correctly.
If your dump files were created a long time ago, or you know you were debugging an OutOfMemoryException or have been running server javadump --include=heap,system then go ahead and delete the files. If, however, you keep getting new dump files and don't know why then read on.
The core and heapdump files contain a snapshot of the memory of the application from a specific point in time. Usually you do this to capture the state of your application at the point where something goes wrong so that you can examine it with analysis tools and try to work out what went wrong.
For example, by default the IBM JVM will perform a dump when an OutOfMemoryException is thrown. This allows you to look at the dump file and see what's using up all the memory.
If you have a corresponding javacore file, the fourth line or so should say why the memory dump was made.
e.g. 1TISIGINFO Dump Requested By User (00100000) Through com.ibm.jvm.Dump.javaDumpToFile (caused by running server javadump)
or 1TISIGINFO Dump Event "user" (00004000) received (caused by running kill -3)
If it's a "user" event, then something's asking the JVM to create a dump. If not, and you're still not sure what's causing it, check your jvm.options file for any -Xdump options which can be used to cause the JVM to create a dump in response to certain events. More information on that in the Knowledge Center.

PostgreSQL statistics issue - could not rename temporary statistics file

I am running PotgreSQL 9.4 on Windows, and constantly get the error,
2015-06-15 09:35:36 EDT LOG could not rename temporary statistics file "pg_stat_tmp/global.tmp" to "pg_stat_tmp/global.stat": Permission denied
I also see constant 200-800k writes to global.stat and global.tmp. I have seen other users with the same issue, but no solution.
It is a big database server, with 300g of data, and 6,000 databases.
I tried setting,
track_activities=off
In the config file, but it did not seem to have any affect.
Any help for the error, or reducing the write?
After my initial answer, I decided to research the operation of the stats collector and in particular what it is doing with the files in pg_stat_tmp. I've substantially re-written the answer as a result.
What are the global.stat / global.tmp files used for?
Postgresql contains functionality to collect statistics and status information about its operation. The function is described in Section 27.2 of the manual.
This information is collated by the stats collector process. It is made available to the other postgresql processes via the global.stat file. The first time you run a query that accesses this data within a transaction, the backend which you are connected to will read the global.stat file and cache the result, using it until the end of the transaction.
To keep this file up to date, the stats collector process periodically re-writes it with updated information. It typically does this several times a second. The process is as follows:
Create a new file global.tmp
Write data to this file
Rename global.tmp as global.stat, overwriting the previous global.stat
The global.tmp and global.stats files are written into the directory configured by the stats_temp_directory configuration parameter. Normally this is set to $PGDATA/pg_stat_tmp.
On shutdown, the stats file is written into the file $PGDATA/global/pgstat.stat, and the files in the tmp dir above are removed. This file is then read and removed when the database is started up again.
Why is the stats collector processor creating so much I/O load?
Normally, the amount of data written to the global.stats is relatively modest and writing it does not generate that much I/O traffic. However under some circumstances it does seem to get very bloated. When this happens the amount of load generated can start to get excessive as the entire file is rewritten more than once a second.
I have had one experience where it grew by a factor or 10 or more, compared to other similar servers. This machine did have an unusually large number of databases (for our application at least - 30-40 databases - but nothing like the 6000 you say you have). It is possible that having a large number of databases exacerbates this.
Some of the references below talk about a pattern of creating / dropping lots of tables causing bloat in these files, and that perhaps autovacuum is not running aggressively enough to remove the associated bloat. You may wish to consider your autovac settings.
Why do I get 'Permission Denied' errors on Windows?
After examining the postgresql source code I think there may be a race condition in accessing the global.stats file which could happen at any time, but is exacerbated by the size of the file.
The default mode of operation in Windows is that it is not possible to rename or remove a file while another process has it open. This is different to Linux (or Unix) where a file can be renamed or removed while other processes are accessing it.
In the sequence above you can see that if one of the backend processes is reading the file at the same time as the stats collector is rewriting it, then the backend process may still have the file open at the time the rename is attempted. That leads to the 'Permission Denied' error you are seeing.
Naturally when the file becomes very large, then the amount of time taken to read it becomes more significant, therefore the probability of the stats collector process attempting a rename while a backend still has it open increases.
However, since the file is frequently being rewritten, the impact of these errors is relatively mild. It just means that this particular update fails, leading the the backends getting slightly out of date statistics. The next update will probably succeed.
Note that Windows does offer a file opening mode which does allow files to be deleted or renamed while they are opened by another process, however as far as I could tell, this mode is not used by Postgresql. I could not find any bug report on this - seems like it should be reported.
In summary, these errors are a side effect of the main problem, which is the excessive size of the global.stat file.
I've turned track_activities off but the file is still being written - Why?
From what I can see, track_activites affects only one of the sets of information that the stats collector is collecting.
In addition, it looks as though the stats collector process is started regardless of these settings, and will continue to re-write the file. The settings appear to control only the collection of fresh data.
My conclusion is that once the file has become bloated, it will remain so and continue to be re-written, even once all of the stats collection options are turned off.
What can I do to avoid this problem?
Once the file has become bloated, it seems that the easiest way to get the database back into a good working state is to remove the file, using the following steps:
Stop the database
When the DB is stopped, the pg_stat_tmp directory is empty and a file $PGDATA/global/pgstat.stat is written. We renamed this file to pgstat.stat.old.
Start the database. It creates a fresh set of pgstat files. After confirming the server was operating correctly you can remove the old file you have renamed.
This is the process we used when one of our servers suffered from this problem.
Needless to say be very careful when manually manipulating any files under the Postgresql Data directory.
After this you may want to monitor the server to see if it the file becomes bloated again. If it does then here are some additional ideas to consider:
As mentioned above I have seen some references to this file becoming bloated if autovacuum is not running aggressively enough. You may wish to tune the autovacuum settings
Disabling any of the track_xxx options described in the Section 18.9.1 of the manual which are not required may help
It is possible to place the pg_stats_tmp directory in a tmpfs filesystem (or whatever equivalent RAM based filesystem is available in windows). Doing so should eliminate I/O as a concern for these files.
References:
Postgres stats collector showing high disk I/O
Too much I/O generated by postgres stats collector process
stats collector suddenly causing lots of IO
Here might be a solution for your problem. https://wiki.postgresql.org/wiki/May_2015_Fsync_Permissions_Bug
Another possibility could be antivirus settings. Try to turn it off temporarily.
It happened to me few days ago. I rebooted the machine, but the error did not disappeared.
Don't know why, but performing a vacuum analyze verbose did the trick, and the error has stoped to show up.

logstash forwareder doesn't release file handle

I am running logstash forwareder to ship logs.
Forwarder,logstash,elasticsearch all are on localhost.
I have one UI application whose log files is read by shipper. When forwarder is running, archiving of log file doesn't work. logs are appended in same file. I have configured log file to archive every minute, so I can see the change. As soon as I stop forwarder, log file archiving starts working.
I guess forwarder keep holding file handle that's why file does not get archived.
Please help me on this.
regards,
Sunil
Running on windows? There exists known unfixed bugs.
See https://github.com/elasticsearch/logstash/issues/1944
for some kind of work around.

Skipping bad input files in hadoop

I'm using Amazon Elastic MapReduce to process some log files uploaded to S3.
The log files are uploaded daily from servers using S3, but it seems that a some get corrupted during the transfer. This results in a java.io.IOException: IO error in map input file exception.
Is there any way to have hadoop skip over the bad files?
There's a who bunch of record skipping configuration properties you can use to do this - see the mapred.skip. prefixed properties on http://hadoop.apache.org/docs/r1.2.1/mapred-default.html
There's also a nice blog post entry about this subject and these config properties:
http://devblog.factual.com/practical-hadoop-streaming-dealing-with-brittle-code
That said, if you file is completely corrupt (i.e. broken before the first record), you might still have issues even with these properties.
Chris White's comment suggesting writing your own RecordReader and InputFormat is exactly right. I recently faced this issue and was able to solve it by catching the file exceptions in those classes, logging them, and then moving on to the next file.
I've written up some details (including full Java source code) here: http://daynebatten.com/2016/03/dealing-with-corrupt-or-blank-files-in-hadoop/

Resources