Filter int, float and character from a text file in a Shell Script - bash

Suppose I have a text file, which contains data like this.
Below output generated from du - sh /home/*
1.5G user1
2.5G user2
And so on...
Now if I want that those files size be stored in an array and compared to 5 GB if the user is consuming more than 5 Gb. What can I do?
The du command shows the usage of each folder in home directory. So if i want myself to be notified that some user is consuming more than 5 GB. Because there is a long list of users. It will be tedious to identify each user's disk usage. I want a shell script to identify the usage for each directory in home. And then I will put mail function to notify myself for exceeded limits.
Note : Don't want to implement quota as I just want to monitor the usage.

Use du's -t (--threshold) option to specify you only want to know about directories with more than a certain amount of data in them:
$ du -sh -t 5G /home/*
If you're picky about precisely how big a gigabyte is, note that 5G uses multiples of 1024; you may prefer -t 5GB for multiples of 1000, or even -t 5000M to mix them.
For lots of users, you're probably better off writing that using -d 1 instead of -s to avoid the shell having to expand the * into a very long list:
$ du -h -d 1 -t 5G /home/

Related

Discrepancy between the size of file created and size displayed by du -sh [duplicate]

This question already has answers here:
Size() vs ls -la vs du -h which one is correct size?
(3 answers)
Closed last month.
I had to create a random file of 10GB size, which I can using dd or fallocate, but the size shown by du -sh is twice the one I created:
$ dd bs=1MB count=10000 if=/dev/zero of=foo
10000+0 records in
10000+0 records out
10000000000 bytes (10 GB, 9.3 GiB) copied, 4.78419 s, 2.1 GB/s
$ du -sh foo
19G foo
$ ls -sh foo
19G foo
$ fallocate -l 10G bar
$ du -sh bar
20G bar
$ ls -sh bar
20G bar
Can someone please explain me this apparent discrepancy?
On wikipedia, it mentions about GPFS ...
The system stores data on standard block storage volumes, but includes an internal RAID layer that can virtualize those volumes for redundancy and parallel access much like a RAID block storage system.
I conclude that there is at least one non-visible duplicate for every file, and therefore each file actually uses twice the amount of space than the actual content of a single file. So the underlying RAID imposes the double-usage.
That could explain it, because I have created a similar massive file for other purposes, also using dd, on an ext4 filesystem, but the OS reports my file size matching the dd creation size, as per design intent (no RAID in effect on my drive).
The fact that you indicate that stat does report the correct file size as per dd's actions, confirms what I put forward above.

Top command redirection

I am trying to analyse a randomly crashing application.To get some details of the process i am implementing a script like below
procID=$(ps -aef|grep app|awk '{print $2}')
top -p $procID -b -d 10 > test.log
In this case top command will execute for every 10 sec and write to test.log. Planning to run it indefinitely, but I want to flush out the contents of test.log each time top command writes some value to it. How can I modify the script accordingly ?
To avoid your log getting too big, I suggest you use logrotate instead. In short, logrotate creates a cronjob that will "rotate" your log file upon a condition (one file per day, rotate when the file reaches a given size, etc.). You can also choose to keep the X last rotated logs.
This is what is used for lastmessage and the other linux log files (hence the .log.1, .log.2 etc.)
Here is a basic way to make it work:
Create a configuration for your file in /etc/logrotate.d. For example:
# /etc/logrotate.d/test
/var/log/test.log {
size 4M
create 770 somegroup someuser
rotate 4
missingok
copytruncate
}
What it says:
size 4M: rotate file after it reaches 4M in size
create ...: create new file with the given permisions (replace somegroup someuser by your usergroup and username)
rotate 4: keep the last 4 log files (test.log, test.log.1 etc)
missingOk an copyTruncate: avoid some errors (see the doc for more info)
Check that your file is correct by executing:
sudo logrotate /etc/logrotate.d/test
enjoy.
More about logrotate: https://support.rackspace.com/how-to/understanding-logrotate-utility/

finding size of a file using ls and du .what is difference [duplicate]

This question already has answers here:
Size() vs ls -la vs du -h which one is correct size?
(3 answers)
Closed 8 years ago.
There is a file named today.log in my server.
ls -l today.log showing 400GB.
du -sh today.log. showing 240GB
What is the difference between ls and du ...
du shows how much disk the file uses. ls shows how big the file is. These two values can be different. Files with holes can take up less space than their size. Most files do not completely fill the blocks of the filesystem, so they take up more space than their size. A file with a single byte still takes up at least one full block. (512 or 1024 bytes, typically.) As an examle, consider a file with a single byte at position 183738475 (randomly typed numbers). That file can be stored on disk using a single block (whenever the kernel queries the filesystem for bytes other than the single byte in the file, the filesystem reports them as being zero, and there is no need to store anything. Not all filesystems work this way.) But the size of the file is 183738475, so ls will report that and du will report how many blocks are used by the filesystem. du -h will report the number of blocks used times the block size converted to a human readable format. Keep in mind that the actual numbers will vary depending on your filesystem. For example:
$ echo > foo; ls -l foo |awk '{print $5}'; du foo; du -h foo
1
8 foo
4.0K foo
This file is one byte in size but consumes 8 blocks on disk, and the block size is 512 so those 8 blocks consume 4k. (My filesystem has been optimized for large files, and small files waste a lot of space.)

How to zgrep the last line of a gz file without tail

Here is my problem, I have a set of big gz log files, the very first info in the line is a datetime text, e.g.: 2014-03-20 05:32:00.
I need to check what set of log files holds a specific data.
For the init I simply do a:
'-query-data-'
zgrep -m 1 '^20140320-04' 20140320-0{3,4}*gz
BUT HOW to do the same with the last line without process the whole file as would be done with zcat (too heavy):
zcat foo.gz | tail -1
Additional info, those logs are created with the data time of it's initial record, so if I want to query logs at 14:00:00 I have to search, also, in files created BEFORE 14:00:00, as a file would be created at 13:50:00 and closed at 14:10:00.
The easiest solution would be to alter your log rotation to create smaller files.
The second easiest solution would be to use a compression tool that supports random access.
Projects like dictzip, BGZF, and csio each add sync flush points at various intervals within gzip-compressed data that allow you to seek to in a program aware of that extra information. While it exists in the standard, the vanilla gzip does not add such markers either by default or by option.
Files compressed by these random-access-friendly utilities are slightly larger (by perhaps 2-20%) due to the markers themselves, but fully support decompression with gzip or another utility that is unaware of these markers.
You can learn more at this question about random access in various compression formats.
There's also a "Blasted Bioinformatics" blog by Peter Cock with several posts on this topic, including:
BGZF - Blocked, Bigger & Better GZIP! – gzip with random access (like dictzip)
Random access to BZIP2? – An investigation (result: can't be done, though I do it below)
Random access to blocked XZ format (BXZF) – xz with improved random access support
Experiments with xz
xz (an LZMA compression format) actually has random access support on a per-block level, but you will only get a single block with the defaults.
File creation
xz can concatenate multiple archives together, in which case each archive would have its own block. The GNU split can do this easily:
split -b 50M --filter 'xz -c' big.log > big.log.sp.xz
This tells split to break big.log into 50MB chunks (before compression) and run each one through xz -c, which outputs the compressed chunk to standard output. We then collect that standard output into a single file named big.log.sp.xz.
To do this without GNU, you'd need a loop:
split -b 50M big.log big.log-part
for p in big.log-part*; do xz -c $p; done > big.log.sp.xz
rm big.log-part*
Parsing
You can get the list of block offsets with xz --verbose --list FILE.xz. If you want the last block, you need its compressed size (column 5) plus 36 bytes for overhead (found by comparing the size to hd big.log.sp0.xz |grep 7zXZ). Fetch that block using tail -c and pipe that through xz. Since the above question wants the last line of the file, I then pipe that through tail -n1:
SIZE=$(xz --verbose --list big.log.sp.xz |awk 'END { print $5 + 36 }')
tail -c $SIZE big.log.sp.xz |unxz -c |tail -n1
Side note
Version 5.1.1 introduced support for the --block-size flag:
xz --block-size=50M big.log
However, I have not been able to extract a specific block since it doesn't include full headers between blocks. I suspect this is nontrivial to do from the command line.
Experiments with gzip
gzip also supports concatenation. I (briefly) tried mimicking this process for gzip without any luck. gzip --verbose --list doesn't give enough information and it appears the headers are too variable to find.
This would require adding sync flush points, and since their size varies on the size of the last buffer in the previous compression, that's too hard to do on the command line (use dictzip or another of the previously discussed tools).
I did apt-get install dictzip and played with dictzip, but just a little. It doesn't work without arguments, creating a (massive!) .dz archive that neither dictunzip nor gunzip could understand.
Experiments with bzip2
bzip2 has headers we can find. This is still a bit messy, but it works.
Creation
This is just like the xz procedure above:
split -b 50M --filter 'bzip2 -c' big.log > big.log.sp.bz2
I should note that this is considerably slower than xz (48 min for bzip2 vs 17 min for xz vs 1 min for xz -0) as well as considerably larger (97M for bzip2 vs 25M for xz -0 vs 15M for xz), at least for my test log file.
Parsing
This is a little harder because we don't have the nice index. We have to guess at where to go, and we have to err on the side of scanning too much, but with a massive file, we'd still save I/O.
My guess for this test was 50000000 (out of the original 52428800, a pessimistic guess that isn't pessimistic enough for e.g. an H.264 movie.)
GUESS=50000000
LAST=$(tail -c$GUESS big.log.sp.bz2 \
|grep -abo 'BZh91AY&SY' |awk -F: 'END { print '$GUESS'-$1 }')
tail -c $LAST big.log.sp.bz2 |bunzip2 -c |tail -n1
This takes just the last 50 million bytes, finds the binary offset of the last BZIP2 header, subtracts that from the guess size, and pulls that many bytes off of the end of the file. Just that part is decompressed and thrown into tail.
Because this has to query the compressed file twice and has an extra scan (the grep call seeking the header, which examines the whole guessed space), this is a suboptimal solution. See also the below section on how slow bzip2 really is.
Perspective
Given how fast xz is, it's easily the best bet; using its fastest option (xz -0) is quite fast to compress or decompress and creates a smaller file than gzip or bzip2 on the log file I was testing with. Other tests (as well as various sources online) suggest that xz -0 is preferable to bzip2 in all scenarios.
————— No Random Access —————— ——————— Random Access ———————
FORMAT SIZE RATIO WRITE READ SIZE RATIO WRITE SEEK
————————— ————————————————————————————— —————————————————————————————
(original) 7211M 1.0000 - 0:06 7211M 1.0000 - 0:00
bzip2 96M 0.0133 48:31 3:15 97M 0.0134 47:39 0:00
gzip 79M 0.0109 0:59 0:22
dictzip 605M 0.0839 1:36 (fail)
xz -0 25M 0.0034 1:14 0:12 25M 0.0035 1:08 0:00
xz 14M 0.0019 16:32 0:11 14M 0.0020 16:44 0:00
Timing tests were not comprehensive, I did not average anything and disk caching was in use. Still, they look correct; there is a very small amount of overhead from split plus launching 145 compression instances rather than just one (this may even be a net gain if it allows an otherwise non-multithreaded utility to consume multiple threads).
Well, you can access randomly a gzipped file if you previously create an index for each file ...
I've developed a command line tool which creates indexes for gzip files which allow for very quick random access inside them:
https://github.com/circulosmeos/gztool
The tool has two options that may be of interest for you:
-S option supervise a still-growing file and creates an index for it as it is growing - this can be useful for gzipped rsyslog files as reduces to zero in the practice the time of index creation.
-t tails a gzip file: this way you can do: $ gztool -t foo.gz | tail -1
Please, note that if the index doesn't exists, this will consume the same time as a complete decompression: but as the index is reusable, next searches will be greatly reduced in time!
This tool is based on zran.c demonstration code from original zlib, so there's no out-of-the-rules magic!

Errno::ENOMEM: Cannot allocate memory - cat

I have a job running on production which process xml files.
xml files counts around 4k and of size 8 to 9 GB all together.
After processing we get CSV files as output. I've a cat command which will merge all CSV files to a single file I'm getting:
Errno::ENOMEM: Cannot allocate memory
on cat (Backtick) command.
Below are few details:
System Memory - 4 GB
Swap - 2 GB
Ruby : 1.9.3p286
Files are processed using nokogiri and saxbuilder-0.0.8.
Here, there is a block of code which will process 4,000 XML files and output is saved in CSV (1 per xml) (sorry, I'm not suppose to share it b'coz of company policy).
Below is the code which will merge the output files to a single file
Dir["#{processing_directory}/*.csv"].sort_by {|file| [file.count("/"), file]}.each {|file|
`cat #{file} >> #{final_output_file}`
}
I've taken memory consumption snapshots during processing.It consumes almost all part of the memory, but, it won't fail.
It always fails on cat command.
I guess, on backtick it tries to fork a new process which doesn't get enough memory so it fails.
Please let me know your opinion and alternative to this.
So it seems that your system is running pretty low on memory and spawning a shell + calling cat is too much for the few memory left.
If you don't mind loosing some speed, you can merge the files in ruby, with small buffers.
This avoids spawning a shell, and you can control the buffer size.
This is untested but you get the idea :
buffer_size = 4096
output_file = File.open(final_output_file, 'w')
Dir["#{processing_directory}/*.csv"].sort_by {|file| [file.count("/"), file]}.each do |file|
f = File.open(file)
while buffer = f.read(buffer_size)
output_file.write(buffer)
end
f.close
end
You are probably out of physical memory, so double check that and verify your swap (free -m). In case you don't have a swap space, create one.
Otherwise if your memory is fine, the error is most likely caused by shell resource limits. You may check them by ulimit -a.
They can be changed by ulimit which can modify shell resource limits (see: help ulimit), e.g.
ulimit -Sn unlimited && ulimit -Sl unlimited
To make these limit persistent, you can configure it by creating the ulimit setting file by the following shell command:
cat | sudo tee /etc/security/limits.d/01-${USER}.conf <<EOF
${USER} soft core unlimited
${USER} soft fsize unlimited
${USER} soft nofile 4096
${USER} soft nproc 30654
EOF
Or use /etc/sysctl.conf to change the limit globally (man sysctl.conf), e.g.
kern.maxprocperuid=1000
kern.maxproc=2000
kern.maxfilesperproc=20000
kern.maxfiles=50000
I have the same problem, but instead of cat it was sendmail (gem mail).
I found problem & solution here by installing posix-spawn gem, e.g.
gem install posix-spawn
and here is the example:
a = (1..500_000_000).to_a
require 'posix/spawn'
POSIX::Spawn::spawn('ls')
This time creating child process should succeed.
See also: Minimizing Memory Usage for Creating Application Subprocesses at Oracle.

Resources