Find files with padded with zeros after fsarchiver backup - bash

My SSD was dying so I tried to backup my /home with fsarchiver but during the process I got a bunch of errors like : file has been truncated: padding with zeros.
Now I'm trying to locate those files so Im' searching for a bash/python/perl... script allowing me to search for non-empty files with the last n bytes 'padded with zeros'.
Thank you in advance for your help and please excuse my english.

This script takes a list of files on the command line and reports the names of those whose last ten bytes padded with zeros:
#!/bin/sh
for fname in "$#"
do
if [ -s "$fname" -a "$(tail -c10 "$fname" | tr -d '\000' | wc -c)" -eq 0 ]
then
echo Truncated file: $fname
fi
done
It works by first checking that the file is non-empty ([ -s "$fname" ]) and then it takes the last ten bytes of the file (tail -c10 "$fname") and removes any bytes (tr -d '\000') and then counts how many bytes are left (wc -c). If all of the last ten bytes are zeros, then there will be no bytes left, and the file is reported as truncated.
If you want to use something other than 10 bytes in your test, adjust the tail option to suit.
If you test all the files in some directory or filesystem, find can assist. If the above script is in an executable file name padded.sh, then run:
find path/to/suspect/files -size +10c -exec padded.sh {} +

Related

Writing word count in a text file in bash script

I am very new into bashscript, how can I write word count, word size and character size in the text file itself? My current code is:
#!bin/bash/
echo "start"
cat file #print file.txt
wc -m file #character count
wc -w file # word count
wc -c file # size
echo "end"
I want to append the terminal outputs into my text file. Text file should be like this:
...text
The size of this file: x , word count: x , character count: x. How can I do that?
You mean append the file's metadata to the same file?
Using GNU wc:
#!/bin/bash
if [[ $# -lt 1 ]]; then
echo "Usage: ${0##*/} FILE ..." >&2
exit 1
fi
for file do
wc -cwm "$file" |
awk -v file="$file" \
'{print "File size: "$3", word count: "$1", character count: "$2 >> file}'
done
Also works on Busybox wc. You can provide multiple files.
Note that if wc is given multiple flags, it always prints the numbers in the same order, regardless of the order the flags are given.
From man wc (GNU)
The options below may be used to select which counts are printed, always in the following order: newline, word, character, byte, maximum line length.
However, POSIX says this:
By default, the standard output shall contain an entry for each input file of the form:
"%d %d %d %s\n", <newlines>, <words>, <bytes>, <file>
If the -m option is specified, the number of characters shall replace the field in this format.
source: https://pubs.opengroup.org/onlinepubs/9699919799/utilities/wc.html
I haven't tested on other platforms, but apparently, a POSIX compliant wc would need two separate invocations to get both byte and character count. The reasoning for using one was to avoid a file changing between counts.
If you know the file is ASCII (and not UTF-8 for example), you could just reuse the byte count for the character count.

how to calculate percentage difference with cmp command

I am aware of cmp command in linux which is used to do byte-by-byte comparison, could we build upon this to get percentage difference.
Example I have two files a1.jpg and a2.jpg
So when I compare these two files using cmp. Could I get percentage of difference between these two files.
example: a1.jpg -> has 1000 bytes and a2.jpg has 1021 (taking bigger file as reference)
So could get percentage difference between two files i.e No of byter differing/Total bytes in larger
Looking for some shell script snippet. Thanks in advance
You could create a file script with the following content - let us call this file percmp.sh:
#!/bin/sh
DIFF=`cmp -l $1 $2 | wc -l`
SIZE_A=`wc -c $1 | awk '{print $1}'`
SIZE_B=`wc -c $2 | awk '{print $1}'`
if [ $SIZE_A -gt $SIZE_B ]
then
MAX=$SIZE_A
else
MAX=$SIZE_B
fi
echo $DIFF/$MAX*100|bc -l
Be sure that it will be saved with Linux encription.
Then you run it with the two file names as arguments. For example, assuming percmp.sh and the two files are in the same folder you run the command:
sh percmp.sh FILE1.jpg FILE2.jpg
Otherwise you specify the full path of both the script and the files.
The code do exactly what you need, if you need reference:
#!/bin/sh tells how to interpret the file
cmp -l lists all the different bites
wc -l number of rows (in the code: lenght of the list of different bites -> number of different bytes)
wc -c size of a file
awk text parsing (to get ONLY the size of the file)
-gt Greater Than
bc -l performs the inputed division
Hope I helped!

Can I limit the size of files to be included in tar in a directory to include atomic files?

I have a directory having multiple files with different sizes
a.txt
b.txt
c.txt
I want to limit create multiple tar files with some maximum fixed size(say 100 MB). Such that whole file is included in the tar or not included in the tar(If file size is greater than fixed size maybe throw an error)
I am aware of split function:
Creating tar
Splitting with desired chunk size
The problem with above method is that resulting tar files can't be extracted individually.
Could anyone help with the solution(or provide an alternative solution)
The following script takes two or more arguments. The first is the total size
of the set of files to use. The remaining arguments are passed to find so you
can give a directory as an argument. The script assumes that filenames
are "well behaved" and won't contain spaces, newlines, or anything
else that would confuse the shell.
The script prints out the largest files that will fit in the given size. This
is not necessarily the most space consuming set of files, but finding that
efficiently in the general case is the Knapsack problem, which I am unlikely to
solve here.
You could change the sort -rn to sort -n to start with the smallest
files first.
#!/bin/sh
avail=$1
shift
used=0
find "$#" -type f -print | xargs wc -c | sort -rn | while read size fn; do
if expr $used + $size '>' $avail >/dev/null; then
continue;
fi
used=$(expr $used + $size)
echo $fn
done
The output of the script can be passed to pax(1) to create the actual archive.
For example (assuming you have called the script fitfiles):
sh fitfiles 10000000 *.txt | pax -w -x ustar -v | xz > wdb.tar.xz

Split large gzip file while adding header line to each split

I want to automate the process of splitting large gzip file to smaller gzip file each split containing 10000000 lines (Last split will be left over and will less than 10000000).
Here is how I am doing at the moment and I am actually repeating by calculating number of left over lines.
gunzip -c large_gzip_file.txt.gz | tail -n +10000001 | head -n 10000000 > split1_.txt
gzip split1_.txt
gunzip -c large_gzip_file.txt.gz | tail -n +20000001 | head -n 10000000 > split2_.txt
gzip split2_.txt
I continue this by repeating as shown all the way until the end. Then I open these and manually add the header line. How can this be automated.
I search online where i see awk and other solutions but didn't see for gzip or similar to this scenario.
I would approach it like this:
gunzip the file
use head to get the first line and save it off to another file
use tail to get the rest of the file and pipe it to split to produce files of 10,000,000 lines each
use sed to insert the header into each file, or just cat the header with each file
gzip each file
You'll want to wrap this in a script or a function to make it easier to rerun at a later time. Here's an attempt at a solution, lightly tested:
#!/bin/bash
set -euo pipefail
LINES=10000000
file=$(basename $1 .gz)
gunzip -k ${file}.gz
head -n 1 $file >header.txt
tail -n +2 $file | split -l $LINES - ${file}.part.
rm -f $file
for part in ${file}.part.* ; do
[[ $part == *.gz ]] && continue # ignore partial results of previous runs
gzip -c header.txt $part >${part}.gz
rm -f $part
done
rm -f header.txt
To use:
$ ./splitter.sh large_gzip_file.txt.gz
I would further improve this by using a temporary directory (mktemp -d) for the intermediate files and ensuring the script cleans up after itself at exit (with a trap). Ideally it would also sanity check the arguments, possibly accepting a second argument indicating the number of lines per part, and inspect the contents of the current directory to ensure it doesn't clobber any preexisting files.
I don't think awk is for splitting gzip file into smaller pieces files, it's for text-processing. Below is my way to solve your issue, hope it helps:
step1:
gunzip -c large_gzip_file.txt.gz | split -l 10000000 - split_file_
split command can split a file into pieces, you can specify the size of each piece and also provide prefix for all pieces.
the large gzip file will be splited to multiple files with name prefix split_file_
step2:
save header content into file header_file.csv
step3:
for f in split_file*; do
cat header_file.csv $f > $f.new
mv $f.new $f
done
Here I assume you work in the splited file directory, if not, replace split_file* with the absolute path, for example /path/to/split_file*. Iterates all files with name pattern split_file*, add header content to the beginning of each match file

Split big compressed log files into compressed chunks of X lines while doing inline compression

My situation is the following: a big (10GB) compressed file containing some files (~60) with a total uncompressed size of 150GB.
I would like to be able to slice big compressed log files into parts that have a certain number of lines in them (ie: 1 million).
I don't want to use split since it involves totally decompressing the original file, and i don't have that much disk space available.
What i am doing so far is this:
#!/bin/bash
SAVED_IFS=$IFS
IFS=$(echo -en "\n\b")
for file in `ls *.rar`
do
echo Reading file: $file
touch $file.chunk.uncompressed
COUNTER=0
CHUNK_COUNTER=$((10#000))
unrar p $file while read line;
do
echo "$line" >> $file.chunk.uncompressed
let COUNTER+=1
if [ $COUNTER -eq 1000000 ]; then
CHUNK_COUNTER=`printf "%03d" $CHUNK_COUNTER;`
echo Enough lines \($COUNTER\) to create a compressed chunk \($file.chunk.compressed.$CHUNK_COUNTER.bz2\)
pbzip2 -9 -c $file.chunk.uncompressed > $file.chunk.compressed.$CHUNK_COUNTER.bz2
# 10# is to force bash to count in base 10, so that 008+ are valid
let CHUNK_COUNTER=$((10#$CHUNK_COUNTER+1))
let COUNTER=0
fi
done
#TODO need to compress lines in the last chunk too
done
IFS=$SAVED_IFS
What i don't like about it, is that i am limited by the speed of writing and then reading uncompressed chunks (~15MB/s).
The speed of reading the uncompressed stram directly from the compressed file is ~80MB/s.
How can i adapt this script to stream directly a limited number of lines per chunk while directly writing to a compressed file?
You can pipe the output to a loop in which you use head to chop the files.
$ unrar p $file | ( while :; do i=$[$i+1]; head -n 10000 | gzip > split.$i.gz; done )
The only thing you have to work out still, is how to terminate the loop, since this will go on generating empty files. This is left as an excercise to the reader.
Zipping an empty file will give some output (for gz, it's 26 bytes) so you could test for that:
$ unrar p $file |
( while :; do
i=$[$i+1];
head -n 10000 | gzip > split.$i.gz;
if [ `stat -c %s split.$i.gz` -lt 30 ]; then rm split.$i.gz; break; fi;
done )
If you don't mind wrapping the file in a tar file, than you can use tar to split and compress the file for you.
You can use tar -M --tape-length 1024 to create 1 megabyte files. Do note that after every 100 megabyte tar will ask you to press enter before it starts writing to the file again. So you will have to wrap it with your own script and move the resulting file before doing so.

Resources