Split large gzip file while adding header line to each split - shell

I want to automate the process of splitting large gzip file to smaller gzip file each split containing 10000000 lines (Last split will be left over and will less than 10000000).
Here is how I am doing at the moment and I am actually repeating by calculating number of left over lines.
gunzip -c large_gzip_file.txt.gz | tail -n +10000001 | head -n 10000000 > split1_.txt
gzip split1_.txt
gunzip -c large_gzip_file.txt.gz | tail -n +20000001 | head -n 10000000 > split2_.txt
gzip split2_.txt
I continue this by repeating as shown all the way until the end. Then I open these and manually add the header line. How can this be automated.
I search online where i see awk and other solutions but didn't see for gzip or similar to this scenario.

I would approach it like this:
gunzip the file
use head to get the first line and save it off to another file
use tail to get the rest of the file and pipe it to split to produce files of 10,000,000 lines each
use sed to insert the header into each file, or just cat the header with each file
gzip each file
You'll want to wrap this in a script or a function to make it easier to rerun at a later time. Here's an attempt at a solution, lightly tested:
#!/bin/bash
set -euo pipefail
LINES=10000000
file=$(basename $1 .gz)
gunzip -k ${file}.gz
head -n 1 $file >header.txt
tail -n +2 $file | split -l $LINES - ${file}.part.
rm -f $file
for part in ${file}.part.* ; do
[[ $part == *.gz ]] && continue # ignore partial results of previous runs
gzip -c header.txt $part >${part}.gz
rm -f $part
done
rm -f header.txt
To use:
$ ./splitter.sh large_gzip_file.txt.gz
I would further improve this by using a temporary directory (mktemp -d) for the intermediate files and ensuring the script cleans up after itself at exit (with a trap). Ideally it would also sanity check the arguments, possibly accepting a second argument indicating the number of lines per part, and inspect the contents of the current directory to ensure it doesn't clobber any preexisting files.

I don't think awk is for splitting gzip file into smaller pieces files, it's for text-processing. Below is my way to solve your issue, hope it helps:
step1:
gunzip -c large_gzip_file.txt.gz | split -l 10000000 - split_file_
split command can split a file into pieces, you can specify the size of each piece and also provide prefix for all pieces.
the large gzip file will be splited to multiple files with name prefix split_file_
step2:
save header content into file header_file.csv
step3:
for f in split_file*; do
cat header_file.csv $f > $f.new
mv $f.new $f
done
Here I assume you work in the splited file directory, if not, replace split_file* with the absolute path, for example /path/to/split_file*. Iterates all files with name pattern split_file*, add header content to the beginning of each match file

Related

Shell script to copy the files

I worked very little with scripts and I don't know..
I need to create a script (in Ubuntu) that copies only files where a certain user modified more than 20 lines at a given time.
I know that to copy a file elsewhere I use this code:
$ ls dir1/
dir2/
$ cp -r dir1/ dir1.copy
$ ls dir1.copy
dir2/
And to count lines: wc -l file1
But how could I check if a user has modified more than 20 lines in a file (eg a simple txt, for example today)?
Thank you in advance !
In the first place, if by "modifying lines in a file" you mean "adding lines to a file", then you can do something about it. If you are literally talking about modifying lines in files, there is nothing you can do to track that activity without setting up some version control first.
So, assuming we are talking about files in which your users will be adding lines, a workaround for that may consist of setting up some scheduled tasks to check the line numbers of those files "at a given time" (as you said) and compare that value to a previous result, and then if there are more than 20 additional lines than from the last value, copy the files elsewhere.
First things first, counting the lines of your files is something you have already mentioned and it is right: I will propose using wc -l too.
Once here you will need two things: one place (tipically a file) to periodically save the number of lines of your files at a given time and one trigger that would start copying the files in case there have been more than 20 lines added.
So for example, in this case you can set up a cron job, like this (i.e. to run every hour):
0 */1 * * * cat ${FILE} |wc -l > /tmp/${FILE}_counter
That one will check the number of lines of a given file and send the output to a temporary file that we will be using soon. In case you have multiple files you can easily script that and make a loop, like this:
#!/bin/bash
for FILE in file1 file2 file3; do
cat ${FILE} |wc -l > /tmp/${FILE}_counter
done
Don't forget to add the path to the script in the cron job if you do that this way. After that, you will have something like this in your /tmp directory:
/tmp/file1_counter
/tmp/file2_counter
/tmp/file3_counter
...
At this point you only need a trigger, which can be another script, to compare the current number of lines of a file at a given time and start copying it elsewhere in case there are more than 20 additional lines than in the previous check. Consider this:
#!/bin/bash
LAST_VALUE=$(cat /tmp/${FILE}_counter)
CURRENT_VALUE=$(cat ${FILE} |wc -l)
if [ ${CURRENT_VALUE} -gt $(expr ${LAST_VALUE} + 20) ]; then
# Your cp stuff here
fi
Of course you can add a loop here too in case of handling multiple files:
#!/bin/bash
LAST_VALUE=$(cat /tmp/${FILE}_counter)
CURRENT_VALUE=$(cat ${FILE} |wc -l)
for FILE in file1 file2 file3; do
if [ ${CURRENT_VALUE} -gt $(expr ${LAST_VALUE} + 20) ]; then
# Your cp stuff here
fi
done
Then you only have to add this last script to a cron job too, and you should be done.
Hope you find this useful.
You can use diff to compare 2 files. With the -u0 option, it will show you the added/deleted/modified lines, prefixed with "+" or "-". You can then count lines starting with "+" or "-" with grep and it's -c option.
So for the number of lines added or modified, which begin with "+" :
diff -u0 $file_before $file_after | grep -c '^+'
and this will count the deleted or modified lines, which start with "-" :
diff -u0 $file_before $file_after | grep -c '^-'
Note that there are 2 header lines in this format, which also start with "+" and "-", so you may want to take that in account.

Can I limit the size of files to be included in tar in a directory to include atomic files?

I have a directory having multiple files with different sizes
a.txt
b.txt
c.txt
I want to limit create multiple tar files with some maximum fixed size(say 100 MB). Such that whole file is included in the tar or not included in the tar(If file size is greater than fixed size maybe throw an error)
I am aware of split function:
Creating tar
Splitting with desired chunk size
The problem with above method is that resulting tar files can't be extracted individually.
Could anyone help with the solution(or provide an alternative solution)
The following script takes two or more arguments. The first is the total size
of the set of files to use. The remaining arguments are passed to find so you
can give a directory as an argument. The script assumes that filenames
are "well behaved" and won't contain spaces, newlines, or anything
else that would confuse the shell.
The script prints out the largest files that will fit in the given size. This
is not necessarily the most space consuming set of files, but finding that
efficiently in the general case is the Knapsack problem, which I am unlikely to
solve here.
You could change the sort -rn to sort -n to start with the smallest
files first.
#!/bin/sh
avail=$1
shift
used=0
find "$#" -type f -print | xargs wc -c | sort -rn | while read size fn; do
if expr $used + $size '>' $avail >/dev/null; then
continue;
fi
used=$(expr $used + $size)
echo $fn
done
The output of the script can be passed to pax(1) to create the actual archive.
For example (assuming you have called the script fitfiles):
sh fitfiles 10000000 *.txt | pax -w -x ustar -v | xz > wdb.tar.xz

extracting the first 10KB chunk from a text file and then moving the new file to a different location

I have all the text files at bigdata/*.txt.The format of this file is languageName-xxx-10MB.txt. I want to perform an operation that would chunk these files from 10MB to the first 10KB recursively and then place the newly formed file at ../smalldata/.The newly formed file should be in the format of languageName-xxx-10KB.txt
I have tried both the operations independently.The first being looping over all the files in the bigdata/ using
#!/bin/bash
for entry in bigdata/*
do
echo "$entry"
done
I get the output as
bigdata/lang1-xxx-10MB.txt
bigdata/lang2-xxx-10MB.txt
.
.
bigdata/langn-xxx-10MB.txt
I have also tried using the head command that gets me the first 10KB of the files using
head -c 10240 lang1-xxx-10MB.txt > ../smalldata/lang1-xxx-10KB.txt
I am looking for a way to merge this two tasks iteratively.
While inside the directory containing your bigdata and smalldata directories, you can run
#!/bin/bash
cd bigdata
for entry in *
do
head -c 10240 "$entry" > "../smalldata/${entry//10MB/10KB}"
done
The ${..} part is known as parameter expansion.
#! /bin/bash
BIGDATA=bigdata/*
SMALLDATA=smalldata
for entry in $BIGDATA
do
filename=`basename $entry`
newname=$(echo $filename | awk -F '-' '{print $1 "-" $2 "-10KB.txt"}')
head -c 1 $entry > $SMALLDATA/$newname
done

How to quickly check a .gz file without unzip? [duplicate]

How to get the first few lines from a gziped file ?
I tried zcat, but its throwing an error
zcat CONN.20111109.0057.gz|head
CONN.20111109.0057.gz.Z: A file or directory in the path name does not exist.
zcat(1) can be supplied by either compress(1) or by gzip(1). On your system, it appears to be compress(1) -- it is looking for a file with a .Z extension.
Switch to gzip -cd in place of zcat and your command should work fine:
gzip -cd CONN.20111109.0057.gz | head
Explanation
-c --stdout --to-stdout
Write output on standard output; keep original files unchanged. If there are several input files, the output consists of a sequence of independently compressed members. To obtain better compression, concatenate all input files before compressing
them.
-d --decompress --uncompress
Decompress.
On some systems (e.g., Mac), you need to use gzcat.
On a mac you need to use the < with zcat:
zcat < CONN.20111109.0057.gz|head
If a continuous range of lines needs be, one option might be:
gunzip -c file.gz | sed -n '5,10p;11q' > subFile
where the lines between 5th and 10th lines (both inclusive) of file.gz are extracted into a new subFile. For sed options, refer to the manual.
If every, say, 5th line is required:
gunzip -c file.gz | sed -n '1~5p;6q' > subFile
which extracts the 1st line and jumps over 4 lines and picks the 5th line and so on.
If you want to use zcat, this will show the first 10 rows
zcat your_filename.gz | head
Let's say you want the 16 first row
zcat your_filename.gz | head -n 16
This awk snippet will let you show not only the first few lines - but a range you can specify. It will also add line numbers which i needed for debugging an error message pointing to a certain line way down in a gzipped file.
gunzip -c file.gz | awk -v from=10 -v to=20 'NR>=from { print NR,$0; if (NR>=to) exit 1}'
Here is the awk snippet used in the one liner above. In awk NR is a built-in variable (Number of records found so far) which usually is equivalent to a line number. the from and to variable are picked up from the command line via the -v options.
NR>=from {
print NR,$0;
if (NR>=to)
exit 1
}

Merging, then splitting files

Using a for loop, I can merge all of the files in a directory that end with *.txt:
for filename in *.txt; do
cat "${filename}"
echo
done > output.txt
After doing this, I will run output.txt through various scripts, in which the text will be changed considerably. After that, I want to split the files, at the same places at which they were merged, into different files (output01.txt, output02.txt, etc.).
How can I split the files at the same place they were merged?
This cannot be based on line number, because the scripts will add \t in places.
I think a solution that might work is to place "#########" at the end of each of the initial *.txt files before merging them, but I don't know how to get BASH to split the files again at that mark.
Instead of that for loop for concatenating, you can just use cat *.txt.
Anyway, why don't you just perform the scripts on each file independently within the for loop?
If you really want to combine and re-segregate, you can use:
for filename in *.txt; do
cat "${filename}"
echo "#####"
done > output.txt
# Pass output.txt through whatever
awk 'BEGIN { fileno = 1; file = sprintf("output%02d.txt", fileno) };
{ if($1 ~ /#####/) { fileno++;
file = sprintf("output%02d.txt", fileno);
next }
else print >file
}' output.txt
The canonical answer would be:
tar c *.txt > output.txt
You could split/unmerge them exactly by doing
tar xf output.txt # in the current directory
tar x -C /tmp/splitfiles/ -f output.txt
Now if you really want to do stuff like that in a loop and extract to stdout/a pipe, you could:
while read fname < <(tar tf output.txt)
do
# extract named to pipe
tar -xOf output.txt "$fname" | myprogram "$fname"
done
However, that would possibly not be very efficient. You could consider just doing
while read fname < <(tar x -v -C /tmp/splitfiles/ -f output.txt)
do
# handle extracted file
myprogram "/tmp/splitfiles/$fname"
unlink "/tmp/splitfiles/$fname" # drop the temp file
done
This will be completely asynchronous (so if extraction or even the transmission of the archive is slow, the first files can already be processed while waiting for more data to arrive).
See also my other answer https://stackoverflow.com/a/8341221/85371 (look for the older answer part, since that question was changed to be very specific later)
As Fredrik wrote here you can use csplit to split your merged file.

Resources