Split a large gz file into smaller ones filtering and distributing content - bash

I have a gzip file of size 81G which I unzip and size of uncompressed file is 254G. I want to implement a bash script which takes the gzip file and splits it on the basis of the first column. The first column has values range between 1-10. I want to split the files into 10 subfiles where by all rows where value in first column is 1 is put into 1 file. All the rows where the value is 2 in the first column is put into a second file and so on. While I do that I don't want to put column 3 and column 5 in the new subfiles. Also the file is tab separated. For example:
col_1 col_2. col_3. col_4. col_5. col_6
1. 7464 sam. NY. 0.738. 28.9
1. 81932. Dave. NW. 0.163. 91.9
2. 162. Peter. SD. 0.7293. 673.1
3. 7193. Ooni GH. 0.746. 6391
3. 6139. Jess. GHD. 0.8364. 81937
3. 7291. Yeldish HD. 0.173. 1973
File above will result in three different gzipped files such that col_3 and col_5 are removed from each of the new subfiles. What I did was
#!/bin/bash
#SBATCH --partition normal
#SBATCH --mem-per-cpu 500G
#SBATCH --time 12:00:00
#SBATCH -c 1
awk -F, '{print > $1".csv.gz"}' file.csv.gz
But this is not producing the desired result. Also I don't know how to remove col_3 and col_5 from the new subfiles.
Like I said gzip file is 81G and therefore, I am looking for an efficient solution. Insights will be appreciated.

You have to decompress and recompress; to get rid of columns 3 and 5, you could use GNU cut like this:
gunzip -c infile.gz \
| cut --complement -f3,5 \
| awk '{ print | "gzip > " $1 "csv.gz" }'
Or you could get rid of the columns in awk:
gunzip -c infile.gz \
| awk -v OFS='\t' '{ print $1, $2, $4, $6 | "gzip > " $1 "csv.gz" }'

Something like
zcat input.csv.gz | cut -f1,2,4,6- | awk '{ print | ("gzip -c > " $1 "csv.gz") }'
Uncompress the file, remove fields 3 and 5, save to the appropriate compressed file based on the first column.

Robustly and portably with any awk, if the file is always sorted by the first field as shown in your example:
gunzip -c infile.gz |
awk '
{ $0 = $1 OFS $2 OFS $4 OFS $6 }
NR==1 { hdr = $0; next }
$1 != prev { close(gzip); gzip="gzip > \047"$1".csv.gz\047"; prev=$1 }
!seen[$1]++ { print hdr | gzip }
{ print | gzip }
'
otherwise:
gunzip -c infile.gz |
awk 'BEGIN{FS=OFS="\t"} {print (NR>1), NR, $0}' |
sort -k1,1n -k3,3 -k2,2n |
cut -f3- |
awk '
{ $0 = $1 OFS $2 OFS $4 OFS $6 }
NR==1 { hdr = $0; next }
$1 != prev { close(gzip); gzip="gzip > \047"$1".csv.gz\047"; prev=$1 }
!seen[$1]++ { print hdr | gzip }
{ print | gzip }
'
The first awk adds a number at the front to ensure the header line sorts before the rest during the sort phase, and adds the line number so that lines with the same original first field value retain their original input order. Then we sort by the first field, and then cut away the 2 fields added in the first step, then use awk to robustly and portably create the separate output files, ensuring that each output file starts with a copy of the header. We close each output file as we go so that the script will work for any number of output files using any awk and will work efficiently even for a large number of output files with GNU awk. It also ensures that each output file name is properly quoted to avoid globbing, word splitting, and filename expansion.

Related

Cut and sort delimited dates from stdout via pipe

I am trying to split some strings from stdout to get the dates from it, but I have two cases
full.20201004T033103Z.vol93.difftar.gz
full.20201007T033103Z.vol94.difftar.gz
Which should produce: 20201007T033103Z which is the nearest date to now (newest)
Or:
inc.20200830T033103Z.to.20200906T033103Z.vol1.difftar.gz
inc.20200929T033103Z.to.20200908T033103Z.vol10.difftar.gz
Should get the second date (after .to.) not the first one, and print only the newest date: 20200908T033103Z
What I tried:
cat dates_file | awk -F '.to.' 'NF > 1 {print $2}' | cut -d\. -f1 | sort -r -t- -k3.1,3.4 -k2,2 | head -1
This only works for the second case and not covering the first, also I am not sure about the date sorting logic.
Here is a sample data
full.20201004T033103Z.vol93.difftar.gz
full.20201004T033103Z.vol94.difftar.gz
full.20201004T033103Z.vol95.difftar.gz
full.20201004T033103Z.vol96.difftar.gz
full.20201004T033103Z.vol97.difftar.gz
full.20201004T033103Z.vol98.difftar.gz
full.20201004T033103Z.vol99.difftar.gz
inc.20200830T033103Z.to.20200906T033103Z.manifest
inc.20200830T033103Z.to.20200906T033103Z.vol1.difftar.gz
inc.20200830T033103Z.to.20200906T033103Z.vol10.difftar.gz
inc.20200830T033103Z.to.20200906T033103Z.vol11.difftar.gz
inc.20200830T033103Z.to.20200906T033103Z.vol12.difftar.gz
inc.20200830T033103Z.to.20200906T033103Z.vol13.difftar.gz
inc.20200830T033103Z.to.20200906T033103Z.vol14.difftar.gz
inc.20200830T033103Z.to.20200906T033103Z.vol15.difftar.gz
inc.20200830T033103Z.to.20200906T033103Z.vol16.difftar.gz
inc.20200830T033103Z.to.20200906T033103Z.vol17.difftar.gz
To get most recent data from your sample data you can use this awk:
awk '{
sub(/^(.*\.to|[^.]+)\./, "")
gsub(/\..+$|[TZ]/, "")
}
$0 > max {
max = $0
}
END {
print max
}' file
20201004033103

Count lines of processed data while parsing it

I'm trying to find a way to count the number of lines that have been processed while parsing the incoming data from millions of small files.
Sample data as example, tab is the seperator:
CLIENT1.test.com /var DIR 21213412 user1 root default 2000-03-04 18:30:59.000000 PROC_MGMT
CLIENT1.test.com /usr DIR 212112 user1 root default 2006-02-11 08:30:00.000000 PROC_MGMT
CLIENT2.test.com /var/tmp/test.txt ACTIVE FILE 4000 sysuser sysuser NA 2001-04-11 03:00:09.000000 DEFAULT
CLIENT3.test.com /test.out PASSIVE FILE 4000 atuser atgroup group 2012-05-04 02:30:59.000000 AUTOMAT
CLIENT4.test.com /opt DIR 542016 dbuser dbgroup Default 2000-03-04 18:30:59.000000 SYSTEM
My code currently looks something like this:
PATTERN="mssg1|mssg2|mssg3|...|mssgN"
SERVER=my_server_name
find <path> -type f -name "*.txt" -print0 | \
xargs -0 awk -v PAT="$PATTERN" '$0!~PAT' | \
awk '{gsub(/\t/",") {print}}' | \
awk -v SRV="$SERVER" 'BEGIN {FS=OFS=","} {$1=SRV OFS $1;} {if ($4 !~ /DIR/) $4=","$4;} {print}' | \
awk 'BEGIN {FS=OFS=","} {if ($9 == "") $9="01/01/1970 00:00:00 AM"; else {gsub("[:-]"," ",$9); $9=strftime("%m/%d%/Y %r", maketime($9))};} {print}' > /tmp/outputFile.log
I can count the total number of lines of all the incoming files by running a for loop and wc -l (which I guess will be quite slow) and put it as yyyy number of lines.
What I'm looking for is to count the number of lines that I've already processed. So that I can show something like
echo "Processed xxxx lines out of yyyy lines"
where xxxx is divisible by 1000. For e.g.:
Processed 1000 lines out of 1000000 lines.
Processed 2000 lines out of 1000000 lines.
Processed 3000 lines out of 1000000 lines.
.........
Processed 1000000 lines out of 1000000 lines.
Done.
Can I add a counter to the awk statements that I'm using?
My code is bash based running on RHEL 6.7.
The following awk program unifies your entire pipeline.
It is possible to give a count of your records, but it is not possible to print the total number of lines unless you know how many lines there are beforehand. You do know how many files there are, so this you can use as a counter.
PATTERN="mssg1|mssg2|mssg3|...|mssgN"
SERVER=my_server_name
find <path> -type f -name "*.txt" -print0 | \
xargs -0 awk -v PAT="$PATTERN" -v SRV="$SERVER" -v OUT=/tmp/outputFile.log '
BEGIN {FS=OFS=","}
(FNR==1){f++}
# print progress
(NR%1000==0){ print "Processed "NR" lines and "f-1" files out of "ARGC-2 }
# skip line matching pattern
($0~PAT){next}
# substitute all tabs, prepend SRV and redefine fields
# after this point, we inserted a new field before everything
{ gsub(/\t/,","); $0=SRV OFS $0 }
# redefine $6 which automatically redefines fields
# after this line, $4 will be an empty field and $5 will be the old $4
($4 !~ /DIR/){ $4 = OFS $4 }
# process field 9
{ if ($9 == "") $9="01/01/1970 00:00:00 AM"
else { gsub("[-:]"," ",$9); $9=strftime("%m/%d%/Y %r", maketime($9))} }
# print to output file
{ print $0 > OUT }
END{ print "Total lines processed: "NR
print "Total files processed: "f }'
A general recommendation regarding dates: avoid anything that is not sortable, your format "mm/dd/yyyy", when ascii sorted, is not sorted by date while "yyyy-mm-dd" is. Also, AM and PM for time make not a lot of sense.
https://xkcd.com/1179/
If you want to output the status in a status file, you do
xargs ... | awk ' ...
END{ print "Total lines processed: "NR > "status.txt"
print "Total files processed: "f > "status.txt" }'

awk or shell command to count occurence of value in 1st column based on values in 4th column

I have a large file with records like below :
jon,1,2,apple
jon,1,2,oranges
jon,1,2,pineaaple
fred,1,2,apple
tom,1,2,apple
tom,1,2,oranges
mary,1,2,apple
I want to find the no of person (names in col 1) have apple and oranges both. And the command should take as less memory as possible and should be fast. Any help appreciated!
Output :
awk/sed file => 2 (jon and tom)
Using awk is pretty easy:
awk -F, \
'$4 == "apple" { apple[$1]++ }
$4 == "oranges" { orange[$1]++ }
END { for (name in apple) if (orange[name]) print name }' data
It produces the required output on the sample data file:
jon
tom
Yes, you could squish all the code onto a single line, and shorten the names, and otherwise obfuscate the code.
Another way to do this avoids the END block:
awk -F, \
'$4 == "apple" { if (apple[$1]++ == 0 && orange[$1]) print $1 }
$4 == "oranges" { if (orange[$1]++ == 0 && apple[$1]) print $1 }' data
When it encounters an apple entry for the first time for a given name, it checks to see if the name also (already) has an entry for oranges and prints it if it has; likewise and symmetrically, if it encounters an orange entry for the first time for a given name, it checks to see if the name also has an entry for apple and prints it if it has.
As noted by Sundeep in a comment, it could use in:
awk -F, \
'$4 == "apple" { if (apple[$1]++ == 0 && $1 in orange) print $1 }
$4 == "oranges" { if (orange[$1]++ == 0 && $1 in apple) print $1 }' data
The first answer could also use in in the END loop.
Note that all these solutions could be embedded in a script that would accept data from standard input (a pipe or a redirected file) — they have no need to read the input file twice. You'd replace data with "$#" to process file names if they're given, or standard input if no file names are specified. This flexibility is worth preserving when possible.
With awk
$ awk -F, 'NR==FNR{if($NF=="apple") a[$1]; next}
$NF=="oranges" && ($1 in a){print $1}' ip.txt ip.txt
jon
tom
This processes the input twice
In first pass, add key to an array if last field is apple (-F, would set , as input field separator)
In second pass, check if last field is oranges and if first field is a key of array a
To print only number of matches:
$ awk -F, 'NR==FNR{if($NF=="apple") a[$1]; next}
$NF=="oranges" && ($1 in a){c++} END{print c}' ip.txt ip.txt
2
Further reading: idiomatic awk for details on two file processing and awk idioms
I did a work around and used only grep and comm commands.
grep "apple" file | cut -d"," -f1 | sort > file1
grep "orange" file | cut -d"," -f1 | sort > file2
comm -12 file1 file2 > names.having.both.apple&orange
comm -12 shows only the common names between the 2 files.
Solution from Jonathan also worked.
For the input:
jon,1,2,apple
jon,1,2,oranges
jon,1,2,pineaaple
fred,1,2,apple
tom,1,2,apple
tom,1,2,oranges
mary,1,2,apple
the command:
sed -n "/apple\|oranges/p" inputfile | cut -d"," -f1 | uniq -d
will output a list of people with both apples and oranges:
jon
tom
Edit after comment: For an for input file where lines are not ordered by 1st column and where each person can have two or more repeated fruits, like:
jon,1,2,apple
fred,1,2,apple
fred,1,2,apple
jon,1,2,oranges
jon,1,2,pineaaple
jon,1,2,oranges
tom,1,2,apple
mary,1,2,apple
tom,1,2,oranges
This command will work:
sed -n "/\(apple\|oranges\)$/ s/,.*,/,/p" inputfile | sort -u | cut -d, -f1 | uniq -d

Multiple Big file sort

I have two files that each line order by timestamp but has different structure. I want merge there file info one single file and order by timestamp. look like:
file A(less than 2G)
1,1,1487779199850
2,2,1487779199852
3,3,1487779199854
4,4,1487779199856
5,5,1487779199858
file B(less than 15G)
1,1,10,100,1487779199850
2,2,20,200,1487779199852
3,3,30,300,1487779199854
4,4,40,400,1487779199856
5,5,50,500,1487779199858
how can I accomplish this? is there any way can make it as fast as possible?
$ awk -F, -v OFS='\t' '{print $NF, $0}' fileA fileB | sort -s -n -k1,1 | cut -f2-
1,1,1487779199850
1,1,10,100,1487779199850
2,2,1487779199852
2,2,20,200,1487779199852
3,3,1487779199854
3,3,30,300,1487779199854
4,4,1487779199856
4,4,40,400,1487779199856
5,5,1487779199858
5,5,50,500,1487779199858
I originally posted the above as just a comment under #VM17's answer but (s)he suggested I make it a new answer.
The above would be more robust and efficient since it's using the default separator for sort+cut (tab), will truly only sort on the first key (his would use the whole line despite the -k1 since sorts field separator tab isn't present in the line), uses a stable sort algorithm (sort -s) to preserve input order and uses cut to strip off the added key field which is more efficient than invoking awk again since awk does field splitting etc. on each record which isn't needed to just remove the leading field(s).
Alternatvely you might find something like this more efficient:
$ cat tst.awk
{ currRec = $0; currKey = $NF }
NR>1 {
print prevRec
printf "%s", saved
while ( (getline < "fileB") > 0 ) {
if ($NF < currKey) {
print
}
else {
saved = $0 ORS
break
}
}
}
{ prevRec = currRec; prevKey = currKey }
END {
print prevRec
printf "%s", saved
while ( (getline < "fileB") > 0 ) {
print
}
}
$ awk -f tst.awk fileA
1,1,1487779199850
1,1,10,100,1487779199850
2,2,1487779199852
2,2,20,200,1487779199852
3,3,1487779199854
3,3,30,300,1487779199854
4,4,1487779199856
4,4,40,400,1487779199856
5,5,1487779199858
5,5,50,500,1487779199858
As you can see it reads from fileB between reads of lines fileA comparing timestamps so it's interleaving the 2 files and so doesn't require a subsequent pipe to sort and cut.
Just check the logic as I didn't think about it very much and be aware that this is a rare situation where getline might be appropriate for efficiency but make sure to read http://awk.freeshell.org/AllAboutGetline to understand all it's caveats if you're ever considering using it again.
Try this-
awk -F, '{print $NF, $0}' fileA fileB | sort -nk 1 | awk '{print $2}'
Output-
1,1,10,100,1487779199850
1,1,1487779199850
2,2,1487779199852
2,2,20,200,1487779199852
3,3,1487779199854
3,3,30,300,1487779199854
4,4,1487779199856
4,4,40,400,1487779199856
5,5,1487779199858
5,5,50,500,1487779199858
This concatenates the two files and then puts the timestamp at the starting of the line. It then sorts according to the timestamp and then removes that dummy column.
This will be slow for big files though.

AWK: Compare two CSV files

I have two CSV files and I want to compare them using AWK and generate a new file.
file1.csv:
"no","loc"
"abc121","C:/pro/in"
"abc122","C:/pro/abc"
"abc123","C:/pro/xyz"
"abc124","C:/pro/in"
file2.csv:
"no","loc"
"abc121","C:/pro/in"
"abc122","C:/pro/abc"
"abc125","C:/pro/xyz"
"abc126","C:/pro/in"
output.csv:
"file1","file2","Diff"
"abc121","abc121","Match"
"abc122","abc122","Match"
"abc123","","Unmatch"
"abc124","","Unmatch"
"","abc125","Unmatch"
"","abc126","Unmatch"
One way with awk:
script.awk:
BEGIN {
FS = ","
}
NR>1 && NR==FNR {
a[$1] = $2
next
}
FNR>1 {
print ($1 in a) ? $1 FS $1 FS "Match" : "\"\"" FS $1 FS "Unmatch"
delete a[$1]
}
END {
for (x in a) {
print x FS "\"\"" FS "Unmatch"
}
}
Output:
$ awk -f script.awk file1.csv file2.csv
"abc121","abc121",Match
"abc122","abc122",Match
"","abc125",Unmatch
"","abc126",Unmatch
"abc124","",Unmatch
"abc123","",Unmatch
I didn't use awk alone, but if I understood the gist of what you're asking correctly, I think this long one-liner should do it...
join -t, -a 1 -a 2 -o 1.1 2.1 1.2 2.2 file1.csv file2.csv | awk -F, '{ if ( $3 == $4 ) var = "\"Match\""; else var = "\"Unmatch\"" ; print $1","$2","var }' | sed -e '1d' -e 's/^,/"",/' -e 's/,$/,"" /' -e 's/,,/,"",/g'
Description:
The join portion takes the two CSV files, joins them on the first column (default behavior of join) and outputs all four fields (-o 1.1 2.1 1.2 2.2), making sure to include rows that are unmatched for both files (-a 1 -a 2).
The awk portion takes that output and replaces combination of the 3rd and 4th columns to either "Match" or "Unmatch" based on if they do in fact match or not. I had to make an assumption on this behavior based on your example.
The sed portion deletes the "no","loc" header from the output (-e '1d') and replaces empty fields with open-close quote marks (-e 's/^,/"",/' -e 's/,$/,""/' -e 's/,,/,"",/g'). This last part might not be necessary for you.
EDIT:
As tripleee points out, the above fails if the two initial files are unsorted. Here's an updated command to fix that. It punts the header line and sorts each file before passing them to join...
join -t, -a 1 -a 2 -o 1.1 2.1 1.2 2.2 <( sed 1d file1.csv | sort ) <( sed 1d file2.csv | sort ) | awk -F, '{ if ( $3 == $4 ) var = "\"Match\""; else var = "\"Unmatch\"" ; print $1","$2","var }' | sed -e 's/^,/"",/' -e 's/,$/,""/' -e 's/,,/,"",/g'

Resources