Multiple Big file sort

Multiple Big file sort - shell

I have two files that each line order by timestamp but has different structure. I want merge there file info one single file and order by timestamp. look like:
file A(less than 2G)
1,1,1487779199850
2,2,1487779199852
3,3,1487779199854
4,4,1487779199856
5,5,1487779199858
file B(less than 15G)
1,1,10,100,1487779199850
2,2,20,200,1487779199852
3,3,30,300,1487779199854
4,4,40,400,1487779199856
5,5,50,500,1487779199858
how can I accomplish this? is there any way can make it as fast as possible?

$ awk -F, -v OFS='\t' '{print $NF, $0}' fileA fileB | sort -s -n -k1,1 | cut -f2-
1,1,1487779199850
1,1,10,100,1487779199850
2,2,1487779199852
2,2,20,200,1487779199852
3,3,1487779199854
3,3,30,300,1487779199854
4,4,1487779199856
4,4,40,400,1487779199856
5,5,1487779199858
5,5,50,500,1487779199858
I originally posted the above as just a comment under #VM17's answer but (s)he suggested I make it a new answer.
The above would be more robust and efficient since it's using the default separator for sort+cut (tab), will truly only sort on the first key (his would use the whole line despite the -k1 since sorts field separator tab isn't present in the line), uses a stable sort algorithm (sort -s) to preserve input order and uses cut to strip off the added key field which is more efficient than invoking awk again since awk does field splitting etc. on each record which isn't needed to just remove the leading field(s).
Alternatvely you might find something like this more efficient:
$ cat tst.awk
{ currRec = $0; currKey = $NF }
NR>1 {
print prevRec
printf "%s", saved
while ( (getline < "fileB") > 0 ) {
if ($NF < currKey) {
print
}
else {
saved = $0 ORS
break
}
}
}
{ prevRec = currRec; prevKey = currKey }
END {
print prevRec
printf "%s", saved
while ( (getline < "fileB") > 0 ) {
print
}
}
$ awk -f tst.awk fileA
1,1,1487779199850
1,1,10,100,1487779199850
2,2,1487779199852
2,2,20,200,1487779199852
3,3,1487779199854
3,3,30,300,1487779199854
4,4,1487779199856
4,4,40,400,1487779199856
5,5,1487779199858
5,5,50,500,1487779199858
As you can see it reads from fileB between reads of lines fileA comparing timestamps so it's interleaving the 2 files and so doesn't require a subsequent pipe to sort and cut.
Just check the logic as I didn't think about it very much and be aware that this is a rare situation where getline might be appropriate for efficiency but make sure to read http://awk.freeshell.org/AllAboutGetline to understand all it's caveats if you're ever considering using it again.

Try this-
awk -F, '{print $NF, $0}' fileA fileB | sort -nk 1 | awk '{print $2}'
Output-
1,1,10,100,1487779199850
1,1,1487779199850
2,2,1487779199852
2,2,20,200,1487779199852
3,3,1487779199854
3,3,30,300,1487779199854
4,4,1487779199856
4,4,40,400,1487779199856
5,5,1487779199858
5,5,50,500,1487779199858
This concatenates the two files and then puts the timestamp at the starting of the line. It then sorts according to the timestamp and then removes that dummy column.
This will be slow for big files though.

Related

awk: comparing two files containing numbers

I'm using this command to compare two files and print out lines in which $1 is different:
awk -F, 'NR==FNR {exclude[$1];next} !($1 in exclude)' old.list new.list > changes.list
the files I'm working with have been sorted numerically with -n
old.list:
30606,10,57561
30607,100,26540
30611,300,35,5.068
30612,100,211,0.035
30613,200,5479,0.005
30616,100,2,15.118
30618,0,1257,0.009
30620,14,8729,0.021
new.list
30606,10,57561
30607,100,26540
30611,300,35,5.068
30612,100,211,0.035
30613,200,5479,0.005
30615,50,874,00.2
30616,100,2,15.118
30618,0,1257,0.009
30620,14,8729,0.021
30690,10,87,0.021
30800,20,97,1.021
Result
30615,50,874,00.2
30690,10,87,0.021
30800,20,97,1.021
I'm looking for a way to tweak my command and make awk print lines only if $1 from new.list is not only unique but also > $1 from the last line of the old.list
Expected result:
30690,10,87,0.021
30800,20,97,1.021
because 30690 and 30800 ($1) > 30620 ($1 from the last line of old.list)
in this case, 30615,50,874,00.2 would not be printed because 30615 is admitedly unique to new.list but it's also < 30620 ($1 from the last line of the old.list)
awk -F, '{if ($1 #from new.list > $1 #from_the_last_line_of_old.list) print }'
something like that, but I'm not sure it can be done this way?
Thank you

You can use the awk you have but then pipe through sort to sort numeric high to low then pipe to head to get the first:
awk -F, 'FNR==NR{seen[$1]; next} !($1 in seen)' old new | sort -nr | head -n1
30690,10,87,0.021
Or, use an the second pass to find the max in awk and an END block to print:
awk -F, 'FNR==NR{seen[$1]; next}
(!($1 in seen)) {uniq[$1]=$0; max= $1>max ? $1 : max}
END {print uniq[max]}' old new
30690,10,87,0.021
Cup of coffee and reading you edit, just do this:
awk -F, 'FNR==NR{ref=$1; next} $1>ref' old new
30690,10,87,0.021
30800,20,97,1.021
Since you are only interested in the values greater than the last line of old there is no need to even look at the other lines of that file;
Just read the full first file and grab the last $1 since it is already sorted and then compare to $1 in the new file. If old is not sorted or you just want to save that step, you can do:
FNR==NR{ref=$1>ref ? $1 : ref; next}
if you need to uniquely the values in new you can do that as part of the sort step you are already doing:
sort -t, -k 1,1 -n -u new

single-pass awk solution :
mawk 'BEGIN { ___ = log(!(_^= FS = ",")) # set def. value to -inf
} NR==FNR ? __[___=$_] : ($_ in __)<(+___<+$_)' old.txt new.txt
30690,10,87,0.021
30800,20,97,1.021

Since both files are sorted, this command should be more efficient than the other solutions here:
awk -F, 'NR==FNR{x=$1}; $1>x{x=$1; print}' <(tail -n1 old) new
It reads only one line from old
It prints only lines where new.$1 > old[last].$1
It prints only lines with unique $1

Split a large gz file into smaller ones filtering and distributing content

I have a gzip file of size 81G which I unzip and size of uncompressed file is 254G. I want to implement a bash script which takes the gzip file and splits it on the basis of the first column. The first column has values range between 1-10. I want to split the files into 10 subfiles where by all rows where value in first column is 1 is put into 1 file. All the rows where the value is 2 in the first column is put into a second file and so on. While I do that I don't want to put column 3 and column 5 in the new subfiles. Also the file is tab separated. For example:
col_1 col_2. col_3. col_4. col_5. col_6
1. 7464 sam. NY. 0.738. 28.9
1. 81932. Dave. NW. 0.163. 91.9
2. 162. Peter. SD. 0.7293. 673.1
3. 7193. Ooni GH. 0.746. 6391
3. 6139. Jess. GHD. 0.8364. 81937
3. 7291. Yeldish HD. 0.173. 1973
File above will result in three different gzipped files such that col_3 and col_5 are removed from each of the new subfiles. What I did was
#!/bin/bash
#SBATCH --partition normal
#SBATCH --mem-per-cpu 500G
#SBATCH --time 12:00:00
#SBATCH -c 1
awk -F, '{print > $1".csv.gz"}' file.csv.gz
But this is not producing the desired result. Also I don't know how to remove col_3 and col_5 from the new subfiles.
Like I said gzip file is 81G and therefore, I am looking for an efficient solution. Insights will be appreciated.

You have to decompress and recompress; to get rid of columns 3 and 5, you could use GNU cut like this:
gunzip -c infile.gz \
| cut --complement -f3,5 \
| awk '{ print | "gzip > " $1 "csv.gz" }'
Or you could get rid of the columns in awk:
gunzip -c infile.gz \
| awk -v OFS='\t' '{ print $1, $2, $4, $6 | "gzip > " $1 "csv.gz" }'

Something like
zcat input.csv.gz | cut -f1,2,4,6- | awk '{ print | ("gzip -c > " $1 "csv.gz") }'
Uncompress the file, remove fields 3 and 5, save to the appropriate compressed file based on the first column.

Robustly and portably with any awk, if the file is always sorted by the first field as shown in your example:
gunzip -c infile.gz |
awk '
{ $0 = $1 OFS $2 OFS $4 OFS $6 }
NR==1 { hdr = $0; next }
$1 != prev { close(gzip); gzip="gzip > \047"$1".csv.gz\047"; prev=$1 }
!seen[$1]++ { print hdr | gzip }
{ print | gzip }
'
otherwise:
gunzip -c infile.gz |
awk 'BEGIN{FS=OFS="\t"} {print (NR>1), NR, $0}' |
sort -k1,1n -k3,3 -k2,2n |
cut -f3- |
awk '
{ $0 = $1 OFS $2 OFS $4 OFS $6 }
NR==1 { hdr = $0; next }
$1 != prev { close(gzip); gzip="gzip > \047"$1".csv.gz\047"; prev=$1 }
!seen[$1]++ { print hdr | gzip }
{ print | gzip }
'
The first awk adds a number at the front to ensure the header line sorts before the rest during the sort phase, and adds the line number so that lines with the same original first field value retain their original input order. Then we sort by the first field, and then cut away the 2 fields added in the first step, then use awk to robustly and portably create the separate output files, ensuring that each output file starts with a copy of the header. We close each output file as we go so that the script will work for any number of output files using any awk and will work efficiently even for a large number of output files with GNU awk. It also ensures that each output file name is properly quoted to avoid globbing, word splitting, and filename expansion.

Combining multiple lines into a single based on column values [duplicate]

This question already has answers here:
Merge values for same key
(4 answers)
Closed 4 years ago.
I have a file with below records.
$File.txt
APPLE,A,10
APPLE,A,20
APPLE,A,30
GRAPE,B,12
GRAPE,B,13
I want the output to be as given below:
APPLE,A,10|20|30,
GRAPE,B,12|13,
I have tried the below method and got the required output. But looking for something simpler.
awk -F"," '{if(NR<2){if(!seen[$1]++){printf "%-8s|",$3}}else{if(seen[$1]++){printf "%-12s|",$3}else{ printf ",\n%-12s|",$3}}}' File1.txt | awk -F"|" '{for(i=1;i<NF-1;i++){ printf "%-12s|",$i}printf "%-12s,\n", $(NF-1)}'|sed 's/ //g' > O1.txt
awk -F"," '{print $1","$2","}' File1.txt | uniq > O2.txt
paste -d'\0' O2.txt O1.txt

something like this?
$ awk -F, '{k=$1 FS $2; a[k]=((k in a)?a[k]"|":k FS)$3}
END {for(k in a) print a[k] FS}' file
APPLE,A,10|20|30,
GRAPE,B,12|13,
to remove the last comma, remove the FS in print statement. If your file is already sorted this can be simplified further.

You need something like below with just standalone awk
awk -F, 'BEGIN { OFS = FS }{ key = $1","$2 }{ unique[key] = unique[key]?(unique[key]"|"$3):($3) }
END { for (i in unique) print i, unique[i] }' file
If you think you need the extra , at the end just add "," at the END clause after printing the elements from the array.

awk or shell command to count occurence of value in 1st column based on values in 4th column

I have a large file with records like below :
jon,1,2,apple
jon,1,2,oranges
jon,1,2,pineaaple
fred,1,2,apple
tom,1,2,apple
tom,1,2,oranges
mary,1,2,apple
I want to find the no of person (names in col 1) have apple and oranges both. And the command should take as less memory as possible and should be fast. Any help appreciated!
Output :
awk/sed file => 2 (jon and tom)

Using awk is pretty easy:
awk -F, \
'$4 == "apple" { apple[$1]++ }
$4 == "oranges" { orange[$1]++ }
END { for (name in apple) if (orange[name]) print name }' data
It produces the required output on the sample data file:
jon
tom
Yes, you could squish all the code onto a single line, and shorten the names, and otherwise obfuscate the code.
Another way to do this avoids the END block:
awk -F, \
'$4 == "apple" { if (apple[$1]++ == 0 && orange[$1]) print $1 }
$4 == "oranges" { if (orange[$1]++ == 0 && apple[$1]) print $1 }' data
When it encounters an apple entry for the first time for a given name, it checks to see if the name also (already) has an entry for oranges and prints it if it has; likewise and symmetrically, if it encounters an orange entry for the first time for a given name, it checks to see if the name also has an entry for apple and prints it if it has.
As noted by Sundeep in a comment, it could use in:
awk -F, \
'$4 == "apple" { if (apple[$1]++ == 0 && $1 in orange) print $1 }
$4 == "oranges" { if (orange[$1]++ == 0 && $1 in apple) print $1 }' data
The first answer could also use in in the END loop.
Note that all these solutions could be embedded in a script that would accept data from standard input (a pipe or a redirected file) — they have no need to read the input file twice. You'd replace data with "$#" to process file names if they're given, or standard input if no file names are specified. This flexibility is worth preserving when possible.

With awk
$ awk -F, 'NR==FNR{if($NF=="apple") a[$1]; next}
$NF=="oranges" && ($1 in a){print $1}' ip.txt ip.txt
jon
tom
This processes the input twice
In first pass, add key to an array if last field is apple (-F, would set , as input field separator)
In second pass, check if last field is oranges and if first field is a key of array a
To print only number of matches:
$ awk -F, 'NR==FNR{if($NF=="apple") a[$1]; next}
$NF=="oranges" && ($1 in a){c++} END{print c}' ip.txt ip.txt
2
Further reading: idiomatic awk for details on two file processing and awk idioms

I did a work around and used only grep and comm commands.
grep "apple" file | cut -d"," -f1 | sort > file1
grep "orange" file | cut -d"," -f1 | sort > file2
comm -12 file1 file2 > names.having.both.apple&orange
comm -12 shows only the common names between the 2 files.
Solution from Jonathan also worked.

For the input:
jon,1,2,apple
jon,1,2,oranges
jon,1,2,pineaaple
fred,1,2,apple
tom,1,2,apple
tom,1,2,oranges
mary,1,2,apple
the command:
sed -n "/apple\|oranges/p" inputfile | cut -d"," -f1 | uniq -d
will output a list of people with both apples and oranges:
jon
tom
Edit after comment: For an for input file where lines are not ordered by 1st column and where each person can have two or more repeated fruits, like:
jon,1,2,apple
fred,1,2,apple
fred,1,2,apple
jon,1,2,oranges
jon,1,2,pineaaple
jon,1,2,oranges
tom,1,2,apple
mary,1,2,apple
tom,1,2,oranges
This command will work:
sed -n "/\(apple\|oranges\)$/ s/,.*,/,/p" inputfile | sort -u | cut -d, -f1 | uniq -d

AWK split file by separator and count

I have a large 220mb file. The file is grouped by a horizontal row "---". This is what I have so far:
cat test.list | awk -v ORS="" -v RS="-------------------------------------------------------------------------------" '{print $0;}'
How do I take this and print to a new file every 1000 matches?
Is there another way to do this? I looked at split, and csplit but the "----" rows to not occur predictably so I have to match them, and then split on a count of the matches.
I would like the output files to groups of 1000 matches per file.

To output the first 1000 records to outputfile0, the next to outputfile1, etc., just do:
awk 'NR%1000 == 1{ file = "outputfile" i++ } { print > file }' ORS= RS=------ test.list
(Note that I truncated the dashes in RS for simplicity.)'
Unfortunately, using a value of RS that is more than a single character produces unspecified results, so the above cannot be the solution. Perhaps something like twalberg's solution is required:
awk '/^----$/ { if(!(c%1000)) count+=1; c+=1; next }
{print > ("outputfile"count)}' c=1 count=1

Not tested, but something along these lines might work:
awk 'BEGIN {fileno=1,matchcount=0}
/^-------/ { if (++matchcount == 1000) { ++fileno; matchcount=0; } }
{ print $0 > "output_file_" fileno }' < test.list
It might be cleaner to put all that in, say split.awk and use awk -f split.awk test.list instead...

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Multiple Big file sort - shell

Related

awk: comparing two files containing numbers

Split a large gz file into smaller ones filtering and distributing content

Combining multiple lines into a single based on column values [duplicate]

awk or shell command to count occurence of value in 1st column based on values in 4th column

AWK split file by separator and count

Categories

Resources