I frequently have large text files (10-100GB decompressed) to demultiplex based on barcodes in each line, where in practice the number of resulting individual files (unique barcodes) is between 1K and 20K. I've been using awk for this and it accomplishes the task. However, I've noticed that the rate of demuxing larger files (which correlates with more unique barcodes used) is significantly slower (10-20X). Checking ulimit -n shows 4096 as the limit on open files per process, so I suspect that the slowdown is due to the overhead of awk being forced to constantly close and reopen files whenever the total number of demuxed files exceeds 4096.
Lacking root access (i.e., the limit is fixed), what kinds of workarounds could be used to circumvent this bottleneck?
I do have a list of all barcodes present in each file, so I've considered forking multiple awk processes where each is assigned a mutually exclusive subset (< 4096) of barcodes to search for. However, I'm concerned the overhead of having to check each line's barcode for set membership might defeat the gains of not closing files.
Is there a better strategy?
I'm not married to awk, so approaches in other scripting or compiled languages are welcome.
Specific Example
Data Generation (FASTQ with barcodes)
The following generates data similar to what I'm specifically working with. Each entry consists of 4 lines, where the barcode is an 18 character word using the non-ambiguous DNA alphabet.
1024 unique barcodes | 1 million reads
cat /dev/urandom | tr -dc "ACGT" | fold -w 5 | \
awk '{ print "#batch."NR"_"$0"AAAAAAAAAAAAA_ACGTAC length=1\nA\n+\nI" }' | \
head -n 4000000 > cells.1K.fastq
16384 unique barcodes | 1 million reads
cat /dev/urandom | tr -dc "ACGT" | fold -w 7 | \
awk '{ print "#batch."NR"_"$0"AAAAAAAAAAA_ACGTAC length=1\nA\n+\nI" }' | \
head -n 4000000 > cells.16K.fastq
awk script for demultiplexing
Note that in this case I'm writing to 2 files for each unique barcode.
demux.awk
#!/usr/bin/awk -f
BEGIN {
if (length(outdir) == 0 || length(prefix) == 0) {
print "Variables 'outdir' and 'prefix' must be defined!" > "/dev/stderr";
exit 1;
}
print "[INFO] Initiating demuxing..." > "/dev/stderr";
}
{
if (NR%4 == 1) {
match($1, /.*_([ACGT]{18})_([ACGTN]{6}).*/, bx);
print bx[2] >> outdir"/"prefix"."bx[1]".umi";
}
print >> outdir"/"prefix"."bx[1]".fastq";
if (NR%40000 == 0) {
printf("[INFO] %d reads processed\n", NR/4) > "/dev/stderr";
}
}
END {
printf("[INFO] %d total reads processed\n", NR/4) > "/dev/stderr";
}
Usage
awk -v outdir="/tmp/demux1K" -v prefix="batch" -f demux.awk cells.1K.fastq
or similarly for the cells.16K.fastq.
Assuming you're the only one running awk, you can verify the approximate number of open files using
lsof | grep "awk" | wc -l
Observed Behavior
Despite the files being the same size, the one with 16K unique barcodes runs 10X-20X slower than the one with only 1K unique barcodes.
Without seeing any sample input/output or the script you're currently executing it's very much guesswork but if you currently have the barcode in field 1 and are doing (assuming GNU awk so you don't have your own code managing the open files):
awk '{print > $1}' file
then if managing open files really is your problem you'll get a significant improvement if you change it to:
sort file | '$1!=f{close(f};f=$1} {print > f}'
The above is, of course, making assumptions about what these barcoode values are, which field holds them, what separates fields, whether or not the output order has to match the original, what else your code might be doing that gets slower as the input grows, etc., etc. since you haven't shown us any of that yet.
If that's not all you need then edit your question to include the missing MCVE.
Given your updated question with your script and the info that the input is 4-line blocks, I'd approach the problem by adding the key "bx" values at the front of each record and using NUL to separate the 4-line blocks then using NUL as the record separator for sort and the subsequent awk:
$ cat tst.sh
infile="$1"
outdir="${infile}_out"
prefix="foo"
mkdir -p "$outdir" || exit 1
awk -F'[_[:space:]]' -v OFS='\t' -v ORS= '
NR%4 == 1 { print $2 OFS $3 OFS }
{ print $0 (NR%4 ? RS : "\0") }
' "$infile" |
sort -z |
awk -v RS='\0' -F'\t' -v outdir="$outdir" -v prefix="$prefix" '
BEGIN {
if ( (outdir == "") || (prefix == "") ) {
print "Variables \047outdir\047 and \047prefix\047 must be defined!" | "cat>&2"
exit 1
}
print "[INFO] Initiating demuxing..." | "cat>&2"
outBase = outdir "/" prefix "."
}
{
bx1 = $1
bx2 = $2
fastq = $3
if ( bx1 != prevBx1 ) {
close(umiOut)
close(fastqOut)
umiOut = outBase bx1 ".umi"
fastqOut = outBase bx1 ".fastq"
prevBx1 = bx1
}
print bx2 > umiOut
print fastq > fastqOut
if (NR%10000 == 0) {
printf "[INFO] %d reads processed\n", NR | "cat>&2"
}
}
END {
printf "[INFO] %d total reads processed\n", NR | "cat>&2"
}
'
When run against input files generated as you describe in your question:
$ wc -l cells.*.fastq
4000000 cells.16K.fastq
4000000 cells.1K.fastq
the results are:
$ time ./tst.sh cells.1K.fastq 2>/dev/null
real 0m55.333s
user 0m56.750s
sys 0m1.277s
$ ls cells.1K.fastq_out | wc -l
2048
$ wc -l cells.1K.fastq_out/*.umi | tail -1
1000000 total
$ wc -l cells.1K.fastq_out/*.fastq | tail -1
4000000 total
$ time ./tst.sh cells.16K.fastq 2>/dev/null
real 1m6.815s
user 0m59.058s
sys 0m5.833s
$ ls cells.16K.fastq_out | wc -l
32768
$ wc -l cells.16K.fastq_out/*.umi | tail -1
1000000 total
$ wc -l cells.16K.fastq_out/*.fastq | tail -1
4000000 total
Related
Using Bash, I'm wanting to get a list of email addresses from a CSV file to do a recursive grep search on it for a bunch of directories looking for a match in specific metadata XML files, and then also tallying up how many results I find for each address throughout the directory tree (i.e. updating the tally field in the same CSV file).
accounts.csv looks something like this:
updated to more accurately reflect real-world data
email,date,bar,URL,"something else",tally
address#somewhere.com,21/04/2015,1.2.3.4,https://blah.com/,"blah blah",5
something#that.com,17/06/2015,5.6.7.8,https://blah.com/,"lah yah",0
another#here.com,7/08/2017,9.10.11.12,https://blah.com/,"wah wah",1
For example, if we put address#somewhere.com in $email from the list, run
grep -rl "${email}" --include=\*_meta.xml --only-matching | wc -l
on it and then add that result to the tally column.
At the moment I can get the first column of that CSV file (minus the heading/first line) using
awk -F"," '{print $1}' accounts.csv | tail -n +2
but I'm lost how to do the looping and also the writing of the result back to the CSV file...
So for instance, with another#here.com if we run
grep -rl "${email}" --include=\*_meta.xml --only-matching | wc -l
and the result is say 17, how can I update that line to become:
another#here.com,7/08/2017,9.10.11.12,https://blah.com/,"wah wah",17
Is this possible with maybe awk or sed?
This is where I'm up to:
#!/bin/bash
# make temporary list of email addresses
awk -F"," '{print $1}' accounts.csv | tail -n +2 > emails.tmp
# loop over each
while read email; do
# count how many uploads for current email address
grep -rl "${email}" --include=\*_meta.xml --only-matching | wc -l
done < emails.tmp
XML Metadata looks something like this:
<?xml version="1.0" encoding="UTF-8"?>
<metadata>
<identifier>SomeTitleNameGoesHere</identifier>
<mediatype>audio</mediatype>
<collection>opensource_movies</collection>
<description>example <br /></description>
<subject>testing</subject>
<title>Some Title Name Goes Here</title>
<uploader>another#here.com</uploader>
<addeddate>2017-05-28 06:20:54</addeddate>
<publicdate>2017-05-28 06:21:15</publicdate>
<curation>[curator]email#address.com[/curator][date]20170528062151[/date][comment]checked for malware[/comment]</curation>
</metadata>
how to do the looping and also the writing of the result back to the CSV file
awk does the looping automatically. You can change any field by assigning to it. So to change a tally field (the 6th in each line) you would do $6 = ....
awk is a great tool for many scenarios. You probably can safe a lot of time in the future by investing some minutes in a short tutorial now.
The only non-trivial part is getting the output of grep into awk.
The following script increments each tally by the count of *_meta.xml files containing the given email address:
awk -F, -v OFS=, -v q=\' 'NR>1 {
cmd = "grep -rlFw " q $1 q " --include=\\*_meta.xml | wc -l";
cmd | getline c;
close(cmd);
$6 = c
} 1' accounts.csv
For simplicity we assume that filenames are free of linebreaks and email addresses are free of '.
To reduce possible false positives, I also added the -F and -w option to your grep command.
-F searches literal strings; without it, searching for a.b#c would give false positives for things like axb#c and a-b#c.
-w matches only whole words; without it, searching for b#c would give a false positive for ab#c. This isn't 100% safe, as a-b#c would still give a false positive, but without knowing more about the structure of your xml files we cannot fix this.
A pipeline to reduce the number of greps:
grep -rHo --include=\*_meta.xml -f <(awk -F, 'NR > 1 {print $1}' accounts.csv) \
| gawk -F, -v OFS=',' '
NR == FNR {
# store the filenames for each email
if (match($0, /^([^:]+):(.+)/, m)) tally[m[2]][m[1]]
next
}
FNR > 1 {$4 = length(tally[$1])}
1
' - accounts.csv
Here is a solution using single awk command to achieve this. This solution will be highly performant as compared to other solutions because it is scanning each XML file only once for all the email addresses found in first column of the CSV file. Also it is not invoking any external command or spawning a sub0shell anywhere.
This should work in any version of awk.
cat srch.awk
# function to escape regex meta characters
function esc(s, tmp) {
tmp = s
gsub(/[&+.]/, "\\\\&", tmp)
return tmp
}
BEGIN {FS=OFS=","}
# while processing csv file
NR == FNR {
# save escaped email address in array em skipping header row
if (FNR > 1)
em[esc($1)] = 0
# save each row in rec array
rec[++n] = $0
next
}
# this block will execute for eaxh XML file
{
# loop each email and save count of matched email in array em
# PS: gsub return no of substitutionx
for (i in em)
em[i] += gsub(i, "&")
}
END {
# print header row
print rec[1]
# from 2nd row onwards split row into columns using comma
for (i=2; i<=n; ++i) {
split(rec[i], a, FS)
# 6th column is the count of occurrence from array em
print a[1], a[2], a[3], a[4], a[5], em[esc(a[1])]
}
}
Use it as:
awk -f srch.awk accounts.csv $(find . -name '*_meta.xml') > tmp && mv tmp accounts.csv
A script that handles accounts.csv line by line and replaces the data in accounts.new.csv for comparison.
#! /bin/bash
file_old=accounts.csv
file_new=${file_old/csv/new.csv}
delimiter=","
x=1
# Copy file
cp ${file_old} ${file_new}
while read -r line; do
# Skip first line
if [[ $x -gt 1 ]]; then
# Read data into variables
IFS=${delimiter} read -r address foo bar tally somethingelse <<< ${line}
cnt=$(find . -name '*_meta.xml' -exec grep -lo "${address}" {} \; | wc -l)
# Reset tally
tally=$cnt
# Change line number $x in new file
sed "${x}s/.*/${address} ${foo} ${bar} ${tally} ${somethingelse}/; ${x}s/ /${delimiter}/g" \
-i ${file_new}
fi
((x++))
done < ${file_old}
The input and ouput:
# Input
$ find . -name '*_meta.xml' -exec cat {} \; | sort | uniq -c
2 address#somewhere.com
1 something#that.com
$ cat accounts.csv
email,foo,bar,tally,somethingelse
address#somewhere.com,bar1,foo2,-1,blah
something#that.com,bar2,foo3,-1,blah
another#here.com,bar4,foo5,-1,blah
# output
$ ./test.sh
$ cat accounts.new.csv
email,foo,bar,tally,somethingelse
address#somewhere.com,bar1,foo2,2,blah
something#that.com,bar2,foo3,1,blah
another#here.com,bar4,foo5,0,blah
The following silly hard-coding of what ought to be some kind of loop or parallel construct, works nominally, but it is poor mawk syntax. My good mawk syntax attempts have all failed, using for loops in mawk (not shown) and gnu parallel (not shown).
It really needs to read the CSV file from disk just 1 time, not one time per column, because I have a really big CSV file (millions of rows, thousands of columns). My original code worked fine-ish (not shown) but it read the whole disk file again for every column and it was taking hours and I killed it after realizing what was happening. I have a fast solid state disk using a GPU connector slot so disk reads are blazing fast on this device. Thus CPU is the bottleneck here. Code sillyness is even more of a bottleneck if I have to hard-code 4000 lines of basically the same statements except for column number.
The code is column-wise making counts of non-numeric values. I need some looping (for-loop) or parallel (preferred) because while the following works correctly on 2 columns, it is not a scalable way to write mawk code for thousands of columns.
tail -n +1 pht.csv | awk -F"," '(($1+0 != $1) && ($1!="")){cnt1++}; (($2+0 != $2) && ($2!="")){cnt2++} END{print cnt1+0; print cnt2+0}'
2
1
How can the "column 1 processing; column 2 processing;" duplicate code be reduced? How can looping be introduced? How can gnu parallel be introduced? Thanks much. New to awk, I am. Not new to other languages.
I keep expecting some clever combo of one or more of the following bash commands is going to solve this handily, but here I am many hours later with nothing to show. I come with open hands. Alms for the code-poor?
seq 1 2 ( >>2 for real life CSV file)
tail (to skip the header or not as needed)
mawk (nice-ish row-wise CSV file processing, with that handy syntax I showed you in my demo for finding non-numerics easily in a supposedly all-numeric CSV datafile of jumbo dimensions)
tr (removes newline which is handy for transpose-ish operations)
cut (to grab a column at a time)
parallel (fast is good, and I have mucho cores needing something to work on, and phat RAM)
Sorry, I am absolutely required to not use CSV specific libraries like python pandas or R dataframes. My hands are tied here. Sorry. Thank you for being so cool about it. I can only use bash command lines in this case.
My mawk can handle 32000+ columns so NF is not a problem here, unlike some other awk I've seen. I have less than 32000 columns (but not by that much).
Datafile pht.csv contains the following 3x2 dataset:
cat pht.csv
8,T1,
T13,3,T1
T13,,-6.350818276405334473e-01
don't have access to mawk but you can do something equivalent to this
awk -F, 'NR>1 {for(i=1;i<=NF;i++) if($i~/[[:alpha:]]/) a[i]++}
END {for(i=1;i in a;i++) print a[i]}' file
shouldn't take more than few minutes even for million records.
For recognizing exponential notation regex test is not going to work and you need to revert to $1+0!=$1 test as mentioned in the comments. Note that you don't have to check null string separately.
None of the solutions so far parallelize. Let's change that.
Assume you have a solution that works in serial and can read from a pipe:
doit() {
# This solution gives 5-10 MB/s depending on system
# Changed so it now also treats '' as zero
perl -F, -ane 'for(0..$#F) {
# Perl has no beautiful way of matching scientific notation
$s[$_] += $F[$_] !~ /^-?\d*(?:\.\d*)?(?:[eE][+\-]?\d+)?$/m
}
END { $" = ","; print "#s\n" }';
}
export -f doit
doit() {
# Somewhat faster - but regards empty fields as zero
mawk -F"," '{
for(i=1;i<=NF;i++) { cnt[i]+=(($i+0)!=$i) && ($i!="") }
}
END { for(i=1;i<NF;i++) printf cnt[i]","; print cnt[NF] }';
}
export -f doit
To parallelize this we need to split the big file into chunks and pass each chunk to the serial solution:
# This will spawn a process for each core
parallel --pipe-part -a pht.csv --block -1 doit > blocksums
(You need version 20161222 or later to use '--block -1').
To deal with the header we compute result of the header, but we negate the result:
head -n1 pht.csv | doit | perl -pe 's/(^|,)/$1-/g' > headersum
Now we can simply sum up the headersum and the blocksums:
cat headersum blocksums |
perl -F, -ane 'for(0..$#F) { $s[$_] += $F[$_] }
END { $" = ",";print "#s\n" }'
Or if you prefer the output line by line:
cat headersum blocksums |
perl -F, -ane 'for(0..$#F) { $s[$_] += $F[$_] }
END { $" = "\n";print "#s\n" }'
Is this what you're trying to do?
$ awk -v RS='[\n,]' '($1+0) != $1' file | sort | uniq -c
1 T1
2 T13
The above uses GNU awk for multi-char RS and should run in seconds for an input file like you describe. If you don't have GNU awk you could do:
$ tr ',' $'\n' < file | awk '($1+0) != $1' | sort | uniq -c
1 T1
2 T13
I'm avoiding the approach of using , as a FS since then you'd have to use $i in a loop which would cause awk to do field splitting for every input line which adds on time but you could try it:
$ awk -F, '{for (i=1;i<=NF;i++) if (($i+0) != $i) print $i}' file | sort | uniq -c
1 T1
2 T13
You could do the unique counting all in awk with an array indexed by the non-numeric values but then you potentially have to store a lot of data in memory (unlike with sort which uses temp swap files as necessary) so YMMV with that approach.
I solved it independently. What finally did it for me was the dynamic variable creation examples at the following URL. http://cfajohnson.com/shell/cus-faq-2.html#Q24
Here is the solution I developed. Note: I have added another column with some missing data for a more complete unit test. Mine is not necessarily the best solution, which is TBD. It works correctly on the small csv shown is all I know at the moment. The best solution will also need to run really fast on a 40 GB csv file (not shown haha).
$ cat pht.csv
8,T1,
T13,3,T1
T13,,0
$ tail -n +1 pht.csv | awk -F"," '{ for(i=1;i<=NF;i++) { cnt[i]+=(($i+0)!=$i) && ($i!="") } } END { for(i=1;i<=NF;i++) print cnt[i] }'
2
1
1
ps. Honestly I am not satisfied with my own answer. They say that premature optimization is the root of all evil. Well that maxim does not apply here. I really, really want gnu parallel in there, instead of the for-loop if possible, because I have a need for speed.
Final note: Below I am sharing performance timings of sequential and parallel versions, and best available unit test dataset. Special thank you to Ole Tange for big help developing code to use his nice gnu parallel command in this application.
Unit test datafile, final version:
$ cat pht2.csv
COLA99,COLB,COLC,COLD
8,T1,,T1
T13,3,T1,0.03
T13,,-6.350818276405334473e-01,-0.036
Timing on big data (not shown) for sequential version of column-wise non-numeric counts:
ga#ga-HP-Z820:/mnt/fastssd$ time tail -n +2 train_all.csv | awk -F"," '{ for(i=1; i<=NF; i++){ cnt[i]+=(($i+0)!=$i) && ($i!="") } } END { for(i=1;i<=NF;i++) print cnt[i] }' > /dev/null
real 35m37.121s
Timing on big data for parallel version of column-wise non-numeric counts:
# Correctness - 2 1 1 1 is the correct output.
#
# pht2.csv: 2 1 1 1 :GOOD
# train_all.csv:
# real 1m14.253s
doit1() {
perl -F, -ane 'for(0..$#F) {
# Perl has no beautiful way of matching scientific notation
$s[$_] += $F[$_] !~ /^-?\d+(?:\.\d*)?(?:[eE][+\-]?\d+)?$/m
}
END { $" = ","; print "#s\n" }';
}
# pht2.csv: 2 1 1 1 :GOOD
# train_all.csv:
# real 1m59.960s
doit2() {
mawk -F"," '{
for(i=1;i<=NF;i++) { cnt[i]+=(($i+0)!=$i) && ($i!="") }
}
END { for(i=1;i<NF;i++) printf cnt[i]","; print cnt[NF] }';
}
export -f doit1
parallel --pipe-part -a "$fn" --block -1 doit1 > blocksums
if [ $csvheader -eq 1 ]
then
head -n1 "$fn" | doit1 | perl -pe 's/(^|,)/$1-/g' > headersum
cat headersum blocksums | perl -F, -ane 'for(0..$#F) { $s[$_] += $F[$_] } END { $" = "\n";print "#s\n" }' > "$outfile"
else
cat blocksums | perl -F, -ane 'for(0..$#F) { $s[$_] += $F[$_] } END { $" = "\n";print "#s\n" }' > "$outfile"
fi
NEW: Here is the ROW-wise (not column-wise) counts in sequential code:
tail -n +2 train_all.csv | awk -F"," '{ cnt=0; for(i=1; i<=NF; i++){ cnt+=(($i+0)!=$i) && ($i!="") } print cnt; }' > train_all_cnt_nonnumerics_rowwwise.out.txt
Context: Project is machine learning. This is part of a data exploration. ~25x parallel speedup seen on Dual Xeon 32 virtual / 16 physical core shared memory host using Samsung 950 Pro SSD storage: (32x60) seconds sequential time, 74 sec parallel time. AWESOME!
Say, I have two files and want to find out how many equal lines they have. For example, file1 is
1
3
2
4
5
0
10
and file2 contains
3
10
5
64
15
In this case the answer should be 3 (common lines are '3', '10' and '5').
This, of course, is done quite simply with python, for example, but I got curious about doing it from bash (with some standard utils or extra things like awk or whatever). This is what I came up with:
cat file1 file2 | sort | uniq -c | awk '{if ($1 > 1) {$1=""; print $0}}' | wc -l
It does seem too complicated for the task, so I'm wondering is there a simpler or more elegant way to achieve the same result.
P.S. Outputting the percentage of common part to the number of lines in each file would also be nice, though is not necessary.
UPD: Files do not have duplicate lines
To find lines in common with your 2 files, using awk :
awk 'a[$0]++' file1 file2
Will output 3 10 15
Now, just pipe this to wc to get the number of common lines :
awk 'a[$0]++' file1 file2 | wc -l
Will output 3.
Explanation:
Here, a works like a dictionary with default value of 0. When you write a[$0]++, you will add 1 to a[$0], but this instruction returns the previous value of a[$0] (see difference between a++ and ++a). So you will have 0 ( = false) the first time you encounter a certain string and 1 ( or more, still = true) the next times.
By default, awk 'condition' file is a syntax for outputting all the lines where condition is true.
Be also aware that the a[] array will expand every time you encounter a new key. At the end of your script, the size of the array will be the number of unique values you have throughout all your input files (in OP's example, it would be 9).
Note: this solution counts duplicates, i.e if you have:
file1 | file2
1 | 3
2 | 3
3 | 3
awk 'a[$0]++' file1 file2 will output 3 3 3 and awk 'a[$0]++' file1 file2 | wc -l will output 3
If this is a behaviour you don't want, you can use the following code to filter out duplicates :
awk '++a[$0] == 2' file1 file2 | wc -l
with your input example, this works too. but if the files are huge, I prefer the awk solutions by others:
grep -cFwf file2 file1
with your input files, the above line outputs
3
Here's one without awk that instead uses comm:
comm -12 <(sort file1.txt) <(sort file2.txt) | wc -l
comm compares two sorted files. The arguments 1,2 suppresses unique lines found in both files.
The output is the lines they have in common, on separate lines. wc -l counts the number of lines.
Output without wc -l:
10
3
5
And when counting (obviously):
3
You can also use comm command. Remember that you will have to first sort the files that you need to compare:
[gc#slave ~]$ sort a > sorted_1
[gc#slave ~]$ sort b > sorted_2
[gc#slave ~]$ comm -1 -2 sorted_1 sorted_2
10
3
5
From man pages for comm command:
comm - compare two sorted files line by line
Options:
-1 suppress column 1 (lines unique to FILE1)
-2 suppress column 2 (lines unique to FILE2)
-3 suppress column 3 (lines that appear in both files)
You can do all with awk:
awk '{ a[$0] += 1} END { c = 0; for ( i in a ) { if ( a[i] > 1 ) c++; } print c}' file1 file2
To get the percentage, something like this works:
awk '{ a[$0] += 1; if (NR == FNR) { b = FILENAME; n = NR} } END { c = 0; for ( i in a ) { if ( a[i] > 1 ) c++; } print b, c/n; print FILENAME, c/FNR;}' file1 file2
and outputs
file1 0.428571
file2 0.6
In your solution, you can get rid of one cat:
sort file1 file2| uniq -c | awk '{if ($1 > 1) {$1=""; print $0}}' | wc -l
How about keeping it nice and simple...
This is all that's needed:
cat file1 file2 | sort -n | uniq -d | wc -l
3
man sort:
-n, --numeric-sort -- compare according to string numerical value
man uniq:
-d, --repeated -- only print duplicate lines
man wc:
-l, --lines -- print the newline counts
Hope this helps.
EDIT - one fewer process (credit martin):
sort file1 file2 | uniq -d | wc -l
One way using awk:
awk 'NR==FNR{a[$0]; next}$0 in a{n++}END{print n}' file1 file2
Output:
3
The first answer by Aserre using awk is good but may have the undesirable effect of counting duplicates - even if the duplicates exist in only ONE of the files, which is not quite what the OP asked for.
I believe this edit will return only the unique lines that exist in BOTH files.
awk 'NR==FNR{a[$0]=1;next}a[$0]==1{a[$0]++;print $0}' file1 file2
If duplicates are desired, but only if they exist in both files, I believe this next version will work, but will only report duplicates in the second file that exist in the first file. (If the duplicates exist in the first file, only the those that also exist in file2 will be reported, so file order matters).
awk 'NR==FNR{a[$0]=1;next}a[$0]' file1 file2
Btw, I tried using grep, but it was painfully slow on files with a few thousand lines each. Awk is very fast!
UPDATE 1 : new version ensures intra-file duplicates are excluded from count, so only cross-file duplicates would show up in the final stats :
mawk '
BEGIN { _*= FS = "^$"
} FNR == NF { split("",___)
} ___[$_]++<NF { __[$_]++
} END { split("",___)
for (_ in __) {
___[__[_]]++ } printf(RS)
for (_ in ___) {
printf(" %\04715.f %s\n",_,___[_]) }
printf(RS) }' \
<( jot - 1 999 3 | mawk '1;1;1;1;1' | shuf ) \
<( jot - 2 1024 7 | mawk '1;1;1;1;1' | shuf ) \
<( jot - 7 1295 17 | mawk '1;1;1;1;1' | shuf )
3 3
2 67
1 413
===========================================
this is probably waaay overkill, but i wrote something similar to this to supplement uniq -c :
measuring the frequency of frequencies
it's like uniq -c | uniq -c without wasting time sorting. The summation and % parts are trivial from here, with 47 over-lapping lines in this example. It avoids spending any time performing per row processing, since the current setup only shows the summarized stats.
If you need to actual duplicated rows, they're also available right there serving as the hash key for the 1st array.
gcat <( jot - 1 999 3 ) <( jot - 2 1024 7 ) |
mawk '
BEGIN { _*= FS = "^$"
} { __[$_]++
} END { printf(RS)
for (_ in __) { ___[__[_]]++ }
for (_ in ___) {
printf(" %\04715.f %s\n",
_,___[_]) } printf(RS) }'
2 47
1 386
add another file, and the results reflect the changes (I added <( jot - 5 1295 5 ) ):
3 9
2 115
1 482
I have two columns in a file, and I want to automate summing both values per row
for example
read write
5 6
read write
10 2
read write
23 44
I want to then sum the "read" and "write" of each row. Eventually after summing, I'm finding the max sum and putting that max value in a file. I feel like I have to use grep -v to rid of the column headers per row, which like stated in the answers, makes the code inefficient since I'm grepping the entire file just to read a line.
I currently have this in a bash script (within a for loop where $x is the file name) to sum the columns line by line
lines=`grep -v READ $x|wc -l | awk '{print $1}'`
line_num=1
arr_num=0
while [ $line_num -le $lines ]
do
arr[$arr_num]=`grep -v READ $x | sed $line_num'q;d' | awk '{print $2 + $3}'`
echo $line_num
line_num=$[$line_num+1]
arr_num=$[$arr_num+1]
done
However, the file to be summed has 270,000+ rows. The script has been running for a few hours now, and it is nowhere near finished. Is there a more efficient way to write this so that it does not take so long?
Use awk instead and take advantage of modulus function:
awk '!(NR%2){print $1+$2}' infile
awk is probably faster, but the idiomatic bash way to do this is something like:
while read -a line; do # read each line one-by-one, into an array
# use arithmetic expansion to add col 1 and 2
echo "$(( ${line[0]} + ${line[1]} ))"
done < <(grep -v READ input.txt)
Note the file input file is only read once (by grep) and the number of externally forked programs is kept to a minimum (just grep, called only once for the whole input file). The rest of the commands are bash builtins.
Using the <( ) process substition, in case variables set in the while loop are required out of scope of the while loop. Otherwise a | pipe could be used.
Your question is pretty verbose, yet your goal is not clear. The way I read it, your numbers are on every second line, and you want only to find the maximum sum. Given that:
awk '
NR%2 == 1 {next}
NR == 2 {max = $1+$2; next}
$1+$2 > max {max = $1+$2}
END {print max}
' filename
You could also use a pipeline with tools that implicitly loop over the input like so:
grep -v read INFILE | tr -s ' ' + | bc | sort -rn | head -1 > OUTFILE
This assumes there are spaces between your read and write data values.
Why not run:
awk 'NR==1 { print "sum"; next } { print $1 + $2 }'
You can afford to run it on the file while the other script it still running. It'll be complete in a few seconds at most (prediction). When you're confident it's right, you can kill the other process.
You can use Perl or Python instead of awk if you prefer.
Your code is running grep, sed and awk on each line of the input file; that's damnably expensive. And it isn't even writing the data to a file; it is creating an array in Bash's memory that'll need to be printed to the output file later.
Assuming that it's always one 'header' row followed by one 'data' row:
awk '
BEGIN{ max = 0 }
{
if( NR%2 == 0 ){
sum = $1 + $2;
if( sum > max ) { max = sum }
}
}
END{ print max }' input.txt
Or simply trim out all lines that do not conform to what you want:
grep '^[0-9]\+\s\+[0-9]\+$' input.txt | awk '
BEGIN{ max = 0 }
{
sum = $1 + $2;
if( sum > max ) { max = sum }
}
END{ print max }' input.txt
I want to randomly 80/20 split a file using awk.
I have read and tried the option found HERE in which something like the following proposed:
$ awk -v N=`cat FILE | wc -l` 'rand()<3000/N' FILE
works great if you want a random selection.
However, is it possible to alter this awk in order to split the one file into two files of 80/20 (or any other) proportion?
With gawk, you'd write
gawk '
BEGIN {srand()}
{f = FILENAME (rand() <= 0.8 ? ".80" : ".20"); print > f}
' file
Example:
seq 100 > 100.txt
gawk 'BEGIN {srand()} {f = FILENAME (rand() <= 0.8 ? ".80" : ".20"); print > f}' 100.txt
wc -l 100.txt*
100 100.txt
23 100.txt.20
77 100.txt.80
200 total
To ensure 20 lines in the "20" file:
$ paste -d $'\034' <(seq $(wc -l < "$file") | sort -R) "$file" \
| awk -F $'\034' -v file="$file" '{
f = file ($1 <= 20 ? ".20" : ".80")
print $2 > f
}'
$ wc -l "$file"*
100 testfile
20 testfile.20
80 testfile.80
200 total
\034 is the ASCII FS character, unlikely to appear in a text file.
sort -R to shuffle the input may not be portable. It's in GNU and BSD sort though.