Print first few and last few lines of file through a pipe with "..." in the middle - bash

Problem Description
This is my file
1
2
3
4
5
6
7
8
9
10
I would like to send the cat output of this file through a pipe and receive this
% cat file | some_command
1
2
...
9
10
Attempted solutions
Here are some solutions I've tried, with their output
% cat temp | (head -n2 && echo '...' && tail -n2)
1
2
...
% cat temp | tee >(head -n3) >(tail -n3) >/dev/null
1
2
3
8
9
10
# I don't know how to get the ...
% cat temp | sed -e 1b -e '$!d'
1
10
% cat temp | awk 'NR==1;END{print}'
1
10
# Can only get 2 lines

An awk:
awk -v head=2 -v tail=2 'FNR==NR && FNR<=head
FNR==NR && cnt++==head {print "..."}
NR>FNR && FNR>(cnt-tail)' file file
Or if a single pass is important (and memory allows), you can use perl:
perl -0777 -lanE 'BEGIN{$head=2; $tail=2;}
END{say join("\n", #F[0..$head-1],("..."),#F[-$tail..-1]);}' file
Or, an awk that is one pass:
awk -v head=2 -v tail=2 'FNR<=head
{lines[FNR]=$0}
END{
print "..."
for (i=FNR-tail+1; i<=FNR; i++) print lines[i]
}' file
Or, nothing wrong with being a caveman direct like:
head -2 file; echo "..."; tail -2 file
Any of these prints:
1
2
...
9
10
It terms of efficiency, here are some stats.
For small files (ie, less than 10 MB or so) all these are less than 1 second and the 'caveman' approach is 2 ms.
I then created a 1.1 GB file with seq 99999999 >file
The two pass awk: 50 secs
One pass perl: 10 seconds
One pass awk: 29 seconds
'Caveman': 2 MS

You may consider this awk solution:
awk -v top=2 -v bot=2 'FNR == NR {++n; next} FNR <= top || FNR > n-top; FNR == top+1 {print "..."}' file{,}
1
2
...
9
10

Two single pass sed solutions:
sed '1,2b
3c\
...
N
$!D'
and
sed '1,2b
3c\
...
$!{h;d;}
H;g'

Assumptions:
as OP has stated, a solution must be able to work with a stream from a pipe
the total number of lines coming from the stream is unknown
if the total number of lines is less than the sum of the head/tail offsets then we'll print duplicate lines (we can add more logic if OP updates the question with more details on how to address this situation)
A single-pass awk solution that implements a queue in awk to keep track of the most recent N lines; the queue allows us to limit awk's memory usage to just N lines (as opposed to loading the entire input stream into memory, which could be problematic when processing a large volume of lines/data on a machine with limited available memory):
h=2 t=3
cat temp | awk -v head=${h} -v tail=${t} '
{ if (NR <= head) print $0
lines[NR % tail] = $0
}
END { print "..."
if (NR < tail) i=0
else i=NR
do { i=(i+1)%tail
print lines[i]
} while (i != (NR % tail) )
}'
This generates:
1
2
...
8
9
10
Demonstrating the overlap issue:
$ cat temp4
1
2
3
4
With h=3;t=3 the proposed awk code generates:
$ cat temp4 | awk -v head=${h} -v tail=${t} '...'
1
2
3
...
2
3
4
Whether or not this is the 'correct' output will depend on OP's requirements.

I suggest with bash:
(head -n 2; echo "..."; tail -n 2) < file
Output:
1
2
...
9
10

Related

Get n lines from file which are equal spaced

I have a big file with 1000 lines.I wanted to get 110 lines from it.
Lines should be evenly spread in Input file.
For example,I have read 4 lines from file with 10 lines
Input File
1
2
3
4
5
6
7
8
9
10
outFile:
1
4
7
10
Use:
sed -n '1~9p' < file
The -n option will stop sed from outputting anything. '1~9p' tells sed to print from line 1 every 9 lines (the p at the end orders sed to print).
To get closer to 110 lines you have to print every 9th line (1000/110 ~ 9).
Update: This answer will print 112 lines, if you need exactly 110 lines, you can limit the output just using head like this:
sed -n '1~9p' < file | head -n 110
$ cat tst.awk
NR==FNR { next }
FNR==1 { mod = int((NR-1)/tgt) }
!( (FNR-1)%mod ) { print; cnt++ }
cnt == tgt { exit }
$ wc -l file1
1000 file1
$ awk -v tgt=110 -f tst.awk file1 file1 > file2
$ wc -l file2
110 file2
$ head -5 file2
1
10
19
28
37
$ tail -5 file2
946
955
964
973
982
Note that this will not produce the output you posted in your question given your posted input file because that would require an algorithm that doesn't always use the same interval between output lines. You could dynamically calculate mod and adjust it as you parse your input file if you like but the above may be good enough.
With awk you can do:
awk -v interval=3 '(NR-1)%interval==0' file
where interval is the difference in line count between consecutive lines that are printed. The value is essentially a division of the total lines in the file divided by the number of lines that are printed.
I often like to use a combination of shell and awk for these sorts of things
#!/bin/bash
filename=$1
toprint=$2
awk -v tot=$(expr $(wc -l < $filename)) -v toprint=$toprint '
BEGIN{ interval=int((tot-1)/(toprint-1)) }
(NR-1)%interval==0 {
print;
nbr++
}
nbr==toprint{exit}
' $filename
Some examples:
$./spread.sh 1001lines 5
1
251
501
751
1001
$ ./spread.sh 1000lines 110 |head -n 3
1
10
19
$ ./spread.sh 1000lines 110 |tail -n 3
964
973
982

Get lengths of zeroes (interrupted by ones)

I have a long column of ones and zeroes:
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
1
0
0
1
....
I can easily get the average number of zeroes between ones (just total/ones):
ones=$(grep -c 1 file.txt)
lines=$(wc -l < file.txt)
echo "$lines / $ones" | bc -l
But how can I get the length of strings of zeroes between the ones? In the short example above it would be:
3
5
5
2
I'd include uniq for a more easily read approach:
uniq -c file.txt | awk '/ 0$/ {print $1}'
Edit: fixed for the case where the last line is a 0
Easy in awk:
awk '/1/{print NR-prev-1; prev=NR;}END{if (NR>prev)print NR-prev;}'
Not so difficult in bash, either:
i=0
for x in $(<file.txt); do
if ((x)); then echo $i; i=0; else ((++i)); fi
done
((i)) && echo $i
Using awk, I would use the fact that a field with the value 0 evaluates as False:
awk '!$1{s++; next} {if (s) print s; s=0} END {if (s) print s}' file
This returns:
3
5
5
2
Also, note the END block to print any "remaining" zeroes appearing after the last 1.
Explanation
!$1{s++; next} if the field is not True, that is, if the field is 0, increment the counter. Then, skip to the next line.
{if (s) print s; s=0} otherwise, print the value of the counter and reset it, but just if it contains some value (to avoid printing 0 if the file starts with a 1).
END {if (s) print s} print the remaining value of the counter after processing the file, but just if it wasn't printed before.
If your file.txt is just a column of ones and zeros, you can use awk and change the record separator to "1\n". This makes each "record" a sequence of "0\n", and the count of 0's in the record is the length of the record divided by 2. Counts will be correct for leading and trailing ones and zeros.
awk 'BEGIN {RS="1\n"} { print length/2 }' file.txt
This seems to be pretty popular question today. Joining the party late, here is another short gnu-awk command to do the job:
awk -F '\n' -v RS='(1\n)+' 'NF{print NF-1}' file
3
5
5
2
How it works:
-F '\n' # set input field separator as \n (newline)
-v RS='(1\n)+' # set input record separator as multipled of 1 followed by newline
NF # execute the block if minimum one field is found
print NF-1 # print num of field -1 to get count of 0
Pure bash:
sum=0
while read n ; do
if ((n)) ; then
echo $sum
sum=0
else
((++sum))
fi
done < file.txt
((sum)) && echo $sum # Don't forget to output the last number if the file ended in 0.
Another way:
perl -lnE 'if(m/1/){say $.-1;$.=0}' < file
"reset" the line counter when 1.
prints
3
5
5
2
You can use awk:
awk '$1=="0"{s++} $1=="1"{if(s)print s;s=0} END{if(s)print(s)}'
Explanation:
The special variable $1 contains the value of the first field (column) of a line of text. Unless you specify the field delimiter using the -F command line option it defaults to a widespace - meaning $1 will contain 0 or 1 in your example.
If the value of $1 equals 0 a variable called s will get incremented but if $1 is equal to 1 the current value of s gets printed (if greater than zero) and re-initialized to 0. (Note that awk initializes s with 0 before the first increment operation)
The END block gets executed after the last line of input has been processed. If the file ends with 0(s) the number of 0s between the file's end and the last 1 will get printed. (Without the END block they wouldn't printed)
Output:
3
5
5
2
if you can use perl:
perl -lne 'BEGIN{$counter=0;} if ($_ == 1){ print $counter; $counter=0; next} $counter++' file
3
5
5
2
It actually looks better with awk same logic:
awk '$1{print c; c=0} !$1{c++}' file
3
5
5
2
My attempt. Not so pretty but.. :3
grep -n 1 test.txt | gawk '{y=$1-x; print y-1; x=$1}' FS=":"
Out:
3
5
5
2
A funny one, in pure Bash:
while read -d 1 -a u || ((${#u[#]})); do
echo "${#u[#]}"
done < file
This tells read to use 1 as a delimiter, i.e., to stop reading as soon as a 1 is encountered; read stores the 0's in the fields of the array u. Then we only need to count the number of fields in u with ${#u[#]}. The || ((${#u[#]})) is here just in case your file doesn't end with a 1.
More strange (and not fully correct) way:
perl -0x31 -laE 'say #F+0' <file
prints
3
5
5
2
0
It
reads the file with the record separator is set to character 1 the -0x31
with autosplit -a (splits the record into array #F)
and prints the number of elements in #F e.g. say #F+0 or could use say scalar #F
Unfortunately, after the final 1 (as record separator) it prints an empty record - therefore prints the last 0.
It is incorrect solution, showing it only as alternative curiosity.
Expanding erickson's excellent answer, you can say:
$ uniq -c file | awk '!$2 {print $1}'
3
5
5
2
From man uniq we see that the purpose of uniq is to:
Filter adjacent matching lines from INPUT (or standard input), writing
to OUTPUT (or standard output).
So uniq groups the numbers. Using the -c option we get a prefix with the number of occurrences:
$ uniq -c file
3 0
1 1
5 0
1 1
5 0
1 1
2 0
1 1
Then it is a matter of printing those the counters before the 0. For this we can use awk like: awk '!$2 {print $1}'. That is: print the second field if the field is 0.
The simplest solution would be to use sed together with awk, like this:
sed -n '$bp;/0/{:r;N;/0$/{h;br}};/1/{x;bp};:p;/.\+/{s/\n//g;p}' input.txt \
| awk '{print length}'
Explanation:
The sed command separates the 0s and creates output like this:
000
00000
00000
00
Piped into awk '{print length}' you can get the count of 0 for each interval:
Output:
3
5
5
2

Running zcat on multiple files using a for loop

I'm very new to terminal/bash, and perhaps this has been asked before but I wasn't able to find what I'm looking for perhaps because I'm not sure exactly what to search for to answer my question.
I'm trying to format some files for genetic analysis and while I could write out the following command for every sample file, I know there is a better way:
zcat myfile.fastq.gz | awk 'NR % 8 == 5 || NR % 8 == 6 || NR % 8 == 7 || NR % 8 == 0 {print $0}' | gzip > myfile.2.fastq.gz
zcat myfile.fastq.gz | awk 'NR % 8 == 1 || NR % 8 == 2 || NR % 8 == 3 || NR % 8 == 4 {print $0}' | gzip > myfile.1.fastq.gz
I have the following files:
-bash-3.2$ ls
BB001.fastq BB013.fastq.gz IN014.fastq.gz RV006.fastq.gz SL083.fastq.gz
BB001.fastq.gz BB014.fastq.gz INA01.fastq.gz RV007.fastq.gz SL192.fastq.gz
BB003.fastq.gz BB015.fastq.gz INA02.fastq.gz RV008.fastq.gz SL218.fastq.gz
BB004.fastq.gz IN001.fastq.gz INA03.fastq.gz RV009.fastq.gz SL276.fastq.gz
BB006.fastq.gz IN002.fastq.gz INA04.fastq.gz RV010.fastq.gz SL277.fastq.gz
BB008.fastq.gz IN007.fastq.gz INA05.fastq.gz RV011.fastq.gz SL326.fastq.gz
BB009.fastq.gz IN010.fastq.gz INA1M.fastq.gz RV012.fastq.gz SL392.fastq.gz
BB010.fastq.gz IN011.fastq.gz RV003.fastq.gz SL075.fastq.gz SL393.fastq.gz
BB011.fastq.gz IN012.fastq.gz RV004.fastq.gz SL080.fastq.gz SL395.fastq.gz
BB012.fastq.gz IN013.fastq.gz RV005.fastq.gz SL081.fastq.gz
and I would like to apply the two zcat functions to each file, creating two new files from each one without writing it out 50 times. I've used for loops in R quite a bit but don't know where to start in bash. I can say in words what I want and hopefully someone can give me a hand coding it!:
for FILENAME.fastq.gz in all files in cd
zcat FILENAME.fastq.gz | awk 'NR % 8 == 5 || NR % 8 == 6 || NR % 8 == 7 || NR % 8 == 0 {print $0}' | gzip > FILENAME.2.fastq.gz
zcat FILENAME.fastq.gz | awk 'NR % 8 == 1 || NR % 8 == 2 || NR % 8 == 3 || NR % 8 == 4 {print $0}' | gzip > FILENAME.1.fastq.gz
Thanks a ton in advance for your help!
*****EDIT*****
My notation was a bit off, here's the final, correct for loop:
for fname in *.fastq.gz
do
gzcat "$fname" | awk 'NR % 8 == 5 || NR % 8 == 6 || NR % 8 == 7 || NR % 8 == 0 {print $0}' | gzip >../../SeparateReads/"${fname%.fastq.gz}.2.fastq.gz"
gzcat "$fname" | awk 'NR % 8 == 1 || NR % 8 == 2 || NR % 8 == 3 || NR % 8 == 4 {print $0}' | gzip >../../SeparateReads/"${fname%.fastq.gz}.1.fastq.gz"
done
*****FOLLOWUP QUESTION*****
When I run the following:
for fname in *.1.fastq.gz
do
cat ./CleanedSeparate/XhoI/"$fname" ./CleanedSeparate/MseI/"${fname%.1.fastq.gz}.2.fastq.gz" > ./FinalCleaned/"${fname%.1.fastq.gz}.fastq.gz"
done
I get this error:
cat: ./CleanedSeparate/XhoI/*.1.fastq.gz: No such file or directory
cat: ./CleanedSeparate/MseI/*.2.fastq.gz: No such file or directory
Obviously I'm not using * correctly. Any tips on where I'm going wrong?
for fname in *.fastq.gz
do
zcat "$fname" | awk 'NR % 8 == 5 || NR % 8 == 6 || NR % 8 == 7 || NR % 8 == 0 {print $0}' | gzip >"${fname%.fastq.gz}.2.fastq.gz"
zcat "$fname" | awk 'NR % 8 == 1 || NR % 8 == 2 || NR % 8 == 3 || NR % 8 == 4 {print $0}' | gzip >"${fname%.fastq.gz}.1.fastq.gz"
done
Key points:
for fname in *.fastq.gz
This loops over every file in the current directory ending in .fastq.gz. If the files are in a different directory, then use:
for fname in /path/to/*.fastq.gz
where /path/to/ is whatever the path should be to get to those files.
zcat "$fname"
This part is straightforward. It substitutes in the file name as the argument for zcat.
"${fname%.fastq.gz}.1.fastq.gz"
This is a little bit trickier. To get the desired output file name, we need to insert the .1 into the original filename. The easiest way to do this in bash is to remove the .fastq.gz suffix from the file name with ${fname%.fastq.gz} where the % is bash-speak meaning remove what follows from the end. Then, we add on the new suffix .1.fastq.gz and we have the correct file name.
Creating the new files in a different directory
As per the follow-up question, this does not work:
for fname in *.1.fastq.gz
do
cat ./CleanedSeparate/XhoI/"$fname" ./CleanedSeparate/MseI/"${fname%.1.fastq.gz}.2.fastq.gz" > ./FinalCleaned/"${fname%.1.fastq.gz}.fastq.gz"
done
The problem is that, in the for statement, the shell is looking for the *.1.fastq.gz in the current directory. But, they aren't there. They are in the ./CleanedSeparate/XhoI/. Instead, run:
dir1=./CleanedSeparate/XhoI
for fname in "$dir1"/*.1.fastq.gz
do
base=${fname#$dir1/}
base=${base%.1.fastq.gz}
echo "base=$base"
cat "$fname" "./CleanedSeparate/MseI/${base}.2.fastq.gz" >"./FinalCleaned/${base}.fastq.gz"
done
Notice here that the for statement is given the correct directory in which to find the files.
You can use something like:
for fspec in *.fastq.gz ; do
echo "${fspec}"
done
That will simply echo the file being processed but you can do anything you want to ${fspec}, including using it for a couple of zcat commands.
In order to get the root of the file name (for creating the other files), you can use the pattern deletion feature of bash to remove the trailing bit:
for fspec in *.fastq.gz ; do
froot=${fspec%%.fastq.gz}
echo "Transform ${froot}.fastq.gz into ${froot}.1.fastq.gz"
done
In addition, for your specific need, it appears you want to send the first four lines of an eight-line group to one file and the other four lines to a second file.
I tend to just use sed for simple tasks like that since it's likely to be faster. You can get the first line group (first four lines of the eight) with:
sed -n 'p;n;p;n;p;n;p;n;n;n;n'
and the second (second four lines of the eight) with:
sed -n 'n;n;n;n;p;n;p;n;p;n;p'
using the p print-current and n get-next commands.
Hence the code then becomes something like:
for fsrc in *.fastq.gz ; do
fdst1="${fspec%%.fastq.gz}.1.fastq.gz"
fdst2="${fspec%%.fastq.gz}.2.fastq.gz"
echo "Processing ${fsrc}"
# For each group of 8 lines, fdst1 gets 1-4, fdst2 gets 5-8.
zcat ${fsrc} | sed -n 'p;n;p;n;p;n;p;n;n;n;n' | gzip >${fdst1}
zcat ${fsrc} | sed -n 'n;n;n;n;p;n;p;n;p;n;p' | gzip >${fdst2}
done

Counting equal lines in two files

Say, I have two files and want to find out how many equal lines they have. For example, file1 is
1
3
2
4
5
0
10
and file2 contains
3
10
5
64
15
In this case the answer should be 3 (common lines are '3', '10' and '5').
This, of course, is done quite simply with python, for example, but I got curious about doing it from bash (with some standard utils or extra things like awk or whatever). This is what I came up with:
cat file1 file2 | sort | uniq -c | awk '{if ($1 > 1) {$1=""; print $0}}' | wc -l
It does seem too complicated for the task, so I'm wondering is there a simpler or more elegant way to achieve the same result.
P.S. Outputting the percentage of common part to the number of lines in each file would also be nice, though is not necessary.
UPD: Files do not have duplicate lines
To find lines in common with your 2 files, using awk :
awk 'a[$0]++' file1 file2
Will output 3 10 15
Now, just pipe this to wc to get the number of common lines :
awk 'a[$0]++' file1 file2 | wc -l
Will output 3.
Explanation:
Here, a works like a dictionary with default value of 0. When you write a[$0]++, you will add 1 to a[$0], but this instruction returns the previous value of a[$0] (see difference between a++ and ++a). So you will have 0 ( = false) the first time you encounter a certain string and 1 ( or more, still = true) the next times.
By default, awk 'condition' file is a syntax for outputting all the lines where condition is true.
Be also aware that the a[] array will expand every time you encounter a new key. At the end of your script, the size of the array will be the number of unique values you have throughout all your input files (in OP's example, it would be 9).
Note: this solution counts duplicates, i.e if you have:
file1 | file2
1 | 3
2 | 3
3 | 3
awk 'a[$0]++' file1 file2 will output 3 3 3 and awk 'a[$0]++' file1 file2 | wc -l will output 3
If this is a behaviour you don't want, you can use the following code to filter out duplicates :
awk '++a[$0] == 2' file1 file2 | wc -l
with your input example, this works too. but if the files are huge, I prefer the awk solutions by others:
grep -cFwf file2 file1
with your input files, the above line outputs
3
Here's one without awk that instead uses comm:
comm -12 <(sort file1.txt) <(sort file2.txt) | wc -l
comm compares two sorted files. The arguments 1,2 suppresses unique lines found in both files.
The output is the lines they have in common, on separate lines. wc -l counts the number of lines.
Output without wc -l:
10
3
5
And when counting (obviously):
3
You can also use comm command. Remember that you will have to first sort the files that you need to compare:
[gc#slave ~]$ sort a > sorted_1
[gc#slave ~]$ sort b > sorted_2
[gc#slave ~]$ comm -1 -2 sorted_1 sorted_2
10
3
5
From man pages for comm command:
comm - compare two sorted files line by line
Options:
-1 suppress column 1 (lines unique to FILE1)
-2 suppress column 2 (lines unique to FILE2)
-3 suppress column 3 (lines that appear in both files)
You can do all with awk:
awk '{ a[$0] += 1} END { c = 0; for ( i in a ) { if ( a[i] > 1 ) c++; } print c}' file1 file2
To get the percentage, something like this works:
awk '{ a[$0] += 1; if (NR == FNR) { b = FILENAME; n = NR} } END { c = 0; for ( i in a ) { if ( a[i] > 1 ) c++; } print b, c/n; print FILENAME, c/FNR;}' file1 file2
and outputs
file1 0.428571
file2 0.6
In your solution, you can get rid of one cat:
sort file1 file2| uniq -c | awk '{if ($1 > 1) {$1=""; print $0}}' | wc -l
How about keeping it nice and simple...
This is all that's needed:
cat file1 file2 | sort -n | uniq -d | wc -l
3
man sort:
-n, --numeric-sort -- compare according to string numerical value
man uniq:
-d, --repeated -- only print duplicate lines
man wc:
-l, --lines -- print the newline counts
Hope this helps.
EDIT - one fewer process (credit martin):
sort file1 file2 | uniq -d | wc -l
One way using awk:
awk 'NR==FNR{a[$0]; next}$0 in a{n++}END{print n}' file1 file2
Output:
3
The first answer by Aserre using awk is good but may have the undesirable effect of counting duplicates - even if the duplicates exist in only ONE of the files, which is not quite what the OP asked for.
I believe this edit will return only the unique lines that exist in BOTH files.
awk 'NR==FNR{a[$0]=1;next}a[$0]==1{a[$0]++;print $0}' file1 file2
If duplicates are desired, but only if they exist in both files, I believe this next version will work, but will only report duplicates in the second file that exist in the first file. (If the duplicates exist in the first file, only the those that also exist in file2 will be reported, so file order matters).
awk 'NR==FNR{a[$0]=1;next}a[$0]' file1 file2
Btw, I tried using grep, but it was painfully slow on files with a few thousand lines each. Awk is very fast!
UPDATE 1 : new version ensures intra-file duplicates are excluded from count, so only cross-file duplicates would show up in the final stats :
mawk '
BEGIN { _*= FS = "^$"
} FNR == NF { split("",___)
} ___[$_]++<NF { __[$_]++
} END { split("",___)
for (_ in __) {
___[__[_]]++ } printf(RS)
for (_ in ___) {
printf(" %\04715.f %s\n",_,___[_]) }
printf(RS) }' \
<( jot - 1 999 3 | mawk '1;1;1;1;1' | shuf ) \
<( jot - 2 1024 7 | mawk '1;1;1;1;1' | shuf ) \
<( jot - 7 1295 17 | mawk '1;1;1;1;1' | shuf )
3 3
2 67
1 413
===========================================
this is probably waaay overkill, but i wrote something similar to this to supplement uniq -c :
measuring the frequency of frequencies
it's like uniq -c | uniq -c without wasting time sorting. The summation and % parts are trivial from here, with 47 over-lapping lines in this example. It avoids spending any time performing per row processing, since the current setup only shows the summarized stats.
If you need to actual duplicated rows, they're also available right there serving as the hash key for the 1st array.
gcat <( jot - 1 999 3 ) <( jot - 2 1024 7 ) |
mawk '
BEGIN { _*= FS = "^$"
} { __[$_]++
} END { printf(RS)
for (_ in __) { ___[__[_]]++ }
for (_ in ___) {
printf(" %\04715.f %s\n",
_,___[_]) } printf(RS) }'
2 47
1 386
add another file, and the results reflect the changes (I added <( jot - 5 1295 5 ) ):
3 9
2 115
1 482

Problems in mapping indices using awk

Hi all I have this data files
File1
1 The hero
2 Chainsaw and the gang
3 .........
4 .........
where the first field is the id and the second field is the product name
File 2
The hero 12
The hero 2
Chainsaw and the gang 2
.......................
From these two files I want to have a third file
File 3
The hero 12 1
The hero 2 1
Chainsaw and the gang 2 2
.......................
As you can see I am just adding the indices reading from file 1
I used this method
awk -F '\t' 'NR == FNR{a[$2]=$1; next}; {print $0, a[$1]}' File1 File2 > File 3
where I am creating this associated array using File 1 and doing just lookup using product names from file 2
However my files are huge, I have like 20 million product names and this process is taking a lot of time. Any suggestions, how I can speed it up?
You can use this awk:
awk 'FNR==NR{p=$1; $1=""; sub(/^ +/, ""); a[$0]=p;next} {q=$NF; $NF=""; sub(/ +$/, "")}
($0 in a) {print $0, q, a[$0]}' f1 f2
The hero 12 1
The hero 2 1
Chainsaw and the gang 2 2
The script you posted won't produce the output you want from the input files you posted so let's fix that first:
$ cat file1
1 The hero
2 Chainsaw and the gang
$ cat file2
The hero 12
The hero 2
Chainsaw and the gang 2
$ awk -F'\t' 'NR==FNR{map[$2]=$1;next} {key=$0; sub(/[[:space:]]+[^[:space:]]+$/,"",key); print $0, map[key]}' file1 file2
The hero 12 1
The hero 2 1
Chainsaw and the gang 2 2
Now, is that really too slow or were you doing some pre or post-processing and that was the real speed issue?
The obvious speed up is if your "file2" is sorted then you can delete the corresponding map[] value whenever the key changes so your map[] gets smaller every time you use it. e.g. something like this (untested):
$ awk -F'\t' '
NR==FNR {map[$2]=$1; next}
{ key=$0; sub(/[[:space:]]+[^[:space:]]+$/,"",key); print $0, map[key] }
key != prev { delete map[prev] }
{ prev = key }
' file1 file2
Alternative approach when populating map[] uses too much time/memory and file2 is sorted:
$ awk '
{ key=$0
sub(/[[:space:]]+[^[:space:]]+$/,"",key)
if (key != prev) {
cmd = "awk -F\"\t\" -v key=\"" key "\" \047$2 == key{print $1;exit}\047 file1"
cmd | getline val
close(cmd)
}
print $0, val
prev = key
}' file2
From comments you're having scaling problems with your lookups. The general fix for that is to merge sorted sequences:
join -t $'\t' -1 2 -2 1 -o 1.2,2.2,1.1 \
<( sort -t $'\t' -k2 file1) \
<( sort -t $'\t' -sk1,1 file2)
I gather Windows can't do process substitution, so you have to use temporary files:
sort -t $'\t' -k2 file1 >idlookup.bykey
sort -t $'\t' -sk1,1 file2 >values.bykey
join -t $'\t' -1 2 -2 1 -o 1.2,2.2,1.1 idlookup.bykey values.bykey
If you need to preserve the value lookup sequence use nl to put line numbers on the front and sort on those at the end.
If your issue is performance then try this perl script:
#!/usr/bin/perl -l
use strict;
use warnings;
my %h;
open my $fh1 , "<", "file1.txt";
open my $fh2 , "<", "file2.txt";
open my $fh3 , ">", "file3.txt";
while (<$fh1>) {
my ($v, $k) = /(\d+)\s+(.*)/;
$h{$k} = $v;
}
while (<$fh2>) {
my ($k, $v) = /(.*)\s+(\d+)$/;
print $fh3 "$k $v $h{$k}" if exists $h{$k};
}
Save the above script in say script.pl and run it as perl script.pl. Make sure the file1.txt and file2.txt are in the same directory as the script.

Resources