how to paste n times the same column? - bash

I have two files with different length, e.g. file1 reads
A
B
C
D
E
and file2
1
I am looking for a way to create file3 like:
A 1
B 1
C 1
D 1
E 1
I know that if file1 and file2 had the same length a simple paste file1 file2 > file3 would solve the problem.

take 1
If file2 only has one line, I would do
awk -v f2="$(< file2)" '{print $0, f2}' file1
if file contains, say, 3 lines and you want the output to look like:
a 1
b 2
c 3
d 1
e 2
then I would do
awk '
NR==FNR {f2[FNR]=$0; n=FNR; next}
{print $0, f2[((FNR-1)%n)+1]}
' file2 file1
take 2
Here's a crazy way to use paste and a process substitution that repeats file2 so that it's the same length as file1
printf "%s\n" {A..Z} >|file1
seq 1 3 >| file2
paste file1 <(
lf1=$(wc -l < file1)
lf2=$(wc -l < file2)
for (( i=0; i <= lf1/lf2; i++)); do
cat file2
done | head -n $lf1
)
A 1
B 2
C 3
D 1
E 2
F 3
G 1
H 2
I 3
J 1
K 2
L 3
M 1
N 2
O 3
P 1
Q 2
R 3
S 1
T 2
U 3
V 1
W 2
X 3
Y 1
Z 2

One way with awk:
awk 'NR==FNR{a[NR]=$0;next}{x=a[FNR]?a[FNR]:x;$2=x}1' file2 file1 > file3

Related

awk insert rows of one file as new columns to every nth rows of another file

Let's keep n=3 here, and say I have two files:
file1.txt
a b c row1
d e f row2
g h i row3
j k l row4
m n o row5
o q r row6
s t u row7
v w x row8
y z Z row9
file2.txt
1 2 3
4 5 6
7 8 9
I would like to merge the two files into a new_file.txt:
new_file.txt
a b c 2 3
d e f 2 3
g h i 2 3
j k l 5 6
m n o 5 6
o q r 5 6
s t u 8 9
v w x 8 9
y z Z 8 9
Currently I do this as follows (there are also slow bash for or while loop solutions, of course): awk '1;1;1' file2.txt > tmp2.txt and then something like awk 'FNR==NR{a[FNR]=$2" "$3;next};{$NF=a[FNR]};1' tmp2.txt file1.txt > new_file.txt for the case listed in my question.
Or put these in one line: awk '1;1;1' file2.txt | awk 'FNR==NR{a[FNR]=$2" "$3;next};{$NF=a[FNR]};1' - file1.txt > new_file.txt. But these do not look elegant at all...
I am looking for a more elegant one liner (perhaps awk) that can effectively do this.
In the real case, let's say for example I have 9 million rows in input file1.txt and 3 million rows in input file2.txt and I would like to append columns 2 and 3 of the first row of file2.txt as the new last columns of the first 3 rows of file1.txt, columns 2 and 3 of the second row of file2.txt as the same new last columns of the next 3 rows of file1.txt, etc, etc.
Thanks!
Try this, see mywiki.wooledge - Process Substitution for details on <() syntax
$ # transforming file2
$ cut -d' ' -f2-3 file2.txt | sed 'p;p'
2 3
2 3
2 3
5 6
5 6
5 6
8 9
8 9
8 9
$ # then paste it together with required fields from file1
$ paste -d' ' <(cut -d' ' -f1-3 file1.txt) <(cut -d' ' -f2-3 file2.txt | sed 'p;p')
a b c 2 3
d e f 2 3
g h i 2 3
j k l 5 6
m n o 5 6
o q r 5 6
s t u 8 9
v w x 8 9
y z Z 8 9
Speed comparison, time shown for two consecutive runs
$ perl -0777 -ne 'print $_ x 1000000' file1.txt > f1
$ perl -0777 -ne 'print $_ x 1000000' file2.txt > f2
$ du -h f1 f2
95M f1
18M f2
$ time paste -d' ' <(cut -d' ' -f1-3 f1) <(cut -d' ' -f2-3 f2 | sed 'p;p') > t1
real 0m1.362s
real 0m1.154s
$ time awk '1;1;1' f2 | awk 'FNR==NR{a[FNR]=$2" "$3;next};{$NF=a[FNR]};1' - f1 > t2
real 0m12.088s
real 0m13.028s
$ time awk '{
if (c==3) c=0;
printf "%s %s %s ",$1,$2,$3;
if (!c++){ getline < "f2"; f4=$2; f5=$3 }
printf "%s %s\n",f4,f5
}' f1 > t3
real 0m13.629s
real 0m13.380s
$ time awk '{
if (c==3) c=0;
main_fields=$1 OFS $2 OFS $3;
if (!c++){ getline < "f2"; f4=$2; f5=$3 }
printf "%s %s %s\n", main_fields, f4, f5
}' f1 > t4
real 0m13.265s
real 0m13.896s
$ diff -s t1 t2
Files t1 and t2 are identical
$ diff -s t1 t3
Files t1 and t3 are identical
$ diff -s t1 t4
Files t1 and t4 are identical
Awk solution:
awk '{
if (c==3) c=0;
main_fields=$1 OFS $2 OFS $3;
if (!c++){ getline < "file2.txt"; f4=$2; f5=$3 }
printf "%s %s %s\n", main_fields, f4, f5
}' file1.txt
c - variable reflecting nth coefficient
getline < file - reads the next record from file
f4=$2; f5=$3 - contain the values of the 2nd and 3rd fields from currently read record of file2.txt
The output:
a b c 2 3
d e f 2 3
g h i 2 3
j k l 5 6
m n o 5 6
o q r 5 6
s t u 8 9
v w x 8 9
y z Z 8 9
This is still a lot slower than Sundeep's cut&paste code on the 100,000 lines test (8s vs 21s on my laptop) but perhaps easier to understand than the other Awk solution. (I had to play around for a bit before getting the indexing right, though.)
awk 'NR==FNR { a[FNR] = $2 " " $3; next }
{ print $1, $2, $3, a[1+int((FNR-1)/3)] }' file2.txt file1.txt
This simply keeps (the pertinent part of) file2.txt in memory and then reads file1.txt and writes out the combined lines. That also means it is limited by available memory, whereas Roman's solution will scale to basically arbitrarily large files (as long as each line fits in memory!) but slightly faster (I get 28s real time for Roman's script with Sundeep's 100k test data).

join command leaving out a row of numbers

I have two files, I want to take out the rows which have common data in the third column. But it is leaving out a row which should be matched.
File1
b b b
4 5 3
c c c
File2
1 2 3 4
a b c d
e f g h
i j k l
l m n o
The output is:
c c c a b d
The command used is:
join -1 3 -2 3 --nocheck-order File1.txt File2.txt
It is missing out the row with 3 as the common field, even after placing the --nocheck-order
Edit:
Expected output:
c c c a b d
3 4 5 1 2 4
As an alternative to 2 sort commands (can be very expensive for big files) and then a join, you can use this single awk command to get your output:
awk 'FNR == NR{a[$3]=$0; next} $3 in a{print $3, a[$3], $1, $2, $4}' file1 file2
3 4 5 3 1 2 4
c c c c a b d
Explanation:
NR == FNR { # While processing the first file
a[$3] = $0 # store the whole line in array a using $3 as key
next
}
$3 in a { # while processing the 2nd file, when $3 is found in array
print $3,a[$3],$1,$2,$4 # print relevant fields from file2 and the remembered
# value from the first file.
}
You need to sort your inputs (e.g. using process substitution):
$ join -1 3 -2 3 <(sort -k3 1.txt) <(sort -k3 2.txt)
3 4 5 1 2 4
c c c a b d
This is equivalent to:
$ sort -k3 1.txt > 1-sorted.txt
$ sort -k3 2.txt > 2-sorted.txt
$ join -1 3 -2 3 1-sorted.txt 2-sorted.txt
3 4 5 1 2 4
c c c a b d

Sorting lines in one file given the order in another file

Given a file1:
13 a b c d
5 f a c d
7 d c g a
14 a v s d
and a file2:
7 x
5 c
14 a
13 i
I would like to sort file1 considering the same order of the first column in file2, so that the output should be:
7 d c g a
5 f a c d
14 a v s d
13 a b c d
Is it possible to do this in bash or should I use some "higher" language like python?
Use awk to put the line number from file2 as an extra column in front of file1. Sort the result by that column. Then remove that prefix column
awk 'FNR == NR { lineno[$1] = NR; next}
{print lineno[$1], $0;}' file2 file1 | sort -k 1,1n | cut -d' ' -f2-
Simple solution
for S in $(cat file2 | awk '{print $1}'); do grep $S file1; done

how to use awk to merge files with common fields and print in another file

I have read all the related questions, but still quite confuse...
I have two files tab separated.
file1 (breaks added for readability):
a 15 bac
g 10 bac
h11 bac
r 33 arq
t 12 euk
file2 (breaks added for readability):
0 15 h 3 5 2 gf a a g e g s s g g
p 33 g 4 5 2 hg 3 1 3 f 5 h 5 h 6
g 4 r 8 j 9 jk 9 j 9 9 h t 9 k 0
Output desired (breaks added for readability):
bac 15 h 3 5 2 gf a a g e g s s g g
arq 33 g 4 5 2 hg 3 1 3 f 5 h 5 h 6
ND g 4 r 8 j 9 jk 9 j 9 9 h t 9 k 0
Just that. I need to print the complete file2 but in the first column I need to replace with the third column of file1 only when $2 of file2 is the same that $2 of file1...
file1 is larger than file2, but still could happen that $2 from file2 is not present in file1, in that case print in the first column ND.
I'm sure it must be simple, but I have problems with awk managing two files. Please, if someone could help me...
Using this awk command:
awk 'FNR==NR{a[$2]=$3;next} {$1=(a[$2])?a[$2]:"ND"} 1' file1 file2
bac 15 h 3 5 2 gf a a g e g s s g g
arq 33 g 4 5 2 hg 3 1 3 f 5 h 5 h 6
ND 4 r 8 j 9 jk 9 j 9 9 h t 9 k 0
Explanation:
FNR==NR - Execute this block for first file in input i.e. file1
a[$2]=$3 - Populate an associative array a with key as $2 and value as $3 from file1
next - Read next line until EOF on first file
Now operating in file2
$1=(a[$2])?a[$2]:"ND" - Overwrite $1 with a[$2] if $2 is found in array a, otherwise by literal string "ND"
1 - print the output
You could try with join + awk command as below:
join -t ' ' -a2 -1 2 -2 2 test1.txt test2.txt | awk 'BEGIN { start = 5; end = 18 } { if (NF == 16) { temp = $1; $1 = "ND " $2; $2 = temp; print } else { printf("%s %s ", $3, $1); for (i=start; i<=end; i++) printf ("%s ", $i); printf("\n");}}'

Get n last records and change particular columns on them

I have file like this
1 2 "45554323" p b
2 2 "34534567" f a
3 3 "76546787" u b
2 4 "56765435" f a
* a
0 b
I want delete a, b from two last Records in END{} section
Result:
1 2 "45554323" p b
2 2 "34534567" f a
3 3 "76546787" u b
2 4 "56765435" f a
*
0
How can I get n last lines and change fields on them with awk?
Here's one way using any awk:
awk -v count=$(wc -l <file.txt) 'NR > count - 2 { $2 = "" }1' file.txt
Results:
1 2 "45554323" p b
2 2 "34534567" f a
3 3 "76546787" u b
2 4 "56765435" f a
*
0
Or to do awk operations for all records except 2 last lines of input file as a shell script, try ./script.sh file.txt. Contents of script.sh:
command=$(awk -v count=$(wc -l <"$1") 'NR <= count - 2 { $2 = "" }1' "$1"
echo -e "$command"
Results:
1 "45554323" p b
2 "34534567" f a
3 "76546787" u b
2 "56765435" f a
* a
0 b
If you know the value of n - the line number after which you want to delete the last item on the line/colum (here 4) this will work:
awk '{if (NR>4) NF=NF-1}1' data.txt
will give:
1 2 "45554323" p b
2 2 "34534567" f a
3 3 "76546787" u b
2 4 "56765435" f a
*
0
NF = NF -1 makes awk think there is one less field on the line than there is, which is how it doesn't display the last column/item on the line once that condition is met. NR refers to the current line number in the file being read.
awk can't know the number of lines in a file unless it goes through it once, or is given that information (e.g., wc -l). An alternative approach would be to save the last n lines in a buffer (sort of a sliding window/tape-delay type analogy, you are always printing n lines behind) and then process the final n lines in the END block.
This doesn't exactly answer your question but it produces the output you require:
$ gawk '{if (NF < 3) print $1; else print}' input.txt
1 2 "45554323" p b
2 2 "34534567" f a
3 3 "76546787" u b
2 4 "56765435" f a
*
0
$ cat file
1 2 "45554323" p b
2 2 "34534567" f a
3 3 "76546787" u b
2 4 "56765435" f a
* a
0 b
$ awk 'BEGIN{ARGV[ARGC++]=ARGV[ARGC-1]} NR==FNR{nr++; next} FNR>(nr-2) {NF--} 1' file
1 2 "45554323" p b
2 2 "34534567" f a
3 3 "76546787" u b
2 4 "56765435" f a
*
0
or if you don't mind manually specifying the file name twice:
awk 'NR==FNR{nr++; next} FNR>(nr-2) {NF--} 1' file file

Resources