I have two sample files test1.txt and text2.txt. How to join them appropriately ?
$> cat test1.txt
1 USA
2 CANADA
3 MEXICO
4 BAHAMAS
5 CUBA
$> cat test2.txt
MEXICO Mexico-city
USA Washington
CANADA Ottawa
CUBA Havanna
BAHAMAS Nassau
$> join -j 2 -o '1.1,1.2,2.2' < (sort -k2 test1.txt) < (sort -k1 test2.txt)
ksh: 0403-057 Syntax error: `(' is not expected.
Expected output:
1 USA Washington
2 CANADA Ottawa
3 MEXICO Mexico-city
4 BAHAMAS Nassau
5 CUBA Havanna
The first solution is bad! When the test1.txt file is large, you will have too much overhead in calling grep and cut for each line.
while read -r line; do
key=${line#* }
printf "%s %s\n" "${line}" $(grep "${key}" test2.txt | cut -d" " -f2-)
done < test1.txt
You should get your join working or use awk:
awk 'FNR==NR { towns[$1]=$2; next;} $2 in towns {print $0 " " towns[$2];}' test2.txt test1.txt
Related
I want count the number of occurrences of part of a filename when doing ls.
For example if my directory has the following files:
apple.cool_test1
banana.cool_test1
banana.cool_test2
cherry.cool_test1
cherry.cool_test2
cherry.cool_test3
I want the result like this:
1 apple
2 banana
3 cherry
So I tried "ls | sort | uniq -c" but how do I extract the first part of the filename. My record separator can be "." ?
give this one-liner a try:
$ awk -F'.' '{a[$1]++}END{for(x in a)print a[x],x}' file
1 apple
2 banana
3 cherry
You can extract the first part with cut or awk:
$ printf '%s\n' * | cut -d'.' -f1 | uniq -c
1 apple
2 banana
3 cherry
$ printf '%s\n' * | awk -F'.' '{print $1}' | uniq -c
1 apple
2 banana
3 cherry
So basically, these are two files I need to compare
file1.txt
1 a
2 b
3 c
44 d
file2.txt
11 a
123 a
3 b
445 d
To show the unique lines in file 1, I use 'comm -23' command after 'sort -u' these 2 files. Additionally, I would like to make '11 a' '123 a' in file 2 become subsets of '1 a' in file 1, similarly, '445 d' is a subset of ' 44 d'. These subsets are considered the same as their superset. So the desired output is
2 b
3 c
I'm a beginner and my loop is way too slow... So here is my code
comm -23 <( awk {print $1,$2}' file1.txt | sort -u ) <( awk '{print $1,$2}' file2.txt | sort -u ) >output.txt
array=($( awk -F ',' '{print $1}' file1.txt ))
for i in "${array[#]}";do
awk -v pattern="$i" 'match($0, "^" pattern)' output.txt > repeat.txt
done
comm -23 <( cat output.txt | sort -u ) <( cat repeat.txt | sort -u )
Anyone got any good ideas?
Another question: Any ways I could show the row numbers from original file at output? For example,
(row num from file 1)
2 2 b
3 3 c
With GNU awk for arrays of arrays:
$ cat tst.awk
NR==FNR {
vals[$2][$1]
next
}
$2 in vals {
for (i in vals[$2]) {
if ( index(i,$1) == 1 ) {
next
}
}
}
{ print FNR, $0 }
$ awk -f tst.awk file2 file1
2 2 b
3 3 c
Here are two files where I need to eliminate the data that they do not have in common:
a.txt:
hello world
tom tom
super hero
b.txt:
hello dolly 1
tom sawyer 2
miss sunshine 3
super man 4
I tried:
grep -f a.txt b.txt >> c.txt
And this:
awk '{print $1}' test1.txt
because I need to check only if the first word of the line exists in the two files (even if not at the same line number).
But then what is the best way to get the following output in the new file?
output in c.txt:
hello dolly 1
tom sawyer 2
super man 4
Use awk where you iterate over both files:
$ awk 'NR == FNR { a[$1] = 1; next } a[$1]' a.txt b.txt
hello dolly 1
tom sawyer 2
super man 4
NR == FNR is only true for the first file making { a[$1] = 1; next } only run on said file.
Use sed to generate a sed script from the input, then use another sed to execute it.
sed 's=^=/^=;s= .*= /p=' a.txt | sed -nf- b.txt
The first sed turns your a.txt into
/^hello /p
/^tom /p
/^super /p
which prints (p) whenever a line contains hello, tom, or super at the beginning of line (^) followed by a space.
This combines grep, cut and sed with process substitution:
$ grep -f <(cut -d ' ' -f 1 a.txt | sed 's/^/^/') b.txt
hello dolly 1
tom sawyer 2
super man 4
The output of the process substitution is this (piping to cat -A to show spaces):
$ cut -d ' ' -f 1 a.txt | sed 's/^/^/;s/$/ /' | cat -A
^hello $
^tom $
^super $
We then use this as input for grep -f, resulting in the above.
If your shell doesn't support process substitution, but your grep supports reading from stdin with the -f option (it should), you can use this instead:
$ cut -d ' ' -f 1 a.txt | sed 's/^/^/;s/$/ /' | grep -f - b.txt
hello dolly 1
tom sawyer 2
super man 4
when i have two files such as file A
012
658
458
895
235
and file B
1
2
3
4
5
how could they be joined in bash? The output shoudl just be
1012
2658
3458
4895
5235
really I just want to bind by column such as in R (cbind).
Assuming columns are in equal length in both files, you can use paste command:
paste --delimiters='' fileB fileA
The default delimiter for paste command is TAB. So '' make sure no delimiter is in place.
Like this maybe:
paste -d'\0' B A
Or, if you like awk:
awk 'FNR==NR{A[FNR]=$0;next} {print $0,A[FNR]}' OFS='' A B
Using pure Bash and no external commands:
while read -u 3 A && read -u 4 B; do
echo "${B}${A}"
done 3< File_A.txt 4< File_B.txt
grep "run complete" *.err | awk -F: '{print $1}'|sort > a
ls ../bam/*bam | grep -v temp | awk -F[/_] '{print $3".err"}' | sort > b
diff <(grep "run complete" *.err | awk -F: '{print $1}'|sort) <(ls ../bam/*bam | grep -v temp | awk -F[/_] '{print $3".err"}' )
paste a b
I have to update a particular column value in a file for particular Unique IDs.
My file-name and sample contents are given below:
Names.txt
J017 0001 Amit 10th
J011 2341 Kuldeep 11th
J004 1254 Ramand 12th
I have to update the 4th column value to something . I tried the below logic but did not work
stu="";
for i in `echo "J017, J058 and J107. " |egrep -o '[jJ][0-9]{3}' `
do
stu="$stu|$i ";
awk -v I=$i '/$I/{$4="LEFT";print $0}' Names.txt >tmp
done
egrep -v `echo "$stu" | sed "s/^|//g" ` Names.txt >>tmp
mv tmp Names.txt
The above awk command did not give the result. Please help me to fix the error.
To answer your specific question about why this:
awk -v I=$i '/$I/{$4="LEFT";print $0}'
doesn't work, you don;t acces awk variables by prefixing them with a "$", just like you don't do that for C or most other languages (shell being an exception). This is how you would write the above to execute the way you are trying to get it to execute:
awk -v I=$i '$0 ~ I{$4="LEFT";print $0}'
Having said that, your shell script is completely the wrong way to do what you want. Try this instead (uses GNU awk for patsplit() but match()/substr() in other awks would work just as well):
$ cat tst.sh
awk -v ids="J017, J058 and J107. " '
BEGIN{
patsplit(ids,idsA,/[jJ][0-9]{3}/)
for (i=1;i in idsA;i++)
stu = stu (i==1?"^":"|") idsA[i]
stu = stu "$"
}
$1 ~ stu { $4 = "LEFT" }
{ print }
' "$#"
$ ./tst.sh file
J017 0001 Jagdeep LEFT
J011 2341 Kuldeep 11th
J004 1254 Ramand 12th
#!/bin/bash
FILE='Names.txt'
COLUMNS=(J017 J011 J004)
REPLACE='LEFT'
OUT=$(
IFS="|"
awk -v R="$REPLACE" -v E="${COLUMNS[*]}" '$1 ~ E{$4 = R;print $0}' "$FILE"
)
echo "$OUT" > "$FILE"
Run with:
bash script.sh
Input:
J017 0001 Jagdeep 10th
J011 2341 Kuldeep 11th
J004 1254 Ramand 12th
Result:
J017 0001 Jagdeep LEFT
J011 2341 Kuldeep LEFT
J004 1254 Ramand LEFT
Names.txt
J017 0001 Jagdeep 10th
J011 2341 Kuldeep 11th
J004 1254 Ramand 12th
awk
awk '/[jJ][0-9][0-9][0-9]/ {$4="LEFT"}1' Names.txt
J017 0001 Jagdeep LEFT
J011 2341 Kuldeep LEFT
J004 1254 Ramand LEFT
This might work for you:
echo "J017, J111 and J004. " |
grep -o "J[0-9]\{3\}" |
awk 'FNR==NR{key[$1];next};$1 in key{$4="LEFT"}1' - Names.txt