I have file.txt 3 columns.
1 A B
2 C D
3 E F
I want to add #1 as the end of #2. Result should look like this:
1A
2C
3E
1B
2D
3F
I am doing this by
cut -f 1,2 > tmp1
cut -f 1,3 > tmp2
cat *tmp * > final_file
But I am getting repeated lines! If I check the final output with:
cat * | sort | uniq -d
there are plenty of repeated lines and there are none in the primary file.
Can anyone suggest other way of doing this? I believe the one I am trying to use is too complex and that's why I am getting such a weird output.
pzanoni#vicky:/tmp$ cat file.txt
1 A B
2 C D
3 E F
pzanoni#vicky:/tmp$ cut -d' ' -f1,2 file.txt > result
pzanoni#vicky:/tmp$ cut -d' ' -f1,3 file.txt >> result
pzanoni#vicky:/tmp$ cat result
1 A
2 C
3 E
1 B
2 D
3 F
I'm using bash
Preserves the order with one pass through the file
awk '
{print $1 $2; pass2 = pass2 sep $1 $3; sep = "\n"}
END {print pass2}
' file.txt
The reason this (cat tmp* * > final_file) is wrong:
I assume *tmp was a typo
I assume as this point the directory only contains "tmp1" and "tmp2"
Look at how those wildcards will be expanded:
tmp* expands to "tmp1" and "tmp2"
* also expands to "tmp1" and "tmp2"
So your command line becomes cat tmp1 tmp2 tmp1 tmp2 > final_file and hence you get all the duplicated lines.
cat file.txt | awk '{print $1 $2 "\n" $1 $3};'
Related
diff and similar tools seem to compare files, not content that happens to be in the form of lines in files. That is, they consider the position of each line in the file as significant and part of the comparison.
What about when you just don't care about position? I simply want to compare two lists in more like a set operation without any respect to position. Here each line can be considered a list element. So, I'm looking for what is the difference between lines in file1 and file2, and file2 and file1.
I don't want to see positional information, or do any a pairwise compariosn, just a result set for each operation. For example:
SET1: a b c d f g
SET2: a b c e g h
SET1 - SET2 = d f
SET2 - SET1 = e g
Can I do this easily in bash? Obviously it's fine to sort the list first or not but sorting is not intrinsically a prerequisute to working with sets
Assuming you want to do full-line string comparisons and consider counts of lines rather than just appearances of lines as differences, this might do what you want (untested):
awk '
NR==FNR {
set1[$0]++
next
}
$0 in set1 {
both[$0]++
if ( --set1[$0] == 0 ) {
delete set1[$0]
}
next
}
{
set2[$0]++
}
END {
for ( str in both ) {
printf "Both: %s (%d)\n", str, both[str]
}
for ( str in set1 ) {
printf "Set1: %s (%d)\n", str, set1[str]
}
for ( str in set2 ) {
printf "Set2: %s (%d)\n", str, set2[str]
}
}
' file1 file2
For simple line-oriented comparisons, the comm command might be all you need:
$ tail a.txt b.txt
==> a.txt <==
a
b
c
d
f
g
==> b.txt <==
a
b
c
e
g
h
$ comm -23 <(sort a.txt) <(sort b.txt)
d
f
$ comm -13 <(sort a.txt) <(sort b.txt)
e
h
Also, it's probably worth it to enable the --unique flag on sort in order to remove duplicate lines:
comm -23 <(sort --unique a.txt) <(sort --unique b.txt)
If your fields are all separated by spaces, you may use a four steps process:
split
sort
compare
aggregate
tr " " "\n" file1 | sort > file1.tmp
tr " " "\n" file2 | sort > file2.tmp
diff file1.tmp file2.tmp | tee /tmp/result
and now, aggregate the results:
echo "file1: $(diff file1.tmp file2.tmp | egrep "^>" | sed 's/^> //' | tr "\n" " " )"
echo "file2: $(diff file1.tmp file2.tmp | egrep "^<" | sed 's/^< //' | tr "\n" " " )"
rm /tmp/result file1.tmp file2.tmp
Eventually, to match exactly your field format, append a | tr -s " " to your two echoes.
Something like this works, but comm -23 is much nicer
#!/bin/bash
LINESX=$(cat $1)
LINESY=$(cat $2)
for LINEX in $LINESX
do
for LINEY in $LINESY
do
if [ "$LINEX" = "$LINEY" ]
then
match=1
break
fi
done
# For intersection
#if [ $match -eq 1 ]
#then
# echo "$LINEX"
#fi
# For difference
if [ $match -eq 0 ]
then
echo "$LINEX"
fi
done
So basically, these are two files I need to compare
file1.txt
1 a
2 b
3 c
44 d
file2.txt
11 a
123 a
3 b
445 d
To show the unique lines in file 1, I use 'comm -23' command after 'sort -u' these 2 files. Additionally, I would like to make '11 a' '123 a' in file 2 become subsets of '1 a' in file 1, similarly, '445 d' is a subset of ' 44 d'. These subsets are considered the same as their superset. So the desired output is
2 b
3 c
I'm a beginner and my loop is way too slow... So here is my code
comm -23 <( awk {print $1,$2}' file1.txt | sort -u ) <( awk '{print $1,$2}' file2.txt | sort -u ) >output.txt
array=($( awk -F ',' '{print $1}' file1.txt ))
for i in "${array[#]}";do
awk -v pattern="$i" 'match($0, "^" pattern)' output.txt > repeat.txt
done
comm -23 <( cat output.txt | sort -u ) <( cat repeat.txt | sort -u )
Anyone got any good ideas?
Another question: Any ways I could show the row numbers from original file at output? For example,
(row num from file 1)
2 2 b
3 3 c
With GNU awk for arrays of arrays:
$ cat tst.awk
NR==FNR {
vals[$2][$1]
next
}
$2 in vals {
for (i in vals[$2]) {
if ( index(i,$1) == 1 ) {
next
}
}
}
{ print FNR, $0 }
$ awk -f tst.awk file2 file1
2 2 b
3 3 c
I have 2 text files :
1st file :
1 C
1 D
1 B
1 A
2nd file :
B
C
D
A
I want to sort first file like this:
1 B
1 C
1 D
1 A
Can you help me with a script in bash (or command ).
I solved the sort problem (i eliminated the first column ) and use this script
awk 'FNR == NR { lineno[$1] = NR; next}
{print lineno[$1], $0;}' ids.txt resultpartial.txt | sort -k 1,1n | cut -d' ' -f2-
Now I want to add ( first column like before)
1 .....
and what only to ignore the first file and do this?
echo -n > result-file.txt # empty result file if already created
while read line; do
echo "1 $line" >> result-file.txt
done < file2.txt
It would make sense when your files' format is specific.
Assuming that the "sort" field contains no duplicated values:
awk 'FNR==NR {line[$2] = $0; next} {print line[$1]}' file1 file2
I have two text files, and each file has one column with several rows:
FILE1
a
b
c
FILE2
d
e
f
I want to create a file that has the following output:
a - d
b - e
c - f
All the entries are meant to be numbers (decimals). I am completely stuck and do not know how to proceed.
Using paste seems like the obvious choice but unfortunately you can't specify a multiple character delimiter. To get around this, you can pipe the output to sed:
$ paste -d- file1 file2 | sed 's/-/ - /'
a - d
b - e
c - f
Paste joins the two files together and sed adds the spaces around the -.
If your desired output is the result of the subtraction, then you could use awk:
paste file1 file2 | awk '{ print $1 - $2 }'
given:
$ cat /tmp/a.txt
1
2
3
$ cat /tmp/b.txt
4
5
6
awk is a good bet to process the two files and do arithmetic:
$ awk 'FNR==NR { a[FNR""] = $0; next } { print a[FN""]+$1 }' /tmp/a.txt /tmp/b.txt
5
7
9
Or, if you want the strings rather than arithmetic:
$ awk 'FNR==NR { a[FNR""] = $0; next } { print a[FNR""] " - "$0 }' /tmp/a.txt /tmp/b.txt
1 - 4
2 - 5
3 - 6
Another solution using while and file descriptors :
while read -r line1 <&3 && read -r line2 <&4
do
#printf '%s - %s\n' "$line1" "$line2"
printf '%s\n' $(($line1 - $line2))
done 3<f1.txt 4<f2.txt
Given input file
z
b
a
f
g
a
b
...
I want to output the number of occurrences of each string, for example:
z 1
b 2
a 2
f 1
g 1
How can this be done in a bash script?
You can sort the input and pass to uniq -c:
$ sort input_file | uniq -c
2 a
2 b
1 f
1 g
1 z
If you want the numbers on the right, use awk to switch them:
$ sort input_file | uniq -c | awk '{print $2, $1}'
a 2
b 2
f 1
g 1
z 1
Alternatively, do the whole thing in awk:
$ awk '
{
++count[$1]
}
END {
for (word in count) {
print word, count[word]
}
}
' input_file
f 1
g 1
z 1
a 2
b 2
cat text | sort | uniq -c
should do the job
Try:
awk '{ freq[$1]++; } END{ for( c in freq ) { print c, freq[c] } }' test.txt
Where test.txt would be your input file.
Here's a bash-only version (requires bash version 4), using an associative array.
#! /bin/bash
declare -A count
while read val ; do
count[$val]=$(( ${count[$val]} + 1 ))
done < your_intput_file # change this as needed
for key in ${!count[#]} ; do
echo $key ${count[$key]}
done
This might work for you:
cat -n file |
sort -k2,2 |
uniq -cf1 |
sort -k2,2n |
sed 's/^ *\([^ ]*\).*\t\(.*\)/\2 \1/'
This output the number of occurrences of each string in the order in which they appear.
You can use sort filename | uniq -c.
Have a look at the Wikipedia page on uniq.