Substracting row-values from two different text files - bash

I have two text files, and each file has one column with several rows:
FILE1
a
b
c
FILE2
d
e
f
I want to create a file that has the following output:
a - d
b - e
c - f
All the entries are meant to be numbers (decimals). I am completely stuck and do not know how to proceed.

Using paste seems like the obvious choice but unfortunately you can't specify a multiple character delimiter. To get around this, you can pipe the output to sed:
$ paste -d- file1 file2 | sed 's/-/ - /'
a - d
b - e
c - f
Paste joins the two files together and sed adds the spaces around the -.
If your desired output is the result of the subtraction, then you could use awk:
paste file1 file2 | awk '{ print $1 - $2 }'

given:
$ cat /tmp/a.txt
1
2
3
$ cat /tmp/b.txt
4
5
6
awk is a good bet to process the two files and do arithmetic:
$ awk 'FNR==NR { a[FNR""] = $0; next } { print a[FN""]+$1 }' /tmp/a.txt /tmp/b.txt
5
7
9
Or, if you want the strings rather than arithmetic:
$ awk 'FNR==NR { a[FNR""] = $0; next } { print a[FNR""] " - "$0 }' /tmp/a.txt /tmp/b.txt
1 - 4
2 - 5
3 - 6

Another solution using while and file descriptors :
while read -r line1 <&3 && read -r line2 <&4
do
#printf '%s - %s\n' "$line1" "$line2"
printf '%s\n' $(($line1 - $line2))
done 3<f1.txt 4<f2.txt

Related

Bash: compare 2 files and show the unique content of one file with 'hierachy'

So basically, these are two files I need to compare
file1.txt
1 a
2 b
3 c
44 d
file2.txt
11 a
123 a
3 b
445 d
To show the unique lines in file 1, I use 'comm -23' command after 'sort -u' these 2 files. Additionally, I would like to make '11 a' '123 a' in file 2 become subsets of '1 a' in file 1, similarly, '445 d' is a subset of ' 44 d'. These subsets are considered the same as their superset. So the desired output is
2 b
3 c
I'm a beginner and my loop is way too slow... So here is my code
comm -23 <( awk {print $1,$2}' file1.txt | sort -u ) <( awk '{print $1,$2}' file2.txt | sort -u ) >output.txt
array=($( awk -F ',' '{print $1}' file1.txt ))
for i in "${array[#]}";do
awk -v pattern="$i" 'match($0, "^" pattern)' output.txt > repeat.txt
done
comm -23 <( cat output.txt | sort -u ) <( cat repeat.txt | sort -u )
Anyone got any good ideas?
Another question: Any ways I could show the row numbers from original file at output? For example,
(row num from file 1)
2 2 b
3 3 c
With GNU awk for arrays of arrays:
$ cat tst.awk
NR==FNR {
vals[$2][$1]
next
}
$2 in vals {
for (i in vals[$2]) {
if ( index(i,$1) == 1 ) {
next
}
}
}
{ print FNR, $0 }
$ awk -f tst.awk file2 file1
2 2 b
3 3 c

How to print lines with the specified word in the path?

Let's say I have file abc.txt which contains the following lines:
a b c /some/path/123/path/120
a c b /some/path/312/path/098
a p t /some/path/123/path/321
a b c /some/path/098/path/123
and numbers.txt:
123
321
123
098
I want to print the whole line which contain "123" only in the third place under "/some/path/123/path",
I don't want to print line "a c b/some/path/312/path" or
"a b c /some/path/098/path/123/". I want to save all files with the "123" in the third place in the new file.
I tried several methods and the best way seems to be use awk. Here is my example code which is not working correctly:
for i in `cat numbers.txt | xargs`
do
cat abc.txt | awk -v i=$i '$4 ~ /i/ {print $0}' > ${i}_number.txt;
done
because it's catching also for example "a b c /some/path/098/path/123/".
Example:
For number "123" I want to save only one line from abc.txt in 123_number.txt:
a b c /some/path/123/path/120
For number "312" I want to save only one line from abc.txt in 312_number.txt:
a c b /some/path/312/path/098
this can be accomplished in a single awk call:
$ awk -F'/' 'NR==FNR{a[$0];next} ($4 in a){f=$4"_number.txt";print >>f;close(f)}' numbers.txt abc.txt
$ cat 098_number.txt
a b c /some/path/098/path/123
$ cat 123_number.txt
a b c /some/path/123/path/120
a p t /some/path/123/path/321
keep numbers in an array and use it for matching lines, append matching lines to corresponding files.
if your files are huge you may speed up the process using sort:
sort -t'/' -k4 abc.txt | awk -F'/' 'NR==FNR{a[$0];next} ($4 in a){if($4!=p){close(f);f=(p=$4)"_number.txt"};print >>f}' numbers.txt -

Adding column values from multiple different files

I have ~100 files and I would like to do an arithmetical operation (e.g. sum them up) on the second column of the files, such that I add the value of first row of one file to the first row value of second file and so on for all rows of column 2 in each file.
In my actual files I have ~30 000 rows so any kind of manual manipulation with the rows is not possible.
fileA
1 1
2 100
3 1000
4 15000
fileB
1 7
2 500
3 6000
4 20000
fileC
1 4
2 300
3 8000
4 70000
output:
1 12
2 900
3 15000
4 105000
I used this and ran it as: script.sh listofnames.txt (All the files have the same name but they are in different directories so I was referring to them with $line to the file with the list of directories names). This gives me a syntax error and I am looking for a way to define the "sum" otherwise.
while IFS='' read -r line || [[ -n "$line" ]]; do
awk '{"'$sum'"+=$3; print $1,$2,"'$sum'"}' ../$line/file.txt >> output.txt
echo $sum
done < "$1"
$ paste fileA fileB fileC | awk '{sum=0; for (i=2;i<=NF;i+=2) sum+=$i; print $1, sum}'
1 12
2 900
3 15000
4 105000
or if you wanted to do it all in awk:
$ awk '{key[FNR]=$1; sum[FNR]+=$2} END{for (i=1; i<=FNR;i++) print key[i], sum[i]}' fileA fileB fileC
1 12
2 900
3 15000
4 105000
If you have a list of directories in a file named "foo" and every file you're interested in in every directory is named "bar" then you can do:
IFS=$'\n' files=( $(awk '{print $0 "/bar"}' foo) )
cmd "${files[#]}"
where cmd is awk or paste or anything else you want to run on those files. Look:
$ cat foo
abc
def
ghi klm
$ IFS=$'\n' files=( $(awk '{print $0 "/bar"}' foo) )
$ awk 'BEGIN{ for (i=1;i<ARGC;i++) print "<" ARGV[i] ">"; exit}' "${files[#]}"
<abc/bar>
<def/bar>
<ghi klm/bar>
So if your files are all named file.txt and your directory names are stored in listofnames.txt then your script would be:
IFS=$'\n' files=( $(awk '{print $0 "/file.txt"}' listofnames.txt) )
followed by whichever of these you prefer:
paste "${files[#]}" | awk '{sum=0; for (i=2;i<=NF;i+=2) sum+=$i; print $1, sum}'
awk '{key[FNR]=$1; sum[FNR]+=$2} END{for (i=1; i<=FNR;i++) print key[i], sum[i]}' "${files[#]}"

Sort a file like another file

I have 2 text files :
1st file :
1 C
1 D
1 B
1 A
2nd file :
B
C
D
A
I want to sort first file like this:
1 B
1 C
1 D
1 A
Can you help me with a script in bash (or command ).
I solved the sort problem (i eliminated the first column ) and use this script
awk 'FNR == NR { lineno[$1] = NR; next}
{print lineno[$1], $0;}' ids.txt resultpartial.txt | sort -k 1,1n | cut -d' ' -f2-
Now I want to add ( first column like before)
1 .....
and what only to ignore the first file and do this?
echo -n > result-file.txt # empty result file if already created
while read line; do
echo "1 $line" >> result-file.txt
done < file2.txt
It would make sense when your files' format is specific.
Assuming that the "sort" field contains no duplicated values:
awk 'FNR==NR {line[$2] = $0; next} {print line[$1]}' file1 file2

Merge columns cut & cat

I have file.txt 3 columns.
1 A B
2 C D
3 E F
I want to add #1&#3 as the end of #2. Result should look like this:
1A
2C
3E
1B
2D
3F
I am doing this by
cut -f 1,2 > tmp1
cut -f 1,3 > tmp2
cat *tmp * > final_file
But I am getting repeated lines! If I check the final output with:
cat * | sort | uniq -d
there are plenty of repeated lines and there are none in the primary file.
Can anyone suggest other way of doing this? I believe the one I am trying to use is too complex and that's why I am getting such a weird output.
pzanoni#vicky:/tmp$ cat file.txt
1 A B
2 C D
3 E F
pzanoni#vicky:/tmp$ cut -d' ' -f1,2 file.txt > result
pzanoni#vicky:/tmp$ cut -d' ' -f1,3 file.txt >> result
pzanoni#vicky:/tmp$ cat result
1 A
2 C
3 E
1 B
2 D
3 F
I'm using bash
Preserves the order with one pass through the file
awk '
{print $1 $2; pass2 = pass2 sep $1 $3; sep = "\n"}
END {print pass2}
' file.txt
The reason this (cat tmp* * > final_file) is wrong:
I assume *tmp was a typo
I assume as this point the directory only contains "tmp1" and "tmp2"
Look at how those wildcards will be expanded:
tmp* expands to "tmp1" and "tmp2"
* also expands to "tmp1" and "tmp2"
So your command line becomes cat tmp1 tmp2 tmp1 tmp2 > final_file and hence you get all the duplicated lines.
cat file.txt | awk '{print $1 $2 "\n" $1 $3};'

Resources