Sort a file like another file - bash

I have 2 text files :
1st file :
1 C
1 D
1 B
1 A
2nd file :
B
C
D
A
I want to sort first file like this:
1 B
1 C
1 D
1 A
Can you help me with a script in bash (or command ).

I solved the sort problem (i eliminated the first column ) and use this script
awk 'FNR == NR { lineno[$1] = NR; next}
{print lineno[$1], $0;}' ids.txt resultpartial.txt | sort -k 1,1n | cut -d' ' -f2-
Now I want to add ( first column like before)
1 .....

and what only to ignore the first file and do this?
echo -n > result-file.txt # empty result file if already created
while read line; do
echo "1 $line" >> result-file.txt
done < file2.txt
It would make sense when your files' format is specific.

Assuming that the "sort" field contains no duplicated values:
awk 'FNR==NR {line[$2] = $0; next} {print line[$1]}' file1 file2

Related

Bash: replacing a column by another and using AWK to print specific order

I have a dummy file that looks like so:
a ID_1 S1 S2
b SNP1 1 0
c SNP2 2 1
d SNP3 1 0
I want to replace the contents of column 2 by the corresponding line number. My file would then look like so:
a 1 S1 S2
b 2 1 0
c 3 2 1
d 4 1 0
I can do this with the following command:
cut -f 1,3-4 -d " " file.txt | awk '{print $1 " " FNR " " $2,$3}'
My question is, is there a better way of doing this? In particular, the real file I am working on has 2303 columns. Obviously I don't want to have to write:
cut -f 1,3-2303 -d " " file.txt | awk '{print $1 " " FNR " " $2,$3,$4,$5 ETC}'
Is there a way to tell awk to print from column 2 to the last column without having to write all the names?
Thanks
I think this should do
$ awk '{$2=FNR} 1' file.txt
a 1 S1 S2
b 2 1 0
c 3 2 1
d 4 1 0
change second column and print the changed record. Default OFS is single space which is what you need here
the above command is idiomatic way to write
awk '{$2=FNR} {print $0}' file.txt
you can think of simple awk program as awk 'cond1{action1} cond2{action2} ...'
only if cond1 evaluates to true, action1 is executed and so on. If action portion is omitted, awk by default prints input record. 1 is simply one way to write always true condition
See Idiomatic awk mentioned in https://stackoverflow.com/tags/awk/info for more such idioms
Following awk may also help you in same here.
awk '{sub(/.*/,FNR,$2)} 1' Input_file
Output will be as follows.
a 1 S1 S2
b 2 1 0
c 3 2 1
d 4 1 0
Explanation: It's explanation will be simple, using sub utility of awk to substitute everything in $2(second field) with FNR which is out of the box variable for awk to represent the current line number of any Input_file then mentioning 1 will print the current line of Input_file.

Substracting row-values from two different text files

I have two text files, and each file has one column with several rows:
FILE1
a
b
c
FILE2
d
e
f
I want to create a file that has the following output:
a - d
b - e
c - f
All the entries are meant to be numbers (decimals). I am completely stuck and do not know how to proceed.
Using paste seems like the obvious choice but unfortunately you can't specify a multiple character delimiter. To get around this, you can pipe the output to sed:
$ paste -d- file1 file2 | sed 's/-/ - /'
a - d
b - e
c - f
Paste joins the two files together and sed adds the spaces around the -.
If your desired output is the result of the subtraction, then you could use awk:
paste file1 file2 | awk '{ print $1 - $2 }'
given:
$ cat /tmp/a.txt
1
2
3
$ cat /tmp/b.txt
4
5
6
awk is a good bet to process the two files and do arithmetic:
$ awk 'FNR==NR { a[FNR""] = $0; next } { print a[FN""]+$1 }' /tmp/a.txt /tmp/b.txt
5
7
9
Or, if you want the strings rather than arithmetic:
$ awk 'FNR==NR { a[FNR""] = $0; next } { print a[FNR""] " - "$0 }' /tmp/a.txt /tmp/b.txt
1 - 4
2 - 5
3 - 6
Another solution using while and file descriptors :
while read -r line1 <&3 && read -r line2 <&4
do
#printf '%s - %s\n' "$line1" "$line2"
printf '%s\n' $(($line1 - $line2))
done 3<f1.txt 4<f2.txt

Exclude a column when pasting two data files

I have one file "dat1.txt" which is like:
0 5.71159e-01
1 1.92632e-01
2 -4.73603e-01
and another file "dat2.txt" which is:
0 5.19105e-01
1 2.29702e-01
2 -3.05675e-01
to write combine these two files into one I use
paste dat1.txt dat2.txt > data.txt
But I do not want the 1st column of the 2nd file in the output file. How do I modify the unix command?
If your files are in sorted order along column 1, you could try:
join dat[12].txt
You could try this in awk itself,
$ awk 'FNR==NR {a[FNR]=$0;next} {print a[FNR],$2}' data1.txt data2.txt
0 5.71159e-01 5.19105e-01
1 1.92632e-01 2.29702e-01
2 -4.73603e-01 -3.05675e-01
Use cut to remove the first column and then pipe to paste.
cut -d' ' -f 1 --complement dat2.txt | paste dat1.txt - > data.txt
Note that the - in the past ecommand means to read from stdin in place of the second file.
If cut is broken on OSX, awk might work.
awk '{for (i=2; i<=NF; i++) print $i}' dat2.txt | paste dat1.txt - > data.txt
paste dat1.txt <(cut -d" " -f2- dat2.txt)
Using cut to remove column 1, and using process substitution to use its output in paste
Output:
0 5.71159e-01 5.19105e-01
1 1.92632e-01 2.29702e-01
2 -4.73603e-01 -3.05675e-01

How to find the difference between the values two fields from two files and print only if there is a difference >10 using shell

Let say, i have two files a.txt and b.txt. the content of a.txt and b.txt is as follows:
a.txt:
abc|def|ghi|jfkdh|dfgj|hbkjdsf|ndf|10|0|cjhk|00|098r|908re|
dfbk|sgvfd|ZD|zdf|2df|3w43f|ZZewd|11|19|fdgvdf|xz00|00|00
b.txt:
abc|def|ghi|jfkdh|dfgj|hbkjdsf|ndf|11|0|cjhk|00|098r|908re|
dfbk|sgvfd|ZD|zdf|2df|3w43f|ZZewd|22|18|fdgvdf|xz00|00|00
So let's say these files have various fields separated by "|" and can have any number of lines. Also, assume that both are sorted files and so that we can match exact line between the two files. Now, i want to find the difference between the fields 8 & 9 of each row of each to be compared respectively and if any of their difference is greater than 10, then print the lines, otherwise remove the lines from file.
i.e., in the given example, i will subtract |10-11| (respective field no. 8 which is 1(absolute value) from a.txt and b.txt) and similarly for field no. 9 (0-0) which is 0,and both the difference is <10 so we delete this line from the files.
for the second line, the differences are (11-22)= 10 so we print this line.(dont need to check 19-18 as if any of the fields values(8,9) is >=10 we print such lines.
So the output is
a.txt:
dfbk|dfdag|sgvfd|ZD|zdf|2df|3w43f|ZZewd|11|19|fdgvdf|xz00|00|00
b.txt:
dfbk|dfdag|sgvfd|ZD|zdf|2df|3w43f|ZZewd|22|18|fdgvdf|xz00|00|00
You can do this with awk:
awk -F\| 'FNR==NR{x[FNR]=$0;eight[FNR]=$8;nine[FNR]=$9;next} {d1=eight[FNR]-$8;d2=nine[FNR]-$9;if(d1>10||d1<-10||d2>10||d2<-10){print x[FNR] >> "newa";print $0 >> "newb"}}' a.txt b.txt
Explanation
The -F sets the field separator to the pipe symbol. The stuff in curly braces after FNR==NR applies only to the processing of a.txt. It says to save the whole line in array x[] indexed by line number (FNR) and also to save the eighth field in array eight[] also indexed by line number. Likewise field 9 is saved in array nine[].
The second set of curly braces applies to processing file b. It calculates the differences d1 and d2. If either exceeds 10, the line is printed to each of the files newa and newb.
You can write bash shell script that does it:
while true; do
read -r lineA <&3 || break
read -r lineB <&4 || break
vara_8=$(echo "$lineA" | cut -f8 -d "|")
varb_8=$(echo "$lineB" | cut -f8 -d "|")
vara_9=$(echo "$lineA" | cut -f9 -d "|")
varb_9=$(echo "$lineB" | cut -f9 -d "|")
if (( vara_8-varb_8 > 10 || vara_8-varb_8 < -10
|| vara_9-varb_9 > 10 || vara_9-varb_9 < -10 )); then
echo "$lineA" >> newA.txt
echo "$lineB" >> newB.txt
fi
done 3<a.txt 4<b.txt
For short files
Use the method provided by Mark Setchell. Seen below in an expanded and slightly modified version:
parse.awk
FNR==NR {
x[FNR] = $0
m[FNR] = $8
n[FNR] = $9
next
}
{
if(abs(m[FNR] - $8) || abs(n[FNR] - $9)) {
print x[FNR] >> "newa"
print $0 >> "newb"
}
}
Run it like this:
awk -f parse.awk a.txt b.txt
For huge files
The method above reads a.txt into memory. If the file is very large, this becomes unfeasible and streamed parsing is called for.
It can be done in a single pass, but that requires careful handling of the multiplexed lines from a.txt and b.txt. A less error prone approach is to identify relevant line numbers, and then extract those into new files. An example of the last approach is shown below.
First you need to identify the matching lines:
# Extract fields 8 and 9 from a.txt and b.txt
paste <(awk -F'|' '{print $8, $9}' OFS='\t' a.txt) \
<(awk -F'|' '{print $8, $9}' OFS='\t' b.txt) |
# Check if it the fields matche the criteria and print line number
awk '$1 - $3 > n || $3 - $1 > n || $2 - $4 > n || $4 - $2 > 10 { print NR }' n=10 > linesfile
Now we are ready to extract the lines from a.txt and b.txt, and as the numbers are sorted, we can use the extract.awk script proposed here (repeated for convenience below):
extract.awk
BEGIN {
getline n < linesfile
if(length(ERRNO)) {
print "Unable to open linesfile '" linesfile "': " ERRNO > "/dev/stderr"
exit
}
}
NR == n {
print
if(!(getline n < linesfile)) {
if(length(ERRNO))
print "Unable to open linesfile '" linesfile "': " ERRNO > "/dev/stderr"
exit
}
}
Extract the lines (can be run in parallel):
awk -v linesfile=linesfile -f extract.awk a.txt > newa
awk -v linesfile=linesfile -f extract.awk b.txt > newb

Comparing values in two files

I am comparing two files, each having one column and n number of rows.
file 1
vincy
alex
robin
file 2
Allen
Alex
Aaron
ralph
robin
if the data of file 1 is present in file 2 it should return 1 or else 0, in a tab seprated file.
Something like this
vincy 0
alex 1
robin 1
What I am doing is
#!/bin/bash
for i in `cat file1 `
do
cat file2 | awk '{ if ($1=="'$i'") print 1 ; else print 0 }'>>binary
done
the above code is not giving me the output which I am looking for.
Kindly have a look and suggest correction.
Thank you
The simple awk solution:
awk 'NR==FNR{ seen[$0]=1 } NR!=FNR{ print $0 " " seen[$0] + 0}' file2 file1
A simple explanation: for the lines in file2, NR==FNR, so the first action is executed and we simply record that a line has been seen. In file1, the 2nd action is taken and the line is printed, followed by a space, followed by a "0" or a "1", depending on if the line was seen in file2.
AWK loves to do this kind of thing.
awk 'FNR == NR {a[tolower($1)]; next} {f = 0; if (tolower($1) in a) {f = 1}; print $1, f}' file2 file1
Swap the positions of file2 and file1 in the argument list to make file1 the dictionary instead of file2.
When FNR (the record number in the current file) and NR (the record number of all records so far) are equal, then the first file is the one being processed. Simply referencing an array element brings it into existence. This sets up the dictionary. The next instruction reads the next record.
Once FNR and NR aren't equal, subsequent file(s) are being processed and their data is looked up in the dictionary array.
The following code should do it.
Take a close look to the BEGIN and END sections.
#!/bin/bash
rm -f binary
for i in $(cat file1); do
awk 'BEGIN {isthere=0;} { if ($1=="'$i'") isthere=1;} END { print "'$i'",isthere}' < file2 >> binary
done
There are several decent approaches. You can simply use line-by-line set math:
{
grep -xF -f file1 file2 | sed $'s/$/\t1/'
grep -vxF -f file1 file2 | sed $'s/$/\t0/'
} > somefile.txt
Another approach would be to simply combine the files and use uniq -c, then just swap the numeric column with something like awk:
sort file1 file2 | uniq -c | awk '{ print $2"\t"$1 }'
The comm command exists to do this kind of comparison for you.
The following approach does only one pass and scales well to very large input lists:
#!/bin/bash
while read; do
if [[ $REPLY = $'\t'* ]] ; then
printf "%s\t0\n" "${REPLY#?}"
else
printf "%s\t1\n" "${REPLY}"
fi
done < <(comm -2 <(tr '[A-Z]' '[a-z]' <file1 | sort) <(tr '[A-Z]' '[a-z]' <file2 | sort))
See also BashFAQ #36, which is directly on-point.
Another solution, if you have python installed.
If you're familiar with Python and are interested in the solution, you only need a bit of formatting.
#/bin/python
f1 = open('file1').readlines()
f2 = open('file2').readlines()
f1_in_f2 = [int(x in f2) for x in f1]
for n,c in zip(f1, f1_in_f2):
print n,c

Resources