Re-arrange files starting from desired file with awk one-liner - sorting

Let's assume we have these regular sorted files in a unix/linux folder.
A.txt
B.txt
C.txt
D.txt
a.txt
b.txt
c.txt
command D.txt must sort as:
D.txt -> begin with selected file
a.txt
b.txt
c.txt
A.txt -> append remaining files in order
B.txt
C.txt
command b.txt must sort as:
b.txt
c.txt
A.txt
B.txt
C.txt
D.txt
a.txt
I tried such scripts below but awk seems doesn't re-arrange output in a way that I desired. Command always sorts regular order.
#!/bin/sh
export fname="$1"
find . -maxdepth 1 -type f | sort | awk -v ref="./$fname" '($0 > ref) {print $0} ($0 < ref) {print $0}'

Buffer lines until ref, print ref and succesive lines as they arrive, and print buffer at the end.
awk -v ref="./$fname" '$0==ref{f=1} f{print;next} {buf=buf $0 ORS} END{printf "%s",buf}'
Demo:
$ seq 5 | awk -v ref=3 '$0==ref{f=1} f{print;next} {buf=buf $0 ORS} END{printf "%s",buf}'
3
4
5
1
2

Related

How to get files in directory A but not B and vice versa using bash comm?

I'm trying to use comm to get files on a folder A that is not on B and vice-versa:
comm -3 <(find /Users/rob/A -type f -exec basename {} ';' | sort) <(find "/Users/rob/B" -type f -exec basename {} ';' | sort)
I'm using basename {} ';' to exclude the directory path, but this is the output I get:
IMG_5591.JPG
IMG_5591.jpeg
IMG_5592.JPG
IMG_5592.jpeg
IMG_5593.JPG
IMG_5593.jpeg
IMG_5594.JPG
IMG_5594.jpeg
There's a tab in the name of the first directory, therefore all entries are considered different. What am I doing wrong?
The leading tabs are not being generated by the find|basename code; the leading tabs are being generated by comm ...
comm generates 1 to 3 columns of output depending on the input flags; 2nd column of output will have a leading tab while 3rd column of output will have 2 leading tabs.
In this case OP's code says to ignore column #3 (-3, the files in common between the 2 sources), so comm generates 2 columns of output w/ the 2nd column having a leading tab.
One easy fix:
comm --output-delimiter="" <(find...|sort...) <(find...|sort...)
If for some reason your comm does not support the --output-delimiter flag:
comm <(find...|sort...) <(find...|sort...) | tr -d '\t'
This assumes the file names do not include embedded tabs otherwise replace the tr with your favorite code to strip leading white space, eg:
comm <(find...|sort...) <(find...|sort...) | sed 's/^[[:space:]]*//'
Demo ...
$ cat file1
a.txt
b.txt
$ cat file2
b.txt
c.txt
$ comm file1 file2
a.txt
b.txt
c.txt
# 2x tabs (\t) before 'b.txt' (3rd column), 1x tab (\t) before 'c.txt' (2nd column):
$ comm file1 file2 | od -c
0000000 a . t x t \n \t \t b . t x t \n \t c
0000020 . t x t \n
# OP's scenario:
$ comm -3 file1 file2
a.txt
c.txt
# 1x tab (\t) before 'c.txt' (2nd column):
$ comm -3 file1 file2 | od -c
0000000 a . t x t \n \t c . t x t \n
Removing the leading tabs:
$ comm --output-delimiter="" -3 file1 file2
a.txt
c.txt
$ comm -3 file1 file2 | tr -d '\t'
a.txt
c.txt
$ comm -3 file1 file2 | sed 's/^[[:space:]]*//'
a.txt
c.txt
If basename causes issues, you can use find's printf :
#!/bin/bash
find_basename(){
find "$1" -type f -printf "%P\n" | sort
}
comm -3 <(find_basename /Users/rob/A) <(find_basename /Users/rob/B)

copying columns from different files into a single file using awk

I have about more than 500 files having two columns "Gene short name" and "FPKM" values. The number of rows is same and the "Gene short name" column is common in all the files. I want to create a matrix by keeping first column as gene short name (can be taken from any of the files) and rest other columns having the FPKM.
I have used this command which works well, but then, how can I use it for 500 files?
paste -d' ' <(awk -F'\t' '{print $1}' 69_genes.fpkm.txt) \
<(awk -F'\t' '{print $2}' 69_genes.fpkm.txt) \
<(awk -F'\t' '{print $2}' 72_genes.fpkm.txt) \
<(awk -F'\t' '{print $2}' 75_genes.fpkm.txt) \
<(awk -F'\t' '{print $2}' 78_genes.fpkm.txt) > col.txt
sample data (files are tab separated):
head 69_genes.fpkm.txt
gene_short_name FPKM
DDX11L1 0.196141
MIR1302-2HG 0.532631
MIR1302-2 0
WASH7P 4.51437
Expected outcome
gene_short_name FPKM FPKM FPKM FPKM
DDX11L1 0.196141 0.206591 0.0201256 0.363618
MIR1302-2HG 0.532631 0.0930007 0.0775838 0
MIR1302-2 0 0 0 0
WASH7P 4.51437 3.31073 3.23326 1.05673
MIR6859-1 0 0 0 0
FAM138A 0.505155 0.121703 0.105235 0
OR4G4P 0.0536387 0 0 0
OR4G11P 0 0 0 0
OR4F5 0.0390888 0.0586067 0 0
Also, I want to change the name "FPKM" to "filename_FPKM".
Given the input
$ cat a.txt
a 1
b 2
c 3
$ cat b.txt
a I
b II
c III
$ cat c.txt
a one
b two
c three
you can loop:
cut -f1 a.txt > result.txt
for f in a.txt b.txt c.txt
do
cut -f2 "$f" | paste result.txt - > tmp.txt
mv {tmp,result}.txt
done
$ cat result.txt
a 1 I one
b 2 II two
c 3 III three
In awk, using #Micha's data for clarity:
$ awk '
BEGIN { FS=OFS="\t" } # set the field separators
FNR==1 {
$2=FILENAME "_" $2 # on first record of each file rename $2
}
NR==FNR { # process the first file
a[FNR]=$0 # hash whole record to a
next
}
{ # process other files
a[FNR]=a[FNR] OFS $2 # add $2 to the end of the record
}
END { # in the end
for(i=1;i<=FNR;i++) # print all records
print a[i]
}' a.txt b.txt c.txt
Output:
a a.txt_1 b.txt_I c.txt_one
b 2 II two
c 3 III three

Comparing file content in two different directories

I have four files in two directories: 1.txt and 2.txt in one directory and 3.txt and 4.txt in another one. I want to compare the first pattern starting with word "query" in these text files and match the files existing in two different directories.
How can I do it?
Example:
1.txt
ABC
Query : JKLTER
2.txt
ABC
Query : PCA
3.txt
Query :JKLTER
XYSH
Query : ABC
4.txt
GFHHH
Using the command I could derive these two files from the directories just based on first pattern (starting with Query) matched.
Output :
Matched files : 1.txt 3.txt
I have something that is hopefully close enough - else you can diddle around with it a bit to get it closer.
So, if you use GNU awk to find the first line containing the word Query in all the files in a directory and then print the last word on that line and the name of the current file, you will get this for your first directory d1:
awk -F'[ :]*' '/Query/{print $NF,FILENAME; nextfile}' d1/*txt
JKLTER d1/1.txt
PCA d1/2.txt
And this for the second directory d2:
awk -F'[ :]*' '/Query/{print $NF,FILENAME; nextfile}' d2/*txt
JKLTER d2/3.txt
You can then pass the output of each of those commands to join to have it join lines wherein the first field matches:
join <(awk -F'[ :]*' '/Query/{print $NF,FILENAME; nextfile}' d1/*txt) <(awk -F'[ :]*' '/Query/{print $NF,FILENAME; nextfile}' d2/*txt)
Output
JKLTER d1/1.txt d2/3.txt
You can get rid of the leading directory by changing into each directory before running awk:
join <(cd d1; awk -F'[ :]*' '/Query/{print $NF,FILENAME; nextfile}' *txt) <(cd d2; awk -F'[ :]*' '/Query/{print $NF,FILENAME;nextfile}' *txt)
Output
JKLTER 1.txt 3.txt
You can get rid of the common field used by join like this:
join <(...) <(...) | awk '{$1="";print}'
Output
1.txt 3.txt
If you only have text files and nothing else in each subdirectory, and there are actually spaces after the colon following the word Query, my solution can be simplified to:
join <(cd d1; awk '/Query/{print $NF,FILENAME; nextfile}' *) <(cd d2; awk '/Query/{print $NF,FILENAME;nextfile}' *) | awk '{print $2,"matches",$3}'
Output
1.txt matches 3.txt

how to sort a file with multi-character delimiter using linux shell sort?

I have a file, its delimiter is "|||".
abc|||123|||999|||5|||Just for you | Jim|||20
cef|||7|||210|||6|||Go away | R&B|||30
mmm|||89|||320|||16|||Traveling Light|George Winston|||21
The delimiter "|||" can't be replace with "|" or "||", because data itself may contain "|" or "||".
Could someone tell me how to sort column 2 with delimiter "|||" ?
The following method fails:
sort -t$'|||' -nrk2 a.txt > b.txt
sort: multi-character tab `|||'
Thank you!
You could change the delimiter to | then sort it with sort and then change everything back:
# change | to __BAR__ writing the result to b.txt
sed 's#\([^|]\)|\([^|]\)#\1__BAR__\2#g' a.txt > b.txt
# change ||| to | in b.txt
sed -i 's#|||#|#g' b.txt
# do sorting with | delimiter writing the result to c.txt
sort -t$'|' -nrk2 -k3,rn -k4,rn b.txt > c.txt
# change everything back in c.txt:
# | to |||
sed -i 's#|#|||#g' c.txt
# __BAR__ to |
sed -i 's#__BAR__#|#g' c.txt

how to find matching records from 3 different files in unix

I have 3 different files.
Test1.txt , Test2.txt & Test3.txt
Test1.txt contains
JJTP#yahoo.com
BBMU#ssc.com
HK#glb.com
Test2.txt contains
SFTY#gmail.com
JJTP#yahoo.com
Test3.txt contains
JJTP#yahoo.com
HK#glb.com
I would like to see only matching records in these 3 files.
so the matching records in above example will be JJTP#yahoo.com
The output should be
JJTP#yahoo.com
If you don't have duplicate lines in each file then:
$ awk '++a[$1]==3' test[1-3]
JJTP#yahoo.com
Here is an awk that has a mix of jaypal and sudo_o solution.
It will not give false positive since it test for uniqueness of the lines.
awk '!a[$1 FS FILENAME]++ && ++b[$1]==3' test*
JJTP#yahoo.com
If you have a unknown number of files, this could be an option
awk '!a[$1 FS FILENAME]++ && ++b[$1]==ARGC-1' test*
The ARGC store the number of files read by awk + 1
comm lists common lines for two files. Just find the common lines in the first two files, then pipe the output to comm again and find the common lines with the third file.
comm -12 <(sort Test1.txt) <(sort Test2.txt) | comm -12 - <(sort Test3.txt)
Here is how you'd do with awk:
awk '
FILENAME == ARGV[1] { a[$0]++ }
FILENAME == ARGV[2] && ($0 in a) { b[$0]++ }
FILENAME == ARGV[3] && ($0 in b)' file1 file2 file3
Output:
JJTP#yahoo.com
To find the common lines in two files, you can use:
sort Test1.txt Test2.txt | uniq -d
Or, if you wish to preserve the order found in Test1.txt, you may use:
while read x; do grep -w "$x" Test2.txt; done < Test1.txt
For three files, repeat this:
sort Test1.txt Test2.txt | uniq -d | sort - Test3.txt | uniq -d
Or:
cat Test1.txt |\
while read x; do grep -w "$x" Test2.txt; done |\
while read x; do grep -w "$x" Test3.txt; done
The sort method assumes that the files themselves don't have duplicate lines; if so, you may need create temporary files.
If you wish to use sed rather than grep, try sed -n "/^$x$/p".

Resources