I have the following files:
A-111.txt
A-311.txt
B-111.txt
B-311.txt
C-111.txt
C-312.txt
D-112.txt
D-311.txt
I want to merge lines of files with the same basename (same letter before the dash) if there is a match in column 4. I have many files so I want to do it in the loop.
So far I have this:
for f1 in *-1**.txt; do f2="${f1/-1/-3}"; awk -F"\t" 'BEGIN { OFS=FS } NR==FNR { a[$4,$4]=$0 ; next } ($4,$4) in a { print a[$4,$4],$0 }' "$f1" "$f2" > $f1_merged.txt; done
It works for files A and B as intended, but not for files C and D files.
Can someone help me improve the code, please?
EDIT - here's the above code formatted legibly:
for f1 in *-1**.txt; do
f2="${f1/-1/-3}"
awk -F"\t" '
BEGIN {
OFS = FS
}
NR == FNR {
a[$4, $4] = $0
next
}
($4, $4) in a {
print a[$4, $4], $0
}
' "$f1" "$f2" > $f1_merged.txt
done
EDIT - after Ed Morton kindly formatted my code, the error is:
awk: cmd. line:7: fatal: cannot open file 'C-311.txt' for reading (No such file or directory)
awk: cmd. line:7: fatal: cannot open file 'D-312.txt' for reading (No such file or directory)
EDIT-all lines not only the first one should be compared
Input file A-111.txt
ID
Chr
bp
db_SNP
REF
ALT
A-111
1
4367323
rs1490413
G
A
A-111
1
12070292
rs730123
G
A
A-111
22
47836412
rs2040411
G
A
A-111
22
49876931
rs4605480
T
C
Input file A-311.txt
ID
Chr
bp
db_SNP
REF
ALT
A-311
Y
17053771
rs17269816
C
T
A-311
Y
22665262
rs2196155
A
G
A-311
1
4367323
rs1490413
G
A
A-311
1
12070292
rs730123
G
A
Desired output file
ID
Chr
bp
db_SNP
REF
ALT
ID
Chr
bp
db_SNP
REF
ALT
A-111
1
4367323
rs1490413
G
A
A-311
1
4367323
rs1490413
G
A
A-111
1
12070292
rs730123
G
A
A-311
1
12070292
rs730123
G
A
Would you please try the following:
#!/bin/bash
prefix="ref_" # prefix to declare array variable names
declare -A bases # array to count files for the basename
for f in *-[0-9]*.txt; do # loop over the target files
base=${f%%-*} # extract the basename
declare -n ref="$prefix$base" # indirect reference to an array named "$base"
ref+=("$f") # create a list of filenames for the basename
(( bases[$base]++ )) # count the number of files for the basename
done
for base in "${!bases[#]}"; do # loop over the basenames
if (( ${bases[$base]} == 2 )); then # check if the number of files are two
declare -n ref="$prefix$base" # indirect reference
awk -F'\t' -v OFS='\t' '
NR==FNR { # read 1st file
f0[$4] = $0 # store the record keyed by $4
next
}
$4 in f0 { # read 2nd file and check if f0[$f4] is defined
print f0[$4], $0 # if match, merge the records and print
}' "${ref[0]}" "${ref[1]}" > "${base}_merged.txt"
fi
done
First extract the basenames such as "A", "B", .. then create a list
of associated filenames. For instance, the array "A" will be assigned to
('A-111.txt' 'A-311.txt'). At the same time, the array bases counts
the files for each basename.
Then loop over the basenames, make sure the number of associated files
are two, compare the 4th columns of the files. If they match, concatenate
the files to generate a new file.
The awk script searches the 4th field across the lines; if match, paste the lines of the two files.
Related
Suppose I have a file A contains the column numbers need to be removed (I really have over 500 columns in my input file fileB),
fileA:
2
5
And I want to remove those columns(2 and 5) from fileB:
a b c d e f
g h i j k l
in Linux to get:
a c d f
g i j l
what should I do? I found out that I could eliminate printing those columns with the code:
awk '{$2=$5="";print $0}' fileB
however, there are two problems in this way, first it does not really remove those columns, it just using empty string to replace them; second, instead of manually typing in those column numbers, how can I get these column numbers by reading from another file.
Original Question:
Suppose I have a file A contains the column numbers need to be removed,
file A:
223
345
346
567
And I want to remove those columns(223, 345,567) from file B in Linux, what should I do?
If your cut have the --complement option then you can do:
cut --complement -d ' ' -f "$(echo $(<FileA))" fileB
$ cat tst.awk
NR==FNR {
badFldNrs[$1]
next
}
FNR == 1 {
for (inFldNr=1; inFldNr<=NF; inFldNr++) {
if ( !(inFldNr in badFldNrs) ) {
out2in[++numOutFlds] = inFldNr
}
}
}
{
for (outFldNr=1; outFldNr<=numOutFlds; outFldNr++) {
inFldNr = out2in[outFldNr]
printf "%s%s", $inFldNr, (outFldNr<numOutFlds ? OFS : ORS)
}
}
$ awk -f tst.awk fileA fileB
a c d f
g i j l
One awk idea:
awk '
FNR==NR { skip[$1] ; next } # store field #s to be skipped
{ line="" # initialize output variable
pfx="" # first prefix will be ""
for (i=1;i<=NF;i++) # loop through the fields in this input line ...
if ( !(i in skip) ) { # if field # not mentioned in the skip[] array then ...
line=line pfx $i # add to our output variable
pfx=OFS # prefix = OFS for 2nd-nth fields to be added to output variable
}
if ( pfx == OFS ) # if we have something to print ...
print line # print output variable to stdout
}
' fileA fileB
NOTE: OP hasn't provided the input/output field delimiters; OP can add the appropriate FS/OFS assignments as needed
This generates:
a c d f
g i j l
Using awk
$ awk 'NR==FNR {col[$1]=$1;next} {for(i=1;i<=NF;++i) if (i != col[i]) printf("%s ", $i); printf("\n")}' fileA fileB
a c d f
g i j l
I am working on bash script that loops multi-column data filles and executes integrated AWK code to operate with the multi-column data.
#!/bin/bash
home="$PWD"
# folder with the outputs
rescore="${home}"/rescore
# folder with the folders to analyse
storage="${home}"/results
while read -r d; do
awk -F ", *" ' # set field separator to comma, followed by 0 or more whitespaces
FNR==1 {
if (n) { # calculate the results of previous file
f= # apply this equation to rescore data using values of $3 and $2
f[suffix] = f # store the results in the array
n=$1 # take ID of the column
}
prefix=suffix=FILENAME
sub(/_.*/, "", prefix)
sub(/\/[^\/]+$/, "", suffix)
sub(/^.*_/, "", suffix)
n = 1 # count of samples
min = 0 # lowest value of $3 (assuming all $3 < 0)
}
FNR > 1 {
s += $3
s2 += $3 * $3
++n
if ($3 < min) min = $3 # update the lowest value
}
print "ID" prefix, rescoring
for (i in n)
printf "%s %.2f\n", i, f[i]
}' "${d}_"*/input.csv > "${rescore}/"${d%%_*}".csv"
done < <(find . -maxdepth 1 -type d -name '*_*_*' | awk -F '[_/]' '!seen[$2]++ {print $2}')
Briefly, the workflow should process each line of the input.csv located inside ${d} folder that correctly has been identified by my bash script:
# input.csv located in the folder 10V1_cne_lig12
ID, POP, dG
1, 142, -5.6500 # this is dG(min)
2, 10, -5.5000
3, 2, -4.9500
4, 150, -4.1200
My AWK script is expected to process each line of each CSV file in order to reduce them to the two columns, keeping in the output: i) the number from the first column of the input.csv (contained ID of the processed line) + the name of the folder ($d) contained the CSV file as well as ii) the result of the math operation (f) applied on the numbers in POP and dG columns of the input.csv:
f(ID)= sqrt(((dG(ID)+10)/10)^2+((POP(ID)-240)/240))^2)
where dG(ID) is the value of dG ($3) of the "rescored" line of input.csv, and POP(ID) is its POP value ($2).Eventually output.csv contained information regarding 1 input.csv should be in the following format:
# output.csv
ID, rescore value
1 10V1_cne_lig12, f(ID1)
2 10V1_cne_lig12, f(ID2)
3 10V1_cne_lig12, f(ID3)
4 10V1_cne_lig12, f(ID4)
While bash part of my code (dealing with the looping of CSVs in the distinct directories) works correctly I am stuck with the AWK code, which does not assign correctly ID of the lines in order that I could apply demonstrated math operations using $2 and $3 columns of the line with precised ID.
given the input file: folder/file
ID, POP, dG
1, 142, -5.6500
2, 10, -5.5000
3, 2, -4.9500
4, 150, -4.1200
this script
$ awk -F', *' -v OFS=', ' '
FNR==1 {path=FILENAME; sub(/\/[^/]+$/,"",path); print $1,"rescore value"; next}
{print $1" "path, sqrt((($3+10)/10)^2+(($2-240)/240)^2)}' folder/file
will produce
ID, rescore value
1 folder, 0.596625
2 folder, 1.05873
3 folder, 1.11285
4 folder, 0.697402
Not sure what the rest of your code does, but I guess you can integrate it in.
What I want?
Extract the common lines from n large files.
Append the original line numbers of each files.
Example:
File1.txt has the following content
apple
banana
cat
File2.txt has the following content
boy
girl
banana
apple
File3.txt has the following content
foo
apple
bar
The output should be a different file
1 3 2 apple
1, 3 and 2 in the output are the original line numbers of File1.txt, File2.txt and File3.txt where the common line apple exists
I have tried using grep -nf File1.txt File2.txt File3.txt, but it returns
File2.txt:3:apple
File3.txt:2:apple
Associate each unique line with a space separated list of line numbers indicating where it is seen in each file in an array, and print these next to each other at the end if the line is found in all three files.
awk '{
n[$0] = n[$0] FNR OFS
c[$0]++
}
END {
for (r in c)
if (c[r] == 3)
print n[r] r
}' file1 file2 file3
If the number of files is unknown, refer to Ravinder's answer, or just change the hardcoded 3 in the END block with ARGC-1 as shown there.
GNU awk specific approach that works with any number of files:
#!/usr/bin/gawk -f
BEGINFILE {
nfiles++
}
{
lines[$0][nfiles] = FNR
}
END {
PROCINFO["sorted_in"] = "#ind_str_asc"
for (line in lines) {
if (length(lines[line]) == nfiles) {
for (file = 1; file <= nfiles; file++)
printf "%d\t", lines[line][file]
print line
}
}
}
Example:
$ ./showlines file[123].txt
1 3 2 apple
Could you please try following, written and tested with GNU awk, one could make use of ARGC value which gives us total number of element passed to awk program.
awk '
{
a[$0]=(a[$0]?a[$0] OFS:"")FNR
count[$0]++
}
END{
for(i in count){
if(count[i]==(ARGC-1)){
print i,a[i]
}
}
}
' file1.txt file2.txt file3.txt
A perl solution
perl -ne '
$h{$_} .= "$.\t"; # append current line number and tab character to value in a hash with key current line
$. = 0 if eof; # reset line number when end of file is reached
END{
while ( ($k,$v) = each %h ) { # loop over has entries
if ( $v =~ y/\t// == 3 ) { # if value contains 3 tabs
print $v.$k # print value concatenated with key
}
}
}' file1.txt file2.txt file3.txt
This is a strange question, I have been looking around and I wasn't able to find anything to match with what I wish to do.
What I'm trying to do is;
File A, File B, File C
5 Lines, 3 Lines, 2 Lines.
Join all files in one file matching the same amount of the file B
The output should be
File A, File B, File C
3 Lines, 3 Lines, 3 Lines.
So in file A I have to remove two lines, in File C i have to duplicate 1 line so I can match the same lines as file B.
I was thinking to do a count to see how many lines each file has first
count1=`wc -l FileA| awk '{print $1}'`
count2=`wc -l FileB| awk '{print $1}'`
count3=`wc -l FileC| awk '{print $1}'`
Then to do a gt then file B remove lines, else add lines.
But I have got lost as I'm not sure how to continue with this, I never seen anyone trying to do this.
Can anyone point me to an idea?
the output should be as per attached picture below;
Output
thanks.
Could you please try following. I have made # as a separator you could change it as per your need too.
paste -d'#' file1 file2 file3 |
awk -v file2_lines="$(wc -l < file2)" '
BEGIN{
FS=OFS="#"
}
FNR<=file2_lines{
$1=$1?$1:prev_first
$3=$3?$3:prev_third
print
prev_first=$1
prev_third=$3
}'
Example of running above code:
Lets say following are Input_file(s):
cat file1
File1_line1
File1_line2
File1_line3
File1_line4
File1_line5
cat file2
File2_line1
File2_line2
File2_line3
cat file3
File3_line1
File3_line2
When I run above code in form of script following will be the output:
./script.ksh
File1_line1#File2_line1#File3_line1
File1_line2#File2_line2#File3_line2
File1_line3#File2_line3#File3_line2
you can get the first n lines of a files with the head command resp sed.
you can generate new lines with echo.
i'm going to use sed, as it allows in-place editing of a file (so you don't have to deal with temporary files):
#!/bin/bash
fix_numlines() {
local filename=$1
local wantlines=$2
local havelines=$(grep -c . "${filename}")
head -${wantlines} "${filename}"
if [ $havelines -lt $wantlines ]; then
for i in $(seq $((wantlines-havelines))); do echo; done
fi
}
lines=$(grep -c . fileB)
fix_numlines fileA ${lines}
fix_numlines fileB ${lines}
fix_numlines fileC ${lines}
if you want columnated output, it's even simpler:
paste fileA fileB fileC | head -$(grep -c . fileB)
Another for GNU awk that outputs in columns:
$ gawk -v seed=$RANDOM -v n=2 ' # n parameter is the file index number
BEGIN { # ... which defines the record count
srand(seed) # random record is printed when not enough records
}
{
a[ARGIND][c[ARGIND]=FNR]=$0 # hash all data to a first
}
END {
for(r=1;r<=c[n];r++) # loop records
for(f=1;f<=ARGIND;f++) # and fields for below output
printf "%s%s",((r in a[f])?a[f][r]:a[f][int(rand()*c[f])+1]),(f==ARGIND?ORS:OFS)
}' a b c # -v n=2 means the second file ie. b
Output:
a1 b1 c1
a2 b2 c2
a3 b3 c1
If you don't like the random pick of a record, replace int(rand()*c[f])+1] with c[f].
$ gawk ' # remember GNU awk only
NR==FNR { # count given files records
bnr=FNR
next
}
{
print # output records of a b c
if(FNR==bnr) # ... up to bnr records
nextfile # and skip to next file
}
ENDFILE { # if you get to the end of the file
if(bnr>FNR) # but bnr not big enough
for(i=FNR;i<bnr;i++) # loop some
print # and duplicate the last record of the file
}' b a b c # first the file to count then all the files to print
To make a file have n lines you can use the following function (usage: toLength n file). This omits lines at the end if the file is too long and repeats the last line if the file is too short.
toLength() {
{ head -n"$1" "$2"; yes "$(tail -n1 "$2")"; } | head -n"$1"
}
To set all files to the length of FileB and show them side by side use
n="$(wc -l < FileB)"
paste <(toLength "$n" FileA) FileB <(toLength "$n" FileC) | column -ts$'\t'
As observed by the user umläute the side-by-side output makes things even easier. However, they used empty lines to pad out short files. The following solution repeats the last line to make short files longer.
stretch() {
cat "$1"
yes "$(tail -n1 "$1")"
}
paste <(stretch FileA) FileB <(stretch FileC) | column -ts$'\t' |
head -n"$(wc -l < FileB)"
This is a clean way using awk where we read each file only a single time:
awk -v n=2 '
BEGIN{ while(1) {
for(i=1;i<ARGC;++i) {
if (b[i]=(getline tmp < ARGV[i])) a[i] = tmp
}
if (b[n]) for(i=1;i<ARGC;++i) print a[i] > ARGV[i]".new"
else {break}
}
}' f1 f2 f3 f4 f5 f6
This works in the following way:
the lead file is defined by the index n. Here we choose the lead file to be f2.
We do not process the files in the standard read record, fields sequentially, but we use the BEGIN block where we read the files in parallel.
We do an infinite loop while(1) where we will break out if the lead-file has no more input.
Per cycle, we read a new line of each file using getline. If the file i has a new line, store it in a[i], and set the outcome of getline into b[i]. If file i has reached its end, keep the last line in mind.
Check the outcome of the lead file with b[n]. If we still read a line, print all the lines to the files f1.new, f2.new, ..., otherwise, break out of the infinite loop.
I have an issue where I want to parse through the output from a file and I want to grab the nth occurrence of text in between two patterns preferably using awk or sed
category
1
s
t
done
category
2
n
d
done
category
3
r
d
done
category
4
t
h
done
Let's just say for this example I want to grab the third occurrence of text in between category and done, essentially the output would be
category
3
r
d
done
This might work for you (GNU sed):
'sed -n '/category/{:a;N;/done/!ba;x;s/^/x/;/^x\{3\}$/{x;p;q};x}' file
Turn off automatic printing by using the -n option. Gather up lines between category and done. Store a counter in the hold space and when it reaches 3 print the collection in the pattern space and quit.
Or if you prefer awk:
awk '/^category/,/^done/{if(++m==1)n++;if(n==3)print;if(/^done/)m=0}' file
Try doing this :
awk -v n=3 '/^category/{l++} (l==n){print}' file.txt
Or more cryptic :
awk -v n=3 '/^category/{l++} l==n' file.txt
If your file is big :
awk -v n=3 '/^category/{l++} l>n{exit} l==n' file.txt
If your file doesn't contain any null characters, here's on way using GNU sed. This will find the third occurrence of a pattern range. However, you can easily modify this to get any occurrence you'd like.
sed -n '/^category/ { x; s/^/\x0/; /^\x0\{3\}$/ { x; :a; p; /done/q; n; ba }; x }' file.txt
Results:
category
3
r
d
done
Explanation:
Turn off default printing with the -n switch. Match the word 'category' at the start of a line. Swap the pattern space with the hold space and append a null character to the start of the pattern. In the example, if the pattern then contains two leading null characters, pull the pattern out of holdspace. Now create a loop and print the contents of the pattern space until the last pattern is matched. When this last pattern is found, sed will quit. If it's not found sed will continue to read the next line of input in and continue in its loop.
awk -v tgt=3 '
/^category$/ { fnd=1; rec="" }
fnd {
rec = rec $0 ORS
if (/^done$/) {
if (++cnt == tgt) {
printf "%s",rec
exit
}
fnd = 0
}
}
' file
With GNU awk you can set the the record separator to a regular expression:
<file awk 'NR==n+1 { print rt, $0 } { rt = RT }' RS='\\<category' ORS='' n=3
Output:
category
3
r
d
done
RT is the matched record separator. Note that the record relative to n will be off by one as the first record refers to what precedes the first RS.
Edit
As per Ed's comment, this will not work when the records have other data in between them, e.g.:
category
1
s
t
done
category
2
n
d
done
foo
category
3
r
d
done
bar
category
4
t
h
done
One way to get around this is to clean up the input with a second (or first) awk:
<file awk '/^category$/,/^done$/' |
awk 'NR==n+1 { print rt, $0 } { rt = RT }' RS='\\<category' ORS='' n=3
Output:
category
3
r
d
done
Edit 2
As Ed has noted in the comments, the above methods do not search for the ending pattern. One way to do this, which hasn't been covered by the other answers, is with getline (note that there are some caveats with awk getline):
<file awk '
/^category$/ {
v = $0
while(!/^done$/) {
if(!getline)
exit
v = v ORS $0
}
if(++nr == n)
print v
}' n=3
On one line:
<file awk '/^category$/ { v = $0; while(!/^done$/) { if(!getline) exit; v = v ORS $0 } if(++nr == n) print v }' n=3