I want to catch the row contain "Will_Liu>" from massive_data.txt if the n < m or n==0 or m==0, a period of the prototype is as below.
cat massive_data.txt
Will_Liu> set Name.* xxx
============================================
Id Name Para status
============================================
1 name-1 xxxxx OK
2 name-2 xxxxx OK
3 name-3 xxxxx Not_OK
. ... .... OK
. ... .... OK
m name-m .... Not_OK
============================================
Total: m name attempted, n name set OK
In the above code, the "m" and "n" are variable, if the n < m or n==0 or m==0, print the rows contain "Will_Liu>" ;
if n==m and both of them !=0, just skip and ignore this situation.
I just could use "grep" and "sed" to grasp key points like those:
cat test.txt
Will_Liu> set Name_group1 xxx
============================================
Id Name Para status
============================================
1 name-1 xxxxx OK
2 name-2 xxxxx OK
3 name-3 xxxxx Not_OK
============================================
Total: 3 name attempted, 2 name set OK
Will_Liu> set Name_group2 yyy
============================================
Id Name Para status
============================================
1 name-4 xxxxx OK
2 name-5 xxxxx Not_OK
3 name-6 xxxxx Not_OK
============================================
Total: 3 name attempted, 1 name set OK
I could use "sed" and "grep" command like this:
sed -n "/Total: 3 name attempted,/p" test.txt
Total: 3 name attempted, 2 name set OK
Total: 3 name attempted, 1 name set OK
grep -B 9 "Total: 3 name attempted" test.txt | sed -n '/Will_Liu>/p'
Will_Liu> set Name_group1 xxx
Will_Liu> set Name_group2 yyy
in the grep command the 9 is 3+6, the 6 is base on the format of the structure, it's a fixed value.
So how can I introduce 2 variates to define the "m" and "n" and improve my code to get expected result from massive_data.txt?
My expect output:
Will_Liu> set Name1 xxx
Will_Liu> set Name2 yyy
Will_Liu> set Name3 zzz
. . .
. . .
. . .
In general, any previous line you want to print matches another pattern. In these cases it is better to store the last candidate to be printed and when you reach your condition, decide what to do with it. For example
awk '/^Will_Liu/{
last_will=$0
}
/^Total/{
m=$2; n=$5
if (m>n || (m==0 && n==0))
print last_will
}' file
In cases where you really don't have any pattern to select the last candidate to print, and you have to decide some line number to print after a math operation on matched line data, then you could double pass a file, or use tac to invert the input or keep all last lines in a hash array or any similar approach. These approaches could be not efficient sometimes. For example, with storing all lines, which is not recommended for your case
awk '{ line[NR]=$0 }
/^Total/{
m=$2; n=$5
if (m>n || (m==0 && n==0))
print line[NR-(m+5)]
}' file
Related
This question already has answers here:
Printing a sequence from a fasta file
(5 answers)
Closed 20 days ago.
Hi I have a similar situation with Grep group of lines, but slightly different.
I have a file in the format of:
> xxxx AB=AAA NNN xxxx CD=DDD xxxxx
xxx
xxx
xxx
xxx
xxx
>xxxx AB=AAA JJJ xxxx CD=EEE xxxxx
xxx
xxx
xxx
xxx
xxx
xxx
xxx
>xxxx AB=AAA NNN xxxx CD=FFF xxxxx
xxx
xxx
xxx
xxx
>xxxx AB=EEE FFF xxxx CD=GGG xxxxx
xxx
xxx
xxx
xxx
xxx
xxx
(each item starting with > does not necessarily contain same number of lines with xxx, xxx is a list of string with all capital letters, the only cue that the record of this item is completed is that the next line starts with >)
Firstly, I want to grep all items with AB = EEE FFF as a resultant file like below:
>xxxx AB=EEE FFF xxxx CD=GGG xxxxx
xxx
xxx
xxx
xxx
xxx
>xxxx AB=EEE FFF xxxx CD=TTT xxxxx
xxx
xxx
xxx
xxx
>xxxx AB=EEE FFF xxxx CD=EEE xxxxx
xxx
xxx
xxx
xxx
xxx
xxx
Then, I have a csv file with list of CD items, and I want to grep all these with CD=xxx as xxx is a line in csv file.
A sample of an item is:
>sp|P01023|A2MG_HUMAN Alpha-2-macroglobulin OS=Homo sapiens OX=9606 GN=A2M PE=1 SV=3
MGKNKLLHPSLVLLLLVLLPTDASVSGKPQYMVLVPSLLHTETTEKGCVLLSYLNETVTV
SASLESVRGNRSLFTDLEAENDVLHCVAFAVPKSSSNEEVMFLTVQVKGPTQEFKKRTTV
MVKNEDSLVFVQTDKSIYKPGQTVKFR
AB in my example refers to OS here, and CD in my example refers to GN (so it's a single string containing capital letters AND/OR number
My csv file looks like (with ~1000 lines):
A2M
AIF1
Thanks a lot!
Your question doesn't have much in the way of testable sample data but something like this might be a starting point:
awk -v s1='AB=EEE FFF' -v s2='CD' -v out='out.dat' '
/^>/ {
if ( ok = index($0,s1) )
for ( i=1; i<=NF; i++ )
if ( index($i, s2"=")==1 )
print substr( $i, index($i,"=")+1 )
}
ok { print >out }
' in.dat |\
grep -Fx -f - in.csv >out.csv
use awk to process in.csv:
look for lines starting > and if found:
set flag based on presence/absence of desired string s1 (flag will remain set until re-tested at next > line)
if s1 present, search for a field that starts with string s2 followed by =
if found, write section after = to stdout
(for efficiency, one could break out of the for here)
if ok flag is set, copy the line to out.dat
awk's stdout is piped into grep.
use grep to search for fixed strings listed in awk's output that match an entire line of in.csv, and save results to out.csv
i was parsing some data in bash and I could not firure out how to do it. I need to merge the lines together, so it looks like this:
70:54:D2:8D:82:9A 1 Internet
...
I have these 3 file outputs.
Mac addresses:
70:54:D2:8D:82:9A
F8:8E:85:84:4F:55
F4:6D:04:B0:C2:18
10:FE:ED:78:2A:44
Channel numbers:
1
4
1
8
and SSIDs:
Internet
ASUS
Free-WiFi
NetFree
Is there a simple way of doing so? Thanks in advance.
EDIT: It seems like someone already asked this question here
You can use the paste command to append the lines of the files together...
paste -d " " macs channels SSIds
Here's a full example...
echo "123" > 1
echo "abc" > 2
echo "##$" > 3
paste -d " " 1 2 3
123 abc ##$
echo "456" >> 1
paste -d " " 1 2 3
123 abc ##$
456
So you can see that if the line counts don't match up you'll get some slightly skewed output so you'll want to make sure the lines are 1:1.
]$ paste 1.txt 2.txt 3.txt
70:54:D2:8D:82:9A 1 Internet
F8:8E:85:84:4F:55 4 ASUS
F4:6D:04:B0:C2:18 1 Free-WiFi
10:FE:ED:78:2A:44 8 NetFree
Using the tab-delimited file below I am trying to validate the header line 1 and then store that number in a variable $header to use in a couple of if statements. If $header equals 10 then file has expected number of fields, but if $header less than 10 file is missing header for: and the missing header fields are printed underneath. The bash seems close and if i use the awk by itself it seems to work perfectly, but I can not seem to use it in the if. Thank you :).
file.txt
Index Chr Start End Ref Alt Freq Qual Score Input
1 1 1 100 C - 1 GOOD 10 .
2 2 20 200 A C .002 STRAND BIAS 2 .
3 2 270 400 - GG .036 GOOD 6 .
file2.txt
Index Chr Start End Ref Alt Freq Qual Score
1 1 1 100 C - 1 GOOD 10
2 2 20 200 A C .002 STRAND BIAS 2
3 2 270 400 - GG .036 GOOD 6
bash
for f in /home/cmccabe/Desktop/validate/*.txt; do
bname=`basename $f`
pref=${bname%%.txt}
header=$(awk -F'\t' '{print NF, "fields detected in file and they are:" ORS $0; exit}') $f >> ${pref}_output # detect header row in file and store in header and write to output
if [[ $header == "10" ]]; then # display results
echo "file has expected number of fields" # file is validated for headers
else
echo "file is missing header for:" # missing header field ...in file not-validated
echo "$header"
fi # close if.... else
done >> ${pref}_output
desired output for file.txt
file has expected number of fields
desired output for file1.txt
file is missing header for:
Input
You can use awk if you like, but bash is more than capable of handling the first line fields comparison on its own. If you maintain an array of expected field names, you can then easily split the first line into fields, compare against the expected number of fields, and output the identity of the missing field if you read less than the expected number of fields from any given file.
The following is a short example that takes filenames as arguments (you need to take filenames from stdin for a large number of files, or use xargs, as required). The script simply reads the first line in each file, separates the line into fields, checks the field count, and outputs any missing fields in a short error message:
#!/bin/bash
declare -i header=10 ## header has 10 fields
## aray of field names (can be read from 1st file)
fields=( "Index"
"Chr"
"Start"
"End"
"Ref"
"Alt"
"Freq"
"Qual"
"Score"
"Input" )
for i in "$#"; do ## for each file given as argument
read -r line < "$i" ## read first line from file into 'line'
oldIFS="$IFS" ## save current Internal Field Separator (IFS)
IFS=$'\t' ## set IFS to word-split on '\t'
fldarray=( $line ); ## fill 'fldarray' with fields in line
IFS="$oldIFS" ## restore original IFS
nfields=${#fldarray[#]} ## get number of fields in 'line'
if (( nfields < header )) ## test against header
then
printf "error: only '%d' fields in file '%s'\nmissing:" "$nfields" "$i"
for j in "${fields[#]}" ## for each expected field
do ## check against those in line, if not present print
[[ $line =~ $j ]] || printf " %s" "$j"
done
printf "\n\n" ## tidy up with newlines
fi
done
Example Input
$ cat dat/hdr.txt
Index Chr Start End Ref Alt Freq Qual Score Input
1 1 1 100 C - 1 GOOD 10 .
2 2 20 200 A C .002 STRAND BIAS 2 .
3 2 270 400 - GG .036 GOOD 6 .
$ cat dat/hdr2.txt
Index Chr Start End Ref Alt Freq Qual Score
1 1 1 100 C - 1 GOOD 10
2 2 20 200 A C .002 STRAND BIAS 2
3 2 270 400 - GG .036 GOOD 6
$ cat dat/hdr3.txt
Index Chr Start End Alt Freq Qual Score Input
1 1 1 100 - 1 GOOD 10 .
2 2 20 200 C .002 STRAND BIAS 2 .
3 2 270 400 GG .036 GOOD 6 .
Example Use/Output
$ bash hdrfields.sh dat/hdr.txt dat/hdr2.txt dat/hdr3.txt
error: only '9' fields in file 'dat/hdr2.txt'
missing: Input
error: only '9' fields in file 'dat/hdr3.txt'
missing: Ref
Look things over, while awk can do many things bash cannot on its own, bash is more than capable with parsing text.
Here is one in GNU awk (nextfile):
$ awk '
FNR==NR {
for(n=1;n<=NF;n++)
a[$n]
nextfile
}
NF==(n-1) {
print FILENAME " file has expected number of fields"
nextfile
}
{
for(i=1;i<=NF;i++)
b[$i]
print FILENAME " is missing header for: "
for(i in a)
if(i in b==0)
print i
nextfile
}' file1 file1 file2
file1 file has expected number of fields
file2 is missing header for:
Input
The first file processed by the script defines the headers (in a) that the following files should have and compares them (in b) against it.
This piece of code will do exactly what you are asking. Let me know if it works for you.
for f in ./*.txt; do
[[ $( head -1 $f | awk '{ print NF}' ) -eq 10 ]] && echo "File $f has all the fields on its header" || echo "File $f is missing " $( echo "Index Chr Start End Ref Alt Freq Qual Score Input $( head -1 $f )" | tr ' ' '\n' | sort | uniq -c | awk '/1 / {print $2}' );
done
Output :
File ./file2.txt is missing Input
File ./file.txt has all the fields on its header
Want I want to do is simply add a column with the numbers of a huge file:
xxx xxxxx xxxx
xxx xxxxx xxxx
xxx xxxxx xxxx
xxx xxxxx xxxx
xxx xxxxx xxxx
To get the next output:
xxx 1 xxxx xxxxx
xxx xxxx xxxx
xxx 2 xxxx xxxxx
xxx xxxx xxxx
xxx 3 xxxx xxxxx
I tried something with awk '{print NR % 2==1 etc ...} but it doesn't work
Any suggestion?
Many thanks in advance
You're on the right track
awk 'NR%2 { $1 = $1" "++i}; 1;' file.txt
NR%2 evalutes to true for odd-numbers lines. The resulting assignment replace the first field with the value in the first field plus a number that (starting from 0) is incremented then concatenated. The 1; always evaluates to true and applies the default action (print) to the line. The longer-but-clear equivalent is NR%2 { $1 = $1" "++i}; {print}.
perl -lane 'if ($. % 2 == 1){$n++; print "$F[0] $n #F[1..$#F]"} else{print}' file.txt
produces the output:
xxx 1 xxxxx xxxx
xxx xxxxx xxxx
xxx 2 xxxxx xxxx
xxx xxxxx xxxx
xxx 3 xxxxx xxxx
Explanation:
-n loop around every line of the input file, put the line in the $_ variable, do not automatically print every line
-l removes newlines before processing, and adds them back in afterwards
-a autosplit mode – split input lines into the #F array.
-e execute the perl code
$. is the line number
#F is the array of words in each line, indexed starting with 0
$#F is the number of words in #F
#F[1..$#F] is an array slice of element 1 through the last element
Consider the following three files with headers in the first row:
file1:
id name in1
1 jon 1
2 sue 1
file2:
id name in2
2 sue 1
3 bob 1
file3:
id name in3
2 sue 1
3 adam 1
I want to merge these files to get the following output, merged_files:
id name in1 in2 in3
1 jon 1 0 0
2 sue 1 1 1
3 bob 0 1 0
3 adam 0 0 1
This request has several special features that I have not found implemented in a handy way in grep/sed/awk/join etc. Edit: You may assume, for simplicity, that the three files have already been sorted.
This is very similar to the problem solved in Bash script to find matching rows from multiple CSV files. It's not identical, but it is very similar. (So similar that I only had to remove three sort commands, change the three sed commands slightly, change the file names, change the 'missing' value from no to 0, and change the replacement in the final sed from comma to space.)
The join command with sed (usually sort too, but the data is already sufficiently sorted) are the primary tools needed. Assume that : does not appear in the original data. To record the presence of a row in a file, we want a 1 field in the file (it's almost there); we'll have join supply the 0 when there isn't a match. The 1 at the end of each non-heading line needs to become :1, and the last field in the heading also needs to be preceded by the :. Then, using bash's process substitution, we can write:
$ sed 's/[ ]\([^ ]*\)$/:\1/' file1 |
> join -t: -a 1 -a 2 -e 0 -o 0,1.2,2.2 - <(sed 's/[ ]\([^ ]*\)$/:\1/' file2) |
> join -t: -a 1 -a 2 -e 0 -o 0,1.2,1.3,2.2 - <(sed 's/[ ]\([^ ]*\)$/:\1/' file3) |
> sed 's/:/ /g'
id name in1 in2 in3
1 jon 1 0 0
2 sue 1 1 1
3 adam 0 0 1
3 bob 0 1 0
$
The sed command (three times) adds the : before the last field in each line of the files. The joins are very nearly symmetric. The -t: specifies that the field separator is the colon; the -a 1 and -a 2 mean that when there isn't a match in a file, the line will still be included in the output; the -e 0 means that if there isn't a match in a file, a 0 is generated in the output; and the -o option specifies the output columns. For the first join, -o 0,1.2,2.2 the output is the join column (0), then the second column (the 1) from the two files. The second join has 3 columns in the input, so it specifies -o 0,1.2,1.3,2.2. The argument - on its own means 'read standard input'. The <(...) notation is 'process substitution', where a file name (usually /dev/fd/NN) is provided to the join command, and it contains the output of the command inside the parentheses. The output is then filtered through sed once more to replace the colons with spaces, yielding the desired output.
The only difference from the desired output is the sequencing of 3 bob after 3 adam; it is not particularly clear on what basis you ordered them in reverse in your desired output. If it is crucial, a means can be devised for resolving the order differently (such as sort -k1,1 -k3,5, except that sorts the label line after the data; there are workarounds for that if necessary).
Code for GNU awk:
{
if ($1=="id") { v[i++]=$3; next }
b[$1,$2]=$1" "$2
c[i-1,$1" "$2]=$3
}
END {
printf ("id name")
for (x in v) printf (" %s", v[x]); printf ("\n")
for (y in b) {
printf ("%s", b[y])
for (z in v) if (c[z,b[y]]==0) {printf (" 0")} else printf (" %s", c[z,b[y]])
printf ("\n")
}
}
$cat file?
id name in1
1 jon 1
2 sue 1
id name in2
2 sue 1
3 bob 1
id name in3
2 sue 1
3 adam 1
$awk -f prog.awk file?
id name in1 in2 in3
3 bob 0 1 0
3 adam 0 0 1
1 jon 1 0 0
2 sue 1 1 1
This awk script will do what you want:
$1=="id"&&$2=="name"{
ins[$3]= 1;
lastin = $3;
}
$1!="id"||$2!="name" {
ids[$1] = 1;
names[$2] = 1;
a[$1,$2,lastin]= $3
used[$1,$2] = 1;
}
END {
printf "id name"
for (i in ins) {
printf " %s", i
}
printf "\n"
for (id in ids) {
for (name in names) {
if (used[id,name]) {
printf "%s %s", id, name
for (i in ins) {
printf " %d", a[id,name,i]
}
printf "\n"
}
}
}
}
Assuming your files are called list1, list2, etc., and the awk file is script.awk, you can run it like this
$ cat list* | awk -f script.awk
id name in1 in2 in3
1 jon 1 0 0
2 sue 1 1 1
3 bob 0 1 0
3 adam 0 0 1
I am sure that is a much shorter and simpler way to do it, but this is all I could come up with at 1:30 am :)