Take string from multiple files and copy to new file and print filename into second column in bash - bash

I have multiple files containing this information:
sP12345.txt
COMMENT Method: conceptual translation.
FEATURES Location/Qualifiers
source 1..3024
/organism="H"
/isolate="sP12345"
/isolation_source="blood"
/host="Homo sapiens"
/db_xref="taxon:11103"
/collection_date="31-Mar-2014"
/note="genotype: 3"
sP4567.txt
COMMENT Method: conceptual translation.
FEATURES Location/Qualifiers
source 1..3024
/organism="H"
/isolate="sP4567"
/isolation_source="blood"
/host="Homo sapiens"
/db_xref="taxon:11103"
/collection_date="31-Mar-2014"
/note="genotype: 2"
Now I would like to get the /note="genotype: 3" and copy only the number that is after genotype: copy it to a new textfile and print the filename from which is has been taken as column 2.
Expected Output:
3 sP12345
2 sP4567
I tried this code: but it only prints the first column and not the filename:
awk -F'note="genotype: ' -v OFS='\t' 'FNR==1{++c} NF>1{print $2, c}' *.txt > output_file.txt

You may use:
awk '/\/note="genotype: /{gsub(/^.* |"$/, ""); f=FILENAME; sub(/\.[^.]+$/, "", f); print $0 "\t" f}' sP*.txt
3 sP12345
2 sP4567

$ awk -v OFS='\t' 'sub(/\/note="genotype:/,""){print $0+0, FILENAME}' sP12345.txt sP4567.txt
3 sP12345.txt
2 sP4567.txt

You can do:
awk '/\/note="genotype:/{split($0,a,": "); print a[2]+0,"\t",FILENAME}' sP*.txt
3 sP12345.txt
2 sP4567.txt

With your shown samples, in GNU awk please try following awk code.
awk -v RS='/note="genotype: [0-9]*"' '
RT{
gsub(/.*: |"$/,"",RT)
print RT,FILENAME
nextfile
}
' *.txt
Explanation: Simple explanation would be, passing all .txt files to GNU awk program here. Then setting RS(record separator) as /note="genotype: [0-9]*" as per shown samples and requirement. In main program of awk, using gsub(global substitution) to removing everything till colon followed by space AND " at the end of value of RT with NULL. Then printing value of RT followed by current file's name. Using nextfile will directly take program to next file skipping rest of contents of file, to save sometime for us.

Related

Prepend text to specific line numbers with variables

I have spent hours trying to solve this. There are a bunch of answers as to how to prepend to all lines or specific lines but not with a variable text and a variable number.
while [ $FirstVariable -lt $NextVariable ]; do
#sed -i "$FirstVariables/.*/$FirstVariableText/" "$PWD/Inprocess/$InprocessFile"
cat "$PWD/Inprocess/$InprocessFile" | awk 'NR==${FirstVariable}{print "$FirstVariableText"}1' > "$PWD/Inprocess/Temp$InprocessFile"
FirstVariable=$[$FirstVariable+1]
done
Essentially I am looking for a particular string delimiter and then figuring out where the next one is and appending the first result back into the following lines... Note that I already figured out the logic I am just having issues prepending the line with the variables.
Example:
This >
Line1:
1
2
3
Line2:
1
2
3
Would turn into >
Line1:
Line1:1
Line1:2
Line1:3
Line2:
Line2:1
Line2:2
Line2:3
You can do all that using below awk one liner.
Assuming your pattern starts with Line, then the below script can be used.
> awk '{if ($1 ~ /Line/ ){var=$1;print $0;}else{ if ($1 !="")print var $1}}' $PWD/Inprocess/$InprocessFile
Line1:
Line1:1
Line1:2
Line1:3
Line2:
Line2:1
Line2:2
Line2:3
Here is how the above script works:
If the first record contains word Line then it is copied into an awk variable var. From next word onwards, if the record is not empty, the newly created var is appended to that record and prints it producing the desired result.
If you need to pass the variables dynamically from shell to awk you can use -v option. Like below:
awk -v var1=$FirstVariable -v var2=$FirstVariableText 'NR==var{print var2}1' > "$PWD/Inprocess/Temp$InprocessFile"
The way you addressed the problem is by parsing everything both with bash and awk to process the file. You make use of bash to extract a line, and then use awk to manipulate this one line. The whole thing can actually be done with a single awk script:
awk '/^Line/{str=$1; print; next}{print (NF ? str $0 : "")}' inputfile > outputfile
or
awk 'BEGIN{RS="";ORS="\n\n";FS=OFS="\n"}{gsub(FS,OFS $1)}1' inputfile > outputfile

Compare lines within a file in bash

input.txt file
12345678,Manoj,23,Developer
12345678,Manoj,34,Developer
12345678,Manoj,67,Developer
12345679,Vijay,12,Tester
12345679,Vijay,98,Tester
12345676,Samrat,100,Manager
12345676,Samrat,25,Manager
12345676,Samrat,28,Manager
Desired output file
12345678,Manoj,23,Developer,0
12345678,Manoj,34,Developer,1
12345678,Manoj,67,Developer,2
12345679,Vijay,12,Tester,0
12345679,Vijay,98,Tester,1
12345676,Samrat,100,Manager,0
12345676,Samrat,25,Manager,1
12345676,Samrat,28,Manager,2
Explanation
Here the first value i.e 12345678 in the first 3 lines of my input file are the same so append the first 3 lines with ,0 ,1 and ,2 respectively. And similarly to the following lines.
How it can be done in Shell Script.
Edit in Desired Output
Is is also possible to change the Desired Output number format to the following for the output?
12345678,Manoj,23,Developer,0000000
12345678,Manoj,34,Developer,0000001
12345678,Manoj,67,Developer,0000002
12345679,Vijay,12,Tester,0000000
12345679,Vijay,98,Tester,0000001
12345676,Samrat,100,Manager,0000000
12345676,Samrat,25,Manager,0000001
12345676,Samrat,28,Manager,0000002
New:
Is it possible to start the numbering from 0000019. Is there anyother option to initialize a variable like a=5, a=19, a=39 from where i can increment afterwards.
12345678,Manoj,23,Developer,0000019
12345678,Manoj,34,Developer,0000020
12345678,Manoj,67,Developer,0000021
12345679,Vijay,12,Tester,0000019
12345679,Vijay,98,Tester,0000020
12345676,Samrat,100,Manager,0000019
12345676,Samrat,25,Manager,0000020
12345676,Samrat,28,Manager,0000021
Using awk:
$ awk 'BEGIN{FS=OFS=",";RS="\r?\n"}{print $0,a[$1]++}' file
Output:
12345678,Manoj,23,Developer,0
12345678,Manoj,34,Developer,1
12345678,Manoj,67,Developer,2
12345679,Vijay,12,Tester,0
12345679,Vijay,98,Tester,1
12345676,Samrat,100,Manager,0
12345676,Samrat,25,Manager,1
12345676,Samrat,28,Manager,2
Edit:
As the requirements changed and a lot of commenting took place, here is the final version (revision one as the requirements were different in comments and the OP, knocking on wood):
$ awk 'BEGIN{FS=","}{sub(/\r$/,"");printf "%s,%07d" ORS,$0,a[$1]++}' file
Explained:
$ awk '
BEGIN {
FS=","
# ORS="\r\n" # uncomment if Windows line-endings are desired
}
{
sub(/\r$/,"") # remove Windows line-endings (ie. \r from \r\n)
printf "%s,%07d" ORS,$0,a[$1]++ # output zeropadded running count on $1
}' file
Tested with gawk, mawk, busybox awk and the original-awk (awk version 20121220). Oh, and recycled my Solaris box 5 years ago. ;D
Update to fix my former self-unknown line-ending error.
Use this, will work on both \r\n and \n line endings, output will end in \n:
awk -F, 'sub(/\r$/,"") ($(NF+1)=sprintf("%07d",a[$2]++))' OFS=, input.txt
Output:
12345678,Manoj,23,Developer,0000000
12345678,Manoj,34,Developer,0000001
12345678,Manoj,67,Developer,0000002
12345679,Vijay,12,Tester,0000000
12345679,Vijay,98,Tester,0000001
12345676,Samrat,100,Manager,0000000
12345676,Samrat,25,Manager,0000001
12345676,Samrat,28,Manager,0000002
I wrote like that is for conciseness, it's functionally equals to:
awk 'BEGIN{FS=OFS=","}{sub(/\r$/,"");$(NF+1)=sprintf("%07d",a[$2]++)}1' input.txt
If you have ruby installed:
ruby -aF, -pe 'BEGIN{a=Hash.new(-1)};sub(/\r?$/, "," + "%07d" % a[$F[1]]+=1)' input.txt
Same output.
Btw, if you want it starts with 19, you can use this (add 19+ to the value):
awk 'sub(/\r$/,"") ($(NF+1)=sprintf("%07d",19+a[$2]++))' FS=, OFS=, input.txt
Or this(initialize with 18):
ruby -aF, -pe 'BEGIN{a=Hash.new(18)};sub(/\r?$/, "," + "%07d" % a[$F[1]]+=1)' input.txt
These all used $2 (column 2) as the keys, since in your samples $1 and $2 are related, so use either one would work.
Could you please try following.(without editing line simply print it by addiotnal array's count value)
awk 'BEGIN{FS=OFS=","} {printf("%s,%07d\n",$0,count[$2]++)}' Input_file
Using Perl
$ cat manoj.txt
12345678,Manoj,23,Developer
12345678,Manoj,34,Developer
12345678,Manoj,67,Developer
12345679,Vijay,12,Tester
12345679,Vijay,98,Tester
12345676,Samrat,100,Manager
12345676,Samrat,25,Manager
12345676,Samrat,28,Manager
$ perl -F, -lane ' $F[$#F]=~s/\r//g; $F[$#F+1]=sprintf("%07d",$kv{$F[0]}++);$,=","; print #F ' manoj.txt
12345678,Manoj,23,Developer,0000000
12345678,Manoj,34,Developer,0000001
12345678,Manoj,67,Developer,0000002
12345679,Vijay,12,Tester,0000000
12345679,Vijay,98,Tester,0000001
12345676,Samrat,100,Manager,0000000
12345676,Samrat,25,Manager,0000001
12345676,Samrat,28,Manager,0000002
$

cut string in a specific column in bash

How can I cut the leading zeros in the third field so it will only be 6 characters?
xxx,aaa,00000000cc
rrr,ttt,0000000yhh
desired output
xxx,aaa,0000cc
rrr,ttt,000yhh
or here's a solution using awk
echo " xxx,aaa,00000000cc
rrr,ttt,0000000yhh"|awk -F, -v OFS=, '{sub(/^0000/, "", $3)}1'
output
xxx,aaa,0000cc
rrr,ttt,000yhh
awk uses -F (or FS for FieldSeparator) and you must use OFS for OutputFieldSeparator) .
sub(/srchtarget/, "replacmentstring", stringToFix) is uses a regular expression to look for 4 0s at the front of (^) the third field ($3).
The 1 is a shorthand for the print statement. A longhand version of the script would be
echo " xxx,aaa,00000000cc
rrr,ttt,0000000yhh"|awk -F, -v OFS=, '{sub(/^0000/, "", $3);print}'
# ---------------------------------------------------------^^^^^^
Its all related to awk's /pattern/{action} idiom.
IHTH
If you can assume there are always three fields and you want to strip off the first four zeros in the third field you could use a monstrosity like this:
$ cat data
xxx,0000aaa,00000000cc
rrr,0000ttt,0000000yhh
$ cat data |sed 's/\([^,]\+\),\([^,]\+\),0000\([^,]\+\)/\1,\2,\3/
xxx,0000aaa,0000cc
rrr,0000ttt,000yhh
Another more flexible solution if you don't mind piping into Python:
cat data | python -c '
import sys
for line in sys.stdin():
print(",".join([f[4:] if i == 2 else f for i, f in enumerate(line.strip().split(","))]))
'
This says "remove the first four characters of the third field but leave all other fields unchanged".
Using awks substr should also work:
awk -F, -v OFS=, '{$3=substr($3,5,6)}1' file
xxx,aaa,0000cc
rrr,ttt,000yhh
It just take 6 characters from 5 position in field 3 and set it back to field 3

Awk changes tabs to spaces

Data:
Sandnes<space>gecom<tab>Hansen<tab>Ola<space>Timoteivn<space>10
I am substituting a specific column (ex:2th column) value with a variable in a file. So I am using the command:
varz="zipval"
awk -v VAR=$varz '{$2=VAR}1' OutputFile.log
The awk substitute all the tabs to space after processing. So I have used OFS="\t" .
But it removes every space to tabs
Sandnes<tab>gecom<tab>Hansen<tab>zipval<tab>Timoteivn<tab>10
How to handle it.
Thanks
Your problem is that awk splits your input on FS=[ \t]+ and then reassembles it with OFS=' ' or OFS='\t'. I don't think you can get around doing an extra split. Something like this works:
<data awk -v VAR="$varz" 'BEGIN { FS=OFS="\t" } { split($1, a, " +"); $1 = a[1]" "VAR } 1'
Output:
Sandnes zipval^IHansen^IOla Timoteivn 10
Use this script to pass column no to your awk script:
varz="zipval"
awk -v VAR=$varz -v N=6 '{sub($N, VAR)}1' OutputFile.log
The below is working fine at my place:
> setenv var "hi"
> echo "1 2 3 4 5 6 7" | awk -v var1=$var '{$6=var1}1'
1 2 3 4 5 hi 7
>
You didn't post your desired output or even tell us which specific text you wanted replaced ("2th field" could mean several things) so this is a guess, but assuming your input file is tab-separated fields, you just need to quote your shell variable and assign FS as well as OFS:
varz="zipval"
awk -v VAR="$varz" 'BEGIN{FS=OFS="\t"} {$2=VAR} 1' OutputFile.log
I'd also recommend you don't use all-upper case for your variable name since that's used to identify awk builtin variables (NR, NF, etc.).

Deleting the first two lines of a file using BASH or awk or sed or whatever

I'm trying to delete the first two lines of a file by just not printing it to another file. I'm not looking for something fancy. Here's my (failed) attempt at awk:
awk '{ (NR > 2) {print} }' myfile
That throws out the following error:
awk: { NR > 2 {print} }
awk: ^ syntax error
Example:
contents of 'myfile':
blah
blahsdfsj
1
2
3
4
What I want the result to be:
1
2
3
4
Use tail:
tail -n+3 file
from the man page:
-n, --lines=K
output the last K lines, instead of the last 10; or use -n +K
to output lines starting with the Kth
How about:
tail +3 file
OR
awk 'NR>2' file
OR
sed '1,2d' file
You're nearly there. Try this instead:
awk 'NR > 2 { print }' myfile
awk is rule based, and the rule appears bare (i.e., without braces) before the block it woud execute if it passes.
Also as Jaypal has pointed out, in awk if all you want to do is print the line that matches the rules you can even omit the action, thus simplifying the command to:
awk 'NR > 2' myfile
awk is based on pattern{action} statements. In your case, the pattern is NR>2 and the action you want to perform is print. This action is also the default action of awk.
So even though
awk 'NR>2{print}' filename
would work fine, you can shorten it to
awk 'NR>2' filename.

Resources