sed options N P h d - shell

i have found this script on the server , would you please support me regarding the aim of the below options , what is the reason behind using N , P , h , d
sed '/Text1/{N;N;N;N;N;N;N;N;N;N;N;N;N;P;h;
s/.*\(Text2\)/\1/;P;g;
s/.*\(Text3\)/\1/;P;g;
s/.*\(Text4\)/\1/;P;g;
s/.*\(Text5\)/\1/;P;g;
s/.*\(Text6\)/\1/;P;g;
s/.*\(Text7\)/\1/;P;g;
s/.*\(Text8\)/\1/;P;g;
s/.*\(Text9\)/\1/;P;g;
s/.*\(Text10\)/\1/;P;g;
s/.*\(Text11\)/\1/;P;g;
s/.*\(Text12\)/\1/;P;g;
s/.*\(Text13\)/\1/;P;g;
s/.*\(Text14\)/\1/;P;d;}'

Related

Using sed to add line above a set of lines

EDIT BELOW
I'm new to bash scripting, sorry if this has been answered elsewhere, couldn't find it in any searches I've done.
I'm using sed -i to add a line above an argument, for example.
for EFP in *.inp; do
sed -i "/^O */i FRAGNAME=H2ODFT" $EFP
done
and it works as expected. but I would like it to only add the line when the argument is true across multiple lines, like so:
O
C
O
C
FRAGNAME=H2ODFT
O
H
H
FRAGNAME=H2ODFT
O
H
H
Notice there's no added line above the two O's that are followed by C's.
I tried the following:
for FILE in *.inp; do
sed -i "/^O*\nH*\nH */i FRAGNAME=H2ODFT" $EFP
done
and I was expecting it to show up above the 3 lines that went O - H - H, but nothing happened, it passed through the file thinking that that argument was nowhere in the document.
I've looked elsewhere and thought of using awk, but I can't wrap my head around it.
Any help would be greatly appreciated!
L
EDIT
Thanks for the help. And sorry for being a bit unclear. I've tried a ton of things, too many to put in this post. I've tried awk, perl and sed solutions, but they're not working.
My input has a series of O's C's and H's, with cartesian coordinates assigned to them:
C 36.116 34.950 34.657
C 35.638 34.681 35.883
C 36.134 33.569 36.703
C 34.379 34.567 37.522
N 34.579 35.375 36.476
N 35.234 33.518 37.706
O 37.045 32.745 36.559
H 36.892 34.226 34.415
O 35.234 38.803 30.513
H 34.303 39.079 30.567
C 33.490 35.015 38.608
H 34.002 35.390 39.503
H 32.894 34.170 38.974
H 32.832 35.813 38.245
C 35.342 32.708 38.920
H 35.920 33.237 39.688
H 35.942 31.802 38.772
H 34.356 32.475 39.340
O 30.226 35.908 36.744
H 30.557 36.408 37.490
H 30.642 36.311 35.982
O 37.356 40.420 29.232
H 36.473 40.786 29.286
H 37.220 39.474 29.189
O 40.889 37.054 35.401
H 40.304 36.361 35.706
H 41.620 36.587 34.995
I'm trying to input a new line above a specific set of three lines, the OHH lines.
The awk solution posted didn't work, because it would add extra lines where there shouldn't be when the stage gets reset. I'm looking for the following output:
C 36.116 34.950 34.657
C 35.638 34.681 35.883
C 36.134 33.569 36.703
C 34.379 34.567 37.522
N 34.579 35.375 36.476
N 35.234 33.518 37.706
O 37.045 32.745 36.559
H 36.892 34.226 34.415
O 35.234 38.803 30.513
H 34.303 39.079 30.567
C 33.490 35.015 38.608
H 34.002 35.390 39.503
H 32.894 34.170 38.974
H 32.832 35.813 38.245
C 35.342 32.708 38.920
H 35.920 33.237 39.688
H 35.942 31.802 38.772
H 34.356 32.475 39.340
FRAGNAME=H2ODFT
O 30.226 35.908 36.744
H 30.557 36.408 37.490
H 30.642 36.311 35.982
FRAGNAME=H2ODFT
O 37.356 40.420 29.232
H 36.473 40.786 29.286
H 37.220 39.474 29.189
FRAGNAME=H2ODFT
O 40.889 37.054 35.401
H 40.304 36.361 35.706
H 41.620 36.587 34.995
The ^tsed was a typo and should've been an indent instead of ^t
Here is a ruby to do that:
ruby -e 'lines=$<.read.split(/\R/)
lines.each_with_index{|line,i|
three_line_tag=lines[i..i+2].map{|sl| sl.split[0] }.join
puts "FRAGNAME=H2ODFT" if three_line_tag == "OHH"
puts line
}
' file
Or any awk, same kind of method:
awk '{lines[NR]=$0}
END{
for(i=1;i<=NR;i++) {
tag=""
for(j=0;j<=2;j++) {
split(lines[i+j],arr)
tag=tag arr[1]
}
if (tag=="OHH")
print "FRAGNAME=H2ODFT"
print lines[i]
}
}
' file
Or Perl:
perl -0777 -pe 's/(^\h*O\h.*\R^\h*H\h.*\R^\h*H\h.*\R?)/FRAGNAME=H2ODFT\n\1/gm' file
Any print:
C 36.116 34.950 34.657
C 35.638 34.681 35.883
C 36.134 33.569 36.703
C 34.379 34.567 37.522
N 34.579 35.375 36.476
N 35.234 33.518 37.706
O 37.045 32.745 36.559
H 36.892 34.226 34.415
O 35.234 38.803 30.513
H 34.303 39.079 30.567
C 33.490 35.015 38.608
H 34.002 35.390 39.503
H 32.894 34.170 38.974
H 32.832 35.813 38.245
C 35.342 32.708 38.920
H 35.920 33.237 39.688
H 35.942 31.802 38.772
H 34.356 32.475 39.340
FRAGNAME=H2ODFT
O 30.226 35.908 36.744
H 30.557 36.408 37.490
H 30.642 36.311 35.982
FRAGNAME=H2ODFT
O 37.356 40.420 29.232
H 36.473 40.786 29.286
H 37.220 39.474 29.189
FRAGNAME=H2ODFT
O 40.889 37.054 35.401
H 40.304 36.361 35.706
H 41.620 36.587 34.995
===
Edit in place:
Read THIS about awk and that is generally applicable.
Any of these scripts as written write to stdout.
You can redirect the output to a new file:
someutility input_file >new_file
Or some (like perl, ruby, GNU awk, GNU sed) have the ability to do in-place file replacement. If you don't have that option, you cannot do:
someutil 'prints to STDOUT' file >file
since file will be destroyed before fully read.
Instead you would do:
someutil 'prints to STDOUT' file > tmp && mv tmp file
This might work for you (GNU sed):
sed -Ei -e ':a;N;s/\n/&/2;Ta;/^O(\n.)\1$/i FRAGNAME=H2ODFT' -e 'P;D' file1 file2
Open a 3 line window throughout the file and if the required pattern matches, insert the line of the desired text.
N.B. The \1 back reference matches the line before. Also the script is in two separate pieces because the i command requires to end in a newline which the -e option provides.
An alternative version of the same solution:
cat <<\! | sed -Ef - -i file{1..100}
:a
N
s/\n/&/2
Ta
/^O(\n.)\1$/i FRAGNAME=H2ODFT
P
D
!
If input files aren't large to cause memory issues, you can slurp the entire file and then perform the substitution. For example:
perl -0777 -pe 's/^O\nH\nH\n/FRAGNAME=H2ODFT\n$&/gm' ip.txt
If this works for you, then you can add the -i option for inplace editing. The regex ^O*\nH*\nH * shown in the question isn't clear. ^O\nH\nH\n will match three lines having O, H and H exactly. Adjust as needed.
I know you requested a sed solution, but, I have a solution based on awk.
We initialize the awk program with a stage which, overtime, will track the progress of "OHH"
If we receive another letter, we grow the stage until we get OHH, then, we print your required string and reset the stage
If we encounter a breakage, we print out whatever we accumulated in stage and reset stage
awk '
BEGIN { stage="" }
/^O$/ { if (stage=="") { stage="O\n"; next } }
/^H$/ { if (stage=="O\n") { stage="O\nH\n"; next } }
/^H$/ { if (stage=="O\nH\n") { print "FRAGNAME=H20DFT\nO\nH\nH"; stage=""; next } }
{ print stage $1; stage="" }
' < sample.txt
Where sample.txt contains:
O
C
O
C
O
H
H
O
H
H

Add one column from one file to the end of multiple files

I want to put one column from one file, the column 7, (i.e motherfile) to the end column of many files (i.e child1.c, chil2.c child3.c and so on)
motherfile
38 WAT1 1 TIP3 OH2 OT -0.834000 15.9994 0
39 WAT1 1 TIP3 H1 HT 0.417000 1.0080 0
40 WAT1 1 TIP3 H2 HT 0.417000 1.0080 0
41 WAT1 2 TIP3 OH2 OT -0.834000 15.9994 0
42 WAT1 2 TIP3 H1 HT 0.417000 1.0080 0
child1.c
O -5.689000 -0.628000 -10.423000
H -6.663000 -0.744000 -10.224000
H -5.166000 -1.340000 -9.957000
O 11.405000 3.612000 1.674000
H 11.331000 4.609000 1.663000
child2.c
O -4.689000 -0.628000 -10.423000
H -5.663000 -0.744000 -10.224000
H -6.166000 -1.340000 -9.957000
O 1.4405000 3.612000 1.674000
H 14.331000 4.609000 1.663000
and so on ...
I tried to use
awk '{f1 = $0; getline<"motherfile"; print f1, $7}' < child1.c > newchild1.c
but this only function to add a column to one file , and I want to put the column to many files.
Note the newchild.c need to be like this one.
O -5.689000 -0.628000 -10.423000 -0.834000
H -6.663000 -0.744000 -10.224000 0.417000
H -5.166000 -1.340000 -9.957000 0.417000
O 11.405000 3.612000 1.674000 -0.834000
H 11.331000 4.609000 1.663000 0.417000
In awk print statements can be redirected to a file using > or >>. The following example will read column 7 of the motherfile into memory, and write to a new file, pretended with the string new, including the saved column.
awk 'NR==FNR{a[FNR]=$7;next}{print$0,a[FNR]>"new"FILENAME}' motherfile
child1.c child2.c ...

Incorrect alarm raised by pocketsphinx in android

I have implemented pocketsphinx in my my android application to recognize voice command and create some custom dictionary and words to recognize. Here is my implementation:
private void setupRecognizer(File assetsDir) throws IOException {
recognizer = defaultSetup()
.setAcousticModel(new File(assetsDir, "en-us-ptm"))
.setDictionary(new File(assetsDir, "cmudict-en-us.dict"))
//.setRawLogDir(assetsDir)
.setKeywordThreshold(1e-10f)
.setBoolean("-allphone_ci", true)
.getRecognizer();
recognizer.addListener(this);
// Create keyword-activation search.
// recognizer.addKeyphraseSearch(KWS_SEARCH, KEYPHRASE);
File menuGrammar = new File(assetsDir, "target-words.gram");
recognizer.addKeywordSearch(KWS_SEARCH, menuGrammar);
}
For words list:
yalp /1-0/
yaalp /1-0/
yeelp /1e-1/
yelp /1e-1/
and grammar:
yalp Y AE L P
yaalp Y AA L P
yealp Y EH L P
yeelp Y IY L P
yelp Y EH L P
But i am getting wrong results,means if i am not speaking or make a sound(even on clapping) i am getting onPartialResult like Yelp Yealp etc . I try to tune setKeywordThreshold() // 1e-10f, 1e-20f,1e-30f etc same as for words list to add different range like 1-0/1e-1 but nothing is working to make this correct. Can someone suggest me why this is producing a wrong results..
Here is the image of my asset:

Regex for First Line (Only) that Contains a String

I have a bunch of phone numbers with one per line:
[Home] (202) 121-7777
C (202) 456-1111
[mobile] 55 55 5 55555
[Work] (404) 555-1234
[Cell] (505) 555-1234
W 303-555-5555
M 777-555-5555
c 12346567s
I want to grab the first one that contains the letter "c" upper or lower case.
So far, I have this /^.*[C].*$/i and that matches C (202) 456-1111, [Cell] (505) 555-1234 and c 12346567s. How do I return only the first? In other words, the match should only be C (202) 456-1111.
I have been blindly putting question marks everywhere without success.
I am using Ruby if it makes a difference http://www.rubular.com/r/h6ReB9IN8t
Edit: Here is another question that Hrishi pointed to but I cannot figure out how to adapt it to match the whole line.
Try match method. Here is an example:
list = <<EOF
[Home] (202) 121-7777
C (202) 456-1111
[mobile] 55 55 5 55555
[Work] (404) 555-1234
[Cell] (505) 555-1234
W 303-555-5555
M 777-555-5555
c 12346567s
EOF
Update
#match line with "c" letter in line, even that are part of word
puts list.match(/^.*C.*$/i)
#match line with "c" letter in line, that are not a part of word
puts list.match(/^\W*C\W.*$/i)
I'd go about this a bit differently. I prefer to reduce regular expressions to very simple patterns:
str = <<EOT
[Home] (202) 121-7777
C (202) 456-1111
[mobile] 55 55 5 55555
[Work] (404) 555-1234
[Cell] (505) 555-1234
W 303-555-5555
M 777-555-5555
c 12346567s
EOT
Finding the right line to work with is easily done using either select or find:
str.split("\n").select{ |s| s[/c/i] }.first # => "C (202) 456-1111"
str.split("\n").find{ |s| s[/c/i] } # => "C (202) 456-1111"
I'd recommend find because it only returns the first occurrence.
Once the desired string is found, use scan to grab the numbers:
str.split("\n").find{ |s| s[/c/i] }.scan(/\d+/) # => ["202", "456", "1111"]
Then join them. When you have phone numbers stored in a database you don't really want them to be formatted, you just want the numbers. Formatting occurs later when you're outputting them again.
phone_number = str.split("\n").find{ |s| s[/c/i] }.scan(/\d+/).join # => "2024561111"
When you need to output the number, break it into the right grouping based on the regional phone-number representation. You should have some idea where the person is located, because you've usually also got their country code. Based on that you know how many digits you should have, plus the groups:
area_code, prefix, number = phone_number[0 .. 2], phone_number[3 .. 5], phone_number[6 .. 9] # => ["202", "456", "1111"]
Then output them so they're displayed correctly:
"(%s) %s-%s" % [area_code, prefix, number] # => "(202) 456-1111"
As far as your original pattern /^.*[C].*$/i, there are some things wrong with your understanding of regex:
^.* says "start at the beginning of the string and find zero or more characters", which is no more effective than saying /[C].
Using [C] creates an unnecessary character-set which means "find one of the letters in the set "C"; It does nothing useful, so just use C as /C.
.*$ artificially finds the end of the string also, but since you're not capturing it there's nothing accomplished, so don't bother with it. The regex is now /C/.
Since you want to match upper and lower-case, use /C/i or /c/i. (Or you could use /[cC]/ but why?)
Instead:
To find a "c" or "C" anywhere in the string, just use /c/i. That's all that's needed. http://rubular.com/r/uPyxACOWls
To find "c", "C" or "cell" or "Cell", you can use /c(?:ell)?/. http://rubular.com/r/TkSRPWG2y6
To find "c", "C", "cell" or "Cell" as a separate word, use word-break markers like /\bc(?:ell)?\b/. http://rubular.com/r/Smo0bFs9w8
You can get a whole lot more complicated, but if you're not accomplishing anything with the additional pattern information, you're just wasting the regex-engine's CPU-time, and slowing your code. A confused regex-engine can waste a LOT of CPU-time, so be efficient and aware of what you're asking it to do.
EDIT Added two more ways of handling this. The last one is preferable.
This will do what you want. It will search for matches of your regex, and then get the first one. Please note that this will produce an error if string does not have any matches.
string = "[Home] (202) 121-7777
C (202) 456-1111
[mobile] 55 55 5 55555
[Work] (404) 555-1234
[Cell] (505) 555-1234
W 303-555-5555
M 777-555-5555
c 12346567s"
puts string.match(/^(.*[C].*)$/i).captures.first
puts string.match(/^(.*[C].*)$/i)
puts string[/^(.*[C].*)$/i]
Ruby Docs String#match.
Split the string by the new line characters, and select the substring which matches your requirements and grab the first one:
str = '[Home] (202) 121-7777
C (202) 456-1111
[mobile] 55 55 5 55555
[Work] (404) 555-1234
[Cell] (505) 555-1234
W 303-555-5555
M 777-555-5555
c 12346567s'
p str.split(/\n/).select{|el| el =~ /^.*[C].*$/i}[0]
or use match:
p str.match(/^.*[C].*$/i)[0]
EDITED:
Or, in case you want to find the first chunk that exactly starts with C try this:
p str.match(/^C.*$/)[0]

Speed up the analysis

I have 2 dataframes in R for example df and dfrefseq.
df<-data.frame( chr = c("chr1","chr1","chr1","chr4")
, start = c(843294,4329248,4329423,4932234)
, stop = c(845294,4329248,4529423,4935234)
, genenames= c("HTA","OdX","FEA","MGA")
)
dfrefseq<-data.frame( chr = c("chr1","chr1","chr1","chr2")
, start = c(843294,4329248,4329423,4932234)
, stop = c(845294,4329248,4529423,4935234)
, genenames= c("tra","FGE","FFs","FAA")
)
I want to check for each gene in df witch gene in dfrefseq lies closest to the selected df gene.
I first selected "chr1" in both dataframes.
Then I calculated for the first gene in readschr1 the distance between start-start start-stop stop-start and stop-stop sites.
The sum of this calculations say everything about the distance. My question here is, How can I speed up this analyse? Because now I tested only 1 gene against a dataframe, but I need to test 2000 genes.
readschr1 <- subset(df,df[,1]=="chr1")
refseqchr1 <- subset(dfrefseq,dfrefseq[,1]=="chr1")
names<-list()
read_start_start<-list()
read_start_stop<-list()
read_stop_start<-list()
read_stop_stop<-list()
for (i in 1:nrow(refseqchr1)) {
startstart<-abs(readschr1[1,2] - refseqchr1[i,2])
startstop<-abs(readschr1[1,2] - refseqchr1[i,3])
stopstart<-abs(readschr1[1,3] - refseqchr1[i,2])
stopstop<-abs(readschr1[1,3] - refseqchr1[i,3])
read_start_start[[i]]<- matrix(startstart)
read_start_stop[[i]]<- matrix(startstop)
read_stop_start[[i]]<- matrix(stopstart)
read_stop_stop[[i]]<- matrix(stopstop)
names[[i]]<-matrix(refseqchr1[i,4])
}
table<-cbind(names, read_start_start, read_start_stop, read_stop_start, read_stop_stop)
sumtotalcolumns<-as.numeric(table[,2]) + as.numeric(table[,3])+ as.numeric(table[,4]) + as.numeric(table[,5])
test<-cbind(table, sumtotalcolumns)
test1<-test[order(as.vector(test$sumtotalcolumns)), ]
Thank you!
The Bioconductor package GenomicRanges is designed to work with this type of data
source('http://bioconductor.org/biocLite.R')
biocLite('GenomicRanges') # one-time installation
then
library(GenomicRanges)
gr <- with(df,
GRanges(factor(chr, levels=paste("chr", 1:4, sep="")),
IRanges(start, stop), genenames=genenames))
grrefseq <- with(dfrefseq,
GRanges(factor(chr, levels=paste("chr", 1:4, sep="")),
IRanges(start, stop), genenames=genenames))
and
> nearest(gr, grrefseq)
[1] 1 2 3 NA
You can merge the two separate data.frames together to form one table and then use vectorized operations. The key to merge is to specify the common column(s) between the data.frames and to tell it what to do when there are cases that do not match. Specifying all = TRUE will return all rows and fill in NAs if there is no match in the other data.frame, i.e. ch2 and ch4 in this case. Once the data.frames have been merged, then it's a simple exercise in subtracting the different columns from one another and then summing the four columns of interest. I use transform to cut down on the typing needed to do the subtraction.
zz <- merge(df, dfrefseq, by = "chr", all = TRUE)
zz <- transform(zz,
read_start_start = abs(start.x - start.y)
, read_start_stop = abs(start.x - stop.y)
, read_stop_start = abs(stop.x - start.y)
, read_stop_stop = abs(stop.x - stop.y)
)
zz <- transform(zz,
sum_total_columns = read_start_start + read_start_stop + read_stop_start + read_stop_stop
)
Here's one approach get the row with the minimum distance. I'm assuming you want to do this by chr and genenames. I use the plyr package, but I'm sure there are base solutions if you'd prefer one of those. Maybe someone else will chime in with a base solution.
require(plyr)
ddply(zz, c("chr", "genenames.x"), function(x) x[which.min(x$sum_total_columns) ,])

Resources