Perl does not match multiple lines

Perl does not match multiple lines - shell

I want to match:
Start Here Some example
text covering a few
lines. End Here
So I do
$ perl -nle 'print $1 if /(Start Here.*?)End Here/s'
then paste the text above and ctr-D. It wont match from cmd - but it does in file script. Why?

Change input record separator ($/) to null using -0 command line switch.
perl -0777nle 'print $1 if /(Start Here.*?)End Here/s' <<END
Start Here Some example
text covering a few
lines. End Here
THE_END
man perlrun
-0[octal/hexadecimal]
specifies the input record separator ($/) as an octal or
hexadecimal number. […] Any value 0400 or above will cause Perl to slurp files whole, but by convention the value 0777 is the one normally used for this purpose.
man perlvar
IO::Handle->input_record_separator( EXPR )
$INPUT_RECORD_SEPARATOR
$RS
$/
The input record separator, newline by default. This influences Perl's idea of what a "line" is. […] You may set it to […] "undef" to read through the end of file.

As others have explained, you're reading your file a line at a time, so matches over multiple lines are never going to work.
Reading files a line at a time is often the best approach. So we can use the "flip-flip" operator to do this:
$ perl -nle 'print if /Start Here/ .. /End Here/' your_file_here

Related

Replace single character in fasta header with awk or sed

I am working in bash with a fasta file with headers that begin with a ">" and end with either a "C" or a "+". Like so:
>chr1:35031657-35037706+
GGTGGACTAGCCAGTGAATGTCAACGCGTCCCTA
CCTAAGGCGATATCCGCAGCCGCCCGCGTCCCTA
>chr1:71979382-71985425C
agattaaatgaactattacacataaagtgcttac
ttacacataaagtgcttacgaactattacaggga
I'd like to use awk (gsub?) or sed to change the last character of the header to a "+" if it is a "C". Basically I want all of the sequences to end in "+". No C's.
Desired output:
>chr1:35031657-35037706+
GGTGGACTAGCCAGTGAATGTCAACGCGTCCCTA
CCTAAGGCGATATCCGCAGCCGCCCGCGTCCCTA
>chr1:71979382-71985425+
agattaaatgaactattacacataaagtgcttac
ttacacataaagtgcttacgaactattacaggga
Nothing needs to change with the sequences. I think this is pretty straight forward, but I'm struggling to use other posts to do this myself. I know that awk '/^>/ && /C$/{print $0}' will print the headers than begin with ">" and end with "C", but I'm not sure how to replace all of those "C"s with "+"s.
Thanks for your help!

I think this would be easier to do in sed:
sed '/^>/ s/C$/+/'
Translation: on lines starting with ">", replace "C" at the end of the line with "+". Note that if the "C" isn't matched, there isn't an error, it just doesn't replace anything. Also, unlike awk, sed automatically prints each line after processing it.
If you really want to use awk, the equivalent would be:
awk '/^>/ {sub("C$","+",$0)}; {print}'

Use this Perl one-liner:
perl -pe 's{^(>.*)C$}{$1+}' input.fasta > output.fasta
Or, to change the file in-place:
perl -i.bak -pe 's{^(>.*)C$}{$1+}' input.fasta
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-p : Loop over the input one line at a time, assigning it to $_ by default. Add print $_ after each loop iteration.
-i.bak : Edit input files in-place (overwrite the input file). Before overwriting, save a backup copy of the original file by appending to its name the extension .bak. If you want to skip writing a backup file, just use -i and skip the extension.
s{^(>.*)C$}{$1+} : Change the line that starts with > (= fasta header) and ends with C to the same line with C changed to +.
^ marks the beginning of the line and $ marks the end of the line. .* means any character repeated 0 or more times. (>.*) captures the pattern inside, which is the entire line minus the C, and stores it in capture variable $1.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
perldoc perlre: Perl regular expressions (regexes)
perldoc perlre: Perl regular expressions (regexes): Quantifiers; Character Classes and other Special Escapes; Assertions; Capture groups
perldoc perlrequick: Perl regular expressions quick start

You might harness GNU AWK for this task following way, let file.txt content be
>chr1:35031657-35037706+
GGTGGACTAGCCAGTGAATGTCAACGCGTCCCTA
CCTAAGGCGATATCCGCAGCCGCCCGCGTCCCTA
>chr1:71979382-71985425C
agattaaatgaactattacacataaagtgcttac
ttacacataaagtgcttacgaactattacaggga
then
awk 'BEGIN{FPAT=".";OFS=""}$1==">"{$NF="+"}{print}' file.txt
gives output
>chr1:35031657-35037706+
GGTGGACTAGCCAGTGAATGTCAACGCGTCCCTA
CCTAAGGCGATATCCGCAGCCGCCCGCGTCCCTA
>chr1:71979382-71985425+
agattaaatgaactattacacataaagtgcttac
ttacacataaagtgcttacgaactattacaggga
Explanation: I inform GNU AWK that field is any single character using FPAT and output field separator is empty string using OFS. For each line where 1st field, that is 1st character is > I change value of last field ($NF) to +. Note this is applied also to headers ending with + but this is not problem as it changes + to +. Each line, changed or not, is printed.
(tested in GNU Awk 5.0.1)

How to extract only the English words and leaving the Devanagari words in bash script?

The text file is like this,
#एक
1के
अंकगणित8IU
अधोरेखाunderscore
$thatऔर
%redएकyellow
$चिह्न
अंडरस्कोर#_
The desired text file should be like,
#
1
8IU
underscore
$that
%redyellow
$
#_
This is what I have tried so far, using awk
awk -F"[अ-ह]*" '{print $1}' filename.txt
And the output that I am getting is,
#
1
$that
%red
$
and using this awk -F"[अ-ह]*" '{print $1,$2}' filename.txt and I am getting an output like this,
#
1 े
ं
ो
$that
%red yellow
$ ि
ं
Is there anyway to solve this in bash script?

Using perl:
$ perl -CSD -lpe 's/\p{Devanagari}+//g' input.txt
#
1
8IU
underscore
$that
%redyellow
$
#_
-CSD tells perl that standard streams and any opened files are encoded in UTF-8. -p loops over input files printing each line to standard output after executing the script given by -e. If you want to modify the file in place, add the -i option.
The regular expression matches any codepoints assigned to the Devanagari script in the Unicode standard and removes them. Use \P{Devanagari} to do the opposite and remove the non-Devanagari characters.

Using awk you can do:
awk '{sub(/[^\x00-\x7F]+/, "")} 1' file
#
1
8IU
underscore
$that
%redyellow
See documentation: https://www.gnu.org/software/gawk/manual/html_node/Bracket-Expressions.html
using [\x00-\x7F].
This matches all values numerically between zero and 127, which is the defined range of the ASCII character set. Use a complemented character list [^\x00-\x7F] to match any single-byte characters that are not in the ASCII range.

tr is a very good fit for this task:
LC_ALL=C tr -c -d '[:cntrl:][:graph:]' < input.txt
It sets the POSIX C locale environment so that only US English character set is valid.
Then instructs tr to -d delete -c complement [:cntrl:][:graph:], control and drawn characters classes (those not control or visible) characters. Since it is sets all the locale setting to C, all non-US-English characters are discarded.

Adding file paths to Latex figures?

In the below text I would like to add figs/01/ to each of the 3 files. As you can see the files can either be pdf,png or not have an extension and sometimes the \includegraphics breaks over several lines.
My current thinking is
cat figs.tex | ruby -ne 'puts $_.gsub(/\\includegraphics\[.*?\]\{.*?\}/) { |x| x.do_something_here }'
but it is a chick and egg problem, because I would need to search again for the part to search and replace.
Question
Can anyone see how to solve such a situation?
\begin{figure}[ht]
\centerline{ \includegraphics[height=55mm]{plotLn} \includegraphics[height=55mm]{plotLnZoom.pdf}}
\caption{Funktionen $f(x) = \ln(x)$ \ref{examg0} (bl)}
\end{figure}
\begin{example}[Parameterfremstilling for ret linje]\label{tn6.linje}
\begin{think}
Givet linjen $\,m\,$,
\includegraphics[trim=1cm 11.5cm 1cm
11.5cm,width=0.60\textwidth,clip]{vektor8.png}
\end{think}

You can read the whole file in one shot (instead of the default behaviour that reads the file line by line). To do that you need the switch -0777 (special value for the record separator). This solves the problem of a pattern that spreads over multiple lines.
You can also replace the -n option and puts with -p to automatically print the result.
ruby -0777 -pe 'gsub(/\\includegraphics\[[^\]]*\]{\K/,"figs/01/")' figs.tex
You can omit $_, by default gsub is applied to it. (You can even impress your friends removing the space between -pe and the quote ')
About the pattern, \K removes all on the left from the match result, the match result here is only an empty string at the expected position where the replacement string is inserted.
Note that the ruby command line options come from Perl:
perl -0777 -pe 's!\\includegraphics\[[^\]]*\]{\K!figs/01/!g' figs.tex

Trying to delete lines from file with sed -- what am I doing wrong?

I have a .csv file where I'd like to delete the lines between line 355686 and line 1048576.
I used the following command in Terminal (on MacOSx):
sed -i.bak -e '355686,1048576d' trips3.csv
This produces a file called trips3.csv.bak -- but it still has a total of 1,048,576 lines when I reopen it in Excel.
Any thoughts or suggestions you have are welcome and appreciated!

I suspect the problem is that excel is using carriage return (\r, octal 015) to separate records, while sed assumes lines are separated by linefeed (\n, octal 012); this means that sed will treat the entire file as one really long line. I don't think there's an easy way to get sed to get sed to recognize CR as a line delimiter, but it's easy with perl:
perl -n -015 -i.bak -e 'print if $. < 355686 || $. > 1048576' trips3.csv
(Note: if 1048576 is the number of "lines" in the file, you can leave off the || $. > 1048576 part.)

Not sure about the osx sed implementation, however the gnu sed implementation when passed the -i flag with a backup extension first copies the original file to the specified backup and modifies the original file in-place. You should expect to see a reduced number of lines in the original file trip3.csv

Some incantation that should do the job (if you have Ruby installed, obviously)
ruby -pe 'exit if $. > 355686' < trips3.csv > output.csv
If you prefer Perl/Python, just follow the documentation to do something similar and you should be fine. :)
Also, I'm using one of the Ruby one-liners, by Dave.
EDIT: Sorry, forgot to say that you need '> output.csv' to redirect stdout to a file.

awk '!(NR>355686 && NR <1048576)' your_file

How to append to specific lines in a flat file using shell script

I have a flat file that contains something like this:
11|30646|654387|020751520
11|23861|876521|018277154
11|30645|765418|016658304
Using shell script, I would like to append a string to certain lines in this file, if those lines contain a specific string.
For example, in the above file, for lines containing 23861, I would like to append a string "Processed" at the end, so that the file becomes:
11|30646|654387|020751520
11|23861|876521|018277154|Processed
11|30645|765418|016658304
I could use sed to append the string to all lines in the file, but how do I do it for specific lines ?

I'd do it this way
sed '/\|23861\|/{s/$/|Something/;}' file
This is similar to Marcelo's answer but doesn't require extended expressions and is, I think, a little cleaner.
First, match lines having 23861 between pipes
/\|23861\|/
Then, on those lines, replace the end-of-line with the string |Something
{s/$/|Something/;}
If you want to do more than one of these you could simply list them
sed '/\|23861\|/{s/$/|Something/;};/\|30645\|/{s/$/|SomethingElse/;}' file

Use the following awk-script:
$ awk '/23861/ { $0=$0 "|Processed" } {print}' input
11|30646|654387|020751520
11|23861|876521|018277154|Processed
11|30645|765418|016658304
or, using sed:
$ sed 's/\(.*23861.*$\)/\1|Processed/' input
11|30646|654387|020751520
11|23861|876521|018277154|Processed
11|30645|765418|016658304

Use the substitution command:
sed -i~ -E 's/(\|23861\|.*)/\1|Processed/' flat.file
(Note: the -i~ performs the substitution in-place. Just leave it out if you don't want to modify the original file.)

You can use the shell
while read -r line
do
case "$line" in
*23681*) line="$line|Processed";;
esac
echo "$line"
done < file > tempo && mv tempo file

sed is just a stream version of ed, which has a similar command set but was designed to edit files in place (allegedly interactively, but you wouldn't want to use it that way unless all you had was one of these). Something like
field_2_value=23861
appended_text='|processed'
line_match_regex="^[^|]*|$field_2_value|"
ed "$file" <<EOF
g/$line_match_regex/s/$/$appended_text/
wq
EOF
should get you there.
Note that the $ in .../s/$/... is not expanded by the shell, as are $line_match_regex and $appended_text, because there's no such thing as $/ - instead it's passed through as-is to ed, which interprets it as text to substitute ($ being regex-speak for "end of line").
The syntax to do the same job in sed, should you ever want to do this to a stream rather than a file in place, is very similar except that you don't need the leading g before the regex address:
sed -e "/$line_match_regex/s/$/$appended_text/" "$input_file" >"$output_file"
You need to be sure that the values you put in field_2_value and appended_text never contain slashes, because ed's g and s commands use those for delimiters.
If they might do, and you're using bash or some other shell that allows ${name//search/replace} parameter expansion syntax, you could fix them up on the fly by substituting \/ for every / during expansion of those variables. Because bash also uses / as a substitution delimiter and also uses \ as a character escape, this ends up looking horrible:
appended_text='|n/a'
ed "$file" <<EOF
g/${line_match_regex//\//\\/}/s/$/${appended_text//\//\\/}/
wq
EOF
but it does work. Nnote that both ed and sed require a trailing / after the replacement text in s/search/replace/ while bash's ${name//search/replace} syntax doesn't.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Perl does not match multiple lines - shell

I want to match: Start Here Some example text covering a few lines. End Here So I do $ perl -nle 'print $1 if /(Start Here.*?)End Here/s' then paste the text above and ctr-D. It wont match from cmd - but it does in file script. Why?

As others have explained, you're reading your file a line at a time, so matches over multiple lines are never going to work. Reading files a line at a time is often the best approach. So we can use the "flip-flip" operator to do this: $ perl -nle 'print if /Start Here/ .. /End Here/' your_file_here

Related

Replace single character in fasta header with awk or sed

How to extract only the English words and leaving the Devanagari words in bash script?

Adding file paths to Latex figures?

Trying to delete lines from file with sed -- what am I doing wrong?

How to append to specific lines in a flat file using shell script

Categories

Resources