Ruby one liner to replace only lines that match, discard others - ruby

Looking for the ruby one liner substitute to print out a substitution only if the line matches the regular expression:
echo -e "Line 1\nLine 2\nLine 3" | perl -ne "print if s/Line 2/Line 2 replaced, others discarded/g"
Input:
Line 1
Line 2
Line 3
Output:
Line 2 replaced, others discarded

As I know, there is no equivalent to -ne shorthand in ruby. So it will be little longer:
echo -e "Line 1\nLine 2\nLine 3" | ruby -e 'puts $<.read.lines.map {|l| l =~ /Line 2/ ? l.gsub(/Line 2/, "Line 2 replaced, others discarded") : nil }.compact'
Where:
$< also ARGF (docs) is Stream for file argument or STDIO
$<.read will read it all to string
$<.read.lines split by new line character, returns array
map {|l| ... } will collect result of expression in a block to new array
l =~ /Line 2/ check if string match Regex
l.gsub(/Line 2/, "Line 2 replaced") will replace all "Line 2" to "Line 2 replaced"
.compact will remove nil values from array (return new array without nil's)
puts [] will print each element of array on new line
Probably ruby is not a best chose for this task, I would choose sed or do it in text editor. Most of text editors can find and replace by regex nowdays

Related

Bash regex to match multiple blocks of indented content and print all of them

I'm trying to do some regex matching in bash.
I'd like to match multiple block of indented (space or tab) content, with the block itself starting with a keyword.
Some other content could be present in the file.
Using this sample content :
keyword aaa match1
Some other content
keyword ccc match2
indentend content
matching
Some other content
with indendation
keyword ddd match2
indented content still matching
I managed to use this : (^keyword.*(?:\n^\h+.*)*), which seems to be sort of okay, everything is matching as expected. :
https://regex101.com/r/kvMlKK/1
Expected output would be to print every matches :
keyword aaa match1
keyword ccc match2
indentend content
matching
keyword ddd match2
indented content still matching
Unfortunatly I did not find a way to print all matches in bash. I can use grep/sed/awk/perl without any problem (edit: i meant I have access to all these command in the environnement i am working with).
Edit:
grep -E --include \*.md '(^keyword.*(?:\n^\h+.*)*)' $(dirname "$0")/../_inbox/draft.md
Using grep it does not return the full match, only first line because of the lack of multi-line matching support I guess.
I am not familiar with awk/sed, I did not get any meaningful results (even if it seems to be better to use them for multi-line matching).
Edit: if that could work on multiple files that would be awesome
Thanks for your help!
You can do it in pure bash, by looping... Because bash regex doesn't support multi-line matching.
#!/bin/bash
# Flag to track whether inside indented block
indented=0
# Read input line by line
while IFS= read -r line; do
# Check if line starts with keyword
reg="^[ \t]*keyword"
if [[ $line =~ $reg ]]; then
# Print line
printf "%s\n" "$line"
# Set flag to indicate inside indented block
indented=1
else
# Check if line starts with whitespace and inside indented block
reg="^[ \t]+.*"
if [[ $line =~ $reg && $indented -eq 1 ]]; then
# Print line
printf "%s\n" "$line"
else
# Reset flag to indicate outside indented block
indented=0
fi
fi
done < "input"
You can do it in awk too:
awk '/^[ \t]*keyword/{print;while(getline line) if(line~/^[ \t]+.*/) print line;else break}' input
Or use sed
sed -n '/^[ \t]*keyword/{:start;p;n;/^[ \t]/{p;n;b start;}}' input
Using awk:
$ awk '!/^[\t ]/{p=0} /^keyword/{p=1} p' file
keyword aaa match1
keyword ccc match2
indentend content
matching
keyword ddd match2
indented content still matching
$

sed insert line after a match only once [duplicate]

UPDATED:
Using sed, how can I insert (NOT SUBSTITUTE) a new line on only the first match of keyword for each file.
Currently I have the following but this inserts for every line containing Matched Keyword and I want it to only insert the New Inserted Line for only the first match found in the file:
sed -ie '/Matched Keyword/ i\New Inserted Line' *.*
For example:
Myfile.txt:
Line 1
Line 2
Line 3
This line contains the Matched Keyword and other stuff
Line 4
This line contains the Matched Keyword and other stuff
Line 6
changed to:
Line 1
Line 2
Line 3
New Inserted Line
This line contains the Matched Keyword and other stuff
Line 4
This line contains the Matched Keyword and other stuff
Line 6
You can sort of do this in GNU sed:
sed '0,/Matched Keyword/s//New Inserted Line\n&/'
But it's not portable. Since portability is good, here it is in awk:
awk '/Matched Keyword/ && !x {print "Text line to insert"; x=1} 1' inputFile
Or, if you want to pass a variable to print:
awk -v "var=$var" '/Matched Keyword/ && !x {print var; x=1} 1' inputFile
These both insert the text line before the first occurrence of the keyword, on a line by itself, per your example.
Remember that with both sed and awk, the matched keyword is a regular expression, not just a keyword.
UPDATE:
Since this question is also tagged bash, here's a simple solution that is pure bash and doesn't required sed:
#!/bin/bash
n=0
while read line; do
if [[ "$line" =~ 'Matched Keyword' && $n = 0 ]]; then
echo "New Inserted Line"
n=1
fi
echo "$line"
done
As it stands, this as a pipe. You can easily wrap it in something that acts on files instead.
If you want one with sed*:
sed '0,/Matched Keyword/s//Matched Keyword\nNew Inserted Line/' myfile.txt
*only works with GNU sed
This might work for you:
sed -i -e '/Matched Keyword/{i\New Inserted Line' -e ':a;n;ba}' file
You're nearly there! Just create a loop to read from the Matched Keyword to the end of the file.
After inserting a line, the remainder of the file can be printed out by:
Introducing a loop place holder :a (here a is an arbitrary name).
Print the current line and fetch the next into the pattern space with the ncommand.
Redirect control back using the ba command which is essentially a goto to the a place holder. The end-of-file condition is naturally taken care of by the n command which terminates any further sed commands if it tries to read passed the end-of-file.
With a little help from bash, a true one liner can be achieved:
sed $'/Matched Keyword/{iNew Inserted Line\n:a;n;ba}' file
Alternative:
sed 'x;/./{x;b};x;/Matched Keyword/h;//iNew Inserted Line' file
This uses the Matched Keyword as a flag in the hold space and once it has been set any processing is curtailed by bailing out immediately.
If you want to append a line after first match only, use AWK instead of SED as below
awk '{print} /Matched Keyword/ && !n {print "New Inserted Line"; n++}' myfile.txt
Output:
Line 1
Line 2
Line 3
This line contains the Matched Keyword and other stuff
New Inserted Line
Line 4
This line contains the Matched Keyword and other stuff
Line 6

In bash how can I get the last part of a string after the last hyphen [duplicate]

I have this variable:
A="Some variable has value abc.123"
I need to extract this value i.e abc.123. Is this possible in bash?
Simplest is
echo "$A" | awk '{print $NF}'
Edit: explanation of how this works...
awk breaks the input into different fields, using whitespace as the separator by default. Hardcoding 5 in place of NF prints out the 5th field in the input:
echo "$A" | awk '{print $5}'
NF is a built-in awk variable that gives the total number of fields in the current record. The following returns the number 5 because there are 5 fields in the string "Some variable has value abc.123":
echo "$A" | awk '{print NF}'
Combining $ with NF outputs the last field in the string, no matter how many fields your string contains.
Yes; this:
A="Some variable has value abc.123"
echo "${A##* }"
will print this:
abc.123
(The ${parameter##word} notation is explained in ยง3.5.3 "Shell Parameter Expansion" of the Bash Reference Manual.)
Some examples using parameter expansion
A="Some variable has value abc.123"
echo "${A##* }"
abc.123
Longest match on " " space
echo "${A% *}"
Some variable has value
Longest match on . dot
echo "${A%.*}"
Some variable has value abc
Shortest match on " " space
echo "${A%% *}"
some
Read more Shell-Parameter-Expansion
The documentation is a bit painful to read, so I've summarised it in a simpler way.
Note that the '*' needs to swap places with the ' ' depending on whether you use # or %. (The * is just a wildcard, so you may need to take off your "regex hat" while reading.)
${A% *} - remove shortest trailing * (strip the last word)
${A%% *} - remove longest trailing * (strip the last words)
${A#* } - remove shortest leading * (strip the first word)
${A##* } - remove longest leading * (strip the first words)
Of course a "word" here may contain any character that isn't a literal space.
You might commonly use this syntax to trim filenames:
${A##*/} removes all containing folders, if any, from the start of the path, e.g.
/usr/bin/git -> git
/usr/bin/ -> (empty string)
${A%/*} removes the last file/folder/trailing slash, if any, from the end:
/usr/bin/git -> /usr/bin
/usr/bin/ -> /usr/bin
${A%.*} removes the last extension, if any (just be wary of things like my.path/noext):
archive.tar.gz -> archive.tar
How do you know where the value begins? If it's always the 5th and 6th words, you could use e.g.:
B=$(echo "$A" | cut -d ' ' -f 5-)
This uses the cut command to slice out part of the line, using a simple space as the word delimiter.
As pointed out by Zedfoxus here. A very clean method that works on all Unix-based systems. Besides, you don't need to know the exact position of the substring.
A="Some variable has value abc.123"
echo "$A" | rev | cut -d ' ' -f 1 | rev
# abc.123
More ways to do this:
(Run each of these commands in your terminal to test this live.)
For all answers below, start by typing this in your terminal:
A="Some variable has value abc.123"
The array example (#3 below) is a really useful pattern, and depending on what you are trying to do, sometimes the best.
1. with awk, as the main answer shows
echo "$A" | awk '{print $NF}'
2. with grep:
echo "$A" | grep -o '[^ ]*$'
the -o says to only retain the matching portion of the string
the [^ ] part says "don't match spaces"; ie: "not the space char"
the * means: "match 0 or more instances of the preceding match pattern (which is [^ ]), and the $ means "match the end of the line." So, this matches the last word after the last space through to the end of the line; ie: abc.123 in this case.
3. via regular bash "indexed" arrays and array indexing
Convert A to an array, with elements being separated by the default IFS (Internal Field Separator) char, which is space:
Option 1 (will "break in mysterious ways", as #tripleee put it in a comment here, if the string stored in the A variable contains certain special shell characters, so Option 2 below is recommended instead!):
# Capture space-separated words as separate elements in array A_array
A_array=($A)
Option 2 [RECOMMENDED!]. Use the read command, as I explain in my answer here, and as is recommended by the bash shellcheck static code analyzer tool for shell scripts, in ShellCheck rule SC2206, here.
# Capture space-separated words as separate elements in array A_array, using
# a "herestring".
# See my answer here: https://stackoverflow.com/a/71575442/4561887
IFS=" " read -r -d '' -a A_array <<< "$A"
Then, print only the last elment in the array:
# Print only the last element via bash array right-hand-side indexing syntax
echo "${A_array[-1]}" # last element only
Output:
abc.123
Going further:
What makes this pattern so useful too is that it allows you to easily do the opposite too!: obtain all words except the last one, like this:
array_len="${#A_array[#]}"
array_len_minus_one=$((array_len - 1))
echo "${A_array[#]:0:$array_len_minus_one}"
Output:
Some variable has value
For more on the ${array[#]:start:length} array slicing syntax above, see my answer here: Unix & Linux: Bash: slice of positional parameters, and for more info. on the bash "Arithmetic Expansion" syntax, see here:
https://www.gnu.org/savannah-checkouts/gnu/bash/manual/bash.html#Arithmetic-Expansion
https://www.gnu.org/savannah-checkouts/gnu/bash/manual/bash.html#Shell-Arithmetic
You can use a Bash regex:
A="Some variable has value abc.123"
[[ $A =~ [[:blank:]]([^[:blank:]]+)$ ]] && echo "${BASH_REMATCH[1]}" || echo "no match"
Prints:
abc.123
That works with any [:blank:] delimiter in the current local (Usually [ \t]). If you want to be more specific:
A="Some variable has value abc.123"
pat='[ ]([^ ]+)$'
[[ $A =~ $pat ]] && echo "${BASH_REMATCH[1]}" || echo "no match"
echo "Some variable has value abc.123"| perl -nE'say $1 if /(\S+)$/'

shell: how to read a certain column in a certain line into a variable

I want to extract the first column of the last line of a text file. Instead of output the content of interest in another file and read it in again, can I just use some command to read it into a variable directly?
For exampole, if my file is like this:
...
123 456 789(this is the last line)
What I want is to read 123 into a variable in my shell script. How can I do that?
One approach is to extract the line you want, read its columns into an array, and emit the array element you want.
For the last line:
#!/bin/bash
# ^^^^- not /bin/sh, to enable arrays and process substitution
read -r -a columns < <(tail -n 1 "$filename") # put last line's columns into an array
echo "${columns[0]}" # emit the first column
Alternately, awk is an appropriate tool for the job:
line=2
column=1
var=$(awk -v line="$line" -v col="$column" 'NR == line { print $col }' <"$filename")
echo "Extracted the value: $var"
That said, if you're looking for a line close to the start of a file, it's often faster (in a runtime-performance sense) and easier to stick to shell builtins. For instance, to take the third column of the second line of a file:
{
read -r _ # throw away first line
read -r _ _ value _ # extract third value of second line
} <"$filename"
This works by using _s as placeholders for values you don't want to read.
I guess with "first column", you mean "first word", do you?
If it is guaranteed, that the last line doesn't start with a space, you can do
tail -n 1 YOUR_FILE | cut -d ' ' -f 1
You could also use sed:
$> var=$(sed -nr '$s/(^[^ ]*).*/\1/p' "file.txt")
The -nr tells sed to not output data by default (-n) and use extended regular expressions (-r to avoid needing to escape the paranthesis otherwise you have to write \( \))). The $ is an address that specifies the last line. The regular expression anchors the beginning of the line with the first ^, then matches everything that is not a space [^ ]* and puts that the result into a capture group ( ) and then gets rid of the rest of the line .* by replacing the line with the capture group \1, then print p to print the line.

I'm puzzled here about awk, sed, etc

I'm trying for a while to work this out with no success so far
I have a command output that I need to chew to make it suitable for further processing
The text I have is:
1/2 [3] (27/03/2012 19:32:54) word word word word 4/5
What I need is to extract only the numbers 1/2 [3] 4/5 so it will look:
1 2 3 4 5
So, basically I was trying to exclude all characters that are not digits, like "/", "[", "]", etc.
I tried awk with FS, tried using regexp, but none of my tries were successful.
I would then add something to it like
first:1 second:2 third:3 .... etc
Please take in mind I'm talking about a file that contains a lot if lines with the same structure, but I already though about using awk to sum every column with
awk '{sum1+=$1 ; sum2+=$2 ;......etc} END {print "first:"sum1 " second:"sum2.....etc}'
But first I will need to extract only the relevant numbers,
The date that is in between "( )" can be omitted completely but they are numbers too, so filtering merely by digits won't be enough as it will match them too
Hope you can help me out
Thanks in advance!
This: sed -r 's/[(][^)]*[)]/ /g; s/[^0-9]+/ /g' should work. It makes two passes, removing parenthesized expressions first and then replacing all runs of non-digits with single spaces.
You can do something like sed -e 's/(.*)//' -e 's/[^0-9]/ /g'. It deletes everything inside the round brackets, than substitutes all non-digit characters with a space. To get rid of extra spaces you can feed it to column -t:
$ echo '1/2 [3] (27/03/2012 19:32:54) word word word word 4/5' | sed -e 's/(.*)//' -e 's/[^0-9]/ /g' | column -t
1 2 3 4 5
TXR:
#(collect)
#one/#two [#three] (#date #time) #(skip :greedy) #four/#five
#(filter :tonumber one two three four five)
#(end)
#(bind (first second third fourth fifth)
#(mapcar (op apply +) (list one two three four five)))
#(output)
first:#first second:#second third:#third fourth:#fourth fifth:#fifth
#(end)
data:
1/2 [3] (27/03/2012 19:32:54) word word word word 4/5
10/20 [30] (27/03/2012 19:32:54) word word 40/50
run:
$ txr data.txr data.txt
first:11 second:22 third:33 fourth:44 fifth:55
Easy to add some error checking:
#(collect)
# (cases)
#one/#two [#three] (#date #time) #(skip :greedy) #four/#five
# (or)
#line
# (throw error `badly formatted line: #line`)
# (end)
# (filter :tonumber one two three four five)
#(end)
#(bind (first second third fourth fifth)
#(mapcar (op apply +) (list one two three four five)))
#(output)
first:#first second:#second third:#third fourth:#fourth fifth:#fifth
#(end)
$ txr data.txr -
foo bar junk
txr: unhandled exception of type error:
txr: ("badly formatted line: foo bar junk")
Aborted
TXR is for robust programming. There is strong typing, so you can't treat strings as numbers just because they contain digits. Variables have to be bound before use, and so misspelled variables do not silently default to zero or blank, but rather produce an unbound variable <name> in <file>:<line> type error. Text extraction is performed with lots of specific context to guard against misinterpreting input in one format as being in another format.
see below, if it is what you want:
kent$ echo "1/2 [3] (27/03/2012 19:32:54) word word word word 4/5"|sed -r 's/\([^)]*\)//g; s/[^0-9]/ /g'
1 2 3 4 5
if you want it to look better:
kent$ echo "1/2 [3] (27/03/2012 19:32:54) word word word word 4/5"|sed -r 's/\([^)]*\)//g; s/[^0-9]/ /g;s/ */ /g'
1 2 3 4 5
awk '{ first+=gensub("^([0-9]+)/.*","\\1","g",$0)
second+=gensub("^[0-9]+/([0-9]+) .*","\\1","g",$0)
thirdl+=gensub("^[0-9]+/[0-9]+ \[([0-9]+)\].*","\\1","g",$0)
fourth+=gensub("^.* ([0-9]+)/[0-9]+ *$","\\1","g",$0)
fifth+=gensub("^.* [0-9]+/([0-9]+) *$","\\1","g",$0)
}
END { print "first: " first " second: " second " third: " third " fourth: " fourth " fifth: " fifth
}
Might work for you.
This will give you digits extracted out excluding text in parenthesis:
digits=$(echo '1/2 [3] (27/03/2012 19:32:54) word word word word 4/5' |\
sed 's/(.*)//' | grep -o '[0-9][0-9]*')
echo $digits
or pure sed solution:
echo '1/2 [3] (27/03/2012 19:32:54) word word word word 4/5' |\
sed -e 's/(.*)//' -e 's/[^0-9]/ /g' -e 's/[ \t][ \t]*/ /g'
OUTPUT:
1 2 3 4 5
one pass with awk is sufficient if you set a fancy field separator: any one of slash, space, open bracket or close bracket separates a field:
awk -F '[][/ ]' '
{s1+=$1; s2+=$2; s3+=$4; s4+=$(NF-1); s5+=$NF}
END {printf("first:%d second:%d third:%d fourth:%d fifth:%d\n", s1, s2, s3, s4, s5)}
'

Resources