Can a repeated piece of regular expression create multiple groups? - ruby

I'm using RUBY 's regular expression to deal with text such as
${1:aaa|bbbb}
${233:aaa | bbbb | ccc ccccc }
${34: aaa | bbbb | cccccccc |d}
${343: aaa | bbbb | cccccccc |dddddd ddddddddd}
${3443:a aa|bbbb|cccccccc|d}
${353:aa a| b b b b | c c c c c c c c | dddddd}
I want to get the trimed text between each pipe line. For example, for the first line of my upper example, I want to get the result aaa and bbbb, for the second line, I want aaa, bbbb and ccc ccccc. Now I have wrote a piece of regular expression and a piece of ruby code to test it:
array = "${33:aaa|bbbb|cccccccc}".scan(/\$\{\s*(\d+)\s*:(\s*[^\|]+\s*)(?:\|(\s*[^\|]+\s*))+\}/)
puts array
Now my problem is the (?:\|(\s*[^\|]+\s*))+ part can't create multiple groups. I don't know how to solve this problem, because the number of text I need in each line is variable. Can anyone help?

When you repeat a capturing group in a regular expression, the capturing group only stores the text matched by its last iteration. If you need to capture multiple iterations, you'll need to use more than one regex. (.NET is the only exception to this. Its CaptureCollection provides the matches of all iterations of a capturing group.)
In your case, you could do a search-and-replace to replace ^\d+: with nothing. That strips off the number and colon at the start of your string. Then call split() using the regex \s*\|\s* to split the string into the elements delimited by vertical bars.

Why don't you split your string?
str = "${233:aaa | bbbb | ccc ccccc }"
str.split(/\d+|\$|\{|\}|:|\|/).select{|v| !v.empty? }.select{|v| !v.empty? }.map{|v| v.strip}.join(', ')
#=> "aaa, bbb, cc cccc"

This might help you
Script
a = [
'${1:aaa|bbbb}',
'${233:aaa | bbbb | ccc ccccc }',
'${34: aaa | bbbb | cccccccc |d}',
'${343: aaa | bbbb | cccccccc |dddddd ddddddddd}',
'${3443:a aa|bbbb|cccccccc|d}',
'${353:aa a| b b b b | c c c c c c c c | dddddd}'
]
a.each do |input|
puts input
input.scan(/[:|]([^|}]+)/).flatten.each do |s|
puts s.gsub(/(^\s+|\s+$)/, '') # trim
end
end
Output
${1:aaa|bbbb}
aaa
bbbb
${233:aaa | bbbb | ccc ccccc }
aaa
bbbb
ccc ccccc
${34: aaa | bbbb | cccccccc |d}
aaa
bbbb
cccccccc
d
${343: aaa | bbbb | cccccccc |dddddd ddddddddd}
aaa
bbbb
cccccccc
dddddd ddddddddd
${3443:a aa|bbbb|cccccccc|d}
a aa
bbbb
cccccccc
d
${353:aa a| b b b b | c c c c c c c c | dddddd}
aa a
b b b b
c c c c c c c c
dddddd

Instead of trying to do everything at once, divide and conquer:
DATA.each do |line|
line =~ /:(.+)\}/
items = $1.strip.split( /\s* \| \s*/x )
p items
end
__END__
${1:aaa|bbbb}
${233:aaa | bbbb | ccc ccccc }
${34: aaa | bbbb | cccccccc |d}
${343: aaa | bbbb | cccccccc |dddddd ddddddddd}
${3443:a aa|bbbb|cccccccc|d}
${353:aa a| b b b b | c c c c c c c c | dddddd}
If you want to do it with a single regex, you can use scan, but this seems more difficult to grok:
DATA.each do |line|
items = line.scan( /[:|] ([^|}]+) /x ).flatten.map { |i| i.strip }
p items
end

Related

2-word links using Xpath

I need to find links on a page consisting of two words.How can this be done with Xpath?
<div class="navbar">
<p>
Aaa aaa
Bbb
Ccc ccc
Ddd
Eee
Fff fff ff
</p>
</div>
If you can differentiate the strings by the count of spaces, you could use this XPath-1.0 expression:
/div/p/a[string-length(normalize-space(.))-string-length(translate(normalize-space(.),' ',''))=1]
This matches all two-word-strings.
My 2cts by removing not space characters and also counting spaces
XPath 1: //a[count(translate(normalize-space(.),"abcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ","")) = 1]
echo -e 'cat //a[translate(normalize-space(.),"abcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ","") = " "] \n bye' | xmllint --shell test.html
/ > cat //a[translate(normalize-space(.),"abcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ","") = " "]
-------
Aaa aaa
-------
Ccc ccc
/ > bye
Using length of remaining spaces
Xpath 2: //a[string-length(translate(normalize-space(.),"abcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ","")) = 1]
echo -e 'cat //a[string-length(translate(normalize-space(.),"abcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ","")) = 1] \n bye' | xmllint --shell test.html
/ > cat //a[string-length(translate(normalize-space(.),"abcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ","")) = 1]
-------
Aaa aaa
-------
Ccc ccc
/ > bye

remove lines in file from stream

cat file_a
aaa
bbb
ccc
cat file_b
ddd
eee
fff
cat file_x
bbb
ccc
ddd
eee
I want to cat file_a file_b | remove_from_stream_what_is_in(file_x)
Result:
aaa
fff
If there is no basic filter to do this with, then I wonder if there is a way with ruby -ne '...'.
Try:
$ cat file_a file_b | grep -vFf file_x
aaa
fff
-v means remove matching lines.
-F tells grep to treat the match patterns as fixed strings, not regular expressions.
-f file_x tells grep to get the match patterns from the lines of file_x.
Other options that you may want to consider are:
-w tells grep to match only complete words.
-x tells grep to match only complete lines.
IO.write('file_a', %w| aaa bbb ccc |.join("\n")) #=> 11
IO.write('file_b', %w| ddd eee fff |.join("\n")) #=> 11
IO.write('file_x', %w| bbb ccc ddd eee |.join("\n")) #=> 15
From Ruby:
IO.readlines('file_a', chomp: true) + IO.readlines('file_b', chomp: true) -
IO.readlines('file_x', chomp: true)
#=> ["aaa", "fff"]

Bash replace column of a csv based on identifier and replacement column from another CSV

I have 2 CSV files, where the 1st one is my main CSV that contains all the columns I need. The 2nd CSV contains 2 columns, where the 1st column is an identifier, and the 2nd column is replacement value. For example
Main.csv
aaa 111 bbb 222 ccc 333
ddd 444 eee 555 fff 666
iii 777 jjj 888 kkk 999
lll 101 eee 201 nnn 301
replacement.csv
bbb abc
jjj def
eee ghi
I want the results to look like the following, where for example the 3rd column of the main.csv is the identifier and 1st column of the replacement.csv. By using that as an identifier, the 5th column of main.csv should be replaced with 2nd column of replacement.csv. Also, the main.csv can have repeated values, so all the values should be changed to the appropriate replacement value
aaa 111 bbb 222 abc 333
ddd 444 eee 555 ghi 666
iii 777 jjj 888 def 999
lll 101 eee 201 ghi 301
I tried a code like this
while read col1 col2 col3 col4 col5 col6
do
while read col7 col8
do
if[$col7==col3]
then
col5=col8
fi
done < RepCSV
done < MainCSV > MainCSV
But it did not work.
I'm quite new to bash, so the help will be appreciated. Thanks in advance
Using awk:
$ awk '
NR==FNR { # process the first file
a[$1]=$2 # hash $2 to a, $1 as key
next # next record
}
{ # second file
$5=($3 in a?a[$3]:$5) ยค replace $5 based on $3
}1' replacement main
aaa 111 bbb 222 abc 333
ddd 444 eee 555 ghi 666
iii 777 jjj 888 def 999
lll 101 eee 201 ghi 301

how to replace a lasted special character in each row

cat test1
a a 1 a aa 1 1 111 bb b
a1b a 11 b b b
1 asd fdg 1 bb b
I wanna to replace the end "1" shows in each row with #, keep other data as the same.
my expect result
cat expected_result
a a 1 a aa 1 1 11# bb b
a1b a 1# b b b
1 asd fdg # bb b
Could this condition solved by "sed"? I don't know how to select the last "1" in each row, thanks.
Method 1:
1([^1]*)$ matches the last 1 on the line and everything after:
$ sed -E 's/1([^1]*)$/#\1/' test1
a a 1 a aa 1 1 11# bb b
a1b a 1# b b b
1 asd fdg # bb b
Method 2:
(.*)1 matches everything on the line up to and including the last 1:
$ sed -E 's/(.*)1/\1#/' test1
a a 1 a aa 1 1 11# bb b
a1b a 1# b b b
1 asd fdg # bb b
This works because sed's regular expressions are greedy (more precisely, leftmost-longest). The leftmost-longest match of (.*)1 will match from the beginning of the line through the last 1 on the line.
You can try this. * is greedy, tries to match as much as possible and 1/\1#/ will match the last occurrence of 1 of each line and replace with #. If there is something else like 'x' to match and replace the last occurrence with y then it should be x/\1y/
sed 's/\(.*\)1/\1#/' filename
Output:
a a 1 a aa 1 1 11# bb b
a1b a 1# b b b
1 asd fdg # bb b
using rev and awk solution too here.
rev Input_file | awk '{sub(/1/,"#");print}' | rev
Output will be as follows.
a a 1 a aa 1 1 11# bb b
a1b a 1# b b b
1 asd fdg # bb b
1([^1]*$), will match the latest 1 and anything ahead.
sed -r 's/1([^1]*$)/#\1/' v1
cat test#
a a 1 a aa 1 1 11# bb b
a1b a 1# b b b
1 asd fdg # bb b
cat v1
cat test1
a a 1 a aa 1 1 111 bb b
a1b a 11 b b b
1 asd fdg 1 bb b

How can I compare two 2D-array files with bash?

I have two 2D-array files to read with bash.
What I want to do is extract the elements inside both files.
These two files contain different rows x columns such as:
file1.txt (nx7)
NO DESC ID TYPE W S GRADE
1 AAA 20 AD 100 100 E2
2 BBB C0 U 200 200 D
3 CCC 9G R 135 135 U1
4 DDD 9H Z 246 246 T1
5 EEE 9J R 789 789 U1
.
.
.
file2.txt (mx3)
DESC W S
AAA 100 100
CCC 135 135
EEE 789 789
.
.
.
Here is what I want to do:
Extract the element in DESC column of file2.txt then find the corresponding element in file1.txt.
Extract the W,S elements in such row of file2.txt then find the corresponding W,S elements in such row of file1.txt.
If [W1==W2 && S1==S2]; then echo "${DESC[colindex]} ok"; else echo "${DESC[colindex]} NG"
How can I read this kind of file as a 2D array with bash or is there any convenient way to do that?
bash does not support 2D arrays. You can simulate them by generating 1D array variables like array1, array2, and so on.
Assuming DESC is a key (i.e. has no duplicate values) and does not contain any spaces:
#!/bin/bash
# read data from file1
idx=0
while read -a data$idx; do
let idx++
done <file1.txt
# process data from file2
while read desc w2 s2; do
for ((i=0; i<idx; i++)); do
v="data$i[1]"
[ "$desc" = "${!v}" ] && {
w1="data$i[4]"
s1="data$i[5]"
if [ "$w2" = "${!w1}" -a "$s2" = "${!s1}" ]; then
echo "$desc ok"
else
echo "$desc NG"
fi
break
}
done
done <file2.txt
For brevity, optimizations such as taking advantage of sort order are left out.
If the files actually contain the header NO DESC ID TYPE ... then use tail -n +2 to discard it before processing.
A more elegant solution is also possible, which avoids reading the entire file in memory. This should only be relevant for really large files though.
If the rows order is not needed be preserved (can be sorted), maybe this is enough:
join -2 2 -o 1.1,1.2,1.3,2.5,2.6 <(tail -n +2 file2.txt|sort) <(tail -n +2 file1.txt|sort) |\
sed 's/^\([^ ]*\) \([^ ]*\) \([^ ]*\) \2 \3/\1 OK/' |\
sed '/ OK$/!s/\([^ ]*\) .*/\1 NG/'
For file1.txt
NO DESC ID TYPE W S GRADE
1 AAA 20 AD 100 100 E2
2 BBB C0 U 200 200 D
3 CCC 9G R 135 135 U1
4 DDD 9H Z 246 246 T1
5 EEE 9J R 789 789 U1
and file2.txt
DESC W S
AAA 000 100
CCC 135 135
EEE 789 000
FCK xxx 135
produces:
AAA NG
CCC OK
EEE NG
Explanation:
skip the header line in both files - tail +2
sort both files
join the needed columns from both files into one table like, in the result will appears only the lines what has common DESC field
like next:
AAA 000 100 100 100
CCC 135 135 135 135
EEE 789 000 789 789
in the lines, which have the same values in 2-4 and 3-5 columns, substitute every but 1st column with OK
in the remainder lines substitute the columns with NG

Resources