2-word links using Xpath - xpath

I need to find links on a page consisting of two words.How can this be done with Xpath?
<div class="navbar">
<p>
Aaa aaa
Bbb
Ccc ccc
Ddd
Eee
Fff fff ff
</p>
</div>

If you can differentiate the strings by the count of spaces, you could use this XPath-1.0 expression:
/div/p/a[string-length(normalize-space(.))-string-length(translate(normalize-space(.),' ',''))=1]
This matches all two-word-strings.

My 2cts by removing not space characters and also counting spaces
XPath 1: //a[count(translate(normalize-space(.),"abcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ","")) = 1]
echo -e 'cat //a[translate(normalize-space(.),"abcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ","") = " "] \n bye' | xmllint --shell test.html
/ > cat //a[translate(normalize-space(.),"abcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ","") = " "]
-------
Aaa aaa
-------
Ccc ccc
/ > bye
Using length of remaining spaces
Xpath 2: //a[string-length(translate(normalize-space(.),"abcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ","")) = 1]
echo -e 'cat //a[string-length(translate(normalize-space(.),"abcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ","")) = 1] \n bye' | xmllint --shell test.html
/ > cat //a[string-length(translate(normalize-space(.),"abcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ","")) = 1]
-------
Aaa aaa
-------
Ccc ccc
/ > bye

Related

remove lines in file from stream

cat file_a
aaa
bbb
ccc
cat file_b
ddd
eee
fff
cat file_x
bbb
ccc
ddd
eee
I want to cat file_a file_b | remove_from_stream_what_is_in(file_x)
Result:
aaa
fff
If there is no basic filter to do this with, then I wonder if there is a way with ruby -ne '...'.
Try:
$ cat file_a file_b | grep -vFf file_x
aaa
fff
-v means remove matching lines.
-F tells grep to treat the match patterns as fixed strings, not regular expressions.
-f file_x tells grep to get the match patterns from the lines of file_x.
Other options that you may want to consider are:
-w tells grep to match only complete words.
-x tells grep to match only complete lines.
IO.write('file_a', %w| aaa bbb ccc |.join("\n")) #=> 11
IO.write('file_b', %w| ddd eee fff |.join("\n")) #=> 11
IO.write('file_x', %w| bbb ccc ddd eee |.join("\n")) #=> 15
From Ruby:
IO.readlines('file_a', chomp: true) + IO.readlines('file_b', chomp: true) -
IO.readlines('file_x', chomp: true)
#=> ["aaa", "fff"]

How can I compare two 2D-array files with bash?

I have two 2D-array files to read with bash.
What I want to do is extract the elements inside both files.
These two files contain different rows x columns such as:
file1.txt (nx7)
NO DESC ID TYPE W S GRADE
1 AAA 20 AD 100 100 E2
2 BBB C0 U 200 200 D
3 CCC 9G R 135 135 U1
4 DDD 9H Z 246 246 T1
5 EEE 9J R 789 789 U1
.
.
.
file2.txt (mx3)
DESC W S
AAA 100 100
CCC 135 135
EEE 789 789
.
.
.
Here is what I want to do:
Extract the element in DESC column of file2.txt then find the corresponding element in file1.txt.
Extract the W,S elements in such row of file2.txt then find the corresponding W,S elements in such row of file1.txt.
If [W1==W2 && S1==S2]; then echo "${DESC[colindex]} ok"; else echo "${DESC[colindex]} NG"
How can I read this kind of file as a 2D array with bash or is there any convenient way to do that?
bash does not support 2D arrays. You can simulate them by generating 1D array variables like array1, array2, and so on.
Assuming DESC is a key (i.e. has no duplicate values) and does not contain any spaces:
#!/bin/bash
# read data from file1
idx=0
while read -a data$idx; do
let idx++
done <file1.txt
# process data from file2
while read desc w2 s2; do
for ((i=0; i<idx; i++)); do
v="data$i[1]"
[ "$desc" = "${!v}" ] && {
w1="data$i[4]"
s1="data$i[5]"
if [ "$w2" = "${!w1}" -a "$s2" = "${!s1}" ]; then
echo "$desc ok"
else
echo "$desc NG"
fi
break
}
done
done <file2.txt
For brevity, optimizations such as taking advantage of sort order are left out.
If the files actually contain the header NO DESC ID TYPE ... then use tail -n +2 to discard it before processing.
A more elegant solution is also possible, which avoids reading the entire file in memory. This should only be relevant for really large files though.
If the rows order is not needed be preserved (can be sorted), maybe this is enough:
join -2 2 -o 1.1,1.2,1.3,2.5,2.6 <(tail -n +2 file2.txt|sort) <(tail -n +2 file1.txt|sort) |\
sed 's/^\([^ ]*\) \([^ ]*\) \([^ ]*\) \2 \3/\1 OK/' |\
sed '/ OK$/!s/\([^ ]*\) .*/\1 NG/'
For file1.txt
NO DESC ID TYPE W S GRADE
1 AAA 20 AD 100 100 E2
2 BBB C0 U 200 200 D
3 CCC 9G R 135 135 U1
4 DDD 9H Z 246 246 T1
5 EEE 9J R 789 789 U1
and file2.txt
DESC W S
AAA 000 100
CCC 135 135
EEE 789 000
FCK xxx 135
produces:
AAA NG
CCC OK
EEE NG
Explanation:
skip the header line in both files - tail +2
sort both files
join the needed columns from both files into one table like, in the result will appears only the lines what has common DESC field
like next:
AAA 000 100 100 100
CCC 135 135 135 135
EEE 789 000 789 789
in the lines, which have the same values in 2-4 and 3-5 columns, substitute every but 1st column with OK
in the remainder lines substitute the columns with NG

How to do natural sort output of "uniq -c" in descending/acsending order? - unix

How to do natural sort on uniq -c output?
When the counts are <10, the uniq -c | sort output looks fine:
alvas#ubi:~/testdir$ echo -e "aaa\nbbb\naa\ncd\nada\naaa\nbbb\naa\nccd\naa" > test.txt
alvas#ubi:~/testdir$ cat test.txt
aaa
bbb
aa
cd
ada
aaa
bbb
aa
ccd
aa
alvas#ubi:~/testdir$ cat test.txt | sort | uniq -c | sort
1 ada
1 ccd
1 cd
2 aaa
2 bbb
3 aa
but when the counts are > 10 and even in thousands/hundreds the sort messes up because it's sorting by string and not by natural integer sort:
alvas#ubi:~/testdir$ echo -e "aaa\nbbb\naa\nnaa\nnaa\naa\nnaa\nnaa\nnaa\nnaa\nnaa\nnaa\nnaa\nnaa\nnnaa\ncd\nada\naaa\nbbb\naa\nccd\naa" > test.txt
alvas#ubi:~/testdir$ cat test.txt | sort | uniq -c | sort
10 naa
1 ada
1 ccd
1 cd
1 nnaa
2 aaa
2 bbb
4 aa
How to do natural sort output of "uniq -c" in descending/acsending order?
Use -n in your sort command, so that it sorts numerically. Also -r allows you to reverse the result:
$ sort test.txt | uniq -c | sort -n
1 ada
1 ccd
1 cd
1 nnaa
2 aaa
2 bbb
4 aa
10 naa
$ sort test.txt | uniq -c | sort -nr
10 naa
4 aa
2 bbb
2 aaa
1 nnaa
1 cd
1 ccd
1 ada
From man sort:
-n, --numeric-sort
compare according to string numerical value
-r, --reverse
reverse the result of comparisons

Show diff between two files in specific format

here is the question: I have two files:
file1:
aaa
bbb
ccc
ddd
file2:
bbb
ddd
HOW TO USE DIFF TO GET THIS OUTPUT (only differences)
aaa
ccc
If what you want is records unique to file1, then :
$ comm -23 <(sort file1) <(sort file2)
aaa
ccc

Can a repeated piece of regular expression create multiple groups?

I'm using RUBY 's regular expression to deal with text such as
${1:aaa|bbbb}
${233:aaa | bbbb | ccc ccccc }
${34: aaa | bbbb | cccccccc |d}
${343: aaa | bbbb | cccccccc |dddddd ddddddddd}
${3443:a aa|bbbb|cccccccc|d}
${353:aa a| b b b b | c c c c c c c c | dddddd}
I want to get the trimed text between each pipe line. For example, for the first line of my upper example, I want to get the result aaa and bbbb, for the second line, I want aaa, bbbb and ccc ccccc. Now I have wrote a piece of regular expression and a piece of ruby code to test it:
array = "${33:aaa|bbbb|cccccccc}".scan(/\$\{\s*(\d+)\s*:(\s*[^\|]+\s*)(?:\|(\s*[^\|]+\s*))+\}/)
puts array
Now my problem is the (?:\|(\s*[^\|]+\s*))+ part can't create multiple groups. I don't know how to solve this problem, because the number of text I need in each line is variable. Can anyone help?
When you repeat a capturing group in a regular expression, the capturing group only stores the text matched by its last iteration. If you need to capture multiple iterations, you'll need to use more than one regex. (.NET is the only exception to this. Its CaptureCollection provides the matches of all iterations of a capturing group.)
In your case, you could do a search-and-replace to replace ^\d+: with nothing. That strips off the number and colon at the start of your string. Then call split() using the regex \s*\|\s* to split the string into the elements delimited by vertical bars.
Why don't you split your string?
str = "${233:aaa | bbbb | ccc ccccc }"
str.split(/\d+|\$|\{|\}|:|\|/).select{|v| !v.empty? }.select{|v| !v.empty? }.map{|v| v.strip}.join(', ')
#=> "aaa, bbb, cc cccc"
This might help you
Script
a = [
'${1:aaa|bbbb}',
'${233:aaa | bbbb | ccc ccccc }',
'${34: aaa | bbbb | cccccccc |d}',
'${343: aaa | bbbb | cccccccc |dddddd ddddddddd}',
'${3443:a aa|bbbb|cccccccc|d}',
'${353:aa a| b b b b | c c c c c c c c | dddddd}'
]
a.each do |input|
puts input
input.scan(/[:|]([^|}]+)/).flatten.each do |s|
puts s.gsub(/(^\s+|\s+$)/, '') # trim
end
end
Output
${1:aaa|bbbb}
aaa
bbbb
${233:aaa | bbbb | ccc ccccc }
aaa
bbbb
ccc ccccc
${34: aaa | bbbb | cccccccc |d}
aaa
bbbb
cccccccc
d
${343: aaa | bbbb | cccccccc |dddddd ddddddddd}
aaa
bbbb
cccccccc
dddddd ddddddddd
${3443:a aa|bbbb|cccccccc|d}
a aa
bbbb
cccccccc
d
${353:aa a| b b b b | c c c c c c c c | dddddd}
aa a
b b b b
c c c c c c c c
dddddd
Instead of trying to do everything at once, divide and conquer:
DATA.each do |line|
line =~ /:(.+)\}/
items = $1.strip.split( /\s* \| \s*/x )
p items
end
__END__
${1:aaa|bbbb}
${233:aaa | bbbb | ccc ccccc }
${34: aaa | bbbb | cccccccc |d}
${343: aaa | bbbb | cccccccc |dddddd ddddddddd}
${3443:a aa|bbbb|cccccccc|d}
${353:aa a| b b b b | c c c c c c c c | dddddd}
If you want to do it with a single regex, you can use scan, but this seems more difficult to grok:
DATA.each do |line|
items = line.scan( /[:|] ([^|}]+) /x ).flatten.map { |i| i.strip }
p items
end

Resources