I'm parsing a html file successfully with xmllint but when I combine two or more xpath expressions I get only one occurrence and not all of them.
When I run the expressions separately I get something like this:
Expression:
xmllint --html --xpath "//h3[contains(text(),'Rodada')]/../following-sibling::div//span[contains(#class,'partida-desc')][2]/text()" 2012-campeonato-brasileiro.html 2>/dev/null
Result:
Couto Pereira - Curitiba - PR
Aflitos - Recife - PE
Serra Dourada - Goiania - GO
But when I combine the expressions:
prefix="//h3[contains(text(),'Rodada')]/../following-sibling::div"
xmllint --html --xpath "normalize-space(concat($prefix//span[contains(#class,'partida-desc')]/text(),';',$prefix//div[contains(#class,'pull-left')]//img/#title,';',$prefix//div[contains(#class,'pull-right')]//img/#title,';',$prefix//strong/span/text(),';',$prefix//span[contains(#class,'partida-desc')][2]/text()))" 2012-campeonato-brasileiro.html 2>/dev/null
Result:
Sáb, 19/05/2012 18:30 - Jogo: 3 ;Palmeiras - SP;Portuguesa - SP;1 x 1; Pacaembu - Sao Paulo - SP
It works but stop at the first result. I can't make it parse all the file.
To run this example, you can download the html from here
curl https://www.cbf.com.br/futebol-brasileiro/competicoes/campeonato-brasileiro-serie-a/2012 --compressed > /tmp/2012-campeonato-brasileiro.html
With any call to functions like normalize-space or concat in XPath 1.0, if you call it on an argument being a node-set only the value of the first node in the node-set is used.
In XPath 2 and later you can use e.g. //foo/normalize-space() or //foo/concat(.//bar, ';', .//baz) or string-join(//foo, ';').
With pure XPath 1.0 you would need to iterate in the host language (e.g. shell or XSLT or Java) and then concatenate in the host language.
Concat will operate on the first node of a nodeset.
The following command adds more processing to take advantage of xmllint shell
echo -e "cd //h3[contains(text(),'Rodada')]/../following-sibling::div \n cat .//span[contains(#class,'partida-desc')]/text() | .//div[contains(#class,'pull-left')]//img/#title | .//div[contains(#class,'pull-right')]//img/#title | .//strong/span/text() | .//span[contains(#class,'partida-desc')][2]/text() \nbye\n" | \
xmllint --html --shell 2012-campeonato-brasileiro.html 2>/dev/null | \
tr -s ' ' | grep -v '^ *$' | \
gawk 'BEGIN{ RS="(\n -------){3,3}"; FS="\n -------\n"; OFS=";"} {if(NR>2) { print gensub(/\n/,"","g",$1),gensub(/title="([^"]+)"/,"\\1","g",$2),gensub(/title="([^"]+)"/,"\\1","g",$3),$4,$5}}'
Result
Sáb, 19/05/2012 21:00 - Jogo: 4 ; Figueirense - SC; Náutico - PE;2 x 1; Orlando Scarpelli - Florianopolis - SC
Dom, 20/05/2012 16:00 - Jogo: 8 ; Ponte Preta - SP; Atlético - MG;0 x 1; Moisés Lucarelli - Campinas - SP
Dom, 20/05/2012 16:00 - Jogo: 5 ; Corinthians - SP; Fluminense - RJ;0 x 1; Pacaembu - Sao Paulo - SP
Dom, 20/05/2012 16:00 - Jogo: 7 ; Botafogo - RJ; São Paulo - SP;4 x 2; João Havelange - Rio de Janeiro - RJ
Dom, 20/05/2012 16:00 - Jogo: 6 ; Internacional - RS; Coritiba - PR;2 x 0; Beira-Rio - Porto Alegre - RS
Dom, 20/05/2012 18:30 - Jogo: 1 ; Vasco da Gama - RJ; Grêmio - RS;2 x 1; São Januário - Rio de Janeiro - RJ
Dom, 20/05/2012 18:30 - Jogo: 2 ; Bahia - BA; Santos - SP;0 x 0; Pituaçu - Salvador - BA
.... (more records)
More clean up might be needed since field contain leading/trailing spaces.
Note: html needs to be converted to unix new lines
dos2unix 2012-campeonato-brasileiro.html
Thanks for your answers!
Considering my alternatives that's my best solution so far.
´#!/bin/bash
RODADAS=$(xmllint --html --xpath "//h3[contains(text(),'Rodada ')]/text()" $1 2>/dev/null)
while read i
do
for x in {1..10}
do
PREFIX="//h3[contains(text(), '$i')]/../following-sibling::div/ul/li[$x]";
xmllint --html --xpath "normalize-space(concat($PREFIX//span[contains(#class,'partida-desc')]/text(),';',$PREFIX//div[contains(#class,'pull-left')]//img/#title,';',$PREFIX//div[contains(#class,'pull-right')]//img/#title,';',$PREFIX//strong/span/text(),';',$PREFIX//span[contains(#class,'partida-desc')][2]/text()))" $1 2>/dev/null;
done
done <<< "$RODADAS"´
Run:
./html-csv.sh 2012-campeonato-brasileiro.html
Result:
Sáb, 01/12/2012 19:30 - Jogo: 373 ;Santos - SP;Palmeiras - SP;3 x 1; Vila Belmiro - Santos - SP
Dom, 02/12/2012 17:00 - Jogo: 372 ;Fluminense - RJ;Vasco da Gama - RJ;1 x 2; João Havelange - Rio de Janeiro - RJ
Dom, 02/12/2012 17:00 - Jogo: 374 ;São Paulo - SP;Corinthians - SP;3 x 1; Pacaembu - Sao Paulo - SP
Given master.yml ...
containers:
- name: project-a
items:
- CCC
- name: project-z
items:
- CCC
- DDD
... and an update.yml ...
- name: project-z
items:
- CCC
- EEE
... I want to merge it into the correct entry. This would give:
containers:
- name: project-a
items:
- CCC
- name: project-z
items:
- CCC
- DDD
- EEE
The following yq 4 works if the update was for project-a
yq ea 'select(fileIndex==0) * {"containers":select(fileIndex==1)}' master.yml update.yml
but, with the provided project-z update, it incorrectly replaces the first array entry (you end up with two project-zs).
After a deep-dive into the manuals, I have this :
yq ea 'select(fi==1).0.name as $p | (select(fi==0).containers.[] | select(.name==$p)) = (select(fi==1) | .0 | . headComment="AUTO-UPDATED") | select(fi==0)' master.yml update.yml
which replaces rather than merges project-z by first searching name matching update.yml then completely replacing the content.
I understand the root cause being data is formatted as a list where it should be a dictionary (name is unique).
This would merge trivially and be better in future too!
containers:
project-a:
items:
- CCC
project-z:
items:
- CCC
- DDD
- EEE
This is similar to another stackoverflow question in which kev had a really neat solution for (I think).
I've since added his solution to yq docs here: https://mikefarah.gitbook.io/yq/operators/multiply-merge#merge-arrays-of-objects-together-matching-on-a-key
In your case, this will only work if the second file matched the structure of the first, (that is, it also has 'containers' as a parent):
yq eval-all '
(
((.containers[] | {.name: .}) as $item ireduce ({}; . *+ $item )) as $uniqueMap
| ( $uniqueMap | to_entries | .[]) as $item ireduce([]; . + $item.value)
) as $mergedArray
| select(fi == 0) | .containers = $mergedArray
` file.yaml file2.yaml
containers:
- name: project-a
items:
- CCC
- name: project-z
items:
- CCC
- DDD
- CCC
- EEE
Basically it works by reducing into a merged map based on name (as you mention this would be much easier if that was already the case) and then converting that back to an array.
Disclaimer: I wrote yq
I have data like this:
AA_MAF EA_MAF ExAC_MAF
- - -
G:0.001445 G:0.0044 -
- - -
- - C:0.277
C:0.1984 C:0.1874 C:0.176
G:0.9296 G:0.9994 G:0.993&C:8.237e-06
C:0.9287 C:0.9994 C:0.993&T:5.767e-05
I need to split all column by : and & - this mean separate all letters (A,C,G,T) from their frequencies (numbers followed by letter). This is very complicated and I not sure if it is possible to solve.
require output is tab separate:
AA_MAF AA_MAF EA_MAF EA_MAF ExAC_MAF ExAC_MAF ExAC_MAF ExAC_MAF
- - - - - -
G 0.001445 G 0.0044 - - - -
- - - - - -
- - C 0.277 - -
C 0.1984 C 0.1874 C 0.176 - -
G 0.9296 G 0.9994 G 0.993 C 8.24E-006
C 0.9287 C 0.9994 C 0.993 T 5.77E-005
If array is empty try to substitute - .
My try was:
awk -v OFS="\t" '{{for(i=1; i<=NF; i++) sub(":","\t",$i)}; sub ("&","\t",$i) 1'}' IN_FILE | awk 'BEGIN { FS = OFS = "\t" } { for(i=1; i<=NF; i++) if($i ~ /^ *$/) $i = "-" }1'
If the trailing slashes are not required, you could use this command:
$ awk -F'[ \t:&]+' -v OFS='\t' '{$1=$1}1' file
AA_MAF EA_MAF ExAC_MAF
- - -
G 0.001445 G 0.0044 -
- - -
- - C 0.277
C 0.1984 C 0.1874 C 0.176
G 0.9296 G 0.9994 G 0.993 C 8.237e-06
C 0.9287 C 0.9994 C 0.993 T 5.767e-05
If you need the trailing slashes:
$ awk -F'[ \t:&]+' -v OFS='\t' '{$1=$1;for(i=NF+1;i<=8;i++)$i="-"}1' file
AA_MAF EA_MAF ExAC_MAF - - - - -
- - - - - - - -
G 0.001445 G 0.0044 - - - -
- - - - - - - -
- - C 0.277 - - - -
C 0.1984 C 0.1874 C 0.176 - -
G 0.9296 G 0.9994 G 0.993 C 8.237e-06
C 0.9287 C 0.9994 C 0.993 T 5.767e-05
awk '{for (i=1;i<=NF;i++) {
v1 = v2 = $i
if ($i ~ /:/ ) { gsub(/:.*/, "", v1); gsub( /.*:/, "", v2)}
printf( "%s%s%s%s", v1, OFS, v2, OFS)
}
print ""
}' YourFile
Check for each field content if ":" inside, if the case, separate the content, if not duplicate then print both the value with a separator between until end of the fields. Do it for each lines (including header)
I have two text files with the following line format:
Value - Value - Number
I need to merge these files in a new one that contains only the lines with the common Value - Value pairs followed by the two Number values.
For example if I have these files:
File1.txt
Jack - Mark - 12
Alex - Ryan - 15
Jack - Ryan - 22
File2.txt
Paul - Bill - 11
Jack - Mark - 18
Jack - Ryan - 20
The merged file will contain:
Jack - Mark - 12 - 18
Jack - Ryan - 22 - 20
How can I do this?
awk to the rescue!
awk -F' - ' 'BEGIN{OFS=FS}
NR==FNR{a[$1,$2]=$3;next}
($1,$2) in a{print $1,$2,a[$1,$2],$3}' file1 file2
Jack - Mark - 12 - 18
Jack - Ryan - 22 - 20
alternatively, with decorate/join/undecorate
$ join <(sort file1 | sed 's/ - /-/') <(sort file2 | sed 's/ - /-/') |
sed 's/-/ - /'
Jack - Mark - 12 - 18
Jack - Ryan - 22 - 20
I need to split following string into two parts in AWK. below line is one line.
Dec 10 03:38:49 cgnat1.dd.com 1 2015 Dec 9 14:38:47 02-g4-adsl - - NAT44 - [UserbasedW - 100.70.92.248 vrf-testnet - 222.222.34.125 - 38912 39935 - - ]
Dec 10 03:38:52 cgnat2.dd.com 1 2015 Dec 9 14:38:51 01-g2-adsl - - NAT44 - [UserbasedW - 100.70.21.77 vrf-testnet - 222.222.34.42 - 1024 2047 - - ][UserbasedW - 100.70.21.36 vrf-testnet - 222.222.34.38 - 64512 65535 - - ]
First part:
Dec 10 03:38:49 cgnat1.dd.com 1 2015 Dec 9 14:38:47 02-g4-adsl - - NAT44 -
Second part:
[UserbasedW - 100.70.92.248 vrf-testnet - 222.222.34.125 - 38912 39935 - - ]
How can I achieve this?
awk -F '[' '{print $1 "\n" FS $2}'