I'm parsing a html file successfully with xmllint but when I combine two or more xpath expressions I get only one occurrence and not all of them.
When I run the expressions separately I get something like this:
Expression:
xmllint --html --xpath "//h3[contains(text(),'Rodada')]/../following-sibling::div//span[contains(#class,'partida-desc')][2]/text()" 2012-campeonato-brasileiro.html 2>/dev/null
Result:
Couto Pereira - Curitiba - PR
Aflitos - Recife - PE
Serra Dourada - Goiania - GO
But when I combine the expressions:
prefix="//h3[contains(text(),'Rodada')]/../following-sibling::div"
xmllint --html --xpath "normalize-space(concat($prefix//span[contains(#class,'partida-desc')]/text(),';',$prefix//div[contains(#class,'pull-left')]//img/#title,';',$prefix//div[contains(#class,'pull-right')]//img/#title,';',$prefix//strong/span/text(),';',$prefix//span[contains(#class,'partida-desc')][2]/text()))" 2012-campeonato-brasileiro.html 2>/dev/null
Result:
Sáb, 19/05/2012 18:30 - Jogo: 3 ;Palmeiras - SP;Portuguesa - SP;1 x 1; Pacaembu - Sao Paulo - SP
It works but stop at the first result. I can't make it parse all the file.
To run this example, you can download the html from here
curl https://www.cbf.com.br/futebol-brasileiro/competicoes/campeonato-brasileiro-serie-a/2012 --compressed > /tmp/2012-campeonato-brasileiro.html
With any call to functions like normalize-space or concat in XPath 1.0, if you call it on an argument being a node-set only the value of the first node in the node-set is used.
In XPath 2 and later you can use e.g. //foo/normalize-space() or //foo/concat(.//bar, ';', .//baz) or string-join(//foo, ';').
With pure XPath 1.0 you would need to iterate in the host language (e.g. shell or XSLT or Java) and then concatenate in the host language.
Concat will operate on the first node of a nodeset.
The following command adds more processing to take advantage of xmllint shell
echo -e "cd //h3[contains(text(),'Rodada')]/../following-sibling::div \n cat .//span[contains(#class,'partida-desc')]/text() | .//div[contains(#class,'pull-left')]//img/#title | .//div[contains(#class,'pull-right')]//img/#title | .//strong/span/text() | .//span[contains(#class,'partida-desc')][2]/text() \nbye\n" | \
xmllint --html --shell 2012-campeonato-brasileiro.html 2>/dev/null | \
tr -s ' ' | grep -v '^ *$' | \
gawk 'BEGIN{ RS="(\n -------){3,3}"; FS="\n -------\n"; OFS=";"} {if(NR>2) { print gensub(/\n/,"","g",$1),gensub(/title="([^"]+)"/,"\\1","g",$2),gensub(/title="([^"]+)"/,"\\1","g",$3),$4,$5}}'
Result
Sáb, 19/05/2012 21:00 - Jogo: 4 ; Figueirense - SC; Náutico - PE;2 x 1; Orlando Scarpelli - Florianopolis - SC
Dom, 20/05/2012 16:00 - Jogo: 8 ; Ponte Preta - SP; Atlético - MG;0 x 1; Moisés Lucarelli - Campinas - SP
Dom, 20/05/2012 16:00 - Jogo: 5 ; Corinthians - SP; Fluminense - RJ;0 x 1; Pacaembu - Sao Paulo - SP
Dom, 20/05/2012 16:00 - Jogo: 7 ; Botafogo - RJ; São Paulo - SP;4 x 2; João Havelange - Rio de Janeiro - RJ
Dom, 20/05/2012 16:00 - Jogo: 6 ; Internacional - RS; Coritiba - PR;2 x 0; Beira-Rio - Porto Alegre - RS
Dom, 20/05/2012 18:30 - Jogo: 1 ; Vasco da Gama - RJ; Grêmio - RS;2 x 1; São Januário - Rio de Janeiro - RJ
Dom, 20/05/2012 18:30 - Jogo: 2 ; Bahia - BA; Santos - SP;0 x 0; Pituaçu - Salvador - BA
.... (more records)
More clean up might be needed since field contain leading/trailing spaces.
Note: html needs to be converted to unix new lines
dos2unix 2012-campeonato-brasileiro.html
Thanks for your answers!
Considering my alternatives that's my best solution so far.
´#!/bin/bash
RODADAS=$(xmllint --html --xpath "//h3[contains(text(),'Rodada ')]/text()" $1 2>/dev/null)
while read i
do
for x in {1..10}
do
PREFIX="//h3[contains(text(), '$i')]/../following-sibling::div/ul/li[$x]";
xmllint --html --xpath "normalize-space(concat($PREFIX//span[contains(#class,'partida-desc')]/text(),';',$PREFIX//div[contains(#class,'pull-left')]//img/#title,';',$PREFIX//div[contains(#class,'pull-right')]//img/#title,';',$PREFIX//strong/span/text(),';',$PREFIX//span[contains(#class,'partida-desc')][2]/text()))" $1 2>/dev/null;
done
done <<< "$RODADAS"´
Run:
./html-csv.sh 2012-campeonato-brasileiro.html
Result:
Sáb, 01/12/2012 19:30 - Jogo: 373 ;Santos - SP;Palmeiras - SP;3 x 1; Vila Belmiro - Santos - SP
Dom, 02/12/2012 17:00 - Jogo: 372 ;Fluminense - RJ;Vasco da Gama - RJ;1 x 2; João Havelange - Rio de Janeiro - RJ
Dom, 02/12/2012 17:00 - Jogo: 374 ;São Paulo - SP;Corinthians - SP;3 x 1; Pacaembu - Sao Paulo - SP
I have a text file looking like
text_a_3 xxx yyy
- - - - - - - - - - -
text_b_2 xyx zyz
- - - - - - - - - - -
text_b_3 xxy zyy
- - - - - - - - - - -
text_a_2 foo bar
- - - - - - - - - - -
text_a_1 foo bla
- - - - - - - - - - -
text_b_1 bla bla
I want to sort this file numerically, based on the first field, so that my output would look like:
text_a_1 foo bla
- - - - - - - - - - -
text_a_2 foo bar
- - - - - - - - - - -
text_a_3 xxx yyy
- - - - - - - - - - -
text_b_1 bla bla
- - - - - - - - - - -
text_b_2 xyx zyz
- - - - - - - - - - -
text_b_3 xxy zyy
I thought sort would do the job. I thus tried
sort -n name_of_my_file
sort -k1 -n name_of_my_file
But it gives
- - - - - - - - - - -
- - - - - - - - - - -
- - - - - - - - - - -
- - - - - - - - - - -
- - - - - - - - - - -
text_a_1 foo bla
text_a_2 foo bar
text_a_3 xxx yyy
text_b_1 bla bla
text_b_2 xyx zyz
text_b_3 xxy zyy
The option --field-separator is not of any help.
Is there any way to achieve this with a one-line, sort based command ?
Or is the only solution to extract text containing lines, sort them, and insert line delimiters afterwards ?
Using GNU awk only, and relying with internal sort function asort() and record separator set to dashes line:
awk -v RS='- - - - - - - - - - -\n' '
{a[++c]=$0}
END{
asort(a)
for(i=1;i<=c;i++)
printf "%s%s",a[i],(i==c?"":RS)
}' name_of_my_file
The script first fills the content of the input file into the array a. When the file is read, the array is sorted and then printed with the same input record separator.
When the line delimiters are all on the even lines, you can use
paste -d'\r' - - < yourfile | sort -n | tr '\r' '\n'
I actually prefer removing the delimiters in front, sort and add them afterwards, so please reconsider your requirements:
grep -Ev "(- )*-" yourfile | sort -n | sed 's/$/\n- - - - - - - - - - -/'
Following sort + awk may help you.
sort -t"_" -k2 -k3 Input_file | awk '/^-/ && !val{val=$0} !/^-/{if(prev){print prev ORS val};prev=$0} END{print prev}'
Adding a non-one liner form of solution too now.
sort -t"_" -k2 -k3 Input_file |
awk '
/^-/ && !val{
val=$0}
!/^-/{
if(prev){
print prev ORS val};
prev=$0
}
END{
print prev
}'
I need to split following string into two parts in AWK. below line is one line.
Dec 10 03:38:49 cgnat1.dd.com 1 2015 Dec 9 14:38:47 02-g4-adsl - - NAT44 - [UserbasedW - 100.70.92.248 vrf-testnet - 222.222.34.125 - 38912 39935 - - ]
Dec 10 03:38:52 cgnat2.dd.com 1 2015 Dec 9 14:38:51 01-g2-adsl - - NAT44 - [UserbasedW - 100.70.21.77 vrf-testnet - 222.222.34.42 - 1024 2047 - - ][UserbasedW - 100.70.21.36 vrf-testnet - 222.222.34.38 - 64512 65535 - - ]
First part:
Dec 10 03:38:49 cgnat1.dd.com 1 2015 Dec 9 14:38:47 02-g4-adsl - - NAT44 -
Second part:
[UserbasedW - 100.70.92.248 vrf-testnet - 222.222.34.125 - 38912 39935 - - ]
How can I achieve this?
awk -F '[' '{print $1 "\n" FS $2}'
From Server A I want to grep couple lines from a logfile on Server B with a certain watermark.
The logfile contains at least 3 lines with the matched watermark
..
13:20 - xxx - line 1
[empty line]
13:20 - xxx - line 2
13:20 - xxx - line 3
..
The function which going to be SSH-ed:
function functssh {
grep $time $logfile
}
Main (from Server A):
RESULT=$(ssh#serverB "$(typeset -f); functssh $time $filename" 2>/dev/null)
echo $RESULT
Gives this output:
13:20 - xxx - line 2
13:20 - xxx - line 3
Line 1 is missing! I have no idea why does it get cropped.
This is my ls command and post-processing:
ls -l $pwd | tail -n +2 | cut -c1-10,50-999999 | sed 's/./& /g' |
sed 's/\(.\{7\}\)/& /g' | sed 's/\(.\{30\}\)/&/g'
This is the output:
- r w x r - - r - - a d d . o l d
I want to remove all the spaces within the filename, so I can end up with something like this: (keep in mind the space pattern is kept at the permissions)
- r w x r - - r - - add.old
You simply don't. There's a whole universe of articles out there detailing why you should not be parsing ls output, but use combinations of tools like your shell's (most likely very comprehensive) file name globbing, find, and stat.
for example:
for name in * ;do echo $(stat -c '%A' "$name"):$name ; done
EDIT: stat gives you a lot of formats to help you achieve your desired output, and now that this gives you an unambiguous output (still, this can go wrong with things like newlines in file names), you can just use sed on the stat output in isolation. See man stat.
Writing bullet-proof code to parse ls output is tricky if the file names are not reasonably clean. With the caveat that your file names must not contain newlines and should avoid other control characters (but more or less anything else should be OK) you can just about risk parsing the output from ls. If your file names are limited to the portable file name character set ([-a-zA-Z0-9_.]), you should be fine. But be aware that not everyone will be as disciplined as you are with their file names, so your scripts can fail suddenly if someone creates a name with unexpected characters. Note that leading dashes in file names can wreak havoc; a name such as --version will make the average GNU utility behave unexpectedly, for example.
You were warned!
You can use cut -c1-10,50- to avoid typing all those 9's. You can combine your 3 sed commands into one using -e in front of each expression.
With a little bit of sed trickery, you can deal with the spacing more easily. On my Mac, the correct column for the start of the file name is column 54, not 50:
$ ls -l |
> sed -e '1d;12q' | # Only 11 file names listed
> cut -c 1-10,54- |
> sed -e 's/\(.\)\(...\)\(...\)\(...\)/\1\2 \3 \4 ##/' \
> -e h -e 's/##.*/##/' -e 's/[^#]/& /g' -e G -e 's/\n//' \
> -e 's/##[^#]*##//'
- r w - r - - r - - 2da.c
- r w - r - - r - - 3dalloc1.c
- r w - r - - r - - 3dalloc2.c
- r w - r - - r - - 3dalloc3.c
- r w - r - - r - - 3dalloc4.c
- r w - r - - r - - 4 ## Testing
- r w - r - - r - - ## Testing
d r w x r - x r - x Floppy
d r w x r - x r - x IQ
d r w x r - x r - x Primes
d r w x r - x r - x SHA-256
$
What does the sed command do?
Replace the permissions fields with a space after each group of 3 permissions bits and add ## at the end of the permissions.
Copy the modified line to the hold space.
Replace the ## and everything after it with just ##.
Put a space after every non-# character.
Append the hold space to the pattern space with a newline.
Remove the newline.
Remove two ## markers and everything between that isn't an #.
This leaves ## markers in the file names untouched — witness the two names with spaces in them that contain ##.
For the record, the 11 lines passed to cut and then sed were:
-rw-r--r-- 1 jleffler staff 2362 Mar 6 19:48 2da.c
-rw-r--r-- 1 jleffler staff 1638 Mar 6 19:48 3dalloc1.c
-rw-r--r-- 1 jleffler staff 2870 Mar 6 19:48 3dalloc2.c
-rw-r--r-- 1 jleffler staff 2968 Mar 6 19:48 3dalloc3.c
-rw-r--r-- 1 jleffler staff 2096 Mar 6 19:48 3dalloc4.c
-rw-r--r-- 1 jleffler staff 0 Mar 21 16:46 4 ## Testing
-rw-r--r-- 1 jleffler staff 0 Mar 21 16:47 ## Testing
drwxr-xr-x 4 jleffler staff 136 Mar 9 23:03 Floppy
drwxr-xr-x 8 jleffler staff 272 Mar 6 19:48 IQ
drwxr-xr-x 33 jleffler staff 1122 Mar 6 19:48 Primes
drwxr-xr-x 13 jleffler staff 442 Mar 6 19:48 SHA-256