Having trouble getting preg_match_all to work - preg-match

I'm new to regular expressions and really struggling to get it to work.
I'm trying to grab some information from a page that is in between the following html:
<!--webbot bot="Include" U-Include="/inspections/Restaurants_Avalon.html" TAG="BODY" startspan --> EVERYTHING IN BETWEEN<!--webbot bot="Include" i-checksum="41417" endspan -->
I've tried:
$pattern = '/<.*?webbot bot=\"Include\" U-Include=\".*?\".*?startspan.*?(.*?)<.*?webbot bot=\"Include\" i-checksum=\".*?\" endspan.*?/i';
and a few other dozen variations but my obvious lack of experience and understanding of regular expresses has just created regular messes rather than expressions.
Can someone have a look and tell me what I'm doing wrong?
Thanks!

Just change this part :
startspan.*?(.*?)<.*?webbot
by
startspan -->(.*?)<!--webbot
In action:
$str = '<!--webbot bot="Include" U-Include="/inspections/Restaurants_Avalon.html" TAG="BODY" startspan --> EVERYTHING IN BETWEEN<!--webbot bot="Include" i-checksum="41417" endspan -->';
$pat = '/<.*?webbot bot=\"Include\" U-Include=\".*?\".*?startspan -->(.*?)<!--webbot bot=\"Include\" i-checksum=\".*?\" endspan.*?/i';
preg_match($pat, $str, $m);
print_r($m);
output:
Array
(
[0] => <!--webbot bot="Include" U-Include="/inspections/Restaurants_Avalon.html" TAG="BODY" startspan --> EVERYTHING IN BETWEEN<!--webbot bot="Include" i-checksum="41417" endspan
[1] => EVERYTHING IN BETWEEN
)

Related

XPath problem with multiple OR expressions like (a|b|c) [duplicate]

This question already has an answer here:
Logical OR in XPath? Why isn't | working?
(1 answer)
Closed 1 year ago.
I have simplified html:
<html>
<main>
<span>one</span>
</main>
<not_important>
<div>skip_me</div>
</not_important>
<support>
<div>two</div>
</support>
</html>
I want to find only one and two, using conditions that the parent tag is main or support, and there is span or divafter it.
I wonder why that code does not work:
import lxml.html as HTML_PARSER
html = """
<html>
<main>
<span>one</span>
</main>
<not_important>
<div>skip_me</div>
</not_important>
<support>
<div>two</div>
</support>
</html>
"""
parent = '//main | //support'
child = '/span | /div'
doc = HTML_PARSER.fromstring(html)
print doc
xpath = '(%s)(%s)' % (parent, child)
print xpath
parsed = doc.xpath(xpath)
print parsed
I get an error Invalid expression. Why?
This (//main | //support) and this (/span | /div) xpaths are both correct.
Simple combo like (//main | //support)/span is also correct.
But why more complicated combination (//main | //support)(/span | /div) is not correct? How to resolve it?
In my real case //main, //support, /span and /div are really complicated xpaths, I want some general solution like (xpath1 | xpath2)(xpath3 | xpath4)
this will find it, however I'm not 100% sure if it's what you want:
//*[name() = 'main' or name() = 'support']/*[name() = 'span' or name() = 'div']/text()
Your XPath is not valid for XPath version 1 (the one that lxml use)
Try
xpath = '//div[parent::support]|//span[parent::main]'
or
parent = ['main', 'support']
child = ['span', 'div']
xpath = '//*[self::{0[0]} or self::{0[1]}]/*[self::{1[0]} or self::{1[1]}]'.format(parent, child)
You can use the self:: axis:
(//main | //support)[*[self::div or self::span]]

RegEx code works in theory but not when code is run

i'm trying to use this RegEx search: <div class="ms3">(\n.*?)+<in Ruby, however as soon as i get to the last character "<" it stops working altogether. I've tested it in Rubular and the RegEx works perfectly fine, I'm using rubymine to write my code but i also tested it using Powershell and it comes up with the same results. no Error message. when i run <div class="ms3">(\n.*?)+ it prints <div class="ms3"> which is exactly what i'm looking for, but as soon as i add the "<" it comes out with nothing.
my code:
#!/usr/bin/ruby
# encoding: utf-8
File.open('ms3.txt', 'w') do |fo|
fo.puts File.foreach('input.txt').grep(/<div class="ms3">(\n.*?)+/)
end
some of what i'm searching through:
<div class="ms3">
<span xml:lang="zxx"><span xml:lang="zxx">Still the tone of the remainder of the chapter is bleak. The</span> <span class="See_In_Glossary" xml:lang="zxx">DAY OF THE <span class="Name_Of_God" xml:lang="zxx">LORD</span></span> <span xml:lang="zxx">holds no hope for deliverance (5.16–18); the futility of offering sacrifices unmatched by common justice is once more underlined, and exile seems certain (5.21–27).</span></span>
</div>
<div class="Paragraph">
<span class="Verse_Number" id="idAMO_5_1" xml:lang="zxx">1</span><span class="scrText">Listen, people of Israel, to this funeral song which I sing over you:</span>
</div>
<div class="Stanza_Break"></div>
The full RegEx i need to do is <div class="ms3">(\n.*?)+<\/div> it picks up the first section and nothing else
Your problem starts with using File.foreach('input.txt') which cuts the result into lines. This means that the pattern is matched to each line separately, so none of the lines match the pattern (by definition, none of the lines have \n in its middle).
You should have better luck reading the whole text as a block, and using match on it:
File.read('input.txt').match(/<div class="ms3">(\n.*?)+<\/div>/)
# => #<MatchData "<div class=\"ms3\">\n <span xml:lang=\"zxx\">
# => <span xml:lang=\"zxx\">Still the tone of the remainder of the chapter is bleak. The</span>
# => <span class=\"See_In_Glossary\" xml:lang=\"zxx\">DAY OF THE
# => <span class=\"Name_Of_God\" xml:lang=\"zxx\">LORD</span></span>
# => <span xml:lang=\"zxx\">holds no hope for deliverance (5.16–18);
# => the futility of offering sacrifices unmatched by common justice is once more
# => underlined, and exile seems certain (5.21–27).</span></span>\n </div>" 1:"\n ">

Moving chunks of data in a file with awk

I'm moving my bookmarks from kippt.com to pinboard.in.
I exported my bookmarks from Kippt and for some reason, they were storing tags (preceded by #) and description within the same field. Pinboard keeps tags and description separated.
This is what a Kippt bookmark looks like after export:
<DT>This is a title
<DD>#tag1 #tag2 This is a description
This is what it should look like before importing into Pinboard:
<DT>This is a title
<DD>This is a description
So basically, I need to replace #tag1 #tag2 by TAGS="tag1,tag2" and move it on the first line within <A>.
I've been reading about moving chunks of data here: sed or awk to move one chunk of text betwen first pattern pair into second pair?
I haven't been to come up with a good recipe so far. Any insight?
Edit:
Here's an actual example of what the input file looks like (3 entries out of 3500):
<DT>Phabricator
<DD>#bug #tracking
<DT>The hidden commands for diagnosing and improving your Netflix streaming quality – Quartz
<DT>Icelandic Farm Holidays | Local experts in Iceland vacations
<DD>#iceland #tour #car #drive #self Self-driving tour of Iceland
This might not be the most beautiful solution, but since it seems to be a one-time-thing it should be sufficient.
import re
dt = re.compile('^<DT>')
dd = re.compile('^<DD>')
with open('bookmarks.xml', 'r') as f:
for line in f:
if re.match(dt, line):
current_dt = line.strip()
elif re.match(dd, line):
current_dd = line
tags = [w for w in line[4:].split(' ') if w.startswith('#')]
current_dt = re.sub('(<A[^>]+)>', '\\1 TAGS="' + ','.join([t[1:] for t in tags]) + '">', current_dt)
for t in tags:
current_dd = current_dd.replace(t + ' ', '')
if current_dd.strip() == '<DD>':
current_dd = ""
else:
print current_dt
print current_dd
current_dt = ""
current_dd = ""
print current_dt
print current_dd
If some parts of the code are not clear, just tell me. You can of course use python to write the lines to a file instead of printing them, or even modify the original file.
Edit: Added if-clause so that empty <DD> lines won't show up in the result.
script.awk
BEGIN{FS="#"}
/^<DT>/{
if(d==1) print "<DT>"s # for printing lines with no tags
s=substr($0,5);tags="" # Copying the line after "<DT>". You'll know why
d=1
}
/^<DD>/{
d=0
m=match(s,/>/) # Find the end of the HREF descritor first match of ">"
for(i=2;i<=NF;i++){sub(/ $/,"",$i);tags=tags","$i} # Concatenate tags
td=match(tags,/ /) # Parse for tag description (marked by a preceding space).
if(td==0){ # No description exists
tags=substr(tags,2)
tagdes=""
}
else{ # Description exists
tagdes=substr(tags,td)
tags=substr(tags,2,td-2)
}
print "<DT>" substr(s,1,m-1) ", TAGS=\"" tags "\"" substr(s,m)
print "<DD>" tagdes
}
awk -f script.awk kippt > pinboard
INPUT
<DT>Phabricator
<DD>#bug #tracking
<DT>The hidden commands for diagnosing and improving your Netflix streaming quality – Quartz
<DT>Icelandic Farm Holidays | Local experts in Iceland vacations
<DD>#iceland #tour #car #drive #self Self-driving tour of Iceland
OUTPUT:
<DT>Phabricator
<DD>
<DT>The hidden commands for diagnosing and improving your Netflix streaming quality – Quartz
<DT>Icelandic Farm Holidays | Local experts in Iceland vacations
<DD> Self-driving tour of Iceland

How do I parse Google image URLs using Ruby and Nokogiri?

I'm trying to make an array of all the image files on a Google images webpage.
I want a regular expression to pull everything after "imagurl=" and ending before "&amp" as seen in this HTML:
<img height="124" width="124" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRLy5inpSdHxWuE7z3QSZw35JwN3upbBaLr11LR25noTKbSMn9-qrySSg"><br><cite title="trendytree.com">trendytree.com</cite><br>Silent Night Chapel <b>20031</b><br>400 × 400 - 58k - jpg</td>
I feel like I can do this with a regex, but I can't find a way to search my parsed document using regex, but I'm not finding any solutions.
str = '<img height="124" width="124" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRLy5inpSdHxWuE7z3QSZw35JwN3upbBaLr11LR25noTKbSMn9-qrySSg"><br><cite title="trendytree.com">trendytree.com</cite><br>Silent Night Chapel <b>20031</b><br>400 × 400 - 58k - jpg</td>'
str.split('imgurl=')[1].split('&amp')[0]
#=> "http://www.trendytree.com/old-world- christmas/images/20031chapel20031-silent-night-chapel.jpg"
Is that what you're looking for?
The problem with using a regex is you assume too much knowledge about the order of parameters in the URL. If the order changes, or & disappears the regex won't work.
Instead, parse the URL, then split the values out:
# encoding: UTF-8
require 'nokogiri'
require 'cgi'
require 'uri'
doc = Nokogiri::HTML.parse('<img height="124" width="124" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRLy5inpSdHxWuE7z3QSZw35JwN3upbBaLr11LR25noTKbSMn9-qrySSg"><br><cite title="trendytree.com">trendytree.com</cite><br>Silent Night Chapel <b>20031</b><br>400 × 400 - 58k - jpg</td>')
img_url = doc.search('a').each do |a|
query_params = CGI::parse(URI(a['href']).query)
puts query_params['imgurl']
end
Which outputs:
http://www.trendytree.com/old-world-christmas/images/20031chapel20031-silent-night-chapel.jpg
Both URI and CGI are used because URI's decode_www_form raises an exception when trying to decode the query.
I've also been known to decode the query string into a hash using something like:
Hash[URI(a['href']).query.split('&').map{ |p| p.split('=') }]
That will return:
{"imgurl"=>
"http://www.trendytree.com/old-world-christmas/images/20031chapel20031-silent-night-chapel.jpg",
"imgrefurl"=>
"http://www.trendytree.com/old-world-christmas/silent-night-chapel-20031-christmas-ornament-old-world-christmas.html",
"usg"=>"__YJdf3xc4ydSfLQa9tYnAzavKHYQ",
"h"=>"400",
"w"=>"400",
"sz"=>"58",
"hl"=>"en",
"start"=>"19",
"zoom"=>"1",
"tbnid"=>"ajDcsGGs0tgE9M:",
"tbnh"=>"124",
"tbnw"=>"124",
"ei"=>"qagfUbXmHKfv0QHI3oG4CQ",
"itbs"=>"1",
"sa"=>"X",
"ved"=>"0CE4QrQMwEg"}
To get all the img urls you want do
# get all links
url = 'some-google-images-url'
links = Nokogiri::HTML( open(url) ).css('a')
# get regex match or nil on desired img
img_urls = links.map {|a| a['href'][/imgurl=(.*?)&/, 1] }
# get rid of nils
img_urls.compact
The regex you want is /imgurl=(.*?)&/ because you want a non-greedy match between imgurl= and &, otherwise the greedy .* would take everything to the last & in the string.

SimpleXML Reading node with a hyphenated name

I have the following XML:
<?xml version="1.0" encoding="UTF-8"?>
<gnm:Workbook xmlns:gnm="http://www.gnumeric.org/v10.dtd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.gnumeric.org/v9.xsd">
<office:document-meta xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:meta="urn:oasis:names:tc:opendocument:xmlns:meta:1.0" xmlns:ooo="http://openoffice.org/2004/office" office:version="1.1">
<office:meta>
<dc:creator>Mark Baker</dc:creator>
<dc:date>2010-09-01T22:49:33Z</dc:date>
<meta:creation-date>2010-09-01T22:48:39Z</meta:creation-date>
<meta:editing-cycles>4</meta:editing-cycles>
<meta:editing-duration>PT00H04M20S</meta:editing-duration>
<meta:generator>OpenOffice.org/3.1$Win32 OpenOffice.org_project/310m11$Build-9399</meta:generator>
</office:meta>
</office:document-meta>
</gnm:Workbook>
And am trying to read the office:document-meta node to extractthe various elements below it (dc:creator, meta:creation-date, etc.)
The following code:
$xml = simplexml_load_string($gFileData);
$namespacesMeta = $xml->getNamespaces(true);
$officeXML = $xml->children($namespacesMeta['office']);
var_dump($officeXML);
echo '<hr />';
gives me:
object(SimpleXMLElement)[91]
public 'document-meta' =>
object(SimpleXMLElement)[93]
public '#attributes' =>
array
'version' => string '1.1' (length=3)
public 'meta' =>
object(SimpleXMLElement)[94]
but if I try to read the document-meta element using:
$xml = simplexml_load_string($gFileData);
$namespacesMeta = $xml->getNamespaces(true);
$officeXML = $xml->children($namespacesMeta['office']);
$docMeta = $officeXML->document-meta;
var_dump($docMeta);
echo '<hr />';
I get
Notice: Use of undefined constant meta - assumed 'meta' in /usr/local/apache/htdocsNewDev/PHPExcel/Classes/PHPExcel/Reader/Gnumeric.php on line 273
int 0
I assume that SimpleXML is trying to extract a non-existent node "document" from $officeXML, then subtract the value of (non-existent) constant "meta", resulting in forcing the integer 0 result rather than the document-meta node.
Is there a way to resolve this using SimpleXML, or will I be forced to rewrite using XMLReader? Any help appreciated.
Your assumption is correct. Use
$officeXML->{'document-meta'}
to make it work.
Please note that the above applies to Element nodes. Attribute nodes (those within the #attributes property when dumping the SimpleXmlElement) do not require any special syntax to be accessed when hyphenated. They are regularly accessible via array notation, e.g.
$xml = <<< XML
<root>
<hyphenated-element hyphenated-attribute="bar">foo</hyphenated-element>
</root>
XML;
$root = new SimpleXMLElement($xml);
echo $root->{'hyphenated-element'}; // prints "foo"
echo $root->{'hyphenated-element'}['hyphenated-attribute']; // prints "bar"
See the SimpleXml Basics in the Manual for further examples.
I assume the best way to do it is to cast to array:
Consider the following XML:
<subscribe hello-world="yolo">
<callback-url>example url</callback-url>
</subscribe>
You can access members, including attributes, using a cast:
<?php
$xml = (array) simplexml_load_string($input);
$callback = $xml["callback-url"];
$attribute = $xml['#attributes']['hello-world'];
It makes everything easier. Hope I helped.

Resources