I'm using a get request to get some page data but need to strip the break tags from the finished file. Basically what I'm doing is taking the output of the get request and saving it to a file but it has hundereds of break tags in it I need removed. I'm fine with running a batch or vb script after the file is saved to remove the tags but I'm not sure how on how to do that either. So far the only solutions I have seen is to remove entire lines.
EDIT: This will be deployed to multiple Windows servers so I would like to keep the requirements as minimal as possible. I.E. commands/software that Windows has by default.
If you're au fait with Python, you could use Beautiful Soup to remove <br /> elements in a fairly robust manner. See here for how to remove elements from the tree.
Unless I have misunderstood you could replace the break tags using the replace function in vbscript (assumed from the tag). For example:
cleanedText = Replace(rawText,"<br/>",""))
More information on usage can be found here
http://www.w3schools.com/Vbscript/func_replace.asp
It is worth mention though that that function acts verbatim so you might have to run through a few times to get all common tag markup:
cleanedText = Replace(rawText,"<br/>","")) //no spaces
cleanedText = Replace(cleanedText,"<br />","")) // a space
cleanedText = Replace(cleanedText,"<br>","")) // unterminated
Related
I am creating a Joomla component and one of the pages contains a form with a text input for an email address.
When a < character is typed in the input field, that character and everything after is not showing up in the input.
I tried $_POST['field'] and JFactory::getApplication()->input->getCmd('field')
I also tried alternatives for getCmd like getVar, getString, etc. but no success.
E.g. John Doe <j.doe#mail.com> returns only John Doe.
When the < is left out, like John Doe j.doe#mail.com> the value is coming in correctly.
What can I do to also have the < character in the posted variable?
BTW. I had to use & lt; in this question to display it as I want it. This form suffers from the same problem!!
You actually need to set the filtering that you want when you grab the input. Otherwise, you will get some heavy filtering. (Typically, I will also lose # symbols.)
Replace this line:
JFactory::getApplication()->input->getCmd('field');
with this line:
JFactory::getApplication()->input->getRaw('field');
The name after the get part of the function is the filtering that you will use. Cmd strips everything but alphanumeric characters and ., -, and _. String will run through the html clean tags feature of joomla and depending on your settings will clean out <>. (That usually doesn't happen for me, but my settings are generally pretty open to the point of no filtering on super admins and such.
getRaw should definitely work, but note that there is no filtering at all, which can open security holes in your application.
The default text filter trims html from the input for your field. You should set the property
filter="raw"
in your form's manifest (xml) file, and then use getRaw() to retrieve the value. getCmd removes the non-alphanumeric characters.
I'm using Ruby 1.9.3 and REXML to parse an XML document, make a few changes (additions/subtractions), then re-output the file. Within this file is a block that looks like this:
<someElement>
some.namespace.something1=somevalue1
some.namespace.something2=somevalue2
some.namespace.something3=somevalue3
</someElement>
The problem is that after re-writing the file, this block always ends up looking like this:
<someElement>
some.namespace.something1=somevalue1
some.namespace.something2=somevalue2 some.namespace.something3=somevalue3
</someElement>
The newline after the second value (but never the first!) has been lost and turned into a space. Later, some other code which I have no control or influence over will be reading this file and depending on those newlines to properly parse the content. Generally in this situation i'd use a CDATA to preserve the whitespace, but this isn't an option as the code that parses this data later is not expecting one - it's essential that the inner text of this element is preserved exactly as-is.
My read/write code looks like this:
xmlFile = File.open(myFile)
contents = xmlFile.read
xmlDoc = REXML::Document.new(contents, { :respect_whitespace => :all })
xmlFile.close
{perform some tasks}
out = ""
xmlDoc.write(out, 2)
File.open(filePath, "w"){|file| file.puts(out)}
I'm looking for a way to preserve the whitespace of text between elements when reading/writing a file in this manner using REXML. I've read a number of other questions here on stackoverflow on this subject, but none that quite replicate this scenario. Any ideas or suggestions are welcome.
I get correct behavior by removing the indent (second) parameter to Document.write():
#xmlDoc.write(out, 2)
xmlDoc.write(out)
That seems like a bug in Document.write() according to my reading of the docs, but if you don't really need to set the indentation, then leaving that off should solve yor problem.
I'm trying to parse a webpage to get posts from a forum.
The start of each message starts with the following format
<div id="post_message_somenumber">
and I only want to get the first one
I tried xpath='//div[starts-with(#id, '"post_message_')]' in yql without success
I'm still learning this, anyone have suggestions
I think I have a solution that does not require dealing with namespaces.
Here is one that selects all matching div's:
//div[#id[starts-with(.,"post_message")]]
But you said you wanted just the "first one" (I assume you mean the first "hit" in the whole page?). Here is a slight modification that selects just the first matching result:
(//div[#id[starts-with(.,"post_message")]])[1]
These use the dot to represent the id's value within the starts-with() function. You may have to escape special characters in your language.
It works great for me in PowerShell:
# Load a sample xml document
$xml = [xml]'<root><div id="post_message_somenumber"/><div id="not_post_message"/><div id="post_message_somenumber2"/></root>'
# Run the xpath selection of all matching div's
$xml.selectnodes('//div[#id[starts-with(.,"post_message")]]')
Result:
id
--
post_message_somenumber
post_message_somenumber2
Or, for just the first match:
# Run the xpath selection of the first matching div
$xml.selectnodes('(//div[#id[starts-with(.,"post_message")]])[1]')
Result:
id
--
post_message_somenumber
I tried xpath='//div[starts-with(#id,
'"post_message_')]' in yql without
success I'm still learning this,
anyone have suggestions
If the problem isn't due to the many nested apostrophes and the unclosed double-quote, then the most likely cause (we can only guess without being shown the XML document) is that a default namespace is used.
Specifying names of elements that are in a default namespace is the most FAQ in XPath. If you search for "XPath default namespace" in SO or on the internet, you'll find many sources with the correct solution.
Generally, a special method must be called that binds a prefix (say "x:") to the default namespace. Then, in the XPath expression every element name "someName" must be replaced by "x:someName.
Here is a good answer how to do this in C#.
Read the documentation of your language/xpath-engine how something similar should be done in your specific environment.
#FindBy(xpath = "//div[starts-with(#id,'expiredUserDetails') and contains(text(), 'Details')]")
private WebElementFacade ListOfExpiredUsersDetails;
This one gives a list of all elements on the page that share an ID of expiredUserDetails and also contains the text or the element Details
I'm using the Mako template system in my Pylons website and am having some issues with stripping whitespace.
My reasoning for stripping whitespace is the generated HTML file comes out as 12363 lines of code. This, I assume, is why Internet Explorer is hanging when it tries to load it.
I want to be able to have my HTML file look nice and neat so I can make changes to it with ease and have the generated output look as ugly and messy as required to cut down on filesize and memory usage.
The Mako documentation http://www.makotemplates.org/docs/filtering.html says you can use the trim flag but that doesn't seem to work. Example code:
<div id="content">
${next.body() | trim}
</div>
The only way I've been able to strip the newlines is to add a \ (backslash) to the end of each line. This is rather annoying when coding the views and I'd prefer to have a centralized solution.
How do I remove the whitespace/newlines ?
I don't believe IE hanging is due to the file size. You have probably just found a combination of markup that hits an IE bug that causes it to freeze. Try trimming down you document until the freeze no longer happens to isolate the offending piece of markup pattern (and then avoid the markup pattern), or try changing the markup until this no longer happens.
Another good thing to do would be to run your page through HTML validator and fixing any issues reported.
I'm guessing that the trim filter is treating your html as a single string, and only stripping the leading and trailing whitespace characters. You want it to strip whitespace from every line. I would create a filter and iterate over each line.
<%!
def better_trim(html):
clean_html = ''
for line in html:
clean_html += line.strip()
return clean_html
%>
<div id="content">
${next.body() | better_trim}
</div>
You could create your own simple WSGI Middleware to strip whitespace from your templates after they've been rendered (possibly using the methods described in this answer).
Below is a quick example which might help you get started. Note: I wrote this code without testing or even running it; it may work first time, but it probably won't suit your needs so you'll need to expand on this yourself.
class HTMLMinifyMiddleware(object):
def __init__(self, app):
self.app = app
def __call__(self, environ, start_response):
resp_body = self.app(environ, start_response)
for i, part in enumerate(resp_body):
resp_body[i] = ' '.join(part.split())
return resp_body
I have been looking at regular expressions to try and do this, but the most I can do is find the start of a line with ^, but not replace it.
I can then find the first characters on a line to replace, but can not do it in such a way with keeping it intact.
Unfortunately I donĀ“t have access to a tool like cut since I am on a windows machine...so is there any way to do what I want with just regexp?
Use notepad++. It offers a way to record an sequence of actions which then can be repeated for all lines in the file.
Did you try replacing the regular expression ^ with the text you want to put at the start of each line? Also you should use the multiline option (also called m in some regex dialects) if you want ^ to match the start of every line in your input rather than just the first.
string s = "test test\ntest2 test2";
s = Regex.Replace(s, "^", "foo", RegexOptions.Multiline);
Console.WriteLine(s);
Result:
footest test
footest2 test2
I used to program on the mainframe and got used to SPF panels. I was thrilled to find a Windows version of the same editor at Command Technology. Makes problems like this drop-dead simple. You can use expressions to exclude or include lines, then apply transforms on just the excluded or included lines and do so inside of column boundaries. You can even take the contents of one set of lines and overlay the contents of another set of lines entirely or within column boundaries which makes it very easy to generate mass assignments of values to variables and similar tasks. I use Notepad++ for most stuff but keep a copy of SPFSE around for special-purpose editing like this. It's not cheap but once you figure out how to use it, it pays for itself in time saved.