Ruby XML Reading from one XML and parsing into another - ruby

XPath.each( xmldoc, "//speech/speaking") do |element|
# puts element.attributes['name']
# puts element.text
File.open(file_name + "_" + element.attributes['name'] + "-" + year + ".xml", 'a+') do |f|
f.write("<speaker>" + element.attributes['name'] + "</speaker>")
f.write("<speech>" + doc.xpath('//speech/speaking').text + "</speech>" + "\n")
end
end
Hello stackoverflow I am looking for help solving a logic issue I am having with XML files. The above code creates a file with the "speakers" name and then it should place what the speaker says into that file.
The problem that I am running into is that it places ALL of the speakers into the same file. So I am thinking the problem lies here:
f.write("<speech>" + doc.xpath('//speech/speaking').text + "</speech>" + "\n")
I am hoping that someone has a better way of doing this, but the idea would be to change the above code to:
doc.xpath('//speech/speaking').text WHERE speaker == element.attributes['name']
Ultimately I would like to have each speaker in their own XML file with their own speeches.
<speaking name="Mr. FAZIO">I appreciate my friend yielding.</speaking>
The above is a sample from the XML file.

The xpath you are looking for is:
doc.xpath("//speech/speaking[#name='#{element.attributes['name']}']").text
see XPath to select Element by attribute value

Related

How do I make this regular expression more general?

I'm using Ruby 1.8.7. I have a text file with following content:
"testhost-01.test.de|lan|ip-v4|cmk-agent|tcp|ip-v4-only|site:tir_projects|test|wato|/" + FOLDER_PATH + "/",
"testhost-02.test.de|lan|ip-v4|cmk-agent|tcp|ip-v4-only|site:tir_projects|prod|puppetagent|wato|/" + FOLDER_PATH + "/",
"testhost-03.test.de|wan|ip-v4|cmk-agent|tcp|ip-v4-only|site:tir_projects|prod|puppetagent|wato|/" + FOLDER_PATH + "/",
"testhost-04.test.de|ip-v4|cmk-agent|tcp|ip-v4-only|site:tir_projects|dmz|prod|puppetagent|wato|/" + FOLDER_PATH + "/",
"testhost-05.test.de|wan|ip-v4|cmk-agent|tcp|ip-v4-only|site:tir_projects|prod|puppetagent|wato|/" + FOLDER_PATH + "/",
"testhost-06.test.de|lan|ip-v4|cmk-agent|tcp|ip-v4-only|site:tir_projects|prod|wato|/" + FOLDER_PATH + "/",
"testhost-07.test.de|ip-v6|cmk-agent|tcp|site:tir_projects|ip-v6-only|dmz|prod|puppetagent|wato|/" + FOLDER_PATH + "/",
"testhost-08.test.de|ip-v4|snmp|snmp-only|ip-v4-only|critical|site:tir_projects|dmz|wato|/" + FOLDER_PATH + "/",
I'm trying to extract the hostnames (testhost-01.test.de - testhost-08.test.de) to an Array but only when "puppetagent" is in the same line.
The result should be:
[
"testhost-02.test.de",
"testhost-03.test.de",
"testhost-04.test.de",
"testhost-05.test.de",
"testhost-07.test.de"
]
Code Example:
path = "Textfile"
file = IO.read(path)
nodes = file.scan(/^"(.*)\|lan.*\|puppetagent/).flatten
This example above works only for lines where after the first pipe,
"lan" follows, so it only finds host 02.
If you don't want to restrict output to lines that include |lan, you can't include |lan in the expression. It looks like you want |lan to mark the end of your capture group - instead, you can restrict your capture group to not include | by using the character set [^|]. Then, even if the line doesn't include lan, you'll stop at the first |. After the |, you don't care about content until puppetagent, so we'll consume that with .*.
/^"([^|]*).*puppetagent/
In plain English, that's
^" Start with "
([^|]*) Capture anything that's not a |
.* Accept anything else on the line
puppetagent Require puppetagent to be present

Multiprocessing and shared multiprocessing manager lists for parsing large file

I am trying to parse a huge file (approx 23 MB) using the code below, wherein I populate a multiprocessing.manager.list with all the lines read from the file . In the target routine (parse_line) for each process, I pop a line and parse it to create a defaultdict object with certain parsed attributes and finally push each of these objects into another multiprocessing.manager.list.
class parser(object):
def __init__(self):
self.manager = mp.Manager()
self.in_list = self.manager.list()
self.out_list = self.manager.list()
self.dict_list,self.lines, self.pcap_text = [],[],[]
self.last_timestamp = [[(999999,0)]*32]*2
self.num = Word(nums)
self.word = Word(alphas)
self.open_brace = Suppress(Literal("["))
self.close_brace = Suppress(Literal("]"))
self.colon = Literal(":")
self.stime = Combine(OneOrMore(self.num + self.colon) + self.num + Literal(".") + self.num)
self.date = OneOrMore(self.word) + self.num + self.stime
self.is_cavium = self.open_brace + (Suppress(self.word)) + self.close_brace
self.oct_id = self.open_brace + Suppress(self.word) + Suppress(Literal("=")) \
+ self.num + self.close_brace
self.core_id = self.open_brace + Suppress(self.word) + Suppress(Literal("#")) \
+ self.num + self.close_brace
self.ppm_id = self.open_brace + self.num + self.close_brace
self.oct_ts = self.open_brace + self.num + self.close_brace
self.dump = Suppress(Word(hexnums) + Literal(":")) + OneOrMore(Word(hexnums))
self.opening = Suppress(self.date) + Optional(self.is_cavium.setResultsName("cavium")) \
+ self.oct_id.setResultsName("octeon").setParseAction(lambda toks:int(toks[0])) \
+ self.core_id.setResultsName("core").setParseAction(lambda toks:int(toks[0])) \
+ Optional(self.ppm_id.setResultsName("ppm").setParseAction(lambda toks:int(toks[0])) \
+ self.oct_ts.setResultsName("timestamp").setParseAction(lambda toks:int(toks[0]))) \
+ Optional(self.dump.setResultsName("pcap"))
def parse_file(self, filepath):
self.filepath = filepath
with open(self.filepath,'r') as f:
self.lines = f.readlines()
for lineno,line in enumerate(self.lines):
self.in_list.append((lineno,line))
processes = [mp.Process(target=self.parse_line) for i in range(mp.cpu_count())]
[process.start() for process in processes]
[process.join() for process in processes]
while self.in_list:
(lineno, len) = self.in_list.pop()
print mp.current_process().name, "start"
dic = defaultdict(int)
result = self.opening.parseString(line)
self.pcap_text.append("".join(result.pcap))
if result.timestamp or result.ppm:
dic['oct'], dic['core'], dic['ppm'], dic['timestamp'] = result[0:4]
self.last_timestamp[result.octeon][result.core] = (result.ppm,result.timestamp)
else:
dic['oct'], dic['core'] = result[0:2]
dic['ppm'] = (self.last_timestamp[result.octeon][result.core])[0]
dic['ts'] = (self.last_timestamp[result.octeon][result.core])[1]
dic['line'] = lineno
self.out_list.append(dic)
However this entire process takes approximately 3 minutes to complete.
My question is, if there is a better way to make this faster ?
I am using pyparsing module to parse each line, if it makes any difference.
PS: Made changes in the routine Paul McGuire's advice
Not a big performance issue, but learn to iterate over files directly, instead of using readlines(). In place of this code:
self.lines = f.readlines()
for lineno,line in enumerate(self.lines):
self.in_list.append((lineno,line))
You can write:
self.in_list = list(enumerate(f))
A hidden performance killer is using while self.in_list: (lineno,line) = list.pop(). Each call to pop removes the 0'th element from the list. Unfortunately, Python's lists are implemented as arrays. To remove the 0'th element, the 1..n-1'th elements have to be moved up one slot in the array. You don't really have to destroy self.in_list as you go, just iterate over it:
for lineno, line in self.in_list:
<Do something with line and line no. Parse each line and push into out_list>
If you are thinking that consuming self.in_list as you go is a memory-saving measure, then you can avoid the array-shifting inefficiency of Python lists by using a deque instead (from Python's provided collections module). deque's are implemented internally as linked lists, so that pushing or popping to and from either end is very fast, but indexed access is slow. To use a deque, replace the line:
self.in_list = list(enumerate(f))
with:
self.in_list = deque(enumerate(f))
Then replace the call in your code self.in_list.pop() with self.in_list.popleft().
But MUCH more likely to be the performance issue is the pyparsing code you are using to process each line. But since you didn't post the parser code, there is not much help we can provide there.
To get an idea about where the time is going, try leaving all your code, and then comment out the <Do something with line and line no. Parse each line and push into out_list> code (you may have to add a pass statement for the for loop), and then run against your 23MB file. This will give you a rough idea about how much of your 3 minutes is being spent in reading and iterating over the file, and how much is being spent doing the actual parsing. Then post back in another question when you find where the real performance issues lie.

Ruby csv for each - clean up characters?

I have the following code which reads each line of a csv and cleans up each row. The rows are all path\ file name directories. I am having an issue where the script cannot find a path\file because the file name has a - in it. The - (dash) is read by ruby as \x96 . Does anyone know how to get it to not do that, and to read the - as a dash?
This is what I have, but it is not working:
CSV.foreach("#{batch_File_Dir_sdata}") do |ln|
line_number += 1
pathline = ln.to_s
log_linemsg = "Source #{line_number}= #{pathline}"
log_line = ["#{$cname}","#{log_linemsg}","","",]
puts log_linemsg
insert_logitems(connection, table_namelog, log_line)
if pathline.include?("\\")
cleanpath = pathline.gsub!("\\\\","\\")
#cleanpath = cleanpath.gsub!("[","")
#cleanpath = cleanpath.gsub!("]","")
cleanpath.gsub!("\"","")
#THIS IS THE LINE WHERE I AM TRYING TO FIX THE ISSUE
cleanpath.gsub!("\\x96","\-")
cleanpath.slice!(0)
cleanpath.chop!
#puts "Clean path - has backslash\n#{cleanpath}"
else
cleanpath = pathline
#puts "#{cleanpath}"
#puts "Clean path - has NO backslash\n#{cleanpath}"
end
Any help would be greatly appreciated.

How to insert line break in a return statement for D3

I have the following code d3 code:
tooltip.select("#popupCount").text(function(){
if (varToGraph == "rough_top_cost"){
return " " + textValue + ": $" + addCommas(allCountyData[countyName][varToGraph]) + "\n" +
"Count:"
}})
I want the word count to appear on a new line. However, the above code results in everything being on one line. How can I get the output to be on two lines?
Thanks,
AH
Untested answer, but FWIW this may get close;
tooltip.select("#popupCount").html(function(){
if (varToGraph == "rough_top_cost"){
return " " + textValue + ": <br/>$" + addCommas(allCountyData[countyName][varToGraph]) + "\n" +
"Count:"
}})
Working from the example provided on page 80 of D3 Tips and Tricks which includes tooltips with line breaks.
Uses html element instead of text which allows line breaks. Check out the document for more detail.

test if a PDF file is finished in Ruby (on Solaris/Unix)?

i have a server, that generates or copies PDF-Files to a specific folder.
i wrote a ruby script (my first ever), that regularily checks for own PDF-files and displayes them with acrobat. So simple so nice.
But now I have the Problem: how to detect the PDF is complete?
The generated PDF ends with %%EOF\n
but the copied ones are generated with some Apple-Magic (Acrobat Writer I think), that has an %%EOF near the beginning of the File, lots of binary Zeros and another %%EOF near the end with a carriage return (or line feed) and a binary zero at the end.
while true
dir = readpfad
Dir.foreach(dir) do |f|
datei = File.join(dir, f)
if File.file?(datei)
if File.stat(datei).owned?
if datei[-9..-1].upcase == "__PDF.PDF"
if File.stat(datei).size > 5
test = File.new(datei)
dummy = test.readlines
if dummy[-1][0..4] == "%%EOF"
#move the file, so it will not be shown again
cmd = "mv " + datei + " " + movepfad
system(cmd)
acro = ACROREAD + " " + File.join(movepfad, f) + "&"
system(acro)
else
puts ">>>" + dummy[-1] + "<<<"
end
end
end
end
end
end
sleep 1
end
Any help or idea?
Thanks
Peter
All the %%EOF token means is that there should be one within the last 1024 bytes of the physical end of file. The structure of PDF is such that a PDF document may have 1 or more %%EOF tokens within it (the details are in the spec).
As such, "contains %%EOF" is not equivalent to "completely copied". Really, the correct answer is that the server should signal when it's done and your code should be a client of that signal. In general, polling -- especially IO bound polling is the wrong answer to this problem.

Resources