Parsing XML into CSV using Nokogiri

Parsing XML into CSV using Nokogiri - ruby

I am trying to figure out how to get Make and Model out of XML returned from a URL and put them into a CSV. Here is the XML returned from the URL:
<VINResult xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns="http://basicvalues.pentondata.com/">
<Vehicles>
<Vehicle>
<ID>131497</ID>
<Product>TRUCK</Product>
<Year>1993</Year>
<Make>Freightliner</Make>
<Model>FLD12064T</Model>
<Description>120'' BBC Alum Air Cond Long Conv. (SBA) Tractor w/48'' Sleeper Air Brakes & Power Steering 6x4 (SBA - Set Back Axle)</Description>
</Vehicle>
<Vehicle>
<ID>131497</ID>
<Product>TRUCK</Product>
<Year>1993</Year>
<Make>Freightliner</Make>
<Model>FLD12064T</Model>
<Description>120'' BBC Alum Air Cond Long Conv. (SBA) Tractor w/48'' Sleeper Air Brakes & Power Steering 6x4 (SBA - Set Back Axle)</Description>
</Vehicle>
</Vehicles>
<Errors/>
<InvalidVINMsg/>
</VINResult>
Here is the code I have so far:
require 'csv'
require 'rubygems'
require 'nokogiri'
require 'open-uri'
vincarriercsv = 'vincarrier.csv'
vindetails = 'vindetails.csv'
vinurl = 'http://redacted/LookUp_VIN?key=redacted&vin='
CSV.open(vindetails, "wb") do |details|
CSV.foreach(vincarriercsv) do |row|
vinxml = Nokogiri::HTML(vinurl + row[1])
make = vinxml.xpath('//VINResult//Vehicles//Vehicle//Make').text
model = vinxml.xpath('//VINResult//Vehicles//Vehicle//Model').text
details << [ row[0], row[1], make, model ]
end
end
For some reason the URL returns the same data twice but I only need the first result. So far my attempts to grab the Make and Model from the XML has failed...any ideas?

Here's how to get at the make and model data. How to convert it to CSV is left to you:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<VINResult xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns="http://basicvalues.pentondata.com/">
<Vehicles>
<Vehicle>
<ID>131497</ID>
<Product>TRUCK</Product>
<Year>1993</Year>
<Make>Freightliner</Make>
<Model>FLD12064T</Model>
<Description>120'' BBC Alum Air Cond Long Conv. (SBA) Tractor w/48'' Sleeper Air Brakes & Power Steering 6x4 (SBA - Set Back Axle)</Description>
</Vehicle>
<Vehicle>
<ID>131497</ID>
<Product>TRUCK</Product>
<Year>1993</Year>
<Make>Freightliner</Make>
<Model>FLD12064T</Model>
<Description>120'' BBC Alum Air Cond Long Conv. (SBA) Tractor w/48'' Sleeper Air Brakes & Power Steering 6x4 (SBA - Set Back Axle)</Description>
</Vehicle>
</Vehicles>
<Errors/>
<InvalidVINMsg/>
</VINResult>
EOT
vehicle_make_and_models = doc.search('Vehicle').map{ |vehicle|
[
'make', vehicle.at('Make').content,
'model', vehicle.at('Model').content
]
}
This results in:
vehicle_make_and_models # => [["make", "Freightliner", "model", "FLD12064T"], ["make", "Freightliner", "model", "FLD12064T"]]
If you don't want the field names:
vehicle_make_and_models = doc.search('Vehicle').map{ |vehicle|
[
vehicle.at('Make').content,
vehicle.at('Model').content
]
}
vehicle_make_and_models # => [["Freightliner", "FLD12064T"], ["Freightliner", "FLD12064T"]]
Note: You have XML, not HTML. Don't assume that Nokogiri treats them the same, or that the difference is insignificant. Nokogiri parses XML strictly, since XML is a strict standard.
I use CSS selectors unless I absolutely have to use XPath. CSS results in a much clearer selector most of the time, which results in easier to read code.
vinxml.xpath('//VINResult//Vehicles//Vehicle//Make').text doesn't work, because // means "start at the top of the document". Each time it's encountered Nokogiri starts at the top, searches down, and finds all matching nodes. xpath returns all matching nodes as a NodeSet, not just a particular Node, and text will return the text of all Nodes in the NodeSet, resulting in a concatenated string of the text, which is probably not what you want.
I prefer to use search instead of xpath or css. It returns a NodeSet like the other two, but it also lets us use either CSS or XPath selectors. If your particular selector was ambiguous and could be interpreted as either CSS or XPath, then you can use the explicit form. Likewise, you can use at or xpath_at or css_at to find just the first matching node, which is equivalent to search('foo').first.

You could also do the following which will place all of the vehicles in an Array and all of the vehicle attributes into a Hash
require 'nokogiri'
doc = Nokogiri::XML(open(YOUR_XML_FILE))
vehicles = doc.search("Vehicle").map do |vehicle|
Hash[
vehicle.children.map do |child|
[child.name, child.text] unless child.text.chomp.strip == ""
end.compact
]
end
#=>[{"ID"=>"131497", "Product"=>"TRUCK", "Year"=>"1993", "Make"=>"Freightliner", "Model"=>"FLD12064T", "Description"=>"120'' BBC Alum Air Cond Long Conv. (SBA) Tractor w/48'' Sleeper Air Brakes Power Steering 6x4 (SBA - Set Back Axle)"}, {"ID"=>"131497", "Product"=>"TRUCK", "Year"=>"1993", "Make"=>"Freightliner", "Model"=>"FLD12064T", "Description"=>"120'' BBC Alum Air Cond Long Conv. (SBA) Tractor w/48'' Sleeper Air Brakes Power Steering 6x4 (SBA - Set Back Axle)"}]
Then you can access all the attributes for an individual vehicle i.e.
vehicles.first["ID"]
#=> "131497"
vehicles.first["Year"]
#=> "1993"
etc.

Related

How to combine two XML files with Nokogiri

I am trying to combine two separate, but related, files with Nokogiri. I want to combine the "product" and "product pricing" if "ItemNumber" is the same.
I loaded the documents, but I have no idea how to combine the two.
Product File:
<Products>
<Product>
<Name>36-In. Homeowner Bent Single-Bit Axe Handle</Name>
<ProductTypeId>0</ProductTypeId>
<Description>This single bit curved grip axe handle is made for 3 to 5 pound axes. A good quality replacement handle made of American hickory with a natural wax finish. Hardwood handles do not conduct electricity and American Hickory is known for its strength, elasticity and ability to absorb shock. These handles provide exceptional value and economy for homeowners and other occasional use applications. Each Link handle comes with the required wedges, rivets, or epoxy needed for proper application of the tool head.</Description>
<ActiveFlag>Y</ActiveFlag>
<ImageFile>100024.jpg</ImageFile>
<ItemNumber>100024</ItemNumber>
<ProductVariants>
<ProductVariant>
<Sku>100024</Sku>
<ColorName></ColorName>
<SizeName></SizeName>
<SequenceNo>0</SequenceNo>
<BackOrderableFlag>N</BackOrderableFlag>
<InventoryLevel>0</InventoryLevel>
<ColorCode></ColorCode>
<SizeCode></SizeCode>
<TaxableFlag>Y</TaxableFlag>
<VariantPromoGroupCode></VariantPromoGroupCode>
<PricingGroupCode></PricingGroupCode>
<StartDate xsi:nil="true"></StartDate>
<EndDate xsi:nil="true"></EndDate>
<ActiveFlag>Y</ActiveFlag>
</ProductVariant>
</ProductVariants>
</Product>
</Products>
Product Pricing Fields:
<ProductPricing>
<ItemNumber>100024</ItemNumber>
<AcquisitionCost>8.52</AcquisitionCost>
<MemberCost>10.7</MemberCost>
<Price>14.99</Price>
<SalePrice xsi:nil="true"></SalePrice>
<SaleCode>0</SaleCode>
</ProductPricing>
I am looking to generate a file like this:
<Products>
<Product>
<Name>36-In. Homeowner Bent Single-Bit Axe Handle</Name>
<ProductTypeId>0</ProductTypeId>
<Description>This single bit curved grip axe handle is made for 3 to 5 pound axes. A good quality replacement handle made of American hickory with a natural wax finish. Hardwood handles do not conduct electricity and American Hickory is known for its strength, elasticity and ability to absorb shock. These handles provide exceptional value and economy for homeowners and other occasional use applications. Each Link handle comes with the required wedges, rivets, or epoxy needed for proper application of the tool head.</Description>
<ActiveFlag>Y</ActiveFlag>
<ImageFile>100024.jpg</ImageFile>
<ItemNumber>100024</ItemNumber>
<ProductVariants>
<ProductVariant>
<Sku>100024</Sku>
<ColorName></ColorName>
<SizeName></SizeName>
<SequenceNo>0</SequenceNo>
<BackOrderableFlag>N</BackOrderableFlag>
<InventoryLevel>0</InventoryLevel>
<ColorCode></ColorCode>
<SizeCode></SizeCode>
<TaxableFlag>Y</TaxableFlag>
<VariantPromoGroupCode></VariantPromoGroupCode>
<PricingGroupCode></PricingGroupCode>
<StartDate xsi:nil="true"></StartDate>
<EndDate xsi:nil="true"></EndDate>
<ActiveFlag>Y</ActiveFlag>
</ProductVariant>
</ProductVariants>
</Product>
<ProductPricing>
<ItemNumber>100024</ItemNumber>
<AcquisitionCost>8.52</AcquisitionCost>
<MemberCost>10.7</MemberCost>
<Price>14.99</Price>
<SalePrice xsi:nil="true"></SalePrice>
<SaleCode>0</SaleCode>
</ProductPricing>
</Products>
Here is the code I have so far:
require 'csv'
require 'nokogiri'
xml = File.read('lateApril-product-pricing.xml')
xml2 = File.read('lateApril-master-date')
doc = Nokogiri::XML(xml)
doc2 = Nokogiri::XML(xml2)
pricing_data = []
item_number = []
doc.xpath('//ProductsPricing/ProductPricing').each do |file|
itemNumber = file.xpath('./ItemNumber').first.text
variant_Price = file.xpath('./Price').first.text
pricing_data << [ itemNumber, variant_Price ]
item_number << [ itemNumber ]
end
puts item_number ## This prints all the item number but i have no idea how to loop through them and combine them with Product XML
doc2.xpath('//Products/Product').each do |file|
itemNumber = file.xpath('./ItemNumber').first.text #not sure how to write the conditions here since i don't have pricing fields available in this method
end

Try this on:
require 'nokogiri'
doc1 = Nokogiri::XML(<<EOT)
<Products>
<Product>
<Name>36-In. Homeowner Bent Single-Bit Axe Handle</Name>
</Product>
</Products>
EOT
doc2 = Nokogiri::XML(<<EOT)
<ProductPricing>
<ItemNumber>100024</ItemNumber>
</ProductPricing>
EOT
doc1.at('Product').add_next_sibling(doc2.at('ProductPricing'))
Which results in:
puts doc1.to_xml
# >> <?xml version="1.0"?>
# >> <Products>
# >> <Product>
# >> <Name>36-In. Homeowner Bent Single-Bit Axe Handle</Name>
# >> </Product><ProductPricing>
# >> <ItemNumber>100024</ItemNumber>
# >> </ProductPricing>
# >> </Products>
Please, when you ask, strip the example input and expected resulting output to the absolute, bare, minimum. Anything beyond that wastes space, eye-time and brain CPU.
This is untested code, but is where I'd start if I was going to merge two files containing multiple <ItemNumber> nodes:
require 'nokogiri'
doc1 = Nokogiri::XML(<<EOT)
<Products>
<Product>
<Name>36-In. Homeowner Bent Single-Bit Axe Handle</Name>
<ItemNumber>100024</ItemNumber>
</Product>
</Products>
EOT
doc2 = Nokogiri::XML(<<EOT)
<ProductPricing>
<ItemNumber>100024</ItemNumber>
</ProductPricing>
EOT
# build a hash containing the item numbers in doc1 for each product
doc1_products_by_item_numbers = doc1.search('Product').map { |product|
item_number = product.at('ItemNumber').value
[
item_number,
product
]
}.to_hash
# build a hash containing the item numbers in doc2 for each product pricing
doc2_products_by_item_numbers = doc2.search('ProductPricing').map { |pricing|
item_number = pricing.at('ItemNumber').value
[
item_number,
pricing
]
}.to_hash
# append doc2 entries to doc1 after each product based on item numbers
doc1_products_by_item_numbers.keys.each { |k|
doc1_products_by_item_numbers[k].add_next_sibling(doc2_products_by_item_numbers[k])
}

How to filter XML elements by date range in Ruby

I typically use Nokogiri as my XML parser.
I have the following XML:
<albums>
<aldo_nova album="aldo nova">
<release_date value="19820401"/>
</aldo_nova>
<classix_nouveaux album="Night People"/>
<release_date value="19820501"/>
</classix_nouveaux>
<engligh_beat album="I Just Can't Stop It"/>
<release_date value="19800501"/>
</engligh_beat>
</albums>
I want to get all albums that were released between 1/1/1980 and 4/15/1982:
<aldo_nova album="aldo nova">
<release_date value="19820401"/>
</aldo_nova>
<engligh_beat album="I Just Can't Stop It"/>
<release_date value="19800501"/>
</engligh_beat>
How do I filter/query the XML by a release_date range?

Your XML is malformed. After parsing, here's what Nokogiri has to say about it:
doc.errors
# => [#<Nokogiri::XML::SyntaxError: Opening and ending tag mismatch: albums line 1 and classix_nouveaux>,
# #<Nokogiri::XML::SyntaxError: Extra content at the end of the document>]
That's because:
<classix_nouveaux album="Night People"/>
and
<engligh_beat album="I Just Can't Stop It"/>
are terminated. Instead they should be:
<classix_nouveaux album="Night People">
and
<engligh_beat album="I Just Can't Stop It">
You can use CSS or XPath selectors to find exact matches, or even sub-string matches, but neither CSS or XPath understand "ranges" of dates, nor do they have an idea of what a Date is, so you'd have to extract all nodes, convert the date value into a Date object or integer in this case, then compare to the range:
date_range = 19800501..19820401
selected_albums = doc.search('//release_date').select { |rd| date_range.include?(rd['value'].to_i) }.map { |rd| rd.parent }
selected_albums.map(&:to_xml)
# => ["<aldo_nova album=\"aldo nova\">\n" +
# " <release_date value=\"19820401\"/>\n" +
# "</aldo_nova>",
# "<engligh_beat album=\"I Just Can't Stop It\">\n" +
# " <release_date value=\"19800501\"/>\n" +
# "</engligh_beat>"]
I think your XML is poorly designed because you have varying tag names for what should be an album. <album> should be a child of <albums>. I'd recommend something like this:
<collection>
<albums>
<album band="aldo nova" title="aldo nova" release_date="19820401"/>
<album band="classix nouveaux" title="Night People" release_date="19820501"/>
<album band="english beat" title="I Just Can't Stop It" release_date="19800501"/>
</albums>
</collection>
Once the XML is in a standard form, then it becomes easier to navigate and search:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<collection>
<albums>
<album band="aldo nova" title="aldo nova" release_date="19820401"/>
<album band="classix nouveaux" title="Night People" release_date="19820501"/>
<album band="english beat" title="I Just Can't Stop It" release_date="19800501"/>
</albums>
</collection>
EOT
doc.search('album').last['title'] # => "I Just Can't Stop It"
band = 'aldo nova'
doc.search("//album[#band='#{band}']").map { |a| a['title'] } # => ["aldo nova"]
and searching for dates becomes more straightforward because it's not necessary to find the parent of the node:
date_range = 19800501..19820401
selected_albums = doc.search('album').select { |a| date_range.include?(a['release_date'].to_i) }
selected_albums.map(&:to_xml)
# => ["<album band=\"aldo nova\" title=\"aldo nova\" release_date=\"19820401\"/>",
# "<album band=\"english beat\" title=\"I Just Can't Stop It\" release_date=\"19800501\"/>"]
I'd recommend reading some tutorials on XML itself as it's easy to paint ourselves into corners if the data isn't represented logically and correctly.

How to pull data from tags based on other tags

I have the following example document:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<n1:Form109495CTransmittalUpstream xmlns="urn:us:gov:treasury:irs:ext:aca:air:7.0" xmlns:irs="urn:us:gov:treasury:irs:common" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:us:gov:treasury:irs:msg:form1094-1095Ctransmitterupstreammessage IRS-Form1094-1095CTransmitterUpstreamMessage.xsd" xmlns:n1="urn:us:gov:treasury:irs:msg:form1094-1095Ctransmitterupstreammessage">
<Form1095CUpstreamDetail RecordType="String" lineNum="1">
<RecordId>1</RecordId>
<CorrectedInd>0</CorrectedInd>
<irs:TaxYr>2015</irs:TaxYr>
<EmployeeInfoGrp>
<OtherCompletePersonName>
<PersonFirstNm>JOHN</PersonFirstNm>
<PersonMiddleNm>B</PersonMiddleNm>
<PersonLastNm>Doe</PersonLastNm>
</OtherCompletePersonName>
<PersonNameControlTxt/>
<irs:TINRequestTypeCd>INDIVIDUAL_TIN</irs:TINRequestTypeCd>
<irs:SSN>123456790</irs:SSN>
</Form1095CUpstreamDetail>
<Form1095CUpstreamDetail RecordType="String" lineNum="1">
<RecordId>2</RecordId>
<CorrectedInd>0</CorrectedInd>
<irs:TaxYr>2015</irs:TaxYr>
<EmployeeInfoGrp>
<OtherCompletePersonName>
<PersonFirstNm>JANE</PersonFirstNm>
<PersonMiddleNm>B</PersonMiddleNm>
<PersonLastNm>DOE</PersonLastNm>
</OtherCompletePersonName>
<PersonNameControlTxt/>
<irs:TINRequestTypeCd>INDIVIDUAL_TIN</irs:TINRequestTypeCd>
<irs:SSN>222222222</irs:SSN>
</EmployeeInfoGrp>
</Form1095CUpstreamDetail>
</n1:Form109495CTransmittalUpstream>
Using Nokogiri I want to extract the value between the <PersonFirstNm>, <PersonLastNm> and <irs:SSN> for each <Form1095CUpstreamDetail> based on the <RecordId>.
I tried removing namespaces as well. I posted a small snippet, but I have tried many iterations of working through the XML with no success. This is my first time using XML, so I realize I am likely missing something easy.
When I set my XPath:
require 'nokogiri'
submission_doc = Nokogiri::XML(open('1094C_Request.xml'))
submissions = submission_doc.remove_namespaces
nodes = submission.xpath('//Form1095CUpstreamDetail')
I do not seem to have any association between the RecordId and the tags mentioned above, and I am stuck on where to go next.
The fields are not listed as children for the RecordId, so I can't think of how to approach obtaining their values. I am including the full document as an example to make sure I am not excluding anything.
I have an array of values, and I would like to pull the three tags mentioned above if the RecordId is contained within the array of numbers.

Nokogiri makes it pretty easy to do what you want (assuming the XML is syntactically correct). I'd do something like:
require 'nokogiri'
require 'pp'
doc = Nokogiri::XML(<<EOT)
<n1:Form109495CTransmittalUpstream xmlns="urn:us:gov:treasury:irs:ext:aca:air:7.0" xmlns:irs="urn:us:gov:treasury:irs:common" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:us:gov:treasury:irs:msg:form1094-1095Ctransmitterupstreammessage IRS-Form1094-1095CTransmitterUpstreamMessage.xsd" xmlns:n1="urn:us:gov:treasury:irs:msg:form1094-1095Ctransmitterupstreammessage">
<Form1095CUpstreamDetail RecordType="String" lineNum="1">
<RecordId>1</RecordId>
<PersonFirstNm>JOHN</PersonFirstNm>
<PersonLastNm>Doe</PersonLastNm>
<irs:SSN>123456790</irs:SSN>
</Form1095CUpstreamDetail>
<Form1095CUpstreamDetail RecordType="String" lineNum="1">
<RecordId>2</RecordId>
<PersonFirstNm>JANE</PersonFirstNm>
<PersonLastNm>DOE</PersonLastNm>
<irs:SSN>222222222</irs:SSN>
</Form1095CUpstreamDetail>
</Form109495CTransmittalUpstream>
EOT
info = doc.search('Form1095CUpstreamDetail').map{ |form|
{
record_id: form.at('RecordId').text,
person_first_nm: form.at('PersonFirstNm').text,
person_last_nm: form.at('PersonLastNm').text,
ssn: form.at('irs|SSN').text
}
}
pp info
# >> [{:record_id=>"1",
# >> :person_first_nm=>"JOHN",
# >> :person_last_nm=>"Doe",
# >> :ssn=>"123456790"},
# >> {:record_id=>"2",
# >> :person_first_nm=>"JANE",
# >> :person_last_nm=>"DOE",
# >> :ssn=>"222222222"}]
While it's possible to do this with XPath, Nokogiri's implementation of CSS selectors tends to result in more easily read selectors, which translates to easier to maintain, which is a very good thing.
You'll see the use of | in 'irs|SSN' which is Nokogiri's way of defining a namespace for CSS. This is documented in "Namespaces".

First of all the xml validator reports error
The default (no prefix) Namespace URI for XPath queries is always '' and it cannot be redefined to 'urn:us:gov:treasury:irs:ext:aca:air:7.0'.
so you must set this default xmlns to "".
You can use this code.
require 'nokogiri'
doc = Nokogiri::XML(open('1094C_Request.xml'))
doc.namespaces['xmlns'] = ''
details = doc.xpath("//:Form1095CUpstreamDetail")
elem_a = ["PersonFirstNm", "PersonLastNm", "irs:SSN"]
output = details.each_with_object({}) do |element, exp|
exp[element.xpath("./:RecordId").text] = elem_a.each_with_object({}) do |elem_n, exp_h|
exp_h[elem_n] = element.xpath(".//#{elem_n.include?(':') ? elem_n : ":#{elem_n}"}").text
end
end
output
p output
# {
# "1" => {"PersonFirstNm" => "JOHN", "PersonLastNm" => "Doe", "irs:SSN" => "123456790"},
# "2" => {"PersonFirstNm" => "JANE", "PersonLastNm" => "DOE", "irs:SSN" => "222222222"}
# }
I hope this helps

Can't address XML attribute thought XPath in Ruby (using Nokogiri)

I'm trying to filter xml file to get nodes with certain attribute. I can successfully filter by node (ex. \top_manager), but when I try \\top_manager[#salary='great'] I get nothing.
<?xml version= "1.0"?>
<employee xmlns="http://www.w3schools.com" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="employee.xsd">
<top_manager>
<ceo salary="great" respect="enormous" type="extra">
<fname>
Vasya
</fname>
<lname>
Pypkin
</lname>
<hire_date>
19
</hire_date>
<descr>
Big boss
</descr>
</ceo>
<cio salary="big" respect="great" type="intro">
<fname>
Petr
</fname>
<lname>
Pypkin
</lname>
<hire_date>
25
</hire_date>
<descr>
Resposible for information security
</descr>
</cio>
</top_manager>
......
How I need to correct this code to get what I need?
require 'nokogiri'
f = File.open("employee.xml")
doc = Nokogiri::XML(f)
doc.xpath("//top_manager[#salary='great']").each do |node|
puts node.text
end
thank you.

That's because salary is not attribute of <top_manager> element, it is the attribute of <top_manager>'s children elements :
//xmlns:top_manager[*[#salary='great']]
Above XPath select <top_manager> element having any of it's child element has attribute salary equals "great". Or if you meant to select the children (the <ceo> element in this case) :
//xmlns:top_manager/*[#salary='great']

Ruby - URL to Markdown

TOTAL rookie here.
I'm working on customizing a script made by Brett Terpstra - http://brettterpstra.com/2013/11/01/save-pocket-favorites-to-nvalt-with-ifttt-and-hazel/
Mine is a different use: I'd like to save my pinboard bookmarks with a specific tag to a file in dropbox in Markdown.
I feed it a text file such as:
Title: Yesterday is over.
URL: http://www.jonacuff.com/blog/want-to-change-the-world-get-doing/
Tags: 2md, 2wcx, 2pdf
Date: June 20, 2013 at 06:20PM
Image: notused
Excerpt: You can't start the next chapter of your life if you keep re-reading the last one.
And it outputs the markdown file.
Everything works great except when the 'excerpt' (see above) is more than one line. Sometimes it's a couple of paragraphs. When that happens, it stops working. When I hit enter from the command line, it's still waiting for more input.
Here's an example of a file that it doesn't work on:
Title: Talking ’bout my Generation.
URL: http://blog.greglaurie.com/?p=8881
Tags: 2md, 2wcx, 2pdf
Date: June 28, 2013 at 09:46PM
Image: notused
Excerpt: Contrast two men from the 19th century: Max Jukes and Jonathan Edwards.
Max Jukes lived in New York. He did not believe in Christ or in raising his children in the way of the Lord. He refused to take his children to church, even when they asked to go. Of his 1,026 descendants:
•300 were sent to prison for an average term of 13 years
•190 were prostitutes
•680 were admitted alcoholics
His family, thus far, has cost the state in excess of $420,000 and has made no contribution to society.
Jonathan Edwards also lived in New York, at the same time as Jukes. He was known to have studied 13 hours a day and, in spite of his busy schedule of writing, teaching, and pastoring, he made it a habit to come home and spend an hour each day with his children. He also saw to it that his children were in church every Sunday. Of his 929 descendants:
•430 were ministers
•86 became university professors
•13 became university presidents
•75 authored good books
•7 were elected to the United States Congress
•1 was Vice President of the United States
Edwards’ family never cost the state one cent.
We tend to think that our decisions only affect ourselves, but they have ramifications for generations to come.
Here's a screenshot of what it looks like after I run the command: https://www.dropbox.com/s/i9zg483k7nkdp6f/Screenshot%202013-11-22%2016.39.17.png
I'm hoping it's something easy. Any ideas?
#!/usr/bin/env ruby
# Works with IFTTT recipe https://ifttt.com/recipes/125999
#
# Set Hazel to watch the folder you specify in the recipe.
# Make sure nvALT is set to store its notes as individual files.
# Edit the $target_folder variable below to point to your nvALT
# ntoes folder.
require 'date'
require 'open-uri'
require 'net/http'
require 'fileutils'
require 'cgi'
$target_folder = "~/Dropbox/messx/urls2md"
def url_to_markdown(url)
res = Net::HTTP.post_form(URI.parse("http://heckyesmarkdown.com/go/"),{'u'=>url,'read'=>'1'})
if res.code.to_i == 200
res.body
else
false
end
end
file = ARGV[0]
begin
input = IO.read(file).force_encoding('utf-8')
headers = {}
input.each_line {|line|
key, value = line.split(/: /)
headers[key] = value.strip || ""
}
outfile = File.join(File.expand_path($target_folder), headers['Title'].gsub(/["!*?'|]/,'') + ".txt")
date = Time.now.strftime("%Y-%m-%d %H:%M")
date_added = Date.parse(headers['Date']).strftime("%Y-%m-%d %H:%M")
content = "Title: #{headers['Title']}\nDate: #{date}\nDate Added: #{date_added}\nSource: #{headers['URL']}\n"
tags = false
if headers['Tags'].length > 0
tag_arr = header s['Tags'].split(", ")
tag_arr.map! {|tag|
%Q{"#{tag.strip}"}
}
tags = tag_arr.join(" ")
content += "Keywords: #{tags}\n"
end
markdown = url_to_markdown(headers['URL']).force_encoding('utf-8')
if markdown
content += headers['Image'].length > 0 ? "\n\n> #{headers['Excerpt']}\n\n---#{markdown}\n" : "\n\n"+markdown
else
content += headers['Image'].length > 0 ? "\n\n![](#{headers['Image']})\n\n#{headers['Excerpt']}\n" : "\n\n"+headers['Excerpt']
end
File.open(outfile,'w') {|f|
f.puts content
}
if tags && File.exists?("/usr/local/bin/openmeta")
%x{/usr/local/bin/openmeta -a #{tags} -p "#{outfile}"}
end
# FileUtils.rm(file)
rescue Exception => e
puts e
end

How about this? Modify your input.each_line area accordingly:
headers = {}
key = nil
input.each_line do |line|
match = /^(?<key>\w+)\s*:\s*(?<value>.*)/.match(line)
value = line
if match
key = match[:key].strip
headers[key] = match[:value].strip
else
headers[key] += line
end
end
First, splitting on just ":" is dangerous since that can be in content. Instead, a (simplified from code) regex of /^\w+:.*/ will match "Word: Content". Since the lines after the "Excerpt:" aren't prefixed, you need to hang on to the last seen key, and just append if there's no key for this line. You may need to add a newline in there, depending on what you're doing with that header information, but it seems to work.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Parsing XML into CSV using Nokogiri - ruby

Related

How to combine two XML files with Nokogiri

How to filter XML elements by date range in Ruby

How to pull data from tags based on other tags

Can't address XML attribute thought XPath in Ruby (using Nokogiri)

Ruby - URL to Markdown

Categories

Resources