Ruby - URL to Markdown - ruby

TOTAL rookie here.
I'm working on customizing a script made by Brett Terpstra - http://brettterpstra.com/2013/11/01/save-pocket-favorites-to-nvalt-with-ifttt-and-hazel/
Mine is a different use: I'd like to save my pinboard bookmarks with a specific tag to a file in dropbox in Markdown.
I feed it a text file such as:
Title: Yesterday is over.
URL: http://www.jonacuff.com/blog/want-to-change-the-world-get-doing/
Tags: 2md, 2wcx, 2pdf
Date: June 20, 2013 at 06:20PM
Image: notused
Excerpt: You can't start the next chapter of your life if you keep re-reading the last one.
And it outputs the markdown file.
Everything works great except when the 'excerpt' (see above) is more than one line. Sometimes it's a couple of paragraphs. When that happens, it stops working. When I hit enter from the command line, it's still waiting for more input.
Here's an example of a file that it doesn't work on:
Title: Talking ’bout my Generation.
URL: http://blog.greglaurie.com/?p=8881
Tags: 2md, 2wcx, 2pdf
Date: June 28, 2013 at 09:46PM
Image: notused
Excerpt: Contrast two men from the 19th century: Max Jukes and Jonathan Edwards.
Max Jukes lived in New York. He did not believe in Christ or in raising his children in the way of the Lord. He refused to take his children to church, even when they asked to go. Of his 1,026 descendants:
•300 were sent to prison for an average term of 13 years
•190 were prostitutes
•680 were admitted alcoholics
His family, thus far, has cost the state in excess of $420,000 and has made no contribution to society.
Jonathan Edwards also lived in New York, at the same time as Jukes. He was known to have studied 13 hours a day and, in spite of his busy schedule of writing, teaching, and pastoring, he made it a habit to come home and spend an hour each day with his children. He also saw to it that his children were in church every Sunday. Of his 929 descendants:
•430 were ministers
•86 became university professors
•13 became university presidents
•75 authored good books
•7 were elected to the United States Congress
•1 was Vice President of the United States
Edwards’ family never cost the state one cent.
We tend to think that our decisions only affect ourselves, but they have ramifications for generations to come.
Here's a screenshot of what it looks like after I run the command: https://www.dropbox.com/s/i9zg483k7nkdp6f/Screenshot%202013-11-22%2016.39.17.png
I'm hoping it's something easy. Any ideas?
#!/usr/bin/env ruby
# Works with IFTTT recipe https://ifttt.com/recipes/125999
#
# Set Hazel to watch the folder you specify in the recipe.
# Make sure nvALT is set to store its notes as individual files.
# Edit the $target_folder variable below to point to your nvALT
# ntoes folder.
require 'date'
require 'open-uri'
require 'net/http'
require 'fileutils'
require 'cgi'
$target_folder = "~/Dropbox/messx/urls2md"
def url_to_markdown(url)
res = Net::HTTP.post_form(URI.parse("http://heckyesmarkdown.com/go/"),{'u'=>url,'read'=>'1'})
if res.code.to_i == 200
res.body
else
false
end
end
file = ARGV[0]
begin
input = IO.read(file).force_encoding('utf-8')
headers = {}
input.each_line {|line|
key, value = line.split(/: /)
headers[key] = value.strip || ""
}
outfile = File.join(File.expand_path($target_folder), headers['Title'].gsub(/["!*?'|]/,'') + ".txt")
date = Time.now.strftime("%Y-%m-%d %H:%M")
date_added = Date.parse(headers['Date']).strftime("%Y-%m-%d %H:%M")
content = "Title: #{headers['Title']}\nDate: #{date}\nDate Added: #{date_added}\nSource: #{headers['URL']}\n"
tags = false
if headers['Tags'].length > 0
tag_arr = header s['Tags'].split(", ")
tag_arr.map! {|tag|
%Q{"#{tag.strip}"}
}
tags = tag_arr.join(" ")
content += "Keywords: #{tags}\n"
end
markdown = url_to_markdown(headers['URL']).force_encoding('utf-8')
if markdown
content += headers['Image'].length > 0 ? "\n\n> #{headers['Excerpt']}\n\n---#{markdown}\n" : "\n\n"+markdown
else
content += headers['Image'].length > 0 ? "\n\n![](#{headers['Image']})\n\n#{headers['Excerpt']}\n" : "\n\n"+headers['Excerpt']
end
File.open(outfile,'w') {|f|
f.puts content
}
if tags && File.exists?("/usr/local/bin/openmeta")
%x{/usr/local/bin/openmeta -a #{tags} -p "#{outfile}"}
end
# FileUtils.rm(file)
rescue Exception => e
puts e
end

How about this? Modify your input.each_line area accordingly:
headers = {}
key = nil
input.each_line do |line|
match = /^(?<key>\w+)\s*:\s*(?<value>.*)/.match(line)
value = line
if match
key = match[:key].strip
headers[key] = match[:value].strip
else
headers[key] += line
end
end
First, splitting on just ":" is dangerous since that can be in content. Instead, a (simplified from code) regex of /^\w+:.*/ will match "Word: Content". Since the lines after the "Excerpt:" aren't prefixed, you need to hang on to the last seen key, and just append if there's no key for this line. You may need to add a newline in there, depending on what you're doing with that header information, but it seems to work.

Related

How can I use the ruamel.yaml rtsc mode?

I've been working on creating a YAML re-formatter based on ruamel.yaml (which you can see here).
I'm currently using version 0.17.20.
Cleaning up comments and whitespace has been difficult. I want to:
ensure there is only one space before the # for EOL comments
align full line comments with the key or item immediately following
remove duplicate blank lines so there is at most one blank line
To get closer to achieving that, I have a custom Emitter class where I extend write_comment to adjust the comments just before writing with super().write_comment(...). However, the Emitter does not know about which key or item comes next because comments are generally attached as post comments.
As I've studied the ruamel.yaml code to figure out how to do this, I found the rtsc mode (Round Trip Split Comments) which looks fantastic because it separates EOLComment, BlankLineComment and FullLineComment instead of lumping them together.
From what I can tell, the Parser and Scanner have been adjusted to capture the comments. So, loading is (mostly?) implemented with this "NEWCMNT" implementation. But Emitter.write_comment expects CommentToken instead of comment line numbers, so dumping does not work yet.
If I update my Emitter.write_comment method, is that enough to finish dumping? Or what else might be necessary? In one of my tries, I ran into a sys.exit in ScannedComments.assign_eol() - what else is needed to finish that?
PS: I wouldn't normally ask how to collaborate on StackOverflow, but this is not a bug report or a feature request, and I'm trying/failing to use a new (undocumented) feature, so I'm filing this here instead of sourceforge.
rtsc is work in progress cq work started but unfinished. It's internals will almost certainly change.
Two of the three points you indicate can relatively easy be implemented:
set the column of each comment to 0 ( by recursively going over a loaded data structure similar to here ) if the column is before the position of the end of the value on a line, you'll get one space between the value and the column
at the same time doing the recursion in the previous point. Take each comment value and do something like:
value = '\n'.join(line.strip() for line in value.splitlines()
while '\n\n\n' in value:
value = value.replace('\n\n\n', '\n\n')
The indentation to the following element is difficult, depends on the
data structure etc. Given that these are full line comments, I suggest
you do some postprocessing of the YAML document you generate:
find a full line comment, gather full line comments until next line is
not full line comment (i.e. some "real" YAML). Since full line comments
are in column[0] if the previous stuff is applied, you don't have to
track if you are in a (multi-line) literal or folded scalar string where
one of the lines happens to start with #
determine number of spaces
before real YAML and apply these to the full line comments.
import sys
import ruamel.yaml
yaml_str = """\
# the following is a example YAML doc
a:
- b: 42
# collapse multiple empty lines
c: |
# this is not a comment
it is the first line of a block style literal scalar
processing this gobbles a newline which doesn't go into a comment
# that is unless you have a (dedented) comment directly following
d: 42 # and some non-full line comment
e: # another one
# and some more comments to align
f: glitter in the dark near the Tannhäuser gate
"""
def redo_comments(d):
def do_one(comment):
if not comment:
return
comment.column = 0
value = '\n'.join(line.strip() for line in comment.value.splitlines()) + '\n'
while '\n\n\n' in value:
value = value.replace('\n\n\n', '\n\n')
comment.value = value
def do_values(v):
for x in v:
for comment in x:
do_one(comment)
def do_loc(v):
if v is None:
return
do_one(v[0])
if not v[1]:
return
for comment in v[1]:
do_one(comment)
if isinstance(d, dict):
do_loc(d.ca.comment)
do_values(d.ca.items.values())
for val in d.values():
redo_comments(val)
elif isinstance(d, list):
do_values(d.ca.items.values())
for elem in d:
redo_comments(elem)
def realign_full_line_comments(s):
res = []
buf = []
for line in s.splitlines(True):
if not buf:
if line and line[0] == '#':
buf.append(line)
else:
res.append(line)
else:
if line[0] in '#\n':
buf.append(line)
else:
# YAML line, determine indent
count = 0
while line[count] == ' ':
count += 1
if count > len(line):
break # superfluous?
indent = ' ' * count
for cline in buf:
if cline[0] == '\n': # empty
res.append(cline)
else:
res.append(indent + cline)
buf = []
res.append(line)
return ''.join(res)
yaml = ruamel.yaml.YAML()
# yaml.indent(mapping=4, sequence=4, offset=2)
# yaml.preserve_quotes = True
data = yaml.load(yaml_str)
redo_comments(data)
yaml.dump(data, sys.stdout, transform=realign_full_line_comments)
which gives:
# the following is a example YAML doc
a:
- b: 42
# collapse multiple empty lines
c: |
# this is not a comment
it is the first line of a block style literal scalar
processing this gobbles a newline which doesn't go into a comment
# that is unless you have a (dedented) comment directly following
d: 42 # and some non-full line comment
e: # another one
# and some more comments to align
f: glitter in the dark near the Tannhäuser gate

Text outputs as multiple separate lines instead of one paragraph with linebreaks

I have a bot that writes my message to a webpage. I want the message to be sent as one paragraph, but with the lines separated by a linebreak. However, when I actually run the code, the bot inputs and enters each line separately, instead of as one paragraph
I've tried messing with the linebreak formatting and string formatting, but the issue persists
reply_messages = []
reply_messages.push([
"FREE BABY AVOCUDDLE - Thank you for your patience!",
"To redeem your FREE avocuddle, just use the LINK IN OUR BIO and ADD TO CART - just cover shipping, no additional charges!",
"Discount AUTOMATICALLY APPLIES! Super simple, no code!",
"If you order another avocuddle in addition, we cover shipping PLUS the free baby avocuddle! :)",
"Feel free to DM us if you need anything!"
].join("\n")+"\n")
.
.
.
while true
browser.get 'https://twitter.com/messages/requests'
sleep 5
request = wait.until {
el = browser.find_element(:css, "[data-testid='conversation']")
el if el.displayed?
}
break if request.nil?
request.click
not_acceptable_link = true
na_link_index = 1
while not_acceptable_link == true
accept_btn = browser.find_element(:xpath, "//*[contains(text(), 'Accept')]")
unless accept_btn.displayed?
request_2 = wait.until {
el = browser.find_elements(:css, "[data-testid='conversation']")[na_link_index]
el if el.displayed?
}
request_2.click
else
not_acceptable_link = false
end
na_link_index += 1
end
accept_btn = browser.find_element(:xpath, "//*[contains(text(), 'Accept')]")
accept_btn.click
sleep 1
reply_input = wait.until {
el = browser.find_element(:css, "[data-testid='dmComposerTextInput']")
el if el.displayed?
}
reply_input.click
reply_input.send_keys(reply_messages)
.
.
.
I would like for the code to output the entire text as one block of text. However, instead, it outputs as separate lines.
Output currently looks like:
FREE BABY AVOCUDDLE - Thank you for your patience!
(enters this into text box)
To redeem your FREE avocuddle, just use the LINK IN OUR BIO and ADD TO CART - just cover shipping, no additional charges!
(enters this into text box)
etc.
Instead I would like for it to send as one message.

Moving chunks of data in a file with awk

I'm moving my bookmarks from kippt.com to pinboard.in.
I exported my bookmarks from Kippt and for some reason, they were storing tags (preceded by #) and description within the same field. Pinboard keeps tags and description separated.
This is what a Kippt bookmark looks like after export:
<DT>This is a title
<DD>#tag1 #tag2 This is a description
This is what it should look like before importing into Pinboard:
<DT>This is a title
<DD>This is a description
So basically, I need to replace #tag1 #tag2 by TAGS="tag1,tag2" and move it on the first line within <A>.
I've been reading about moving chunks of data here: sed or awk to move one chunk of text betwen first pattern pair into second pair?
I haven't been to come up with a good recipe so far. Any insight?
Edit:
Here's an actual example of what the input file looks like (3 entries out of 3500):
<DT>Phabricator
<DD>#bug #tracking
<DT>The hidden commands for diagnosing and improving your Netflix streaming quality – Quartz
<DT>Icelandic Farm Holidays | Local experts in Iceland vacations
<DD>#iceland #tour #car #drive #self Self-driving tour of Iceland
This might not be the most beautiful solution, but since it seems to be a one-time-thing it should be sufficient.
import re
dt = re.compile('^<DT>')
dd = re.compile('^<DD>')
with open('bookmarks.xml', 'r') as f:
for line in f:
if re.match(dt, line):
current_dt = line.strip()
elif re.match(dd, line):
current_dd = line
tags = [w for w in line[4:].split(' ') if w.startswith('#')]
current_dt = re.sub('(<A[^>]+)>', '\\1 TAGS="' + ','.join([t[1:] for t in tags]) + '">', current_dt)
for t in tags:
current_dd = current_dd.replace(t + ' ', '')
if current_dd.strip() == '<DD>':
current_dd = ""
else:
print current_dt
print current_dd
current_dt = ""
current_dd = ""
print current_dt
print current_dd
If some parts of the code are not clear, just tell me. You can of course use python to write the lines to a file instead of printing them, or even modify the original file.
Edit: Added if-clause so that empty <DD> lines won't show up in the result.
script.awk
BEGIN{FS="#"}
/^<DT>/{
if(d==1) print "<DT>"s # for printing lines with no tags
s=substr($0,5);tags="" # Copying the line after "<DT>". You'll know why
d=1
}
/^<DD>/{
d=0
m=match(s,/>/) # Find the end of the HREF descritor first match of ">"
for(i=2;i<=NF;i++){sub(/ $/,"",$i);tags=tags","$i} # Concatenate tags
td=match(tags,/ /) # Parse for tag description (marked by a preceding space).
if(td==0){ # No description exists
tags=substr(tags,2)
tagdes=""
}
else{ # Description exists
tagdes=substr(tags,td)
tags=substr(tags,2,td-2)
}
print "<DT>" substr(s,1,m-1) ", TAGS=\"" tags "\"" substr(s,m)
print "<DD>" tagdes
}
awk -f script.awk kippt > pinboard
INPUT
<DT>Phabricator
<DD>#bug #tracking
<DT>The hidden commands for diagnosing and improving your Netflix streaming quality – Quartz
<DT>Icelandic Farm Holidays | Local experts in Iceland vacations
<DD>#iceland #tour #car #drive #self Self-driving tour of Iceland
OUTPUT:
<DT>Phabricator
<DD>
<DT>The hidden commands for diagnosing and improving your Netflix streaming quality – Quartz
<DT>Icelandic Farm Holidays | Local experts in Iceland vacations
<DD> Self-driving tour of Iceland

How to extract string from large file only if specific string appears previous using Ruby?

I am trying to extract information from a large file and cannot figure out how to extract strings from file lines only when a previous line in the same record within the file has been matched by regex. An example of one record in the file is as follows:
*NEW RECORD
RECTYPE = D
MH = Informed Consent
AQ = ES HI LJ PX SN ST
ENTRY = Consent, Informed
MN = N03.706.437.650.312
MN = N03.706.535.489
FX = Disclosure
FX = Mental Competency
FX = Therapeutic Misconception
FX = Treatment Refusal
ST = T058
ST = T078
AN = competency to consent: coordinate IM with MENTAL COMPETENCY (IM)
PI = Jurisprudence (1966-1970)
PI = Physician-Patient Relations (1966-1970)
MS = Voluntary authorization, by a patient or research subject, etc,...
This file contains over 20,000 records like this example. I want to identify a small percent of those records using the "MH" field. In this example, I want to find "Informed Consent", and then use regex to extract the information in the FX, AN, and MS fields only within that record. So far, I have opened the file, accessed the hash that the MH terms are stored in, and been able to extract those terms from the records in the file. I also have a functioning regex that identifies the content in the "FX" field.
File.open('mesh_descriptor.bin').each do |file_line|
file_line = file_line.chomp
# read each key of candidate_descriptor_keys
candidate_descriptor_keys.each do |cand_term|
if file_line =~ /^MH\s=\s(#{cand_term})$/
mesh_header = $1
puts "MH from Mesh Descriptor file is: #{mesh_header}"
if file_line =~ /^FX\s=\s(.*)$/
see_also = $1
puts " See_Also from Descriptor file is: #{see_also}"
end
end
end
end
The hash contains the following MH (keys):
candidate_descriptor_keys = ["Body Weight", "Obesity", "Thinness", "Fetal Weight", "Overweight"]
I had success extracting "FX" when I put the statement outside of the "if" statement to extract "MH", but all of the "FX" from the whole file were retrieved - not what I need. I thought putting the "if" statement for "FX" within the previous "if" statement would restrict the results to only those found when the first statement is true, but I am getting no results (also no errors) with this strategy. What I would like as a result is:
> Informed Consent
> Disclosure
> Mental Competency
> Therapeutic Misconception
> Treatment Refusal
as well as the strings within the "AN" and "MS" fields for only those records matching "MH". Any suggestions would be helpful!
I think this may be what you are looking for, but if not, let me know and I will change it. Look especially at the very end to see if that is the sort of output (for input having two records, both with a "MH" field) you want. I will also add a "explanation" section at the end once I have understood your question correctly.
I have assumed that each record begins
*NEW_RECORD
and you wish to identify all lines beginning "MH" whose field is one of the elements of:
candidate_descriptor_keys =
["Body Weight", "Obesity", "Thinness", "Informed Consent"]
and for each match, you would like to print the contents of the lines for the same record that begin with "FX", "AN" and "MS".
Code
NEW_RECORD_MARKER = "*NEW RECORD"
def getem(fname, candidate_descriptor_keys)
line = 0
found_mh = false
File.open(fname).each do |file_line|
file_line = file_line.strip
case
when file_line == NEW_RECORD_MARKER
puts # space between records
found_mh = false
when found_mh == false
candidate_descriptor_keys.each do |cand_term|
if file_line =~ /^MH\s=\s(#{cand_term})$/
found_mh = true
puts "MH from line #{line} of file is: #{cand_term}"
break
end
end
when found_mh
["FX", "AN", "MS"].each do |des|
if file_line =~ /^#{des}\s=\s(.*)$/
see_also = $1
puts " Line #{line} of file is: #{des}: #{see_also}"
end
end
end
line += 1
end
end
Example
Let's begin be creating a file, starging with a "here document that contains two records":
records =<<_
*NEW RECORD
RECTYPE = D
MH = Informed Consent
AQ = ES HI LJ PX SN ST
ENTRY = Consent, Informed
MN = N03.706.437.650.312
MN = N03.706.535.489
FX = Disclosure
FX = Mental Competency
FX = Therapeutic Misconception
FX = Treatment Refusal
ST = T058
ST = T078
AN = competency to consent
PI = Jurisprudence (1966-1970)
PI = Physician-Patient Relations (1966-1970)
MS = Voluntary authorization
*NEW RECORD
MH = Obesity
AQ = ES HI LJ PX SN ST
ENTRY = Obesity
MN = N03.706.437.650.312
MN = N03.706.535.489
FX = 1st FX
FX = 2nd FX
AN = Only AN
PI = Jurisprudence (1966-1970)
PI = Physician-Patient Relations (1966-1970)
MS = Only MS
_
If you puts records you will see it is just a string. (You'll see that I shortened two of them.) Now write it to a file:
File.write('mesh_descriptor', records)
If you wish to confirm the file contents, you could do this:
puts File.read('mesh_descriptor')
We also need to define define the array candidate_descriptor_keys:
candidate_descriptor_keys =
["Body Weight", "Obesity", "Thinness", "Informed Consent"]
We can now execute the method getem:
getem('mesh_descriptor', candidate_descriptor_keys)
MH from line 2 of file is: Informed Consent
Line 7 of file is: FX: Disclosure
Line 8 of file is: FX: Mental Competency
Line 9 of file is: FX: Therapeutic Misconception
Line 10 of file is: FX: Treatment Refusal
Line 13 of file is: AN: competency to consent
Line 16 of file is: MS: Voluntary authorization
MH from line 18 of file is: Obesity
Line 23 of file is: FX: 1st FX
Line 24 of file is: FX: 2nd FX
Line 25 of file is: AN: Only AN
Line 28 of file is: MS: Only MS

Get numbers from a list in a file, output to another file in Ruby?

I have a big text file that contains - among others- lines like these:
"X" : "452345230"
I want to find all lines that contain "X" , and take just the number (without the quotation marks), and then output the numbers in another file, in this fashion:
452349532
234523452
213412411
219456433
etc.
What I did so far is this:
myfile = File.open("myfile.txt")
x = []
myfile.grep(/"X"/) {|line|
x << line.match( /"(\d{9})/ ).values_at( 1 )[0]
puts x
File.open("output.txt", 'w') {|f| f.write(x) }
}
it works, but the list it produces is of this form:
["23419230", "2349345234" , ... ]
How do I output it like I showed before, just numbers and each number in a line?
Thanks.
Here's a solution that doesn't leave files open:
File.open("output.txt", 'w') do |output|
File.open("myfile.txt").each do |line|
output.puts line[/\d{9}/] if line[/"X"/]
end
end
I couldn't reproduce what you saw:
$ cat myfile.txt
"X" : "452345230"
"X" : "452345231"
"X" : "452345232"
"X" : "452345233"
$ ./scanner.rb
452345230
452345230
452345231
452345230
452345231
452345232
452345230
452345231
452345232
452345233
$ cat output.txt
452345230452345231452345232452345233$
However, I did notice that your application is incredibly wasteful and probably not doing what you expect: You open output.txt, write some content to it, then close it again. The next time it is opened in the loop, it is overwritten. If your file is 1000 lines long, this won't be so bad, you're only making 1000 files. If your file is 1,000,000 lines long, this is going to represent a pretty horrible performance penalty as you create a file, write into it, and then delete it again, one million times. Oops.
I re-wrote your tool a little bit:
$ cat scanner.rb
#!/usr/bin/ruby -w
myfile = File.open("myfile.txt")
output = File.open("output.txt", 'w')
myfile.grep(/"X"/) {|line|
x = line.match( /"(\d{9})/ ).values_at( 1 )[0]
puts x
output.write(x + "\n")
}
This opens each file exactly onces, writes each new line one at a time, and then lets them both be closed when the application quits. Depending upon if this is a small portion of your application or the entire thing, this might be alright. (If this is a small portion of the program, then definitely close the files when you're done with them.)
This might still be wasteful for one million matched lines -- those writes are almost certainly handed straight to the system call write(2), which will involve some overhead.
How many of these will you be running? Millions? Billions? If this needs more refinement feel free to ask...
Solution:
myfile = File.open("myfile.txt")
File.open("output.txt", 'w') do |output|
content = myfile.lines.map { |line| line.scan(/^"X".*(\d{9})/) }.flatten.join("\n")
output.write(content)
end
Edited: I updated the code reducing it a bit. If the example above seems complicated, you can also grab the data you want with the following statement (could be a little bit clear of what's happening):
content = myfile.lines.select { |line| line =~ /"X"/ }.map { |line| line.scan(/\d{9}/) }.join("\n")

Resources