Moving chunks of data in a file with awk - bash

I'm moving my bookmarks from kippt.com to pinboard.in.
I exported my bookmarks from Kippt and for some reason, they were storing tags (preceded by #) and description within the same field. Pinboard keeps tags and description separated.
This is what a Kippt bookmark looks like after export:
<DT>This is a title
<DD>#tag1 #tag2 This is a description
This is what it should look like before importing into Pinboard:
<DT>This is a title
<DD>This is a description
So basically, I need to replace #tag1 #tag2 by TAGS="tag1,tag2" and move it on the first line within <A>.
I've been reading about moving chunks of data here: sed or awk to move one chunk of text betwen first pattern pair into second pair?
I haven't been to come up with a good recipe so far. Any insight?
Edit:
Here's an actual example of what the input file looks like (3 entries out of 3500):
<DT>Phabricator
<DD>#bug #tracking
<DT>The hidden commands for diagnosing and improving your Netflix streaming quality – Quartz
<DT>Icelandic Farm Holidays | Local experts in Iceland vacations
<DD>#iceland #tour #car #drive #self Self-driving tour of Iceland

This might not be the most beautiful solution, but since it seems to be a one-time-thing it should be sufficient.
import re
dt = re.compile('^<DT>')
dd = re.compile('^<DD>')
with open('bookmarks.xml', 'r') as f:
for line in f:
if re.match(dt, line):
current_dt = line.strip()
elif re.match(dd, line):
current_dd = line
tags = [w for w in line[4:].split(' ') if w.startswith('#')]
current_dt = re.sub('(<A[^>]+)>', '\\1 TAGS="' + ','.join([t[1:] for t in tags]) + '">', current_dt)
for t in tags:
current_dd = current_dd.replace(t + ' ', '')
if current_dd.strip() == '<DD>':
current_dd = ""
else:
print current_dt
print current_dd
current_dt = ""
current_dd = ""
print current_dt
print current_dd
If some parts of the code are not clear, just tell me. You can of course use python to write the lines to a file instead of printing them, or even modify the original file.
Edit: Added if-clause so that empty <DD> lines won't show up in the result.

script.awk
BEGIN{FS="#"}
/^<DT>/{
if(d==1) print "<DT>"s # for printing lines with no tags
s=substr($0,5);tags="" # Copying the line after "<DT>". You'll know why
d=1
}
/^<DD>/{
d=0
m=match(s,/>/) # Find the end of the HREF descritor first match of ">"
for(i=2;i<=NF;i++){sub(/ $/,"",$i);tags=tags","$i} # Concatenate tags
td=match(tags,/ /) # Parse for tag description (marked by a preceding space).
if(td==0){ # No description exists
tags=substr(tags,2)
tagdes=""
}
else{ # Description exists
tagdes=substr(tags,td)
tags=substr(tags,2,td-2)
}
print "<DT>" substr(s,1,m-1) ", TAGS=\"" tags "\"" substr(s,m)
print "<DD>" tagdes
}
awk -f script.awk kippt > pinboard
INPUT
<DT>Phabricator
<DD>#bug #tracking
<DT>The hidden commands for diagnosing and improving your Netflix streaming quality – Quartz
<DT>Icelandic Farm Holidays | Local experts in Iceland vacations
<DD>#iceland #tour #car #drive #self Self-driving tour of Iceland
OUTPUT:
<DT>Phabricator
<DD>
<DT>The hidden commands for diagnosing and improving your Netflix streaming quality – Quartz
<DT>Icelandic Farm Holidays | Local experts in Iceland vacations
<DD> Self-driving tour of Iceland

Related

How can I use the ruamel.yaml rtsc mode?

I've been working on creating a YAML re-formatter based on ruamel.yaml (which you can see here).
I'm currently using version 0.17.20.
Cleaning up comments and whitespace has been difficult. I want to:
ensure there is only one space before the # for EOL comments
align full line comments with the key or item immediately following
remove duplicate blank lines so there is at most one blank line
To get closer to achieving that, I have a custom Emitter class where I extend write_comment to adjust the comments just before writing with super().write_comment(...). However, the Emitter does not know about which key or item comes next because comments are generally attached as post comments.
As I've studied the ruamel.yaml code to figure out how to do this, I found the rtsc mode (Round Trip Split Comments) which looks fantastic because it separates EOLComment, BlankLineComment and FullLineComment instead of lumping them together.
From what I can tell, the Parser and Scanner have been adjusted to capture the comments. So, loading is (mostly?) implemented with this "NEWCMNT" implementation. But Emitter.write_comment expects CommentToken instead of comment line numbers, so dumping does not work yet.
If I update my Emitter.write_comment method, is that enough to finish dumping? Or what else might be necessary? In one of my tries, I ran into a sys.exit in ScannedComments.assign_eol() - what else is needed to finish that?
PS: I wouldn't normally ask how to collaborate on StackOverflow, but this is not a bug report or a feature request, and I'm trying/failing to use a new (undocumented) feature, so I'm filing this here instead of sourceforge.
rtsc is work in progress cq work started but unfinished. It's internals will almost certainly change.
Two of the three points you indicate can relatively easy be implemented:
set the column of each comment to 0 ( by recursively going over a loaded data structure similar to here ) if the column is before the position of the end of the value on a line, you'll get one space between the value and the column
at the same time doing the recursion in the previous point. Take each comment value and do something like:
value = '\n'.join(line.strip() for line in value.splitlines()
while '\n\n\n' in value:
value = value.replace('\n\n\n', '\n\n')
The indentation to the following element is difficult, depends on the
data structure etc. Given that these are full line comments, I suggest
you do some postprocessing of the YAML document you generate:
find a full line comment, gather full line comments until next line is
not full line comment (i.e. some "real" YAML). Since full line comments
are in column[0] if the previous stuff is applied, you don't have to
track if you are in a (multi-line) literal or folded scalar string where
one of the lines happens to start with #
determine number of spaces
before real YAML and apply these to the full line comments.
import sys
import ruamel.yaml
yaml_str = """\
# the following is a example YAML doc
a:
- b: 42
# collapse multiple empty lines
c: |
# this is not a comment
it is the first line of a block style literal scalar
processing this gobbles a newline which doesn't go into a comment
# that is unless you have a (dedented) comment directly following
d: 42 # and some non-full line comment
e: # another one
# and some more comments to align
f: glitter in the dark near the Tannhäuser gate
"""
def redo_comments(d):
def do_one(comment):
if not comment:
return
comment.column = 0
value = '\n'.join(line.strip() for line in comment.value.splitlines()) + '\n'
while '\n\n\n' in value:
value = value.replace('\n\n\n', '\n\n')
comment.value = value
def do_values(v):
for x in v:
for comment in x:
do_one(comment)
def do_loc(v):
if v is None:
return
do_one(v[0])
if not v[1]:
return
for comment in v[1]:
do_one(comment)
if isinstance(d, dict):
do_loc(d.ca.comment)
do_values(d.ca.items.values())
for val in d.values():
redo_comments(val)
elif isinstance(d, list):
do_values(d.ca.items.values())
for elem in d:
redo_comments(elem)
def realign_full_line_comments(s):
res = []
buf = []
for line in s.splitlines(True):
if not buf:
if line and line[0] == '#':
buf.append(line)
else:
res.append(line)
else:
if line[0] in '#\n':
buf.append(line)
else:
# YAML line, determine indent
count = 0
while line[count] == ' ':
count += 1
if count > len(line):
break # superfluous?
indent = ' ' * count
for cline in buf:
if cline[0] == '\n': # empty
res.append(cline)
else:
res.append(indent + cline)
buf = []
res.append(line)
return ''.join(res)
yaml = ruamel.yaml.YAML()
# yaml.indent(mapping=4, sequence=4, offset=2)
# yaml.preserve_quotes = True
data = yaml.load(yaml_str)
redo_comments(data)
yaml.dump(data, sys.stdout, transform=realign_full_line_comments)
which gives:
# the following is a example YAML doc
a:
- b: 42
# collapse multiple empty lines
c: |
# this is not a comment
it is the first line of a block style literal scalar
processing this gobbles a newline which doesn't go into a comment
# that is unless you have a (dedented) comment directly following
d: 42 # and some non-full line comment
e: # another one
# and some more comments to align
f: glitter in the dark near the Tannhäuser gate

Sort two text files with its indented text aligned to it

I would like to compare two of my log files generated before and after an implementation to see if it has impacted anything. However, the order of the logs I get is not the same all the time. Since, the log file also has multiple indented lines, when I tried to sort, everything is sorted. But, I would like to keep the child intact with the parent. Indented lines are spaces and not tab.
Any help would be greatly appreciated. I am fine with any windows solution or Linux one.
Eg of the file:
#This is a sample code
Parent1 to be verified
Child1 to be verified
Child2 to be verified
Child21 to be verified
Child23 to be verified
Child22 to be verified
Child221 to be verified
Child4 to be verified
Child5 to be verified
Child53 to be verified
Child52 to be verified
Child522 to be verified
Child521 to be verified
Child3 to be verified
I am posting another answer here to sort it hierarchically, using python.
The idea is to attach the parents to the children to make sure that the children under the same parent are sorted together.
See the python script below:
"""Attach parent to children in an indentation-structured text"""
from typing import Tuple, List
import sys
# A unique separator to separate the parent and child in each line
SEPARATOR = '#'
# The indentation
INDENT = ' '
def parse_line(line: str) -> Tuple[int, str]:
"""Parse a line into indentation level and its content
with indentation stripped
Args:
line (str): One of the lines from the input file, with newline ending
Returns:
Tuple[int, str]: The indentation level and the content with
indentation stripped.
Raises:
ValueError: If the line is incorrectly indented.
"""
# strip the leading white spaces
lstripped_line = line.lstrip()
# get the indentation
indent = line[:-len(lstripped_line)]
# Let's check if the indentation is correct
# meaning it should be N * INDENT
n = len(indent) // len(INDENT)
if INDENT * n != indent:
raise ValueError(f"Wrong indentation of line: {line}")
return n, lstripped_line.rstrip('\r\n')
def format_text(txtfile: str) -> List[str]:
"""Format the text file by attaching the parent to it children
Args:
txtfile (str): The text file
Returns:
List[str]: A list of formatted lines
"""
formatted = []
par_indent = par_line = None
with open(txtfile) as ftxt:
for line in ftxt:
# get the indentation level and line without indentation
indent, line_noindent = parse_line(line)
# level 1 parents
if indent == 0:
par_indent = indent
par_line = line_noindent
formatted.append(line_noindent)
# children
elif indent > par_indent:
formatted.append(par_line +
SEPARATOR * (indent - par_indent) +
line_noindent)
par_indent = indent
par_line = par_line + SEPARATOR + line_noindent
# siblings or dedentation
else:
# We just need first `indent` parts of parent line as our prefix
prefix = SEPARATOR.join(par_line.split(SEPARATOR)[:indent])
formatted.append(prefix + SEPARATOR + line_noindent)
par_indent = indent
par_line = prefix + SEPARATOR + line_noindent
return formatted
def sort_and_revert(lines: List[str]):
"""Sort the formatted lines and revert the leading parents
into indentations
Args:
lines (List[str]): list of formatted lines
Prints:
The sorted and reverted lines
"""
sorted_lines = sorted(lines)
for line in sorted_lines:
if SEPARATOR not in line:
print(line)
else:
leading, _, orig_line = line.rpartition(SEPARATOR)
print(INDENT * (leading.count(SEPARATOR) + 1) + orig_line)
def main():
"""Main entry"""
if len(sys.argv) < 2:
print(f"Usage: {sys.argv[0]} <file>")
sys.exit(1)
formatted = format_text(sys.argv[1])
sort_and_revert(formatted)
if __name__ == "__main__":
main()
Let's save it as format.py, and we have a test file, say test.txt:
parent2
child2-1
child2-1-1
child2-2
parent1
child1-2
child1-2-2
child1-2-1
child1-1
Let's test it:
$ python format.py test.txt
parent1
child1-1
child1-2
child1-2-1
child1-2-2
parent2
child2-1
child2-1-1
child2-2
If you wonder how the format_text function formats the text, here is the intermediate results, which also explains why we could make file sorted as we wanted:
parent2
parent2#child2-1
parent2#child2-1#child2-1-1
parent2#child2-2
parent1
parent1#child1-2
parent1#child1-2#child1-2-2
parent1#child1-2#child1-2-1
parent1#child1-1
You may see that each child has its parents attached, all the way along to the root. So that the children under the same parent are sorted together.
Short answer (Linux solution):
sed ':a;N;$!ba;s/\n /#/g' test.txt | sort | sed ':a;N;$!ba;s/#/\n /g'
Test it out:
test.txt
parent2
child2-1
child2-1-1
child2-2
parent1
child1-1
child1-2
child1-2-1
$ sed ':a;N;$!ba;s/\n /#/g' test.txt | sort | sed ':a;N;$!ba;s/#/\n /g'
parent1
child1-1
child1-2
child1-2-1
parent2
child2-1
child2-1-1
child2-2
Explanation:
The idea is to replace the newline followed by an indentation/space with a non newline character, which has to be unique in your file (here I used # for example, if it is not unique in your file, use other characters or even a string), because we need to turn it back the newline and indentation/space later.
About sed command:
:a create a label 'a'
N append the next line to the pattern space
$! if not the last line, ba branch (go to) label 'a'
s substitute, /\n / regex for newline followed by a space
/#/ a unique character to replace the newline and space
if it is not unique in your file, use other characters or even a string
/g global match (as many times as it can)

Reversing and Splitting in Python

I have a file "names.txt". The contents are
"Smith,RobJones,MikeJane,SallyPetel,Brian"
and I want to read "names.txt" and make a new file "names2.txt" that looks like:
"Rob Smith Mike Jones Sally Jane Brian Petel"
I know I should be using #rstrip(\n) and #.split(',')
So far I have:
namesfile = input('Enter name of file: ') #open names.txt
openfile = open(namesfile, 'r')
This will do exactly that. You might be able to polish this and make it more elegant and I encourage you to do so:
import re
with open('names.txt') as f:
# Split the names
names = re.sub(r'([A-Z])(?![A-Z])',r',\1',f.read()).split(',')
# Filter empty results
names = [n for n in names if n != '']
# Swap pairs with each other
for i in range(len(names)):
if((i+1)%2 == 0):
names[i], names[i-1] = names[i-1], names[i]
print ' '.join(names)

python regex specific blocks of text from large text file

I'm new to python and this site so thank-you in advance for your... understanding. This is my first attempt at a python script.
I'm having what I think is a performance issue trying to solve this problem which is causing me to not get any data back.
This code works on a small text file of a couple pages but when I try to use it on my 35MB real data text file it just hits the CPU and hasn't returned any data (>24 hours now).
Here's a snippet of the real data from the 35MB text file:
D)dddld
d00d90d
dd
ddd
vsddfgsdfgsf
dfsdfdsf
aAAAAAa
221546
29806916295
Meowing
fs:/mod/umbapp/umb/sentbox/221546.pdu
2013:10:4:22:11:31:4
sadfsdfsdf
sdfff
ff
f
29806916295
What's your cat doing?
fs:/mod/umbapp/umb/sentbox/10955.pdu
2013:10:4:22:10:15:4
aaa
aaa
aaaaa
What I'm trying to copy into a new file:
29806916295
Meowing
fs:/mod/umbapp/umb/sentbox/221546.pdu
2013:10:4:22:11:31:4
29806916295
What's your cat doing?
fs:/mod/umbapp/umb/sentbox/10955.pdu
2013:10:4:22:10:15:4
My Python code is:
import re
with open('testdata.txt') as myfile:
content = myfile.read()
text = re.search(r'\d{11}.*\n.*\n.*(\d{4})\D+(\d{2})\D+(\d{1})\D+(\d{2})\D+(\d{2})\D+\d{2}\D+\d{1}', content, re.DOTALL).group()
with open("result.txt", "w") as myfile2:
myfile2.write(text)
Regex isn't the fastest way to search a string. You also compounded the problem by having a very big string (35MB). Reading an entire file into memory is generally not recommended because you may run into memory issues.
Judging from your regex pattern, it seems like you want to capture 4-line groups that start with an 11-digit string and end with some time-line string. Try this code:
import re
start_pattern = re.compile(r'^\d{11}$')
end_pattern = re.compile(r'^\d{4}\D+\d{2}\D+\d{1}\D+\d{2}\D+\d{2}\D+\d{2}\D+\d{1}$')
capturing = 0
capture = ''
with open('output.txt', 'w') as output_file:
with open('input.txt', 'r') as input_file:
for line in input_file:
if capturing > 0 and capturing <= 4:
capturing += 1
capture += line
elif start_pattern.match(line):
capturing = 1
capture = line
if capturing == 4:
if end_pattern.match(line):
output_file.write(capture + '\n')
else:
capturing = 0
It iterates over the input file, line by line. If it finds a line matching the start_pattern, it will read in 3 more. If the 4th line matches the end_pattern, it will write the whole group to the output file.

Ruby - URL to Markdown

TOTAL rookie here.
I'm working on customizing a script made by Brett Terpstra - http://brettterpstra.com/2013/11/01/save-pocket-favorites-to-nvalt-with-ifttt-and-hazel/
Mine is a different use: I'd like to save my pinboard bookmarks with a specific tag to a file in dropbox in Markdown.
I feed it a text file such as:
Title: Yesterday is over.
URL: http://www.jonacuff.com/blog/want-to-change-the-world-get-doing/
Tags: 2md, 2wcx, 2pdf
Date: June 20, 2013 at 06:20PM
Image: notused
Excerpt: You can't start the next chapter of your life if you keep re-reading the last one.
And it outputs the markdown file.
Everything works great except when the 'excerpt' (see above) is more than one line. Sometimes it's a couple of paragraphs. When that happens, it stops working. When I hit enter from the command line, it's still waiting for more input.
Here's an example of a file that it doesn't work on:
Title: Talking ’bout my Generation.
URL: http://blog.greglaurie.com/?p=8881
Tags: 2md, 2wcx, 2pdf
Date: June 28, 2013 at 09:46PM
Image: notused
Excerpt: Contrast two men from the 19th century: Max Jukes and Jonathan Edwards.
Max Jukes lived in New York. He did not believe in Christ or in raising his children in the way of the Lord. He refused to take his children to church, even when they asked to go. Of his 1,026 descendants:
•300 were sent to prison for an average term of 13 years
•190 were prostitutes
•680 were admitted alcoholics
His family, thus far, has cost the state in excess of $420,000 and has made no contribution to society.
Jonathan Edwards also lived in New York, at the same time as Jukes. He was known to have studied 13 hours a day and, in spite of his busy schedule of writing, teaching, and pastoring, he made it a habit to come home and spend an hour each day with his children. He also saw to it that his children were in church every Sunday. Of his 929 descendants:
•430 were ministers
•86 became university professors
•13 became university presidents
•75 authored good books
•7 were elected to the United States Congress
•1 was Vice President of the United States
Edwards’ family never cost the state one cent.
We tend to think that our decisions only affect ourselves, but they have ramifications for generations to come.
Here's a screenshot of what it looks like after I run the command: https://www.dropbox.com/s/i9zg483k7nkdp6f/Screenshot%202013-11-22%2016.39.17.png
I'm hoping it's something easy. Any ideas?
#!/usr/bin/env ruby
# Works with IFTTT recipe https://ifttt.com/recipes/125999
#
# Set Hazel to watch the folder you specify in the recipe.
# Make sure nvALT is set to store its notes as individual files.
# Edit the $target_folder variable below to point to your nvALT
# ntoes folder.
require 'date'
require 'open-uri'
require 'net/http'
require 'fileutils'
require 'cgi'
$target_folder = "~/Dropbox/messx/urls2md"
def url_to_markdown(url)
res = Net::HTTP.post_form(URI.parse("http://heckyesmarkdown.com/go/"),{'u'=>url,'read'=>'1'})
if res.code.to_i == 200
res.body
else
false
end
end
file = ARGV[0]
begin
input = IO.read(file).force_encoding('utf-8')
headers = {}
input.each_line {|line|
key, value = line.split(/: /)
headers[key] = value.strip || ""
}
outfile = File.join(File.expand_path($target_folder), headers['Title'].gsub(/["!*?'|]/,'') + ".txt")
date = Time.now.strftime("%Y-%m-%d %H:%M")
date_added = Date.parse(headers['Date']).strftime("%Y-%m-%d %H:%M")
content = "Title: #{headers['Title']}\nDate: #{date}\nDate Added: #{date_added}\nSource: #{headers['URL']}\n"
tags = false
if headers['Tags'].length > 0
tag_arr = header s['Tags'].split(", ")
tag_arr.map! {|tag|
%Q{"#{tag.strip}"}
}
tags = tag_arr.join(" ")
content += "Keywords: #{tags}\n"
end
markdown = url_to_markdown(headers['URL']).force_encoding('utf-8')
if markdown
content += headers['Image'].length > 0 ? "\n\n> #{headers['Excerpt']}\n\n---#{markdown}\n" : "\n\n"+markdown
else
content += headers['Image'].length > 0 ? "\n\n![](#{headers['Image']})\n\n#{headers['Excerpt']}\n" : "\n\n"+headers['Excerpt']
end
File.open(outfile,'w') {|f|
f.puts content
}
if tags && File.exists?("/usr/local/bin/openmeta")
%x{/usr/local/bin/openmeta -a #{tags} -p "#{outfile}"}
end
# FileUtils.rm(file)
rescue Exception => e
puts e
end
How about this? Modify your input.each_line area accordingly:
headers = {}
key = nil
input.each_line do |line|
match = /^(?<key>\w+)\s*:\s*(?<value>.*)/.match(line)
value = line
if match
key = match[:key].strip
headers[key] = match[:value].strip
else
headers[key] += line
end
end
First, splitting on just ":" is dangerous since that can be in content. Instead, a (simplified from code) regex of /^\w+:.*/ will match "Word: Content". Since the lines after the "Excerpt:" aren't prefixed, you need to hang on to the last seen key, and just append if there's no key for this line. You may need to add a newline in there, depending on what you're doing with that header information, but it seems to work.

Resources