Format text with footnote by regular expression - shell

I want to transform the annotation of text into the form of footnote. Here is a minimal example of the text.
Paragraph one. This is the first place [1] of paragraph one. This is the second place [2] of paragraph one.
[1] annotation one of paragraph one
[2] annotation two of paragraph one
Paragraph two. This is the first place [1] of paragraph two. This is the second place [2] of paragraph two.
[1] annotation one of paragraph two
[2] annotation two of paragraph two
At the end of each paragraph, there will be several annotations begins with label [1]. Each annotation will form a single paragraph.
What I want to do is to insert those annotations into the text with latex syntax. The desired output of the sample text is,
Paragraph one. This is the first place \footnote{annotation one of paragraph one} of paragraph one. This is the second place \footnote{annotation two of paragraph one} of paragraph one.
Paragraph two. This is the first place \footnote{annotation one of paragraph two} of paragraph two. This is the second place \footnote{annotation one of paragraph two} of paragraph two.
This is a not just a simple replacement by matching pattens. It may have to be performed on a paragraph basis. What do you think is the simplest way to do it?
Edit: I have came up with a possible solution in order to use sed.
remove the newline in front of the annotation,
Paragraph one. This is the first place [1] of paragraph one. This is the second place [2] of paragraph one. [1] annotation one of paragraph one [2] annotation two of paragraph one
Paragraph two. This is the first place [1] of paragraph two. This is the second place [2] of paragraph two. [1] annotation one of paragraph two [2] annotation two of paragraph two
match the pattern
[1] text1 [1] text2 [2]
and replace it with
text2 text1 [2]
basically the first [1] is where the annotation should be inserted; things between [1] and [2] are annotations to be relocated.
These questions are relevant: Remove new line / line break characters only for specific lines How can I remove a line-feed/newline BEFORE a pattern using sed, but I can't make those code work for me the lack of knowledge of regular expression.

Fundamentally, sed is the wrong tool for this job. You might be able to write a sed script that preprocesses the file and generates a new sed script that processes the file, but you're clutching at straws when there are many much better tools for the task. I'd reach for Perl (but I learned Perl over twenty years ago, and Python only a couple of years ago), but Python is also capable of handling it, and with care you could probably even use awk. Part of the trouble is that you have to save all the text of paragraph one until you reach the start of paragraph two; only then can you start generating the actual text for paragraph one.
I think that the 'sed is the wrong tool' comment remains valid even if the sed script captures the contents of the paragraph in the hold space. Those would be lines not starting with a square bracket. The trouble is, when you come to a line with a square bracket, you need to write a regex that substitutes the tail of the line into the hold space in lieu of the contents of the square brackets. That requires a sort of 'dynamic regex'. Even if you knew there'd never be more than, say, 9 footnotes in a paragraph, so you could consider some sort of hack that wrote out the code 9 times, there are still problems writing the replacement strings in the right places.
Here's a simple script in Perl — well, a not incredibly complex script in Perl — that does the job. The 'whirling loops' (three nested loops) make it a little tricky to understand.
#!/usr/bin/env perl
use strict;
use warnings;
my $para = "";
TEXT:
while (<>)
{
NOTES:
while (m/^\s*\[(\d+)]\s+(.*)/)
{
my $tag = $1;
my $note = $2;
$para =~ s/\[$tag]/\\footnote{$note}/m;
while (<>)
{
last if $_ =~ m/^\s*\[/;
if ($_ !~ m/^\s*$/)
{
print $para;
$para = "";
last NOTES;
}
}
last TEXT if eof;
}
$para .= $_;
}
print "$para";
Given the input file:
Paragraph one. This is the first place [1] of paragraph one. This is the second place [2] of paragraph one.
[1] annotation one of paragraph one
[2] annotation two of paragraph one
Paragraph two. This is the first place [1] of paragraph two. This is the second place [2] of paragraph two.
[1] annotation one of paragraph two
[2] annotation two of paragraph two
The output of this script from that file is:
Paragraph one. This is the first place \footnote{annotation one of paragraph one} of paragraph one. This is the second place \footnote{annotation two of paragraph one} of paragraph one.
Paragraph two. This is the first place \footnote{annotation one of paragraph two} of paragraph two. This is the second place \footnote{annotation two of paragraph two} of paragraph two.
What does the script do?
The outer loop (labelled TEXT) reads lines into $_ until EOF.
The loop labelled NOTES processes the material after a paragraph up to the start of the next one. It knows that it is a footnote line because it starts with a number in square brackets (possibly indented with spaces, and definitely with a space after the close square bracket). When it finds such a line, the number is saved in $tag and the replacement text (must be a single line — no extended multiline footnotes here) is saved in $note. Then the first occurrence of the tag inside square brackets in the saved paragraph is replaced with the footnote notation and the text of the note (this is the part that is nigh-on impossible in a single run of sed, and given that the footnote numbers repeat across paragraphs, makes even two runs of sed problematic). Having done that substitution (not caring if there is no match to replace), it reads the next line, and this is where the loops (and the head) start whirling. If the newly read line is a note line, then the initial last exits the innermost while and returns to the next iteration of the NOTES loop. If the line does not match a blank line, then we must have just read the first line of the next paragraph, so print the previous paragraph (which now has as many substitutions made as there are substitutions to make), empty the saved paragraph, and exit the NOTES loop. Otherwise, ignore the blank line in the middle of the notes.
After the loop, check whether we got EOF and exit the main loop if we did. Otherwise, add the paragraph line that was just read to the saved paragraph.
At the end, print the last saved paragraph.
This has not been exhaustively tested. I've not generated paragraphs with references to missing notes, or notes without references, or notes out of sequence. I think it would 'handle' those by ignoring the issues; there'd still be a reference to the missing note, and unreferenced notes would simply not show up in the output. If the same note number reference appears twice in a paragraph but there's only one note number after the paragraph, the second and subsequent ones are ignored. If the same note number appears twice ('text[1] more[1]') and the notes after the paragraph repeat the number ('[1] note 1A', '[1] note 1B'), then the first will be replaced with 'note 1A' and the second with 'note 1B'. I've not tested multiline paragraphs (but I don't expect trouble). Multiline qualifiers aren't needed for the replacement regex because the reference to a tag cannot be split over lines and isn't anchored on a line.
Processing multiline footnotes is an exercise for the reader (and is not entirely trivial). All else apart, you can't begin substituting a multiline footnote until you find a blank line, another footnote line, or the start of the next paragraph.

A less verbose (and less documented) perl version
perl -00 -pe '
#markers = m{(\[\d+\])}g;
for $i (0..$#markers) {
$footnote = <>;
($marker, $text) = $footnote =~ m{(\[\d+\])\s+(.*)};
s{\Q$marker\E}{\\footnote{$text}};
}
' file
This assumes that if there are 5 footnote markers in a paragraph, 5 footnotes will follow that paragraph.

Related

Bash Script Loop for Substitution and Traversal

So I'm trying to figure out what a question from an old exam means and I'm slightly confused about one or two parts.
#!/bin/bash
awk '{$0 = tolower($0)
gsub(/[,.?;:#!\(\)]/),"",$0)
for(a=1;a<=NF;a++)
b[$a]++}
END print b[a],a}'
sort -sk2
Here is my interpretation:
target the bash script location
scan file with awk
convert string to lower case
sub all occurrences of symbols with nothing (ie. remove) and overwrite string
(here is my issue) for every field increment a by 1?
(again not sure what this is doing) b takes a's number and increments by 1?
end the for loop and print (b, a)
sort by size of the second field
I think the last four lines are my main issue. Also is it just me or is there an extra } in that question?
Thanks in advance.
The for loop is weirdly formatted. Here it is again with proper indentation:
for(a=1; a<=NF; a++)
b[$a]++
In other words, we loop over the field positions; for each, the count in the associative array b is incremented. So if the current input line is
foo bar poo bar baz
the script will do
b["foo"]++ # a is 1; $a is $1
b["bar"]++
b["poo"]++
b["bar"]++
b["baz"]++
So now b contains a set of tokens as keys, and the number of times each occurred as their respective values. In other words, this collects word counts for each word in the input.
The case folding and removal of punctuation normalizes the input so that
Word word word, word!
will count as four occurrences of "word", rather than one each for the capitalized version, the undecorated normal form, and the ones with punctuation attached at the end. It slightly distorts e.g. words which should properly be capitalized, and conflates into homographs words which are differentiated only by capitalization (such as china porcelain vs China the country.)
The END block is executed only when all input lines have been consumed, and thus b is fully loaded with all input words from all input lines, with their final counts. (Though here, there is no valid END block actually, because the opening brace after END is missing; this is a fatal syntax error. There isn't one closing brace too many, there's one non-optional opening brace missing.)

Finding and Editing Multiple Regex Matches on the Same Line

I want to add markdown to key phrases in a (gollum) wiki page that will link to the relevant wiki page in the form:
This is the key phrase.
Becomes
This is the [[key phrase|Glossary#key phrase]].
I have a list of key phrases such as:
keywords = ["golden retriever", "pomeranian", "cat"]
And a document:
Sue has 1 golden retriever. John has two cats.
Jennifer has one pomeranian. Joe has three pomeranians.
I want to iterate over every line and find every match (that isn't already a link) for each keyword. My current attempt looks like this:
File.foreach(target_file) do |line|
glosses.each do |gloss|
len = gloss.length
# Create the regex. Avoid anything that starts with [
# or (, ends with ] or ), and ignore case.
re = /(?<![\[\(])#{gloss}(?![\]\)])/i
# Find every instance of this gloss on this line.
positions = line.enum_for(:scan, re).map {Regexp.last_match.begin(0) }
positions.each do |pos|
line.insert(pos, "[[")
# +2 because we just inserted 2 ahead.
line.insert(pos+len+2, "|#{page}\##{gloss}]]")
end
end
puts line
end
However, this will run into a problem if there are two matches for the same key phrase on the same line. Because I insert things into the line, the position I found for each match isn't accurate after the first one. I know I could adjust for the size of my insertions every time but, because my insertions are a different size for each gloss, it seems like the most brute-force, hacky solution.
Is there a solution that allows me to make multiple insertions on the same line at the same time without several arbitrary adjustments each time?
After looking at #BryceDrew's online python version, I realized ruby probably also has a way to fill in the match. I now have a much more concise and faster solution.
First, I needed to make regexes of my glosses:
glosses.push(/(?<![\[\(])#{gloss}(?![\]\)])/i)
Note: The majority of that regex is look-ahead and look-behind assertions to prevent catching a phrase that's already part of a link.
Then, I needed to make a union of all of them:
re = Regexp.union(glosses)
After that, it's as simple as doing gsub on every line, and filling in my matches:
File.foreach(target_file) do |line|
line = line.gsub(re) {|match| "[[#{match}|Glossary##{match.downcase}]]"}
puts line
end

Ruby puts<<PARAGRAPH

puts <<PARAGRAPH
There's somthing going on here.
With the PARAGRAPH thing
We'll be able to type as much as we like.
Even 4 lines if we want, or 5, or 6 .
PARAGRAPH
This can work, using Notepad++
But why this can't work?
puts <<PARAGRAPH
aaaa Aa
aaa
AA
PARAGRAPH
test.rb:1: syntax error,unexpected tCONSTANT, expecting $end
Thanks!
My guess is that in your second snippet PARAGRAPH is not at the begging of the line.
The multi-line strings in ruby, are weird that way. The terminating character (whatever it may be) must be the first thing on a line to terminate the string, otherwise you will often see the syntax errors.
Ensure ensure that PARAGRAPH (the second instance) is a) spelled the same as your first instance, and b) at the start of the line, or change your code to:
def go
puts <<-PARAGRAPH # hyphen allows the end marker to be indented
Hi mom!
PARAGRAPH
end
For more information, read the intro to Strings and the full description.
The code works for me. One way I broke it was by adding space between << and PARAGRAPH
puts << PARAGRAPH
PARAGRAPH
This is different from the next example.
puts <<PARAGRAPH
PARAGRAPH
Edit: As I continue play with it I found that PARAGRAPH is just like any place holder. You can do the following and you will still get a paragraph in a string
puts <<ANYTHING_YOU_WANT
ANYTHING_YOU_WANT
I thought it was cool that it is not restricted only to the word PARAGRAPH. I didn't know.
I can get either version to error by adding additional spaces after the final PARAGRAPH.
Ensure that the closing PARAGRAPH is truly on a new line (per diedthreetimes' answer) and has no trailing characters (i.e. spaces, tabs, etc.)

Block Indent Regex

I'm having problems about a regexp.
I'm trying to implement a regex to select just the tab indent blocks, but i cant find a way of make it work:
Example:
INDENT(1)
INDENT(2)
CONTENT(a)
CONTENT(b)
INDENT(3)
CONTENT(c)
So I need blocks like:
INDENT(2)
CONTENT(a)
CONTENT(b)
AND
INDENT(3)
CONTENT(c)
How I can do this?
really tks, its almost that, here is my original need:
table
tr
td
"joao"
"joao"
td
"marcos"
I need separated "td" blocks, could i adapt your example to that?
It depends on exactly what you are trying to do, but maybe something like this:
^(\t+)(\S.*)\n(?:\1\t.*\n)*
Working example: http://www.rubular.com/r/qj3WSWK9JR
The pattern searches for:
^(\t+)(\S.*)\n - a line that begins with a tab (I've also captured the first line in a group, just to see the effect), followed by
(?:\1\t.*\n)* - lines with more tabs.
Similarly, you can use ^( +)(\S.*)\n(?:\1 .*\n)* for spaces (example). Mixing spaces and tabs may be a little problematic though.
For the updated question, consider using ^(\t{2,})(\S.*)\n(?:\1\t.*\n)*, for at least 2 tabs at the beginning of the line.
You could use the following regex to get the groups...
[^\s]*.*\r\n(?:\s+.*\r*\n*)*
this requires that your lines not begin with white space for the beginning of the blocks.

Regular expression Unix shell script

I need to filter all lines with words starting with a letter followed by zero or more letters or numbers, but no special characters (basically names which could be used for c++ variable).
egrep '^[a-zA-Z][a-zA-Z0-9]*'
This works fine for words such as "a", "ab10", but it also includes words like "b.b". I understand that * at the end of expression is problem. If I replace * with + (one or more) it skips the words which contain one letter only, so it doesn't help.
EDIT:
I should be more precise. I want to find lines with any number of possible words as described above. Here is an example:
int = 5;
cout << "hello";
//some comments
In that case it should print all of the lines above as they all include at least one word which fits the described conditions, and line does not have to began with letter.
Your solution will look roughly like this example. In this case, the regex requires that the "word" be preceded by space or start-of-line and then followed by space or end-of-line. You will need to modify the boundary requirements (the parenthesized stuff) as needed.
'(^| )[a-zA-Z][a-zA-Z0-9]*( |$)'
Assuming the line ends after the word:
'^[a-zA-Z][a-zA-Z0-9]+|^[a-zA-Z]$'
You have to add something to it. It might be that the rest of it can be white spaces or you can just append the end of line.(AFAIR it was $ )
Your problem lies in the ^ and $ anchors that match the start and end of the line respectively. You want the line to match if it does contain a word, getting rid of the anchors does what you want:
egrep '[a-zA-Z][a-zA-Z0-9]+'
Note the + matches words of length 2 and higher, a * in that place would signel chars too.

Resources