Bash script which adds space inside long words in Pages file - bash

I like to convert documents to EPUB format because it is easier for me to read. However, if I do this for for example some code documentation, some really long lines of code are not readable in the EPUB, because they trail off-screen. I would like to automatically insert spaces in any words in a text file (specifically, a Pages document) over a certain length, so they are reduced to say, 10 character words, at maximum. Then, I will convert that Pages document to an EPUB.
How can I write a bash script which goes through a Pages document and inserts spaces into any word longer than, perhaps, 10 characters?

sed is your friend:
$ cat input.txt
a file with a
verylongwordinit to test with.
$ sed 's/[^[:space:]]\{10\}/& /g' input.txt
a file with a
verylongwo rdinit to test with.
For every sequence of 10 non-whitespace characters in each line, add a space after (The & in the replacement text is itself replaced with the matched text).
If you want to change the file inline instead of making a copy, ed comes into play:
ed input.txt <<'EOF'
s/[^[:space:]]\{10\}/& /g
w
EOF
(Or some versions of sed take an -i switch for inline editing)

Related

Reformat headers in Markdown files with sed fails

I tried to reformat headers in a markdown file with sed but somehow that doesn't seem to work.
Problem is that between the header # sign(s) and the header text needs to be one space, otherwise it is not correctly displayed.
So i tried to run several variations of sed commands to add this space after the # signs
sed -i "s/<expression>/\1 /g" test.md
<expression> being:
^\(\s*#+\)
^\(\[#\]+\)
^\(\[\#\]+\)
-i should replace this inside the file, but when i review the file with cat test.md, the space is still missing. I even added a backslash in front of the space in the substitute, but no luck.
The content of test.md is the following example data:
#Heading 1
Some text
- a list entry
- another one
##Heading 2
text
##Heading 3
The command should result in e.g. line 1 # Heading 1
What am i missing?
After upgrading to pandoc version 2, the newly required space in ATX-style headers can be automatically inserted as follows:
$ sed -i 's|\(^##*\)\([^# \.]\)|\1 \2|' test.md
Explanation
-i edits the markdown file 'in place.'
s|…|…| is a single substitution per line.
Each \(…\) denotes a part in the search expression.
\1 and \2 refer to the first, respectively, second part of the search expression.
^##* means that the line should start ^ with one hash #, followed by zero or more hashes #*.
The second search sequence part should start neither with a hash, space nor period [^# \.].
Note
The last item in the explanation is what differentiates this answer from a more simple sed -i 's|^##*|& |'. The simpler sed command would still insert a space even when there is already a space behind the starting hash sequence.
You need to escape the plus sign, e.g.:
^\(\s*#\+\)

How do I handle newlines in shuf, calc etc?

I have written a bash script:
for f in *.csv; do shuf -n 1000 "$f" > ./1000/"${f%.csv}_1000.csv" ; done
Which for each .csv files in a directory randomly writes 1000 lines to a new file with the suffix '_1000' in the directory /1000, i.e.
afolder/cat.csv
afolder/dog.csv
becomes:
afolder/1000/cat_1000.csv
afolder/1000/dog_1000.csv
Each record is a tweet. This works fine except when input files have newline characters. For example one of my tweet records has a text field with newline characters:
Hope Abbo gets his Sen in #bcafc trenches with McCall & Black..
More Warriors The Better
#ShoulderToShoulder
This is handled correctly in Libroffice Calc file
the three lines are kept together in one record (although this does not appear so in the image because calc has expanded the field).
When I look at the output, shuf has chosen one of the three text lines instead of keeping them together:
Is there anyway of telling shuf to keep them together?

Repeating characters when attempting to concatenate man pages to plain text files

I tried converting some man pages to plain text files. But when I open the file, many of the words have unnecessary repeating characters.
For example doing man awk > awk.txt changes the sections in the awk.txt file from:
NAME to NNAAMMEE
SYNOPSIS to SSYYNNOOPPSSIISS
DESCRIPTION to DDEESSCCRRIIPPTTIIOONN
I thought this would be a simple task. Why does this happen?
Man pages contain formating information (for instance to indicate if some words should be bold). Consequently, some characters may appear repeated when redirecting the output in a file.
You may want to try:
man awk | col -b > awk.txt
What col is doing:
col — filter reverse line feeds from input
SYNOPSIS
col [-bfhpx] [-l num]
DESCRIPTION
The col utility filters out reverse (and half reverse) line feeds so that the output is in the correct order with only forward and half
forward line feeds, and
replaces white-space characters with tabs where possible. This can be useful in processing the output of nroff(1) and tbl(1).
The col utility reads from the standard input and writes to the standard output.
The options are as follows:
-b Do not output any backspaces, printing only the last character written to each column position.

How do I alter the n-th line in multiple files using SED?

I have a series of text files that I want to convert to markdown. I want to remove any leading spaces and add a hash sign to the first line in every file. If I run this:
sed -i.bak '1s/ *\(.*\)/\#\1/g' *.md
It alters the first line of the first file and processes them all, leaving the rest of the files unchanged.
What am I missing that will search and replace something on the n-th line of multiple files?
Using bash on OSX 10.7
The problem is that sed by default treats any number of files as a single stream, and thus line-number offsets are relative to the start of the first file.
For GNU sed, you can use the -s (--separate) flag to modify this behavior:
sed -s -i.bak '1s/^ */#/' *.md
...or, with non-GNU sed (including the one on Mac OS X), you can loop over the files and invoke once per each:
for f in *.md; do sed -i.bak '1s/^ */#/' "$f"; done
Note that the regex is a bit simplified here -- no need to match parts of the line that you aren't going to change.
XARgs will do the trick for you:
http://en.wikipedia.org/wiki/Xargs
Remove the *.md from the end of your sed command, then use XArgs to gather your files one at a time and send them to your sed command as a single entity, sorry I don't have time to work it out for you but the wikiPedia article should show you what you need to know.
sed -rsi.bak '1s/^/#/;s/^[ \t]+//' *.md
You don't need g(lobally) at the end of the command(s), because you wan't to replace something at the begin of line, and not multiple times.
You use two commands, one to modify line 1 (1s...), seperated from the second command for the leading blanks (and tabs? :=\t) with a semicolon. To remove blanks in the first line, switch the order:
sed -rsi.bak 's/^[ \t]+//;1s/^/#/' *.md
Remove the \t if you don't need it. Then you don't need a group either:
sed -rsi.bak 's/^ +//;1s/^/#/' *.md
-r is a flag to signal special treatment of regular expressions. You don't need to mask the plus in that case.

MS Word Doc: Automating find/replace using Shell Scripts

I have a number of word documents that I'd like to remove some elements from. What I would like to do is as follows:
Copy and paste the entire contents of the word file (may not be necessary) and move it into a text file OR Convert .doc to .txt
Using regex: replace \[.*\] with "" AND replace \(.*\) with ""
Save the result to a text file with the same name as the original word document.
Thoughts and direction appreciated. As it stands now, I don't know how to do any of these things programatically. I'm doing this manually as it stands.
If it matters, I'm using Ubuntu 11.04
Since you're open to using plain text, some improvements to your algo:
Use antiword to automate conversion from doc to tx
Use sed to do in-place regex modification: sed -i -e's/bad/good/' file.txt
Update (in response to comment):
The regexes are fine, but I didn't understand the objective completely:
if you want to replace occurrences of [foo] & (foo) with "" use:
sed -i -e's/\[.*\]/""/g' file.txt; sed -i -e's/\(.*\)/""/g' file.txt
if you want to replace occurrences [foo] & (foo) with "foo" each use:
sed -i -e's/\[\(.*\)\]/"\1"/g' file.txt; sed -i -e's/(\(.*\))/"\1"/g' file.txt

Resources