Asciidoctor-pdf no parsing - asciidoctor-pdf

First, thank you for this great resource. Beautiful pdf files it creates.
I have a bunch of text files with all kinds of text of which some are jebrish. Some text lines start with a dot, etc.
Asciidoctor-pdf barfs on many pages correctly so. I've spend days trying to clean the text files with sed but its a no end game.
Is there a way to tell Asciidoctor-pdf to simply convert the text document to pdf without parsing it with Asciidoctor-pdf command options?

You could create a new AsciiDoc file where you include the text files using the include macro. If you want the converter to ignore the syntax you should use a passthrough block. If you want to display a fileA.txt and fileB.txt inside an allfiles.adoc it could look like this:
allfiles.adoc
= all files
== content of fileA.txt
++++
include::fileA.txt[]
++++
== content of fileB.txt
++++
include::fileB.txt[]
++++

Related

Exiftool: batch-write metadata to JPEGs from text file

I'd like to use ExifTool to batch-write metadata that have been previously saved in a text file.
Say I have a directory containing the following JPEG files:
001.jpg 002.jpg 003.jpg 004.jpg 005.jpg
I then create the file metadata.txt, which contains the file names followed by a colon, and I hand it out to a coworker, who will fill it with the needed metadata — in this case comma-separated IPTC keywords. The file would look like this after being finished:
001.jpg: Keyword, Keyword, Keyword
002.jpg: Keyword, Keyword, Keyword
003.jpg: Keyword, Keyword, Keyword
004.jpg: Keyword, Keyword, Keyword
005.jpg: Keyword, Keyword, Keyword
How would I go about feeding this file to ExifTool and making sure that the right keywords get saved to the right file? I'm also open to changing the structure of the file if that helps, for example by formatting it as CSV, JSON or YAML.
If you can change the format to a CSV file, then exiftool can directly read it with the -csv option.
You would have to reformat it in this way. The first row would have to have the header of "SourceFile" above the filenames and "Keywords" above the keywords. If the filenames don't include the path to the files, then command would have to be run from the same directory as the files. The whole keywords string need to be enclosed in quotes so they aren't read as a separate columns. The result would look like this:
SourceFile,Keywords
001.jpg,"KeywordA, KeywordB, KeywordC"
002.jpg,"KeywordD, KeywordE, KeywordF"
003.jpg,"KeywordG, KeywordH, KeywordI"
004.jpg,"KeywordJ, KeywordK, KeywordL"
005.jpg,"KeywordM, KeywordN, KeywordO"
At that point, your command would be
exiftool -csv=/path/to/file.csv -sep ", " /path/to/files
The -sep option is needed to make sure the keywords are treated as separate keywords rather than a single, long keyword.
This has an advantage over a script looping over the file contents and running exiftool once for each line. Exiftool's biggest performance hit is in its startup and running it in a loop will be very slow, especially on a large amount of files (see Common Mistake #3).
See ExifTool FAQ #26 for more details on reading from a csv file.
I believe the answer by #StarGeek is superior to mine, but I will leave mine for completeness and reference of a more basic, Luddite approach :-)
I think you want this:
#!/bin/bash
while IFS=': ' read file keywords ; do
exiftool -sep ", " -iptc:Keywords="$keywords" "$file"
done < list.txt
Here is the list.txt:
001.jpg: KeywordA, KeywordB, KeywordC
002.jpg: KeywordD, KeywordE, KeywordF
003.jpg: KeywordG, KeywordH, KeywordI
And here is a result:
exiftool -b -keywords 002.jpg
KeywordD
KeywordE
KeywordF
Many thanks to StarGeek for his corrections and explanations.

Add part of filename as PDF metadata using bash script and exiftool

I have about 600 books in PDF format where the filename is in the format:
AuthorForename AuthorSurname - Title (Date).pdf
For example:
Foo Z. Bar - Writing Scripts for Idiots (2017)
Bar Foo - Fun with PDFs (2016)
The metadata is unfortunately missing for pretty much all of them so when I import them into Calibre the Author field is blank.
I'm trying to write a script that will take everything that appears before the '-', removes the trailing space, and then adds it as the author in the PDF metadata using exiftool.
So far I have the following:
for i in "*.pdf";
do exiftool -author=$(echo $i | sed 's/-.*//' | sed 's/[ \t]*$//') "$i";
done
When trying to run it, however, the following is returned:
Error: File not found - Z.
Error: File not found - Bar
Error: File not found - *.pdf
0 image files updated
3 files weren't updated due to errors
What about the -author= phrase is breaking here? Please could someone enlighten me?
You don't need to script this. In fact, doing so will be much slower than letting exiftool do it by itself as you would require exiftool to startup once for every file.
Try this
exiftool -ext pdf '-author<${filename;s/\s+-.*//}' /path/to/target/directory
Breakdown:
-ext pdf process only PDF files
-author the tag to copy to
< The copy from another tag option. In this case, the filename will be treated as a pseudo-tag
${filename;s/\s+-.*//} Copying from the filename, but first performing a regex on it. In this case, looking for 1 or more spaces, a dash, and the rest of the name and removing it.
Add -r if you want to recurse into subdirectories. Add -overwrite_original to avoid making backupfiles with _original added to the filename.
The error with your first command was that the value you wanted to assign had spaces in it and needed to be enclosed by quotes.

Grep every word from a file starting a pattern

So I have a file let's call "page.html". Within this file, there's some links/file paths I want to extract. I've been working in BASH trying to get this right but can't seem to do it. The words/links/paths I want to grab all start with "/funny/hello/there/". The goal is for all these words to go to the terminal so I can use them.
This is kinda what I've tried so far, with no luck:
grep -E '^/funny/hello/there/` page.html
and
grep -Po '/funny/hello/there/.*?` page.html
Any help would be greatly appreciated, Thanks.
Here is sample data from the file:
`<td data-title="Blah" class="Blah" >
fdsksldjfah
</td>`
My output gives me all the different line that look like this:
fdsksldjfah
The "/fkljaskdjfl" are all something different though.
What I want the output to look like:
/funny/hello/there/fkljaskdjfl
/funny/hello/there/kfjasdflas
/funny/hello/there/kdfhakjasa
You can use this grep command:
grep -o "/funny/hello/there/[^'\"[:blank:]]*" page.html
However one should avid parsing HTML using shell utilities and use dedicated HTML dom parsers instead.

Can I set command line arguments using the YAML metadata

Pandoc supports a YAML metadata block in markdown documents. This can set the title and author, etc. It can also manipulate the appearance of the PDF output by changing the font size, margin width and the frame sizes given to figures that are included. Lots of details are given here.
I'd like to use the metadata block to remember the command line arguments that I'm supposed to be using, such as --toc and --number-sections. I tried this, adding the following to the top of my markdown:
---
title: My Title
toc: yes
number-sections: yes
---
Then I used the command line:
pandoc -o guide.pdf articheck_guide.md
This did produce a table of contents, but didn't number the sections. I wondered why this was, and if there is a way I can specify this kind of thing from the document so that I don't need to add it on the command line.
YAML metadata are not passed to pandoc as arguments, but as variables. When you call pandoc on your MWE, it does not produce this :
pandoc -o guide.pdf articheck_guide.md --toc --number-sections
as we think it would. rather, it calls :
pandoc -o guide.pdf articheck_guide.md -V toc:yes -V number-sections:yes
Why, then, does you MWE produces a toc? Because the default latex template makes use of a toc variable :
~$ pandoc -D latex | grep toc
$if(toc)$
\setcounter{tocdepth}{$toc-depth$}
So setting toc to any value should produce a table of contents, at least in latex output. In this template, there is no number-sections variables, so this one doesn't work. However, there is a numbersections variable :
~$ pandoc -D latex | grep number
$if(numbersections)$
Setting numbersections to any value will produce numbering in a latex output with the default template
---
title: My Title
toc: yes
numbersections: yes
---
The trouble with this solution is that it only works with some output format. I thought I had read somewhere on the pandoc mailing-list that we soon would be able to use metadata in YAML blocks as intended (ie. as arguments rather than variables), but I can't find it anymore, so maybe it won't happen very soon.
Have a look at panzer (GitHub repository).
This was recently announced and released by Mark Sprevak -- a piece of software, that adds the notion of 'styles' to Pandoc.
It's basically a wrapper around Pandoc. It exploits the concept of YAML metadata blocks to the maximum.
The 'styles' provide a way to set all options for a Pandoc document conversion process with one line ("I want this document be an article/CV/notes/letter.").
You can regard this as more general abstraction than Pandoc templates. Styles are combinations of...
...Pandoc command line options,
...metadata settings,
...templates,
...instructions to run filters, and
...instructions to run pre/postprocessors.
These settings can be customized on a per output type as well as a per document basis. Styles can be...
...combined and
...can bear inheritance relations to each other.
panzer styles simplify Makefiles: they bundle everything concerning the look of a document in one place -- the YAML metadata (a block in the Markdown file, or a separate file).
You just add one line of metadata (style: ...) to your document, and it will be treated as a letter/article/CV/notebook or whatever.

Excel saves tab delimited files without newline (UNIX/Mac os X)

This is a common issue I have and my solution is a bit brash. So I'm looking for a quick fix and explanation of the problem.
The problem is that when I decide to save a spreadsheet in excel (mac 2011) as a tab delimited file it seems to do it perfectly fine. Until I try to parse the file line by line using Perl. For some reason it slurps the whole document in one line.
My brutish solution is to open the file in a web browser and copy and paste the information into the tab delimited file in TextEdit (I never use rich text format). I tried introducing a newline in the end of the file before doing this fix and it does not resolve the issue.
What's going on here? An explanation would be appreciated.
~Thanks!~
The problem is the actual character codes that define new lines on different systems. Windows systems commonly use a CarriageReturn+LineFeed (CRLF) and *NIX systems use only a LineFeed (LF).
These characters can be represented in RegEx as \r\n or \n (respectively).
Sometimes, to hash through a text file, you need to parse New Line characters. Try this for DOS-to-UNIX in perl:
perl -pi -e 's/\r\n/\n/g' input.file
or, for UNIX-to-DOS using sed:
$ sed 's/$'"/`echo \\\r`/" input.txt > output.txt
or, for DOS-to-UNIX using sed:
$ sed 's/^M$//' input.txt > output.txt
Found a pretty simple solution to this. Copy data from Excel to clipboard, paste it into a google spreadsheet. Download google spreadsheet file as a 'tab-separated values .tsv'. This gets around the problem and you have tab delimiters with an end of line for each line.
Yet another solution ...
for a tab-delimited file, save the document as a Windows Formatted Text (.txt) file type
for a comma-separated file, save the document as a `Windows Comma Separated (.csv)' file type
Perl has a useful regex pattern \R which will match any common line ending. It actually matches any vertical whitespace -- the same as \v -- or the CR LF combination, so it's the same as \r\n|\v
This is useful here because you can slurp your entire file into a single scalar and then split /\R/, which will give you a list of file records, already chomped (if you want to keep the line terminators you can split /\R\K/ instead
Another option is the PerlIO::eol module. It provides a new Perl IO layer that will normalize line endings no matter what the contents of the file are
Once you have loaded the module with use PerlIO::eol you can use it in an open statement
open my $fh, '<:eol(LF)', 'myfile.tsv' or die $!;
or you can use the open pragma to set it as the default layer for all input file handles
use open IN => ':raw:eol(LF)';
which will work fine with an input file from any platform

Resources