Find a string in a file, run a script on it, and replace the original string with the new string using PowerShell - shell

Goal
Using PowerShell, find a string in a file, run a simple transformation script on the string, and replace the original string with the new string in the same file
Details
The file is a Markdown file with one or more HTML blocks inside.
The goal is to make the entire file Markdown with no HTML.
Pandoc is a command-line HTML-to-Markdown transformation tool that easily transforms HTML to Markdown.
The transformation script is a Pandoc script.
Pandoc alone cannot transform a Markdown file that includes HTML to Markdown.
Each HTML block a is one long string with no line breaks (see example below).
The HTML is a little rough and sometimes not valid; despite this, Pandoc handles much of the transformation successfully. This may not be relevant.
I cannot change the fact that the file is generated originally as part Markdown/part HTML, that the HTML is sometimes invalid, or that each HTML block is all on one line.
PowerShell is required because that's the scripting language my team supports.
Example file of mixed Markdown/HTML code; most HTML is invalid
# Heading 1
Text
# Heading 2
<h3>Heading 3</h3><p>I am all on one line</h><span><div>I am not always valid HTML</div></span><br><h4>Heading 4<h4><ul><li>Item<br></li><li>Item</li><ul><span></span><img src="url" style="width:85px;">
# Heading 3
Text
# Heading 4
<h2>Heading 1</h2><div>Text</div><h2>Heading 2</h2><div>Text</div>
# Heading 5
<div><ul><li>Item</li><li>Item</li><li>Item</li></ul></div><code><pre><code><div>Code line 1</div><div>Code line 2</div><div>Code line 3</div></code></pre></code>
Text
Code for transformation script
pandoc -f html -t 'markdown_strict-raw_html-native_divs-native_spans-bracketed_spans' --atx-headers
Attempts
I surrounded each HTML block with a <start> and <end> tag with the goal to extract the text in between those tags with a regex, run the Pandoc script on it, and replace the original text. My plan was to run a foreach loop to iterate through each block one by one.
This attempt transforms the HTML to Markdown, but does not return the original Markdown with it:
$file = 'file.md'
$regex = '<start>.*?<end>'
$a = Get-Content $file -Raw
$a | Select-String $regex -AllMatches | ForEach-Object {$_.Matches.Value} | pandoc -f html -t 'markdown_strict-raw_html-native_divs-native_spans-bracketed_spans' --atx-headers
This poor attempt seeks to perform the replace, but only returns the original file with no changes:
$file = 'file.md'
$regex = '<start>.*?<end>'
$content = Get-Content $file -Raw
$a = $content | Select-String $regex -AllMatches
$b = $a | ForEach-Object {$_.Matches } | Foreach-Object {$_.Value} | Select-Object | pandoc -f html -t 'markdown_strict-raw_html-native_divs-native_spans-bracketed_spans' --atx-headers
$content | ForEach-Object {
$_ -replace $a,$b }
I am struggling to move beyond these attempts. I am new at PowerShell. If this approach is wrong entirely I would be grateful to know. Thank you for any advice.

Given the line-oriented nature of your input, you can process your input file line by line and decide for each line whether it needs transformation or not:
$file = 'file.md'
(Get-Content $file | ForEach-Object {
if ($_ -match '^<') { # Is this an HTML line? - you could make this regex stricter
$_ | pandoc -f html -t 'markdown_strict-raw_html-native_divs-native_spans-bracketed_spans' --atx-headers
} else { # A non-HTML line, pass through as-is
$_
}
}) | Set-Content -Encoding Utf8 $file # be sure to choose the desired encoding
Note the (...) around the pipeline before Set-Content, which ensures that $file is read into memory in full up front, which allows writing back to the same file - do note that this convenient approach bears the slight risk of data loss, however, if the command is interrupted before writing completes; always create a backup of the input files first.

Related

Bash: decode string with url escaped hex codes [duplicate]

I'm looking for a way to turn this:
hello < world
to this:
hello < world
I could use sed, but how can this be accomplished without using cryptic regex?
Try recode (archived page; GitHub mirror; Debian page):
$ echo '<' |recode html..ascii
<
Install on Linux and similar Unix-y systems:
$ sudo apt-get install recode
Install on Mac OS using:
$ brew install recode
With perl:
cat foo.html | perl -MHTML::Entities -pe 'decode_entities($_);'
With php from the command line:
cat foo.html | php -r 'while(($line=fgets(STDIN)) !== FALSE) echo html_entity_decode($line, ENT_QUOTES|ENT_HTML401);'
An alternative is to pipe through a web browser -- such as:
echo '!' | w3m -dump -T text/html
This worked great for me in cygwin, where downloading and installing distributions are difficult.
This answer was found here
Using xmlstarlet:
echo 'hello < world' | xmlstarlet unesc
A python 3.2+ version:
cat foo.html | python3 -c 'import html, sys; [print(html.unescape(l), end="") for l in sys.stdin]'
This answer is based on: Short way to escape HTML in Bash? which works fine for grabbing answers (using wget) on Stack Exchange and converting HTML to regular ASCII characters:
sed 's/ / /g; s/&/\&/g; s/</\</g; s/>/\>/g; s/"/\"/g; s/#'/\'"'"'/g; s/“/\"/g; s/”/\"/g;'
Edit 1: April 7, 2017 - Added left double quote and right double quote conversion. This is part of bash script that web-scrapes SE answers and compares them to local code files here: Ask Ubuntu -
Code Version Control between local files and Ask Ubuntu answers
Edit June 26, 2017
Using sed was taking ~3 seconds to convert HTML to ASCII on a 1K line file from Ask Ubuntu / Stack Exchange. As such I was forced to use Bash built-in search and replace for ~1 second response time.
Here's the function:
LineOut="" # Make global
HTMLtoText () {
LineOut=$1 # Parm 1= Input line
# Replace external command: Line=$(sed 's/&/\&/g; s/</\</g;
# s/>/\>/g; s/"/\"/g; s/'/\'"'"'/g; s/“/\"/g;
# s/”/\"/g;' <<< "$Line") -- With faster builtin commands.
LineOut="${LineOut// / }"
LineOut="${LineOut//&/&}"
LineOut="${LineOut//</<}"
LineOut="${LineOut//>/>}"
LineOut="${LineOut//"/'"'}"
LineOut="${LineOut//'/"'"}"
LineOut="${LineOut//“/'"'}" # TODO: ASCII/ISO for opening quote
LineOut="${LineOut//”/'"'}" # TODO: ASCII/ISO for closing quote
} # HTMLtoText ()
On macOS, you can use the built-in command textutil (which is a handy utility in general):
echo '👋 hello < world 🌐' | textutil -convert txt -format html -stdin -stdout
outputs:
👋 hello < world 🌐
To support the unescaping of all HTML entities only with sed substitutions would require too long a list of commands to be practical, because every Unicode code point has at least two corresponding HTML entities.
But it can be done using only sed, grep, the Bourne shell and basic UNIX utilities (the GNU coreutils or equivalent):
#!/bin/sh
htmlEscDec2Hex() {
file=$1
[ ! -r "$file" ] && file=$(mktemp) && cat >"$file"
printf -- \
"$(sed 's/\\/\\\\/g;s/%/%%/g;s/&#[0-9]\{1,10\};/\&#x%x;/g' "$file")\n" \
$(grep -o '&#[0-9]\{1,10\};' "$file" | tr -d '&#;')
[ x"$1" != x"$file" ] && rm -f -- "$file"
}
htmlHexUnescape() {
printf -- "$(
sed 's/\\/\\\\/g;s/%/%%/g
;s/&#x\([0-9a-fA-F]\{1,8\}\);/\&#x0000000\1;/g
;s/&#x0*\([0-9a-fA-F]\{4\}\);/\\u\1/g
;s/&#x0*\([0-9a-fA-F]\{8\}\);/\\U\1/g' )\n"
}
htmlEscDec2Hex "$1" | htmlHexUnescape \
| sed -f named_entities.sed
Note, however, that a printf implementation supporting \uHHHH and \UHHHHHHHH sequences is required, such as the GNU utility’s. To test, check for example that printf "\u00A7\n" prints §. To call the utility instead of the shell built-in, replace the occurrences of printf with env printf.
This script uses an additional file, named_entities.sed, in order to support the named entities. It can be generated from the specification using the following HTML page:
<!DOCTYPE html>
<head><meta charset="utf-8" /></head>
<body>
<p id="sed-script"></p>
<script type="text/javascript">
const referenceURL = 'https://html.spec.whatwg.org/entities.json';
function writeln(element, text) {
element.appendChild( document.createTextNode(text) );
element.appendChild( document.createElement("br") );
}
(async function(container) {
const json = await (await fetch(referenceURL)).json();
container.innerHTML = "";
writeln(container, "#!/usr/bin/sed -f");
const addLast = [];
for (const name in json) {
const characters = json[name].characters
.replace("\\", "\\\\")
.replace("/", "\\/");
const command = "s/" + name + "/" + characters + "/g";
if ( name.endsWith(";") ) {
writeln(container, command);
} else {
addLast.push(command);
}
}
for (const command of addLast) { writeln(container, command); }
})( document.getElementById("sed-script") );
</script>
</body></html>
Simply open it in a modern browser, and save the resulting page as text as named_entities.sed. This sed script can also be used alone if only named entities are required; in this case it is convenient to give it executable permission so that it can be called directly.
Now the above shell script can be used as ./html_unescape.sh foo.html, or inside a pipeline reading from standard input.
For example, if for some reason it is needed to process the data by chunks (it might be the case if printf is not a shell built-in and the data to process is large), one could use it as:
nLines=20
seq 1 $nLines $(grep -c $ "$inputFile") | while read n
do sed -n "$n,$((n+nLines-1))p" "$inputFile" | ./html_unescape.sh
done
Explanation of the script follows.
There are three types of escape sequences that need to be supported:
&#D; where D is the decimal value of the escaped character’s Unicode code point;
&#xH; where H is the hexadecimal value of the escaped character’s Unicode code point;
&N; where N is the name of one of the named entities for the escaped character.
The &N; escapes are supported by the generated named_entities.sed script which simply performs the list of substitutions.
The central piece of this method for supporting the code point escapes is the printf utility, which is able to:
print numbers in hexadecimal format, and
print characters from their code point’s hexadecimal value (using the escapes \uHHHH or \UHHHHHHHH).
The first feature, with some help from sed and grep, is used to reduce the &#D; escapes into &#xH; escapes. The shell function htmlEscDec2Hex does that.
The function htmlHexUnescape uses sed to transform the &#xH; escapes into printf’s \u/\U escapes, then uses the second feature to print the unescaped characters.
I like the Perl answer given in https://stackoverflow.com/a/13161719/1506477.
cat foo.html | perl -MHTML::Entities -pe 'decode_entities($_);'
But, it produced an unequal number of lines on plain text files. (and I dont know perl enough to debug it.)
I like the python answer given in https://stackoverflow.com/a/42672936/1506477 --
python3 -c 'import html, sys; [print(html.unescape(l), end="") for l in sys.stdin]'
but it creates a list [ ... for l in sys.stdin] in memory, that is forbidden for large files.
Here is another easy pythonic way without buffering in memory: using awkg.
$ echo 'hello < : " world' | \
awkg -b 'from html import unescape' 'print(unescape(R0))'
hello < : " world
awkg is a python based awk-like line processor. You may install it using pip https://pypi.org/project/awkg/:
pip install awkg
-b is awk's BEGIN{} block that runs once in the beginning.
Here we just did from html import unescape.
Each line record is in R0 variable, for which we did
print(unescape(R0))
Disclaimer:
I am the maintainer of awkg
I have created a sed script based on the list of entities so it must handle most of the entities.
sed -f htmlentities.sed < file.html
My original answer got some comments, that recode does not work for UTF-8 encoded HTML files. This is correct. recode supports only HTML 4. The encoding HTML is an alias for HTML_4.0:
$ recode -l | grep -iw html
HTML-i18n 2070 RFC2070
HTML_4.0 h h4 HTML
The default encoding for HTML 4 is Latin-1. This has changed in HTML 5. The default encoding for HTML 5 is UTF-8. This is the reason, why recode does not work for HTML 5 files.
HTML 5 defines the list of entities here:
https://html.spec.whatwg.org/multipage/named-characters.html
The definition includes a machine readable specification in JSON format:
https://html.spec.whatwg.org/entities.json
The JSON file can be used to perform a simple text replacement. The following example is a self modifying Perl script, which caches the JSON specification in its DATA chunk.
Note: For some obscure compatibility reasons, the specification allows entities without a terminating semicolon. Because of that the entities are sorted by length in reverse order to make sure, that the correct entities are replaced first so that they do not get destroyed by entities without the ending semicolon.
#! /usr/bin/perl
use utf8;
use strict;
use warnings;
use open qw(:std :utf8);
use LWP::Simple;
use JSON::Parse qw(parse_json);
my $entities;
INIT {
if (eof DATA) {
my $data = tell DATA;
open DATA, '+<', $0;
seek DATA, $data, 0;
my $entities_json = get 'https://html.spec.whatwg.org/entities.json';
print DATA $entities_json;
truncate DATA, tell DATA;
close DATA;
$entities = parse_json ($entities_json);
} else {
local $/ = undef;
$entities = parse_json (<DATA>);
}
}
local $/ = undef;
my $html = <>;
for my $entity (sort { length $b <=> length $a } keys %$entities) {
my $characters = $entities->{$entity}->{characters};
$html =~ s/$entity/$characters/g;
}
print $html;
__DATA__
Example usage:
$ echo '😊 & ٱلْعَرَبِيَّة' | ./html5-to-utf8.pl
😊 & ٱلْعَرَبِيَّة
With Xidel:
echo 'hello < : " world' | xidel -s - -e 'parse-html($raw)'
hello < : " world

Powershell script to remove the first and the last double quotes from each line of a csv file

I want to remove the first and the last double quotes of each line from a csv input file and save the output in the same input file,with ucs-le bom encoding using powershell
The sample csv dataset is (input.csv):
"1","2","3"
The output csv(input.csv):
1","2","3
I have used
$csv = 'input.csv'
(Get-Content $csv) -replace '(?m)"([^,]*?)"(?=,|$)', '$1' |
Set-Content $csv
but it removes the double quotes from the first and the last element.

Unwanted space in substring using powershell

I'm fairly new to PS: I'm extracting fields from multiple xml files ($ABB). The $net var is based on a pattern search and returns a non static substring on line 2. Heres what I have so far:
$ABB = If ($aa -eq $null ) {"nothing to see here"} else {
$count = 0
$files = #($aa)
foreach ($f in $files)
{
$count += 1
$mo=(Get-Content -Path $f )[8].Substring(51,2)
(Get-Content -Path $f | Select-string -Pattern $lf -Context 0,1) | ForEach-Object {
$net = $_.Context.PostContext
$enet = $net -split "<comm:FieldValue>(\d*)</comm:FieldValue>"
$enet = $enet.trim()}
Write-Host "$mo-nti-$lf-$enet" "`r`n"
}}
The output looks like this: 03-nti-260- 8409.
Note the space prefacing the 8409 which corresponds to the $net variable. I haven't been able to solve this on my own, my approach could be all wrong. I'm open to any and all suggestions. Thanks for your help.
Since your first characters in the first line of $net after $net = $_.Context.PostContext contains the split characters, a blank line will output as the first element of the output. Then when you stringify output, each split output item is joined by a single space.
You need to select lines that aren't empty:
$enet = $net -split "<comm:FieldValue>(\d*)</comm:FieldValue>" -ne ''
Explanation:
-Split characters not surrounded by () are removed from the output and the remaining string is split into multiple elements from each of those matched characters. When a matched character starts or ends a string, a blank line is output. Care must be taken to remove those lines if they are not required. Trim() will not work because Trim() applies to a single string rather than an array and will not remove empty string.
Adding -ne '' to the end of the command, removes empty lines. It is just an inline boolean condition that when applied to an array, only outputs elements where the condition is true.
You can see an example of the blank line condition below:
123 -split 1
23
123 -split 1 -ne ''
23
Just use a -replace to get rid of any spaces
For example:
'03-nti-260- 8409' -replace '\s'
<#
# Results
03-nti-260-8409
#>

Parsing Text file and placing contents into an Array Powershell

I am looking for a way to parse a text file and place the results into an array in powershell.
I know select-string -path -pattern will get all strings that match a pattern. But what if I already have a structured textfile, perhaps pipe delimminated, with new entries on each line. Like so:
prodServ1a
prodServ1b
prodServ1c
C:\dir\serverFile.txt
How can I place each of those servers into an array in powershell that I can loop through?
You say 'pipe delimited, like so' but your example isn't pipe delimited. I'll imagine it is, then you need to use the Import-CSV commandlet. e.g. if the data file contains this:
prodServ1a|4|abc
prodServ1b|5|def
prodServ1c|6|ghi
then this code:
$data = Import-Csv -Path test.dat -Header "Product","Cost","SerialNo" -Delimiter "|"
will import and split it, and add headers:
$data
Product Cost SerialNo
------- ---- --------
prodServ1a 4 abc
prodServ1b 5 def
prodServ1c 6 ghi
Then you can use
foreach ($item in $data) {
$item.SerialNo
}
If your data is flat or unstructured, but has a pattern such as delimited by a space or comma and no carriage returns or line feeds, you can use the Split() method.
PS>$data = "prodServ1a prodServ1b prodServ1c"
PS>$data
prodServ1a prodServ1b prodServ1c
PS>$data.Split(" ")
prodServ1a
prodServ1b
prodServ1c
Works particularly well with someone sending you a list of IP addresses separated by a comma.

Storing files inside BASH scripts

Is there a way to store binary data inside a BASH script so that it can be piped to a program later in that script?
At the moment (on Mac OS X) I'm doing
play sound.m4a
# do stuff
I'd like to be able to do something like:
SOUND <<< the m4a data, encoded somehow?
END
echo $SOUND | play
#do stuff
Is there a way to do this?
Base64 encode it. For example:
$ openssl base64 < sound.m4a
and then in the script:
S=<<SOUND
YOURBASE64GOESHERE
SOUND
echo $S | openssl base64 -d | play
I know this is like riding a dead horse since this post is rather old, but I'd like to improve Sionide21 answer as his solution stores the binary data in a variable which is not necessary.
openssl base64 -d <<SOUND | play
YOURBASE64DATAHERE
SOUND
Note: HereDoc Syntax requires that you don't indent the last 'SOUND'
and base64 decoding sometimes failed on me when i indented that
'YOURBASE64DATAHERE' section. So it's best practice to keep the Base64
Data as well the end-token unindented.
I've found this looking for a more elegant way to store binary data in shell scripts, but i had already solved it like described here. Only difference is I'm transporting some tar-bzipped files this way. My platform knows a separate base64 binary so I don't have to use openssl.
base64 -d <<EOF | tar xj
BASE64ENCODEDTBZ
EOF
There is a Unix format called shar (shell archive) that allows you to store binary data in a shell script. You create a shar file using the shar command.
When I've done this I've used a shell here document piped through atob.
function emit_binary {
cat << 'EOF' | atob
--junk emitted by btoa here
EOF
}
the single quotes around 'EOF' prevent parameter expansion in the body of the here document.
atob and btoa are very old programs, and for some reason they are often absent from modern Unix distributions. A somewhat less efficient but more ubiquitous alternative is to use mimencode -b instead of btoa. mimencode will encode into base64 ASCII. The corresponding decoding command is mimencode -b -u instead of atob. The openssl command will also do base64 encoding.
Here's some code I wrote a long time ago that packs a choice executable into a bash script. I can't remember exactly how it works, but I suspect you could pretty easily modify it to do what you want.
#!/usr/bin/perl
use strict;
print "Stub Creator 1.0\n";
unless($#ARGV == 1)
{
print "Invalid argument count, usage: ./makestub.pl InputExecutable OutputCompressedExecutable\n";
exit;
}
unless(-r $ARGV[0])
{
die "Unable to read input file $ARGV[0]: $!\n";
}
my $OUTFILE;
open(OUTFILE, ">$ARGV[1]") or die "Unable to create $ARGV[1]: $!\n";
print "\nCreating stub script...";
print OUTFILE "#!/bin/bash\n";
print OUTFILE "a=/tmp/\`date +%s%N\`;tail -n+3 \$0 | zcat > \$a;chmod 700 \$a;\$a \${*};rm -f \$a;exit;\n";
close(OUTFILE);
print "done.\nCompressing input executable and appending...";
`gzip $ARGV[0] -n --best -c >> $ARGV[1]`;
`chmod +x $ARGV[1]`;
my $OrigSize;
$OrigSize = -s $ARGV[0];
my $NewSize;
$NewSize = -s $ARGV[1];
my $Temp;
if($OrigSize == 0)
{
$NewSize = 1;
}
$Temp = ($NewSize / $OrigSize) * 100;
$Temp *= 1000;
$Temp = int($Temp);
$Temp /= 1000;
print "done.\nStub successfully composed!\n\n";
print <<THEEND;
Original size: $OrigSize
New size: $NewSize
Compression: $Temp\%
THEEND
If it's a single block of data to use, the trick I've used is to put a "start of data" marker at the end of the file, then use sed in the script to filter out the leading stuff. For example, create the following as "play-sound.bash":
#!/bin/bash
sed '1,/^START OF DATA/d' $0 | play
exit 0
START OF DATA
Then, you can just append your data to the end of this file:
cat sound.m4a >> play-sound.bash
and now, executing the script should play the sound directly.
Since Python is available on OS X by default, you can do as below:
ENCODED=$(python -m base64 foo.m4a)
Then decode it as below:
echo $ENCODED | python -m base64 -d | play

Resources