Related
I have several text files (utf-8) that I want to process in shell script. They aren't excactly the same format, but if I could only break them up into edible chunks I can handle that.
This could be programmed in C or python, but I prefer not.
EDIT: I wrote a solution in C; see my own answer. I think this may be the simplest approach after all. If you think I'm wrong please test your solution against the more complicated example input from my answer below.
-- jcxz100
For clarity (and to be able to debug more easily) I want the chunks to be saved as separate text files in a sub-folder.
All types of input files consist of:
junk lines
lines with junk text followed by start brackets or parentheses - i.e. '[' '{' '<' or '(' - and possibly followed by payload
payload lines
lines with brackets or parentheses nested within the top-level pairs; treated as payload too
payload lines with end brackets or parantheses - i.e. ']' '}' '>' or ')' - possibly followed by something (junk text and/or start of a new payload)
I want to break up the input according to only the matching pairs of top-level brackets/parantheses.
Payload inside these pairs must not be altered (including newlines and whitespace).
Everything outside the toplevel pairs should be discarded as junk.
Any junk or payload inside double-quotes must be considered atomic (handled as raw text, thus any brackets or parentheses inside should also be treated as text).
Here is an example (using only {} pairs):
junk text
"atomic junk"
some junk text followed by a start bracket { here is the actual payload
more payload
"atomic payload"
nested start bracket { - all of this line is untouchable payload too
here is more payload
"yet more atomic payload; this one's got a smiley ;-)"
end of nested bracket pair } - all of this line is untouchable payload too
this is payload too
} trailing junk
intermittent junk
{
payload that goes in second output file }
end junk
...sorry: Some of the input files really are as messy as that.
The first output file should be:
{ here is the actual payload
more payload
"atomic payload"
nested start bracket { - all of this line is untouchable payload too
here is more payload
"yet more atomic payload; this one's got a smiley ;-)"
end of nested bracket pair } - all of this line is untouchable payload too
this is payload too
}
... and the second output file:
{
payload that goes in second output file }
Note:
I haven't quite decided whether it's necesary to keep the pair of start/end characters in the output or if they themselves should be discarded as junk.
I think a solution that keeps them in is more general use.
There can be a mix of types of top-level bracket/paranthesis pairs in the same input file.
Beware: There are * and $ characters in the input files, so please avoid confusing bash ;-)
I prefer readability over brevity; but not at an exponential cost of speed.
Nice-to-haves:
There are backslash-escaped double-quotes inside the text; preferably they should be handled
(I have a hack, but it's not pretty).
The script oughtn't break over mismatched pairs of brackets/parentheses in junk and/or payload (note: inside the atomics they must be allowed!)
More-far-out-nice-to-haves:
I haven't seen it yet, but one could speculate that some input might have single-quotes rather than double-quotes to denote atomic content... or even a mix of both.
It would be nice if the script could be easily modified to parse input of similar structure but with different start/end characters or strings.
I can see this is quite a mouthful, but I think it wouldn't give a robust solution if I broke it down into simpler questions.
The main problem is splitting up the input correctly - everything else can be ignored or "solved" with hacks, so
feel free to ignore the nice-to-haves and the more-far-out-nice-to-haves.
Given:
$ cat file
junk text
"atomic junk"
some junk text followed by a start bracket { here is the actual payload
more payload
"atomic payload"
nested start bracket { - all of this line is untouchable payload too
here is more payload
"yet more atomic payload; this one's got a smiley ;-)"
end of nested bracket pair } - all of this line is untouchable payload too
this is payload too
} trailing junk
intermittent junk
{
payload that goes in second output file }
end junk
This perl file will extract the blocks you describe into files block_1, block_2, etc:
#!/usr/bin/perl
use v5.10;
use warnings;
use strict;
use Text::Balanced qw(extract_multiple extract_bracketed);
my $txt;
while (<>){$txt.=$_;} # slurp the file
my #blocks = extract_multiple(
$txt,
[
# Extract {...}
sub { extract_bracketed($_[0], '{}') },
],
# Return all the fields
undef,
# Throw out anything which does not match
1
);
chdir "/tmp";
my $base="block_";
my $cnt=1;
for my $block (#blocks){ my $fn="$base$cnt";
say "writing $fn";
open (my $fh, '>', $fn) or die "Could not open file '$fn' $!";
print $fh "$block\n";
close $fh;
$cnt++;}
Now the files:
$ cat block_1
{ here is the actual payload
more payload
"atomic payload"
nested start bracket { - all of this line is untouchable payload too
here is more payload
"yet more atomic payload; this one's got a smiley ;-)"
end of nested bracket pair } - all of this line is untouchable payload too
this is payload too
}
$ cat block_2
{
payload that goes in second output file }
Using Text::Balanced is robust and likely the best solution.
You can do the blocks with a single Perl regex:
$ perl -0777 -nlE 'while (/(\{(?:(?1)|[^{}]*+)++\})|[^{}\s]++/g) {if ($1) {$cnt++; say "block $cnt:== start:\n$1\n== end";}}' file
block 1:== start:
{ here is the actual payload
more payload
"atomic payload"
nested start bracket { - all of this line is untouchable payload too
here is more payload
"yet more atomic payload; this one's got a smiley ;-)"
end of nested bracket pair } - all of this line is untouchable payload too
this is payload too
}
== end
block 2:== start:
{
payload that goes in second output file }
== end
But that is a little more fragile than using a proper parser like Text::Balanced...
I have a solution in C. It would seem there's too much complexity for this to be easily achieved in shell script.
The program isn't overly complicated but nevertheless has more than 200 lines of code, which include error checking, some speed optimization, and other niceties.
Source file split-brackets-to-chunks.c:
#include <stdio.h>
/* Example code by jcxz100 - your problem if you use it! */
#define BUFF_IN_MAX 255
#define BUFF_IN_SIZE (BUFF_IN_MAX+1)
#define OUT_NAME_MAX 31
#define OUT_NAME_SIZE (OUT_NAME_MAX+1)
#define NO_CHAR '\0'
int main()
{
char pcBuff[BUFF_IN_SIZE];
size_t iReadActual;
FILE *pFileIn, *pFileOut;
int iNumberOfOutputFiles;
char pszOutName[OUT_NAME_SIZE];
char cLiteralChar, cAtomicChar, cChunkStartChar, cChunkEndChar;
int iChunkNesting;
char *pcOutputStart;
size_t iOutputLen;
pcBuff[BUFF_IN_MAX] = '\0'; /* ... just to be sure. */
iReadActual = 0;
pFileIn = pFileOut = NULL;
iNumberOfOutputFiles = 0;
pszOutName[OUT_NAME_MAX] = '\0'; /* ... just to be sure. */
cLiteralChar = cAtomicChar = cChunkStartChar = cChunkEndChar = NO_CHAR;
iChunkNesting = 0;
pcOutputStart = (char*)pcBuff;
iOutputLen = 0;
if ((pFileIn = fopen("input-utf-8.txt", "r")) == NULL)
{
printf("What? Where?\n");
return 1;
}
while ((iReadActual = fread(pcBuff, sizeof(char), BUFF_IN_MAX, pFileIn)) > 0)
{
char *pcPivot, *pcStop;
pcBuff[iReadActual] = '\0'; /* ... just to be sure. */
pcPivot = (char*)pcBuff;
pcStop = (char*)pcBuff + iReadActual;
while (pcPivot < pcStop)
{
if (cLiteralChar != NO_CHAR) /* Ignore this char? */
{
/* Yes, ignore this char. */
if (cChunkStartChar != NO_CHAR)
{
/* ... just write it out: */
fprintf(pFileOut, "%c", *pcPivot);
}
pcPivot++;
cLiteralChar = NO_CHAR;
/* End of "Yes, ignore this char." */
}
else if (cAtomicChar != NO_CHAR) /* Are we inside an atomic string? */
{
/* Yup; we are inside an atomic string. */
int bBreakInnerWhile;
bBreakInnerWhile = 0;
pcOutputStart = pcPivot;
while (bBreakInnerWhile == 0)
{
if (*pcPivot == '\\') /* Treat next char as literal? */
{
cLiteralChar = '\\'; /* Yes. */
bBreakInnerWhile = 1;
}
else if (*pcPivot == cAtomicChar) /* End of atomic? */
{
cAtomicChar = NO_CHAR; /* Yes. */
bBreakInnerWhile = 1;
}
if (++pcPivot == pcStop) bBreakInnerWhile = 1;
}
if (cChunkStartChar != NO_CHAR)
{
/* The atomic string is part of a chunk. */
iOutputLen = (size_t)(pcPivot-pcOutputStart);
fprintf(pFileOut, "%.*s", iOutputLen, pcOutputStart);
}
/* End of "Yup; we are inside an atomic string." */
}
else if (cChunkStartChar == NO_CHAR) /* Are we inside a chunk? */
{
/* No, we are outside a chunk. */
int bBreakInnerWhile;
bBreakInnerWhile = 0;
while (bBreakInnerWhile == 0)
{
/* Detect start of anything interesting: */
switch (*pcPivot)
{
/* Start of atomic? */
case '"':
case '\'':
cAtomicChar = *pcPivot;
bBreakInnerWhile = 1;
break;
/* Start of chunk? */
case '{':
cChunkStartChar = *pcPivot;
cChunkEndChar = '}';
break;
case '[':
cChunkStartChar = *pcPivot;
cChunkEndChar = ']';
break;
case '(':
cChunkStartChar = *pcPivot;
cChunkEndChar = ')';
break;
case '<':
cChunkStartChar = *pcPivot;
cChunkEndChar = '>';
break;
}
if (cChunkStartChar != NO_CHAR)
{
iNumberOfOutputFiles++;
printf("Start '%c' '%c' chunk (file %04d.txt)\n", *pcPivot, cChunkEndChar, iNumberOfOutputFiles);
sprintf((char*)pszOutName, "output/%04d.txt", iNumberOfOutputFiles);
if ((pFileOut = fopen(pszOutName, "w")) == NULL)
{
printf("What? How?\n");
fclose(pFileIn);
return 2;
}
bBreakInnerWhile = 1;
}
else if (++pcPivot == pcStop)
{
bBreakInnerWhile = 1;
}
}
/* End of "No, we are outside a chunk." */
}
else
{
/* Yes, we are inside a chunk. */
int bBreakInnerWhile;
bBreakInnerWhile = 0;
pcOutputStart = pcPivot;
while (bBreakInnerWhile == 0)
{
if (*pcPivot == cChunkStartChar)
{
/* Increase level of brackets/parantheses: */
iChunkNesting++;
}
else if (*pcPivot == cChunkEndChar)
{
/* Decrease level of brackets/parantheses: */
iChunkNesting--;
if (iChunkNesting == 0)
{
/* We are now outside chunk. */
bBreakInnerWhile = 1;
}
}
else
{
/* Detect atomic start: */
switch (*pcPivot)
{
case '"':
case '\'':
cAtomicChar = *pcPivot;
bBreakInnerWhile = 1;
break;
}
}
if (++pcPivot == pcStop) bBreakInnerWhile = 1;
}
iOutputLen = (size_t)(pcPivot-pcOutputStart);
fprintf(pFileOut, "%.*s", iOutputLen, pcOutputStart);
if (iChunkNesting == 0)
{
printf("File done.\n");
cChunkStartChar = cChunkEndChar = NO_CHAR;
fclose(pFileOut);
pFileOut = NULL;
}
/* End of "Yes, we are inside a chunk." */
}
}
}
if (cChunkStartChar != NO_CHAR)
{
printf("Chunk exceeds end-of-file. Exiting gracefully.\n");
fclose(pFileOut);
pFileOut = NULL;
}
if (iNumberOfOutputFiles == 0) printf("Nothing to do...\n");
else printf("All done.\n");
fclose(pFileIn);
return 0;
}
I've solved the nice-to-haves and one of the more-far-out-nice-to-haves.
To show this the input is a little more complex than the example in the question:
junk text
"atomic junk"
some junk text followed by a start bracket { here is the actual payload
more payload
'atomic payload { with start bracket that should be ignored'
nested start bracket { - all of this line is untouchable payload too
here is more payload
"this atomic has a literal double-quote \" inside"
"yet more atomic payload; this one's got a smiley ;-) and a heart <3"
end of nested bracket pair } - all of this line is untouchable payload too
this is payload too
"here's a totally unprovoked $ sign and an * asterisk"
} trailing junk
intermittent junk
<
payload that goes in second output file } mismatched end bracket should be ignored >
end junk
Resulting file output/0001.txt:
{ here is the actual payload
more payload
'atomic payload { with start bracket that should be ignored'
nested start bracket { - all of this line is untouchable payload too
here is more payload
"this atomic has a literal double-quote \" inside"
"yet more atomic payload; this one's got a smiley ;-) and a heart <3"
end of nested bracket pair } - all of this line is untouchable payload too
this is payload too
"here's a totally unprovoked $ sign and an * asterisk"
}
... and resulting file output/0002.txt:
<
payload that goes in second output file } mismatched end bracket should be ignored >
Thanks #dawg for your help :)
I'm actually doing my own shell.
I have done the following special characters:
int commande(int fin, int fout, char * com, char * param, int * bg){
// execute a command
(ex. ls –l)
int symbole;
char *mot;
pid_t pid;
symbole = parsing();
switch(symbole){
case 0: // NL
case 1: // ;
case 2: // &
case 3: // <
case 4: // >
case 5: // | (Here I have some issues when I try to redirect the output of a command).
(correspond à ctrl+D)
case 10:// Mot
default:
}
return;
}
But I have some issues to do the redirection of an output when it is piped " |", when I have two instructions that follow themselves. Indeed I have tried the following operations which have all worked:
>myShell ps > fich
>myShell ls -l | wc -l
But not this one:
>myShell ls -l | wc -l >file
here are the two cases specifically developped. I think that the issue is in the case 5 and not in the case 4 because the first command I tried worked (which I shew you above).
case 4: // SYMBOLE : >
if(output==0){
output=1;
execute=1;
for (l=0;l<10;l++){
eltsoutput[l]=eltsCommande[l];
}
}
break;
case 5: // SYMBOLE : |
//if(tube==0){
/*for (l=0;l<10;l++){
eltstube[l]=eltsCommande[l];
}*/
p2=fork();
if(p2==0){
if(tube==0){
freopen( "fichtmp", "w", stdout );
execvp(eltsCommande[0], eltsCommande);
}
return(0);
}
else{ if(background==0){ // SANS MOD BG ATTENDRE FIN FILS
waitpid(p2, NULL, 0);
}
tube=1;
execute=1;
}
break;
Can you help me finding a way to execute two commands at the same time with | and that allow their result to go to a file?
In my shell, the case one work in the case of a redirection with an instruction ";":
}else if(output==1){
close(1);
int filew = creat(eltsCommande[0], 0644);
execvp(eltsoutput[0], eltsoutput);
Maybe I should use this code to make it work?
Looking at the NetBSD /bin/sh source code, I see the following pipe implementation:
static int
sh_pipe(int fds[2])
{
int nfd;
if (pipe(fds))
return -1;
if (fds[0] < 3) {
nfd = fcntl(fds[0], F_DUPFD, 3);
if (nfd != -1) {
close(fds[0]);
fds[0] = nfd;
}
}
if (fds[1] < 3) {
nfd = fcntl(fds[1], F_DUPFD, 3);
if (nfd != -1) {
close(fds[1]);
fds[1] = nfd;
}
}
return 0;
}
This function is called by evalpipe with 2 file descriptors:
STATIC void
evalpipe(union node *n)
{
struct job *jp;
struct nodelist *lp;
int pipelen;
int prevfd;
int pip[2];
TRACE(("evalpipe(0x%lx) called\n", (long)n));
pipelen = 0;
for (lp = n->npipe.cmdlist ; lp ; lp = lp->next)
pipelen++;
INTOFF;
jp = makejob(n, pipelen);
prevfd = -1;
for (lp = n->npipe.cmdlist ; lp ; lp = lp->next) {
prehash(lp->n);
pip[1] = -1;
if (lp->next) {
if (sh_pipe(pip) < 0) {
if (prevfd >= 0)
close(prevfd);
error("Pipe call failed");
}
}
if (forkshell(jp, lp->n, n->npipe.backgnd ? FORK_BG : FORK_FG) == 0) {
INTON;
if (prevfd > 0) {
close(0);
copyfd(prevfd, 0, 1);
close(prevfd);
}
if (pip[1] >= 0) {
close(pip[0]);
if (pip[1] != 1) {
close(1);
copyfd(pip[1], 1, 1);
close(pip[1]);
}
}
evaltree(lp->n, EV_EXIT);
}
if (prevfd >= 0)
close(prevfd);
prevfd = pip[0];
close(pip[1]);
}
if (n->npipe.backgnd == 0) {
exitstatus = waitforjob(jp);
TRACE(("evalpipe: job done exit status %d\n", exitstatus));
}
INTON;
}
evalpipe is called in a switch statement in evaltree as follows:
case NPIPE:
evalpipe(n);
do_etest = !(flags & EV_TESTED);
break;
... which is called by the infinite loop in evalloop, and percolates up the tree till it gets to the eval function. I hope this helps.
I wrote a D implementation of the nul2pfb utility from http://www.dwheeler.com/essays/filenames-in-shell.html, as the link to the source code was broken and I wanted to try to learn D. I noticed that it was rather slow (could barely keep up with the find -print0 that was passing it data, when it should be far faster as it need not do anywhere near as many system calls).
The first implementation works correctly (tested with zsh and bash printf built-ins, as well as /usr/bin/printf). The second, though much faster (probably due to far fewer calls to write()), repeates the first part of its output many times, and fails to output the remainder of its output. What is causing this difference? I am a newbie to D and do not understand.
Working code:
import std.stdio;
import std.conv;
void main()
{
foreach (ubyte[] mybuff; chunks(stdin, 4096)) {
encodebuff (mybuff);
}
}
#safe void encodebuff (ubyte[] mybuff) {
foreach (ubyte i; mybuff) {
char b = to!char(i);
switch (i) {
case 'a': .. case 'z':
case 'A': .. case 'Z':
case '0': .. case '9':
case '/':
case '.':
case '_':
case ':': writeChar(b); break;
default: writeOctal(b); break;
case 0: writeChar ('\n'); break;
case '\\': writeString(`\\`); break;
case '\t': writeString(`\t`); break;
case '\n': writeString(`\n`); break;
case '\r': writeString(`\r`); break;
case '\f': writeString(`\f`); break;
case '\v': writeString(`\v`); break;
case '\a': writeString(`\a`); break;
case '\b': writeString(`\b`); break;
}
}
}
#trusted void writeString (string a)
{
write (a);
}
#trusted void writeOctal (int a)
{
writef ("\\%.4o", a); // leading 0 needed for for zsh printf '%b'
}
#trusted void writeChar (char a)
{
write (a);
}
The broken version:
import std.stdio;
import std.conv;
import std.string;
void main()
{
foreach (ubyte[] mybuff; chunks(stdin, 4096)) {
encodebuff (mybuff);
}
}
#safe void encodebuff (ubyte[] mybuff) {
char[] outstring;
foreach (ubyte i; mybuff) {
switch (i) {
case 'a': .. case 'z':
case 'A': .. case 'Z':
case '0': .. case '9':
case '/':
case '.':
case '_':
case ':': outstring ~= to!char(i); break;
case 0: outstring ~= '\n'; break;
default: char[5] mystring;
formatOctal(mystring, i);
outstring ~= mystring;
break;
case '\\': outstring ~= `\\`; break;
case '\t': outstring ~= `\t`; break;
case '\n': outstring ~= `\n`; break;
case '\r': outstring ~= `\r`; break;
case '\f': outstring ~= `\f`; break;
case '\v': outstring ~= `\v`; break;
case '\a': outstring ~= `\a`; break;
case '\b': outstring ~= `\b`; break;
}
writeString (outstring);
}
}
#trusted void writeString (char[] a)
{
write (a);
}
#trusted void formatOctal (char[] b, ubyte a)
{
sformat (b, "\\%.4o", a); // leading 0 needed for zsh printf '%b'
}
Tests: (note that filelist is a NUL-delimited list of files generated by find -print0 on my home directory, and filelist2.txt is generated from filelist by filelist.txt sed -e 's/\x0/\n/g' > filelist2.txt and is thus the corresponding list of newline-delimited filenames).
# the sed script escapes the backslashes so xargs does not clobber them
diff filelist2.txt <(<filelist.txt char2code2 | sed -e 's/\\/\\\\/g' | xargs /usr/bin/printf "%b\n")
# from within zsh
bash -c 'diff filelist2.txt <(for i in "$(<filelist.txt char2code)"; do printf "%b\n" "$i"; done)'
# from within zsh and bash
diff filelist.txt <(for i in $(char2code <filelist.txt); do printf '%b\0' "$i"; done)
# from within zsh, bash, and dash
for i in $(char2code <filelist.txt); do printf '%b\0' "$i"; done | diff - filelist.txt
A script I made as an acid test:
#!/bin/bash
# this creates a completely random list of NUL-delimited strings
a=''
trap 'rm -f "$a"' EXIT
a="$(mktemp)";
</dev/urandom sed -e 's/\x0\x0/\x0/g' | dd count=2048 of="$a"
test -s "$a" || exit 1
printf '\0' >> "$a"
for i in $("$#" < "$a")
do
printf '%b\0' "$i"
done | diff - "$a"
What is the reason for the difference?
EDIT: I have implemented the changes suggested by #yaz and #MichalMinich and am still seeing wrong results. Specifically, find -print0 | char2code2 (the name of the program, which is in my $PATH) from my home directory results in an exit status of 1 and no output. However, it works from a subsidiary directory with far fewer items. My revised source is below:
import std.stdio;
import std.conv;
import std.format;
import std.array;
void main()
{
foreach (ubyte[] mybuff; chunks(stdin, 4096)) {
encodebuff (mybuff);
}
writeln();
}
void encodebuff (ubyte[] mybuff) {
auto buffer = appender!string();
foreach (ubyte i; mybuff) {
switch (i) {
case 'a': .. case 'z':
case 'A': .. case 'Z':
case '0': .. case '9':
case '/':
case '.':
case '_':
case ':': buffer.put(to!char(i)); break;
case 0: buffer.put('\n'); break;
default: formatOctal(buffer, i); break;
case '\\': buffer.put(`\\`); break;
case '\t': buffer.put(`\t`); break;
case '\n': buffer.put(`\n`); break;
case '\r': buffer.put(`\r`); break;
case '\f': buffer.put(`\f`); break;
case '\v': buffer.put(`\v`); break;
case '\a': buffer.put(`\a`); break;
case '\b': buffer.put(`\b`); break;
}
}
writeString (buffer.data);
// writef(stderr, "Wrote a line\n");
}
#trusted void writeString (string a)
{
write (a);
}
#trusted void formatOctal(Writer)(Writer w, ubyte a)
{
formattedWrite(w, "\\%.4o", a); // leading 0 needed for zsh printf '%b'
}
You need to take writeString outside the foreach in encodebuff. Currently you're writing outstring on each loop without clearing it. The issue #Michal Minich pointed is valid too.
One reason could be that you are appending always 5 chars of char[5] mystring. Function sformat in formatOctal returns the string formatted which might have less than 5 chars (probably slice of the buffer), you should use that string to append to outstring.
Performance advice: use Appender instead of ~= for better performance when building string.
The real problem turned out to be with my LDC installation. It used shared libraries, which aren't supported by its version of druntime.
Recompiling LDC to used static libraries fixed the problem.
I have the file given below:
elix554bx.xayybol.42> vi setup.REVISION
# Revision information
setenv RSTATE R24C01
setenv CREVISION X3
exit
My requirement is to read RSTATE from file and then increment last 2 digits of RSTATE in setup.REVISION file and overwrite into same file.
Can you please suggest how to do this?
If you're using vim, then you can use the sequence:
/RSTATE/
$<C-a>:x
The first line is followed by a return and searches for RSTATE. The second line jumps to the end of the line and uses Control-a (shown as <C-a> above, and in the vim documentation) to increment the number. Repeat as often as you want to increment the number. The :x is also followed by a return and saves the file.
The only tricky bit is that the leading 0 on the number makes vim think the number is in octal, not decimal. You can override that by using :set nrformats= followed by return to turn off octal and hex; the default value is nrformats=octal,hex.
You can learn an awful lot about vim from the book Practical Vim: Edit Text at the Speed of Thought by Drew Neil. This information comes from Tip 10 in chapter 2.
Here's an awk one-liner type solution:
awk '{
if ( $0 ~ 'RSTATE' ) {
match($0, "[0-9]+$" );
sub( "[0-9]+$",
sprintf( "%0"RLENGTH"d", substr($0, RSTART, RSTART+RLENGTH)+1 ),
$0 );
print; next;
} else { print };
}' setup.REVISION > tmp$$
mv tmp$$ setup.REVISION
Returns:
setenv RSTATE R24C02
setenv CREVISION X3
exit
This will handle transitions from two to three to more digits appropriately.
I wrote for you a class.
class Reader
{
public string ReadRs(string fileWithPath)
{
string keyword = "RSTATE";
string rs = "";
if(File.Exists(fileWithPath))
{
StreamReader reader = File.OpenText(fileWithPath);
try
{
string line = "";
bool finded = false;
while (reader != null && !finded)
{
line = reader.ReadLine();
if (line.Contains(keyword))
{
finded = true;
}
}
int index = line.IndexOf(keyword);
rs = line.Substring(index + keyword.Length +1, line.Length - 1 - (index + keyword.Length));
}
catch (IOException)
{
//Error
}
finally
{
reader.Close();
}
}
return rs;
}
public int GetLastTwoDigits(string rsState)
{
int digits = -1;
try
{
int length = rsState.Length;
//Get the last two digits of the rsstate
digits = Int32.Parse(rsState.Substring(length - 2, 2));
}
catch (FormatException)
{
//Format Error
digits = -1;
}
return digits;
}
}
You can use this as exists
Reader reader = new Reader();
string rsstate = reader.ReadRs("C://test.txt");
int digits = reader.GetLastTwoDigits(rsstate);
In a Windows program, what is the canonical way to parse the command line obtained from GetCommandLine into multiple arguments, similar to the argv array in Unix? It seems that CommandLineToArgvW does this for a Unicode command line, but I can't find a non-Unicode equivalent. Should I be using Unicode or not? If not, how do I parse the command line?
Here is an implementation of CommandLineToArgvA that delegate the work to CommandLineToArgvW, MultiByteToWideChar and WideCharToMultiByte.
LPSTR* CommandLineToArgvA(LPSTR lpCmdLine, INT *pNumArgs)
{
int retval;
retval = MultiByteToWideChar(CP_ACP, MB_ERR_INVALID_CHARS, lpCmdLine, -1, NULL, 0);
if (!SUCCEEDED(retval))
return NULL;
LPWSTR lpWideCharStr = (LPWSTR)malloc(retval * sizeof(WCHAR));
if (lpWideCharStr == NULL)
return NULL;
retval = MultiByteToWideChar(CP_ACP, MB_ERR_INVALID_CHARS, lpCmdLine, -1, lpWideCharStr, retval);
if (!SUCCEEDED(retval))
{
free(lpWideCharStr);
return NULL;
}
int numArgs;
LPWSTR* args;
args = CommandLineToArgvW(lpWideCharStr, &numArgs);
free(lpWideCharStr);
if (args == NULL)
return NULL;
int storage = numArgs * sizeof(LPSTR);
for (int i = 0; i < numArgs; ++ i)
{
BOOL lpUsedDefaultChar = FALSE;
retval = WideCharToMultiByte(CP_ACP, 0, args[i], -1, NULL, 0, NULL, &lpUsedDefaultChar);
if (!SUCCEEDED(retval))
{
LocalFree(args);
return NULL;
}
storage += retval;
}
LPSTR* result = (LPSTR*)LocalAlloc(LMEM_FIXED, storage);
if (result == NULL)
{
LocalFree(args);
return NULL;
}
int bufLen = storage - numArgs * sizeof(LPSTR);
LPSTR buffer = ((LPSTR)result) + numArgs * sizeof(LPSTR);
for (int i = 0; i < numArgs; ++ i)
{
assert(bufLen > 0);
BOOL lpUsedDefaultChar = FALSE;
retval = WideCharToMultiByte(CP_ACP, 0, args[i], -1, buffer, bufLen, NULL, &lpUsedDefaultChar);
if (!SUCCEEDED(retval))
{
LocalFree(result);
LocalFree(args);
return NULL;
}
result[i] = buffer;
buffer += retval;
bufLen -= retval;
}
LocalFree(args);
*pNumArgs = numArgs;
return result;
}
Apparently you can use __argv outside main() to access the pre-parsed argument vector...
I followed the source for parse_cmd (see "argv_parsing.cpp" in the latest SDK) and modified it to match the paradigm and operation for CommandLineToArgW and developed the following. Note: instead of using LocalAlloc, per Microsoft recommendations (see https://msdn.microsoft.com/en-us/library/windows/desktop/aa366723(v=vs.85).aspx) I've substituted HeapAlloc. Additionally one change in the SAL notation. I deviate slightly be stating _In_opt_ for lpCmdLine - as CommandLineToArgvW does allow this to be NULL, in which case it returns an argument list containing just the program name.
A final caveat, parse_cmd will parse the command line slightly different from CommandLineToArgvW in one aspect only: two double quote characters in a row while the state is 'in quote' mode are interpreted as an escaped double quote character. Both functions consume the first one and output the second one. The difference is that for CommandLineToArgvW, there is a transition out of 'in quote' mode, while parse_cmdline remains in 'in quote' mode. This is properly reflected in the function below.
You would use the below function as follows:
int argc = 0;
LPSTR *argv = CommandLineToArgvA(GetCommandLineA(), &argc);
HeapFree(GetProcessHeap(), NULL, argv);
LPSTR* CommandLineToArgvA(_In_opt_ LPCSTR lpCmdLine, _Out_ int *pNumArgs)
{
if (!pNumArgs)
{
SetLastError(ERROR_INVALID_PARAMETER);
return NULL;
}
*pNumArgs = 0;
/*follow CommandLinetoArgvW and if lpCmdLine is NULL return the path to the executable.
Use 'programname' so that we don't have to allocate MAX_PATH * sizeof(CHAR) for argv
every time. Since this is ANSI the return can't be greater than MAX_PATH (260
characters)*/
CHAR programname[MAX_PATH] = {};
/*pnlength = the length of the string that is copied to the buffer, in characters, not
including the terminating null character*/
DWORD pnlength = GetModuleFileNameA(NULL, programname, MAX_PATH);
if (pnlength == 0) //error getting program name
{
//GetModuleFileNameA will SetLastError
return NULL;
}
if (*lpCmdLine == NULL)
{
/*In keeping with CommandLineToArgvW the caller should make a single call to HeapFree
to release the memory of argv. Allocate a single block of memory with space for two
pointers (representing argv[0] and argv[1]). argv[0] will contain a pointer to argv+2
where the actual program name will be stored. argv[1] will be nullptr per the C++
specifications for argv. Hence space required is the size of a LPSTR (char*) multiplied
by 2 [pointers] + the length of the program name (+1 for null terminating character)
multiplied by the sizeof CHAR. HeapAlloc is called with HEAP_GENERATE_EXCEPTIONS flag,
so if there is a failure on allocating memory an exception will be generated.*/
LPSTR *argv = static_cast<LPSTR*>(HeapAlloc(GetProcessHeap(),
HEAP_ZERO_MEMORY | HEAP_GENERATE_EXCEPTIONS,
(sizeof(LPSTR) * 2) + ((pnlength + 1) * sizeof(CHAR))));
memcpy(argv + 2, programname, pnlength+1); //add 1 for the terminating null character
argv[0] = reinterpret_cast<LPSTR>(argv + 2);
argv[1] = nullptr;
*pNumArgs = 1;
return argv;
}
/*We need to determine the number of arguments and the number of characters so that the
proper amount of memory can be allocated for argv. Our argument count starts at 1 as the
first "argument" is the program name even if there are no other arguments per specs.*/
int argc = 1;
int numchars = 0;
LPCSTR templpcl = lpCmdLine;
bool in_quotes = false; //'in quotes' mode is off (false) or on (true)
/*first scan the program name and copy it. The handling is much simpler than for other
arguments. Basically, whatever lies between the leading double-quote and next one, or a
terminal null character is simply accepted. Fancier handling is not required because the
program name must be a legal NTFS/HPFS file name. Note that the double-quote characters are
not copied.*/
do {
if (*templpcl == '"')
{
//don't add " to character count
in_quotes = !in_quotes;
templpcl++; //move to next character
continue;
}
++numchars; //count character
templpcl++; //move to next character
if (_ismbblead(*templpcl) != 0) //handle MBCS
{
++numchars;
templpcl++; //skip over trail byte
}
} while (*templpcl != '\0' && (in_quotes || (*templpcl != ' ' && *templpcl != '\t')));
//parsed first argument
if (*templpcl == '\0')
{
/*no more arguments, rewind and the next for statement will handle*/
templpcl--;
}
//loop through the remaining arguments
int slashcount = 0; //count of backslashes
bool countorcopychar = true; //count the character or not
for (;;)
{
if (*templpcl)
{
//next argument begins with next non-whitespace character
while (*templpcl == ' ' || *templpcl == '\t')
++templpcl;
}
if (*templpcl == '\0')
break; //end of arguments
++argc; //next argument - increment argument count
//loop through this argument
for (;;)
{
/*Rules:
2N backslashes + " ==> N backslashes and begin/end quote
2N + 1 backslashes + " ==> N backslashes + literal "
N backslashes ==> N backslashes*/
slashcount = 0;
countorcopychar = true;
while (*templpcl == '\\')
{
//count the number of backslashes for use below
++templpcl;
++slashcount;
}
if (*templpcl == '"')
{
//if 2N backslashes before, start/end quote, otherwise count.
if (slashcount % 2 == 0) //even number of backslashes
{
if (in_quotes && *(templpcl +1) == '"')
{
in_quotes = !in_quotes; //NB: parse_cmdline omits this line
templpcl++; //double quote inside quoted string
}
else
{
//skip first quote character and count second
countorcopychar = false;
in_quotes = !in_quotes;
}
}
slashcount /= 2;
}
//count slashes
while (slashcount--)
{
++numchars;
}
if (*templpcl == '\0' || (!in_quotes && (*templpcl == ' ' || *templpcl == '\t')))
{
//at the end of the argument - break
break;
}
if (countorcopychar)
{
if (_ismbblead(*templpcl) != 0) //should copy another character for MBCS
{
++templpcl; //skip over trail byte
++numchars;
}
++numchars;
}
++templpcl;
}
//add a count for the null-terminating character
++numchars;
}
/*allocate memory for argv. Allocate a single block of memory with space for argc number of
pointers. argv[0] will contain a pointer to argv+argc where the actual program name will be
stored. argv[argc] will be nullptr per the C++ specifications. Hence space required is the
size of a LPSTR (char*) multiplied by argc + 1 pointers + the number of characters counted
above multiplied by the sizeof CHAR. HeapAlloc is called with HEAP_GENERATE_EXCEPTIONS
flag, so if there is a failure on allocating memory an exception will be generated.*/
LPSTR *argv = static_cast<LPSTR*>(HeapAlloc(GetProcessHeap(),
HEAP_ZERO_MEMORY | HEAP_GENERATE_EXCEPTIONS,
(sizeof(LPSTR) * (argc+1)) + (numchars * sizeof(CHAR))));
//now loop through the commandline again and split out arguments
in_quotes = false;
templpcl = lpCmdLine;
argv[0] = reinterpret_cast<LPSTR>(argv + argc+1);
LPSTR tempargv = reinterpret_cast<LPSTR>(argv + argc+1);
do {
if (*templpcl == '"')
{
in_quotes = !in_quotes;
templpcl++; //move to next character
continue;
}
*tempargv++ = *templpcl;
templpcl++; //move to next character
if (_ismbblead(*templpcl) != 0) //should copy another character for MBCS
{
*tempargv++ = *templpcl; //copy second byte
templpcl++; //skip over trail byte
}
} while (*templpcl != '\0' && (in_quotes || (*templpcl != ' ' && *templpcl != '\t')));
//parsed first argument
if (*templpcl == '\0')
{
//no more arguments, rewind and the next for statement will handle
templpcl--;
}
else
{
//end of program name - add null terminator
*tempargv = '\0';
}
int currentarg = 1;
argv[currentarg] = ++tempargv;
//loop through the remaining arguments
slashcount = 0; //count of backslashes
countorcopychar = true; //count the character or not
for (;;)
{
if (*templpcl)
{
//next argument begins with next non-whitespace character
while (*templpcl == ' ' || *templpcl == '\t')
++templpcl;
}
if (*templpcl == '\0')
break; //end of arguments
argv[currentarg] = ++tempargv; //copy address of this argument string
//next argument - loop through it's characters
for (;;)
{
/*Rules:
2N backslashes + " ==> N backslashes and begin/end quote
2N + 1 backslashes + " ==> N backslashes + literal "
N backslashes ==> N backslashes*/
slashcount = 0;
countorcopychar = true;
while (*templpcl == '\\')
{
//count the number of backslashes for use below
++templpcl;
++slashcount;
}
if (*templpcl == '"')
{
//if 2N backslashes before, start/end quote, otherwise copy literally.
if (slashcount % 2 == 0) //even number of backslashes
{
if (in_quotes && *(templpcl+1) == '"')
{
in_quotes = !in_quotes; //NB: parse_cmdline omits this line
templpcl++; //double quote inside quoted string
}
else
{
//skip first quote character and count second
countorcopychar = false;
in_quotes = !in_quotes;
}
}
slashcount /= 2;
}
//copy slashes
while (slashcount--)
{
*tempargv++ = '\\';
}
if (*templpcl == '\0' || (!in_quotes && (*templpcl == ' ' || *templpcl == '\t')))
{
//at the end of the argument - break
break;
}
if (countorcopychar)
{
*tempargv++ = *templpcl;
if (_ismbblead(*templpcl) != 0) //should copy another character for MBCS
{
++templpcl; //skip over trail byte
*tempargv++ = *templpcl;
}
}
++templpcl;
}
//null-terminate the argument
*tempargv = '\0';
++currentarg;
}
argv[argc] = nullptr;
*pNumArgs = argc;
return argv;
}
CommandLineToArgvW() is in shell32.dll. I'd guessthat the Shell developers created the function for their own use, and it was made public either because someone decided that 3rd party devs would find it useful or because some court action made them do it.
Since the Shell developers only ever needed a Unicode version that's all they ever wrote. It would be fairly simple to write an ANSI wrapper for the function that converts the ANSI to Unicode, calls the function and converts the Unicode results to ANSI (and if Shell32.dll ever provided an ANSI variant of this API, that's probably exactly what would do).
None of these solved the problem perfectly when don't want parsing UNICODE, so my solution is modified from WINE projects, they contains source code of CommandLineToArgvW of shell32.dll, modified it to below and it's work perfectly for me:
/*************************************************************************
* CommandLineToArgvA [SHELL32.#]
*
* MODIFIED FROM https://www.winehq.org/ project
* We must interpret the quotes in the command line to rebuild the argv
* array correctly:
* - arguments are separated by spaces or tabs
* - quotes serve as optional argument delimiters
* '"a b"' -> 'a b'
* - escaped quotes must be converted back to '"'
* '\"' -> '"'
* - consecutive backslashes preceding a quote see their number halved with
* the remainder escaping the quote:
* 2n backslashes + quote -> n backslashes + quote as an argument delimiter
* 2n+1 backslashes + quote -> n backslashes + literal quote
* - backslashes that are not followed by a quote are copied literally:
* 'a\b' -> 'a\b'
* 'a\\b' -> 'a\\b'
* - in quoted strings, consecutive quotes see their number divided by three
* with the remainder modulo 3 deciding whether to close the string or not.
* Note that the opening quote must be counted in the consecutive quotes,
* that's the (1+) below:
* (1+) 3n quotes -> n quotes
* (1+) 3n+1 quotes -> n quotes plus closes the quoted string
* (1+) 3n+2 quotes -> n+1 quotes plus closes the quoted string
* - in unquoted strings, the first quote opens the quoted string and the
* remaining consecutive quotes follow the above rule.
*/
LPSTR* WINAPI CommandLineToArgvA(LPSTR lpCmdline, int* numargs)
{
DWORD argc;
LPSTR *argv;
LPSTR s;
LPSTR d;
LPSTR cmdline;
int qcount,bcount;
if(!numargs || *lpCmdline==0)
{
SetLastError(ERROR_INVALID_PARAMETER);
return NULL;
}
/* --- First count the arguments */
argc=1;
s=lpCmdline;
/* The first argument, the executable path, follows special rules */
if (*s=='"')
{
/* The executable path ends at the next quote, no matter what */
s++;
while (*s)
if (*s++=='"')
break;
}
else
{
/* The executable path ends at the next space, no matter what */
while (*s && *s!=' ' && *s!='\t')
s++;
}
/* skip to the first argument, if any */
while (*s==' ' || *s=='\t')
s++;
if (*s)
argc++;
/* Analyze the remaining arguments */
qcount=bcount=0;
while (*s)
{
if ((*s==' ' || *s=='\t') && qcount==0)
{
/* skip to the next argument and count it if any */
while (*s==' ' || *s=='\t')
s++;
if (*s)
argc++;
bcount=0;
}
else if (*s=='\\')
{
/* '\', count them */
bcount++;
s++;
}
else if (*s=='"')
{
/* '"' */
if ((bcount & 1)==0)
qcount++; /* unescaped '"' */
s++;
bcount=0;
/* consecutive quotes, see comment in copying code below */
while (*s=='"')
{
qcount++;
s++;
}
qcount=qcount % 3;
if (qcount==2)
qcount=0;
}
else
{
/* a regular character */
bcount=0;
s++;
}
}
/* Allocate in a single lump, the string array, and the strings that go
* with it. This way the caller can make a single LocalFree() call to free
* both, as per MSDN.
*/
argv=LocalAlloc(LMEM_FIXED, (argc+1)*sizeof(LPSTR)+(strlen(lpCmdline)+1)*sizeof(char));
if (!argv)
return NULL;
cmdline=(LPSTR)(argv+argc+1);
strcpy(cmdline, lpCmdline);
/* --- Then split and copy the arguments */
argv[0]=d=cmdline;
argc=1;
/* The first argument, the executable path, follows special rules */
if (*d=='"')
{
/* The executable path ends at the next quote, no matter what */
s=d+1;
while (*s)
{
if (*s=='"')
{
s++;
break;
}
*d++=*s++;
}
}
else
{
/* The executable path ends at the next space, no matter what */
while (*d && *d!=' ' && *d!='\t')
d++;
s=d;
if (*s)
s++;
}
/* close the executable path */
*d++=0;
/* skip to the first argument and initialize it if any */
while (*s==' ' || *s=='\t')
s++;
if (!*s)
{
/* There are no parameters so we are all done */
argv[argc]=NULL;
*numargs=argc;
return argv;
}
/* Split and copy the remaining arguments */
argv[argc++]=d;
qcount=bcount=0;
while (*s)
{
if ((*s==' ' || *s=='\t') && qcount==0)
{
/* close the argument */
*d++=0;
bcount=0;
/* skip to the next one and initialize it if any */
do {
s++;
} while (*s==' ' || *s=='\t');
if (*s)
argv[argc++]=d;
}
else if (*s=='\\')
{
*d++=*s++;
bcount++;
}
else if (*s=='"')
{
if ((bcount & 1)==0)
{
/* Preceded by an even number of '\', this is half that
* number of '\', plus a quote which we erase.
*/
d-=bcount/2;
qcount++;
}
else
{
/* Preceded by an odd number of '\', this is half that
* number of '\' followed by a '"'
*/
d=d-bcount/2-1;
*d++='"';
}
s++;
bcount=0;
/* Now count the number of consecutive quotes. Note that qcount
* already takes into account the opening quote if any, as well as
* the quote that lead us here.
*/
while (*s=='"')
{
if (++qcount==3)
{
*d++='"';
qcount=0;
}
s++;
}
if (qcount==2)
qcount=0;
}
else
{
/* a regular character */
*d++=*s++;
bcount=0;
}
}
*d='\0';
argv[argc]=NULL;
*numargs=argc;
return argv;
}
Be careful when parsing empty string "", it's return NULL instead of executable path, that's the different behavior with the standard CommandLineToArgvW, the recommanded usage is below:
int argc;
LPSTR * argv = CommandLineToArgvA(GetCommandLineA(), &argc);
// AFTER consumed argv
LocalFree(argv);
The following is about the simplest way I can think of to obtain an old-fashioned argc/argv pair at the top of WinMain. Assuming that the command-line really was ANSI text, you don't actually need any conversions fancier than this.
int WINAPI WinMain(HINSTANCE hInstance, HINSTANCE hPrevInstance, LPSTR lpCmdLine, int nShowCmd) {
int argc;
LPWSTR *szArglist = CommandLineToArgvW(GetCommandLineW(), &argc);
char **argv = new char*[argc];
for (int i=0; i<argc; i++) {
int lgth = wcslen(szArglist[i]);
argv[i] = new char[lgth+1];
for (int j=0; j<=lgth; j++)
argv[i][j] = char(szArglist[i][j]);
}
LocalFree(szArglist);