Readable output for tracking runtime - bash

I want to have a proper output style using /usr/bin/time and when I try something like
/usr/bin/time -f'time=%E' ls > /dev/null
the output is
time=0:00.05
where the 5 says 5 centiseconds.
If my command/script runs a longer time, the output will be e.g.:
time=1:30:05
where the 5 says 5 seconds.
I wanted to have the output written in man time:
The format string
The format is interpreted in the usual printf-like way. Ordinary characters are directly copied, tab, newline and backslash are escaped using \t, \n and \\, a
percent sign is represented by %%, and otherwise % indicates a conversion. The program time will always add a trailing newline itself. The conversions follow.
All of those used by tcsh(1) are supported.
Time
%E Elapsed real time (in [hours:]minutes:seconds).
So I don't want to have those confusing centiseconds. The format should be logical and easy readable without using additional scripts like sed. When I have a log for several commands, the output should be something like:
time=0:00:01
time=3:30:12
time=0:10:01

Related

Simplify complex command, put it into a variable

date +'%A %B %d' | sed -e 's/\(^\|[^[:digit:]]\+\)0\+\([[:digit:]]\)/\1\2/g
I like the output of the above command, which strips leading zeroes off days of the month produced by the date command, in the case of numerals less than 10. It's the only way I've thus far found of producing single digit dates from the date command's output for the day of the month, which otherwise would be 01, 02, 03, etc.
A couple of questions in this regard. Is there a more elegant way of accomplishing the stated goal of stripping off zeroes? I do know about date's %e switch and would like to use it, but with numerals 10 and greater it has the undesirable effect of losing the space between the month name and the date (so, July 2 but July10).
The second question regards the larger intended goal of arriving at such an incantation. I'm putting together a script that will scrape some data from a web page. The best way of locating the target data on the page is by searching on the current date. But the site uses only single digits for the first 9 days of the month, thus the need to strip off leading zeroes. So what's the best way of getting this complex command into a variable so I can call it within my script? Would a variable within a variable be called for here?
RESOLUTION
I'll sort of answer my own question here, though it is really input from Renaud Pacalett (below) that enabled me to resolve the matter. His input revealed to me that I'd not understood very well the man page, particularly the part where is says "date pads numeric fields with zeroes," and below that where it is written "- (hyphen) do not pad the field." Had I understood better those statements, I would have realized that there is no need for the complex sed line through which I piped the date output in the title of this posting: had I used there %-d instead of just %d there would have been no leading zeroes in front of numerals less than 10 and so no need to call sed (or tr, as suggested below by LMC) to strip them off. In light of that, the answer to the second question about putting that incantation into a variable becomes elementary: var=$(date +'%A %B %-d') is all that is needed.
I may go ahead and mark Renaud Pacalet's response as the solution since, even though I did not implement all of his suggestions into the latest incarnation of my script, it proved crucial in clarifying key requirements of the task.
If your date utility supports it (the one from GNU coreutils does) you can use:
date +'%A %B %-d'
The - tells date to not pad the numeric field. Demo:
$ date -d"2021/07/01" +'%A %B %-d'
Thursday July 1
Not sure I understand your second question but if you want to pass this command to a shell script (I do not really understand why you would do that), you can use the eval shell command:
$ cat foo.sh
#!/usr/bin/env bash
foo="$(eval "$1")"
echo "$foo"
$ ./foo.sh 'date -d"2021/07/01" +"%A %B %-d"'
Thursday July 1
Please pay attention to the double (") and simple (') quotes usage. And of course, you will have to add to this example script what is needed to handle errors, avoid misuses...
Note that many string comparison utilities support one form or another of extended regular expressions. So getting rid of these leading zeros or spaces can be as easy as:
grep -E 'Thursday\s+July\s+0*1' foo.txt
This would match any line of foo.txt containing
Thursday<1 or more spaces>July<1 or more spaces><0 or more zeros>1

Single file contain files name and scores | text processing

I have a folder called files that has 100 files, each one has one value inside;such as: 0.974323
This my code to generate those files and store the single value inside:
DIR="/home/XX/folder"
INPUT_DIR="/home/XX/folder/eval"
OUTPUT_DIR="/home/XX/folder/files"
for i in $INPUT_DIR/*
do
groovy $DIR/calculate.groovy $i > $OUTPUT_DIR/${i##*/}_rates.txt
done
That will generate 100 files inside /home/XX/folder/files, but what I want is one single file that has in each line two columns separated by tab contain the score and the name of the file (which is i).
the score \t name of the file
So, the output will be:
0.9363728 \t resultFile.txt
0.37229 \t outFile.txt
And so on, any help with that please?
Assuming your Groovy program outputs just the score, try something like
#!/bin/sh
# ^ use a valid shebang
# Don't use uppercase for variables
dir="/home/XX/folder"
input_dir="/home/XX/folder/eval"
output_dir="/home/XX/folder/files"
# Always use double quotes around file names
for i in "$input_dir"/*
do
groovy "$dir/calculate.groovy" "$i" |
sed "s%^%$i\t%"
done >"$output_dir"/tabbed_file.txt
The sed script assumes that the file names do not contain percent signs, and that your sed recognizes \t as a tab (some variants will think it's just a regular t with a gratuitous backslash; replace it with a literal tab, or try ctrl-v tab to enter a literal tab at the prompt in many shells).
A much better fix is probably to change your Groovy program so that it accepts an arbitrary number of files as command-line arguments, and includes the file name in the output (perhaps as an option).

Removing diacritical marks from a Greek text in an automatic way

I have a decompiled stardict dictionary in the form of a tab file
κακός <tab> bad
where <tab> signifies a tabulation.
Unfortunately, the way the words are defined requires the query to include all diacritical marks. So if I want to search for ζῷον, I need to have all the iotas and circumflexes correct.
Thus I'd like to convert the whole file so that the keyword has the diacritic removed. So the line would become
κακος <tab> <h3>κακός</h3> <br/> bad
I know I could read the file line by line in bash, as described here [1]
while read line
do
command
done <file
But what is there any way to automatize the operation of converting the line? I heard about iconv [2] but didn't manage to achieve the desired conversion using it. I'd best like to use a bash script.
Besides, is there an automatic way of transliterating Greek, e.g. using the method Perseus has?
/edit: Maybe we could use the Unicode codes? We can notice that U+1F0x, U+1F8x for x < 8, etc. are all variants of the letter α. This would reduce the amount of manual work. I'd accept a C++ solution as well.
[1] http://en.kioskea.net/faq/1757-how-to-read-a-file-line-by-line
[2] How to remove all of the diacritics from a file?
You can remove diacritics from a string relatively easily using Perl:
$_=NFKD($_);s/\p{InDiacriticals}//g;
for example:
$ echo 'ὦὢῶὼώὠὤ ᾪ' | perl -CS -MUnicode::Normalize -pne '$_=NFKD($_);s/\p{InDiacriticals}//g'
ωωωωωωω Ω
This works as follows:
The -CS enables UTF8 for Perl's stdin/stdout
The -MUnicode::Normalize loads a library for Unicode normalisation
-e executes the script from the command line; -n automatically loops over lines in the input; -p prints the output automatically
NFKD() translates the line into one of the Unicode normalisation forms; this means that accents and diacritics are decomposed into separate characters, which makes it easier to remove them in the next step
s/\p{InDiacriticals}//g removes all characters that Unicoded denotes as diacritical marks
This should in fact work for removing diacritics etc for all scripts/languages that have good Unicode support, not just Greek.
I'm not so familiar with Ancient Greek as I am with Modern Greek (which only really uses two diacritics)
However I went through the vowels and found out which combined with diacritics. This gave me the following list:
ἆἂᾶὰάἀἄ
ἒὲέἐἔ
ἦἢῆὴήἠἤ
ἶἲῖὶίἰἴ
ὂὸόὀὄ
ὖὒῦὺύὐὔ
ὦὢῶὼώὠὤ
I saved this list as a file and passed it to this sed
cat test.txt | sed -e 's/[ἆἂᾶὰάἀἄ]/α/g;s/[ἒὲέἐἔ]/ε/g;s/[ἦἢῆὴήἠἤ]/η/g;s/[ἶἲῖὶίἰἴ]/ι/g;s/[ὂὸόὀὄ]/ο/g;s/[ὖὒῦὺύὐὔ]/υ/g;s/[ὦὢῶὼώὠὤ]/ω/g'
Credit to hungnv
It's a simple sed. It takes each of the options and replaces it with the unmarked character. The result of the above command is:
ααααααα
εεεεε
ηηηηηηη
ιιιιιιι
οοοοο
υυυυυυυ
ωωωωωωω
Regarding transliterating the Greek: the image from your post is intended to help the user type in Greek on the site you took it from using similar glyphs, not always similar sounds. Those are poor transliterations. e.g. β is most often transliterated as v. ψ is ps. φ is ph, etc.

Parsing the output of Bash's time builtin

I'm running a C program from a Bash script, and running it through a command called time, which outputs some time statistics for the running of the algorithm.
If I were to perform the command
time $ALGORITHM $VALUE $FILENAME
It produces the output:
real 0m0.435s
user 0m0.430s
sys 0m0.003s
The values depending on the running of the algorithm
However, what I would like to be able to do is to take the 0.435 and assign it to a variable.
I've read into awk a bit, enough to know that if I pipe the above command into awk, I should be able to grab the 0.435 and place it in a variable. But how do I do that?
Many thanks
You must be careful: there's the Bash builtin time and there's the external command time, usually located in /usr/bin/time (type type -a time to have all the available times on your system).
If your shell is Bash, when you issue
time stuff
you're calling the builtin time. You can't directly catch the output of time without some minor trickery. This is because time doesn't want to interfere with possible redirections or pipes you'll perform, and that's a good thing.
To get time output on standard out, you need:
{ time stuff; } 2>&1
(grouping and redirection).
Now, about parsing the output: parsing the output of a command is usually a bad idea, especially when it's possible to do without. Fortunately, Bash's time command accepts a format string. From the manual:
TIMEFORMAT
The value of this parameter is used as a format string specifying how the timing information for pipelines prefixed with the time reserved word should be displayed. The % character introduces an escape sequence that is expanded to a time value or other information. The escape sequences and their meanings are as follows; the braces denote optional portions.
%%
A literal `%`.
%[p][l]R
The elapsed time in seconds.
%[p][l]U
The number of CPU seconds spent in user mode.
%[p][l]S
The number of CPU seconds spent in system mode.
%P
The CPU percentage, computed as (%U + %S) / %R.
The optional p is a digit specifying the precision, the number of fractional digits after a decimal point. A value of 0 causes no decimal point or fraction to be output. At most three places after the decimal point may be specified; values of p greater than 3 are changed to 3. If p is not specified, the value 3 is used.
The optional l specifies a longer format, including minutes, of the form MMmSS.FFs. The value of p determines whether or not the fraction is included.
If this variable is not set, Bash acts as if it had the value
$'\nreal\t%3lR\nuser\t%3lU\nsys\t%3lS'
If the value is null, no timing information is displayed. A trailing newline is added when the format string is displayed.
So, to fully achieve what you want:
var=$(TIMEFORMAT='%R'; { time $ALGORITHM $VALUE $FILENAME; } 2>&1)
As #glennjackman points out, if your command sends any messages to standard output and standard error, you must take care of that too. For that, some extra plumbing is necessary:
exec 3>&1 4>&2
var=$(TIMEFORMAT='%R'; { time $ALGORITHM $VALUE $FILENAME 1>&3 2>&4; } 2>&1)
exec 3>&- 4>&-
Source: BashFAQ032 on the wonderful Greg's wiki.
You could try the below awk command which uses split function to split the input based on digit m or last s.
$ foo=$(awk '/^real/{split($2,a,"[0-9]m|s$"); print a[2]}' file)
$ echo "$foo"
0.435
You can use this awk:
var=$(awk '$1=="real"{gsub(/^[0-9]+[hms]|[hms]$/, "", $2); print $2}' file)
echo "$var"
0.435

Why does sed not replace overlapping patterns

I have a database unload file with field separated with the <TAB> character. I am running this file through sed to replace any occurences of <TAB><TAB> with <TAB>\N<TAB>. This is so that when the file is loaded into MySQL the \N in interpreted as NULL.
The sed command 's/\t\t/\t\N\t/g;' almost works except that it only replaces the first instance e.g. "...<TAB><TAB><TAB>..." becomes "...<TAB>\N<TAB><TAB>...".
If I use 's/\t\t/\t\N\t/g;s/\t\t/\t\N\t/g;' it replaces more instances.
I have a notion that despite the /g modifier this is something to do with the end of one match being the start of another.
Could anyone explain what is happening and suggest a sed command that would work or do I need to loop.
I know I could probably switch to awk, perl, python but I want to know what is happening in sed.
Not dissimilar to the perl solution, this works for me using pure sed
With #Robin A. Meade improvement
sed ':repeat;
s|\t\t|\t\n\t|g;
t repeat'
Explanation
:repeat is a label, used for branch commands, similar to batch
s|\t\t|\t\n\t|g; - Standard replace 2 tabs with tab-newline-tab. I still use the global flag because if you have, say, 15 tabs, you will only need to loop twice, rather than 14 times.
t repeat means if the "s" command did any replaces, then goto the label repeat, else it goes onto the next line and starts over again.
So it goes like this. Keep repeating (goto repeat) as long as there is a match for the pattern of 2 tabs.
While the argument can be made that you could just do two identical global replaces and call it good, this same technique could work in more complicated scenarios.
As #thorn-blake points out, sed just doesn't support advanced features like lookahead, so you need to do a loop like this.
Original Answer
sed ':repeat;
/\t\t/{
s|\t\t|\t\n\t|g;
b repeat
}'
Explanation
:repeat is a label, used for branch commands, similar to batch
/\t\t/ means match the pattern 2 tabs. If the pattern it matched, the command following the second / is executed.
{} - In this case the command following the match command is a group. So all of the commands in the group are executed if the match pattern is met.
s|\t\t|\t\n\t|g; - Standard replace 2 tabs with tab-newline-tab. I still use the global because if you have say 15 tabs, you will only need to loop twice, rather than 14 times.
b repeat means always goto (branch) the label repeat
Short version
Which can be shortened to
sed ':r;s|\t\t|\t\n\t|g; t r'
# Original answer
# sed ':r;/\t\t/{s|\t\t|\t\n\t|g; b r}'
MacOS
And the Mac (yet still Linux/Windows compatible) version:
sed $':r\ns|\t\t|\t\\\n\t|g; t r'
# Original answer
# sed $':r\n/\t\t/{ s|\t\t|\t\\\n\t|g; b r\n}'
Tabs need to be literal in BSD sed
Newlines need to be both literal and escaped at the same time, hence the single slash (that's \ before it is processed by the $, making it a single literal slash ) plus the \n which becomes an actual newline
Both label names (:r) and branch commands (b r when not the end of the expression) must end in a newline. Special characters like semicolons and spaces are consumed by the label name/branch command in BSD, which makes it all very confusing.
I know you want sed, but sed doesn't like this at all, it seems that it specifically (see here) won't do what you want. However, perl will do it (AFAIK):
perl -pe 'while (s#\t\t#\t\n\t#) {}' <filename>
As a workaround, replace every tab with tab + \N; then remove all occurrences of \N which are not immediately followed by a tab.
sed -e 's/\t/\t\\N/g' -e 's/\\N\([^\t]\)/\1/g'
... provided your sed uses backslash before grouping parentheses (there are sed dialects which don't want the backslashes; try without them if this doesn't work for you.)
Right, even with /g, sed will not match the text it replaced again. Thus, it's read <TAB><TAB> and output <TAB>\N<TAB> and then reads the next thing in from the input stream. See http://www.grymoire.com/Unix/Sed.html#uh-7
In a regex language that supports lookaheads, you can get around this with a lookahead.
Well, sed simply works as designed. The input line is scanned once, not multiple times. Maybe it helps to look at the consequences if sed used rescanning the input line to deal with overlapping patterns by default: in this case even simple substitutions would work quite differently--some might say counter-intuitively--, e.g.
s/^/ / inserting a space at the beginning of a line would never terminate
s/$/foo/ appending foo to each line - likewise
s/[A-Z][A-Z]*/CENSORED/ replacing uppercase words with CENSORED - likewise
There are probably many other situations. Of course these could all be remedied with, say, a substitution modifier, but at the time sed was designed, the current behavior was chosen.

Resources