I need some idea on text processing for SRT subtitles - bash

Title says what I really need ATM.
Basically I've created an OCR toolchain based on Tesseract and ImageMagick. I've managed to get it to the point the output text is very consistent. I'm using this to OCR some old hardsubbed videos and make them into soft subbed SRT subs. To take the screenshots for the image input I'm using a modified version of an old shell script I found and rewrote ages ago. Those get feed into a second script that processes them into a form readable by Tessaract. At this point I could easily do the remainder of the work by hand, but I'd like to automate all but the final proofread pass if possible.
Example Text (From current project)
03:04.418 Their parents have always written letters thanking us. =
03:05.018 Their parents have always written letters thanking us. =
03:05.619 Their parents have always written letters thanking us. =
03:06.219 Their parents have always written letters thanking us. =
03:06.820 Their parents have always written letters thanking us. =
03:07.421 Their parents have always written letters thanking us. =
03:08.021 Their parents have always written letters thanking us. =
03:08.622 This seminary was highly reeemmended. | am relieved te leave her in your care. =
03:09.222 This seminary was highly reeemmended. | am relieved te leave her in your care. =
03:09.823 This seminary was highly reeemmended. | am relieved te leave her in your care. =
03:10.424 This seminary was highly reeemmended. | am relieved te leave her in your care. =
03:11.024 This seminary was highly reeemmended. | am relieved te leave her in your care. =
03:11.625 This seminary was highly reeemmended. | am relieved te leave her in your care. =
03:12.225 In additien te all the previeus requests se far..."
03:12.826 In additien te all the previeus requests se far..."
03:13.427 In additien te all the previeus requests se far..."
03:14.027 In additien te all the previeus requests se far..."
03:14.628 In additien te all the previeus requests se far..."
basically I want to match the Text and pull the timestamps from the first and last lines and set them up in srt format
1
00:03:04,418 --> 00:03:08,021
Their parents have always written
letters thanking us. =
2
00:03:08,622 --> 00:03:08,622
This seminary was highly reeemmended
| am relieved te leave her in your care. =
3
00:03:12,225 --> 00:03:14,628
In additien te all the previeus requests se far..."
At this point I'm fine with it being a separate script.
Basically sub.txt in sub.srt out. Then do a Proofread pass. Now there is a bit of Variability in the detected text but it's minimal. I is occasionally detected as | or [, and it sometimes mixes up o and e in some odd corner cases.
Edit February 2 2020:
I've made some changes and tweaks to further get what I wanted. to Both MY shell script and Ivans. I've eliminated The blank sub Lines produced by ivans script and mine as well.
UPDATED processing and ocr script BTW
#!/bin/bash -x
cd "$1"
mkdir ocr
for f in *.png ;
do
base="$(basename "$f" | cut -d "." -f 1,2)"
echo "$base"
if [[ -z "$2" ]] ;
then
tran="$(convert "$f" -separate -average -crop +0+720 -threshold 11% -fill black -draw 'color 700,10 floodfill' +repage ocr/"$base".png)"
else
tran="$(convert "$f" -separate -average -crop +0+720 -negate -threshold 15% -fill white -draw 'color 700,10 floodfill' +repage ocr/"$base".png)"
fi
$tran
cd ocr
magick mogrify -pointsize 50 -fill blue -draw 'text 1400,310 "L" ' +repage "$base".png
cd ..
done
cd ocr
for i in *.png ;
do base2="$(basename "$i" | cut -d "." -f 1,2 | cut -d ":" -f 2,3)"
tesseract "$i" stdout -c page_separator='' --psm 6 --oem 1 --dpi 300 | { tr '\n' ' '; tr -s [:space:] ' '; echo; } >> text.txt
echo "$base2"" " >> time.txt
done
awk '{printf ("%s", $0); getline < "text.txt"; print $0 }' time.txt >> out.txt
sed -i 's/|/I/g' out.txt
sed -i 's/\[/I/g' out.txt
#sed -i 's/L//g' out.txt
#sed -i 's/=//g' out.txt
sed -i 's/.$//' out.txt
sed -i 's/.$//' out.txt
while read line ; do
sed "/[[:alpha:]]/ !d" >> sub.txt
done <out.txt
exit
The Part Making the Blue L is to ensure every line has something in it for timestamp matching.
UPDATED IVAN SRT SCRIPT
#!/bin/bash -x
sub="$1" # path to sub file
OLD=$IFS # remember current delimiter
IFS=$'\n' # set delimiter to the new line
raw=( $(cat $sub) ) # load sub into raw array
IFS=$OLD # set default delimiter back
reset () {
unset raw[0] # remove 1-st item from array
raw=( "${raw[#]}" ) # rearange array
}
output () {
printf "00:$time1 --> 00:$time3\n$text1\n\n"
}
speen () {
time3=$time2
reset
test=( "${raw[#]::2}" ) # get two more items
test2=( ${test[0]} ) # split 2-nd item
time2=${test2[0]} # get 2-nd timing
text2=${test2[#]:1} # get 2-nd text
# if only one item in test than this is the end, return
[[ "${test[1]}" ]] || { printf "00:$time1 --> 00:$time2\n$text1\n\n"; raw=; return; }
# compare, speen more if match, print ang go further if not
[[ "$text1" == "$text2" ]] && speen || output
}
N=1 # set counter
while [[ "${raw[#]}" ]]; do # loop through data
echo $((N++)) # print and inc counter
test1=( $raw ) # get 1-st item
time1=${test1[0]} # get 1-st timing
text1=${test1[#]:1}
# get 1-st text
speen
done
I just added a third time variable to save the old time2 value as time3. Basically Eliminating the blank timestamp line broke his matching. I realized that time2 was the First non matching time stamp. So I needed to save the one prior from the last loop. Thus time3=$time2 Then rest the time2 value. Then use the old time2 ( now time3) to print the sub string.

Ended with this
#!/bin/bash
sub=file # path to sub file
OLD=$IFS # remember current delimiter
IFS=$'\n' # set delimiter to the new line
raw=( $(cat $sub) ) # load sub into raw array
IFS=$OLD # set default delimiter back
reset () {
unset raw[0] # remove 1-st item from array
raw=( "${raw[#]}" ) # rearange array
}
output () {
text1=${text1//|/I} # change | to I in text
text1=${text1//[/I} # change [ to I in text
printf "$time1 --> $time2\n$text1\n\n"
}
speen () {
reset
test=( "${raw[#]::2}" ) # get two more items
test2=( ${test[0]} ) # split 2-nd item
time2=${test2[0]} # get 2-nd timing
text2=${test2[#]:1} # get 2-nd text
# if only one item in test than this is the end, return
[[ "${test[1]}" ]] || { printf "$time1 --> $time2\n$text1\n\n"; raw=; return; }
# compare, speen more if match, print ang go further if not
[[ "$text1" == "$text2" ]] && speen || output
}
N=1 # set counter
while [[ "${raw[#]}" ]]; do # loop through data
echo $((N++)) # print and inc counter
test1=( $raw ) # get 1-st item
time1=${test1[0]} # get 1-st timing
text1=${test1[#]:1} # get 1-st text
speen
done

Related

Evaluate expression using printf in bash [duplicate]

How to do arithmetic with floating point numbers such as 1.503923 in a shell script? The floating point numbers are pulled from a file as a string. The format of the file is as follows:
1.5493482,3.49384,33.284732,23.043852,2.2384...
3.384,3.282342,23.043852,2.23284,8.39283...
.
.
.
Here is some simplified sample code I need to get working. Everything works fine up to the arithmetic. I pull a line from the file, then pull multiple values from that line. I think this would cut down on search processing time as these files are huge.
# set vars, loops etc.
while [ $line_no -gt 0 ]
do
line_string=`sed -n $line_no'p' $file_path` # Pull Line (str) from a file
string1=${line_string:9:6} # Pull value from the Line
string2=${line_string:16:6}
string3=...
.
.
.
calc1= `expr $string2 - $string7` |bc -l # I tried these and various
calc2= ` "$string3" * "$string2" ` |bc -l # other combinations
calc3= `expr $string2 - $string1`
calc4= "$string2 + $string8" |bc
.
.
.
generic_function_call # Use the variables in functions
line_no=`expr $line_no - 1` # Counter--
done
Output I keep getting:
expr: non-numeric argument
command not found
I believe you should use : bc
For example:
echo "scale = 10; 123.456789/345.345345" | bc
(It's the unix way: each tool specializes to do well what they are supposed to do, and they all work together to do great things. don't emulate a great tool with another, make them work together.)
Output:
.3574879198
Or with a scale of 1 instead of 10:
echo "scale = 1; 123.456789/345.345345" | bc
Output:
.3
Note that this does not perform rounding.
I highly recommand switching to awk if you need to do more complex operations, or perl for the most complex ones.
ex: your operations done with awk:
# create the test file:
printf '1.5493482,3.49384,33.284732,23.043852,2.2384,12.1,13.4,...\n' > somefile
printf '3.384,3.282342,23.043852,2.23284,8.39283,14.1,15.2,...\n' >> somefile
# do OP's calculations (and DEBUG print them out!)
awk -F',' '
# put no single quote in here... even in comments! you can instead print a: \047
# the -F tell awk to use "," as a separator. Thus awk will automatically split lines for us using it.
# $1=before first "," $2=between 1st and 2nd "," ... etc.
function some_awk_function_here_if_you_want() { # optionnal function definition
# some actions here. you can even have arguments to the function, etc.
print "DEBUG: no action defined in some_awk_function_here_if_you_want yet ..."
}
BEGIN { rem="Optionnal START section. here you can put initialisations, that happens before the FIRST file-s FIRST line is read"
}
(NF>=8) { rem="for each line with at least 8 values separated by commas (and only for lines meeting that condition)"
calc1=($2 - $7)
calc2=($3 * $2)
calc3=($2 - $1)
calc4=($2 + $8)
# uncomment to call this function :(ex1): # some_awk_function_here_if_you_want
# uncomment to call this script:(ex2): # cmd="/path/to/some/script.sh \"" calc1 "\" \"" calc2 "\" ..." ; rem="continued next line"
# uncomment to call this script:(ex2): # system(cmd); close(cmd)
line_no=(FNR-1) # ? why -1? . FNR=line number in the CURRENT file. NR=line number since the beginning (NR>FNR after the first file ...)
print "DEBUG: calc1=" calc1 " , calc2=" calc2 " , calc3=" calc3 " , calc4=" calc4 " , line_no=" line_no
print "DEBUG fancier_exemples: see man printf for lots of info on formatting (%...f for floats, %...d for integer, %...s for strings, etc)"
printf("DEBUG: calc1=%d , calc2=%10.2f , calc3=%s , calc4=%d , line_no=%d\n",calc1, calc2, calc3, calc4, line_no)
}
END { rem="Optionnal END section. here you can put things that need to happen AFTER the LAST file-s LAST line is read"
}
' somefile # end of the awk script, and the list of file(s) to be read by it.
What about this?
calc=$(echo "$String2 + $String8"|bc)
This will make bc to add the values of $String2 and $String8 and saves the result in the variable calc.
If you don't have the "bc" you can just use 'awk' :
calc=$(echo 2.3 4.6 | awk '{ printf "%f", $1 + $2 }')
scale in bc is the precission so with a scale of 4 if you type bc <<< 'scale=4;22.0/7' you get 3.1428 as an answer. If you use a scale of 8 you get 3.14285714 which is 8 numbers after the floating point.
So the scale is a precission factor

Show average file output speed once for loop is complete, for benchmarking purposes?

Sorry for being unclear my follow mates,
So to elaborate and possibly answer my own question, while Distro1Analysis.txt is being written to, calculate output speed in kb/s and when output is done then average output speed and print to screen.
The second part, its own question really, is quite simple, I'm not a computer scientist or advanced programmer, but I am certain there's an relatively easy way to improve the overall execution speed of the script which asking what is the speed culprit, how the script was written, the chosen programs, the mix of programs (i.e., is it faster to use 3 instances of the same program as opposed to one instance of 3 different programs...) For instance, could recursive-ness be used and how?
I was orignally going to ask how to benchmark the speed of a program to run one command, but it seemed simpler to use an overarching (global) benchmark hence the question. But any help you can provide would be useful.
Rdepends Version
ps -A &>> Distro1Analysis.txt && sudo service --status-all &>> Distro1Analysis.txt && \
for z in $(dpkg -l | awk '/^[hi]i/{print $2}' | grep -v '^lib'); do \
printf "\n$z:" && \
aptitude show $z | grep -E 'Uncompressed Size' && \
result=$(apt-rdepends 2>/dev/null $z | grep -v "Depends")
final=$(apt show 2>/dev/null $result | grep -E "Package|Installed-Size" | sed "/APT/d;s/Installed-Size: //");
if [[ (${#final} -le 700) ]]; then echo $final; else :; fi done &>> Distro1Analysis.txt
Depends Version
ps -A &>> Distro1Analysis.txt && sudo service --status-all &>> Distro1Analysis.txt && \
for z in $(dpkg -l | awk '/^[hi]i/{print $2}' | grep -v '^lib'); do \
printf "\n$z:" && \
aptitude show $z | grep -E 'Uncompressed Size' && \
printf "\n" && \
apt show 2>/dev/null $(aptitude search '!~i?reverse-depends("^'$z'$")' -F "%p" | \
sed 's/:i386$//') | grep -E 'Package|Installed-Size' | sed '/APT/d;s/^.*Package:/\t&/;N;s/\n/ /'; done &>> Distro1Analysis.txt
calculate output speed in kb/s and when output is done then average
output speed and print to screen
Here's an answer that's basically
Starting your script to run in the background.
Checking the size of its output file every two seconds with du -b.
Run the following bash script like so: $ bash scriptoutmon.sh subscript.sh Distro1Analysis.txt 12 10 2
scriptoutmon.sh usage:
$1 : Path to the subscript to run
$2 : Path to output file to monitor
$3 : How long to run scriptoutmon.sh script in seconds.
$4 : How long to run the subscript ($1)
$5 : Tick length for displayed updates in seconds.
scriptoutmon.sh:
#!/bin/bash
# Date: 2020-04-13T23:03Z
# Author: Steven Baltakatei Sandoval
# License: GPLv3+ https://www.gnu.org/licenses/gpl-3.0.en.html
# Description: Runs subscript and measures change in file size of a specified file.
# Usage: scriptoutmon.sh [ path to subscript ] [ path to subscript output file ] [ script TTL (s) ] [ subscript TTL (s) ] [ tick size (s) ]
# References:
# [1]: Adrian Pronk (2013-02-22). "Floating point results in Bash integer division". https://stackoverflow.com/a/15015920
# [2]: chronitis (2012-11-15). "bc: set number of digits after decimal point". https://askubuntu.com/a/217575
# [3]: ypnos (2020-02-12). "Differences of size in du -hs and du -b". https://stackoverflow.com/a/60196741
# == Function Definitions ==
echoerr() { echo "$#" 1>&2; } # display message via stderr
getSize() { echo $(du -b "$1" | awk '{print $1}'); } # output file size in bytes. See [3].
# == Initialize settings ==
SUBSCRIPT_PATH="$1" # path to subscript to run
SUBSCRIPT_OUTPUT_PATH="$2" # path to output file generated by subscript
SCRIPT_TTL="$3" # set script time-to-live in seconds
SUBSCRIPT_TTL="$4" # set subscript time-to-live in seconds
TICK_SIZE="$5" # update tick size (in seconds)
# == Perform work ==
timeout $SUBSCRIPT_TTL bash "$SUBSCRIPT_PATH" & # run subscript for SCRIPT_TTL seconds.
# note: SUBSCRIPT_OUTPUT_PATH should be path of output file generated by subscript.sh .
if [ -f $SUBSCRIPT_OUTPUT_PATH ]; then SUBSCRIPT_OUTPUT_INITIAL_SIZE=$(getSize "$SUBSCRIPT_OUTPUT_PATH"); else SUBSCRIPT_OUTPUT_INITIAL_SIZE="0"; fi # save initial size if file exists.
echoerr "Running $(basename "$SUBSCRIPT_PATH") and then monitoring rate of file size changes to $(basename "$SUBSCRIPT_OUTPUT_PATH")." # explain displayed output
# Calc and display subscript output file size changes
while [ $SECONDS -lt $SCRIPT_TTL ]; do # loop while script age (in seconds) less than SCRIPT_TTL.
if [ $SECONDS -ge $TICK_SIZE ]; then # if after first tick
OUTPUT_PREVIOUS_SIZE="$OUTPUT_CURRENT_SIZE" ; # save size previous tick
OUTPUT_CURRENT_SIZE=$(getSize "$SUBSCRIPT_OUTPUT_PATH") ; # save size current tick
BYTES_WRITTEN=$(( $OUTPUT_CURRENT_SIZE - $OUTPUT_PREVIOUS_SIZE )) ; # calc size difference between current and previous ticks.
WRITE_SPEED_BYTES_PER_SECOND=$(($BYTES_WRITTEN / $TICK_SIZE)) ; # calc write speed in bytes per second
WRITE_SPEED_KILOBYTES_PER_SECOND=$( echo "scale=3; $WRITE_SPEED_BYTES_PER_SECOND / 1000" | bc -l ) ; # calc write speed in kilobytes per second. See [1], [2].
echo "File size change rate (KB/sec):"$WRITE_SPEED_KILOBYTES_PER_SECOND ;
else # if first tick
OUTPUT_CURRENT_SIZE=$(getSize "$SUBSCRIPT_OUTPUT_PATH") # save size current tick (initial)
fi
sleep "$TICK_SIZE"; # wait a tick
done
SUBSCRIPT_OUTPUT_FINAL_SIZE=$(getSize "$SUBSCRIPT_OUTPUT_PATH") # save final size
# == Display results ==
SUBSCRIPT_OUTPUT_TOTAL_CHANGE_BYTES=$(( $SUBSCRIPT_OUTPUT_FINAL_SIZE - $SUBSCRIPT_OUTPUT_INITIAL_SIZE )) # calc total size change in bytes
SUBSCRIPT_OUTPUT_TOTAL_CHANGE_KILOBYTES=$( echo "scale=3; $SUBSCRIPT_OUTPUT_TOTAL_CHANGE_BYTES / 1000" | bc -l ) # calc total size change in kilobytes. See [1], [2].
echoerr "$SUBSCRIPT_OUTPUT_TOTAL_CHANGE_KILOBYTES kilobytes added to $SUBSCRIPT_OUTPUT_PATH size in $SUBSCRIPT_TTL seconds."
exit 0;
You should get output like this:
baltakatei#debianwork:/tmp$ bash scriptoutmon.sh subscript.sh Distro1Analysis.txt 12 10 2
Running subscript.sh and then monitoring rate of file size changes to Distro1Analysis.txt.
File size change rate (KB/sec):6.302
File size change rate (KB/sec):.351
File size change rate (KB/sec):.376
File size change rate (KB/sec):.345
File size change rate (KB/sec):.335
15.419 kilobytes added to Distro1Analysis.txt size in 10 seconds.
baltakatei#debianwork:/tmp$
Increase $3 and $4 to monitor the script longer (perhaps to let it finish its work).
The second part, its own question really
I'd suggest making it a separate question.

Creating a progress bar for BASH script exporting system log files

Essentially for a set number of systems logs pulled and exported I need to indicate the scripts progress by printing a character "#". This should eventually create a progress bar with a width of 60. Something like what's presented below: ############################################# ,additionally I need the characters to build from left to right indicating the progression of the script.
The Question/Problem that this code was based off of goes as follows: "Use a separate invocation of wevtutil el to get the count of the number of logs and scale this to,say, a width of 60."
SYSNAM=$(hostname)
LOGDIR=${1:-/tmp/${SYSNAM}_logs}
i=0
LOGCOUNT=$( wevtutil el | wc -l )
x=$(( LOGCOUNT/60 ))
wevtutil el | while read ALOG
do
ALOG="${ALOG%$'\r'}"
printf "${ALOG}:\r"
SAFNAM="${ALOG// /_}"
SAFNAM="${SAFNAM//\//-}"
wevtutil epl "$ALOG" "${SYSNAM}_${SAFNAM}.evtx"
done
I've attempted methods such as using echo -ne "#", and printf "#%0.s" however the issue that I encounter is that the "#" characters gets printed with each instance of the name of the log file being retrieved; also the pattern is printed vertically rather than horizontally.
LOGCOUNT=$( wevtutil el | wc -l )
x=$(( LOGCOUNT/60 ))
echo -ne "["
for i in {1..60}
do
if [[ $(( x*i )) != $LOGCOUNT ]]
then
echo -ne "#"
#printf '#%0.s'
fi
done
echo "]"
printf "\n"
echo "Transfer Complete."
echo "Total Log Files Transferred: $LOGCOUNT"
I tried previously integrating this code into the first block but no luck. But something tells me that I don't need to establish a whole new loop, I keep thinking that the first block of code only needs a few lines of modification. Anyhow sorry for the lengthy explanation, please let me know if anything additional is needed for assistance--Thank you.
For the sake of this answer I'm going to assume the desired output is a 2-liner that looks something like:
$ statbar
file: /bin/cygdbusmenu-qt5-2.dll
[######## ]
The following may not work for everyone as it comes down to individual terminal attributes and how they can(not) be manipulated by tput (ie, ymmv) ...
For my sample script I'm going to loop through the contents of /bin, printing the name of each file as I process it, while updating the status bar with a new '#' after each 20 files:
there are 719 files under my /bin so there should be 35 #'s in my status bar (I add an extra # at the end once processing has completed)
we'll use a few tput commands to handle cursor/line movement, plus erasing previous output from a line
for printing the status bar I've pre-calculated the number of #'s and then use 2 variables ... $barspace for spaces, $barhash for #'s; for each 20 files I strip a space off $barspace and add a single # to $barhash; by (re)printing these 2x variables every 20x files I get the appearance of a moving status bar
Putting this all together:
$ cat statbar
clear # make sure we have plenty of room to display our status bar;
# if we're at the bottom of the console/window and we cause the
# windows to 'scroll up' then 'tput sc/rc' will not work
tput sc # save pointer/reference to current terminal line
erase=$(tput el) # save control code for 'erase (rest of) line'
# init some variables; get a count of the number of files so we can pre-calculate the total length of our status bar
modcount=20
filecount=$(find /bin -type f | wc -l)
# generate a string of filecount/20+1 spaces (35+1 for my particular /bin)
barspace=
for ((i=1; i<=(filecount/modcount+1); i++))
do
barspace="${barspace} "
done
barhash= # start with no #'s for this variable
filecount=0 # we'll re-use this variable to keep track of # of files processed so need to reset
while read -r filename
do
filecount=$((filecount+1))
tput rc # return cursor to previously saved terminal line (tput sc)
# print filename (1st line of output); if shorter than previous filename we need to erase rest of line
printf "file: ${filename}${erase}\n"
# print our status bar (2nd line of output) on the first and every ${modcount} pass through loop;
if [ ${filecount} -eq 1 ]
then
printf "[${barhash}${barspace}]\n"
elif [[ $((filecount % ${modcount} )) -eq 0 ]]
then
# for every ${modcount}th file we ...
barspace=${barspace:1:100000} # strip a space from barspace
barhash="${barhash}#" # add a '#' to barhash
printf "[${barhash}${barspace}]\n" # print our new status bar
fi
done < <(find /bin -type f | sort -V)
# finish up the status bar (should only be 1 space left to 'convert' to a '#')
tput rc
printf "file: -- DONE --\n"
if [ ${#barspace} -gt 0 ]
then
barspace=${barspace:1:100000}
barhash="${barhash}#"
fi
printf "[${barhash}${barspace}]\n"
NOTE: While testing I had to periodically reset my terminal in order for the tput commands to function properly, eg:
$ reset
$ statbar
I couldn't get the above to work on any of the (internet) fiddle sites (basically having problems getting tput to work with the web-based 'terminals').
Here's a gif displaying the behavior ...
NOTES:
the script does print every filename to stdout but since this script isn't actually doing anything with the files in question a) the printfs occur quite rapidly and b) the video/gif only captures a (relatively) few fleeting images ("Duh, Mark!" ?)
the last printf "file: -- DONE --\n" was added after I created the gif, and I'm being lazy by not generating and uploading a new gif

Shell script: segregate multiple files

I have this in my local directory ~/Report:
Rep_{ReportType}_{Date}_{Seq}.csv
Rep_0001_20150102_0.csv
Rep_0001_20150102_1.csv
Rep_0102_20150102_0.csv
Rep_0503_20150102_0.csv
Rep_0503_20150102_0.csv
Using shell-script,
How do I get multiple files from a local directory with a fixed batch size?
How do I segregate/group the files together by report type (0001 files are grouped together, 0102 grouped together, 0503 grouped together, etc.)
I will generate a sequence file (using forqlift) for EACH group/report type. The output would be Report0001.seq, Report0102.seq, Report0503.seq (3 sequence files). In which I will save to a different directory.
Note: In sequence files, the key is the filename of csv (Rep_0001_20150102.csv), and the value is the content of the file. It is stored as [String, BytesWritable].
This is my code:
1 reportTypes=(0001 0102 8902)
2
3 # collect all files matching expression into an array
4 filesWithDir=(~/Report/Rep_[0-9][0-9][0-9][0-9]_[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]_[0-1].csv)
5
6 # take only the first hundred
7 filesWithDir =( "${filesWithDir[#]:0:100}" )
8
9 # files="${filesWithDir[#]##*/}" #### commented out since forqlift cannot create sequence file without the path/to/file
10 # echo ${files[#]}
11
12 shopt -s nullglob
13
14 # Line 21 is commented out since it has a bug. It collects files in
15 # current directory when it should be filtering the "files array" created
16 # in line 7
17
18
19 for i in ${reportTypes[#]}; do
20 printf -v val '%04d' "$i"
21 # files=("Rep_${val}_"*.csv)
# solution to BUG: (filter files array)
groupFiles=( $( for j in ${filesWithDir[#]} ; do echo $j ; done | grep ${val} ) )
22
23 # Generate sequence file for EACH Report Type
24 forqlift create --file="Report${val}.seq" "${groupFiles[#]}"
25 done
(Note: The sequence file output should be in current directory, not in ~/Report)
It's easy to take only a subset of an array:
# collect all files matching expression into an array
files=( ~/Report/Rep_[0-9][0-9][0-9][0-9]_[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9].csv )
# take only the first hundred
files=( "${files[#]:0:100}" )
The second part is trickier: Bash has associative arrays ("maps"), but the only legal values which can be stored in arrays are strings -- not other arrays -- so you can't store a list of filenames as a value associated with a single entry (without serializing the array to and from a string -- a moderately tricky thing to do safely, since file paths in UNIX can contain any character other than NUL, newlines included).
It's better, then, to just generate the array as you need it.
shopt -s nullglob # allow a glob to expand to zero arguments
for ((i=1; i<=1000; i++)); do
printf -v val '%04d' "$i" # pad digits: 12 -> 0012
files=( "Rep_${val}_"*.csv ) # collect files that match
## emit NUL-separated list of files, if any were found
#(( ${#files[#]} )) && printf '%s\0' "${files[#]}" >"Reports.$val.txt"
# Create a sequence file with forqlift
forqlift create --file="Reports-${val}.seq" "${files[#]}"
done
If you really don't want to do that, then we can put something together that uses namevars for redirection:
#!/bin/bash
# This only works with bash 4.3
re='^REP_([[:digit:]]{4})_[[:digit:]]{8}.csv$'
counter=0
for f in *; do
[[ $f =~ $re ]] || continue # skip files not matching regex
if ((++counter > 100)); then break; fi # stop after 100 files
group=${BASH_REMATCH[1]} # retrieve first regex group
declare -g -a "array${group}" # declare an array
declare -n group_arr="array${group}" # redirect group_arr to that array
group_arr+=( "$f" ) # append to the array
done
for varname in "${!array#}"; do
declare -n group_arr="$varname"
## NUL-delimited form
#printf '%s\0' "${group_arr[#]}" \
# >"collection${varname#array}" # write to files named collection0001, etc.
# forqlift sequence file form
forqlift create --file="Reports-${varname#array}.seq" "${group_arr[#]}"
done
I would move away from shell scripts and start to look towards perl.
#!/usr/bin/env perl
use strict;
use warnings;
my %groups;
while ( my $filename = glob ( "~/Reports/Rep_*.csv" ) ) {
my ( $group, $id ) = ( $filename =~ m,/Rep_(\d{4})_(\d{8})\.csv$, );
next unless $group; #undefined means it didn't match;
#anything past 100 in a group is discarded:
if ( #{$groups{$group}} < 100 ) {
push ( #{$groups{$group}}, $filename );
}
}
foreach my $group ( keys %groups ) {
print "$group contains:\n";
print join ("\n", #{$groups{$group});
}
Another alternative is to clobber some bash commands together with regexp.
See implementation below
# Explanation:
# ls -p = List all files and directories in local directory by path
# grep -v / = ignore subdirectories
# grep "^Rep_\d{4}_\d{8}\.csv$" = Look for files matching your regexp
# tail -100 = get 100 results
for file in $(ls -p | grep -v / | grep "^Rep_\d{4}_\d{8}\.csv$" | tail -100);
do echo $file;
# Use reg exp to extract the desired sequence
re="^Rep_([[:digit:]]{4})_([[:digit:]]{8}).csv$";
if [[ $name =~ $re ]]; then
sequence = ${BASH_REMATCH[1};
# Didn't end up using date, but in case you want it
# date = ${BASH_REMATCH[2]};
# Just in case the sequence file doesn't exist
if [ ! -f "$sequence" ] ; then
touch "$sequence"
fi
# Output/Concat your filename to the sequence file, which you can
# read in later to do whatever administrative tasks you wish to do
# to them
echo "$file" >> "$sequence"
fi
done;

Color escape codes in pretty printed columns

I have a tab-delimited text file which I send to column to "pretty print" a table.
Original file:
1<TAB>blablablabla<TAB>aaaa bbb ccc
2<TAB>blabla<TAB>xxxxxx
34<TAB>okokokok<TAB>zzz yyy
Using column -s$'\t' -t <original file>, I get
1 blablablabla aaaa bbb xxx
2 blabla xxxxxx
34 okokokok zzz yyy
as desired. Now I want to add colors to the columns. I tried to add the escape codes around each tab-delimited field in the original file. column successfully prints in color, but the columns are no longer aligned. Instead, it just prints the TAB separators verbatim.
The question is: how can I get the columns aligned, but also with unique colors?
I've thought of two ways to achieve this:
Adjust the column parameters to make the alignment work with color codes
Redirect the output of column to another file, and do a search+replace on the first two whitespace-delimited fields (the first two columns are guaranteed to not contain spaces; the third column most likely will contain spaces, but no TAB characters)
Problem is, I'm not sure how to do either of those two...
For reference, here is what I'm passing to column:
Note that the fields are indeed separated by TAB characters. I've confirmed this with od.
edit:
There doesn't seem to be an issue with the colorization. I already have the file shown above with the color codes working. The issue is column won't align once I send it input with escape codes. I am thinking of passing the fields without color codes to column, then copying the exact number of spaces column output between each field, and using that in a pretty print scheme.
I wrote a bash version of column (similar to the one from util-linux) which works with color codes:
#!/bin/bash
which sed >> /dev/null || exit 1
version=1.0b
editor="Norman Geist"
last="04 Jul 2016"
# NOTE: Brilliant pipeable tool to format input text into a table by
# NOTE: an configurable seperation string, similar to column
# NOTE: from util-linux, but we are smart enough to ignore
# NOTE: ANSI escape codes in our column width computation
# NOTE: means we handle colors properly ;-)
# BUG : none
addspace=1
seperator=$(echo -e " ")
columnW=()
columnT=()
while getopts "s:hp:v" opt; do
case $opt in
s ) seperator=$OPTARG;;
p ) addspace=$OPTARG;;
v ) echo "Version $version last edited by $editor ($last)"; exit 0;;
h ) echo "column2 [-s seperator] [-p padding] [-v]"; exit 0;;
* ) echo "Unknow comandline switch \"$opt\""; exit 1
esac
done
shift $(($OPTIND-1))
if [ ${#seperator} -lt 1 ]; then
echo "Error) Please enter valid seperation string!"
exit 1
fi
if [ ${#addspace} -lt 1 ]; then
echo "Error) Please enter number of addional padding spaces!"
exit 1
fi
#args: string
function trimANSI()
{
TRIM=$1
TRIM=$(sed 's/\x1b\[[0-9;]*m//g' <<< $TRIM); #trim color codes
TRIM=$(sed 's/\x1b(B//g' <<< $TRIM); #trim sgr0 directive
echo $TRIM
}
#args: len
function pad()
{
for ((i=0; i<$1; i++))
do
echo -n " "
done
}
#read and measure cols
while read ROW
do
while IFS=$seperator read -ra COLS; do
ITEMC=0
for ITEM in "${COLS[#]}"; do
SITEM=$(trimANSI "$ITEM"); #quotes matter O_o
[ ${#columnW[$ITEMC]} -gt 0 ] || columnW[$ITEMC]=0
[ ${columnW[$ITEMC]} -lt ${#SITEM} ] && columnW[$ITEMC]=${#SITEM}
((ITEMC++))
done
columnT[${#columnT[#]}]="$ROW"
done <<< "$ROW"
done
#print formatted output
for ROW in "${columnT[#]}"
do
while IFS=$seperator read -ra COLS; do
ITEMC=0
for ITEM in "${COLS[#]}"; do
WIDTH=$(( ${columnW[$ITEMC]} + $addspace ))
SITEM=$(trimANSI "$ITEM"); #quotes matter O_o
PAD=$(($WIDTH-${#SITEM}))
if [ $ITEMC -ne 0 ]; then
pad $PAD
fi
echo -n "$ITEM"
if [ $ITEMC -eq 0 ]; then
pad $PAD
fi
((ITEMC++))
done
done <<< "$ROW"
echo ""
done
Example usage:
bold=$(tput bold)
normal=$(tput sgr0)
green=$(tput setaf 2)
column2 -s § << END
${bold}First Name§Last Name§City${normal}
${green}John§Wick${normal}§New York
${green}Max§Pattern${normal}§Denver
END
Output example:
I would use awk for the colorization (sed can be used as well):
awk '{printf "\033[1;32m%s\t\033[00m\033[1;33m%s\t\033[00m\033[1;34m%s\033[00m\n", $1, $2, $3;}' a.txt
and pipe it to column for the alignment:
... | column -s$'\t' -t
Output:
A solution using printf to format the ouput as well :
while IFS=$'\t' read -r c1 c2 c3; do
tput setaf 1; printf '%-10s' "$c1"
tput setaf 2; printf '%-30s' "$c2"
tput setaf 3; printf '%-30s' "$c3"
tput sgr0; echo
done < file
In my case, I wanted to selectively colorise values in a column depending on its value. Let's say I want okokokok to be green and blabla to be red.
I can do it such way (the idea is to colorise values of columns after columnisation):
GREEN_SED='\\033[0;32m'
RED_SED='\\033[0;31m'
NC_SED='\\033[0m' # No Color
column -s$'\t' -t <original file> | echo -e "$(sed -e "s/okokokok/${GREEN_SED}okokokok${NC_SED}/g" -e "s/blabla/${RED_SED}blabla${NC_SED}/g")"
Alternatively, with a variable:
DATA=$(column -s$'\t' -t <original file>)
GREEN_SED='\\033[0;32m'
RED_SED='\\033[0;31m'
NC_SED='\\033[0m' # No Color
echo -e "$(sed -e "s/okokokok/${GREEN_SED}okokokok${NC_SED}/g" -e "s/blabla/${RED_SED}blabla${NC_SED}/g" <<< "$DATA")"
Take a note of that additional backslash in values of color definitions. It is made for sed to not interpret an origingal backsash.
This is a result:
2021 Updated BASH Answer
TL;DR
I really liked #NORMAN GEIST's answer but was way too slow for what i needed... So i coded my own version of his script, this time written in Perl (stdin looping and formatting) + Bash (only for presentation/help).
You can find the full code here with an explanation on how to use it.
It is comprehensive of:
A Bash column-like command interface (same parameters like -t, -s, -o)
Exaustive help with column_ansi --help or column_ansi -h
Option to horizontally center.
The actual "core" code can broken down to only the Perl part.
Background and differences
I needed to format a very long awk-generated colored output (more than 300 lines) into a nice table.
I first thought of using column, but as i discovered it didn't take into consideration ANSI characters, since the output would come out not aligned.
After searching a bit on Google i found #NORMAN GEIST's interesting answer on SO which dynamically calculated the width of every single column in the output after removing the ANSI characters and THEN it built the table.
It was all good, but it was taking way too long to load (as someone pointed in the comments)...
So i tried to convert #NORMAN GEIST's column2 from bash to perl and my god if there was a change!
After trying out this version in my production script the time used to display data dropped from 30s to <1s!!
Enjoy!

Resources