Show different context on different grep keyword? - bash

I know -A -B -C could be used to show context around the grep keyword.
My question is, how to show different context on different keyword?
For example, how do I show -A 5 for cat, -B 4 for dog, and -C 1 for monkey:
egrep -A3 "cat|dog|monkey" <file>
// this just show 3 after lines for each keyword.

i don't think there's any way to do it with a single grep call, but you could run it through grep once for each variable and concatenate the output:
var=$(grep -n -A 5 cat file)$'\n'$(grep -n -B 4 dog file)$'\n'$(grep -n -C 1 monkey file)
var=$(sort -un <(echo "$var"))
now echo "$var" will produce the same output as you would have gotten from your single command, plus line numbers and context indicators (the : prefix indicates a line that matched the pattern exactly, and the - prefix indicates a line being included because of the -A -B and/or -C options).
the reason i included the line numbers thus far is to preserve the order of the results you would have seen had you managed to do this in one statement. if you like them, great, but if not, you can use the following line to cut them out:
var=$(cut -d: -f2- <(echo "$var") | cut -d- -f2-)
this passes it through once to cut the exact matching lines' prefixes, then again to cut the context matches' prefixes.
pretty? no. but it works.

I'm afraid grep won't do that. You'll have to use a different tool. Perhaps write your own program.

Something like this would do it:
awk '
BEGIN{ ARGV[ARGC++] = ARGV[1] }
function prtB(nr) { for (i=FNR-nr; i<FNR; i++) print a[i] }
function prtA(nr) { for (i=FNR+1; i<=FNR+nr; i++) print a[i] }
NR==FNR{ a[NR]; next }
/cat/ { print; prtA(5) }
/dog/ { prtB(4); print }
/monkey/ { prtB(1); print; prtA(1) }
' file
check the math on the loops in the functions. You didn't say how you'd want to handle lines that contain monkey AND dog, for example.
EDIT: here's an untested solution that would print the maximum context around any match and let you specify the contexts on the command line and won't use as much memory as the above cheap and cheerful solution:
awk -v cxts="cat:0:5\ndog:4:0\nmonkey:1:1" '
BEGIN{
ARGV[ARGC++] = ARGV[1]
numCxts = split(cxts,cxtsA,RS)
for (i=1;i<=numCxts;i++) {
regex = cxtsA[i]
n = split(regex,rangeA,/:/)
sub(/:[^:]+:[^:]+$/,"",regex)
endA[regex] = rangeA[n]
startA[regex] = rangeA[n-1]
regexA[regex]
}
}
NR==FNR{
for (regex in regexA) {
if ($0 ~ regex) {
start = NR - startA[regex]
end = NR + endA[regex]
for (i=start; i<=end; i++) {
prt[i]
}
}
}
next
}
FNR in prt
' file
Separate the searched for patterns in the cxts variable with whatever your RS value is, newline by default.

Related

How to do operations depending on the presence of a specific string in bash?

I am working with a csv file, so imagine I have this column:
5;10;>11;20;<14
My desired output would be:
5;10;12;20;13
So I would like to add +1 to those values who have the greater than (>) symbol and to subtract 1 to those values with a lesser than (<) symbol with bash language. I have tried something weird with sed but given that it interprets those changes as strings it didn't work out.
Any suggestions?
With awk (tested with GNU awk):
$ awk -F\; -v OFS=\; '
{
for(i = 1; i <= NF; i++) {
if($i ~ /^<[[:digit:]]+$/) {
sub(/^</,"",$i)
$i--
}
else if($i ~ /^>[[:digit:]]+$/) {
sub(/^>/,"",$i)
$i++
}
}
} 1' <<< "5;10;>11;20;<14"
5;10;12;20;13
Warning: use the following if and only if you trust your input file and you are 100% sure it does not contains malicious fields (see the final note).
With GNU sed (and assuming your shell is bash), a bit shorter but also a bit more difficult to understand (as usual with sed):
$ sed -E '
s/<([[:digit:]]+)/$((\1-1))/g
s/>([[:digit:]]+)/$((\1+1))/g
s/.*/printf "%s\n" "&"/e
' <<< "5;10;>11;20;<14"
5;10;12;20;13
That is (where N is a string of digits), substitute all <N with $((N-1)), all >N with $((N+1)), substitute the resulting string S with printf "%s\n" "S", execute it with bash and replace with the output (this is what the e modifier of the substitute command does). In your example the input string successively becomes:
5;10;>11;20;$((14-1))
5;10;$((11+1));20;$((14-1))
printf "%s\n" "5;10;$((11+1));20;$((14-1))"
5;10;12;20;13
The reason why there is a serious security issue here is that if one of your fields is, for instance, $(rm -rf ~/*) it will simply and recursively delete your entire home directory... So, if you do not control the input prefer the awk version.
5;10;>11;20;<14
|
{m,g}awk '
BEGIN {
_*=(OFS= "") (__-=_^= FS ="("(\
___="\31\17")"|"(____="\16\24")")+"
} {
gsub(";[<>][0-9]+",____ "&" ___)
gsub(____ ";[<>]", "&" ___)
NF
for(_+=(_^=($_=$_)<"")+_;_<=NF;_++) {
if ($_~"^[0-9]+$") {
$_+=__^($(_+__)~"[<]$")
}
} print $(_=_<_) }'
=
5;10;>12;20;<13

Check if all of multiple strings or regexes exist in a file

I want to check if all of my strings exist in a text file. They could exist on the same line or on different lines. And partial matches should be OK. Like this:
...
string1
...
string2
...
string3
...
string1 string2
...
string1 string2 string3
...
string3 string1 string2
...
string2 string3
... and so on
In the above example, we could have regexes in place of strings.
For example, the following code checks if any of my strings exists in the file:
if grep -EFq "string1|string2|string3" file; then
# there is at least one match
fi
How to check if all of them exist? Since we are just interested in the presence of all matches, we should stop reading the file as soon all strings are matched.
Is it possible to do it without having to invoke grep multiple times (which won't scale when input file is large or if we have a large number of strings to match) or use a tool like awk or python?
Also, is there a solution for strings that can easily be extended for regexes?
Awk is the tool that the guys who invented grep, shell, etc. invented to do general text manipulation jobs like this so not sure why you'd want to try to avoid it.
In case brevity is what you're looking for, here's the GNU awk one-liner to do just what you asked for:
awk 'NR==FNR{a[$0];next} {for(s in a) if(!index($0,s)) exit 1}' strings RS='^$' file
And here's a bunch of other information and options:
Assuming you're really looking for strings, it'd be:
awk -v strings='string1 string2 string3' '
BEGIN {
numStrings = split(strings,tmp)
for (i in tmp) strs[tmp[i]]
}
numStrings == 0 { exit }
{
for (str in strs) {
if ( index($0,str) ) {
delete strs[str]
numStrings--
}
}
}
END { exit (numStrings ? 1 : 0) }
' file
the above will stop reading the file as soon as all strings have matched.
If you were looking for regexps instead of strings then with GNU awk for multi-char RS and retention of $0 in the END section you could do:
awk -v RS='^$' 'END{exit !(/regexp1/ && /regexp2/ && /regexp3/)}' file
Actually, even if it were strings you could do:
awk -v RS='^$' 'END{exit !(index($0,"string1") && index($0,"string2") && index($0,"string3"))}' file
The main issue with the above 2 GNU awk solutions is that, like #anubhava's GNU grep -P solution, the whole file has to be read into memory at one time whereas with the first awk script above, it'll work in any awk in any shell on any UNIX box and only stores one line of input at a time.
I see you've added a comment under your question to say you could have several thousand "patterns". Assuming you mean "strings" then instead of passing them as arguments to the script you could read them from a file, e.g. with GNU awk for multi-char RS and a file with one search string per line:
awk '
NR==FNR { strings[$0]; next }
{
for (string in strings)
if ( !index($0,string) )
exit 1
}
' file_of_strings RS='^$' file_to_be_searched
and for regexps it'd be:
awk '
NR==FNR { regexps[$0]; next }
{
for (regexp in regexps)
if ( $0 !~ regexp )
exit 1
}
' file_of_regexps RS='^$' file_to_be_searched
If you don't have GNU awk and your input file does not contain NUL characters then you can get the same effect as above by using RS='\0' instead of RS='^$' or by appending to variable one line at a time as it's read and then processing that variable in the END section.
If your file_to_be_searched is too large to fit in memory then it'd be this for strings:
awk '
NR==FNR { strings[$0]; numStrings=NR; next }
numStrings == 0 { exit }
{
for (string in strings) {
if ( index($0,string) ) {
delete strings[string]
numStrings--
}
}
}
END { exit (numStrings ? 1 : 0) }
' file_of_strings file_to_be_searched
and the equivalent for regexps:
awk '
NR==FNR { regexps[$0]; numRegexps=NR; next }
numRegexps == 0 { exit }
{
for (regexp in regexps) {
if ( $0 ~ regexp ) {
delete regexps[regexp]
numRegexps--
}
}
}
END { exit (numRegexps ? 1 : 0) }
' file_of_regexps file_to_be_searched
git grep
Here is the syntax using git grep with multiple patterns:
git grep --all-match --no-index -l -e string1 -e string2 -e string3 file
You may also combine patterns with Boolean expressions such as --and, --or and --not.
Check man git-grep for help.
--all-match When giving multiple pattern expressions, this flag is specified to limit the match to files that have lines to match all of them.
--no-index Search files in the current directory that is not managed by Git.
-l/--files-with-matches/--name-only Show only the names of files.
-e The next parameter is the pattern. Default is to use basic regexp.
Other params to consider:
--threads Number of grep worker threads to use.
-q/--quiet/--silent Do not output matched lines; exit with status 0 when there is a match.
To change the pattern type, you may also use -G/--basic-regexp (default), -F/--fixed-strings, -E/--extended-regexp, -P/--perl-regexp, -f file, and other.
This gnu-awk script may work:
cat fileSearch.awk
re == "" {
exit
}
{
split($0, null, "\\<(" re "\\>)", b)
for (i=1; i<=length(b); i++)
gsub("\\<" b[i] "([|]|$)", "", re)
}
END {
exit (re != "")
}
Then use it as:
if awk -v re='string1|string2|string3' -f fileSearch.awk file; then
echo "all strings were found"
else
echo "all strings were not found"
fi
Alternatively, you can use this gnu grep solution with PCRE option:
grep -qzP '(?s)(?=.*\bstring1\b)(?=.*\bstring2\b)(?=.*\bstring3\b)' file
Using -z we make grep read complete file into a single string.
We are using multiple lookahead assertions to assert that all the strings are present in the file.
Regex must use (?s) or DOTALL mod to make .* match across the lines.
As per man grep:
-z, --null-data
Treat input and output data as sequences of lines, each terminated by a
zero byte (the ASCII NUL character) instead of a newline.
First, you probably want to use awk. Since you eliminated that option in the question statement, yes, it is possible to do and this provides a way to do it. It is likely MUCH slower than using awk, but if you want to do it anyway...
This is based on the following assumptions:G
Invoking AWK is unacceptable
Invoking grep multiple times is unacceptable
The use of any other external tools are unacceptable
Invoking grep less than once is acceptable
It must return success if everything is found, failure when not
Using bash instead of external tools is acceptable
bash version is >= 3 for the regular expression version
This might meet all of your requirements: (regex version miss some comments, look at string version instead)
#!/bin/bash
multimatch() {
filename="$1" # Filename is first parameter
shift # move it out of the way that "$#" is useful
strings=( "$#" ) # search strings into an array
declare -a matches # Array to keep track which strings already match
# Initiate array tracking what we have matches for
for ((i=0;i<${#strings[#]};i++)); do
matches[$i]=0
done
while IFS= read -r line; do # Read file linewise
foundmatch=0 # Flag to indicate whether this line matched anything
for ((i=0;i<${#strings[#]};i++)); do # Loop through strings indexes
if [ "${matches[$i]}" -eq 0 ]; then # If no previous line matched this string yet
string="${strings[$i]}" # fetch the string
if [[ $line = *$string* ]]; then # check if it matches
matches[$i]=1 # mark that we have found this
foundmatch=1 # set the flag, we need to check whether we have something left
fi
fi
done
# If we found something, we need to check whether we
# can stop looking
if [ "$foundmatch" -eq 1 ]; then
somethingleft=0 # Flag to see if we still have unmatched strings
for ((i=0;i<${#matches[#]};i++)); do
if [ "${matches[$i]}" -eq 0 ]; then
somethingleft=1 # Something is still outstanding
break # no need check whether more strings are outstanding
fi
done
# If we didn't find anything unmatched, we have everything
if [ "$somethingleft" -eq 0 ]; then return 0; fi
fi
done < "$filename"
# If we get here, we didn't have everything in the file
return 1
}
multimatch_regex() {
filename="$1" # Filename is first parameter
shift # move it out of the way that "$#" is useful
regexes=( "$#" ) # Regexes into an array
declare -a matches # Array to keep track which regexes already match
# Initiate array tracking what we have matches for
for ((i=0;i<${#regexes[#]};i++)); do
matches[$i]=0
done
while IFS= read -r line; do # Read file linewise
foundmatch=0 # Flag to indicate whether this line matched anything
for ((i=0;i<${#strings[#]};i++)); do # Loop through strings indexes
if [ "${matches[$i]}" -eq 0 ]; then # If no previous line matched this string yet
regex="${regexes[$i]}" # Get regex from array
if [[ $line =~ $regex ]]; then # We use the bash regex operator here
matches[$i]=1 # mark that we have found this
foundmatch=1 # set the flag, we need to check whether we have something left
fi
fi
done
# If we found something, we need to check whether we
# can stop looking
if [ "$foundmatch" -eq 1 ]; then
somethingleft=0 # Flag to see if we still have unmatched strings
for ((i=0;i<${#matches[#]};i++)); do
if [ "${matches[$i]}" -eq 0 ]; then
somethingleft=1 # Something is still outstanding
break # no need check whether more strings are outstanding
fi
done
# If we didn't find anything unmatched, we have everything
if [ "$somethingleft" -eq 0 ]; then return 0; fi
fi
done < "$filename"
# If we get here, we didn't have everything in the file
return 1
}
if multimatch "filename" string1 string2 string3; then
echo "file has all strings"
else
echo "file miss one or more strings"
fi
if multimatch_regex "filename" "regex1" "regex2" "regex3"; then
echo "file match all regular expressions"
else
echo "file does not match all regular expressions"
fi
Benchmarks
I did some benchmarking searching .c,.h and .sh in arch/arm/ from Linux 4.16.2 for the strings "void", "function", and "#define". (Shell wrappers were added/ the code tuned that all can be called as testname <filename> <searchstring> [...] and that an if can be used to check the result)
Results: (measured with time, real time rounded to closest half second)
multimatch: 49s
multimatch_regex: 55s
matchall: 10.5s
fileMatchesAllNames: 4s
awk (first version): 4s
agrep: 4.5s
Perl re (-r): 10.5s
Perl non-re: 9.5s
Perl non-re optimised: 5s (Removed Getopt::Std and regex support for faster startup)
Perl re optimised: 7s (Removed Getopt::Std and non-regex support for faster startup)
git grep: 3.5s
C version (no regex): 1.5s
(Invoking grep multiple times, especially with the recursive method, did better than I expected)
A recursive solution. Iterate over the files one by one. For each file, check if it matches the first pattern and break early (-m1: on first match), only if it matched the first pattern, search for second pattern and so on:
#!/bin/bash
patterns="$#"
fileMatchesAllNames () {
file=$1
if [[ $# -eq 1 ]]
then
echo "$file"
else
shift
pattern=$1
shift
grep -m1 -q "$pattern" "$file" && fileMatchesAllNames "$file" $#
fi
}
for file in *
do
test -f "$file" && fileMatchesAllNames "$file" $patterns
done
Usage:
./allfilter.sh cat filter java
test.sh
Searches in the current dir for the tokens "cat", "filter" and "java". Found them only in "test.sh".
So grep is invoked often in the worst case scenario (finding the first N-1 patterns in the last line of each file, except for the N-th pattern).
But with an informed ordering (rarly matches first, early matchings first) if possible, the solution should be reasonable fast, since many files are abandoned early because they didn't match the first keyword, or accepted early, as they matched a keyword close to the top.
Example: You search a scala source file which contains tailrec (somewhat rarely used), mutable (rarely used, but if so, close to the top on import statements) main (rarely used, often not close to the top) and println (often used, unpredictable position), you would order them:
./allfilter.sh mutable tailrec main println
Performance:
ls *.scala | wc
89 89 2030
In 89 scala files, I have the keywords distribution:
for keyword in mutable tailrec main println; do grep -m 1 $keyword *.scala | wc -l ; done
16
34
41
71
Searching them with a slightly modified version of the scripts, which allows to use a filepattern as first argument takes about 0.2s:
time ./allfilter.sh "*.scala" mutable tailrec main println
Filepattern: *.scala Patterns: mutable tailrec main println
aoc21-2017-12-22_00:16:21.scala
aoc25.scala
CondenseString.scala
Partition.scala
StringCondense.scala
real 0m0.216s
user 0m0.024s
sys 0m0.028s
in close to 15.000 codelines:
cat *.scala | wc
14913 81614 610893
update:
After reading in the comments to the question, that we might be talking about thounsands of patterns, handing them as arguments doesn't seem to be a clever idea; better read them from a file, and pass the filename as argument - maybe for the list of files to filter too:
#!/bin/bash
filelist="$1"
patternfile="$2"
patterns="$(< $patternfile)"
fileMatchesAllNames () {
file=$1
if [[ $# -eq 1 ]]
then
echo "$file"
else
shift
pattern=$1
shift
grep -m1 -q "$pattern" "$file" && fileMatchesAllNames "$file" $#
fi
}
echo -e "Filepattern: $filepattern\tPatterns: $patterns"
for file in $(< $filelist)
do
test -f "$file" && fileMatchesAllNames "$file" $patterns
done
If the number and length of patterns/files exceeds the possibilities of argument passing, the list of patterns could be split into many patternfiles and processed in a loop (for example of 20 pattern files):
for i in {1..20}
do
./allfilter2.sh file.$i.lst pattern.$i.lst > file.$((i+1)).lst
done
You can
make use of the -o|--only-matching option of grep (which forces to output only the matched parts of a matching line, with each such part on a separate output line),
then eliminate duplicate occurrences of matched strings with sort -u,
and finally check that the count of remaining lines equals the count of the input strings.
Demonstration:
$ cat input
...
string1
...
string2
...
string3
...
string1 string2
...
string1 string2 string3
...
string3 string1 string2
...
string2 string3
... and so on
$ grep -o -F $'string1\nstring2\nstring3' input|sort -u|wc -l
3
$ grep -o -F $'string1\nstring3' input|sort -u|wc -l
2
$ grep -o -F $'string1\nstring2\nfoo' input|sort -u|wc -l
2
One shortcoming with this solution (failing to meet the partial matches should be OK requirement) is that grep doesn't detect overlapping matches. For example, although the text abcd matches both abc and bcd, grep finds only one of them:
$ grep -o -F $'abc\nbcd' <<< abcd
abc
$ grep -o -F $'bcd\nabc' <<< abcd
abc
Note that this approach/solution works only for fixed strings. It cannot be extended for regexes, because a single regex can match multiple different strings and we cannot track which match corresponds to which regex. The best you can do is store the matches in a temporary file, and then run grep multiple times using one regex at a time.
The solution implemented as a bash script:
matchall:
#!/usr/bin/env bash
if [ $# -lt 2 ]
then
echo "Usage: $(basename "$0") input_file string1 [string2 ...]"
exit 1
fi
function find_all_matches()
(
infile="$1"
shift
IFS=$'\n'
newline_separated_list_of_strings="$*"
grep -o -F "$newline_separated_list_of_strings" "$infile"
)
string_count=$(($# - 1))
matched_string_count=$(find_all_matches "$#"|sort -u|wc -l)
if [ "$matched_string_count" -eq "$string_count" ]
then
echo "ALL strings matched"
exit 0
else
echo "Some strings DID NOT match"
exit 1
fi
Demonstration:
$ ./matchall
Usage: matchall input_file string1 [string2 ...]
$ ./matchall input string1 string2 string3
ALL strings matched
$ ./matchall input string1 string2
ALL strings matched
$ ./matchall input string1 string2 foo
Some strings DID NOT match
The easiest way for me to check if the file has all three patterns is to get only matched patterns, output only unique parts and count lines.
Then you will be able to check it with a simple Test condition: test 3 -eq $grep_lines.
grep_lines=$(grep -Eo 'string1|string2|string3' file | uniq | wc -l)
Regarding your second question, I don't think it's possible to stop reading the file as soon as more than one pattern is found. I've read man page for grep and there are no options that could help you with that. You can only stop reading lines after specific one with an option grep -m [number] which happens no matter of matched patterns.
Pretty sure that a custom function is needed for that purpose.
It's an interesting problem, and there's nothing obvious in the grep man page to suggest an easy answer. There's might be an insane regex that would do it, but may be clearer with a straightforward chain of greps, even though that ends up scanning the file n-times. At least the -q option has it bail at the first match each time, and the && will shortcut evaluation if one of the strings is not found.
$grep -Fq string1 t && grep -Fq string2 t && grep -Fq string3 t
$echo $?
0
$grep -Fq string1 t && grep -Fq blah t && grep -Fq string3 t
$echo $?
1
Perhaps with gnu sed
cat match_word.sh
sed -z '
/\b'"$2"'/!bA
/\b'"$3"'/!bA
/\b'"$4"'/!bA
/\b'"$5"'/!bA
s/.*/0\n/
q
:A
s/.*/1\n/
' "$1"
and you call it like that :
./match_word.sh infile string1 string2 string3
return 0 if all match are found else 1
here you can look for 4 strings
if you want more, you can add lines like
/\b'"$x"'/!bA
Just for "solutions completeness", you can use a different tool and avoid multiple greps and awk/sed or big (and probably slow) shell loops; Such a tool is agrep.
agrep is actually a kind of egrep supporting also and operation between patterns, using ; as a pattern separator.
Like egrep and like most of the well known tools, agrep is a tool that operates on records/lines and thus we still need a way to treat the whole file as a single record.
Moreover agrep provides a -d option to set your custom record delimiter.
Some tests:
$ cat file6
str4
str1
str2
str3
str1 str2
str1 str2 str3
str3 str1 str2
str2 str3
$ agrep -d '$$\n' 'str3;str2;str1;str4' file6;echo $?
str4
str1
str2
str3
str1 str2
str1 str2 str3
str3 str1 str2
str2 str3
0
$ agrep -d '$$\n' 'str3;str2;str1;str4;str5' file6;echo $?
1
$ agrep -p 'str3;str2;str1' file6 #-p prints lines containing all three patterns in any position
str1 str2 str3
str3 str1 str2
No tool is perfect, and agrep has also some limitations; you can't use a regex /pattern longer than 32 chars and some options are not available when used with regexps- all these are explained in agrep man page
Ignoring the "Is it possible to do it without ... or use a tool like awk or python?" requirement, you can do it with a Perl script:
(Use an appropriate shebang for your system or something like /bin/env perl)
#!/usr/bin/perl
use Getopt::Std; # option parsing
my %opts;
my $filename;
my #patterns;
getopts('rf:',\%opts); # Allowing -f <filename> and -r to enable regex processing
if ($opts{'f'}) { # if -f is given
$filename = $opts{'f'};
#patterns = #ARGV[0 .. $#ARGV]; # Use everything else as patterns
} else { # Otherwise
$filename = $ARGV[0]; # First parameter is filename
#patterns = #ARGV[1 .. $#ARGV]; # Rest is patterns
}
my $use_re= $opts{'r'}; # Flag on whether patterns are regex or not
open(INF,'<',$filename) or die("Can't open input file '$filename'");
while (my $line = <INF>) {
my #removal_list = (); # List of stuff that matched that we don't want to check again
for (my $i=0;$i <= $#patterns;$i++) {
my $pattern = $patterns[$i];
if (($use_re&& $line =~ /$pattern/) || # regex match
(!$use_re&& index($line,$pattern) >= 0)) { # or string search
push(#removal_list,$i); # Mark to be removed
}
}
# Now remove everything we found this time
# We need to work backwards to keep us from messing
# with the list while we're busy
for (my $i=$#removal_list;$i >= 0;$i--) {
splice(#patterns,$removal_list[$i],1);
}
if (scalar(#patterns) == 0) { # If we don't need to match anything anymore
close(INF) or warn("Error closing '$filename'");
exit(0); # We found everything
}
}
# End of file
close(INF) or die("Error closing '$filename'");
exit(1); # If we reach this, we haven't matched everything
Is saved as matcher.pl this will search for plain text strings:
./matcher filename string1 string2 string3 'complex string'
This will search for regular expressions:
./matcher -r filename regex1 'regex2' 'regex4'
(The filename can be given with -f instead):
./matcher -f filename -r string1 string2 string3 'complex string'
It is limited to single line matching patterns (due to dealing with the file linewise).
The performance, when calling for lots of files from a shell script, is slower than awk (But search patterns can contain spaces, unlike the ones passed space-separated in -v to awk). If converted to a function and called from Perl code (with a file containing a list of files to search), it should be much faster than most awk implementations. (When called on several smallish files, the perl startup time (parsing, etc of the script) dominates the timing)
It can be sped up significantly by hardcoding whether regular expressions are used or not, at the cost of flexibility. (See my benchmarks here to see what effect removing Getopt::Std has)
perl -lne '%m = (%m, map {$_ => 1} m!\b(string1|string2|string3)\b!g); END { print scalar keys %m == 3 ? "Match": "No Match"}' file
In python using the fileinput module allows the files to be specified on the command line or the text read line by line from stdin. You could hard code the strings into a python list.
# Strings to match, must be valid regular expression patterns
# or be escaped when compiled into regex below.
strings = (
r'string1',
r'string2',
r'string3',
)
or read the strings from another file
import re
from fileinput import input, filename, nextfile, isfirstline
for line in input():
if isfirstline():
regexs = map(re.compile, strings) # new file, reload all strings
# keep only strings that have not been seen in this file
regexs = [rx for rx in regexs if not rx.match(line)]
if not regexs: # found all strings
print filename()
nextfile()
Assuming all your strings to check are in a file strings.txt, and the file you want to check in is input.txt, the following one liner will do :
Updated the answer based on comments :
$ diff <( sort -u strings.txt ) <( grep -o -f strings.txt input.txt | sort -u )
Explanation :
Use grep's -o option to match only the strings you are interested in. This gives all the strings that are present in the file input.txt. Then use diff to get the strings that are not found. If all the strings were found, the result would be nothing. Or, just check the exit code of diff.
What it does not do :
Exit as soon as all matches are found.
Extendible to regx.
Overlapping matches.
What it does do :
Find all matches.
Single call to grep.
Does not use awk or python.
Many of these answers are fine as far as they go.
But if performance is an issue -- certainly possible if the input is large and you have many thousands of patterns -- then you'll get a large speedup using a tool like lex or flex that generates a true deterministic finite automaton as a recognizer rather than calling a regex interpreter once per pattern.
The finite automaton will execute a few machine instructions per input character regardless of the number of patterns.
A no-frills flex solution:
%{
void match(int);
%}
%option noyywrap
%%
"abc" match(0);
"ABC" match(1);
[0-9]+ match(2);
/* Continue adding regex and exact string patterns... */
[ \t\n] /* Do nothing with whitespace. */
. /* Do nothing with unknown characters. */
%%
// Total number of patterns.
#define N_PATTERNS 3
int n_matches = 0;
int counts[10000];
void match(int n) {
if (counts[n]++ == 0 && ++n_matches == N_PATTERNS) {
printf("All matched!\n");
exit(0);
}
}
int main(void) {
yyin = stdin;
yylex();
printf("Only matched %d patterns.\n", n_matches);
return 1;
}
A down side is that you'd have to build this for every given set of patterns. That's not too bad:
flex matcher.y
gcc -O lex.yy.c -o matcher
Now run it:
./matcher < input.txt
The following python script should do the trick. It kind of does call the equivalent of grep (re.search) multiple times for each line -- i.e. it it searches each pattern for each line, but since you are not forking out a process each time, it should be much more efficient. Also, it removes the patterns which have already been found and stops when all of them have been found.
#!/usr/bin/env python
import re
# the file to search
filename = '/path/to/your/file.txt'
# list of patterns -- can be read from a file or command line
# depending on the count
patterns = [r'py.*$', r'\s+open\s+', r'^import\s+']
patterns = map(re.compile, patterns)
with open(filename) as f:
for line in f:
# search for pattern matches
results = map(lambda x: x.search(line), patterns)
# remove the patterns that did match
results = zip(results, patterns)
results = filter(lambda x: x[0] == None, results)
patterns = map(lambda x: x[1], results)
# stop if no more patterns are left
if len(patterns) == 0:
break
# print the patterns which were not found
for p in patterns:
print p.pattern
You can add a separate check for plain strings (string in line) if you are dealing with plain (non-regex) strings -- will be slightly more efficient.
Does that solve your problem?
One more Perl variant - whenever all given strings match..even when the file is read half through, the processing completes and just prints the results
> perl -lne ' /\b(string1|string2|string3)\b/ and $m{$1}++; eof if keys %m == 3; END { print keys %m == 3 ? "Match": "No Match"}' all_match.txt
Match
> perl -lne ' /\b(string1|string2|stringx)\b/ and $m{$1}++; eof if keys %m == 3; END { print keys %m == 3 ? "Match": "No Match"}' all_match.txt
No Match
First delete the line separator, and then use normal grep multiple times, as the number of patterns as in below.
Example: Let the file content be as below
PAT1
PAT2
PAT3
something
somethingelse
cat file | tr -d "\n" | grep "PAT1" | grep "PAT2" | grep -c "PAT3"
For plain speed, with no external tool limitations, and no regexes, this (crude) C version does a decent job. (Possibly Linux only, although it should work on all Unix-like systems with mmap)
#include <sys/mman.h>
#include <sys/stat.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <errno.h>
/* https://stackoverflow.com/a/8584708/1837991 */
inline char *sstrstr(char *haystack, char *needle, size_t length)
{
size_t needle_length = strlen(needle);
size_t i;
for (i = 0; i < length; i++) {
if (i + needle_length > length) {
return NULL;
}
if (strncmp(&haystack[i], needle, needle_length) == 0) {
return &haystack[i];
}
}
return NULL;
}
int matcher(char * filename, char ** strings, unsigned int str_count)
{
int fd;
struct stat sb;
char *addr;
unsigned int i = 0; /* Used to keep us from running of the end of strings into SIGSEGV */
fd = open(filename, O_RDONLY);
if (fd == -1) {
fprintf(stderr,"Error '%s' with open on '%s'\n",strerror(errno),filename);
return 2;
}
if (fstat(fd, &sb) == -1) { /* To obtain file size */
fprintf(stderr,"Error '%s' with fstat on '%s'\n",strerror(errno),filename);
close(fd);
return 2;
}
if (sb.st_size <= 0) { /* zero byte file */
close(fd);
return 1; /* 0 byte files don't match anything */
}
/* mmap the file. */
addr = mmap(NULL, sb.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
if (addr == MAP_FAILED) {
fprintf(stderr,"Error '%s' with mmap on '%s'\n",strerror(errno),filename);
close(fd);
return 2;
}
while (i++ < str_count) {
char * found = sstrstr(addr,strings[0],sb.st_size);
if (found == NULL) { /* If we haven't found this string, we can't find all of them */
munmap(addr, sb.st_size);
close(fd);
return 1; /* so give the user an error */
}
strings++;
}
munmap(addr, sb.st_size);
close(fd);
return 0; /* if we get here, we found everything */
}
int main(int argc, char *argv[])
{
char *filename;
char **strings;
unsigned int str_count;
if (argc < 3) { /* Lets count parameters at least... */
fprintf(stderr,"%i is not enough parameters!\n",argc);
return 2;
}
filename = argv[1]; /* First parameter is filename */
strings = argv + 2; /* Search strings start from 3rd parameter */
str_count = argc - 2; /* strings are two ($0 and filename) less than argc */
return matcher(filename,strings,str_count);
}
Compile it with:
gcc matcher.c -o matcher
Run it with:
./matcher filename needle1 needle2 needle3
Credits:
uses sstrstr
File handling mostly stolen from the mmap man page
Notes:
It will scan through the parts of the file preceding the matched strings multiple times - it will only open the file once though.
The entire file might end up loaded into memory, especially if a string doesn't match, the OS needs to decide that
regex support can probably be added by using the POSIX regex library (Performance would likely be slightly better than grep - it is should be based on the same library and you would gain reduced overhead from only opening the file once for searching for multiple regexes)
Files containing nulls should work, search strings with them not though...
All characters other than null should be searchable (\r, \n, etc)
I didn't see a simple counter among answers, so here is a counter oriented solution using awk that stops as soon as all matches are satisfied:
/string1/ { a = 1 }
/string2/ { b = 1 }
/string3/ { c = 1 }
{
if (c + a + b == 3) {
print "Found!";
exit;
}
}
A generic script
to expand usage through shell arguments:
#! /bin/sh
awk -v vars="$*" -v argc=$# '
BEGIN { split(vars, args); }
{
for (arg in args) {
if (!temp[arg] && $0 ~ args[arg]) {
inc++;
temp[arg] = 1;
}
}
if (inc == argc) {
print "Found!";
exit;
}
}
END { exit 1; }
' filename
Usage (in which you can pass Regular Expressions):
./script "str1?" "(wo)?men" str3
or to apply a string of patterns:
./script "str1? (wo)?men str3"
$ cat allstringsfile | tr '\n' ' ' | awk -f awkpattern1
Where allstringsfile is your text file, as in the original question.
awkpattern1 contains the string patterns, with && condition:
$ cat awkpattern1
/string1/ && /string2/ && /string3/

Parse out key=value pairs into variables

I have a bunch of different kinds of files I need to look at periodically, and what they have in common is that the lines have a bunch of key=value type strings. So something like:
Version=2 Len=17 Hello Var=Howdy Other
I would like to be able to reference the names directly from awk... so something like:
cat some_file | ... | awk '{print Var, $5}' # prints Howdy Other
How can I go about doing that?
The closest you can get is to parse the variables into an associative array first thing every line. That is to say,
awk '{ delete vars; for(i = 1; i <= NF; ++i) { n = index($i, "="); if(n) { vars[substr($i, 1, n - 1)] = substr($i, n + 1) } } Var = vars["Var"] } { print Var, $5 }'
More readably:
{
delete vars; # clean up previous variable values
for(i = 1; i <= NF; ++i) { # walk through fields
n = index($i, "="); # search for =
if(n) { # if there is one:
# remember value by name. The reason I use
# substr over split is the possibility of
# something like Var=foo=bar=baz (that will
# be parsed into a variable Var with the
# value "foo=bar=baz" this way).
vars[substr($i, 1, n - 1)] = substr($i, n + 1)
}
}
# if you know precisely what variable names you expect to get, you can
# assign to them here:
Var = vars["Var"]
Version = vars["Version"]
Len = vars["Len"]
}
{
print Var, $5 # then use them in the rest of the code
}
$ cat file | sed -r 's/[[:alnum:]]+=/\n&/g' | awk -F= '$1=="Var"{print $2}'
Howdy Other
Or, avoiding the useless use of cat:
$ sed -r 's/[[:alnum:]]+=/\n&/g' file | awk -F= '$1=="Var"{print $2}'
Howdy Other
How it works
sed -r 's/[[:alnum:]]+=/\n&/g'
This places each key,value pair on its own line.
awk -F= '$1=="Var"{print $2}'
This reads the key-value pairs. Since the field separator is chosen to be =, the key ends up as field 1 and the value as field 2. Thus, we just look for lines whose first field is Var and print the corresponding value.
Since discussion in commentary has made it clear that a pure-bash solution would also be acceptable:
#!/bin/bash
case $BASH_VERSION in
''|[0-3].*) echo "ERROR: Bash 4.0 required" >&2; exit 1;;
esac
while read -r -a words; do # iterate over lines of input
declare -A vars=( ) # refresh variables for each line
set -- "${words[#]}" # update positional parameters
for word; do
if [[ $word = *"="* ]]; then # if a word contains an "="...
vars[${word%%=*}]=${word#*=} # ...then set it as an associative-array key
fi
done
echo "${vars[Var]} $5" # Here, we use content read from that line.
done <<<"Version=2 Len=17 Hello Var=Howdy Other"
The <<<"Input Here" could also be <file.txt, in which case lines in the file would be iterated over.
If you wanted to use $Var instead of ${vars[Var]}, then substitute printf -v "${word%%=*}" %s "${word*=}" in place of vars[${word%%=*}]=${word#*=}, and remove references to vars elsewhere. Note that this doesn't allow for a good way to clean up variables between lines of input, as the associative-array approach does.
I will try to explain you a very generic way to do this which you can adapt easily if you want to print out other stuff.
Assume you have a string which has a format like this:
key1=value1 key2=value2 key3=value3
or more generic
key1_fs2_value1_fs1_key2_fs2_value2_fs1_key3_fs2_value3
With fs1 and fs2 two different field separators.
You would like to make a selection or some operations with these values. To do this, the easiest is to store these in an associative array:
array["key1"] => value1
array["key2"] => value2
array["key3"] => value3
array["key1","full"] => "key1=value1"
array["key2","full"] => "key2=value2"
array["key3","full"] => "key3=value3"
This can be done with the following function in awk:
function str2map(str,fs1,fs2,map, n,tmp) {
n=split(str,map,fs1)
for (;n>0;n--) {
split(map[n],tmp,fs2);
map[tmp[1]]=tmp[2]; map[tmp[1],"full"]=map[n]
delete map[n]
}
}
So, after processing the string, you have the full flexibility to do operations in any way you like:
awk '
function str2map(str,fs1,fs2,map, n,tmp) {
n=split(str,map,fs1)
for (;n>0;n--) {
split(map[n],tmp,fs2);
map[tmp[1]]=tmp[2]; map[tmp[1],"full"]=map[n]
delete map[n]
}
}
{ str2map($0," ","=",map) }
{ print map["Var","full"] }
' file
The advantage of this method is that you can easily adapt your code to print any other key you are interested in, or even make selections based on this, example:
(map["Version"] < 3) { print map["var"]/map["Len"] }
The simplest and easiest way is to use the string substitution like this:
property='my.password.is=1234567890=='
name=${property%%=*}
value=${property#*=}
echo "'$name' : '$value'"
The output is:
'my.password.is' : '1234567890=='
Yore.
Using bash's set command, we can split the line into positional parameters like awk.
For each word, we'll try to read a name value pair delimited by =.
When we find a value, assign it to the variable named $key using bash's printf -v feature.
#!/usr/bin/env bash
line='Version=2 Len=17 Hello Var=Howdy Other'
set $line
for word in "$#"; do
IFS='=' read -r key val <<< "$word"
test -n "$val" && printf -v "$key" "$val"
done
echo "$Var $5"
output
Howdy Other
SYNOPSIS
an awk-based solution that doesn't require manually checking the fields to locate the desired key pair :
approach being avoid splitting unnecessary fields or arrays - only performing regex match via function call when needed
only returning FIRST occurrence of input key value. Subsequent matches along the row are NOT returned
i just called it S() cuz it's the closest letter to $
I only included an array (_) of the 3 test values for demo purposes. Those aren't needed. In fact, no state information is being kept at all
caveat being : key-match must be exact - this version of the code isn't for case-insensitive or fuzzy/agile matching
Tested and confirmed working on
- gawk 5.1.1
- mawk 1.3.4
- mawk-2/1.9.9.6
- macos nawk
CODE
# gawk profile, created Fri May 27 02:07:53 2022
{m,n,g}awk '
function S(__,_) {
return \
! match($(_=_<_), "(^|["(_="[:blank:]]")")"(__)"[=][^"(_)"*") \
? "^$" \
: substr(__=substr($-_, RSTART, RLENGTH), index(__,"=")+_^!_)
}
BEGIN { OFS = "\f" # This array is only for testing
_["Version"] _["Len"] _["Var"] # purposes. Feel free to discard at will
} {
for (__ in _) {
print __, S(__) } }'
OUTPUT
Var
Howdy
Len
17
Version
2
So either call the fields in BAU fashion
- $5, $0, $NF, etc
or call S(QUOTED_KEY_VALUE), case-sensitive, like
As a safeguard, to prevent mis-interpreting null strings
or invalid inputs as $0, a non-match returns ^$
instead of empty string
S("Version") to get back 2.
As a bonus, it can safely handle values in multibyte unicode, both for values and even for keys, regardless of whether ur awk is UTF-8-aware or not :
1 ✜
🤡
2 Version
2
3 Var
Howdy
4 Len
17
5 ✜=🤡 Version=2 Len=17 Hello Var=Howdy Other
I know this is particularly regarding awk but mentioning this as many people come here for solutions to break down name = value pairs ( with / without using awk as such).
I found below way simple straight forward and very effective in managing multiple spaces / commas as well -
Source: http://jayconrod.com/posts/35/parsing-keyvalue-pairs-in-bash
change="foo=red bar=green baz=blue"
#use below if var is in CSV (instead of space as delim)
change=`echo $change | tr ',' ' '`
for change in $changes; do
set -- `echo $change | tr '=' ' '`
echo "variable name == $1 and variable value == $2"
#can assign value to a variable like below
eval my_var_$1=$2;
done

awk substitution ascii table rules bash

I want to perform a hierarchical set of (non-recursive) substitutions in a text file.
I want to define the rules in an ascii file "table.txt" which contains lines of blank space tabulated pairs of strings:
aaa 3
aa 2
a 1
I have tried to solve it with an awk script "substitute.awk":
BEGIN { while (getline < file) { subs[$1]=$2; } }
{ line=$0; for(i in subs)
{ gsub(i,subs[i],line); }
print line;
}
When I call the script giving it the string "aaa":
echo aaa | awk -v file="table.txt" -f substitute.awk
I get
21
instead of the desired "3". Permuting the lines in "table.txt" doesn't help. Who can explain what the problem is here, and how to circumvent it? (This is a simplified version of my actual task. Where I have a large file containing ascii encoded phonetic symbols which I want to convert into Latex code. The ascii encoding of the symbols contains {$,&,-,%,[a-z],[0-9],...)).
Any comments and suggestions!
PS:
Of course in this application for a substitution table.txt:
aa ab
a 1
a original string: "aa" should be converted into "ab" and not "1b". That means a string which was yielded by applying a rule must be left untouched.
How to account for that?
The order of the loop for (i in subs) is undefined by default.
In newer versions of awk you can use PROCINFO["sorted_in"] to control the sort order. See section 12.2.1 Controlling Array Traversal and (the linked) section 8.1.6 Using Predefined Array Scanning Orders for details about that.
Alternatively, if you can't or don't want to do that you could store the replacements in numerically indexed entries in subs and walk the array in order manually.
To do that you will need to store both the pattern and the replacement in the value of the array and that will require some care to combine. You can consider using SUBSEP or any other character that cannot be in the pattern or replacement and then split the value to get the pattern and replacement in the loop.
Also note the caveats/etcץ with getline listed on http://awk.info/?tip/getline and consider not using that manually but instead using NR==1{...} and just listing table.txt as the first file argument to awk.
Edit: Actually, for the manual loop version you could also just keep two arrays one mapping input file line number to the patterns to match and another mapping patterns to replacements. Then looping over the line number array will get you the pattern and the pattern can be used in the second array to get the replacement (for gsub).
Instead of storing the replacements in an associative array, put them in two arrays indexed by integer (one array for the strings to replace, one for the replacements) and iterate over the arrays in order:
BEGIN {i=0; while (getline < file) { subs[i]=$1; repl[i++]=$2}
n = i}
{ for(i=0;i<n;i++) { gsub(subs[i],repl[i]); }
print tolower($0);
}
It seems like perl's zero-width word boundary is what you want. It's a pretty straightforward conversion from the awk:
#!/usr/bin/env perl
use strict;
use warnings;
my %subs;
BEGIN{
open my $f, '<', 'table.txt' or die "table.txt:$!";
while(<$f>) {
my ($k,$v) = split;
$subs{$k}=$v;
}
}
while(<>) {
while(my($k, $v) = each %subs) {
s/\b$k\b/$v/g;
}
print;
}
Here's an answer pulled from another StackExchange site, from a fairly similar question: Replace multiple strings in a single pass.
It's slightly different in that it does the replacements in inverse order by length of target string (i.e. longest target first), but that is the only sensible order for targets which are literal strings, as appears to be the case in this question as well.
If you have tcc installed, you can use the following shell function, which process the file of substitutions into a lex-generated scanner which it then compiles and runs using tcc's compile-and-run option.
# Call this as: substitute replacements.txt < text_to_be_substituted.txt
# Requires GNU sed because I was too lazy to write a BRE
substitute () {
tcc -run <(
{
printf %s\\n "%option 8bit noyywrap nounput" "%%"
sed -r 's/((\\\\)*)(\\?)$/\1\3\3/;
s/((\\\\)*)\\?"/\1\\"/g;
s/^((\\.|[^[:space:]])+)[[:space:]]*(.*)/"\1" {fputs("\3",yyout);}/' \
"$1"
printf %s\\n "%%" "int main(int argc, char** argv) { return yylex(); }"
} | lex -t)
}
With gcc or clang, you can use something similar to compile a substitution program from the replacement list, and then execute that program on the given text. Posix-standard c99 does not allow input from stdin, but gcc and clang are happy to do so provided you tell them explicitly that it is a C program (-x c). In order to avoid excess compilations, we use make (which needs to be gmake, Gnu make).
The following requires that the list of replacements be in a file with a .txt extension; the cached compiled executable will have the same name with a .exe extension. If the makefile were in the current directory with the name Makefile, you could invoke it as make repl (where repl is the name of the replacement file without a text extension), but since that's unlikely to be the case, we'll use a shell function to actually invoke make.
Note that in the following file, the whitespace at the beginning of each line starts with a tab character:
substitute.mak
.SECONDARY:
%: %.exe
#$(<D)/$(<F)
%.exe: %.txt
#{ printf %s\\n "%option 8bit noyywrap nounput" "%%"; \
sed -r \
's/((\\\\)*)(\\?)$$/\1\3\3/; #\
s/((\\\\)*)\\?"/\1\\"/g; #\
s/^((\\.|[^[:space:]])+)[[:space:]]*(.*)/"\1" {fputs("\3",yyout);}/' \
"$<"; \
printf %s\\n "%%" "int main(int argc, char** argv) { return yylex(); }"; \
} | lex -t | c99 -D_POSIX_C_SOURCE=200809L -O2 -x c -o "$#" -
Shell function to invoke the above:
substitute() {
gmake -f/path/to/substitute.mak "${1%.txt}"
}
You can invoke the above command with:
substitute file
where file is the name of the replacements file. (The filename must end with .txt but you don't have to type the file extension.)
The format of the input file is a series of lines consisting of a target string and a replacement string. The two strings are separated by whitespace. You can use any valid C escape sequence in the strings; you can also \-escape a space character to include it in the target. If you want to include a literal \, you'll need to double it.
If you don't want C escape sequences and would prefer to have backslashes not be metacharacters, you can replace the sed program with a much simpler one:
sed -r 's/([\\"])/\\\1/g' "$<"; \
(The ; \ is necessary because of the way make works.)
a) Don't use getline unless you have a very specific need and fully understand all the caveats, see http://awk.info/?tip/getline
b) Don't use regexps when you want strings (yes, this means you cannot use sed).
c) The while loop needs to constantly move beyond the part of the line you've already changed or you could end up in an infinite loop.
You need something like this:
$ cat substitute.awk
NR==FNR {
if (NF==2) {
strings[++numStrings] = $1
old2new[$1] = $2
}
next
}
{
for (stringNr=1; stringNr<=numStrings; stringNr++) {
old = strings[stringNr]
new = old2new[old]
slength = length(old)
tail = $0
$0 = ""
while ( sstart = index(tail,old) ) {
$0 = $0 substr(tail,1,sstart-1) new
tail = substr(tail,sstart+slength)
}
$0 = $0 tail
}
print
}
$ echo aaa | awk -f substitute.awk table.txt -
3
$ echo aaaa | awk -f substitute.awk table.txt -
31
and adding some RE metacharacters to table.txt to show they are treated just like every other character and showing how to run it when the target text is stored in a file instead of being piped:
$ cat table.txt
aaa 3
aa 2
a 1
. 7
\ 4
* 9
$ cat foo
a.a\aa*a
$ awk -f substitute.awk table.txt foo
1714291
Your new requirement requires a solution like this:
$ cat substitute.awk
NR==FNR {
if (NF==2) {
strings[++numStrings] = $1
old2new[$1] = $2
}
next
}
{
delete news
for (stringNr=1; stringNr<=numStrings; stringNr++) {
old = strings[stringNr]
new = old2new[old]
slength = length(old)
tail = $0
$0 = ""
charPos = 0
while ( sstart = index(tail,old) ) {
charPos += sstart
news[charPos] = new
$0 = $0 substr(tail,1,sstart-1) RS
tail = substr(tail,sstart+slength)
}
$0 = $0 tail
}
numChars = split($0, olds, "")
$0 = ""
for (charPos=1; charPos <= numChars; charPos++) {
$0 = $0 (charPos in news ? news[charPos] : olds[charPos])
}
print
}
.
$ cat table.txt
1 a
2 b
$ echo "121212" | awk -f substitute.awk table.txt -
ababab

Bash - listing files neatly

I have a non-determinate list of file names that I would like to output to the user in a script. I don't mind if it's a paragraph or in columns (like the out put of ls. How does ls manage it?). In fact I only have the following requirements:
file names need to stay on the same line (yes, that even means files with a space in their name. If someone is dumb enough to use a newline in a filename, though, they deserve what they get.)
If the output is formatted as a paragraph, I'd like to see it indented on the left and right to separate it from other text. Sort of like the way apt-get upgrade handles the list of packages to install.
I would love not to write my own function for this - at least not a complicated one. There are so many text formatting utilities in linux!
The utility should be available in the default Ubuntu install.
It should handle relatively large input, just in case. Something like 2000 characters or so?
It seems like a simple proposition, but I can't seem to get it to work. The column command is out simply because it can't handle large chunks of data. fmt and fold both don't care about delimiters. printf looks like it would work... if I wrote a script for it.
Is there a more flexible tool I've overlooked, or a simple way to do this?
Here I have a simple formatter that, it seems to me, is good enough
% ls | awk '
NR==1 {for(i=1;i<9;i++)printf "----+----%d", i; print ""
line=" " $0;l=2+length($0);next}
{if(l+1+length($0)>80){
print line; line = " " $0 ; l = 2+length($0) ; next}
{l=l+length($0)+1; line=line " " $0}}'
----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8
3inarow.py 5s.py a.csv a.not1.pdf a.pdf as de.1 asde.1 asdef.txt asde.py a.sh
a.tex auto a.wk bizarre.py board.py cc2012xyz2_5_5dp.csv cc2012xyz2_5_5dp.html
cc.py col.pdf col.sh col.sh~ col.tex com.py data data.a datazip datidisk
datizip.py dd.py doc1.pdf doc1.tex doc2 doc2.pdf doc2.tex doc3.pdf doc3.tex
e.awk Exit file file1 file2 geomedian.py group_by_1st group_by_1st.2
group_by_1st.mawk integers its.py join.py light.py listluatexfonts mask.py
mat.rix my_data nostop.py numerize.py pepp.py pepp.pyc pi.pdf pippo muore
pippo.py pi.py pi.tex pizza.py plocol.py points.csv points.py puddu puffo
%
I had to simulate input using ls because you didn't care to show how to access your list of files. The window width is arbitrary as well, but it's easy to provide a value to a -V width=... option of awk
Edit
I added a header line, an unrequested header line, to my awk script because I wanted to test the effectiveness of the (very simple) algorithm.
Addendum
I'd like to stress that the simple formatter above doesn't split "file names" like this across lines, as in the following example:
% ls -d1 b*
bia nconodi
bianconodi.pdf
bianconodi.ppt
bin
b.txt
% ls | awk '
NR==1 {for(i=1;i<9;i++)printf "----+----%d", i; print ""
line=" " $0;l=2+length($0);next}
{if(l+1+length($0)>80){
print line; line = " " $0 ; l = 2+length($0) ; next}
{l=l+length($0)+1; line=line " " $0}}'
----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8
04_Tower.pdf 2plots.py 2.txt a.csv aiuole asdefff a.txt a.txt~ auto
bia nconodi bianconodi.pdf bianconodi.ppt bin Borsa Ferna.jpg b.txt
...
%
As you can see, in the first line there is enough space to print bia but not enough for the real filename bia nconodi, that hence is printed on the second line.
Addendum 2
This is the formatter the OP eventually went with:
local margin=4
local max=10
echo -e "$filenames" | awk -v width=$(tput cols) -v margin=$margin '
NR==1 {
for (i=0; i<margin; i++) {
line = line " "
}
line = line $0;
l = margin + margin + length($0);
next;
}
{
if (l+1+length($0) > width) {
print line;
line = ""
for (i=0; i<margin; i++) line=line " "
line = line $0 ;
l = margin + margin + length($0) ;
next;
}
{
l = l + length($0) + 1;
line = line " " $0;
}
}
END {
print line;
}'
Perhaps you're looking for /usr/bin/fold?
printf '%s ' * | fold -w 77 | sed -e 's/^/ /'
Replace the * with your list, of course; if your files are in an array (they should be; storing filenames in scalar variables is lossy), that'd be "${your_array[#]}".
If you have your filenames in a variable this will create 3 columns, you can change -3 to whatever number of columns you want
echo "$var" | pr -3 -t
or if you need to get them from the filesystem:
find . -printf "%f\n" 2>/dev/null | pr -3 -t
From what you stated in the comments, I think this may be what you are looking for. The find command prints the file or directory name along with a newline and you can put additional filtering of the filenames by piping through grep or sed prior to pr - the pr command is for print and the -3 states 3 columns and -t is for omit headers and trailers - you can adjust it to your preferences.

Resources