How to remove common lines between two files without sorting? [duplicate]

How to remove common lines between two files without sorting? [duplicate] - bash

This question already has answers here:
Compare 2 files and remove any lines in file2 when they match values found in file1
(4 answers)
Closed 8 years ago.
I have two files not sortered which have some lines in common.
file1.txt
Z
B
A
H
L
file2.txt
S
L
W
Q
A
The way I'm using to remove common lines is the following:
sort -u file1.txt > file1_sorted.txt
sort -u file2.txt > file2_sorted.txt
comm -23 file1_sorted.txt file2_sorted.txt > file_final.txt
Output:
B
H
Z
The problem is that I want to keep the order of file1.txt, I mean:
Desired output:
Z
B
H
One solution I tought is doing a loop to read all the lines of file2.txt and:
sed -i '/^${line_file2}$/d' file1.txt
But if files are big the performance may suck.
Do you like my idea?
Do you have any alternative to do it?

You can use just grep (-v for invert, -f for file). Grep lines from input1 that do not match any line in input2:
grep -vf input2 input1
Gives:
Z
B
H

grep or awk:
awk 'NR==FNR{a[$0]=1;next}!a[$0]' file2 file1

I've written a little Perl script that I use for this kind of thing. It can do more than what you ask for but it can also do what you need:
#!/usr/bin/env perl -w
use strict;
use Getopt::Std;
my %opts;
getopts('hvfcmdk:', \%opts);
my $missing=$opts{m}||undef;
my $column=$opts{k}||undef;
my $common=$opts{c}||undef;
my $verbose=$opts{v}||undef;
my $fast=$opts{f}||undef;
my $dupes=$opts{d}||undef;
$missing=1 unless $common || $dupes;;
&usage() unless $ARGV[1];
&usage() if $opts{h};
my (%found,%k,%fields);
if ($column) {
die("The -k option only works in fast (-f) mode\n") unless $fast;
$column--; ## So I don't need to count from 0
}
open(my $F1,"$ARGV[0]")||die("Cannot open $ARGV[0]: $!\n");
while(<$F1>){
chomp;
if ($fast){
my #aa=split(/\s+/,$_);
$k{$aa[0]}++;
$found{$aa[0]}++;
}
else {
$k{$_}++;
$found{$_}++;
}
}
close($F1);
my $n=0;
open(F2,"$ARGV[1]")||die("Cannot open $ARGV[1]: $!\n");
my $size=0;
if($verbose){
while(<F2>){
$size++;
}
}
close(F2);
open(F2,"$ARGV[1]")||die("Cannot open $ARGV[1]: $!\n");
while(<F2>){
next if /^\s+$/;
$n++;
chomp;
print STDERR "." if $verbose && $n % 10==0;
print STDERR "[$n of $size lines]\n" if $verbose && $n % 800==0;
if($fast){
my #aa=split(/\s+/,$_);
$k{$aa[0]}++ if defined($k{$aa[0]});
$fields{$aa[0]}=\#aa if $column;
}
else{
my #keys=keys(%k);
foreach my $key(keys(%found)){
if (/\Q$key/){
$k{$key}++ ;
$found{$key}=undef unless $dupes;
}
}
}
}
close(F2);
print STDERR "[$n of $size lines]\n" if $verbose;
if ($column) {
$missing && do map{my #aa=#{$fields{$_}}; print "$aa[$column]\n" unless $k{$_}>1}keys(%k);
$common && do map{my #aa=#{$fields{$_}}; print "$aa[$column]\n" if $k{$_}>1}keys(%k);
$dupes && do map{my #aa=#{$fields{$_}}; print "$aa[$column]\n" if $k{$_}>2}keys(%k);
}
else {
$missing && do map{print "$_\n" unless $k{$_}>1}keys(%k);
$common && do map{print "$_\n" if $k{$_}>1}keys(%k);
$dupes && do map{print "$_\n" if $k{$_}>2}keys(%k);
}
sub usage{
print STDERR <<EndOfHelp;
USAGE: compare_lists.pl FILE1 FILE2
This script will compare FILE1 and FILE2, searching for the
contents of FILE1 in FILE2 (and NOT vice versa). FILE one must
be one search pattern per line, the search pattern need only be
contained within one of the lines of FILE2.
OPTIONS:
-c : Print patterns COMMON to both files
-f : Search only the first characters of each line of FILE2
for the search pattern given in FILE1
-d : Print duplicate entries
-m : Print patterns MISSING in FILE2 (default)
-h : Print this help and exit
EndOfHelp
exit(0);
}
In your case, you would run it as
list_compare.pl -cf file1.txt file2.txt
The -f option makes it compare only the first word (defined by whitespace) of file2 and greatly speeds things up. To compare the entire line, remove the -f.

Related

How to print both the grep pattern and the resulting matched line on the same line?

I have two files File01 and File02.
File01, looks like this:
BU24DRAFT_430534
BU24DRAFT_488391
BU24DRAFT_488386
BU24DRAFT_417707
BU24DRAFT_417704
BU24DRAFT_488335
BU24DRAFT_429509
BU24DRAFT_210092
BU24DRAFT_229465
BU24DRAFT_498094
BU24DRAFT_416051
BU24DRAFT_482795
BU24DRAFT_4305
BU24DRAFT_10621
BU24DRAFT_4883
File02, looks like this:
XP_033390445.1_uncharacterized_protein_BU24DRAFT_430534_Aaosphaeria_arxii_CBS_175.79
XP_033390442.1_uncharacterized_protein_BU24DRAFT_488391_Aaosphaeria_arxii_CBS_175.79
XP_033390437.1_uncharacterized_protein_BU24DRAFT_488386_Aaosphaeria_arxii_CBS_175.79
XP_033390400.1_uncharacterized_protein_BU24DRAFT_417707_Aaosphaeria_arxii_CBS_175.79
XP_033390397.1_uncharacterized_protein_BU24DRAFT_417704_Aaosphaeria_arxii_CBS_175.79
XP_033390371.1_uncharacterized_protein_BU24DRAFT_488335_Aaosphaeria_arxii_CBS_175.79
XP_033376581.1_uncharacterized_protein_BU24DRAFT_429509_Aaosphaeria_arxii_CBS_175.79
XP_033376580.1_uncharacterized_protein_BU24DRAFT_210092_Aaosphaeria_arxii_CBS_175.79
XP_033376578.1_uncharacterized_protein_BU24DRAFT_229465,_partial_Aaosphaeria_arxii_CBS_175.79
XP_033376577.1_uncharacterized_protein_BU24DRAFT_498094,_partial_Aaosphaeria_arxii_CBS_175.79
XP_033376576.1_uncharacterized_protein_BU24DRAFT_416051,_partial_Aaosphaeria_arxii_CBS_175.79
XP_033376575.1_uncharacterized_protein_BU24DRAFT_482795,_partial_Aaosphaeria_arxii_CBS_175.79
Using the string from File01, via grep, I would like to identify the lines in File02 that match and with this information generate a file that would look like this:
XP_033390442.1_uncharacterized_protein_BU24DRAFT_488391_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_488391
XP_033390437.1_uncharacterized_protein_BU24DRAFT_488386_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_488386
XP_033390400.1_uncharacterized_protein_BU24DRAFT_417707_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_417707
XP_033390397.1_uncharacterized_protein_BU24DRAFT_417704_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_417704
XP_033390371.1_uncharacterized_protein_BU24DRAFT_488335_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_488335
XP_033376581.1_uncharacterized_protein_BU24DRAFT_429509_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_429509
XP_033376580.1_uncharacterized_protein_BU24DRAFT_210092_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_210092
XP_033376578.1_uncharacterized_protein_BU24DRAFT_229465,_partial_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_229465
XP_033376577.1_uncharacterized_protein_BU24DRAFT_498094,_partial_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_498094
XP_033376576.1_uncharacterized_protein_BU24DRAFT_416051,_partial_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_416051
XP_033376575.1_uncharacterized_protein_BU24DRAFT_482795,_partial_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_482795
I tried generating such file using the following code:
while read r;do CMD01=$(echo $r);CMD02=$(grep $r File01); echo "$CMD02 $CMD01";done < File02 | awk '(NR>1) && ($2 > 2 ) '
The problem I run into is that what I obtain extra matching lines:
XP_033390445.1_uncharacterized_protein_BU24DRAFT_430534_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_4305
XP_033390371.1_uncharacterized_protein_BU24DRAFT_488335_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_4883
Where, for example, the string: BU24DRAFT_4305 is wrongly recognizing the string: XP_033390445.1_uncharacterized_protein_BU24DRAFT_430534_Aaosphaeria_arxii_CBS_175.79
This result is incorrect. The string in File01 must match a string in File02 that has a complete version of File01's string
Any ideas that could help me will be appreciated.

For the updated sample input and full-matching requirement and assuming you never have any regexp metacharacters in file1 and that the matching strings in file2 are never at the start or end of the line:
$ awk 'NR==FNR{strs[$0]; next} {for (str in strs) if ($0 ~ ("[^[:alnum:]]"str"[^[:alnum:]]")) print $0, str}' file1 file2
XP_033390445.1_uncharacterized_protein_BU24DRAFT_430534_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_430534
XP_033390442.1_uncharacterized_protein_BU24DRAFT_488391_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_488391
XP_033390437.1_uncharacterized_protein_BU24DRAFT_488386_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_488386
XP_033390400.1_uncharacterized_protein_BU24DRAFT_417707_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_417707
XP_033390397.1_uncharacterized_protein_BU24DRAFT_417704_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_417704
XP_033390371.1_uncharacterized_protein_BU24DRAFT_488335_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_488335
XP_033376581.1_uncharacterized_protein_BU24DRAFT_429509_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_429509
XP_033376580.1_uncharacterized_protein_BU24DRAFT_210092_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_210092
XP_033376578.1_uncharacterized_protein_BU24DRAFT_229465,_partial_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_229465
XP_033376577.1_uncharacterized_protein_BU24DRAFT_498094,_partial_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_498094
XP_033376576.1_uncharacterized_protein_BU24DRAFT_416051,_partial_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_416051
XP_033376575.1_uncharacterized_protein_BU24DRAFT_482795,_partial_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_482795
Original answer doing partial matching:
The correct approach is 1 call to awk:
$ awk 'NR==FNR{strs[$0]; next} {for (str in strs) if (index($0,str)) print $0, str}' file1 file2
XP_033376575.1_uncharacterized_protein_BU24DRAFT_482795,_partial_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_482795
XP_033376576.1_uncharacterized_protein_BU24DRAFT_416051,_partial_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_416051
XP_033376577.1_uncharacterized_protein_BU24DRAFT_498094,_partial_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_498094
XP_033376578.1_uncharacterized_protein_BU24DRAFT_229465,_partial_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_229465
XP_033376580.1_uncharacterized_protein_BU24DRAFT_210092_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_210092
XP_033376581.1_uncharacterized_protein_BU24DRAFT_429509_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_429509
See https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice and https://mywiki.wooledge.org/Quotes for some of the issues with the script in your question.

So, it looks like yours mostly works. A lot of what you are doing here is unnecessary. Here is your script broken into multiple lines for readability:
while read r; do
CMD01=$(echo $r)
CMD02=$(grep $r zztest01)
echo "$CMD02 $CMD01"
done < <(head zztest) | awk '(NR>1) && ($2 > 2 ) '
First, CMD01=$(echo $r): This is really the same (or intended to be) as CMD01="$r" so kind of useless.
Then, < <(head zztest): You are using head to output the contents of the file. This actually works just as well with a simple redirection like this: < zztest.
Last, | awk '(NR>1) && ($2 > 2 ) ': This appears to just be some sort of logic on whether we are going to print anything or not.
Here is a simplified version:
while read r; do
CMD02=$(grep "$r" zztest01) && echo "$CMD02 $r"
done < zztest
Explanation
CMD02=$(grep $r zztest01) && echo "$CMD02 $r": The main part of this is really two commands separated by &&. This means execute the second command if the first one succeeded. grep will return a "failure" code if it does not find what it is looking for. So, if grep does not find a match, echo will not run.
The output of grep will be stored in the variable $CMD02. Then, you will echo that along with $r for each match.
If you really want to keep this on one line like the original:
while read r; do CMD02=$(grep "$r" zztest01) && echo "$CMD02 $r"; done < zztest
Update
If you want to avoid partial matches as Ed asked, you can change the grep to this grep "$r[^0-9]" zztest01. This will avoid a match if there is a trailing digit after the initial match string (which is really an assumption given the sample).

While not explicit in the question, it seems that each pattern should only match single line in the input file (File02).
Based on this observation, possible to improve performance of the solution from Ed Morton:
awk '
NR==FNR{strs[$0]; next}
{ for (str in strs) if (index($0,str)) { print $0, str ; delete strs[str]; next } }
' file1 file2
For large files. with many patterns, it will reduce runtime by a factor of 4.

Split CSV into two files based on column matching values in an array in bash / posh

I have a input CSV that I would like to split into two CSV files. If the value of column 4 matches any value in WLTarray it should go in output file 1, if it doesn't it should go in output file 2.
WLTarray:
"22532" "79994" "18809" "21032"
input CSV file:
header1,header2,header3,header4,header5,header6,header7,header8
"83","6344324","585677","22532","Entitlements","BX","22532:718","36721"
"83","1223432","616454","79994","Compliance Stuff","DR","79994:64703","206134"
"83","162217","616454","83223","Data Enrichment","IEO","83223:64701","206475"
"83","267216","616457","79994","Compliance Engine","ABC","79994:64703","206020"
output CSV file1:
header1,header2,header3,header4,header5,header6,header7,header8
"83","6344324","585677","22532","Entitlements","BX","22532:718","36721"
"83","1223432","616454","79994","Compliance Stuff","DR","79994:64703","206134"
"83","267216","616457","79994","Compliance Engine","ABC","79994:64703","206020"
output CSV file2:
header1,header2,header3,header4,header5,header6,header7,header8
"83","162217","616454","83223","Data Enrichment","IEO","83223:64701","206475"
I've been looking at awk to filter this (python & perl not an option in my environment) but I think there is probably a much smarter way:
declare -a WLTarray=("22532" "79994" "18809" "21032")
for WLTvalue in "${WLTarray[#]}" #Everything in the WLTarray will go to $filename-WLT.tmp
do
awk -F, '($4=='$WLTvalue'){print}' $filename.tmp >> $filename-WLT.tmp #move the lines to the WLT file
# now filter to remove non matching values? why not just move the rows entirely?
done

With regular awk you can make use of split and substr (to handle double-quote removal for comparison) and split the csv file as you indicate. For example you can use:
awk 'BEGIN { FS=","; s="22532 79994 18809 21032"
split (s,a," ") # split s into array a
for (i in a) # loop over each index in a
b[a[i]]=1 # use value in a as index for b
}
FNR == 1 { # first record, write header to both output files
print $0 > "output1.csv"
print $0 > "output2.csv"
next
}
substr($4,2,length($4)-2) in b { # 4th field w/o quotes in b?
print $0 > "output1.csv" # write to output1.csv
next
}
{ print $0 > "output2.csv" } # otherwise write to output2.csv
' input.csv
Where:
in the BEGIN {...} rule you set the field separator (FS) to break on comma, and split the string containing your desired output1.csv field 4 matches into the array a, then loops over the values in a using them for the indexes in array b (to allow a simple i in b check);
the first rule is applied to the first records in the file (the header line) which is simply written out to both output files;
the next rule removes the double-quotes surrounding field-4 and then checks if the number in field-4 matches an index in array b. If so the record is written to output1.csv otherwise it is written to output2.csv.
Example Input File
$ cat input.csv
header1,header2,header3,header4,header5,header6,header7,header8
"83","6344324","585677","22532","Entitlements","BX","22532:718","36721"
"83","1223432","616454","79994","Compliance Stuff","DR","79994:64703","206134"
"83","162217","616454","83223","Data Enrichment","IEO","83223:64701","206475"
"83","267216","616457","79994","Compliance Engine","ABC","79994:64703","206020"
Resulting Output Files
$ cat output1.csv
header1,header2,header3,header4,header5,header6,header7,header8
"83","6344324","585677","22532","Entitlements","BX","22532:718","36721"
"83","1223432","616454","79994","Compliance Stuff","DR","79994:64703","206134"
"83","267216","616457","79994","Compliance Engine","ABC","79994:64703","206020"
$ cat output2.csv
header1,header2,header3,header4,header5,header6,header7,header8
"83","162217","616454","83223","Data Enrichment","IEO","83223:64701","206475"

You can use gawk like this:
test.awk
#!/usr/bin/gawk -f
BEGIN {
split("22532 79994 18809 21032", a)
for(i in a) {
WLTarray[a[i]]
}
FPAT="[^\",]+"
}
NR > 1 {
if ($4 in WLTarray) {
print >> "output1.csv"
} else {
print >> "output2.csv"
}
}
Make it executable and run it like this:
chmod +x test.awk
./test.awk input.csv

using grep with a filter file as input was the simplest answer.
declare -a WLTarray=("22532" "79994" "18809" "21032")
for WLTvalue in "${WLTarray[#]}"
do
awkstring="'\$4 == "\"\\\"$WLTvalue\\\"\"" {print}'"
eval "awk -F, $awkstring input.csv >> output.WLT.csv"
done
grep -v -x -f output.WLT.csv input.csv > output.NonWLT.csv

Unscramble words Challenge - improve my bash solution

There is a Capture the Flag challenge
I have two files; one with scrambled text like this with about 550 entries
dnaoyt
cinuertdso
bda
haey
tolpap
...
The second file is a dictionary with about 9,000 entries
radar
ccd
gcc
fcc
historical
...
The goal is to find the right, unscrambled version of the word, which is contained in the dictionary file.
My approach is to sort the characters from the first word from the first file and then look up if the first word from the second file has the same length. If so then sort that too and compare them.
This is my fully functional bash script, but it is very slow.
#!/bin/bash
while IFS="" read -r p || [ -n "$p" ]
do
var=0
ro=$(echo $p | perl -F -lane 'print sort #F')
len_ro=${#ro}
while IFS="" read -r o || [ -n "$o" ]
do
ro2=$(echo $o | perl -F -lane 'print sort # F')
len_ro2=${#ro2}
let "var+=1"
if [ $len_ro == $len_ro2 ]; then
if [ $ro == $ro2 ]; then
echo $o >> new.txt
echo $var >> whichline.txt
fi
fi
done < dictionary.txt
done < scrambled-words.txt
I have also tried converting all characters to ASCII integers and sum each word, but while comparing I realized that the sum of a different char pattern may have the same sum.
[edit]
For the records:
- no anagrams contained in dictionary
- to get the flag, you need to export the unscrambled words as one blob and ans make a SHA-Hash out of it (thats the flag)
- link to ctf for guy who wanted the files https://challenges.reply.com/tamtamy/user/login.action

You're better off creating a lookup dictionary (keyed by the sorted word) from the dictionary file.
Your loop body is executed 550 * 9,000 = 4,950,000 times (O(N*M)).
The solution I propose executes two loops of at most 9,000 passes each (O(N+M)).
Bonus: It finds all possible solutions at no cost.
#!/usr/bin/perl
use strict;
use warnings qw( all );
use feature qw( say );
my $dict_qfn = "dictionary.txt";
my $scrambled_qfn = "scrambled-words.txt";
sub key { join "", sort split //, $_[0] }
my %dict;
{
open(my $fh, "<", $dict_qfn)
or die("Can't open \"$dict_qfn\": $!\n");
while (<$fh>) {
chomp;
push #{ $dict{key($_)} }, $_;
}
}
{
open(my $fh, "<", $scrambled_qfn)
or die("Can't open \"$scrambled_qfn\": $!\n");
while (<$fh>) {
chomp;
my $matches = $dict{key($_)};
say "$_ matches #$matches" if $matches;
}
}
I wouldn't be surprised if this only takes one millionths of the time of your solution for the sizes you provided (and it scales so much better than yours if you were to increase the sizes).

I would do something like this with gawk
gawk '
NR == FNR {
dict[csort()] = $0
next
}
{
print dict[csort()]
}
function csort( chars, sorted) {
split($0, chars, "")
asort(chars)
for (i in chars)
sorted = sorted chars[i]
return sorted
}' dictionary.txt scrambled-words.txt

Here's perl-free solution I came up with using sort and join:
sort_letters() {
# Splits each letter onto a line, sorts the letters, then joins them
# e.g. "hello" becomes "ehllo"
echo "${1}" | fold-b1 | sort | tr -d '\n'
}
# For each input file...
for input in "dict.txt" "words.txt"; do
# Convert each line to [sorted] [original]
# then sort and save the results with a .sorted extension
while read -r original; do
sorted=$(sort_letters "${original}")
echo "${sorted} ${original}"
done < "${input}" | sort > "${input}.sorted"
done
# Join the two files on the [sorted] word
# outputting the scrambled and unscrambed words
join -j 1 -o 1.2,2.2 "words.txt.sorted" "dict.txt.sorted"

I tried something very alike, but a bit different.
#!/bin/bash
exec 3<scrambled-words.txt
while read -r line <&3; do
printf "%s" ${line} | perl -F -lane 'print sort #F'
done>scrambled-words_sorted.txt
exec 3>&-
exec 3<dictionary.txt
while read -r line <&3; do
printf "%s" ${line} | perl -F -lane 'print sort #F'
done>dictionary_sorted.txt
exec 3>&-
printf "" > whichline.txt
exec 3<scrambled-words_sorted.txt
while read -r line <&3; do
counter="$((++counter))"
grep -n -e "^${line}$" dictionary_sorted.txt | cut -d ':' -f 1 | tr -d '\n' >>whichline.txt printf "\n" >>whichline.txt
done
exec 3>&-
As you can see I don't create a new.txt file; instead I only create whichline.txt with a blank line where the word doesn't match. You can easily paste them up to create new.txt.
The logic behind the script is nearly the logic behind yours, with the exception that I called perl less times and I save two support files.
I think (but I am not sure) that creating them and cycle only one file will be better than ~5kk calls of perl. This way "only" ~10k times is called.
Finally, I decided to use grep because it's (maybe) the fastest regex matcher, and searching for the entire line the lenght is intrinsic in the regex.
Please, note that what #benjamin-w said is still valid and, in that case, grep will reply badly and I did not managed it!
I hope this could help [:

Bash - listing files neatly

I have a non-determinate list of file names that I would like to output to the user in a script. I don't mind if it's a paragraph or in columns (like the out put of ls. How does ls manage it?). In fact I only have the following requirements:
file names need to stay on the same line (yes, that even means files with a space in their name. If someone is dumb enough to use a newline in a filename, though, they deserve what they get.)
If the output is formatted as a paragraph, I'd like to see it indented on the left and right to separate it from other text. Sort of like the way apt-get upgrade handles the list of packages to install.
I would love not to write my own function for this - at least not a complicated one. There are so many text formatting utilities in linux!
The utility should be available in the default Ubuntu install.
It should handle relatively large input, just in case. Something like 2000 characters or so?
It seems like a simple proposition, but I can't seem to get it to work. The column command is out simply because it can't handle large chunks of data. fmt and fold both don't care about delimiters. printf looks like it would work... if I wrote a script for it.
Is there a more flexible tool I've overlooked, or a simple way to do this?

Here I have a simple formatter that, it seems to me, is good enough
% ls | awk '
NR==1 {for(i=1;i<9;i++)printf "----+----%d", i; print ""
line=" " $0;l=2+length($0);next}
{if(l+1+length($0)>80){
print line; line = " " $0 ; l = 2+length($0) ; next}
{l=l+length($0)+1; line=line " " $0}}'
----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8
3inarow.py 5s.py a.csv a.not1.pdf a.pdf as de.1 asde.1 asdef.txt asde.py a.sh
a.tex auto a.wk bizarre.py board.py cc2012xyz2_5_5dp.csv cc2012xyz2_5_5dp.html
cc.py col.pdf col.sh col.sh~ col.tex com.py data data.a datazip datidisk
datizip.py dd.py doc1.pdf doc1.tex doc2 doc2.pdf doc2.tex doc3.pdf doc3.tex
e.awk Exit file file1 file2 geomedian.py group_by_1st group_by_1st.2
group_by_1st.mawk integers its.py join.py light.py listluatexfonts mask.py
mat.rix my_data nostop.py numerize.py pepp.py pepp.pyc pi.pdf pippo muore
pippo.py pi.py pi.tex pizza.py plocol.py points.csv points.py puddu puffo
%
I had to simulate input using ls because you didn't care to show how to access your list of files. The window width is arbitrary as well, but it's easy to provide a value to a -V width=... option of awk
Edit
I added a header line, an unrequested header line, to my awk script because I wanted to test the effectiveness of the (very simple) algorithm.
Addendum
I'd like to stress that the simple formatter above doesn't split "file names" like this across lines, as in the following example:
% ls -d1 b*
bia nconodi
bianconodi.pdf
bianconodi.ppt
bin
b.txt
% ls | awk '
NR==1 {for(i=1;i<9;i++)printf "----+----%d", i; print ""
line=" " $0;l=2+length($0);next}
{if(l+1+length($0)>80){
print line; line = " " $0 ; l = 2+length($0) ; next}
{l=l+length($0)+1; line=line " " $0}}'
----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8
04_Tower.pdf 2plots.py 2.txt a.csv aiuole asdefff a.txt a.txt~ auto
bia nconodi bianconodi.pdf bianconodi.ppt bin Borsa Ferna.jpg b.txt
...
%
As you can see, in the first line there is enough space to print bia but not enough for the real filename bia nconodi, that hence is printed on the second line.
Addendum 2
This is the formatter the OP eventually went with:
local margin=4
local max=10
echo -e "$filenames" | awk -v width=$(tput cols) -v margin=$margin '
NR==1 {
for (i=0; i<margin; i++) {
line = line " "
}
line = line $0;
l = margin + margin + length($0);
next;
}
{
if (l+1+length($0) > width) {
print line;
line = ""
for (i=0; i<margin; i++) line=line " "
line = line $0 ;
l = margin + margin + length($0) ;
next;
}
{
l = l + length($0) + 1;
line = line " " $0;
}
}
END {
print line;
}'

Perhaps you're looking for /usr/bin/fold?
printf '%s ' * | fold -w 77 | sed -e 's/^/ /'
Replace the * with your list, of course; if your files are in an array (they should be; storing filenames in scalar variables is lossy), that'd be "${your_array[#]}".

If you have your filenames in a variable this will create 3 columns, you can change -3 to whatever number of columns you want
echo "$var" | pr -3 -t
or if you need to get them from the filesystem:
find . -printf "%f\n" 2>/dev/null | pr -3 -t
From what you stated in the comments, I think this may be what you are looking for. The find command prints the file or directory name along with a newline and you can put additional filtering of the filenames by piping through grep or sed prior to pr - the pr command is for print and the -3 states 3 columns and -t is for omit headers and trailers - you can adjust it to your preferences.

Show different context on different grep keyword?

I know -A -B -C could be used to show context around the grep keyword.
My question is, how to show different context on different keyword?
For example, how do I show -A 5 for cat, -B 4 for dog, and -C 1 for monkey:
egrep -A3 "cat|dog|monkey" <file>
// this just show 3 after lines for each keyword.

i don't think there's any way to do it with a single grep call, but you could run it through grep once for each variable and concatenate the output:
var=$(grep -n -A 5 cat file)$'\n'$(grep -n -B 4 dog file)$'\n'$(grep -n -C 1 monkey file)
var=$(sort -un <(echo "$var"))
now echo "$var" will produce the same output as you would have gotten from your single command, plus line numbers and context indicators (the : prefix indicates a line that matched the pattern exactly, and the - prefix indicates a line being included because of the -A -B and/or -C options).
the reason i included the line numbers thus far is to preserve the order of the results you would have seen had you managed to do this in one statement. if you like them, great, but if not, you can use the following line to cut them out:
var=$(cut -d: -f2- <(echo "$var") | cut -d- -f2-)
this passes it through once to cut the exact matching lines' prefixes, then again to cut the context matches' prefixes.
pretty? no. but it works.

I'm afraid grep won't do that. You'll have to use a different tool. Perhaps write your own program.

Something like this would do it:
awk '
BEGIN{ ARGV[ARGC++] = ARGV[1] }
function prtB(nr) { for (i=FNR-nr; i<FNR; i++) print a[i] }
function prtA(nr) { for (i=FNR+1; i<=FNR+nr; i++) print a[i] }
NR==FNR{ a[NR]; next }
/cat/ { print; prtA(5) }
/dog/ { prtB(4); print }
/monkey/ { prtB(1); print; prtA(1) }
' file
check the math on the loops in the functions. You didn't say how you'd want to handle lines that contain monkey AND dog, for example.
EDIT: here's an untested solution that would print the maximum context around any match and let you specify the contexts on the command line and won't use as much memory as the above cheap and cheerful solution:
awk -v cxts="cat:0:5\ndog:4:0\nmonkey:1:1" '
BEGIN{
ARGV[ARGC++] = ARGV[1]
numCxts = split(cxts,cxtsA,RS)
for (i=1;i<=numCxts;i++) {
regex = cxtsA[i]
n = split(regex,rangeA,/:/)
sub(/:[^:]+:[^:]+$/,"",regex)
endA[regex] = rangeA[n]
startA[regex] = rangeA[n-1]
regexA[regex]
}
}
NR==FNR{
for (regex in regexA) {
if ($0 ~ regex) {
start = NR - startA[regex]
end = NR + endA[regex]
for (i=start; i<=end; i++) {
prt[i]
}
}
}
next
}
FNR in prt
' file
Separate the searched for patterns in the cxts variable with whatever your RS value is, newline by default.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to remove common lines between two files without sorting? [duplicate] - bash

You can use just grep (-v for invert, -f for file). Grep lines from input1 that do not match any line in input2: grep -vf input2 input1 Gives: Z B H

grep or awk: awk 'NR==FNR{a[$0]=1;next}!a[$0]' file2 file1

Related

How to print both the grep pattern and the resulting matched line on the same line?

Split CSV into two files based on column matching values in an array in bash / posh

Unscramble words Challenge - improve my bash solution

Bash - listing files neatly

Show different context on different grep keyword?

Categories

Resources