How to find files containing a string N times or more often using egrep - bash

I have a folder with about 400-500 SQL-files and need the names of
only those who contain the string CREATE TABLE 3 times or more often.
While the command
$ egrep -rl "(CREATE TABLE)" ./*.sql
prints me of course all file-names, the command
$ egrep -rl "(CREATE TABLE.*){3}" ./*.sql
does not print any at all ...
Flags:
-R – recursive
-L – files-with-matches | print only names of FILEs containing matches

Your command
egrep -rl "(CREATE TABLE.*){3}" ./*.sql
looks for 3 CREATE TABLE's on one line.
When they are on different lines, you need to do something different,
and when you have GNU grep, you are lucky: It has the option -z.
# minimal change of your command
egrep -zrl "(CREATE TABLE.*){3}" ./*.sql
# moving option E to the options as suggested by #anubhava
grep -zErl "(CREATE TABLE.*){3}" ./*.sql

This awk will do the job:
awk 'FNR==1{n=0} /CREATE TABLE/{++n} n>2{print FILENAME; nextfile}' *.sql

Could you please try following. I am taking care of number of opened files in backend too here.
awk 'prev!=FILENAME{n=""}/CREATE TABLE/{++n} n>2{print FILENAME;prev=FILENAME;nextfile}' *.sql

Assuming the possibility of having multiple strings per line, (only covered by the answer of Walter A), here is it's awk version (one that supports nextfile)
awk '(FNR==1){n=0}
{n+=split($0,a,/CREATE TABLE/)-1}
(n>2) {print FILENAME; nextfile}' */.sql
If you don't have GNU grep (according to Walter A's solution) and neither you have an awk with nextfile, the following solutions can be used (POSIX):
awk '(FNR==1){n=0; p=1}
p {n+=split($0,a,/CREATE TABLE/)-1}
(n>2) && p {print FILENAME; p=0}' */.sql
The difference between the two solutions are:
Solution 1 will not process the full file as it will create an early termination per file if the condition is met.
Solution 2 cannot do such an action, however we can reduce computational time by avoiding split if the condition is satisfied.

Try this Perl solution
perl -le ' BEGIN { for(glob("*.sql")) { $x=qx(cat $_); $r++ for($x=~m/CREATE TABLE/g); print $_ if $r > 2 ; $r=0 } } '

Related

How can I generate multiple counts from a file without re-reading it multiple times?

I have large files of HTTP access logs and I'm trying to generate hourly counts for a specific query string. Obviously, the correct solution is to dump everything into splunk or graylog or something, but I can't set all that up at the moment for this one-time deal.
The quick-and-dirty is:
for hour in 0{0..9} {10..23}
do
grep $QUERY $FILE | egrep -c "^\S* $hour:"
# or, alternately
# egrep -c "^\S* $hour:.*$QUERY" $FILE
# not sure which one's better
done
But these files average 15-20M lines, and I really don't want to parse through each file 24 times. It would be far more efficient to parse the file and count each instance of $hour in one go. Is there any way to accomplish this?
You can ask grep to output the matching part of each line with -o and then use uniq -c to count the results:
grep "$QUERY" "$FILE" | grep -o "^\S* [0-2][0-9]:" | sed 's/^\S* //' | uniq -c
The sed command is there to keep only the two digit hour and the colon, which you can also remove with another sed expression if you want.
Caveats: this solution works with GNU grep and GNU sed, and will produce no output, rather than "0", for hours with no log entries. Kudos to #EdMorton for pointing these issues out in the comments, and other issues that were fixed in the answer above.
Assuming the timestamp appears with a space before the 2-digit hour, then a colon after
gawk -v patt="$QUERY" '
$0 ~ patt && match($0, / ([0-9][0-9]):/, m) {
print > (m[1] "." FILENAME)
}
' "$FILE"
This will create 24 files.
Requires GNU awk for the 3-arg form of match()
This is probably what you really need, using GNU awk for the 3rd arg to match() and making assumptions about what your input might look like, what your QUERY variable might contain, and what the output should look like:
awk -v query="$QUERY" '
match($0, " ([0-9][0-9]):.*"query, a) { cnt[a[1]+0]++ }
END {
for (hr=0; hr<=23; hr++) {
printf "%02d = %d\n", hr, cnt[hr]
}
}
' "$FILE"
Don't really use all upper case for non-exported shell variables btw - see Correct Bash and shell script variable capitalization.

grep files against a list only containing numbers

I have several files (~70000) that have numbers in the name, a couple of examples would be 991000_Metatissue.qsub.file 828000_Metatissue.qsub.file, and then I have another file (files_failed.txt) with a bunch of numbers that I would use to grep. This list looks like this:
4578000
458000
4582000
527000
5288000
5733000
653000
6548000
6663000
I have tried with: ls -1 *.qsub.file | grep -F -f files_failed.txt - and even doing this:
ls -1 *.qsub.file > files_to_submit.txt
grep -F -f files_failed.txt files_to_submit.txt
But always got all the qsub.files...
grep -f isn't well composed (see GNU bug 16305), so I recommend using awk instead:
find . -name '*_*.qsub.file' |awk -F_ '
NR == FNR { failed[$NR] = 1; next }
$1 in failed
' files_failed.txt /dev/stdin
This uses find to locate the files in question, piping them into awk. Before awk processes that, it reads files_failed.txt and stores the values into an associated array (aka dictionary or hash) when the line number (NR, number of records so far) equals the line number of the current file (FNR), meaning it's the first file read. If the first column (the number of the file since we delimited by _) is in that array, it was a failure. AWK's default action on a stanza is to print it, so you will get a list of those failed files.
Note the lack of regular expressions! On a big directory, this is much faster than grep -F -f …, which itself is much faster than grep -f …, even assuming the aforementioned bug is fixed.
You should be using find and you need to modify your "patterns". Here is one way that should work:
# List all files ending in "qsub.file"
find . -name '*.qsub.file' |
# Add ./ and _ to each number to make the match exact
grep -F -f <(sed -e 's:^:./:' -e 's/$/_/' files_failed.txt)
70000 files is too much for a ls, you should use find instead.
And I prefer invert the logic, list just what a want instead of list all and then filter.
Something like
while read line; do find -iname $line_Metatissue.qsub.file; done < files_failed.txt
If you need the exit in another file?
while read line; do find -iname $line_Metatissue.qsub.file; done < files_failed.txt >> files_to_submit.txt
You can use the below script:-
ls -1 *.qsub.file > filelist.txt
while read pattern
do
filefound=$(grep $pattern filelist.txt)
if [ "$filefound" != "" ]; then
echo "File Found : $filefound"
fi
done < files_failed.txt
Second option:-
while read pattern
do
find . -name "$pattern*.qsub.file" >> filefound.txt
done < files_failed.txt
All your files will be stored in file filefound.txt

Prepending part of a filename to a .csv file using bash/sed

I have a couple of files in a directory that are named like this;
1_38OE983729JKHKJV.csv
an integer followed by an ID (the Integer and ID are both unique).
I need to prepend this ID to every line of the file for each file in the folder to prepare the files for import to a database (and discard the integer part of the filename). The contents of the file look something like this:
BW;20015;11,45;0,49;41;174856;4103399
BA;25340;11,41;0,55;40;222161;4599779
BB;800;7,58;0,33;42;10559;239887
HE;6301;9,11;0,39;40;69191;1614302
.
.
.
Total;112613;9,33;0,43;40;1207387;25897426
The end result should look something like this:
38OE983729JKHKJV;BW;20015;11,45;0,49;41;174856;4103399
38OE983729JKHKJV;BA;25340;11,41;0,55;40;222161;4599779
38OE983729JKHKJV;BB;800;7,58;0,33;42;10559;239887
38OE983729JKHKJV;HE;6301;9,11;0,39;40;69191;1614302
.
.
.
38OE983729JKHKJV;Total;112613;9,33;0,43;40;1207387;25897426
Thanks for the help!
EDIT: Spelling and vocabular for clarity
Loop over the files with for, use parameter expansion to extract the id.
#!/bin/bash
for csv in *.csv ; do
prefix=${csv%_*}
id=${csv#*_}
id=${id%.csv}
sed -i~ "s/^/$id;/" "$csv"
done
If the ID can contain underscores, you might need to be more careful with the expansion.
With awk tool:
for f in *csv; do awk '{ fn=FILENAME; $0=substr(fn,index(fn,"_")+1,length(fn)-6)";"$0 }1' "$f" > tmp && mv tmp "$f"; done
fn=FILENAME - the filename
try following too in single awk and which will take care of the number of files which are getting opened during this operation too, so that we will avoid the error of maximum number of files opened.
awk 'FNR==1{close(val);val=FILENAME;split(FILENAME,a,"_");sub(/\..*/,"",a[2])} {print a[2]","$0}' *.csv
With GNU awk for inplace editing and gensub() all you need is:
awk -i inplace '{print gensub(/.*_(.*)\..*/,"\\1;",1,FILENAME) $0}' *.csv
No shell loops or anything else necessary, just that command.

getting the last opened file

input file:
wtf.txt|/Users/jaro/documents/inc/face/|
lol.txt|/Users/jaro/documents/inc/linked/|
lol.txt|/Users/jaro/documents/inc/twitter/|
lol.txt|/Users/jaro/documents/inc/face/|
wtf.txt|/Users/jaro/documents/inc/face/|
omg.txt|/Users/jaro/documents/inc/twitter/|
omg.txt|/Users/jaro/documents/inc/linked/|
wtf.txt|/Users/jaro/documents/inc/linked/|
lol.txt|/Users/jaro/documents/inc/twitter/|
wtf.txt|/Users/jaro/documents/inc/linked/|
lol.txt|/Users/jaro/documents/inc/face/|
omg.txt|/Users/jaro/documents/inc/twitter/|
omg.txt|/Users/jaro/documents/inc/face/|
wtf.txt|/Users/jaro/documents/inc/face/|
wtf.txt|/Users/jaro/documents/inc/twitter/|
omg.txt|/Users/jaro/documents/inc/linked/|
omg.txt|/Users/jaro/documents/inc/linked/|
input file is the list of opened files (opening file means 1 line of file) i want to get the last opened file in
e.g. : get last opened file in dir /Users/jaro/documents/inc/face/
output:
wtf.txt
This fetches the last line in the file whose second field is the desired folder name, and prints the first field.
awk -F '\|' '$2 == "/Users/jaro/documents/inc/face/" { f=$1 }
END { print f }' file
To test whether the most recent file is also an existing file, I would use the shell to reverse the order with tac and perform the logic; skip the files in the wrong path, and the ones which don't exist, then print the first success and quit.
tac file |
while IFS='|' read -r basename path _; do
case $path in "/Users/jaro/documents/inc/face") ;; *) continue;; esac
test -e "$path/$basename" || continue
echo "$basename"
break
done |
grep .
The final grep . is to produce an exit code which reflects whether or not the command was successful -- if it printed a file, it's okay; if none of the extracted files existed, return error.
Below is my original answer, based on a plausible but apparently incorrect interpretation of your question.
Here is a quick attempt at finding the file with the newest modification time from the list. I avoid parsing ls, prefering instead to use properly machine-parseable output from stat. Since your input file is line-oriented, I assume no file names contain newlines, which simplifies things quite a bit.
awk -F '\|' '$2 == "/Users/jaro/documents/inc/face/" { print $2 $1 }' file |
sort -u |
xargs stat -f '%m %N' |
sort -rn |
awk -F '/' '{ print $NF; exit(0) }'
The first sort is to remove any duplicates, to avoid running stat more times than necessary (premature optimization, perhaps), the stat prefixes each line with the file's modification time expressed as seconds since the epoch, which facilitates easy numerical sorting by age, and the final Awk script neatly combines head -n 1 | rev | cut -d / -f1 | rev i.e. extract just the basename from the first line of output, then quit.
If there is any way to use a less wacky input format, that would be an improvement (probably of your life in general as well).
The output format from stat is not properly standardized, but your question is tagged linuxosx so I assume GNU coreutils BSD stat. If portability is desired, maybe look at find (which however may be overkill and/or not much better standardized across diverse platforms) or write a small Perl or Python script instead. (Well, Ruby too, I suppose, but personally, I'd go with Perl.)
perl -F'\|' -lane '{ $t{$F[0]} = (stat($F[1].$F[0]))[10]
if !defined $t{$F[0]} and $F[1] == "/Users/jaro/documents/inc/face/" }
END { print ((sort { $t{$a} <=> $t{$b} } keys %t)[-1]) }' file
atime – The atime (access time) is the time when the data of a file was last accessed. Displaying the contents of a file or executing a shell script will update a file’s atime, for example. You can view the atime with the ls -lu command
http://www.techtrunch.com/linux/ctime-mtime-atime-linux-timestamps
So in your case, will do the trick.
ls -lu /Users/jaro/documents/inc/face/

Copying part of a large file using command line

I've a text file with 2 million lines. Each line has some transaction information.
e.g.
23848923748, sample text, feild2 , 12/12/2008
etc
What I want to do is create a new file from a certain unique transaction number onwards. So I want to split the file at the line where this number exists.
How can I do this form the command line?
I can find the line by doing this:
cat myfile.txt | grep 23423423423
use sed like this
sed '/23423423423/,$!d' myfile.txt
Just confirm that the unique transaction number cannot appear as a pattern in some other part of the line (especially, before the correctly matching line) in your file.
There is already a 'perl' answer here, so, i'll give one more AWK way :-)
awk '{BEGIN{skip=1} /number/ {skip=0} // {if (skip!=1) print $0}' myfile.txt
On a random file in my tmp directory, this is how I output everything from the line matching popd onwards in a file named tmp.sh:
tail -n+`grep -n popd tmp.sh | cut -f 1 -d:` tmp.sh
tail -n+X matches from that line number onwards; grep -n outputs lineno:filename, and cut extracts just lineno from grep.
So for your case it would be:
tail -n+`grep -n 23423423423 myfile.txt | cut -f 1 -d:` myfile.txt
And it should indeed match from the first occurrence onwards.
It's not a pretty solution, but how about using -A parameter of grep?
Like this:
mc#zolty:/tmp$ cat a
1
2
3
4
5
6
7
mc#zolty:/tmp$ cat a | grep 3 -A1000000
3
4
5
6
7
The only problem I see in this solution is the 1000000 magic number. Probably someone will know the answer without using such a trick.
You can probably get the line number using Grep and then use Tail to print the file from that point into your output file.
Sorry I don't have actual code to show, but hopefully the idea is clear.
I would write a quick Perl script, frankly. It's invaluable for anything like this (relatively simple issues) and as soon as something more complex rears its head (as it will do!) then you'll need the extra power.
Something like:
#!/bin/perl
my $out = 0;
while (<STDIN>) {
if /23423423423/ then $out = 1;
print $_ if $out;
}
and run it using:
$ perl mysplit.pl < input > output
Not tested, I'm afraid.

Resources