I am running a process producing arbitrary number of files in an arbitrary number of sub-folders. I am interested in the number of distinct sub-folders and currently am trying to solve this with bash and find (I do not want to use a scripting language).
So far I have:
find models/quarter/ -name settings.json | wc -l
However, this obviously does not consider the structure of the result from find and just counts all files returned.
Sample of the find return:
models/quarter/1234/1607701623/settings.json
models/quarter/1234/1607701523/settings.json
models/quarter/3456/1607701623/settings.json
models/quarter/3456/1607702623/settings.json
models/quarter/7890/1607703223/settings.json
I am interested in the number of distinct folders in top-folder models/quarter, so the appropriate result for the sample above would be 3 (1234, 3456, 7890). It is a requirement that the folders to be counted contain a sub-folder (which is a Unix timestamp as you might have recognized) and the sub-folder contains the file settings.json.
My guts tell me it should be possible, e.g. with awk, but I am certainly no bash pro. Any help is greatly appreciated, thanks.
find models/quarter/ -name settings.json | awk -F\/ '{ if (strftime("%s",$4) == $4) { fil[$3]="" } } END { print length(fil) }'
Using awk. Pass the output of find to awk and set / as the field separator. Check that the 4th field is a timestamp and then if it is, create an array with the third field as the index. At the end, print the length of the array fil.
Related
Im sorry for the very basic question but I am frankly extremely new at bash and can't seem to work out the below. Any help would be appreciated.
In my working directory '/test' I have a number of files named :
mm(a 10 digit code)_Pool_1_text.csv
mm(same 10 digit code)_Pool_2_text.csv
mm(same 10 digit code)_Pool_3_text.csv
how can I write a loop that would take the first file and put it in a folder at :
/this/that/Pool_1/
the second file at :
/this/that/Pool_2/
etc
Thank you :)
Using awk you many not need to create an explicit loop:
awk 'FNR==1 {match(FILENAME,/Pool_[[:digit:]]+/);system("mv " FILENAME " /this/that/" substr(FILENAME, RSTART, RLENGTH) "/")}' mm*_Pool_*.text.csv
the shell glob selects the files (we could use extglob, but I wanted to keep it simple)
awk gets the filenames
we match pool and digit
we move the file using the match to extract the pool name
I have the following data containing a subset of record numbers formatting like so:
>head pilot.dat
AnalogPoint,206407
AnalogPoint,2584
AnalogPoint,206292
AnalogPoint,206278
AnalogPoint,206409
AnalogPoint,206410
AnalogPoint,206254
AnalogPoint,206266
AnalogPoint,206408
AnalogPoint,206284
I want to compare the list of entries to another subset file called "disps.dat" to find duplicates, which is formatted in the same way:
>head disps.dat
StatusPoint,280264
StatusPoint,280266
StatusPoint,280267
StatusPoint,280268
StatusPoint,280269
StatusPoint,280335
StatusPoint,280336
StatusPoint,280334
StatusPoint,280124
I used the command:
grep -f pilot.dat disps.dat > duplicate.dat
However, the output file "duplicate.dat" is listing records that exist in the second file "disps.dat", but do not exist in the first file.
(Note, both files are big, so the sample shown above don't have duplicates, but I do expect and have confirmed at least 10-12k duplicates to show up in total)
> head duplicate.dat
AnalogPoint,208106
AnalogPoint,208107
StatusPoint,1235220
AnalogPoint,217270
AnalogPoint,217271
AnalogPoint,217272
AnalogPoint,217273
AnalogPoint,217274
AnalogPoint,217275
AnalogPoint,217277
> grep "AnalogPoint,208106" pilot.dat
>
I tested the above command with a smaller sample of data (10 records), also formatted the same, and the results work fine, so I'm a little bit confused on why it is failing on the larger execution.
I also tried feeding it in as a string with -F thinking that the "," comma might be the source of issue. Right now, I am feeding the data through a 'for' loop and echoing each line, which is executing very, very slowly but at least it will help me cross out the regex possibility.
the -x or -w option is needed to do an exact match.
-x will match exact string, and -w will match exact substring and block non-word characters which works in my case to handle trailing numbers.
The issue is that a record in the first file such as:
"AnalogPoint,1"
Would end up flagging records in the second file like:
"AnalogPoint,10"
"AnalogPoint,123"
"AnalogPoint,100200"
And so on.
Thanks to #Barmar for pointing out my issue.
I've inherited a Laravel system with a large single log file that is currently around 17GB in size, I'm now rotating future log files monthly, however I need to split the existing log by month.
The date is formatted as yyyy-mm-dd hh:mm:ss ("[2018-06-28 13:32:05]"). Does anybody know how I could perform the split using only bash scripting (e.g. through use of awk, sed etc.).
The input file name is laravel.log. I'd like output files to have format such as laravel-2018-06.log.
Help much appreciated.
Since the information you provide is a bit sparse, I will go with the following assumptions :
each log-entry is a single line
somewhere there is always one string of the form [yyyy-mm-dd hh:mm:ss], if there are more, we take the first.
your log-file is sorted in time.
The regex which matches your date is,
\\[[0-9]{4}(-[0-9]{2}){2} ([0-9]{2}:){2}[0-9]{2}\\]
or a bit less strict
\\[[-:0-9 ]{19}\\]
So we can use this in combination with match(s,ere) to get the desired string :
awk 'BEGIN{ere="\\[[0-9]{4}(-[0-9]{2}){2} ([0-9]{2}:){2}[0-9]{2}\\]"}
{ match($0,ere); fname="laravel-"substr($0,RSTART+1,7)".log" }
(fname != oname) { close(oname); oname=fname }
{ print > oname }' laravel.log
As you say that your file is a bit on the large side, you might want to test this first on a subset which covers a couple of months.
$ head -10000 laravel.log > laravel.head.log
$ awk '{...}' laravel.head.log
$ md5sum laravel.head.log
$ cat laravel.*-*.log | md5sum
If the md5sum is not matching, you might have a problem.
So I have a directory with ~50 files, and each contain different things. I often find myself not remembering which files contain what. (This is not a problem with the naming -- it is sort of like having a list of programs and not remembering which files contain conditionals).
Anyways, so far, I've been using
cat * | grep "desiredString"
for a string that I know is in there. However, this just gives me the lines which contain the desired string. This is usually enough, but I'd like it to give me the file names instead, if at all possible.
How could I go about doing this?
It sounds like you want grep -l, which will list the files that contain a particular string. You can also just pass the filename arguments directly to grep and skip cat.
grep -l "desiredString" *
In the directory containing the files among which you want to search:
grep -rn "desiredString" .
This can list all the files matching "desiredString", with file names, matching lines and line numbers.
Still being a newbie with bash-programming I am fighting with another task I got. A specific file called ".dump" (yes, with a dot in the beginning) is located in each folder and always contains three numbers. I need to dump the third number in a variable in case it is greater than 1000 and then print this and the folder name locating the number. So the outcome should look like this:
"/dir1/ 1245"
"/dir1/subdir1/ 3434"
"/dir1/subdir2/ 10003"
"/dir1/subdir2/subsubdir3/ 4123"
"/dir2/ 45440"
(without "" and each of them in a new line (not sure, why it is not shown correctly here))
I was playing around with awk, find and while, but the results are that bad that I do not wanna post them here, which I hope is understood. So any code snippet helping is appreciated.
This could be cleaned up, but should work:
find /dir1 /dir2 -name .dump -exec sh -c 'k=$(awk "\$3 > 1000{print \$3; exit 1}" $0) ||
echo ${0%.dump} $k ' {} \;
(I'm assuming that all three numbers in your .dump files appear on one line. The awk will need to be modified if the input is in a different format.)