How to extract codes using the grep command? - bash

I have a file with below input lines.
John|1|R|Category is not found for local configuration/code/123.NNN and customer 113
TOM|2|R|Category is not found for local configuration/code/123.NNN and customer 114
PETER|3|R|Category is not found for local configuration/code/456.1 and customer 115
I need to extract only the above highlighted text using the grep command.
I tried the below command and didn't get the proper result. Getting the extra 2 unwanted characters in the output. Please suggest if there is any other way to achieve this through grep command.
find ./ -type f -name <FileName> -exec cut -f 4 -d'|' {} + |
grep -o 'Category is not found for local configuration/code/...\\....' |
grep -o '...\\....' | sort | uniq
Current Output:
123.NNN
456.1 a
Expected output:
123.NNN
456.1

You can use another grep regular expression.
find ./ -type f -name f -exec cut -f 4 -d'|' {} + |
grep -o 'Category is not found for local configuration/code/...\.[^ ]*' |
grep -o '...\..*' | sort | uniq
. matches any character, [^ ]* matches any sequence of characters until the first space
Output:
123.NNN
456.1

Your regex specifies a fixed character width for strings of variable width. Based on your examples, something like
[0-9]\+\.[A-Z0-9]\+
would seem like a better regex. However, we could probably also simplify this by merging the cut and multiple grep commands into a single Awk script.
find etc etc -exec awk -F '|' '
$4 ~ /Category is not found for local configuration\/code\/[0-9]{3}\.[0-9A-Z]/ {
split($4, a, /\/code\/);
split(a[2], b); print b[1] }' {} + |
sort -u
The two split operations are just a cheap way to pick out the text between /code/ and the next whitespace character; we have already established by way of the regex match that the string after /code/ matches the pattern we're after.
Notice also how sort has a -u option which allows you to replace (trivial cases of) uniq.
The regex variant supported by Awk is slightly different than that supported by POSIX grep; so the backslashed \+ in grep's BRE dialect is plain + in the dialect called ERE which is [more or less] supported by Awk - and grep -E. If you have grep -P you can use a third variant which has a convenient feature;
find etc etc -exec grep -oP '^([^|]*[|]){3}[^|]*Category is not found for local configuration/code/\K[0-9]{3}\.[0-9A-Z]+' {} + |
sort -u
The \K says "match up through here, but forget everything before this" and so only prints the part after this token.

With sed:
sed -E -n 's#.*code/(.*)\s+and.*#\1#p' file.txt | uniq
Output:
123.NNN
456.1

I'd use the -P option:
grep -oP '/code/\K\S+' file | sort -u
You want to extract the non-whitespace characters following /code/

An awk using match():
$ awk 'match($0,/[0-9]+\.[A-Z0-9]+/)&&++a[(b=substr($0,RSTART,RLENGTH))]==1{print b}' file
Output:
123.NNN
456.1
Pretty printed for slightly better readability:
$ awk '
match($0,/[0-9]+\.[A-Z0-9]+/) && ++a[(b=substr($0,RSTART,RLENGTH))]==1 {
print b
}' file

It's not possible just using grep. You should use AWK instead:
awk '{split($7, ar, "/"); print ar[3]}' FILE
Explanation:
The split function splits on a string, here $7, the 7th field, placing the result in an array ar, and using the string / as delimiter.
Then prints the 3rd field of the array.
Note:
I am assuming that all of your input looks like the samples you have given us, i.e.:
aaa|b|c|ddd is not found for local configuration/code/111.nnn and customer nnn
Where aaa and ddd will not contain whitespace.
I also assume you really do have a file FILE containing those lines. It's a bit unclear.
Input:
▶ cat FILE
John|1|R|Category is not found for local configuration/code/123.NNN and customer 113
TOM|2|R|Category is not found for local configuration/code/123.NNN and customer 114
PETER|3|R|Category is not found for local configuration/code/456.1 and customer 115
Output:
▶ awk '{split($7, ar, "/"); print ar[3]}' FILE
123.NNN
123.NNN
456.1

Single sed can do the filtering.
(The pattern can be further generalized as suggested by others if that is an option. But be careful to not to over simplify so that it can match with unexpected inputs)
sed -nE 's#(\S+\s+){6}configuration/code/(\S+)\s.*#\2#p' input.txt
To replace your exact command,
find ./ -type f -name <Filename> -exec cat {} \; | sed -nE 's#(\S+\s+){6}configuration/code/(\S+)\s.*#\2#p' | sort | uniq

Simple substitutions on individual lines is the job sed is best suited for. This will work using any sed in any shell on any UNIX box:
$ cat file
John|1|R|Category is not found for local configuration/code/123.NNN and customer 113
TOM|2|R|Category is not found for local configuration/code/123.NNN and customer 114
PETER|3|R|Category is not found for local configuration/code/456.1 and customer 115
$ sed -n 's:.*Category is not found for local configuration/code/\([^ ]*\).*:\1:p' file | sort -u
123.NNN
456.1

Related

How to grep only matching string from this result?

I am just simply trying to grab the commit ID, but not quite sure what I'm missing:
➜ ~ curl https://github.com/microsoft/vscode/releases -s | grep -oE 'microsoft/vscode/commit/(.*?)/hovercard'
microsoft/vscode/commit/ccbaa2d27e38e5afa3e5c21c1c7bef4657064247/hovercard
The only thing I need back from this is ccbaa2d27e38e5afa3e5c21c1c7bef4657064247.
This works just fine on regex101.com and in ruby/python. What am I missing?
If supported, you can use grep -oP
echo "microsoft/vscode/commit/ccbaa2d27e38e5afa3e5c21c1c7bef4657064247/hovercard" | grep -oP "microsoft/vscode/commit/\K.*?(?=/hovercard)"
Output
ccbaa2d27e38e5afa3e5c21c1c7bef4657064247
Another option is to use sed with a capture group
echo "microsoft/vscode/commit/ccbaa2d27e38e5afa3e5c21c1c7bef4657064247/hovercard" | sed -E 's/microsoft\/vscode\/commit\/([^\/]+)\/hovercard/\1/'
Output
ccbaa2d27e38e5afa3e5c21c1c7bef4657064247
The point is that grep does not support extracting capturing group submatches. If you install pcregrep you could do that with
curl https://github.com/microsoft/vscode/releases -s | \
pcregrep -o1 'microsoft/vscode/commit/(.*?)/hovercard' | head -1
The | head -1 part is to fetch the first occurrence only.
I would suggest using awk here:
awk 'match($0,/microsoft\/vscode\/commit\/[^\/]*\/hovercard/){print substr($0,RSTART+24,RLENGTH-34);exit}'
The regex will match a line containing
microsoft\/vscode\/commit\/ - microsoft/vscode/commit/ fixed string
[^\/]* - zero or more chars other than /
\/hovercard - a /hovercard string.
The substr($0,RSTART+24,RLENGTH-34) will print the part of the line starting at the RSTART+24 (24 is the length of microsoft/vscode/commit/) index and the RLENGTH is the length of microsoft/vscode/commit/ + the length of the /hovercard.
The exit command will fetch you the first occurrence. Remove it if you need all occurrences.
You can use sed:
curl -s https://github.com/microsoft/vscode/releases |
sed -En 's=.*microsoft/vscode/commit/([^/]+)/hovercard.*=\1=p' |
head -n 1
head -n 1 is to print the first match (there are 10)grep -o will print (only) everything that matches, including microsoft/ etc.
Your task can not be achieved with Mac's grep. grep -o prints all matching text (compared to default behaviour of printing matching lines), including microsoft/ etc. A grep which implemented perl regex (like GNU grep on Linux) could make use of look ahead/behind (grep -Po '(?<=microsoft/vscode/commit/)[^/]+(?=/hovercard)'). But it's just not available on Mac's grep.
On MacOS you don't have gnu utilities available by default. You can just pipe your output to a simple awk like this:
curl https://github.com/microsoft/vscode/releases -s |
grep -oE 'microsoft/vscode/commit/[^/]+/hovercard' |
awk -F/ '{print $(NF-1)}'
ccbaa2d27e38e5afa3e5c21c1c7bef4657064247
3a6960b964327f0e3882ce18fcebd07ed191b316
f4af3cbf5a99787542e2a30fe1fd37cd644cc31f
b3318bc0524af3d74034b8bb8a64df0ccf35549a
6cba118ac49a1b88332f312a8f67186f7f3c1643
c13f1abb110fc756f9b3a6f16670df9cd9d4cf63
ee8c7def80afc00dd6e593ef12f37756d8f504ea
7f6ab5485bbc008386c4386d08766667e155244e
83bd43bc519d15e50c4272c6cf5c1479df196a4d
e7d7e9a9348e6a8cc8c03f877d39cb72e5dfb1ff

Merge huge number of files into one file by reading the files in ascending order

I want to merge a large number of files into a single file and this merge file should happen based on ascending order of the file name. I have tried the below command and it works as intended but the only problem is that after the merge the output.txt file contains whole data in a single line because all the input files have only one line of data without any newline.
Is there any way to merge each file data into output.txt as separate line rather than merging every file data into a single line?
My list of files has the naming format of 9999_xyz_1.json, 9999_xyz_2.json, 9999_xyz_3.json, ....., 9999_xyz_12000.json.
Example:
$ cat 9999_xyz_1.json
abcdef
$ cat 9999_xyz_2.json
12345
$ cat 9999_xyz_3.json
Hello
Expected output.txt:
abcdef
12345
Hello
Actual output:
$ ls -d -1 -v "$PWD/"9999_xyz_*.json | xargs cat
abcdef12345
EDIT:
Since my input files won't contain any spaces or special characters like backslash or quotes, I decided to use the below command which is working for me as expected.
find . -name '9999_xyz_*.json' -type f | sort -V | xargs awk 1 > output.txt
Tried with file name containing a space and below are the results with 2 different commands.
Example:
$ cat 9999_xyz_1.json
abcdef
$ cat 9999_ xyz_2.json -- This File name contains a space
12345
$ cat 9999_xyz_3.json
Hello
Expected output.txt:
abcdef
12345
Hello
Command:
find . -name '9999_xyz_*.json' -print0 -type f | sort -V | xargs -0 awk 1 > output.txt
Output:
Successfuly completed the merge as expected but with an error at the end.
abcdef
12345
hello
awk: cmd. line:1: fatal: cannot open file `
' for reading (No such file or directory)
Command:
Here I have used the sort with -zV options to avoid the error occured in the above command.
find . -name '9999_xyz_*.json' -print0 -type f | sort -zV | xargs -0 awk 1 > output.txt
Output:
Command completed successfully but results are not as expected. Here the file name having space is treated as last file after the sort. The expectation is that the file name with space should be at second position after the sort.
abcdef
hello
12345
I would approach this with a for loop, and use echo to add the newline between each file:
for x in `ls -v -1 -d "$PWD/"9999_xyz_*.json`; do
cat $x
echo
done > output.txt
Now, someone will invariably comment that you should never parse the output of ls, but I'm not sure how else to sort the files in the right order, so I kept your original ls command to enumerate the files, which worked according to your question.
EDIT
You can optimize this a lot by using awk 1 as #oguzismail did in his answer:
ls -d -1 -v "$PWD/"9999_xyz_*.json | xargs awk 1 > output.txt
This solution finishes in 4 seconds on my machine, with 12000 files as in your question, while the for loop takes 13 minutes to run. The difference is that the for loop launches 12000 cat processes, while the xargs needs only a handful to awk processes, which is a lot more efficient.
Note: if want to you upvote this, make sure to upvote #oguzismail's answer too, since using awk 1 is his idea. But his answer with printf and sort -V is safer, so you probably want to use that solution anyway.
Don't parse the output of ls, use an array instead.
for fname in 9999_xyz_*.json; do
index="${fname##*_}"
index="${index%.json}"
files[index]="$fname"
done && awk 1 "${files[#]}" > output.txt
Another approach that relies on GNU extensions:
printf '%s\0' 9999_xyz_*.json | sort -zV | xargs -0 awk 1 > output.txt

How can I deduplicate filenames across directories?

I run the following gsutil command:
gsutil ls -d gs://mybucket/v${version}/folder1/*/*.whl |
sort -V |
grep -e "/*.whl"
I get:
gs://mybucket/v1.0.0/folder1/1560924028/file1-cp27-cp27mu-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1560926922/file1-cp36-cp36m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1560930522/file1-cp35-cp35m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1561568612/file1-cp37-cp37m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1561595893/file1-cp37-cp37m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1561654308/file1-cp37-cp37m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1563319372/file1-cp27-cp27mu-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1563319400/file1-cp36-cp36m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1563329633/file1-cp27-cp27mu-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1563411368/file1-cp35-cp35m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1565916833/file1-cp27-cp27mu-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1565921265/file1-cp35-cp35m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1566258114/file1-cp27-cp27mu-linux_x86_64.whl
Since some files in different folders have the same names, how can I retrieve unique filenames ignoring the path?
I would do it like this:
blabla_your_command | rev | sort -t'/' -u -k1,1 | rev
rev reverses lines. Then I unique sort using / as a separator on the first field. After the line is reversed, the first field will be the filename, so sorting -u on it would return only unique filenames. Then the line needs to be reversed back.
The following command:
cat <<EOF |
gs://mybucket/v1.0.0/folder1/1560924028/file1-cp27-cp27mu-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1560926922/file1-cp36-cp36m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1560930522/file1-cp35-cp35m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1561568612/file1-cp37-cp37m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1561595893/file1-cp37-cp37m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1561654308/file1-cp37-cp37m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1563319372/file1-cp27-cp27mu-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1563319400/file1-cp36-cp36m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1563329633/file1-cp27-cp27mu-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1563411368/file1-cp35-cp35m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1565916833/file1-cp27-cp27mu-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1565921265/file1-cp35-cp35m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1566258114/file1-cp27-cp27mu-linux_x86_64.whl
EOF
rev | sort -t'/' -u -k1,1 | rev
outputs:
gs://mybucket/v1.0.0/folder1/1560930522/file1-cp35-cp35m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1560926922/file1-cp36-cp36m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1561568612/file1-cp37-cp37m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1560924028/file1-cp27-cp27mu-linux_x86_64.whl
Please check awk option given below, this will print the last occurrence of delimiter '/', it worked for me
example:
gsutil ls gs://mybucket/v1.0.0/folder1/1560930522 | awk -F/ '{print $(NF)}'
print all the file names under '1560930522'
your_command|awk -F/ '!($NF in a){a[$NF]; print}'
gs://mybucket/v1.0.0/folder1/1560924028/file1-cp27-cp27mu-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1560926922/file1-cp36-cp36m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1560930522/file1-cp35-cp35m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1561568612/file1-cp37-cp37m-linux_x86_64.whl
4 different ways of saying the same thing
nawk -F'^.+/' '++_[$NF]<NF'
gawk -F'/' '__[$NF]++<!_'
mawk -F/ '_^__[$NF]++'
mawk2 -F/ '!_[$NF]--'
gs://mybucket/v1.0.0/folder1/1560924028/file1-cp27-cp27mu-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1560926922/file1-cp36-cp36m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1560930522/file1-cp35-cp35m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1561568612/file1-cp37-cp37m-linux_x86_64.whl
Here's a simple, straightforward solution:
$ your_gsutil_command | xargs -L 1 basename | sort -u
The easiest way to remove paths is with basename. Unfortunately it accepts only a single filename, which must be on the command line (not from stdin), so we need to take the following steps:
Create the list of files.
We do this with your_gsutil_command, but you can use any command that generates a list of files.
Send each one to basename to remove its path.
The xargs command does this for us by reading its stdin and invoking basename repeatedly, passing the data as command-line arguments. But xargs efficiently tries to reduce the number of invocations by passing multiple filenames on each command line, and that breaks basename. We prevent that with -L 1, limiting it to only one line (that is, one filename) at a time.
Remove duplicates.
The sort -u command does this.
Using your example data:
$ gsutil ls -d gs://mybucket/v${version}/folder1/*/*.whl |
xargs -L 1 basename | sort -u
file1-cp27-cp27mu-linux_x86_64.whl
file1-cp35-cp35m-linux_x86_64.whl
file1-cp36-cp36m-linux_x86_64.whl
file1-cp37-cp37m-linux_x86_64.whl
Caveat: Spaces break everything. 😡
So far we've assumed the filenames and folders do not contain spaces. Spaces break basename because needs exactly one filename, and it would interpret spaces as separators between multiple filenames. We can get around this in two ways:
ls -Q: If you're deduplicating local filenames, you can use the (non-gsutil) ls command with the -Q flag to put the filenames in quotes, so basename will interpret spaces as part of the filenames rather than separators.
gsutil: The -Q flag is unfortunately not supported, so we'll need to escape the spaces manually:
$ your_gsutil_command | sed 's/ /\\ /g' | xargs -L 1 basename | sort -u
Here we use the sed command to escape each space by inserting a backslash before it. (That is, we replace with \ . Note that we also need to escape the backslash in the sed command, which is why we use \\ and not just \.)

How to sort the results of find (including nested directories) alphabetically in bash

I have a list of directories based on the results of running the "find" command in bash. As an example, the result of find are the files:
test/a/file
test/b/file
test/file
test/z/file
I want to sort the output so it appears as:
test/file
test/a/file
test/b/file
test/z/file
Is there any way to sort the results within the find command, or by piping the results into sort?
If you have the GNU version of find, try this:
find test -type f -printf '%h\0%d\0%p\n' | sort -t '\0' -n | awk -F '\0' '{print $3}'
To use these file names in a loop, do
find test -type f -printf '%h\0%d\0%p\n' | sort -t '\0' -n | awk -F '\0' '{print $3}' | while read file; do
# use $file
done
The find command prints three things for each file: (1) its directory, (2) its depth in the directory tree, and (3) its full name. By including the depth in the output we can use sort -n to sort test/file above test/a/file. Finally we use awk to strip out the first two columns since they were only used for sorting.
Using \0 as a separator between the three fields allows us to handle file names with spaces and tabs in them (but not newlines, unfortunately).
$ find test -type f
test/b/file
test/a/file
test/file
test/z/file
$ find test -type f -printf '%h\0%d\0%p\n' | sort -t '\0' -n | awk -F'\0' '{print $3}'
test/file
test/a/file
test/b/file
test/z/file
If you are unable to modify the find command, then try this convoluted replacement:
find test -type f | while read file; do
printf '%s\0%s\0%s\n' "${file%/*}" "$(tr -dc / <<< "$file")" "$file"
done | sort -t '\0' | awk -F'\0' '{print $3}'
It does the same thing, with ${file%/*} being used to get a file's directory name and the tr command being used to count the number of slashes, which is equivalent to a file's "depth".
(I sure hope there's an easier answer out there. What you're asking doesn't seem that hard, but I am blanking on a simple solution.)
find test -type f -printf '%h\0%p\n' | sort | awk -F'\0' '{print $2}'
The result of find is, for example,
test/a'\0'test/a/file
test'\0'test/file
test/z'\0'test/z/file
test/b'\0'test/b/text file.txt
test/b'\0'test/b/file
where '\0' stands for null character.
These compound strings can be properly sorted with a simple sort:
test'\0'test/file
test/a'\0'test/a/file
test/b'\0'test/b/file
test/b'\0'test/b/text file.txt
test/z'\0'test/z/file
And the final result is
test/file
test/a/file
test/b/file
test/b/text file.txt
test/z/file
(Based on the John Kugelman's answer, with "depth" element removed which is absolutely redundant.)
If you want to sort alphabetically, the best way is:
find test -print0 | sort -z
(The example in the original question actually wanted files before directories, which is not the same and requires extra steps)
try this. for reference, it firsts sorts on the second field second char. which only exists on the file, and has a r for reverse meaning it is first, after that it will sort on the first char of the second field. [-t is field deliminator, -k is key]
find test -name file |sort -t'/' -k2.2r -k2.1
do a info sort for more info. there is a ton of different ways to use the -t and -k together to get different results.

grep pipe with sed

This is my bash command
grep -rl "System.out.print" Project1/ |
xargs -I{} grep -H -n "System.out.print" {} |
cut -f-2 -d: |
sed "s/\(.*\):\(.*\)/filename is \1 and line number is \2/
What I'm trying to do here is,I'm trying to iterate through sub folders and check what files contains "System.out.print" (using grep)
using 2nd grep trying to get file names and line numbers
using sed command I display those to console.
from here I want to remove "System.out.print" with "XXXXX" how I can pipe sed command to this?
pls help me
thanxx
GNU sed has an option to change files in place:
find Project1/ -type f | xargs sed -i 's/System\.out\.print/XXXXX/g'
Btw, your script could be written as:
grep -rsn 'root' /etc/ |
awk -F: '{ print "filename is", $1, "and line number is", $2 }'
I'm just building on hop's answer, which I found to be more useful than find -exec. I had search_text dispersed all over my computer, in logs, config files and so on, but I didn't want to search (or especially change) anything in /dev, /sys, /proc, and so on. One note, read man xargs; it doesn't like file names with spaces.
grep -HriIl --exclude-dir=dev --exclude-dir=proc --exclude-dir=sys search_text / | xargs sed -i 's/search_text/replace_text/g'

Resources