Multiple grep separator and display file information - bash

I want to grep multiple information in files with multiple separator, and display file informations, with only one command.
./WBL-FILE-S-1-execution79065.html
./WBL-FILE-S-1-execution79066.html
./WBL-FILE-S-1-execution79067.html
If I do :
find . -type f -name « *WBL-FILE* » | xargs grep "Fichier lu"
I have results like :
./WBL-FILE-S-1-execution79065.html:<td title="Message">Fichier lu /opt/data/in/bl/000334_iwel1C010116730.blc.TRT</td>
./WBL-FILE-S-1-execution79065.html:<td title="Message">Fichier lu /opt/data/in/bl/000312_iwel1C010116727.blc.TRT</td>
./WBL-FILE-S-1-execution74707.html:<td title="Message">Fichier lu /opt/data/in/bl/000420_iwel1C010116284.blc.TRT</td>
The goal is to get the date of file, filename, the XXXXXX_iwel number, and the CXXXXXXXXX number.
Example :
2021-07-13 13:47 WBL-FILE-S-1-execution79065.html 000334 010116730
2021-07-13 14:48 WBL-FILE-S-1-execution79065.html 000312 010116727
2021-07-14 14:49 WBL-FILE-S-1-execution74707.html 000420 010116284
I almost succeed to extract the different part, but after that, I can't get the "ls" (date) information on the original file.
Is there a way to do that only with one line combinaison of commands ?
Thank you

If you want to add the file's date, grep alone won't cut it anymore. Also, extracting XXXXXX_iwel and CXXXXXXXXX and printing these numbers on the same line is not possible with grep alone.
Therefore I would switch to perl:
perl -nle 'use POSIX "strftime";
BEGIN { sub mtime { strftime "%Y-%m-%d %H:%M:%S", localtime((stat $ARGV)[9]) } }
/Fichier lu.*?(\d+)_iwel.*?C(\d+)/ && print join " ", mtime, $ARGV, $1, $2'
Sine all your files are in the same directory, you can use
perl ... *WBL-FILE*
For a recursive file search, use find -exec instead of find | xargs. This is not only more efficient, but also safer in case some filenames contain whitespace or special symbols like "'\.
find -type f -name '*WBL-FILE*' -exec perl ... {} +

For each file, you can display the information you need with one awk command.
awk 'match($0, /Fichier lu.*[^0-9]([0-9]*)_iwel[^C]*C([0-9]*)/, array) { date_command="date +\"%Y-%m-%d %H:%M:%S\" --date #$(stat -c %Y " FILENAME ")"; date_command | getline formatted_date; close(date_command); print formatted_date, FILENAME, array[1], array[2]}' /path/to/file
It can be rewritten like this for clarity:
awk 'match($0, /Fichier lu.*[^0-9]([0-9]*)_iwel[^C]*C([0-9]*)/, array) {
date_command="date +\"%Y-%m-%d %H:%M:%S\" --date #$(stat -c %Y " FILENAME ")";
date_command | getline formatted_date;
close(date_command);
print formatted_date, FILENAME, array[1], array[2]
}'
Basically it does 3 things:
It matches all lines including Fichier lu and captures the numbers of XXXXXX_iwel and CXXXXXXXXX into an array
It calls a command line to get the modification date of the file with the desired format
It prints all the information you want on the same line
You can plug it after find of course.
find . -name "*WBL-FILE*" | xargs awk 'match($0, /Fichier lu.*[^0-9]([0-9]*)_iwel[^C]*C([0-9]*)/, array) { date_command="date +\"%Y-%m-%d %H:%M:%S\" --date #$(stat -c %Y " FILENAME ")"; date_command | getline formatted_date; close(date_command); print formatted_date, FILENAME, array[1], array[2]}'
Result:
2021-07-28 10:45:50 ./WBL-FILE-S-1-execution79065.html 000334 010116730
2021-07-28 10:45:50 ./WBL-FILE-S-1-execution79065.html 000312 010116727
2021-07-28 10:46:41 ./WBL-FILE-S-1-execution74707.html 000420 010116284
Side notes
I used the match function, which is part of GNU Awk, also known as gawk. If you don’t have it, it’s still possible but it requires another way to capture the string.
The trickiest part is probably the command for getting the date because we need to build a string for the command and then call it and then store the result in a variable. It’s a bit messy. It also requires a two-step process: get the date in Epoch time (i.e. numbers of seconds from 1970-01-01) and then format this value with the YYYY-MM-DD HH:MM:SS format. On the other hand you can adapt these steps very easily. For instance you can display the date with another format by changing the +\"%Y-%m-%d %H:%M:%S\" string sent to date. Or you can display the creation date instead of the last modification date by changing the -c %Y option sent to stat.
The command is not robust to filenames and folders containing whitespaces. To fix this, first you may use an ugly syntax to replace $(stat -c %Y " FILENAME ")" with $(stat -c %Y '"'"'" FILENAME "'"'"')" during the date call. Yikes. This is due to how we build the string in one line. Secondly you may use either of those commands to make sure filenames are passed correctly (to simplify, let’s say the awk script is stored in the AWKSTRING variable).
find . -name "*WBL-FILE*" -print0 | xargs -0 awk "$AWKSTRING"
find . -name "*WBL-FILE*" -exec awk "$AWKSTRING" {} \;
find . -name "*WBL-FILE*" -exec awk "$AWKSTRING" {} +
The latter is probably a bit more optimal than the others, but not all versions of find support it.

Related

Input folder / output folder for each file in AWK [duplicate]

This question already has answers here:
Redirecting stdout with find -exec and without creating new shell
(3 answers)
Closed last month.
I am trying to run (several) Awk scripts through a list of files and would like to get each file as an output in a different folder. I tried already several ways but can not find the solution. The output in the output folder is always a single file called {} which includes all content of all files from the input folder.
Here is my code:
input_folder="/path/to/input"
output_folder="/path/to/output"
find $input_folder -type f -exec awk '! /rrsig/ && ! /dskey/ {print $1,";",$5}' {} >> $output_folder/{} \;
Can you please give me a hint what I am doing wrong?
The code is called in a .sh script.
I'd probably opt for a (slightly) less complicated find | xargs, eg:
find "${input_folder}" -type f | xargs -r \
awk -v outd="${output_folder}" '
FNR==1 { close(outd "/" outf); outf=FILENAME; sub(/.*\//,"",outf) }
! /rrsig/ && ! /dskey/ { print $1,";",$5 > (outd "/" outf) }'
NOTE: the commas in $1,";",$5 will insert spaces between $1, ; and $2; if the spaces are not desired then use $1 ";" $5 (ie, remove the commas)

Extracting Number From a File

I'm trying to write a script (with bash) that looks for a word (for example "SOME(X) WORD:") and prints the rest of the line which is effectively some numbers with "-" in front. To clarify, an example line that I'm looking for in a file is;
SOME(X) WORD: -1.0475392439 ANOTHER.W= -0.0590214433
I want to extract the number after "SOME(X) WORD:", so "-1.0475392439" for this example. I have a similar script to this which extracts the number from the following line (both lines are from the same input file)
A-DESIRED RESULT W( WORD) = -9.68765465413
And the script for this is,
local output="$1"
local ext="log"
local word="W( WORD)"
cd $dir
find "${output}" -type f -name "*.${ext}" -exec awk -v ptn="${word}" 'index($0,ptn) {print $NF,FILENAME}' {} +
But when I change the local word variable from "W( WORD)" to "SOME(X) WORD", it captures the "-0.0590214433" instead of "-1.0475392439" meaning it takes the last number in line. How can I find a solution to this? Thanks in advance!
As you have seen, print $NF outputs the last field of the line. Please modify the find line as:
find "${output}" -type f -name "*.${ext}" -exec awk -v ptn="${word}" 'index($0, ptn) {if (match($0, /-[0-9]+\.[0-9]+/)) print substr($0, RSTART, RLENGTH), FILENAME}' {} +
Then it will output the first number in the line.
Please note it assumes the number always starts with the - sign.

Calling external command in find and using pipe

I am wondering if there is a way to search for all the files from a certain directory including subdirectories using a find command on AIX 6.x, before calling an external command (e.g. hlcat) to display/convert them into a readable format, which can then be piped through a grep command to find a pattern instead of using loops in the shell?
e.g. find . -type f -name “*.hl7” -exec hlcat {} | grep -l “pattern” \;
The above command would not work and I have to use a while loop to display the content and search for the pattern as follows:
find . -type f -name “*.hl7” -print | while read file; do
hlcat $file | grep -l “pattern”;
done
At the same time, these HL7 files have been renamed with round brackets which prevent them from being open without having to include double quotes around the file name.
e.g. hlcat (patient) filename.hl7 will fail to open.
hlcat “(patient) filename.hl7” will work.
In short, I am looking for a clean concise one-liner approach within the find command and view and search their content these HL7 files with round bracket names.
Many thanks,
George
P.S. HL7 raw data is made up of one continuous line and is not readable unless it is converted into a workable reading format using tools such as hlcat.
in
Update: The easy way
find . -type f -name '*.hl7' -exec grep -iEl 'Barry|Jolene' {} +
note: You may get some false positives though. See below for a targeted search.
Searching for a first name in a bunch of HL7v2 files:
1. Looking into the HL7v2 file format
Example of HL7v2 PID segment:
PID|||56782445^^^UAReg^PI||KLEINSAMPLE^BARRY^Q^JR||19620910|M|||
PID Segment decomposition:
Seq
NAME
HHIC USE
LEN
0
PID keyword
Segment Type
3
3
Patient ID
Medical Record Num
250
5
Patient Name
Last^First^Middle
250
7
Date/Time Of Birth
YYYYMMDD
26
8
Sex
F, M, or U
1
2. Writing targeted searches
With grep (AIX):
find . -type f -name '*.hl7' -exec grep -iEl '^PID\|([^|]*\|){4}[^^|]*\^(Barry|Jolene)\^' {} +
With awk:
find . -type f -name '*.hl7' -exec awk -v firstname='^(Barry|Jolene)$' '
BEGIN { FS="|" }
FNR == 1 { if( found ) print filename; found = 0; filename = FILENAME }
$1 == "PID" { split($6, name, "^"); if (toupper(name[2]) ~ toupper(firstname)) { found = 1 } }
END { if ( found ) print filename }
' {} +
remark: The good part about this awk solution is that you pass the first name regexp as an argument. This solution is easily extendable, for example for searching the last name.

Bash - Search and Replace operation with reporting the files and lines that got changed

I have a input file "test.txt" as below -
hostname=abc.com hostname=xyz.com
db-host=abc.com db-host=xyz.com
In each line, the value before space is the old value which needs to be replaced by the new value after the space recursively in a folder named "test". I am able to do this using below shell script.
#!/bin/bash
IFS=$'\n'
for f in `cat test.txt`
do
OLD=$(echo $f| cut -d ' ' -f 1)
echo "Old = $OLD"
NEW=$(echo $f| cut -d ' ' -f 2)
echo "New = $NEW"
find test -type f | xargs sed -i.bak "s/$OLD/$NEW/g"
done
"sed" replaces the strings on the fly in 100s of files.
Is there a trick or an alternative way by which i can get a report of the files changed like absolute path of the file & the exact lines that got changed ?
PS - I understand that sed or stream editors doesn't support this functionality out of the box. I don't want to use versioning as it will be an overkill for this task.
Let's start with a simple rewrite of your script, to make it a little bit more robust at handling a wider range of replacement values, but also faster:
#!/bin/bash
# escape regexp and replacement strings for sed
escapeRegex() { sed 's/[^^]/[&]/g; s/\^/\\^/g' <<<"$1"; }
escapeSubst() { sed 's/[&/\]/\\&/g' <<<"$1"; }
while read -r old new; do
find test -type f -exec sed "/$(escapeRegex "$old")/$(escapeSubst "$new")/g" -i '{}' \;
done <test.txt
So, we loop over pairs of whitespace-separated fields (old, new) in lines from test.txt and run a standard sed in-place replace on all files found with find.
Pretty similar to your script, but we properly read lines from test.txt (no word splitting, pathname/variable expansion, etc.), we use Bash builtins whenever possible (no need to call external tools like cat, cut, xargs); and we escape sed metacharacters in old/new values for proper use as sed's regexp and replacement expressions.
Now let's add logging from sed:
#!/bin/bash
# escape regexp and replacement strings for sed
escapeRegex() { sed 's/[^^]/[&]/g; s/\^/\\^/g' <<<"$1"; }
escapeSubst() { sed 's/[&/\]/\\&/g' <<<"$1"; }
while read -r old new; do
find test -type f -printf '\n[%p]\n' -exec sed "/$(escapeRegex "$old")/{
h
s//$(escapeSubst "$new")/g
H
x
s/\n/ --> /
w /dev/stdout
x
}" -i '{}' > >(tee -a change.log) \;
done <test.txt
The sed script above changes each old to new, but it also writes old --> new line to /dev/stdout (Bash-specific), which we in turn append to change.log file. The -printf action in find outputs a "header" line with file name, for each file processed.
With this, your "change log" will look something like:
[file1]
hostname=abc.com --> hostname=xyz.com
[file2]
[file1]
db-host=abc.com --> db-host=xyz.com
[file2]
db-host=abc.com --> db-host=xyz.com
Just for completeness, a quick walk-through the sed script. We act only on lines containing the old value. For each such line, we store it to hold space (h), change it to new, append that new value to the hold space (joined with newline, H) which now holds old\nnew. We swap hold with pattern space (x), so we can run s command that converts it to old --> new. After writing that to the stdout with w, we move the new back from hold to pattern space, so it gets written (in-place) to the file processed.
From man sed:
-i[SUFFIX], --in-place[=SUFFIX]
edit files in place (makes backup if SUFFIX supplied)
This can be used to create a backup file when replacing. You can then look for any backup files, which indicate which files were changed, and diff those with the originals. Once you're done inspecting the diff, simply remove the backup files.
If you formulate your replacements as sed statements rather than a custom format you can go one further, and use either a sed shebang line or pass the file to -f/--file to do all the replacements in one operation.
There's several problems with your script, just replace it all with (using GNU awk instead of GNU sed for inplace editing):
mapfile -t files < <(find test -type f)
awk -i inplace '
NR==FNR { map[$1] = $2; next }
{ for (old in map) gsub(old,map[old]) }
' test.txt "${files[#]}"
You'll find that is orders of magnitude faster than what you were doing.
That still has the issue your existing script does of failing when the "test.txt" strings contain regexp or backreference metacharacters and modifying previously-modified strings and handling partial matches - if that's an issue let us know as it's easy to work around with awk (and extremely difficult with sed!).
To get whatever kind of report you want you just tweak the { for ... } line to print them, e.g. to print a record of the changes to stderr:
mapfile -t files < <(find test -type f)
awk -i inplace '
NR==FNR { map[$1] = $2; next }
{
orig = $0
for (old in map) {
gsub(old,map[old])
}
if ($0 != orig) {
printf "File %s, line %d: \"%s\" became \"%s\"\n", FILENAME, FNR, orig, $0 | "cat>&2"
}
}
' test.txt "${files[#]}"

bash script reading lines in every file copying specific values to newfile

I want to write a script helping me to do my work.
Problem: I have many files in one dir containing data and I need from every file specific values copied in a newfile.
The datafiles can look likes this:
Name abc $desV0
Start MJD56669 opCMS v2
End MJD56670 opCMS v2
...
valueX 0.0456 RV_gB
...
valueY 12063.23434 RV_gA
...
What the script should do is copy valueX and the following value and also valueY and following value copied into an new file in one line. And the add in that line the name of the source datafile. Additionally the value of valueY should only contain everything before the dot.
The result should look like this:
valueX 0.0456 valueY 12063 name_of_sourcefile
I am so far:
for file in $(find -maxdepth 0 -type f -name *.wt); do
for line in $(cat $file | grep -F vb); do
cp $line >> file_done
done
done
But that doesn't work at all. I also have no idea how to get the data in ONE line in the newfile.
Can anyone help me?
I think you can simplify your script a lot using awk:
awk '/valueX/{x=$2}/valueY/{print "valueX",x,"valueY",$2,FILENAME}' *.wt > file_done
This goes through every file in the current directory. When "valueX" is matched, the value is saved to the variable x. When "valueY" is matched, the line is printed.
This assumes that the line containing "valueX" always comes before the one containing "valueY". If that isn't a valid assumption, the script can easily be changed.
To print only the integer part of "valueY", you can use printf instead of print:
awk '/valueX/{x=$2}/valueY/{printf "valueX %s valueY %d %s\n",x,$2,FILENAME}' *.wt > file_done
%d is the format specifier for an integer.
If your requirements are more complex and you need to use find, you should use -exec rather than looping through the results, to avoid problems with awkward file names:
find -maxdepth 1 -iname "5*.par" ! -iname "*_*" -exec \
awk '/valueX/{x=$2}/valueY/{printf "valueX %s valueY %d %s\n",x,$2,"{}"}' '{}' \; > file_done
don't fight. I'm really thankful for your help and exspecially the fast answers.
This is my final solution I think:
#!/bin/bash
for file in $(find * -maxdepth 1 -iname "5*.par" ! -iname "*_*"); do
awk '/TASC/{x=$2}/START/{printf "TASC %s MJD %d %s",x,$2, FILENAME}' $file > mjd_vs_tasc
done
Very thanks again to you guys.
Try something like below :
egrep "valueX|valueY" *.wt | awk -vRD="\n" -vORS=" " -F':| ' '{if (NR%2==0) {print $2, $3, $1} else {print $2, $3}}' > $file.new.txt

Resources