bash cat exclude multiple files based on grep results - bash

I have the following cat command that I use in a bash script. I look for $SAMPLE.txt file in subfolders 20* and combine them into 1 output.txt
cat /$FOLDER/20*/$SAMPLE.txt > /$OUTPUTFOLDER/output.txt
I now want to exclude certain files conditionally.
I found the following here https://unix.stackexchange.com/questions/246048/cat-files-except-one
$ shopt -s extglob
$ cat -- !(DISCARD).txt > catKEPT
I want to do something like this.
Look for $SAMPLE and a pattern '$PAT1' in a $SAMPLEFILE. This $SAMPLEFILE is comma seperated. If there is a match, I want to store the first field of this line & use it to exclude files from cat
I would use this command to look for $SAMPLE and $PAT1 & then cut to keep my first field. I would assign that to a variable 'EXLUDE_FOLDER'
EXCLUDE_FOLDER=grep '$SAMPLE' $SAMPLEFILE | grep '$PAT1' | cut -d "," -f 1
And then use it like this
cat /$FOLDER/20*/$SAMPLE.txt -- !($FOLDER/$EXLUDE_FOLDER/$SAMPLE.txt) > /$OUTPUTFOLDER/output.txt
I'm stuck at putting this into an if/statement and dealing with situations where grep results in multiple matches, so multiple files should be excluded

If SAMPLE and PAT are variables, you presumably want them expanded to their contents, which means you must put them in double quotes, not single quotes. Example:
SAMPLE=3
# Compare single quotes versus double
echo '$SAMPLE' # outputs $SAMPLE
echo "$SAMPLE" # outputs 3
If SAMPLEFILE is the name of a file, you must double-quote it, else it will fail if your filename has spaces in it, so you must use:
grep "$SAMPLE" "$SAMPLEFILE"
So, now you can test if your grep works like this:
grep "$SAMPLE" "$SAMPLEFILE" | grep "$PAT1" | cut -d "," -f 1
So, if that works, the next thing is that you want to capture the output of the command, so you need to use $(...). That means:
EXCLUDE_FOLDER=$(grep "$SAMPLE" "$SAMPLEFILE" | grep "$PAT1" | cut -d "," -f 1)
So, see test if that works now:
echo "$EXCLUDE_FOLDER"

Related

Extract specific string from line with standard grep,egrep or awk

i'm trying to extract a specific string from a grep output
uci show minidlna
produces a large list
.
.
.
minidlna.config.enabled='1'
minidlna.config.db_dir='/mnt/sda1/usb/db'
minidlna.config.enable_tivo='1'
minidlna.config.wide_links='1'
.
.
.
so i tried to narrow down what i wanted by running
uci show minidlna | grep -oE '\bdb_dir=\S+'
this narrows the output to
db_dir='/mnt/sda1/usb/db'
what i want is to output only
/mnt/sda1/usb/db
without the quotes and without the starting "db_dir" so i can run rm /mnt/sda1/usb/db/file.db
i've used the answers found here
How to extract string following a pattern with grep, regex or perl
and that's as close as i got.
EDIT: after using Ed Morton's awk command i needed to pass the output to rm command.
i used:
| ( read DB; (rm $DB/files.db) .
read DB passes the output into the vairable DB.
(...) combines commands.
rm $DB/files.db deletes the the file files.db.
Is this what you're trying to do?
$ awk -F"'" '/db_dir/{print $2}' file
/mnt/sda1/usb/db
That will work in any awk in any shell on every UNIX box.
If that's not what you want then edit your question to clarify your requirements and post more truly representative sample input/output.
Using sed with some effort to avoid single quotes:
sed -n 's/^minidlna.config.db_dir=\s*\S\(\S*\)\S\s*$/\1/p' input
Well, so you end up having a string like db_dir='/mnt/sda1/usb/db'.
I would first remove the quotes by piping this to
.... | tr -d "'"
Now you end up with a string like db_dir=/mnt/sda1/usb/db.
Say you have this string stored in a variable named confstr, then
${confstr##*=}
gives you just /mnt/sda1/usb/db, since *= denotes everything from the start to the equal sign, and ## denotes removal.
I would do this:
Once you either extracted your line about into file.txt (or pipe it into this command), split the fields using the quote character. Use printf to generate the rm command and pass this into bash to execute.
$ awk -F"'" '{printf "rm %s.db/file.db\n", $2}' file.txt | bash
rm: /mnt/sda1/usb/db.db/file.db: No such file or directory
With your original command:
$ uci show minidlna | grep -oE '\bdb_dir=\S+' | \
awk -F"'" '{printf "rm %s.db/file.db\n", $2}' | bash

How can I use grep as cat

I am creating a script that parses some files and greps out the necessary information.
I have set up many different variables in arrays that search for different aspects in the files.
e.g. dates, locations, types.
However I wanted to make each of these variables optional which is where I run into an issue.
the syntax of the command would have been simple
grep ${dates} filename | grep ${locations} | grep ${types}
However, due to variables being optional, the above won't work if a variable is unset.
I was trying to find a way to get grep to find anything (i.e. like egrep .* filename)
that way I could set the variables to the proper regex and have the command still run.
unfortunately, when I set the variable to equal '' it freezes, when I set it to '.' it just greps everything from every file in the current directory and when I leave the variable blank it takes the filename as the variable and waits for a filename.
is there anyway that I can set a variable so that grep $variable file outputs the same as cat would?
Many thanks in advance!
To get grep to act like cat use an empty string as a search pattern, i.e. grep "". Therefore to make some of those variables optional, but not have piped greps fail, just quote the variables:
grep "${dates}" filename | grep "${locations}" | grep "${types}"
Demonstration. Search {250,255,260...280} for the digits 5, 2, and 7:
x=5 y=2 z=7 ; seq 250 5 280 | grep "$x" | grep "$y" | grep "$z"
275
Now unset two of the variables, and it still works:
unset x y ; seq 250 5 280 | grep "$x" | grep "$y" | grep "$z"
270
275
If there aren't any $dates in filename then there is nothing to feed the rest of the pipeline.
I think the best way to do it is to grep for each string separately.
If you get a match, then grep for the next string.
Can you post the source file and a target output. From your question it sounds like you just need to use grep -E in conjunction with the | pipe delimiter.
grep -E "${dates}|${locations}|${types}" fileName
The above line should automatically get you every occurrence of the any of the patterns. This is not even a regex yet.

How to grep all characters in file

I have a CSV file with this lines:
----------+79975532211,----------+79975532212
4995876655,4995876658
I try to grep this lines in Bash script
#!/bin/bash
config='/test/config.conf'
sourcecsv=/test/sourse.csv
cat $sourcecsv | while read line
do
Oldnumber=$(echo $line | cut -d',' -f1)
cat $config | grep "\\$Oldnumber" -B 8
done
But when script grep value 4995876655 I get error:
grep: Invalid back reference
How I can grep all values in my file?
Instead of:
cat $config | grep "\\$Oldnumber" -B 8
You should do:
grep -B 8 -F -- "$Oldnumber" "$config"
If you really mean to grep for all strings between commas, you can do it all in one go.
tr ',' '\n' </test/sourse.csv |
grep -F -f - -B 8 /test/config.conf
If you need to obtain the matches in sequence (all matches for the first string followed by all matches for the second, etc) then maybe loop over them with a proper while loop:
tr ',' '\n' </test/sourse.csv |
while read -r Oldnumber; do
grep -F -B 8 -e "$Oldnumber" /test/config.conf
done
Keeping the file names in variables does not seem to offer any advantage here.
If you mean to search for the strings preceded by a literal backslash, you can add it back; the -F option I added turns all strings into literals. If you need metacharacters, and take out the -F option, you need to double the backslashes (inside double quotes, a single backslash needs to be represented as double; and to get a literal backslash in a regular expression, you need two of them).

Optimize shell script for multiple sed replacements

I have a file containing a list of replacement pairs (about 100 of them) which are used by sed to replace strings in files.
The pairs go like:
old|new
tobereplaced|replacement
(stuffiwant).*(too)|\1\2
and my current code is:
cat replacement_list | while read i
do
old=$(echo "$i" | awk -F'|' '{print $1}') #due to the need for extended regex
new=$(echo "$i" | awk -F'|' '{print $2}')
sed -r "s/`echo "$old"`/`echo "$new"`/g" -i file
done
I cannot help but think that there is a more optimal way of performing the replacements. I tried turning the loop around to run through lines of the file first but that turned out to be much more expensive.
Are there any other ways of speeding up this script?
EDIT
Thanks for all the quick responses. Let me try out the various suggestions before choosing an answer.
One thing to clear up: I also need subexpressions/groups functionality. For example, one replacement I might need is:
([0-9])U|\10 #the extra brackets and escapes were required for my original code
Some details on the improvements (to be updated):
Method: processing time
Original script: 0.85s
cut instead of awk: 0.71s
anubhava's method: 0.18s
chthonicdaemon's method: 0.01s
You can use sed to produce correctly -formatted sed input:
sed -e 's/^/s|/; s/$/|g/' replacement_list | sed -r -f - file
I recently benchmarked various string replacement methods, among them a custom program, sed -e, perl -lnpe and an probably not that widely known MySQL command line utility, replace. replace being optimized for string replacements was almost an order of magnitude faster than sed. The results looked something like this (slowest first):
custom program > sed > LANG=C sed > perl > LANG=C perl > replace
If you want performance, use replace. To have it available on your system, you'll need to install some MySQL distribution, though.
From replace.c:
Replace strings in textfile
This program replaces strings in files or from stdin to stdout. It accepts a list of from-string/to-string pairs and replaces each occurrence of a from-string with the corresponding to-string. The first occurrence of a found string is matched. If there is more than one possibility for the string to replace, longer matches are preferred before shorter matches.
...
The programs make a DFA-state-machine of the strings and the speed isn't dependent on the count of replace-strings (only of the number of replaces). A line is assumed ending with \n or \0. There are no limit exept memory on length of strings.
More on sed. You can utilize multiple cores with sed, by splitting your replacements into #cpus groups and then pipe them through sed commands, something like this:
$ sed -e 's/A/B/g; ...' file.txt | \
sed -e 's/B/C/g; ...' | \
sed -e 's/C/D/g; ...' | \
sed -e 's/D/E/g; ...' > out
Also, if you use sed or perl and your system has an UTF-8 setup, then it also boosts performance to place a LANG=C in front of the commands:
$ LANG=C sed ...
You can cut down unnecessary awk invocations and use BASH to break name-value pairs:
while IFS='|' read -r old new; do
# echo "$old :: $new"
sed -i "s~$old~$new~g" file
done < replacement_list
IFS='|' will give enable read to populate name-value in 2 different shell variables old and new.
This is assuming ~ is not present in your name-value pairs. If that is not the case then feel free to use an alternate sed delimiter.
Here is what I would try:
store your sed search-replace pair in a Bash array like ;
build your sed command based on this array using parameter expansion
run command.
patterns=(
old new
tobereplaced replacement
)
pattern_count=${#patterns[*]} # number of pattern
sedArgs=() # will hold the list of sed arguments
for (( i=0 ; i<$pattern_count ; i=i+2 )); do # don't need to loop on the replacement…
search=${patterns[i]};
replace=${patterns[i+1]}; # … here we got the replacement part
sedArgs+=" -e s/$search/$replace/g"
done
sed ${sedArgs[#]} file
This result in this command:
sed -e s/old/new/g -e s/tobereplaced/replacement/g file
You can try this.
pattern=''
cat replacement_list | while read i
do
old=$(echo "$i" | awk -F'|' '{print $1}') #due to the need for extended regex
new=$(echo "$i" | awk -F'|' '{print $2}')
pattern=${pattern}"s/${old}/${new}/g;"
done
sed -r ${pattern} -i file
This will run the sed command only once on the file with all the replacements. You may also want to replace awk with cut. cut may be more optimized then awk, though I am not sure about that.
old=`echo $i | cut -d"|" -f1`
new=`echo $i | cut -d"|" -f2`
You might want to do the whole thing in awk:
awk -F\| 'NR==FNR{old[++n]=$1;new[n]=$2;next}{for(i=1;i<=n;++i)gsub(old[i],new[i])}1' replacement_list file
Build up a list of old and new words from the first file. The next ensures that the rest of the script isn't run on the first file. For the second file, loop through the list of replacements and perform them each one by one. The 1 at the end means that the line is printed.
{ cat replacement_list;echo "-End-"; cat YourFile; } | sed -n '1,/-End-/ s/$/³/;1h;1!H;$ {g
t again
:again
/^-End-³\n/ {s///;b done
}
s/^\([^|]*\)|\([^³]*\)³\(\n\)\(.*\)\1/\1|\2³\3\4\2/
t again
s/^[^³]*³\n//
t again
:done
p
}'
More for fun to code via sed. Try maybe for a time perfomance because this start only 1 sed that is recursif.
for posix sed (so --posix with GNU sed)
explaination
copy replacement list in front of file content with a delimiter (for line with ³ and for list with -End-) for an easier sed handling (hard to use \n in class character in posix sed.
place all line in buffer (add the delimiter of line for replacement list and -End- before)
if this is -End-³, remove the line and go to final print
replace each first pattern (group 1) found in text by second patttern (group 2)
if found, restart (t again)
remove first line
restart process (t again). T is needed because b does not reset the test and next t is always true.
Thanks to #miku above;
I have a 100MB file with a list of 80k replacement-strings.
I tried various combinations of sed's sequentially or parallel, but didn't see throughputs getting shorter than about a 20-hour runtime.
Instead I put my list into a sequence of scripts like "cat in | replace aold anew bold bnew cold cnew ... > out ; rm in ; mv out in".
I randomly picked 1000 replacements per file, so it all went like this:
# first, split my replace-list into manageable chunks (89 files in this case)
split -a 4 -l 1000 80kReplacePairs rep_
# next, make a 'replace' script out of each chunk
for F in rep_* ; do \
echo "create and make executable a scriptfile" ; \
echo '#!/bin/sh' > run_$F.sh ; chmod +x run_$F.sh ; \
echo "for each chunk-file line, strip line-ends," ; \
echo "then with sed, turn '{long list}' into 'cat in | {long list}' > out" ; \
cat $F | tr '\n' ' ' | sed 's/^/cat in | replace /;s/$/ > out/' >> run_$F.sh ;
echo "and append commands to switch in and out files, for next script" ; \
echo -e " && \\\\ \nrm in && mv out in\n" >> run_$F.sh ; \
done
# put all the replace-scripts in sequence into a main script
ls ./run_rep_aa* > allrun.sh
# make it executable
chmod +x allrun.sh
# run it
nohup ./allrun.sh &
.. which ran in under 5 mins, a lot less than 20 hours !
Looking back, I could have used more pairs per script, by finding how many lines would make up the limit.
xargs --show-limits </dev/null 2>&1 | grep --color=always "actually use:"
Maximum length of command we could actually use: 2090490
So just under 2MB; how many pairs would that be for my script ?
head -c 2090490 80kReplacePairs | wc -l
76923
So it seems I could have used 2 * 40000-line chunks
to expand on chthonicdaemon's solution
live demo
#! /bin/sh
# build regex from text file
REGEX_FILE=some-patch.regex.diff
# test
# set these with "export key=val"
SOME_VAR_NAME=hello
ANOTHER_VAR_NAME=world
escape_b() {
echo "$1" | sed 's,/,\\/,g'
}
regex="$(
(echo; cat "$REGEX_FILE"; echo) \
| perl -p -0 -e '
s/\n#[^\n]*/\n/g;
s/\(\(SOME_VAR_NAME\)\)/'"$(escape_b "$SOME_VAR_NAME")"'/g;
s/\(\(ANOTHER_VAR_NAME\)\)/'"$(escape_b "$ANOTHER_VAR_NAME")"'/g;
s/([^\n])\//\1\\\//g;
s/\n-([^\n]+)\n\+([^\n]*)(?:\n\/([^\n]+))?\n/s\/\1\/\2\/\3;\n/g;
'
)"
echo "regex:"; echo "$regex" # debug
exec perl -00 -p -i -e "$regex" "$#"
prefixing lines with -+/ allows empty "plus" values, and protects leading whitespace from buggy text editors
sample input: some-patch.regex.diff
# file format is similar to diff/patch
# this is a comment
# replace all "a/a" with "b/b"
-a/a
+b/b
/g
-a1|a2
+b1|b2
/sg
# this is another comment
-(a1).*(a2)
+b\1b\2b
-a\na\na
+b
-a1-((SOME_VAR_NAME))-a2
+b1-((ANOTHER_VAR_NAME))-b2
sample output
s/a\/a/b\/b/g;
s/a1|a2/b1|b2/;;
s/(a1).*(a2)/b\1b\2b/;
s/a\na\na/b/;
s/a1-hello-a2/b1-world-b2/;
this regex format is compatible with sed and perl
since miku mentioned mysql replace:
replacing fixed strings with regex is non-trivial,
since you must escape all regex chars,
but you also must handle backslash escapes ...
naive escaper:
echo '\(\n' | perl -p -e 's/([.+*?()\[\]])/\\\1/g'
\\(\n

Bash variables not acting as expected

I have a bash script which parses a file line by line, extracts the date using a cut command and then makes a folder using that date. However, it seems like my variables are not being populated properly. Do I have a syntax issue? Any help or direction to external resources is very appreciated.
#!/bin/bash
ls | grep .mp3 | cut -d '.' -f 1 > filestobemoved
cat filestobemoved | while read line
do
varYear= $line | cut -d '_' -f 3
varMonth= $line | cut -d '_' -f 4
varDay= $line | cut -d '_' -f 5
echo $varMonth
mkdir $varMonth'_'$varDay'_'$varYear
cp ./$line'.mp3' ./$varMonth'_'$varDay'_'$varYear/$line'.mp3'
done
You have many errors and non-recommended practices in your code. Try the following:
for f in *.mp3; do
f=${f%%.*}
IFS=_ read _ _ varYear varMonth varDay <<< "$f"
echo $varMonth
mkdir -p "${varMonth}_${varDay}_${varYear}"
cp "$f.mp3" "${varMonth}_${varDay}_${varYear}/$f.mp3"
done
The actual error is that you need to use command substitution. For example, instead of
varYear= $line | cut -d '_' -f 3
you need to use
varYear=$(cut -d '_' -f 3 <<< "$line")
A secondary error there is that $foo | some_command on its own line does not mean that the contents of $foo gets piped to the next command as input, but is rather executed as a command, and the output of the command is passed to the next one.
Some best practices and tips to take into account:
Use a portable shebang line - #!/usr/bin/env bash (disclaimer: That's my answer).
Don't parse ls output.
Avoid useless uses of cat.
Use More Quotes™
Don't use files for temporary storage if you can use pipes. It is literally orders of magnitude faster, and generally makes for simpler code if you want to do it properly.
If you have to use files for temporary storage, put them in the directory created by mktemp -d. Preferably add a trap to remove the temporary directory cleanly.
There's no need for a var prefix in variables.
grep searches for basic regular expressions by default, so .mp3 matches any single character followed by the literal string mp3. If you want to search for a dot, you need to either use grep -F to search for literal strings or escape the regular expression as \.mp3.
You generally want to use read -r (defined by POSIX) to treat backslashes in the input literally.

Resources