I am trying to select the nth file in a folder of which the filename matches a certain pattern:
Ive tried using this with sed: e.g.,
sed -n 3p /path/to/files/pattern.txt
but it appears to return the 3rd line of the first matching file.
Ive also tried
sed -n 3p ls /path/to/files/*pattern*.txt
which doesnt work either.
Thanks!
Why sed, when bash is so much better at it?
Assuming some name n indicates the index you want:
Bash
files=(path/to/files/*pattern*.txt)
echo "${files[n]}"
Posix sh
i=0
for file in path/to/files/*pattern*.txt; do
if [ $i = $n ]; then
break
fi
i=$((i++))
done
echo "$file"
What's wrong with sed is that you would have to jump through many hoops to make it safe for the entire set of possible characters that can occur in a filename, and even if that doesn't matter to you you end up with a double-layer of subshells to get the answer.
file=$(printf '%s\n' path/to/files/*pattern*.txt | sed -n "$n"p)
Please, never parse ls.
ls -1 /path/to/files/*pattern*.txt | sed -n '3p'
or, if patterne is a regex pattern
ls -1 /path/to/files/ | egrep 'pattern' | sed -n '3p'
lot of other possibilities, it depend on performance or simplicity you look at
Related
The command that I'm making wants the first input to be a file and search how many times a certain pattern occurs within the file, using grep and sed.
Ex:
$ cat file1
oneonetwotwotwothreefourfive
Intended output:
$ ./command file1 one two three
one 2
two 3
three 1
The problem is the file does not have any lines and is just a long list of letters. I'm trying to use sed to replace the pattern I'm looking for with "FIND" and move the list to the next line and this continues until the end of file. Then, use $grep FIND to get the line that contains FIND. Finally, use wc -l to find a number of lines. However, I cannot find the option to move the list to the next line
Ex:
$cat file1
oneonetwosixone
Intended output:
FIND
FIND
twosixFIND
Another problem that I've been having is how to use the rest of the input, not including the file.
Failed attempt:
file=$1
for PATTERN in 2 3 4 5 ... N
do
variable=$(sed 's/$PATTERN/find/g' $file | grep FIND $file | wc -l)
echo $PATTERN $variable
exit
Another failed attempt:
file=$1
PATTERN=$($2,$3 ... $N)
for PATTERN in $*
do variable=$(sed 's/$PATTERN/FIND/g' $file | grep FIND $file | wc-1)
echo $PATTERN $variable
exit
Any suggestions and help will be greatly appreciated. Thank you in advance.
Non-portable solution with GNU grep:
file=$1
shift
for pattern in "$#"; do
echo "$pattern" $(grep -o -e "$pattern" <"$file" | wc -l)
done
If you want to use sed and your "patterns" are actually fixed strings (which don't contain characters that have special meaning to sed), you could do something like:
file=$1
shift
for pattern in "$#"; do
echo "$pattern" $(
sed "s/$pattern/\n&\n/g" "$file" |\
grep -e "$pattern" | wc -l
)
done
Your code has several issues:
you should quote use of variables where word splitting may happen
don't use ALLCAPS variable names - they are reserved for use by the shell
if you put a string in single-quotes, variable expansion does not happen
if you give grep a file, it won't read standard input
your for loop has no terminating done
This might work for you (GNU bash,sed and uniq):
f(){ local file=$1;
shift;
local args="$#";
sed -E 's/'${args// /|}'/\n&\n/g
s/(\n\S+)\n\S+/\1/g
s/\n+/\n/g
s/.(.*)/echo "\1"|uniq -c/e
s/ *(\S+) (\S+)/\2 \1/mg' $file; }
Separate arguments into file and remaining arguments.
Apply arguments as alternation within a sed substitution command which splits words into lines separated by a newline either side.
Remove unwanted words and unwanted newlines.
Evaluate the manufactured file within a sed substitution using the uniq command with the -c option.
Rearrange the output and print the result.
The problem is the file does not have any lines
Great! So the problem reduces to putting newlines.
func() {
file=$1
shift
rgx=$(printf "%s\\|" "$#" | sed 's#\\|$##');
# put the newline between words
sed 's/\('"$rgx"'\)/&\n/g' "$file" |
# it's just standard here
sort | uniq -c |
# filter only input - i.e. exclude fourfive
grep -xf <(printf " *[0-9]\+ %s\n" "$#")
};
func <(echo oneonetwotwotwothreefourfive) one two three
outputs:
2 one
1 three
3 two
I have a file containing a list of replacement pairs (about 100 of them) which are used by sed to replace strings in files.
The pairs go like:
old|new
tobereplaced|replacement
(stuffiwant).*(too)|\1\2
and my current code is:
cat replacement_list | while read i
do
old=$(echo "$i" | awk -F'|' '{print $1}') #due to the need for extended regex
new=$(echo "$i" | awk -F'|' '{print $2}')
sed -r "s/`echo "$old"`/`echo "$new"`/g" -i file
done
I cannot help but think that there is a more optimal way of performing the replacements. I tried turning the loop around to run through lines of the file first but that turned out to be much more expensive.
Are there any other ways of speeding up this script?
EDIT
Thanks for all the quick responses. Let me try out the various suggestions before choosing an answer.
One thing to clear up: I also need subexpressions/groups functionality. For example, one replacement I might need is:
([0-9])U|\10 #the extra brackets and escapes were required for my original code
Some details on the improvements (to be updated):
Method: processing time
Original script: 0.85s
cut instead of awk: 0.71s
anubhava's method: 0.18s
chthonicdaemon's method: 0.01s
You can use sed to produce correctly -formatted sed input:
sed -e 's/^/s|/; s/$/|g/' replacement_list | sed -r -f - file
I recently benchmarked various string replacement methods, among them a custom program, sed -e, perl -lnpe and an probably not that widely known MySQL command line utility, replace. replace being optimized for string replacements was almost an order of magnitude faster than sed. The results looked something like this (slowest first):
custom program > sed > LANG=C sed > perl > LANG=C perl > replace
If you want performance, use replace. To have it available on your system, you'll need to install some MySQL distribution, though.
From replace.c:
Replace strings in textfile
This program replaces strings in files or from stdin to stdout. It accepts a list of from-string/to-string pairs and replaces each occurrence of a from-string with the corresponding to-string. The first occurrence of a found string is matched. If there is more than one possibility for the string to replace, longer matches are preferred before shorter matches.
...
The programs make a DFA-state-machine of the strings and the speed isn't dependent on the count of replace-strings (only of the number of replaces). A line is assumed ending with \n or \0. There are no limit exept memory on length of strings.
More on sed. You can utilize multiple cores with sed, by splitting your replacements into #cpus groups and then pipe them through sed commands, something like this:
$ sed -e 's/A/B/g; ...' file.txt | \
sed -e 's/B/C/g; ...' | \
sed -e 's/C/D/g; ...' | \
sed -e 's/D/E/g; ...' > out
Also, if you use sed or perl and your system has an UTF-8 setup, then it also boosts performance to place a LANG=C in front of the commands:
$ LANG=C sed ...
You can cut down unnecessary awk invocations and use BASH to break name-value pairs:
while IFS='|' read -r old new; do
# echo "$old :: $new"
sed -i "s~$old~$new~g" file
done < replacement_list
IFS='|' will give enable read to populate name-value in 2 different shell variables old and new.
This is assuming ~ is not present in your name-value pairs. If that is not the case then feel free to use an alternate sed delimiter.
Here is what I would try:
store your sed search-replace pair in a Bash array like ;
build your sed command based on this array using parameter expansion
run command.
patterns=(
old new
tobereplaced replacement
)
pattern_count=${#patterns[*]} # number of pattern
sedArgs=() # will hold the list of sed arguments
for (( i=0 ; i<$pattern_count ; i=i+2 )); do # don't need to loop on the replacement…
search=${patterns[i]};
replace=${patterns[i+1]}; # … here we got the replacement part
sedArgs+=" -e s/$search/$replace/g"
done
sed ${sedArgs[#]} file
This result in this command:
sed -e s/old/new/g -e s/tobereplaced/replacement/g file
You can try this.
pattern=''
cat replacement_list | while read i
do
old=$(echo "$i" | awk -F'|' '{print $1}') #due to the need for extended regex
new=$(echo "$i" | awk -F'|' '{print $2}')
pattern=${pattern}"s/${old}/${new}/g;"
done
sed -r ${pattern} -i file
This will run the sed command only once on the file with all the replacements. You may also want to replace awk with cut. cut may be more optimized then awk, though I am not sure about that.
old=`echo $i | cut -d"|" -f1`
new=`echo $i | cut -d"|" -f2`
You might want to do the whole thing in awk:
awk -F\| 'NR==FNR{old[++n]=$1;new[n]=$2;next}{for(i=1;i<=n;++i)gsub(old[i],new[i])}1' replacement_list file
Build up a list of old and new words from the first file. The next ensures that the rest of the script isn't run on the first file. For the second file, loop through the list of replacements and perform them each one by one. The 1 at the end means that the line is printed.
{ cat replacement_list;echo "-End-"; cat YourFile; } | sed -n '1,/-End-/ s/$/³/;1h;1!H;$ {g
t again
:again
/^-End-³\n/ {s///;b done
}
s/^\([^|]*\)|\([^³]*\)³\(\n\)\(.*\)\1/\1|\2³\3\4\2/
t again
s/^[^³]*³\n//
t again
:done
p
}'
More for fun to code via sed. Try maybe for a time perfomance because this start only 1 sed that is recursif.
for posix sed (so --posix with GNU sed)
explaination
copy replacement list in front of file content with a delimiter (for line with ³ and for list with -End-) for an easier sed handling (hard to use \n in class character in posix sed.
place all line in buffer (add the delimiter of line for replacement list and -End- before)
if this is -End-³, remove the line and go to final print
replace each first pattern (group 1) found in text by second patttern (group 2)
if found, restart (t again)
remove first line
restart process (t again). T is needed because b does not reset the test and next t is always true.
Thanks to #miku above;
I have a 100MB file with a list of 80k replacement-strings.
I tried various combinations of sed's sequentially or parallel, but didn't see throughputs getting shorter than about a 20-hour runtime.
Instead I put my list into a sequence of scripts like "cat in | replace aold anew bold bnew cold cnew ... > out ; rm in ; mv out in".
I randomly picked 1000 replacements per file, so it all went like this:
# first, split my replace-list into manageable chunks (89 files in this case)
split -a 4 -l 1000 80kReplacePairs rep_
# next, make a 'replace' script out of each chunk
for F in rep_* ; do \
echo "create and make executable a scriptfile" ; \
echo '#!/bin/sh' > run_$F.sh ; chmod +x run_$F.sh ; \
echo "for each chunk-file line, strip line-ends," ; \
echo "then with sed, turn '{long list}' into 'cat in | {long list}' > out" ; \
cat $F | tr '\n' ' ' | sed 's/^/cat in | replace /;s/$/ > out/' >> run_$F.sh ;
echo "and append commands to switch in and out files, for next script" ; \
echo -e " && \\\\ \nrm in && mv out in\n" >> run_$F.sh ; \
done
# put all the replace-scripts in sequence into a main script
ls ./run_rep_aa* > allrun.sh
# make it executable
chmod +x allrun.sh
# run it
nohup ./allrun.sh &
.. which ran in under 5 mins, a lot less than 20 hours !
Looking back, I could have used more pairs per script, by finding how many lines would make up the limit.
xargs --show-limits </dev/null 2>&1 | grep --color=always "actually use:"
Maximum length of command we could actually use: 2090490
So just under 2MB; how many pairs would that be for my script ?
head -c 2090490 80kReplacePairs | wc -l
76923
So it seems I could have used 2 * 40000-line chunks
to expand on chthonicdaemon's solution
live demo
#! /bin/sh
# build regex from text file
REGEX_FILE=some-patch.regex.diff
# test
# set these with "export key=val"
SOME_VAR_NAME=hello
ANOTHER_VAR_NAME=world
escape_b() {
echo "$1" | sed 's,/,\\/,g'
}
regex="$(
(echo; cat "$REGEX_FILE"; echo) \
| perl -p -0 -e '
s/\n#[^\n]*/\n/g;
s/\(\(SOME_VAR_NAME\)\)/'"$(escape_b "$SOME_VAR_NAME")"'/g;
s/\(\(ANOTHER_VAR_NAME\)\)/'"$(escape_b "$ANOTHER_VAR_NAME")"'/g;
s/([^\n])\//\1\\\//g;
s/\n-([^\n]+)\n\+([^\n]*)(?:\n\/([^\n]+))?\n/s\/\1\/\2\/\3;\n/g;
'
)"
echo "regex:"; echo "$regex" # debug
exec perl -00 -p -i -e "$regex" "$#"
prefixing lines with -+/ allows empty "plus" values, and protects leading whitespace from buggy text editors
sample input: some-patch.regex.diff
# file format is similar to diff/patch
# this is a comment
# replace all "a/a" with "b/b"
-a/a
+b/b
/g
-a1|a2
+b1|b2
/sg
# this is another comment
-(a1).*(a2)
+b\1b\2b
-a\na\na
+b
-a1-((SOME_VAR_NAME))-a2
+b1-((ANOTHER_VAR_NAME))-b2
sample output
s/a\/a/b\/b/g;
s/a1|a2/b1|b2/;;
s/(a1).*(a2)/b\1b\2b/;
s/a\na\na/b/;
s/a1-hello-a2/b1-world-b2/;
this regex format is compatible with sed and perl
since miku mentioned mysql replace:
replacing fixed strings with regex is non-trivial,
since you must escape all regex chars,
but you also must handle backslash escapes ...
naive escaper:
echo '\(\n' | perl -p -e 's/([.+*?()\[\]])/\\\1/g'
\\(\n
I'm trying to trim a few lines from a file. I know exactly how many lines to remove (say, 2 from the top), but not how many total lines are in the file. So I tried this straightforward solution:
$ wc -l $FILENAME
119559 my_filename.txt
$ LINES=$(wc -l $FILENAME | awk '{print $1}')
$ tail -n $(($LINES - 2)) $FILENAME > $OUTPUT_FILE
The output is fine, but what happened to LINES??
$ wc -l $OUTPUT_FILE
119557 my_output_file.txt
$ echo $LINES
107
Hoping someone can help me understand what's going on.
$LINES has a special meaning. It is the number of rows the terminal has, and if you resize your terminal window, it will be re-set. See info "(bash)Bash Variables".
It always helps to decompose where you thing the problem is. Running
wc -l $FILENAME | awk '{print $1}'
should probably show you where the problem is.
Instead, use
LINES=$(wc -l < $FILENAME )
Hm.. Yes, I'm afraid #MichaelHoffman is probably has diagnosed your problem more accurately.
I hope this helps.
You could also just do sed 'X,Yd' < file
Where X,Y is the range of the lines you want to omit (in this case it would be 1,2).
Other alternatives are:
sed 'X,+Yd' omits Y lines starting from line X
sed /regex/,Yd' omits everything between the line where the regex matches and Y
sed '/regex/,+Yd' omits Y lines starting from where the regex matches
sed '/regex/,/regex/d' omits everything between the two regexs
Note: these are GNU sed extensions
I can't remember how to capture the result of an execution into a variable in a bash script.
Basically I have a folder full of backup files of the following format:
backup--my.hostname.com--1309565.tar.gz
I want to loop over a list of all files and pull the numeric part out of the filename and do something with it, so I'm doing this so far:
HOSTNAME=`hostname`
DIR="/backups/"
SUFFIX=".tar.gz"
PREFIX="backup--$HOSTNAME--"
TESTNUMBER=9999999999
#move into the backup dir
cd $DIR
#get a list of all backup files in there
FILES=$PREFIX*$SUFFIX
#Loop over the list
for F in $FILES
do
#rip the number from the filename
NUMBER=$F | sed s/$PREFIX//g | sed s/$SUFFIX//g
#compare the number with another number
if [ $NUMBER -lg $TESTNUMBER ]
#do something
fi
done
I know the "$F | sed s/$PREFIX//g | sed s/$SUFFIX//g" part rips the number correctly (though I appreciate there might be a better way of doing this), but I just can't remember how to get that result into NUMBER so I can reuse it in the if statement below.
Use the $(...) syntax (or ``).
NUMBER=$( echo $F | sed s/$PREFIX//g | sed s/$SUFFIX//g )
or
NUMBER=` echo $F | sed s/$PREFIX//g | sed s/$SUFFIX//g `
(I prefer the first one, since it is easier to see when multiple ones nest.)
Backticks if you want to be portable to older shells (sh):
NUMBER=`$F | sed s/$PREFIX//g | sed s/$SUFFIX//g`.
Otherwise, use NUMBER=$($F | sed s/$PREFIX//g | sed s/$SUFFIX//g). It's better and supports nesting more readily.
I am a newbie in Bash and I am doing some string manipulation.
I have the following file among other files in my directory:
jdk-6u20-solaris-i586.sh
I am doing the following to get jdk-6u20 in my script:
myvar=`ls -la | awk '{print $9}' | egrep "i586" | cut -c1-8`
echo $myvar
but now I want to convert jdk-6u20 to jdk1.6.0_20. I can't seem to figure out how to do it.
It must be as generic as possible. For example if I had jdk-6u25, I should be able to convert it at the same way to jdk1.6.0_25 so on and so forth
Any suggestions?
Depending on exactly how generic you want it, and how standard your inputs will be, you can probably use AWK to do everything. By using FS="regexp" to specify field separators, you can break down the original string by whatever tokens make the most sense, and put them back together in whatever order using printf.
For example, assuming both dashes and the letter 'u' are only used to separate fields:
myvar="jdk-6u20-solaris-i586.sh"
echo $myvar | awk 'BEGIN {FS="[-u]"}; {printf "%s1.%s.0_%s",$1,$2,$3}'
Flavour according to taste.
Using only Bash:
for file in jdk*i586*
do
file="${file%*-solaris*}"
file="${file/-/1.}"
file="${file/u/.0_}"
do_something_with "$file"
done
i think that sed is the command for you
You can try this snippet:
for fname in *; do
newname=`echo "$fname" | sed 's,^jdk-\([0-9]\)u\([0-9][0-9]*\)-.*$,jdk1.\1.0_\2,'`
if [ "$fname" != "$newname" ]; then
echo "old $fname, new $newname"
fi
done
awk 'if(match($9,"i586")){gsub("jdk-6u20","jdk1.6.0_20");print $9;}'
The if(match()) supersedes the egrep bit if you want to use it. You could use substr($9,1,8) instead of cut as well.
garph0 has a good idea with sed; you could do
myvar=`ls jdk*i586.sh | sed 's/jdk-\([0-9]\)u\([0-9]\+\).\+$/jdk1.\1.0_\2/'`
You're needing the awk in there is an artifact of the -l switch on ls. For pattern substitution on lines of text, sed is the long-time champion:
ls | sed -n '/^jdk/s/jdk-\([0-9][0-9]*\)u\([0-9][0-9]*\)$/jdk1.\1.0_\2/p'
This was written in "old-school" sed which should have greater portability across platforms. The expression says:
don't print lines unless they match -n
on lines beginning with 'jdk' do:
on a line that contains only "jdk-IntegerAuIntegerB"
change it to "jdk.1.IntegerA.0_IntegerB"
and print it
Your sample becomes even simpler as:
myvar=`echo *solaris-i586.sh | sed 's/-solaris-i586\.sh//'`