Get everything in a file after a grep'd string - bash

Yesterday a situation came up where someone needed me to separate out the tail end of a file, specified as being everything after a particular string (for sake of argument, "FOO"). I needed to do this immediately so went with the option that I knew would work and disregarded The Right Way or The Best Way, and went with the following:
grep -n FOO FILE.TXT | cut -f1 -d":" | xargs -I{} tail -n +{} FILE.TXT > NEWFILE.TXT
The thing that bugged me about this was the use of xargs for a singleton value. I thought that I could go flex my Google-Fu on this but was interested to see what sort of things people out in SO-land came up with for this situation

sed -n '/re/,$p' file
is what occurs to me right off.

Did you consider just using grep's '--after-context' argument?
Something like, this should do the trick, with a sufficiently large number to tail out the end of the file:
grep --after-context=999999 -n FOO FILE.TXT > NEWFILE.TXT

Similar to geekosaur's answer above, but this option excludes rather than includes the matched line:
sed '1,/regex/d' myfile
Found this one here after trying the option above.

Related

I want to pipe grep output to sed for input

I'm trying to pipe the output of grep to sed so it will only edit specific files. I don't want sed to edit something without changing it. (Changing the modified date.)
I'm searching with grep and writing with sed. That's it
The thing I am trying to change is a dash, not the normal type, a special type. "-" is normal. "–" isn't normal
The code I currently have:
sed -i 's/– foobar/- foobar/g' * ; perl-rename 's/– foobar/- foobar/' *'– foobar'*
Sorry about the trouble, I'm inexperienced.
Are you sure about what you want to achieve? Let me explain you:
grep "string_in_file" <filelist> | sed <sed_script>
This is first showing the "string_in_file", preceeded by the filename.
If you launch a sed on this, then it will just show you the result of that sed-script on screen, but it will not change the files itself. In order to do this, you need the following:
grep -l "string_in_file" <filelist> | sed <sed_script_on_file>
The grep -l shows you some filenames, and the new sed_script_on_file needs to be a script, reading the file, and altering it.
Thank you all for helping, I'm sorry about not being fast in responding
After a bit of fiddling with the command, I got it:
grep -l 'old' * | xargs -d '\n' sed -i 's/old/new/'
This should only touch files that contain old and leave all other files.
This might be what you're trying to do if your file names don't contain newlines:
grep -l -- 'old' * | xargs sed -i 's/old/new/'

sed bash substitution only if variable has a value

I'm trying to find a way using variables and sed to do a specific text substitution using a changing input file, but only if there is a value given to replace the existing string with. No value= do nothing (rather than remove the existing string).
Example
Substitute.csv contains 5 lines=
this-has-text
this-has-text
this-has-text
this-has-text
and file.text has one sentence=
"When trying this I want to be sure that text-this-has is left alone."
If I run the following command in a shell script
Text='text-this-has'
Change=`sed -n '3p' substitute.csv`
grep -rl $Text /home/username/file.txt | xargs sed -i "s|$Text|$Change|"
I end up with
"When trying this I want to be sure that is left alone."
But I'd like it to remain as
"When trying this I want to be sure that text-this-has is left alone."
Any way to tell sed "If I give you nothing new, do nothing"?
I apologize for the overthinking, bad habit. Essentially what I'd like to accomplish is if line 3 of the csv file has a value - replace $Text with $Change inline. If the line is empty, leave $Text as $Text.
Text='text-this-has'
Change=$(sed -n '3p' substitute.csv)
if [[ -n $Change ]]; then
grep -rl $Text /home/username/file.txt | xargs sed -i "s|$Text|$Change|"
fi
Just keep it simple and use awk:
awk -v t="$Text" -v c="$Change" 'c!=""{sub(t,c)} {print}' file
If you need inplace editing just use GNU awk with -i inplace.
Given your clarified requirement, this is probably what you actually want:
awk -v t="$Text" 'NR==FNR{if (NR==3) c=$0; next} c!=""{sub(t,c)} {print}' Substitute.csv file.txt
Testing whether $Change has a value before launching into the grep and sed is undoubtedly the most efficient bash solution, although I'm a bit skeptical about the duplication of grep and sed; it saves a temporary file in the case of files which don't contain the target string, but at the cost of an extra scan up to the match in the case of files which do contain it.
If you're looking for typing efficiency, though, the following might be interesting:
find . -name '*.txt' -exec sed -i "s|$Text|${Change:-&}|" {} \;
Which will recursively find all files whose names end with the extension .txt and execute the sed command on each one. ${Change:-&} means "the value of $Change if it exists and is non-empty, and otherwise an &"; & in the replacement of a sed s command means "the matched text", so s|foo|&| replaces every occurrence of foo with itself. That's an expensive no-op but if your time matters more than your cpu time, it might have been worth it.

Is it more efficient to grep twice or use a regular expression once?

I'm trying to parse a couple of 2gb+ files and want to grep on a couple of levels.
Say I want to fetch lines that contain "foo" and lines that also contain "bar".
I could do grep foo file.log | grep bar, but my concern is that it will be expensive running it twice.
Would it be beneficial to use something like grep -E '(foo.*bar|bar.*foo)' instead?
grep -E '(foo|bar)' will find lines containing 'foo' OR 'bar'.
You want lines containing BOTH 'foo' AND 'bar'. Either of these commands will do:
sed '/foo/!d;/bar/!d' file.log
awk '/foo/ && /bar/' file.log
Both commands -- in theory -- should be much more efficient than your cat | grep | grep construct because:
Both sed and awk perform their own file reading; no need for pipe overhead
The 'programs' I gave to sed and awk above use Boolean short-circuiting to quickly skip lines not containing 'foo', thus testing only lines containing 'foo' to the /bar/ regex
However, I haven't tested them. YMMV :)
In theory, the fastest way should be:
grep -E '(foo.*bar|bar.*foo)' file.log
For several reasons: First, grep reads directly from the file, rather than adding the step of having cat read it and stuff it down a pipe for grep to read. Second, it uses only a single instance of grep, so each line of the file only has to be processed once. Third, grep -E is generally faster than plain grep on large files (but slower on small files), although this will depend on your implementation of grep. Finally, grep (in all its variants) is optimized for string searching, while sed and awk are general-purpose tools that happen to be able to search (but aren't optimized for it).
These two operations are fundamentally different. This one:
cat file.log | grep foo | grep bar
looks for foo in file.log, then looks for bar in whatever the last grep output. Whereas cat file.log | grep -E '(foo|bar)' looks for either foo or bar in file.log. The output should be very different. Use whatever behavior you need.
As for efficiency, they're not really comparable because they do different things. Both should be fast enough, though.
If you're doing this:
cat file.log | grep foo | grep bar
You're only printing lines that contain both foo and bar in any order. If this is your intention:
grep -e "foo.*bar" -e "bar.*foo" file.log
Will be more efficient since I only have to parse the output once.
Notice I don't need the cat which is more efficient in itself. You rarely ever need cat unless you are concatinating files (which is the purpose of the command). 99% of the time you can either add a file name to the end of the first command in a pipe, or if you have a command like tr that doesn't allow you to use a file, you can always redirect the input like this:
tr `a-z` `A-Z` < $fileName
But, enough about useless cats. I have two at home.
You can pass multiple regular expressions to a single grep which is usually a bit more efficient than piping multiple greps. However, if you can eliminate regular expressions, you might find this the most efficient:
fgrep "foo" file.log | fgrep "bar"
Unlike grep, fgrep doesn't parse regular expressions which means it can parse lines much, much faster. Try this:
time fgrep "foo" file.log | fgrep "bar"
and
time grep -e "foo.*bar" -e "bar.*foo" file.log
And see which is faster.

Get the newest file based on timestamp

I am new to shell scripting so i need some help need how to go about with this problem.
I have a directory which contains files in the following format. The files are in a diretory called /incoming/external/data
AA_20100806.dat
AA_20100807.dat
AA_20100808.dat
AA_20100809.dat
AA_20100810.dat
AA_20100811.dat
AA_20100812.dat
As you can see the filename of the file includes a timestamp. i.e. [RANGE]_[YYYYMMDD].dat
What i need to do is find out which of these files has the newest date using the timestamp on the filename not the system timestamp and store the filename in a variable and move it to another directory and move the rest to a different directory.
For those who just want an answer, here it is:
ls | sort -n -t _ -k 2 | tail -1
Here's the thought process that led me here.
I'm going to assume the [RANGE] portion could be anything.
Start with what we know.
Working Directory: /incoming/external/data
Format of the Files: [RANGE]_[YYYYMMDD].dat
We need to find the most recent [YYYYMMDD] file in the directory, and we need to store that filename.
Available tools (I'm only listing the relevant tools for this problem ... identifying them becomes easier with practice):
ls
sed
awk (or nawk)
sort
tail
I guess we don't need sed, since we can work with the entire output of ls command. Using ls, awk, sort, and tail we can get the correct file like so (bear in mind that you'll have to check the syntax against what your OS will accept):
NEWESTFILE=`ls | awk -F_ '{print $1 $2}' | sort -n -k 2,2 | tail -1`
Then it's just a matter of putting the underscore back in, which shouldn't be too hard.
EDIT: I had a little time, so I got around to fixing the command, at least for use in Solaris.
Here's the convoluted first pass (this assumes that ALL files in the directory are in the same format: [RANGE]_[yyyymmdd].dat). I'm betting there are better ways to do this, but this works with my own test data (in fact, I found a better way just now; see below):
ls | awk -F_ '{print $1 " " $2}' | sort -n -k 2 | tail -1 | sed 's/ /_/'
... while writing this out, I discovered that you can just do this:
ls | sort -n -t _ -k 2 | tail -1
I'll break it down into parts.
ls
Simple enough ... gets the directory listing, just filenames. Now I can pipe that into the next command.
awk -F_ '{print $1 " " $2}'
This is the AWK command. it allows you to take an input line and modify it in a specific way. Here, all I'm doing is specifying that awk should break the input wherever there is an underscord (_). I do this with the -F option. This gives me two halves of each filename. I then tell awk to output the first half ($1), followed by a space (" ")
, followed by the second half ($2). Note that the space was the part that was missing from my initial suggestion. Also, this is unnecessary, since you can specify a separator in the sort command below.
Now the output is split into [RANGE] [yyyymmdd].dat on each line. Now we can sort this:
sort -n -k 2
This takes the input and sorts it based on the 2nd field. The sort command uses whitespace as a separator by default. While writing this update, I found the documentation for sort, which allows you to specify the separator, so AWK and SED are unnecessary. Take the ls and pipe it through the following sort:
sort -n -t _ -k 2
This achieves the same result. Now you only want the last file, so:
tail -1
If you used awk to separate the file (which is just adding extra complexity, so don't do it sheepish), you can replace the space with an underscore again with sed:
sed 's/ /_/'
Some good info here, but I'm sure most people aren't going to read down to the bottom like this.
This should work:
newest=$(ls | sort -t _ -k 2,2 | tail -n 1)
others=($(ls | sort -t _ -k 2,2 | head -n -1))
mv "$newest" newdir
mv "${others[#]}" otherdir
It won't work if there are spaces in the filenames although you could modify the IFS variable to affect that.
Try:
$ ls -lr
Hope it helps.
Use:
ls -r -1 AA_*.dat | head -n 1
(assuming there are no other files matching AA_*.dat)
ls -1 AA* |sort -r|tail -1
Due to the naming convention of the files, alphabetical order is the same as date order. I'm pretty sure that in bash '*' expands out alphabetically (but can not find any evidence in the manual page), ls certainly does, so the file with the newest date, would be the last one alphabetically.
Therefore, in bash
mv $(ls | tail -1) first-directory
mv * second-directory
Should do the trick.
If you want to be more specific about the choice of file, then replace * with something else - for example AA_*.dat
My solution to this is similar to others, but a little simpler.
ls -tr | tail -1
What is actually does is to rely on ls to sort the output, then uses tail to get the last listed file name.
This solution will not work if the filename you require has a leading dot (e.g. .profile).
This solution does work if the file name contains a space.

bash grep newline

[Editorial insertion: Possible duplicate of the same poster's earlier question?]
Hi, I need to extract from the file:
first
second
third
using the grep command, the following line:
second
third
How should the grep command look like?
Instead of grep, you can use pcregrep which supports multiline patterns
pcregrep -M 'second\nthird' file
-M allows the pattern to match more than one line.
Your question abstract "bash grep newline", implies that you would want to match on the second\nthird sequence of characters - i.e. something containing newline within it.
Since the grep works on "lines" and these two are different lines, you would not be able to match it this way.
So, I'd split it into several tasks:
you match the line that contains "second" and output the line that has matched and the subsequent line:
grep -A 1 "second" testfile
you translate every other newline into the sequence that is guaranteed not to occur in the input. I think the simplest way to do that would be using perl:
perl -npe '$x=1-$x; s/\n/##UnUsedSequence##/ if $x;'
you do a grep on these lines, this time searching for string ##UnUsedSequence##third:
grep "##UnUsedSequence##third"
you unwrap the unused sequences back into the newlines, sed might be the simplest:
sed -e 's/##UnUsedSequence##/\n'
So the resulting pipe command to do what you want would look like:
grep -A 1 "second" testfile | perl -npe '$x=1-$x; s/\n/##UnUsedSequence##/ if $x;' | grep "##UnUsedSequence##third" | sed -e 's/##UnUsedSequence##/\n/'
Not the most elegant by far, but should work. I'm curious to know of better approaches, though - there should be some.
I don't think grep is the way to go on this.
If you just want to strip the first line from any file (to generalize your question), I would use sed instead.
sed '1d' INPUT_FILE_NAME
This will send the contents of the file to standard output with the first line deleted.
Then you can redirect the standard output to another file to capture the results.
sed '1d' INPUT_FILE_NAME > OUTPUT_FILE_NAME
That should do it.
If you have to use grep and just don't want to display the line with first on it, then try this:
grep -v first INPUT_FILE_NAME
By passing the -v switch, you are telling grep to show you everything but the expression that you are passing. In effect show me everything but the line(s) with first in them.
However, the downside is that a file with multiple first's in it will not show those other lines either and may not be the behavior that you are expecting.
To shunt the results into a new file, try this:
grep -v first INPUT_FILE_NAME > OUTPUT_FILE_NAME
Hope this helps.
I don't really understand what do you want to match. I would not use grep, but one of the following:
tail -2 file # to get last two lines
head -n +2 file # to get all but first line
sed -e '2,3p;d' file # to get lines from second to third
(not sure how standard it is, it works in GNU tools for sure)
So you just don't want the line containing "first"? -v inverts the grep results.
$ echo -e "first\nsecond\nthird\n" | grep -v first
second
third
Line? Or lines?
Try
grep -E -e '(second|third)' filename
Edit: grep is line oriented. you're going to have to use either Perl, sed or awk to perform the pattern match across lines.
BTW -E tell grep that the regexp is extended RE.
grep -A1 "second" | grep -B1 "third" works nicely, and if you have multiple matches it will even get rid of the original -- match delimiter
grep -E '(second|third)' /path/to/file
egrep -w 'second|third' /path/to/file
you could use
$ grep -1 third filename
this will print a string with match and one string before and after. Since "third" is in the last string you get last two strings.
I like notnoop's answer, but building on AndrewY's answer (which is better for those without pcregrep, but way too complicated), you can just do:
RESULT=`grep -A1 -s -m1 '^\s*second\s*$' file | grep -s -B1 -m1 '^\s*third\s*$'`
grep -v '^first' filename
Where the -v flag inverts the match.

Resources