unix command to get lines from in between first and last occurence of a word and write to a file - bash

I want a unix command to find the lines between first & last occurence of a word
For example:
let's imagine we have 1000 lines. Tenth line contains word "stackoverflow", thirty fifth line also contains word "stackoverflow".
I want to print lines between 10 and 35 and write it to a new file.

You can make it in two steps. The basic idea is to:
1) get the line number of the first and last match.
2) print the range of lines in between these range.
$ read first last <<< $(grep -n stackoverflow your_file | awk -F: 'NR==1 {printf "%d ", $1}; END{print $1}')
$ awk -v f=$first -v l=$last 'NR>=f && NR<=l' your_file
Explanation
read first last reads two values and stores them in $first and $last.
grep -n stackoverflow your_file greps and shows the output like this: number_of_line:output
awk -F: 'NR==1 {printf "%d ", $1}; END{print $1}') prints the number of line of the first and last match of stackoverflow in the file.
And
awk -v f=$first -v l=$last 'NR>=f && NR<=l' your_file prints all lines from $first line number till $last line number.
Test
$ cat a
here we
have some text
stackoverflow
and other things
bla
bla
bla bla
stackoverflow
and whatever else
stackoverflow
to make more fun
blablabla
$ read first last <<< $(grep -n stackoverflow a | awk -F: 'NR==1 {printf "%d ", $1}; END{print $1}')
$ awk -v f=$first -v l=$last 'NR>=f && NR<=l' a
stackoverflow
and other things
bla
bla
bla bla
stackoverflow
and whatever else
stackoverflow
By steps:
$ grep -n stackoverflow a
3:stackoverflow
9:stackoverflow
11:stackoverflow
$ grep -n stackoverflow a | awk -F: 'NR==1 {printf "%d ", $1}; END{print $1}'
3 11
$ read first last <<< $(grep -n stackoverflow a | awk -F: 'NR==1 {printf "%d ", $1}; END{print $1}')
$ echo "first=$first, last=$last"
first=3, last=11

If you know an upper bound of how many lines there can be (say, a million), then you can use this simple abusive script:
(grep -A 100000 stackoverflow | grep -B 1000000 stackoverflow) < file
You can append | tail -n +2 | head -n -1 to strip the border lines as well:
(grep -A 100000 stackoverflow | grep -B 1000000 stackoverflow
| tail -n +2 | head -n -1) < file

I'm not 100% sure from the question whether the output should be inclusive of the first and last matching lines, so I'm assuming it is. But this can be easily changed if we want exclusive instead.
This pure-bash solution does it all in one step - i.e. the file (or pipe) is only read once:
#!/bin/bash
function midgrep {
while read ln; do
[ "$saveline" ] && linea[$((i++))]=$ln
if [[ $ln =~ $1 ]]; then
if [ "$saveline" ]; then
for ((j=0; j<i; j++)); do echo ${linea[$j]}; done
i=0
else
saveline=1
linea[$((i++))]=$ln
fi
fi
done
}
midgrep "$1"
Save this as a script (e.g. midgrep.sh) and pipe whatever output you like to it as follows:
$ cat input.txt | ./midgrep.sh stackoverflow
This works as follows:
find the first matching line and buffer in the first element of an array
continue reading lines until the next match, buffering to the array as we go
on each subsequent matches, flush the buffer array to output
continue reading file to the end. If there are no more matches, then the last buffer is simply discarded.
The advantage of this approach is that we only read through the input one time only. The disadvantage is that we buffer everything between each match - if there are many lines between each match, then these are all buffered to memory, until we hit the next match.
Also this uses the bash =~ regular expression operator to keep this pure bash. But you could replace this with a grep instead, if you are more comfortable with that.

Using perl :
perl -00 -lne '
chomp(my #arr = split /stackoverflow/);
print join "\nstackoverflow", #arr[1 .. $#arr -1 ]
' file.txt | tee newfile.txt
The idea behind this is to feed an array of the whole input file in to chunks using "stackoverflow" string to split. Next, we print the 2nd occurrences to the last -1 with join "stackoverflow".

Related

print lines that first column not in the list

I have a list of numbers in a file
cat to_delete.txt
2
3
6
9
11
and many txt files in one folder. Each file has tab delimited lines (can be more lines than this).
3 0.55667 0.66778 0.54321 0.12345
6 0.99999 0.44444 0.55555 0.66666
7 0.33333 0.34567 0.56789 0.34543
I want to remove the lines that the first number ($1 for awk) is in to_delete.txt and print only the lines that the first number is not in to_delete.txt. The change should be replacing the old file.
Expected output
7 0.33333 0.34567 0.56789 0.34543
This is what I got so far, which doesn't remove anything;
for file in *.txt; do awk '$1 != /2|3|6|9|11/' "$file" > "$tmp" && mv "$tmp" "$file"; done
I've looked through so many similar questions here but still cannot make it work. I also tried grep -v -f to_delete.txt and sed -n -i '/$to_delete/!p'
Any help is appreciated. Thanks!
In awk:
$ awk 'NR==FNR{a[$1];next}!($1 in a)' delete file
Output:
7 0.33333 0.34567 0.56789 0.34543
Explained:
$ awk '
NR==FNR { # hash records in delete file to a hash
a[$1]
next
}
!($1 in a) # if $1 not found in record in files after the first, output
' delete files* # mind the file order
My first idea was this:
printf "%s\n" *.txt | xargs -n1 sed -i "$(sed 's!.*!/& /d!' to_delete.txt)"
printf "%s\n" *.txt - outputs the *.txt files each on separate lines
| xargs -n1 execute the following command for each line passing the line content as the input
sed -i - edit file in place
$( ... ) - command substitution
sed 's!.*!/^& /d!' to_delete.txt - for each line in to_delete.txt, append the line with /^ and suffix with /d. That way from the list of numbers I get a list of regexes to delete, like:
/^2 /d
/^3 /d
/^6 /d
and so on. Which tells sed to delete lines matching the regex - line starting with the number followed by a space.
But I think awk would be simpler. You could do:
awk '$1 != 2 && $1 != 3 && $1 != 6 ... and so on ...`
but that would be longish, unreadable. It's easier to read the map from the file and then check if the number is in the array:
awk 'FNR==NR{ map[$1] } FNR!=NR && !($1 in map)' to_delete.txt "$file"
The FNR==NR is true only for the first file. So when we read it, we set the map[$1] (we "set" it, just so such element exists). Then FNR!=NR is true for the second file, for which we check if the first element is the key in the map. If it is not, the expression is true and the line gets printed out.
all together:
for file in *.txt; do awk 'FNR==NR{ map[$1] } FNR!=NR && !($1 in map)' to_delete.txt "$file" > "$tmp"; mv "$tmp" "$file"; done

For loop in a awk command

I have a file which has rows , now i want to read it'w value from awk command in Unix. I am able to read that file , but i have added a for loop to traverse all the data into the file. But my for loop is not ending it is going in infinite loop.
Below code i am using to read the file and get the data of $1 ,$2 and $3 position
file=$1;
nbrClients=`wc -l $file | cut -d' ' -f1`;
echo $nbrClients;
awk '{
for(i=0; i<=$nbrClients; ++i)
{print $1 $2 $3}
}' $file
File which i am reading has below format :
abc 12 test.txt
abc 12 test.txt
abc 12 test.txt
abc 12 test.txt
abc 12 test.txt
abc 12 test.txt
So for this nbrClients value will be 6 and it should loop for 6 times but it is not doing so .Please suggest what wrong i am doing in this.
Here is the full code which i am trying to :
file=$1;
nbrClients=`wc -l $file | cut -d' ' -f1`;
echo $nbrClients;
file=$1;
cat | awk '{
fileName=$1
tnxCount=$2
for i in `seq 1 $tnxCount`
do
echo "Starting thread number $i"
nohup perl /home/user/abc.pl -i $fileName >>/home/user/test_load_${today}.out 2>&1 &
done
}' $file;
I think the problem here is that you're under the impression that the for loop is what will cause awk to step through your input file, whereas it's awk's nature to do that already.
Awk works by taking a set of condition { statement } pairs, and then FOR EACH LINE OF INPUT, evaluating the condition, and if it rings true, executing the statement. Note that conditions can be statements (since functions and other commands have a return value) and statements can include if constructs, so there's a lot of flexibility here.
Note that awk can also reduce or simplify stuff you'd do in a shell script. Consider the following:
#!/bin/sh
file="$1"
awk '
NR==FNR {
ClientCount++
next
}
FNR==1 {
printf "%s: %d\n", FILENAME, ClientCount
}
{
print $1, $2, $3
}
' "$file" "$file"
This script reads your input file twice -- once to count the lines (so that the line count can be placed at the top of theoutput), and once to process the lines, printing the first three fields. The script is composed of three condition { statement } groupings:
The first one is the counter. It only operates on the first instance of the file, and the next command insures that no other commands will be run on that file.
The second one operates on the first line of the file. But since the first condition captured all of the first file, this statement will only be executed once, when the first line of the second file is in play.
The third one is what prints the bulk of your output. With awk, when no condition is included, the condition is assumed to be "true", so this statement runs for each line of the second file.
The awk script could of course be compressed onto a single line, I've spaced it out for easier reading.
Note also that this method of keeping or showing a line count might be a little heavy handed. If you know that you're just showing a line count, you can use the internal awk variable NR. At the point in your script where the second condition is evaluated, NR-1 is the line count of the previous file, so you could use:
#!/bin/sh
file="$1"
awk '
NR==FNR {
next
}
FNR==1 {
printf "%s: %d\n", FILENAME, NR-1
}
{
print $1, $2, $3
}
' "$file" "$file"
updating the answer based on comment and latest version of the question
file=$1;
nbrClients=`wc -l $file | cut -d' ' -f1`;
echo $nbrClients;
file=$1;
cat $file | awk -v fileName="$1" -v tnxCount="$2" '{
system("echo "Starting thread number $i"")
system("nohup perl /home/user/abc.pl -i $fileName >>/home/user/test_load_${today}.out 2>&1 &")
}';

remove duplicate lines with similar prefix

I need to remove similar lines in a file which has duplicate prefix and keep the unique ones.
From this,
abc/def/ghi/
abc/def/ghi/jkl/one/
abc/def/ghi/jkl/two/
123/456/
123/456/789/
xyz/
to this
abc/def/ghi/jkl/one/
abc/def/ghi/jkl/two/
123/456/789/
xyz/
Appreciate any suggestions,
Answer in case reordering the output is allowed.
sort -r file | awk 'a!~"^"$0{a=$0;print}'
sort -r file : sort lines in revers this way longer lines with the same pattern will be placed before shorter line of the same pattern
awk 'a!~"^"$0{a=$0;print}' : parse sorted output where a holds the previous line and $0 holds the current line
a!~"^"$0 checks for each line if current line is not a substring at the beginning of the previous line.
if $0 is not a substring (ie. not similar prefix), we print it and save new string in a (to be compared with next line)
The first line $0 is not in a because no value was assigned to a (first line is always printed)
A quick and dirty way of doing it is the following:
$ while read elem; do echo -n "$elem " ; grep $elem file| wc -l; done <file | awk '$2==1{print $1}'
abc/def/ghi/jkl/one/
abc/def/ghi/jkl/two/
123/456/789/
xyz/
where you read the input file and print each elements and the number of time it appears in the file, then with awk you print only the lines where it appears only 1 time.
Step 1: This solution is based on assumption that reordering the output is allowed. If so, then it should be faster to reverse sort the input file before processing. By reverse sorting, we only need to compare 2 consecutive lines in each loop, no need to search all the file or all the "known prefixes". I understand that a line is defined as a prefix and should be removed if it is a prefix of any another line. Here is an example of remove prefixes in a file, reordering is allowed:
#!/bin/bash
f=sample.txt # sample data
p='' # previous line = empty
sort -r "$f" | \
while IFS= read -r s || [[ -n "$s" ]]; do # reverse sort, then read string (line)
[[ "$s" = "${p:0:${#s}}" ]] || \
printf "%s\n" "$s" # if s is not prefix of p, then print it
p="$s"
done
Explainations: ${p:0:${#s}} take the first ${#s} (len of s) characters in string p.
Test:
$ cat sample.txt
abc/def/ghi/
abc/def/ghi/jkl/one/
abc/def/ghi/jkl/two/
abc/def/ghi/jkl/one/one
abc/def/ghi/jkl/two/two
123/456/
123/456/789/
xyz/
$ ./remove-prefix.sh
xyz/
abc/def/ghi/jkl/two/two
abc/def/ghi/jkl/one/one
123/456/789/
Step 2: If you really need to keep the order, then this script is an example of removing all prefixes, reordering is not allowed:
#!/bin/bash
f=sample.txt
p=''
cat -n "$f" | \
sed 's:\t:|:' | \
sort -r -t'|' -k2 | \
while IFS='|' read -r i s || [[ -n "$s" ]]; do
[[ "$s" = "${p:0:${#s}}" ]] || printf "%s|%s\n" "$i" "$s"
p="$s"
done | \
sort -n -t'|' -k1 | \
sed 's:^.*|::'
Explanations:
cat -n: numbering all lines
sed 's:\t:|:': use '|' as the delimiter -- you need to change it to another one if needed
sort -r -t'|' -k2: reverse sort with delimiter='|' and use the key 2
while ... done: similar to solution of step 1
sort -n -t'|' -k1: sort back to original order (numbering sort)
sed 's:^.*|::': remove the numbering
Test:
$ ./remove-prefix.sh
abc/def/ghi/jkl/one/one
abc/def/ghi/jkl/two/two
123/456/789/
xyz/
Notes: In both solutions, the most costed operations are calls to sort. Solution in step 1 calls sort once, and solution in the step 2 calls sort twice. All other operations (cat, sed, while, string compare,...) are not at the same level of cost.
In solution of step 2, cat + sed + while + sed is "equivalent" to scan that file 4 times (which theorically can be executed in parallel because of pipe).
The following awk does what is requested, it reads the file twice.
In the first pass it builds up all possible prefixes per line
The second pass, it checks if the line is a possible prefix, if not print.
The code is:
awk -F'/' '(NR==FNR){s="";for(i=1;i<=NF-2;i++){s=s$i"/";a[s]};next}
{if (! ($0 in a) ) {print $0}}' <file> <file>
You can also do it with reading the file a single time, but then you store it into memory :
awk -F'/' '{s="";for(i=1;i<=NF-2;i++){s=s$i"/";a[s]}; b[NR]=$0; next}
END {for(i=1;i<=NR;i++){if (! (b[i] in a) ) {print $0}}}' <file>
Similar to the solution of Allan, but using grep -c :
while read line; do (( $(grep -c $line <file>) == 1 )) && echo $line; done < <file>
Take into account that this construct reads the file (N+1) times where N is the amount of lines.

Read each line of a column of a file and execute grep

I have file.txt exemplary here:
This line contains ABC
This line contains DEF
This line contains GHI
and here the following list.txt:
contains ABC<TAB>ABC
contains DEF<TAB>DEF
Now I am writing a script that executes the following commands for each line of this external file list.txt:
take the string from column 1 of list.txt and search in a third file file.txt
if the first command is positive, return the string from column 2 of list.txt
So my output.txt is:
ABC
DEF
This is my code for grep/echo with putting the query/return strings manually:
if grep -i -q 'contains abc' file.txt
then
echo ABC >output.txt
else
echo -n
fi
if grep -i -q 'contains def' file.txt
then
echo DEF >>output.txt
else
echo -n
fi
I have about 100 search terms, which makes the task laborious if done manually. So how do I include while read line; do [commands]; done<list.txt together with the commands about column1 and column2 inside that script?
I would like to use simple grep/echo/awkcommands if possible.
Something like this?
$ awk -F'\t' 'FNR==NR { a[$1] = $2; next } {for (x in a) if (index($0, x)) {print a[x]}} ' list.txt file.txt
ABC
DEF
For the lines of the first file (FNR==NR), read the key-value pairs to array a. Then for the lines of the second line, loop through the array, check if the key is found on the line, and if so, print the stored value. index($0, x) tries to find the contents of x from (the current line) $0. $0 ~ x would instead take x as a regex to match with.
If you want to do it in the shell, starting a separate grep for each and every line of list.txt, something like this:
while IFS=$'\t' read k v ; do
grep -qFe "$k" file.txt && echo "$v"
done < list.txt
read k v reads a line of input and splits it (based on IFS) into k and v.
grep -F takes the pattern as a fixed string, not a regex, and -q prevents it from outputting the matching line. grep returns true if any matching lines are found, so $v is printed if $k is found in file.txt.
Using awk and grep:
for text in `awk '{print $4}' file.txt `
do
grep "contains $text" list.txt |awk -F $'\t' '{print $2}'
done

Bash Shell: Infinite Loop

The problem is the following I have a file that each line has this form:
id|lastName|firstName|gender|birthday|joinDate|IP|browser
i want to sort alphabetically all the firstnames in that file and print them one on each line but each name only once
i have created the following program but for some reason it creates an infinite loop:
array1=()
while read LINE
do
if [ ${LINE:0:1} != '#' ]
then
IFS="|"
array=($LINE)
if [[ "${array1[#]}" != "${array[2]}" ]]
then
array1+=("${array[2]}")
fi
fi
done < $3
echo ${array1[#]} | awk 'BEGIN{RS=" ";} {print $1}' | sort
NOTES
if [ ${LINE:0:1} != '#' ] : this command is used because there are comments in the file that i dont want to print
$3 : filename
array1 : is used for all the seperate names
Wow, there's a MUCH simpler and cleaner way to achieve this, without having to mess with the IFS variable or using arrays. You can use "for" to do this:
First I created a file with the same structure as yours:
$ cat file
id|lastName|Douglas|gender|birthday|joinDate|IP|browser
id|lastName|Tim|gender|birthday|joinDate|IP|browser
id|lastName|Andrew|gender|birthday|joinDate|IP|browser
id|lastName|Sasha|gender|birthday|joinDate|IP|browser
#id|lastName|Carly|gender|birthday|joinDate|IP|browser
id|lastName|Madson|gender|birthday|joinDate|IP|browser
Here's the script I wrote using "for":
#!/bin/bash
for LINE in `cat file | grep -v "^#" | awk -F'|' '{print$3}' | sort -u`
do
echo $LINE
done
And here's the output of this script:
$ ./script.sh
Andrew
Douglas
Madson
Sasha
Tim
Explanation:
for LINE in `cat file`
Creates a loop that reads each line of "file". The commands between ` are run by linux, for example, if you wanted to store the date inside of a variable you could use "VARDATE=`date`".
grep -v "^#"
The option -v is used to exclude results matching the pattern, in this case the pattern is "^#". The "^" character means "line begins with". So grep -v "^#" means "exclude lines beginning with #".
awk -F'|' '{print$3}'
The -F option switches the column delimiter from the default (the default is a space) to whatever you put between ' after it, in this case the "|" character.
The '{print$3}' prints the 3rd column.
sort -u
And the "sort -u" command to sort the names alphabetically.

Resources