How can I extract the 11th and 12th characters of a grep match in Bash? - bash

I have a text file called temp.txt that consists of 3 serial numbers.
AB400-251429-0014
AA200-251429-0028
AD200-251430-0046
The 11th and 12th characters in the serial number correspond to the week. I want to extract this number for each unit and do something with it (but for this example just echo it). I have the following code:
while read line; do
week=` grep A[ABD][42]00 $line | cut -c11-12 `
echo $week
done < temp.txt
Looks like it's not working as cut is expecting a filename called the serial number in each case. Is there an alternative way to do this?

The problem is not with cut but with grep which expects a filename, but gets the line contents. Also, the expression doesn't match the IDs: they don't start with S followed by A, B, or D.
You can process lines in bash without starting a subshell:
while read line ; do
echo 11th and 12th characters are: "${line:10:2}".
done < temp.txt
Your original approach is still possible:
week=$( echo "$line" | grep 'S[ABD][42]00' | cut -c11-12 )
Note that for non matching lines, $week would be empty.

You can try also the:
grep -oP '.{10}\K..' filename
for your input prints
29
29
30
The \K mean, A variable length look-behind. With other words, the grep would look for the pattern before \K but would not include it into the result.
More precise selection of the lines:
grep -oP '[ABD][42]00-.{4}\K..' # or more precise
grep -oP '^\w[ABD][42]00-.{4}\K..' # or even more
grep -oP '^[A-Z][ABD][42]00-.{4}\K..' # or
grep -oP '^[A-Z][ABD][42]00-\d{4}\K..' # or
prints like the above, but selects the interesting lines... :)

I would use this simple awk
awk '{print substr($0,11,2)}' text.file
29
29
30
To get it into an array that you can use later:
results=($(awk '{print substr($0,11,2)}' text.file))
echo "${results[#]}"
29 29 30

TL;DR
Looping with Bash is pretty inefficient, especially when reading a file a line at a time. You can get what you want faster and more effectively by using grep to cut only the interesting lines, or by using awk to avoid having to call cut in a separate pipelined process.
GNU Grep and Cut Solution
$ grep '[[:alpha:]][ABD][42]' temp.txt | cut -c11,12
29
29
30
Awk Solutions
# As far as I know, this will work on most awks. If you find an exception,
# please post a constructive comment!
$ awk -v NF=1 -v FPAT=. '/[[:alpha:]][ABD][42]00/ { print $11 $12 }' temp.txt
29
29
30
# A more elegant solution as noted by #rici that works with GNU awk,
# and possibly others.
$ gawk -v FS= '/[[:alpha:]][ABD][42]00/ { print $11 $12 }' temp.txt
29
29
30
Store the Results in a Bash Array
Either way, you can store the results of your match in an Bash array to use later. For example:
$ results=(`grep '[[:alpha:]][ABD][42]00' temp.txt | cut -c11,12`)
$ echo "${results[#]}"
29 29 30

Related

Extract lines from text file, using starting line number and amount of lines to extract, in bash?

I have seen How can I extract a predetermined range of lines from a text file on Unix? but I have a slightly different use case: I want to specify a starting line number, and a count/amount/number of lines to extract, from a text file.
So, I tried to generate a text file, and then compose an awk command to extract a count of 10 lines starting from line number 100 - but it does not work:
$ seq 1 500 > test_file.txt
$ awk 'BEGIN{s=100;e=$s+10;} NR>=$s&&NR<=$e' test_file.txt
$
So, what would be an easy approach to extract lines from a text file using a starting line number, and count of lines, in bash? (I'm ok with awk, sed, or any such tool, for instance in coreutils)
This gives you text that is inclusive of both end points
(eleven output lines, here).
$ START=100
$
$ sed -n "${START},$((START + 10))p" < test_file.txt
The -n says "no print by default".
And then the p says "print this line",
for lines within the example range of 100,110
When you want to use awk, use something like
seq 1 500 | awk 'NR>=100 && NR<=110'
Advantage of awk is the flexibility for changing the requirements.
When you want to use a variable start and skip the endpoints, it will be
start=100
seq 1 500 | awk -v start="${start}" 'NR > start && NR < start + 10'
Another alternative with tail and head:
tail -n +$START test_file.txt | head -n $NUMBER
If test_file.txt is very large and $START and $NUMBER are small, the following variant should be the fastest:
head -n $((START+NUMBER)) test_file.txt | tail -n +$START
Anyway, I prefer the sed solution noticed above for short input files:
sed -n "$START,$((START+NUMBER)) p" test_file.txt
sed -n "$Start,$End p" file
is likely a better way to get those lines.
$ seq 1 500 > test_file.txt
$ awk 'BEGIN{s=100;e=$s+10;} NR>=$s&&NR<=$e' test_file.txt
$
$s in GNU AWK means value of s-th field, $e in GNU AWK means value of e-th field. There are not fields yet in BEGIN clause so $s for any s is not set, as you use in arithemtic context it will be assumed to be 0 and therefore e will be set to value 10. Output of seq is single number per line, so there is not 10th field, so GNU AWK assumes it to be zero when asked to compare it with number, as NR is always strictly bigger than 0 your condition never holds so output is empty.
Use Range if you are able to prepare condition which holds solely for starting line and condition which holds solely for ending line, in this case
awk 'BEGIN{s=100}NR==s,NR==s+10' test_file.txt
gives output
100
101
102
103
104
105
106
107
108
109
110
Keep in mind that this will process whole file, if you have huge file and area of interest is relatively near begin, then you might decrease time consumption by ending processing at end of area of interest following way
awk 'BEGIN{s=100}NR>=s{print}NR==s+10{exit}' test_file.txt
(tested in GNU Awk 5.0.1)
This command extracts 30 lines starting from line 100
sed -n '100,$p' test_file.txt | head -30

Searching a file (grep/awk) for 2 carriage return/line-feed characters

I'm trying to write a script that'll simply count the occurrences of \r\n\r\n in a file. (Opening the sample file in vim binary mode shows me the ^M character in the proper places, and the newline is still read as a newline).
Anyway, I know there are tons of solutions, but they don't seem to get me what I want.
e.g. awk -e '/\r/,/\r/!d' or using $'\n' as part of the grep statement.
However, none of these seem to produce what I need. I can't find the \r\n\r\n pattern with grep's "trick", since that just expands one variable. The awk solution is greedy, and so gets me way more lines than I want/need.
Switching grep to binary/Perl/no-newline mode seems to be closer to what I want,
e.g. grep -UPzo '\x0D', but really what I want then is grep -UPzo '\x0D\x00\x0D\x00', which doesn't produce the output I want.
It seems like such a simple task.
By default, awk treats \n as the record separator. That makes it very hard to count \r\n\r\n. If we choose some other record separator, say a letter, then we can easily count the appearance of this combination. Thus:
awk '{n+=gsub("\r\n\r\n", "")} END{print n}' RS='a' file
Here, gsub returns the number of substitutions made. These are summed and, after the end of the file has been reached, we print the total number.
Example
Here, we use bash's $'...' construct to explicitly add newlines and linefeeds:
$ echo -n $'\r\n\r\n\r\n\r\na' | awk '{n+=gsub("\r\n\r\n", "")} END{print n}' RS='a'
2
Alternate solution (GNU awk)
We can tell it to treat \r\n\r\n as the record separator and then return the count (minus 1) of the number of records:
cat file <(echo 1) | awk 'END{print NR-1;}' RS='\r\n\r\n'
In awk, RS is the record separator and NR is the count of the number of records. Since we are using a multiple-character record separator, this requires GNU awk.
If the file ends with \r\n\r\n, the above would be off by one. To avoid that, the echo -n 1 statement is used to assure that there are always at least one character after the last \r\n\r\n in the file.
Examples
Here, we use bash's $'...' construct to explicitly add newlines and linefeeds:
$ echo -n $'abc\r\n\r\n' | cat - <(echo 1) | awk 'END{print NR-1;}' RS='\r\n\r\n'
1
$ echo -n $'abc\r\n\r\ndef' | cat - <(echo 1) | awk 'END{print NR-1;}' RS='\r\n\r\n'
1
$ echo -n $'\r\n\r\n\r\n\r\n' | cat - <(echo 1) | awk 'END{print NR-1;}' RS='\r\n\r\n'
2
$ echo -n $'1\r\n\r\n2\r\n\r\n3' | cat - <(echo 1) | awk 'END{print NR-1;}' RS='\r\n\r\n'
2

"grep"ing first 12 of last 24 character from a line

I am trying to extract "first 12 of last 24 character" from a line, i.e.,
for a line:
species,subl,cmp= 1 4 1 s1,torque= 0.41207E-09-0.45586E-13
I need to extract "0.41207E-0".
(I have not written the code, so don't curse me for its formatting. )
I have managed to do this via:
var_s=`grep "species,subl,cmp= $3 $4 $5" $tfile |sed -n '$s/.*\(........................\)$/\1/p'|sed -n '$s/\(............\).*$/\1/p'`
but, is there any more readable way of doing this, rather then counting dots?
EDIT
Thanks to both of you;
so, I have sed,awk grep and bash.
I will run that in loop, for 100's of file.
so, can you also suggest me which one is most efficient, wrt time?
One way with GNU sed (without counting dots):
$ sed -r 's/.*(.{11}).{12}/\1/' file
0.41207E-09
Similarly with GNU grep:
$ grep -Po '.{11}(?=.{12}$)' file
0.41207E-09
Perhaps a python solution may also be helpful:
python -c 'import sys;print "\n".join([a[-24:-13] for a in sys.stdin])' < file
0.41207E-09
I'm not sure your example data and question match up so just change the values in the {n} quantifier accordingly.
Simplest is using pure bash:
echo "${str:(-24):12}"
OR awk can also do that:
awk '{print substr($0, length($0)-23, 12)}' <<< $str
OUTPUT:
0.41207E-09
EDIT: For using bash solution on a file:
while read l; do echo "${l:(-24):12}"; done < file
Another one, less efficient but has the advantage of making you discover new tools
`echo "$str" | rev | cut -b 1-24 | rev | cut -b 1-12
You can use awk to get first 12 characters of last 24 characters from a line:
awk '{substr($0,(length($0)-23))};{print substr($0,(length($0)-10))}' myfile.txt

Can I grep only the first n lines of a file?

I have very long log files, is it possible to ask grep to only search the first 10 lines?
The magic of pipes;
head -10 log.txt | grep <whatever>
For folks who find this on Google, I needed to search the first n lines of multiple files, but to only print the matching filenames. I used
gawk 'FNR>10 {nextfile} /pattern/ { print FILENAME ; nextfile }' filenames
The FNR..nextfile stops processing a file once 10 lines have been seen. The //..{} prints the filename and moves on whenever the first match in a given file shows up. To quote the filenames for the benefit of other programs, use
gawk 'FNR>10 {nextfile} /pattern/ { print "\"" FILENAME "\"" ; nextfile }' filenames
Or use awk for a single process without |:
awk '/your_regexp/ && NR < 11' INPUTFILE
On each line, if your_regexp matches, and the number of records (lines) is less than 11, it executes the default action (which is printing the input line).
Or use sed:
sed -n '/your_regexp/p;10q' INPUTFILE
Checks your regexp and prints the line (-n means don't print the input, which is otherwise the default), and quits right after the 10th line.
You have a few options using programs along with grep. The simplest in my opinion is to use head:
head -n10 filename | grep ...
head will output the first 10 lines (using the -n option), and then you can pipe that output to grep.
grep "pattern" <(head -n 10 filename)
head -10 log.txt | grep -A 2 -B 2 pattern_to_search
-A 2: print two lines before the pattern.
-B 2: print two lines after the pattern.
head -10 log.txt # read the first 10 lines of the file.
You can use the following line:
head -n 10 /path/to/file | grep [...]
The output of head -10 file can be piped to grep in order to accomplish this:
head -10 file | grep …
Using Perl:
perl -ne 'last if $. > 10; print if /pattern/' file
An extension to Joachim Isaksson's answer: Quite often I need something from the middle of a long file, e.g. lines 5001 to 5020, in which case you can combine head with tail:
head -5020 file.txt | tail -20 | grep x
This gets the first 5020 lines, then shows only the last 20 of those, then pipes everything to grep.
(Edited: fencepost error in my example numbers, added pipe to grep)
grep -A 10 <Pattern>
This is to grab the pattern and the next 10 lines after the pattern. This would work well only for a known pattern, if you don't have a known pattern use the "head" suggestions.
grep -m6 "string" cov.txt
This searches only the first 6 lines for string

Linux commands to output part of input file's name and line count

What Linux commands would you use successively, for a bunch of files, to count the number of lines in a file and output to an output file with part of the corresponding input file as part of the output line. So for example we were looking at file LOG_Yellow and it had 28 lines, the the output file would have a line like this (Yellow and 28 are tab separated):
Yellow 28
wc -l [filenames] | grep -v " total$" | sed s/[prefix]//
The wc -l generates the output in almost the right format; grep -v removes the "total" line that wc generates for you; sed strips the junk you don't want from the filenames.
wc -l * | head --lines=-1 > output.txt
produces output like this:
linecount1 filename1
linecount2 filename2
I think you should be able to work from here to extend to your needs.
edit: since I haven't seen the rules for you name extraction, I still leave the full name. However, unlike other answers I'd prefer to use head rather then grep, which not only should be slightly faster, but also avoids the case of filtering out files named total*.
edit2 (having read the comments): the following does the whole lot:
wc -l * | head --lines=-1 | sed s/LOG_// | awk '{print $2 "\t" $1}' > output.txt
wc -l *| grep -v " total"
send
28 Yellow
You can reverse it if you want (awk, if you don't have space in file names)
wc -l *| egrep -v " total$" | sed s/[prefix]//
| awk '{print $2 " " $1}'
Short of writing the script for you:
'for' for looping through your files.
'echo -n' for printing the current file
'wc -l' for finding out the line count
And dont forget to redirect
('>' or '>>') your results to your
output file

Resources