Get the context around a line number in Bash from compressed file - macos

I know that it's possible to search a line number (say, line # 139656504) and return the context around it:
grep -n 139656504 -B 5 -A 5 file.txt
But when I used that on a compressed file it returned nothing:
zgrep -n 139656504 -B 5 -A 5 file.txt.gz
I'm on macOS Mojave.
Is there a way to get the context around that line with compressed files?
I made a test file (test.txt) that contains:
test1
test2
test3
test4
test5
test6
test7
test8
test9
testBB
test11
test12
test13
test14
test15
test16
test17
test17
test18
test19
test20
Compress the file:
pigz test.txt
Now if I run:
zgrep -n 10 -C 5 test.txt.gz
It gives me nothing... (I made sure that the line 10 would not have a number 10, otherwise, zgrep "searches" for a 10, not the line # 10)
If the line 10 would have been test10 and not testBB, it would have worked. But this is not what I'm expecting.

If you want to print the content of a specific line number, you can use awk:
awk -v nr=10 'FNR==nr' file
or with line number prefix
awk -v nr=10 'FNR==nr{ print FNR":"$0 }' file
or with 5 context lines
awk -v nr=10 'FNR>=nr-5 && FNR<=nr+5{ print FNR":"$0 }' file
or
zcat file.gz | awk ...
for gzipped files.

try this:
zgrep -n 139656504 -C 5 file.txt.gz
Notes:
-n means show line numbers in the output
-C 5 means show 5 lines before and 5 lines after each line that satisfies the expression.
139656504 is the pattern you are searching for.
Do man grep or man zgrep -- looks like all the command line switches work for either.
If all you need is to see surrounding lines of a specific line, you can do something like this:
gunzip -c file.txt.gz | sed -n "139656499,139656509p"
This means -- gunzip the file to stdout so you can pipe it, pipe it into sed which is an amazing utility, and sed here has been told to display line number 5 before and 5 after 139656504.

Related

How to search for a matching string in a file bottom-to-top without using tac? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I need to grep through a file, starting at the bottom of the file until I get to the first date that appears "2021-04-04", and then return that date. I don't want to start from the top and work my way down to the first line as there's thousands of lines in each file.
Example file contents:
random text on first line
random text on second line
2021-01-01
random text on fourth line
2021-02-03
random text on sixth line
2021-03-03
2021-04-04
Random text on ninth line
tac isn't available on MacOS so I can't use it.
"thousands of lines" are nothing, they'll be processed in the blink of an eye. Once you get into 10s of millions of lines THEN you could start thinking about a performance improvement if it became necessary.
All you need is:
awk '/[0-9]{4}(-[0-9]{2}){2}/{line=$0} END{if (line!="") print line}' file
Here's the 3rd-run timing comparison for finding the last line containing 2 or more consecutive 5s in a 100000 line file generated by seq 100000 > file100k, i.e. where the target string is just 45 lines from the end of the input file, with and without tac:
$ time awk '/5{2}/{line=$0} END{if (line!="") print line}' file100k
99955
real 0m0.056s
user 0m0.031s
sys 0m0.000s
$ time tac file100k | awk '/5{2}/{print; exit}'
99955
real 0m0.056s
user 0m0.015s
sys 0m0.030s
As you can see, both ran in a fraction of a second and using tac did nothing to improve the speed of execution. Switching to tac+grep doesn't make it any faster either, it still just takes 1/20th of a second:
$ time tac file100k | grep -m1 '5\{2\}'
99955
real 0m0.057s
user 0m0.015s
sys 0m0.015s
In case you ever do need it in future, though, here's how to implement an efficient tac if you don't have it:
$ mytac() { cat -n "${#:--}" | sort -k1,1rn | cut -d$'\t' -f2-; }
$ seq 5 | mytac
5
4
3
2
1
The above mytac() function just adds line numbers to the input, sorts those in reverse and then removes them again. If your cat doesn't have -n to add line numbers then you can use nl if you have it or awk -v OFS='\t' '{print NR, $0}' will always work.
Use tac:
#!/bin/bash
function process_file_backwords(){
tac $1 | while IFS= read line; do
# Grep for xxxx-xx-xx number matching
first_date=$(echo $line | grep '[0-9]\{4\}-[0-9]\{2\}-[0-9]\{2\}' | awk -F '"' '{ print $2}')
# Check if the variable is not empty, if yes break the loop
[ ! -z $first_date ] && echo $first_date && break
done
}
echo $(process_file_backwords $1)
Note: Make sure you add empty line at the of the file so tac will not concatenate the last two lines.
Note: Remove the awk part if the file contains strings without ".
On MacOS
You can use tail -r which will do the same thing as tac but you may have to supply the number of lines you want tail to output from your file. Something like this should work:
tail -r -n $(wc -l myfile.txt | cut -d ' ' -f 1) myfile.txt | grep -m 1 -o -P '\d{4}-\d{2}-\d{2}'
-r tells tail to output its last line first
-n takes a numeric argument telling how many lines tail should output
wc -l outputs the line count and filename of a given file
cut -d ' ' splits the above on the space character and -f 1 takes the first "field" which will be our line count
$ cat myfile.txt
foo
this is a date 2021-04-03
bar
this is another date 2021-04-04 for example
$ tail -r -n $(wc -l myfile.txt | cut -d ' ' -f 1) myfile.txt | grep -m 1 -o -P '\d{4}-\d{2}-\d{2}'
2021-04-04
grep options:
The -m 1 option will quit after the first result.
The -o option
will return only the string matching the pattern (i.e. your date)
The -P option uses the perl regex engine which is really down to
preference but I personally prefer the regex syntax (seems to use
fewer backslashes \)
On Linux
You can use tac (cat in reverse) and pipe that into your grep. e.g.:
$ tac myfile.txt
this is another date 2021-04-04 for example
bar
this is a date 2021-04-03
foo
$ tac myfile.txt | grep -m 1 -o -P '\d{4}-\d{2}-\d{2}'
2021-04-04
You can use perl to reverse the lines and grep for the 1st match too.
perl -e 'print reverse<>' inputFile | grep -m1 '[0-9]\{4\}-[0-9]\{2\}-[0-9]\{2\}'

Delete values in line based on column index using shell script

I want to be able to delete the values to the RIGHT(starting from given column index) from the test.txt at the given column index based on a given length, N.
Column index refers to the position when you open the file in the VIM editor in LINUX.
If my test.txt contains 1234 5678, and I call my delete_var function which takes in the column number as 2 to start deleting from and length N as 2 to delete as input, the test.txt would reflect 14 5678 as it deleted the values from column 2 to column 4 as the length to delete was 2.
I have the following code as of now but I am unable to understand what I would put in the sed command.
delete_var() {
sed -i -r 's/not sure what goes here' test.txt
}
clmn_index= $1
_N=$2
delete_var "$clmn_index" "$_N" # call the method with the column index and length to delete
#sample test.txt (before call to fn)
1234 5678
#sample test.txt (after call to fn)
14 5678
Can someone guide me?
You should avoid using regex for this task. It is easier to get this done in awk with simple substr function calls:
awk -v i=2 -v n=2 'i>0{$0 = substr($0, 1, i-1) substr($0, i+n)} 1' file
14 5678
Assumping OP must use sed (otherwise other options could include cut and awk but would require some extra file IOs to replace the original file with the modified results) ...
Starting with the sed command to remove the 2 characters starting in column 2:
$ echo '1234 5678' > test.txt
$ sed -i -r "s/(.{1}).{2}(.*$)/\1\2/g" test.txt
$ cat test.txt
14 5678
Where:
(.{1}) - match first character in line and store in buffer #1
.{2} - match next 2 characters but don't store in buffer
(.*$) - match rest of line and store in buffer #2
\1\2 - output contents of buffers #1 and #2
Now, how to get variables for start and length into the sed command?
Assume we have the following variables:
$ s=2 # start
$ n=2 # length
To map these variables into our sed command we can break the sed search-replace pattern into parts, replacing the first 1 and 2 with our variables like such:
replace {1} with {$((s-1))}
replace {2} with {${n}}
Bringing this all together gives us:
$ s=2
$ n=2
$ echo '1234 5678' > test.txt
$ set -x # echo what sed sees to verify the correct mappings:
$ sed -i -r "s/(.{"$((s-1))"}).{${n}}(.*$)/\1\2/g" test.txt
+ sed -i -r 's/(.{1}).{2}(.*$)/\1\2/g' test.txt
$ set +x
$ cat test.txt
14 5678
Alternatively, do the subtraction (s-1) before the sed call and just pass in the new variable, eg:
$ x=$((s-1))
$ sed -i -r "s/(.{${x}}).{${n}}(.*$)/\1\2/g" test.txt
$ cat test.txt
14 5678
One idea using cut, keeping in mind that storing the results back into the original file will require an intermediate file (eg, tmp.txt) ...
Assume our variables:
$ s=2 # start position
$ n=2 # length of string to remove
$ x=$((s-1)) # last column to keep before the deleted characters (1 in this case)
$ y=$((s+n)) # start of first column to keep after the deleted characters (4 in this case)
At this point we can use cut -c to designate the columns to keep:
$ echo '1234 5678' > test.txt
$ set -x # display the cut command with variables expanded
$ cut -c1-${x},${y}- test.txt
+ cut -c1-1,4- test.txt
14 5678
Where:
1-${x} - keep range of characters from position 1 to position $(x) (1-1 in this case)
${y}- - keep range of characters from position ${y} to end of line (4-EOL in this case)
NOTE: You could also use cut's ability to work with the complement (ie, explicitly tell what characters to remove ... as opposed to above which says what characters to keep). See KamilCuk's answer for an example.
Obviously (?) the above does not overwrite test.txt so you'd need an extra step, eg:
$ echo '1234 5678' > test.txt
$ cut -c1-${x},${y}- test.txt > tmp.txt # store result in intermediate file
$ cat tmp.txt > test.txt # copy intermediate file over original file
$ cat test.txt
14 5678
Looks like:
cut --complement -c $1-$(($1 + $2 - 1))
Should just work and delete columns between $1 and $2 columns behind it.
please provide code how to change test.txt
cut can't modify in place. So either pipe to a temporary file or use sponge.
tmp=$(mktemp)
cut --complement -c $1-$(($1 + $2 - 1)) test.txt > "$tmp"
mv "$tmp" test.txt
Below command result in the elimination of the 2nd character. Try to use this in a loop
sed s/.//2 test.txt

awk on multiple files and piping output of each of it's run to the wc command separately

I have bunch of record wise formatted (.csv)files. First field is an integer or may be empty as well. Its true for all the files. I want to count number of records whose first field is empty in each file and then want to plot count graph over all the files.
File format of filename.csv:
123456,few,other,fields
,few,other,fields
234567,few,other,fields
I want something like
awk -F, '$1==""' `ls` | (for each file separately wc -l) | gnugraph ( y axis as output of wc -l command and x axis as simply 1 to n where n is number of csv files)
The problem I am facing is wc -l gets executed only once for all the files together. I want to run wc -l for each file and count the number of records having empty first field and provide this sequence of count to the gnugraph command.
once I get required count for each file I am almost done as
seq 10 | gnuplot -p -e "plot '<cat'"
works fine
You could use awk to keep track of the count for each file in an array. Then at the end print the contents of the array:
awk '$1==""{a[FILENAME]+=1} END{for(file in a) { print file, a[file] }}' `ls`
This way you don't have to tangle with wc and just shoot the contents right over to gnuplot
Example in use:
$> cat file1
,test
2,test
3,
$> cat file2
,test
2,test
3,
,test
$> awk -F"," '$1==""{a[FILENAME]+=1} END{for(file in a) { print file, a[file] }}' `ls`
file1 1
file2 2
With gawk you can use BEGINFILE and ENDFILE:
$ awk -F, '$1==""{++i} BEGINFILE{i=0} ENDFILE{print FILENAME, i}' file1 file2
file1 3
file2 1
If you want to run wc -l separately for each file, you'll have to set up a loop.
Something along the lines of-
for i in `ls`
do
awk -F, '$1==""' "$i" | wc -l
done | gnugraph
For the first field, there is an easier way with grep
$ grep -c '^,' file{1..3}
file1:1
file2:2
file3:4
I copied your file to file1 and doubled in file2 and file3 respectively

Extract lines from a file in bash

I have a file like this
I would like to extract the line with the 0 and 1 (all lines in the file) into a seperate file. However, the sequence does not have to start with a 0 but could also start with a 1. However, the line always comes directly after the line (SITE:). Moreover, I would like to extract the line SITTE itself into a seperate file. Could somebody tell me how that is doable in bash?
Moreover, I would like to extract the line SITTE itself into a seperate file.
That’s the easy part:
grep '^SITE:' infile > outfile.site
Extracting the line after that is slightly harder:
grep --after-context=1 '^SITE:' infile \
| grep '^[01]*$' \
> outfile.nr
--after-context (or -A) specifies how many lines after the matching line to print as well. We then use the second grep to print only that line, and not the actually matching line (nor the delimiter which grep puts between each matching entry when specifying an after-context).
Alternatively, you could use the following to match the numeric lines:
grep '^[01]*$' infile > outfile.nr
That’s much easier, but it will find all lines consisting solely of 0s and 1s, regardless of whether they come after a line which starts with SITE:.
You could try something like :
$ egrep -o "^(0|1)+$" test.txt > test2.txt
$ cat test2.txt
0000000000001010000000000000010000000000000000000100000000000010000000000000000000000000000000000000
0000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000000000
0011010000000000001010000000000000001000010001000000001001001000011000000000000000101000101010101000
$ grep "^SITE:" test.txt > test3.txt
$ cat test3.txt
SITE: 0 0.000340988542 0.0357651018
SITE: 1 0.000529755514 0.00324293642
SITE: 2 0.000577745511 0.052214098
Another solution, using bash :
$ while read; do [[ $REPLY =~ ^(0|1)+$ ]] && echo "$REPLY"; done < test.txt > test2.txt
$ cat test2.txt
0000000000001010000000000000010000000000000000000100000000000010000000000000000000000000000000000000
0000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000000000
0011010000000000001010000000000000001000010001000000001001001000011000000000000000101000101010101000
To remove the characters 0 at beginning of the line :
$ egrep "^(0|1)+$" test.txt | sed "s/^0\{1,\}//g" > test2.txt
$ cat test2.txt
1010000000000000010000000000000000000100000000000010000000000000000000000000000000000000
1000000000000000000000000000000000000000000000000000000000
11010000000000001010000000000000001000010001000000001001001000011000000000000000101000101010101000
UPDATE : New file format provided in comments :
$ egrep "^SITE:" test.txt|egrep -o "(0|1)+$"|sed "s/^0\{1,\}//g" > test2.txt
$ cat test2.txt
100000000000000000000001000001000000000000000000000000000000000000
1010010010000000000111101000010000001001010111111100000000000010010001101010100011101011110011100
10000000000
$ egrep "^SITE:" test.txt|sed "s/[01\ ]\{1,\}$//g" > test3.txt
$ cat test3.txt
SITE: 967 0.189021866 0.0169990123
SITE: 968 0.189149593 0.246619149
SITE: 969 0.189172266 6.84752689e-05
Here's a simple awk solution that matches all lines starting with SITE: and outputs the respective next line:
awk '/^SITE:/ { if (getline) print }' infile > outfile
Simply omit the { ... } block part to extract all lines starting with SITE: themselves to a separate file:
awk '/^SITE:/' infile > outfile
If you wanted to combine both operations:
outfile1 and outfile2 are the names of the 2 output files, passed to awk as variables f1 and f2:
awk -v f1=outfile1 -v f2=outfile2 \
'/^SITE:/ { print > f1; if (getline) print > f2 }' infile

Reorder lines of file by given sequence

I have a document A which contains n lines. I also have a sequence of n integers all of which are unique and <n. My goal is to create a document B which has the same contents as A, but with reordered lines, based on the given sequence.
Example:
A:
Foo
Bar
Bat
sequence: 2,0,1 (meaning: First line 2, then line 0, then line 1)
Output (B):
Bat
Foo
Bar
Thanks in advance for the help
Another solution:
You can create a sequence file by doing (assuming sequence is comma delimited):
echo $sequence | sed s/,/\\n/g > seq.txt
Then, just do:
paste seq.txt A.txt | sort tmp2.txt | sed "s/^[0-9]*\s//"
Here's a bash function. The order can be delimited by anything.
Usage: schwartzianTransform "A.txt" 2 0 1
function schwartzianTransform {
local file="$1"
shift
local sequence="$#"
echo -n "$sequence" | sed 's/[^[:digit:]][^[:digit:]]*/\
/g' | paste -d ' ' - "$file" | sort -n | sed 's/^[[:digit:]]* //'
}
Read the file into an array and then use the power of indexing :
echo "Enter the input file name"
read ip
index=0
while read line ; do
NAME[$index]="$line"
index=$(($index+1))
done < $ip
echo "Enter the file having order"
read od
while read line ; do
echo "${NAME[$line]}";
done < $od
[aman#aman sh]$ cat test
Foo
Bar
Bat
[aman#aman sh]$ cat od
2
0
1
[aman#aman sh]$ ./order.sh
Enter the input file name
test
Enter the file having order
od
Bat
Foo
Bar
an awk oneliner could do the job:
awk -vs="$s" '{d[NR-1]=$0}END{split(s,a,",");for(i=1;i<=length(a);i++)print d[a[i]]}' file
$s is your sequence.
take a look this example:
kent$ seq 10 >file #get a 10 lines file
kent$ s=$(seq 0 9 |shuf|tr '\n' ','|sed 's/,$//') # get a random sequence by shuf
kent$ echo $s #check the sequence in var $s
7,9,1,0,5,4,3,8,6,2
kent$ awk -vs="$s" '{d[NR-1]=$0}END{split(s,a,",");for(i=1;i<=length(a);i++)print d[a[i]]}' file
8
10
2
1
6
5
4
9
7
3
One way(not an efficient one though for big files):
$ seq="2 0 1"
$ for i in $seq
> do
> awk -v l="$i" 'NR==l+1' file
> done
Bat
Foo
Bar
If your file is a big one, you can use this one:
$ seq='2,0,1'
$ x=$(echo $seq | awk '{printf "%dp;", $0+1;print $0+1> "tn.txt"}' RS=,)
$ sed -n "$x" file | awk 'NR==FNR{a[++i]=$0;next}{print a[$0]}' - tn.txt
The 2nd line prepares a sed command print instruction, which is then used in the 3rd line with the sed command. This prints only the line numbers present in the sequence, but not in the order of the sequence. The awk command is used to order the sed result depending on the sequence.

Resources