How can i manipulate the text file using shell script?
input
chr2:98602862-98725768
chr11:3100287-3228869
chr10:3588083-3693494
chr2:44976980-45108665
expected output
2 98602862 98725768
11 3100287 3228869
10 3588083 3693494
2 44976980 45108665
Using sed you can write
$ sed 's/chr//; s/[:-]/ /g' file
2 98602862 98725768
11 3100287 3228869
10 3588083 3693494
2 44976980 45108665
Or maybe you could use awk
awk -F "chr|[-:]" '{print $2,$3, $4}' file
2 98602862 98725768
11 3100287 3228869
10 3588083 3693494
2 44976980 45108665
What it does
-F "chr|[-:]" sets the field separators to chr or : or -. Now you could print the different fields or columns.
You can also use another field separator as -F [^0-9]+ which will makes anything other than digits as separators.
If you don't care about a leading blank char:
$ tr -s -c '[0-9\n]' ' ' < file
2 98602862 98725768
11 3100287 3228869
10 3588083 3693494
2 44976980 45108665
Related
I have hundreds of thousands of files with several hundreds of thousands of lines in each of them.
2022-09-19/SALES_1.csv:CUST1,US,2022-09-19,43.31,17.56,47.1,154.48,154. 114
2022-09-20/SALES_2.csv:CUST2,NA,2022-09-20,12.4,16.08,48.08,18.9,15.9,3517
The lines may have different number of fields. NO matter how many fields are present, I'm wanting to extract just the last 7 fields.
I'm trying with cut & awk but, have been only able to prit a range of fields but not last 'n' fields.
Please could I request guidance.
$ rev file | cut -d, -f1-7 | rev
will give the last 7 fields regardless of varying number of fields in each record.
Using any POSIX awk:
$ awk -F',' 'NF>7{sub("([^,]*,){"NF-7"}","")} 1' file
US,2022-09-19,43.31,17.56,47.1,154.48,154. 114
2022-09-20,12.4,16.08,48.08,18.9,15.9,3517
1 {m,g}awk' BEGIN { _+=(_+=_^= FS = OFS = ",")+_
2 ___= "^[^"(__= "\5") ("]*")__
3
4 } NF<=_ || ($(NF-_) = __$(NF-_))^(sub(___,"")*!_)'
US,
2022-09-19,
43.31,
17.56,
47.1,
154.48,
154. 114
2022-09-20,
12.4,
16.08,
48.08,
18.9,
15.9,
3517
In pure Bash, without any external processes and/or pipes:
(IFS=,; while read -ra line; do printf '%s\n' "${line[*]: -7}"; done;) < file
Prints the last 7 fields:
sed -E 's/.*,((.*,){6}.*)/\1/' file
EDITS: For reference, "stuff" is a general variable, as is "KEEP".
KEEP could be "Hi, my name is Dave" on line 2 and "I love pie" on line 7. The numbers I've put here are for illustration only and DO NOT show up in the data.
I had a file that needed to be parsed, keeping every 4th line, starting at the 3rd line. In other words, it looked like this:
1 stuff
2 stuff
3 KEEP
4
5 stuff
6 stuff
7 KEEP
8 stuff etc...
Great, sed solved that easily with:
sed -n -e 3~4p myfile
giving me
3 KEEP
7 KEEP
11 KEEP
Now I have a different file format and a different take on the pattern:
1 stuff
2 KEEP
3 KEEP
4
5 stuff
6 KEEP
7 KEEP etc...
and I still want the output of
2 KEEP
3 KEEP
6 KEEP
7 KEEP
10 KEEP
11 KEEP
Here's the problem - this is a multi-pattern "pattern" for sed. It's "every 4th line, spit out 2 lines, but start at line 2".
Do I need to have some sort of DO/FOR loop in my sed, or do I need a different command like awk or grep? Thus far, I have tried formats like:
sed -n -e '3~4p;4~4p' myfile
and
awk 'NR % 3 == 0 || NR % 4 ==0' myfile
and
sed -n -e '3~1p;4~4p' myfile
and
awk 'NR % 1 == 0 || NR % 4 ==0' myfile
source: https://superuser.com/questions/396536/how-to-keep-only-every-nth-line-of-a-file
If your intent is to print lines 2,3 then every fourth line after those two, you can do:
$ seq 20 | awk 'BEGIN{e[2];e[3]} (NR%4) in e'
2
3
6
7
10
11
14
15
18
19
You were pretty close with your sed:
$ printf '%s\n' {1..12} | sed -n '2~4p;3~4p'
2
3
6
7
10
11
this is the idiomatic way to write in awk
$ awk 'NR%4==2 || NR%4==3' file
however, this special case can be shortened to
$ awk 'NR%4>1' file
This might work for you (GNU sed):
sed '2~4,+1p;d' file
Use a range, the first parameter is the starting line and modulus (in this case from line 2 modulus 4). The second parameter is how man lines following the start of the range (in this case plus one). Print these lines and delete all others.
In the generic case, you want to keep lines p to p+q and p+n to p+q+n and p+2n to p+q+2n ... So you can write:
awk '(NR - p) % n <= q'
I have a log file with a plenty of collected logs, I already made a grep command with a regex that outputs the number of lines that matches it.
This is the grep command I'm using to output the matched lines:
grep -n -E 'START_REGEX|END_REGEX' Example.log | cut -d ':' -f 1 > ranges.txt
The regex is conditional it can match the begin of a specific log or its end, thus the output is something like:
12
45
128
136
...
The idea is to use this as a source of ranges to make specific cut on the log file from first number to the second and save them on another file.
The ranges are made by couples of the output, according to the example the first range is 12,45 and the second 128,136.
I expect to see in the final file all the text from line 12 to 45 and then from 128 to 136.
The problem I'm facing is that the sed command seems to work with only one range at time.
sed -E -iTMP "$START_RANGE,$END_RANGE! d;$END_RANGEq" $FILE_NAME
Is there any way (maybe with awk) to do that just in one "cycle"?
Constraints: I can only use supported bash command.
You can use an awk statement, too
awk '(NR>=12 && NR<=45) || (NR>=128 && NR<=136)' file
where, NR is a special variable in Awk which keep tracks of the line number as it processes the file.
An example,
seq 1 10 > file
cat file
1
2
3
4
5
6
7
8
9
10
awk '(NR>=1 && NR<=3) || (NR>=8 && NR<=10)' file
1
2
3
8
9
10
You can also avoid, hard-coding the line numbers by using the -v variable option,
awk -v start1=1 -v end1=3 -v start2=8 -v end2=10 '(NR>=start1 && NR<=end1) || (NR>=start2 && NR<=end2)' file
1
2
3
8
9
10
With sed you can do multiple ranges of lines like so:
sed -n '12,45p;128,136p'
This would output lines 12-45, then 128-136.
I've got this:
./awktest -v fields=`cat testfile`
which ought to set fields variable to '1 2 3 4 5' which is all that testfile contains
It returns:
gawk: ./awktest:9: fatal: cannot open file `2' for reading (No such file or directory)
When I do this it works fine.
./awktest -v fields='1 2 3 4 5'
printing fields at the time of error yields:
1
printing fields in the second instance yields:
1 2 3 4 5
When I try it with 12345 instead of 1 2 3 4 5 it works fine for both, so it's a problem with the white space. What is this problem? And how do I fix it.
This is most likely not an awk question. Most likely, it is your shell that is the culprit.
For example, if awktest is:
#!/bin/bash
i=1
for arg in "$#"; do
printf "%d\t%s\n" $i "$arg"
((i++))
done
Then you get:
$ ./awktest -v fields=`cat testfile`
1 -v
2 fields=1
3 2
4 3
5 4
6 5
You see that the file contents are not being handled as a single word.
Simple solution: use double quotes on the command line:
$ ./awktest -v fields="$(< testfile)"
1 -v
2 fields=1 2 3 4 5
The $(< file) construct is a bash shortcut for `cat file` that does not need to spawn an external process.
Or, read the first line of the file in the awk BEGIN block
awk '
BEGIN {getline fields < "testfile"}
rest of awk program ...
'
./awktest -v fields="`cat testfile`"
#note that:
#./awktest -v fields='`cat testfile`'
#does not work
I need to find a faster way to number lines in a file in a specific way using tools like awk and sed. I need the first character on each line to be numbered in this fashion: 1,2,3,1,2,3,1,2,3 etc.
For example, if the input was this:
line 1
line 2
line 3
line 4
line 5
line 6
line 7
The output needs to look like this:
1line 1
2line 2
3line 3
1line 4
2line 5
3line 6
1line 7
Here is a chunk of what I have. $lines is the number of lines in the data file divided by 3. So for a file of 21000 lines I process this loop 7000 times.
export i=0
while [ $i -le $lines ]
do
export start=`expr $i \* 3 + 1`
export end=`expr $start + 2`
awk NR==$start,NR==$end $1 | awk '{printf("%d%s\n", NR,$0)}' >> data.out
export i=`expr $i + 1`
done
Basically this grabs 3 lines at a time, numbers them, and adds to an output file. It's slow...and then some! I don't know of another, faster, way to do this...any thoughts?
Try the nl command.
See https://linux.die.net/man/1/nl (or another link to the documentation that comes up when you Google for "man nl" or the text version that comes up when you run man nl at a shell prompt).
The nl utility reads lines from the
named file or the standard input if
the file argument is ommitted, applies
a configurable line numbering filter
operation and writes the result to the
standard output.
edit: No, that's wrong, my apologies. The nl command doesn't have an option for restarting the numbering every n lines, it only has an option for restarting the numbering after it finds a pattern. I'll make this answer a community wiki answer because it might help someone to know about nl.
It's slow because you are reading the same lines over and over. Also, you are starting up an awk process only to shut it down and start another one. Better to do the whole thing in one shot:
awk '{print ((NR-1)%3)+1 $0}' $1 > data.out
If you prefer to have a space after the number:
awk '{print ((NR-1)%3)+1, $0}' $1 > data.out
Perl comes to mind:
perl -pe '$_ = (($.-1)%3)+1 . $_'
should work. No doubt there is an awk equivalent. Basically, ((line# - 1) MOD 3) + 1.
This might work for you:
sed 's/^/1/;n;s/^/2/;n;s/^/3/' input
Another way is just to use grep and match everything. For example this will enumerate files:
grep -n '.*' <<< `ls -1`
Output will be:
1:file.a
2:file.b
3:file.c
awk '{printf "%d%s\n", ((NR-1) % 3) + 1, $0;}' "$#"
Python
import sys
for count, line in enumerate(sys.stdin):
stdout.write( "%d%s" % ( 1+(count % 3), line )
You don't need to leave bash for this:
i=0; while read; do echo "$((i++ % 3 + 1)) $REPLY"; done < input
This should solve the problem. $_ will print the whole line.
awk '{print ((NR-1)%3+1) $_}' < input
1line 1
2line 2
3line 3
1line 4
2line 5
3line 6
1line 7
# cat input
line 1
line 2
line 3
line 4
line 5
line 6
line 7