grep word from column and remove row - shell

I want to grep a word from particular column from a file. Then remove those rows put all remaining rows into another file.
could anyone please help me on shell command to get following output?
I have a file with this format:
1234 8976 897561234 1234 678901234
5678 5678 123456789 4567 123456790
1234 1234 087664566 4567 678990000
1223 6586 212134344 8906 123456789
I want to grep word "1234" in the second column alone and removed those rows alone and put remaining rows in another file. So output should be in this format:
1234 8976 897561234 1234 678901234
5678 5678 123456789 4567 123456790
1223 6586 212134344 8906 123456789
The out should be with 3 rows except 3 row out of 4 rows.
while read value ;do
grep -v ${value:0:10} /tmp/lakshmi.txt > /tmp/output.txt
cp /tmp/output.txt /tmp/no_post1.txt
done < /tmp/priya.txt
Could you please help me to modify this script?

You can use awk for this, if that's good for you:
awk '$2==1234' <file-name>
$2 represents second column and it will return the line:
1234 1234 087664566 4567 678990000
Then you can use sed, grep -v or even awk for further process, either delete this line from current file, or print only the lines that do not match to another file. awk will be much easier and powerful.

Try the following regular expression.
egrep -v "^[[:space:]]*[^[:space:]]+[[:space:]]+1234[[:space:]]+.*$"
Not sure what your intention is, but my best guess is that you want to do the following.
while read value ;do
egrep -v "^[[:space:]]*[^[:space:]]+[[:space:]]+${value:0:10}[[:space:]]+.*$" /tmp/lakshmi.txt > /tmp/output.txt
cp /tmp/output.txt /tmp/no_post1.txt
done < /tmp/priya.txt

For columnar data, awk is often the best tool to use.
Superficially, if your input data is in priya.txt and you want the output in lakshmi.txt, then this would do the job:
awk '$2==1234 { next } { print }' priya.txt > lakshmi.txt
The first pattern detects 1234 (and also 01234 and 0001234) in column 2 and executes a next which skips the rest of the script. The rest of the script prints the input data; people often use 1 in place of { print }, which achieves the same effect less verbosely (or less clearly).
If you want the line(s) with 1234 in another file (filtered.out, say), then you'd use:
awk '$2==1234 { print > "filtered.out"; next } { print }' priya.txt > lakshmi.txt
If the column must be exactly 1234 rather than just numerically equal to 1234, then you would use a regx match instead:
awk '$2 ~ /^1234$/ { next } { print }' priya.txt > lakshmi.txt
The great thing about awk is that it splits the data into fields automatically, and that usually makes it easy to process columnar data with awk. You can also use Perl or Python or other similar scripting languages to do much the same job.

You did not specify the record layout exactly. When an empty first field is replaced by 4 spaces, the clever solutions will fail. Can a field have a space inside?
When your fields have fixed offsets, you might want to check the offset:
grep -v "^.\{9\}1234"
When /tmp/priya.txt has more than 1 line, your while loop becomes ugly:
cp /tmp/lakshmi.txt /tmp/output.txt
while read value ;do
grep -v "^.\{9\}${value}" /tmp/output.txt > /tmp/output2.txt
mv /tmp/output2.txt /tmp/output.txt
done < /tmp/priya.txt
You can also use the -f option of grep:
echo "1234 8976 897561234 1234 678901234
5678 5678 123456789 4567 123456790
1234 1234 087664566 4567 678990000
1223 6586 212134344 8906 123456789" |grep -vf <(sed 's/^/^.\\{9\\}/' /tmp/priya.txt )
or in your case
grep -vf <(sed 's/^/^.\\{9\\}/' /tmp/priya.txt ) /tmp/lakshmi.txt

Related

Delete values in line based on column index using shell script

I want to be able to delete the values to the RIGHT(starting from given column index) from the test.txt at the given column index based on a given length, N.
Column index refers to the position when you open the file in the VIM editor in LINUX.
If my test.txt contains 1234 5678, and I call my delete_var function which takes in the column number as 2 to start deleting from and length N as 2 to delete as input, the test.txt would reflect 14 5678 as it deleted the values from column 2 to column 4 as the length to delete was 2.
I have the following code as of now but I am unable to understand what I would put in the sed command.
delete_var() {
sed -i -r 's/not sure what goes here' test.txt
}
clmn_index= $1
_N=$2
delete_var "$clmn_index" "$_N" # call the method with the column index and length to delete
#sample test.txt (before call to fn)
1234 5678
#sample test.txt (after call to fn)
14 5678
Can someone guide me?
You should avoid using regex for this task. It is easier to get this done in awk with simple substr function calls:
awk -v i=2 -v n=2 'i>0{$0 = substr($0, 1, i-1) substr($0, i+n)} 1' file
14 5678
Assumping OP must use sed (otherwise other options could include cut and awk but would require some extra file IOs to replace the original file with the modified results) ...
Starting with the sed command to remove the 2 characters starting in column 2:
$ echo '1234 5678' > test.txt
$ sed -i -r "s/(.{1}).{2}(.*$)/\1\2/g" test.txt
$ cat test.txt
14 5678
Where:
(.{1}) - match first character in line and store in buffer #1
.{2} - match next 2 characters but don't store in buffer
(.*$) - match rest of line and store in buffer #2
\1\2 - output contents of buffers #1 and #2
Now, how to get variables for start and length into the sed command?
Assume we have the following variables:
$ s=2 # start
$ n=2 # length
To map these variables into our sed command we can break the sed search-replace pattern into parts, replacing the first 1 and 2 with our variables like such:
replace {1} with {$((s-1))}
replace {2} with {${n}}
Bringing this all together gives us:
$ s=2
$ n=2
$ echo '1234 5678' > test.txt
$ set -x # echo what sed sees to verify the correct mappings:
$ sed -i -r "s/(.{"$((s-1))"}).{${n}}(.*$)/\1\2/g" test.txt
+ sed -i -r 's/(.{1}).{2}(.*$)/\1\2/g' test.txt
$ set +x
$ cat test.txt
14 5678
Alternatively, do the subtraction (s-1) before the sed call and just pass in the new variable, eg:
$ x=$((s-1))
$ sed -i -r "s/(.{${x}}).{${n}}(.*$)/\1\2/g" test.txt
$ cat test.txt
14 5678
One idea using cut, keeping in mind that storing the results back into the original file will require an intermediate file (eg, tmp.txt) ...
Assume our variables:
$ s=2 # start position
$ n=2 # length of string to remove
$ x=$((s-1)) # last column to keep before the deleted characters (1 in this case)
$ y=$((s+n)) # start of first column to keep after the deleted characters (4 in this case)
At this point we can use cut -c to designate the columns to keep:
$ echo '1234 5678' > test.txt
$ set -x # display the cut command with variables expanded
$ cut -c1-${x},${y}- test.txt
+ cut -c1-1,4- test.txt
14 5678
Where:
1-${x} - keep range of characters from position 1 to position $(x) (1-1 in this case)
${y}- - keep range of characters from position ${y} to end of line (4-EOL in this case)
NOTE: You could also use cut's ability to work with the complement (ie, explicitly tell what characters to remove ... as opposed to above which says what characters to keep). See KamilCuk's answer for an example.
Obviously (?) the above does not overwrite test.txt so you'd need an extra step, eg:
$ echo '1234 5678' > test.txt
$ cut -c1-${x},${y}- test.txt > tmp.txt # store result in intermediate file
$ cat tmp.txt > test.txt # copy intermediate file over original file
$ cat test.txt
14 5678
Looks like:
cut --complement -c $1-$(($1 + $2 - 1))
Should just work and delete columns between $1 and $2 columns behind it.
please provide code how to change test.txt
cut can't modify in place. So either pipe to a temporary file or use sponge.
tmp=$(mktemp)
cut --complement -c $1-$(($1 + $2 - 1)) test.txt > "$tmp"
mv "$tmp" test.txt
Below command result in the elimination of the 2nd character. Try to use this in a loop
sed s/.//2 test.txt

bash for loop only return last value xtimes of xlength of arra

I have a file with IDs such as below:
A
D
E
And I have a second file with the same IDs and extra info that I need:
A 50 G25T1 7.24 298
B 20 G234T2 8.3 80
C 5 G1I1 5.2 909
D 500 G458T3 0.4 79
E 321 G46I2 45.8 901
I want to output the third column of the second file by selecting the first column of the second file using the ids from first file:
G25T1
G458T3
G46I2
The issue I have is while the for loop runs, the output is as follows:
G46I2
G46I2
G46I2
Here is my code:
a=0; IFS=$'\r\n' command eval 'ids=($(awk '{print$1}' shared_single_copies.txt | sed -e 's/[[:space:]]//g'))'; for id in "${ids[#]}"; do a=$(($a+1)); echo $a' '"$id"; awk '{$1=="${id}"} END {print $3}' run_Busco_A1/A1_single_copy_ids.txt >> A1_genes_sc_Buscos.txt; done
Your code is way too complicated. Try one of these solutions: "file1" contains the ids, "file2" contains the extra info:
$ join -o 2.3 file1 file2
G25T1
G458T3
G46I2
$ awk 'NR==FNR {id[$1]; next} $1 in id {print $3}' file1 file2
G25T1
G458T3
G46I2
For more help about join, check the man page.
For more help about awk, start with the awk info page.
#glenn jackman's answer was by far the most succinct and elegant imo. If you want to use loops, though, then this can work:
#!/bin/bash
# if output file already exists, clear it so we don't
# inadvertently duplicate data:
> A1_genes_sc_Buscos.txt
while read -r selector
do
while read -r c1 c2 c3 garbage
do
[[ "$c1" = "$selector" ]] && echo "$c3" >> A1_genes_sc_Buscos.txt
done < run_Busco_A1/A1_single_copy_ids.txt
done < shared_single_copies.txt
That should work for your use-case provided the formatting is valid between what you gave as input and your real files.

How to use grep -c to count ocurrences of various strings in a file?

i have a bunch files with data from a company and i need to count, let's say, how many people from a certain cities there are. Initially i was doing it manually with
grep -c 'Chicago' file.csv
But now i have to look for a lot cities and it would be time consuming to do this manually every time. So i did some reaserch and found this:
#!/bin/sh
for p in 'Chicago' 'Washington' 'New York'; do
grep -c '$p' 'file.csv'
done
But it doenst work. It keeps giving me 0s as output and im not sure what is wrong. Anyways, basically what i need is for an output with every result (just the values) given by grep in a column so i can copy directly to a spreadsheet. Ex.:
132
407
523
Thanks in advance.
You should use sort + uniq for that:
$ awk '{print $<N>}' file.csv | sort | uniq -c
where N is the column number of cities (I assume it structured, as it's CSV file).
For example, which shell how often used on my system:
$ awk -F: '{print $7}' /etc/passwd | sort | uniq -c
1 /bin/bash
1 /bin/sync
1 /bin/zsh
1 /sbin/halt
41 /sbin/nologin
1 /sbin/shutdown
$
From the title, it sounds like you want to count the number of occurrences of the string rather than the number of lines on which the string appears, but since you accept the grep -c answer I'll assume you actually only care about the latter. Do not use grep and read the file multiple times. Count everything in one pass:
awk '/Chicago/ {c++} /Washington/ {w++} /New York/ {n++}
END { print c; print w; print n }' input-file
Note that this will print a blank line instead of "0" for any string that does not appear, so you migt want to initialize. There are several ways to do that. I like:
awk '/Chicago/ {c++} /Washington/ {w++} /New York/ {n++}
END { print c; print w; print n }' c=0 w=0 n=0 input-file

awk on multiple files and piping output of each of it's run to the wc command separately

I have bunch of record wise formatted (.csv)files. First field is an integer or may be empty as well. Its true for all the files. I want to count number of records whose first field is empty in each file and then want to plot count graph over all the files.
File format of filename.csv:
123456,few,other,fields
,few,other,fields
234567,few,other,fields
I want something like
awk -F, '$1==""' `ls` | (for each file separately wc -l) | gnugraph ( y axis as output of wc -l command and x axis as simply 1 to n where n is number of csv files)
The problem I am facing is wc -l gets executed only once for all the files together. I want to run wc -l for each file and count the number of records having empty first field and provide this sequence of count to the gnugraph command.
once I get required count for each file I am almost done as
seq 10 | gnuplot -p -e "plot '<cat'"
works fine
You could use awk to keep track of the count for each file in an array. Then at the end print the contents of the array:
awk '$1==""{a[FILENAME]+=1} END{for(file in a) { print file, a[file] }}' `ls`
This way you don't have to tangle with wc and just shoot the contents right over to gnuplot
Example in use:
$> cat file1
,test
2,test
3,
$> cat file2
,test
2,test
3,
,test
$> awk -F"," '$1==""{a[FILENAME]+=1} END{for(file in a) { print file, a[file] }}' `ls`
file1 1
file2 2
With gawk you can use BEGINFILE and ENDFILE:
$ awk -F, '$1==""{++i} BEGINFILE{i=0} ENDFILE{print FILENAME, i}' file1 file2
file1 3
file2 1
If you want to run wc -l separately for each file, you'll have to set up a loop.
Something along the lines of-
for i in `ls`
do
awk -F, '$1==""' "$i" | wc -l
done | gnugraph
For the first field, there is an easier way with grep
$ grep -c '^,' file{1..3}
file1:1
file2:2
file3:4
I copied your file to file1 and doubled in file2 and file3 respectively

bash - how do I use 2 numbers on a line to create a sequence

I have this file content:
2450TO3450
3800
4500TO4560
And I would like to obtain something of this sort:
2450
2454
2458
...
3450
3800
4500
4504
4508
..
4560
Basically I would need a one liner in sed/awk that would read the values on both sides of the TO separator and inject those in a seq command or do the loop on its own and dump it in the same file as a value per line with an arbitrary increment, let's say 4 in the example above.
I know I can use several one temp file, go the read command and sorts, but I would like to do it in a one liner starting with cat filename | etc. as it is already part of a bigger script.
Correctness of the input is guaranteed so always left side of TOis smaller than bigger side of it.
Thanks
Like this:
awk -F'TO' -v inc=4 'NF==1{print $1;next}{for(i=$1;i<=$2;i+=inc)print i}' file
or, if you like starting with cat:
cat file | awk -F'TO' -v inc=4 'NF==1{print $1;next}{for(i=$1;i<=$2;i+=inc)print i}'
Something like this might work:
awk -F TO '{system("seq " $1 " 4 " ($2 ? $2 : $1))}'
This would tell awk to system (execute) the command seq 10 4 10 for lines just containing 10 (which outputs 10), and something like seq 10 4 40 for lines like 10TO40. The output seems to match your example.
Given:
txt="2450TO3450
3800
4500TO4560"
You can do:
echo "$txt" | awk -F TO '{$2<$1 ? t=$1 : t=$2; for(i=$1; i<=t; i++) print i}'
If you want an increment greater than 1:
echo "$txt" | awk -F TO -v p=4 '{$2<$1 ? t=$1 : t=$2; for(i=$1; i<=t; i+=p) print i}'
Give a try to this:
sed 's/TO/ /' file.txt | while read first second; do if [ ! -z "$second" ] ; then seq $first 4 $second; else printf "%s\n" $first; fi; done
sed is used to replace TO with space char.
read is used to read the line, if there are 2 numbers, seq is used to generate the sequence. Otherwise, the uniq number is printed.
This might work for you (GNU sed):
sed -r 's/(.*)TO(.*)/seq \1 4 \2/e' file
This evaluates the RHS of the substitution command if the LHS contains TO.

Resources