grep: keep lines by number in specific column - bash

I know how to do it with awk, for example, keep lines, which contains number 3 in second column: $ awk '"$2" == 3'
But how to do the same with only grep?
What about for first column?

Grep is not great for this, awk is better. But assuming your columns are separated by spaces, then you want
grep -E '^[^ ]+ +3( |$)'
Explanation: find something that has a start of line, followed by one or more non-space characters (first column), then one or more space characters (column separator), then the number 3, then either a space (because there's another column) or end of line (if there's no other column).
(Updated to fix syntax after testing.)

Here is the longer explanation for my mysterious command grep -P '^[^\t]*\t3\t' your_file from the comments:
I assumed that the column delimiter is a tab. grep without -P would require some strange things to use it directly (see e.g. see here ) . The -P makes it possible to just write \t without any problems. If for example your delimiter is ; then you could replace the \t with ; and you dont need the -P option.
Having said that, lets explain the idea behind the regular expression: You said, you want to match a 3 in the second column:
^ means: at the beginning of the line
[^\t]* means: zero or more (*) occurences of something not a tab ([^\t] here the ^ means "not a")
followed by tab
followed by 3
followed by tab
Now we have effectively expressed the idea that we need a 3 as the content of the second column (\t3\t) and we are not interested in the precise content of the first column. The ^[^\t]*\t is only necessary to express the idea "what follows is in the second column".
If you want to match something in the fourth column, you could use this to "skip" the first three column and match a 4 in the fourth column:
^([^\t]*\t){3}4. (Note the parenthesis and the {3}).
As you can see many details and awk is much more elegant and easy.
You can read this up in the documentation of grep and then you will need to study something about regular expression, e.g. start here.

Related

How can I replace all occurrences of a value in a text file just in one column using the Sed command in shell script (columns are seperated by ;)? [duplicate]

This question already has answers here:
sed: replace values in a single column
(3 answers)
Closed last month.
I have a file that has columns seperated by a semi column(;) and I want to change all occurrences of a word in a particular column only to another word. The column number differentiates based on the variable that holds the column number. The word I want to change is stored in a variable, and the word I want to change to is stored in a variable too.
I tried
sed -i "s/\<$word\>/$wordUpdate/g" $anyFile
I tried this but it changed all occurrences of word in the whole file! I only want in a particular column
the number of column is stored in a variable called numColumn
and the columns are seperated by a semi column ;
It is much simpler to use awk for column edits, e.g. if your input looks like this:
68;61;83;27;60;70;84;11;46;62;93;97;40;23;19
33;70;17;49;81;21;68;83;16;6;42;38;68;81;89
73;40;95;64;32;33;77;56;23;11;70;28;33;80;24
8;9;74;6;86;78;87;41;11;79;23;28;71;99;15
29;87;77;9;98;12;7;66;60;85;20;14;55;97;17
39;24;21;58;23;61;39;26;57;70;76;16;70;53;8
37;46;18;64;56;28;86;7;80;71;94;46;19;53;43
71;2;47;62;9;21;68;9;9;80;32;59;73;74;72
20;34;89;58;74;92;86;35;48;81;50;6;63;67;90
78;17;6;63;61;65;75;31;33;82;24;5;90;46;12
You can replace 60 in column c with s with something like this:
<infile awk '$c ~ m { $c = s } 1' FS=';' OFS=';' c=5 m=60 s=XX
Output:
68;61;83;27;XX;70;84;11;46;62;93;97;40;23;19
33;70;17;49;81;21;68;83;16;6;42;38;68;81;89
73;40;95;64;32;33;77;56;23;11;70;28;33;80;24
8;9;74;6;86;78;87;41;11;79;23;28;71;99;15
29;87;77;9;98;12;7;66;60;85;20;14;55;97;17
39;24;21;58;23;61;39;26;57;70;76;16;70;53;8
37;46;18;64;56;28;86;7;80;71;94;46;19;53;43
71;2;47;62;9;21;68;9;9;80;32;59;73;74;72
20;34;89;58;74;92;86;35;48;81;50;6;63;67;90
78;17;6;63;61;65;75;31;33;82;24;5;90;46;12
This might work for you (GNU sed):
word=foo wordUpdate=bar numColumn=3
sed -i 'y/;/\n/
s#.*#echo "&" | sed "'${numColumn}'s/\<'${word}'\>/'${wordUpdate}'/"#e
y/\n/;/' file
Convert each line into a separate file where the columns are lines.
Substitute the matching line (column number) with the word for the updated word.
Reverse the conversion.
N.B. The solution relies on the GNU only e evaluation flag. Also the word and updateWord may need to be quoted.
This can be done with a little creativity...
Note that I'm using double-quotes to embed the logic. This takes a little extra care to double your \'s on backreferences.
$: word=baz; c=3; new=XX; lead="^([^;]*;){$((c-1))}"; sed -E "/$lead$word;/{s/($lead)$word/\\1$new/}" file
1;2;3;4;5;6;7;8;9;0;
foo;bar;XX;qux;foo;bar;baz;qux;
a;b;c;d;e;f;g;
Explained:
lead="^([^;]*;){$((c-1))}"
^ means at the start of a record
(...) is grouping for the following {...} which specified repetition
[^;]* mean zero or more non-semicolons
$((c-1)) does the math and returns one less than the desired column; if you want to look at column 3, it returns two.
SO, ^([^;]*;){$((c-1))} at the start of the record, one-less-than-column occurrences of non-semicolons followed by a semicolon
thus, sed -E "/$lead$word;/{s/($lead)$word/\\1$new/}" file mean read file and on records where $word occurs in the requested column, save everything before it, and put that stuff back, but replace $word with $new.
Even if you MUST use sed, I recommend a function.
fix(){
local word="$1" col="$2" new="$3" file="$4"
local lead="^([^;]*;){$((col-1))}"
sed -E "/$lead$word;/{s/($lead)$word/\\1$new/}" "$file"
}
In use -
$: fix bar 2 HI file
1;2;3;4;5;6;7;8;9;0;
foo;HI;baz;qux;foo;bar;baz;qux;
a;b;c;d;e;f;g;
$: fix 1 1 XX file
XX;2;3;4;5;6;7;8;9;0;
foo;bar;baz;qux;foo;bar;baz;qux;
a;b;c;d;e;f;g;
$: fix bar 2 '(^_^)' file
1;2;3;4;5;6;7;8;9;0;
foo;(^_^);baz;qux;foo;bar;baz;qux;
a;b;c;d;e;f;g;
No changes if no matches -
$: fix bar 5 HI file
1;2;3;4;5;6;7;8;9;0;
foo;bar;baz;qux;foo;bar;baz;qux;
a;b;c;d;e;f;g;
NOTE -
This logic requires trailing delimiters if you ever want to match the last field -
$: fix 0 10 HI file
1;2;3;4;5;6;7;8;9;HI;
foo;bar;baz;qux;foo;bar;baz;qux;
a;b;c;d;e;f;g;
delimiters removed:
$: fix 0 10 HI file
1;2;3;4;5;6;7;8;9;0
foo;bar;baz;qux;foo;bar;baz;qux
a;b;c;d;e;f;g
Otherwise you have to complicate the logic a bit.
But honestly, for field parsing, you'd be so much better served to use awk, or even perl or python, or for that matter a bash loop, though that's going to be relatively slow.

Grep command with fields

I've a formatted text and I need to find a sequence of two characters A separated by two any characters. The point is that I need to look for them only in the second column of the formatted text. I need to use the grep command. I came up with this:
grep -E A\.\.\A data.txt
which works correctly for all the columns, but I need to search only in the second one. Any suggestions?
Thank you
Using grep and assuming that , is your field separator, you can use something like this:
grep -E "^[^,]*,[^,]*A[^,]{2}A" data.txt
Here we
skip the first column:
start of line ^
everything up to the first comma [^,]*
the first comma ,
now that we are behind the first comma, e.g. in the second column, we match
optional characters [^,]* before the A
two characters that are not a comma: [^,]{2}
followed by the second A
But as others have already said: awk is probably the better tool for this task.

Make changes to a file (sed, awk)

I am trying to clean up the next file:
1. 10.160.120.10 ; 140.0.0.40 ;Data-- 1155~00120~xtl~12/01/2016 03:00:24~000BBBBBA4FB~ÍežG5„È&gÈe#Ÿ#•Œ‘„¦åEI²6frÞõ+ã:®*ÓÓÂ"ða5»V$è~
2. ¼?Amµxðïej£„7‹ìËÏð‡.4 --
3. 10.160.120.11 ; 140.10.10.10 ;Data-- 1155~00120~xtl~12/01/2016 03:00:54~2B3BB1EB1BBB~£ˆD]†CÀ,£ÑÉ»In&Ry+/jÑ%A¡ã ÷d_#C÷—NÏÕÞ
3. Ü‚úè"åD\’c\ûñ7x°yFæï --
Note that the numbers are not an actual part of the file. They are just reference for the number of line. The size of the line depends on the encoded message (That is why the 3 is reapeated because it basically one line). There are thousands of records but they follow the same pattern. Each record ends with a (--).
Basically what I am trying to achive is to just get the IPs side by side.
For example:
10.160.120.10 000BBBBBA4FB
My first step would be to delete everything between the first (;) and the fourth (~) since that pattern is the same for each record.
Which leads me to this.
sed 's/;.*~//'
However this particular command would delete everything untill the last (~) and not the fourth.
If it succesfully removes everything between the first (;) and the fourth (~) it would get me something like this:
0.165.65.113 0008B9A4F3~ÍežG5„È&gÈe#Ÿ#•Œ‘„¦åEI²6frÞõ+ã:®*ÓÓÂ"ða5»V$è~
¼?Amµxðïej£„7‹ìËÏð‡.4 --
And then I guess I could delete everything after the first (~) so I can get the desired output.
Am I following the right procedure? Should I achive this with swd or awk? Any suggestion are appreciated!
Instead of trying to remove stuff, why don't you just keep the stuff you want?
sed -r -n 's/^[^0-9]*(([0-9]{1,3}\.){3}[0-9]{1,3}).*([0-9A-F]{12}).*$/\1 \3/p'
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^
# IP Address 12 Hex digits
Explanation:
\1 \3 means enter everything that matched the first and the third set of parenthesis of the search term.
^[^0-9]* matches all non-digits from the beginning of the file
([0-9]{1,3}\.){3}[0-9]{1,3} matches an IP address. The whole term is in parentheses because we want to keep it. The inner (...) could be referenced as \2 in the replacement term, but we don't need that.
[0-9A-F]{12} is simply 12 hexadecimal digits (upper case, use `[0-9a-fA-F] if you expect lower cases as well)
Assuming your data struture is the same
use several field separator at once with a class including ";" and "~". Be carefull , not space alone as separator like by default that return a different field 3 (and 6)
awk -F '[[:blank:]*[;~][[:blank:]]*' '/--$/ {print $1 " " $7}' YourFile
Assuming there is only space char and no tab as separator and data line have Data
awk -F ' *[;~] *' '/--$/ {print $1 " " $7}' YourFile

Change field name and edit a csv file

I have a csv file I am looking at in bash that I am trying to manipulate. There are several things that I have/am trying to edit. Structure is like so where the first row are the column(field) headers
cat,dog,hippopotamus,zebra
1,,3,2
three species, five species,only one,multiple
at,home, at, home, wild, wild
How can I edit the field (column) names in the csv?
head -1 test.csv
shows what the field (column) names are, but it still has the commas in it as well and this doesn't allow for field name changing at all.
The other part about this is that I want to only edit titles that are greater than 8 characters in length, in which case I will just take the first 8 characters. I'm guessing I would use some sort of loop based on string length but since I don't know how to even edit the field name of just one column I'm not sure how to do this. In scenario above, changing hippopotamus to hippopot.
How can I replace empty cells in the csv to NA or NULL?
sed -i 's/ /NULL/g'
Thought would work but doesn't.
Some of the cells have commas within them, messing with the , delimiter. I used the code below and it seems to work, but is there a better/safer way to do this?
sed -i "s/, /_/g"
Or in a similar situation, if multiple columns contain strings sometimes with spaces within a string but I only want to remove the space in one of the columns while leaving the other columns alone, how can I achieve this?
sed -i 's/ //g' test.csv
Sed will allow a line number as a command prefix, to only work on a single line (or a range of numbers, to work on lines in that range). Try something like this.
sed -e '1s/cat/Feline/' test.csv > test2.csv
CSV files will store an empty field as either a comma at start of line, a comma at end of line, or a comma followed by another comma:
Field1,Field2,Field3
,"<-- empty field1",field3
field1,,"<-- empty field2"
field1,"empty field3-->",
You can use the following sed commands to fix these:
sed -e 's/^,/NA,/;s/,$/,NA/' -e ':loop' -e 's/,,/,NA,/g;tloop' test.csv
Your solution appears good. Be aware, however, that CSV should have quotes around any string containing a comma. And that's legit. It's also the point where sed stops being a good tool for manipulating CSV files. ;-) One suggestion would be to replace "interior" commas with "%2C", which is the HTML encoding for a comma. That's pretty distinctive, and at least somewhat standard.
sed numbers groups starting from the left-most paren. If your groups match multiple times, you can only get the last match contents, but if an outer group contains the multi-match, the outer group is still valid. (I assume here that you have already replaced the "interior" commas with something else.)
sed -e ':loop' -e '^\(\([^,]*,\)\{3\}\)\([^ ,]*\) /\1\3/;tloop'
This will remove the first space in column 4, then loop. It will stop when it finds the comma that ends the column, or end-of-line.
Note that the first part, called \1, is general. You can replace the 3 with whatever field, minus one, and that will get you to the start of the field. The actual work is in the second part, \3, where you can do what you like. (Note that \2 is included within \1, and not particularly useful.)

Remove rows with too many delimiters

I have a file with fields separated by the '`' character. But sometimes the actual data also contains this character. How can I remove all the erroneous rows and retain only the good quality data.
Sample Row as below . Towards the end 'fff`ff' this is the erroneous column . in such case The row should be eliminated.
xxx`1000165811`2012`2012_q2`05/09/2012 22:02:00`1343`04/07/2004 00:00:00`05/09/2012 00:00:00````F`1`1.000000`9.620000`1.0000````fff`Not`Free`Free`1.000000`9.620000`0.000000`1.0000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`56565666`255.590000`21`0`0.000000```ddd`dddd`FA May 2012 ddd`0.000000`0.000000`0.000000`0.000000`0.000000`05/30/2012 00:00:00`05/30/2012 00:00:00`1.000000`ddd`ddd`OW`DL`dd dd dd`ddd`dd`dd dd`dd dd`0.000000`0.000000``````````0.000000`````````Non_Mobile`9.620000`1.000000`1`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`9.620000`9.620000`0.000000`0.000000`0.000000`0.000000`28.590000`6.990000`**fff`ff**`````````9.620000`1.000000`1
You need to know what the correct number of delimiters in a line is. You need to count the actual number of delimiters in each line, and reject those lines where the actual count is not the correct number.
Assuming the the correct number of separators is n=5, then you could try:
n=5
grep -E '^[^`]*(`[^`]*){'"$n"'}$' data
The regex uses extended regular expressions (-E). The regex matches the start of the line, zero or more non-back-ticks, then a sequence of n occurrences of a back tick followed by zero or more non-back-ticks, followed by the end of line. Because the back-tick is a shell metacharacter, it is best to enclose most of the regular expression in single quotes. The variable $n could be used without the double quotes around it, but it's generally best to enclose variables in double quotes. Clearly, you can also use this version too:
grep -E '^([^`]*`){'"$n"'}[^`]*$' data
Given a data file data:
AA`BB`CC`DD`EE`FF
AABB`CC`DD`EE`FF
A`A`BB`CC`DD`EE`FF
`BB`CC`DD`EE`FF
`BB`CC`DD`EE`
``CC`DD`EE`
``CC``EE`
````EE`
`BB```EE`
`````
``````
````
Welcome`to`the`land`of`insanity
The output of the command is:
AA`BB`CC`DD`EE`FF
`BB`CC`DD`EE`FF
`BB`CC`DD`EE`
``CC`DD`EE`
``CC``EE`
````EE`
`BB```EE`
`````
Welcome`to`the`land`of`insanity
grep -v "[^`]`[^`]`[^`]`"
you need to have one more times that the correct lines would have
In the spirit of "Be careful what you ask for", here is a "one-liner" (spread over three lines for readability) that will do what was asked, using only awk and assuming that $FILE is the relevant filename.
awk -F'`' -v file="$FILE" '
BEGIN{ while(getline<file){if (min==""||NF<min){min=NF}}}
NF==min' "$FILE"
This incantation first determines the minimum number of delimiters per line (without sorting the file), and then rejects all lines with more than that many.
(This is similar to Ed Morton's proposal, but without the bug :-)

Resources