Remove rows with too many delimiters - bash

I have a file with fields separated by the '`' character. But sometimes the actual data also contains this character. How can I remove all the erroneous rows and retain only the good quality data.
Sample Row as below . Towards the end 'fff`ff' this is the erroneous column . in such case The row should be eliminated.
xxx`1000165811`2012`2012_q2`05/09/2012 22:02:00`1343`04/07/2004 00:00:00`05/09/2012 00:00:00````F`1`1.000000`9.620000`1.0000````fff`Not`Free`Free`1.000000`9.620000`0.000000`1.0000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`56565666`255.590000`21`0`0.000000```ddd`dddd`FA May 2012 ddd`0.000000`0.000000`0.000000`0.000000`0.000000`05/30/2012 00:00:00`05/30/2012 00:00:00`1.000000`ddd`ddd`OW`DL`dd dd dd`ddd`dd`dd dd`dd dd`0.000000`0.000000``````````0.000000`````````Non_Mobile`9.620000`1.000000`1`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`0.000000`9.620000`9.620000`0.000000`0.000000`0.000000`0.000000`28.590000`6.990000`**fff`ff**`````````9.620000`1.000000`1

You need to know what the correct number of delimiters in a line is. You need to count the actual number of delimiters in each line, and reject those lines where the actual count is not the correct number.
Assuming the the correct number of separators is n=5, then you could try:
n=5
grep -E '^[^`]*(`[^`]*){'"$n"'}$' data
The regex uses extended regular expressions (-E). The regex matches the start of the line, zero or more non-back-ticks, then a sequence of n occurrences of a back tick followed by zero or more non-back-ticks, followed by the end of line. Because the back-tick is a shell metacharacter, it is best to enclose most of the regular expression in single quotes. The variable $n could be used without the double quotes around it, but it's generally best to enclose variables in double quotes. Clearly, you can also use this version too:
grep -E '^([^`]*`){'"$n"'}[^`]*$' data
Given a data file data:
AA`BB`CC`DD`EE`FF
AABB`CC`DD`EE`FF
A`A`BB`CC`DD`EE`FF
`BB`CC`DD`EE`FF
`BB`CC`DD`EE`
``CC`DD`EE`
``CC``EE`
````EE`
`BB```EE`
`````
``````
````
Welcome`to`the`land`of`insanity
The output of the command is:
AA`BB`CC`DD`EE`FF
`BB`CC`DD`EE`FF
`BB`CC`DD`EE`
``CC`DD`EE`
``CC``EE`
````EE`
`BB```EE`
`````
Welcome`to`the`land`of`insanity

grep -v "[^`]`[^`]`[^`]`"
you need to have one more times that the correct lines would have

In the spirit of "Be careful what you ask for", here is a "one-liner" (spread over three lines for readability) that will do what was asked, using only awk and assuming that $FILE is the relevant filename.
awk -F'`' -v file="$FILE" '
BEGIN{ while(getline<file){if (min==""||NF<min){min=NF}}}
NF==min' "$FILE"
This incantation first determines the minimum number of delimiters per line (without sorting the file), and then rejects all lines with more than that many.
(This is similar to Ed Morton's proposal, but without the bug :-)

Related

How can I replace all occurrences of a value in a text file just in one column using the Sed command in shell script (columns are seperated by ;)? [duplicate]

This question already has answers here:
sed: replace values in a single column
(3 answers)
Closed last month.
I have a file that has columns seperated by a semi column(;) and I want to change all occurrences of a word in a particular column only to another word. The column number differentiates based on the variable that holds the column number. The word I want to change is stored in a variable, and the word I want to change to is stored in a variable too.
I tried
sed -i "s/\<$word\>/$wordUpdate/g" $anyFile
I tried this but it changed all occurrences of word in the whole file! I only want in a particular column
the number of column is stored in a variable called numColumn
and the columns are seperated by a semi column ;
It is much simpler to use awk for column edits, e.g. if your input looks like this:
68;61;83;27;60;70;84;11;46;62;93;97;40;23;19
33;70;17;49;81;21;68;83;16;6;42;38;68;81;89
73;40;95;64;32;33;77;56;23;11;70;28;33;80;24
8;9;74;6;86;78;87;41;11;79;23;28;71;99;15
29;87;77;9;98;12;7;66;60;85;20;14;55;97;17
39;24;21;58;23;61;39;26;57;70;76;16;70;53;8
37;46;18;64;56;28;86;7;80;71;94;46;19;53;43
71;2;47;62;9;21;68;9;9;80;32;59;73;74;72
20;34;89;58;74;92;86;35;48;81;50;6;63;67;90
78;17;6;63;61;65;75;31;33;82;24;5;90;46;12
You can replace 60 in column c with s with something like this:
<infile awk '$c ~ m { $c = s } 1' FS=';' OFS=';' c=5 m=60 s=XX
Output:
68;61;83;27;XX;70;84;11;46;62;93;97;40;23;19
33;70;17;49;81;21;68;83;16;6;42;38;68;81;89
73;40;95;64;32;33;77;56;23;11;70;28;33;80;24
8;9;74;6;86;78;87;41;11;79;23;28;71;99;15
29;87;77;9;98;12;7;66;60;85;20;14;55;97;17
39;24;21;58;23;61;39;26;57;70;76;16;70;53;8
37;46;18;64;56;28;86;7;80;71;94;46;19;53;43
71;2;47;62;9;21;68;9;9;80;32;59;73;74;72
20;34;89;58;74;92;86;35;48;81;50;6;63;67;90
78;17;6;63;61;65;75;31;33;82;24;5;90;46;12
This might work for you (GNU sed):
word=foo wordUpdate=bar numColumn=3
sed -i 'y/;/\n/
s#.*#echo "&" | sed "'${numColumn}'s/\<'${word}'\>/'${wordUpdate}'/"#e
y/\n/;/' file
Convert each line into a separate file where the columns are lines.
Substitute the matching line (column number) with the word for the updated word.
Reverse the conversion.
N.B. The solution relies on the GNU only e evaluation flag. Also the word and updateWord may need to be quoted.
This can be done with a little creativity...
Note that I'm using double-quotes to embed the logic. This takes a little extra care to double your \'s on backreferences.
$: word=baz; c=3; new=XX; lead="^([^;]*;){$((c-1))}"; sed -E "/$lead$word;/{s/($lead)$word/\\1$new/}" file
1;2;3;4;5;6;7;8;9;0;
foo;bar;XX;qux;foo;bar;baz;qux;
a;b;c;d;e;f;g;
Explained:
lead="^([^;]*;){$((c-1))}"
^ means at the start of a record
(...) is grouping for the following {...} which specified repetition
[^;]* mean zero or more non-semicolons
$((c-1)) does the math and returns one less than the desired column; if you want to look at column 3, it returns two.
SO, ^([^;]*;){$((c-1))} at the start of the record, one-less-than-column occurrences of non-semicolons followed by a semicolon
thus, sed -E "/$lead$word;/{s/($lead)$word/\\1$new/}" file mean read file and on records where $word occurs in the requested column, save everything before it, and put that stuff back, but replace $word with $new.
Even if you MUST use sed, I recommend a function.
fix(){
local word="$1" col="$2" new="$3" file="$4"
local lead="^([^;]*;){$((col-1))}"
sed -E "/$lead$word;/{s/($lead)$word/\\1$new/}" "$file"
}
In use -
$: fix bar 2 HI file
1;2;3;4;5;6;7;8;9;0;
foo;HI;baz;qux;foo;bar;baz;qux;
a;b;c;d;e;f;g;
$: fix 1 1 XX file
XX;2;3;4;5;6;7;8;9;0;
foo;bar;baz;qux;foo;bar;baz;qux;
a;b;c;d;e;f;g;
$: fix bar 2 '(^_^)' file
1;2;3;4;5;6;7;8;9;0;
foo;(^_^);baz;qux;foo;bar;baz;qux;
a;b;c;d;e;f;g;
No changes if no matches -
$: fix bar 5 HI file
1;2;3;4;5;6;7;8;9;0;
foo;bar;baz;qux;foo;bar;baz;qux;
a;b;c;d;e;f;g;
NOTE -
This logic requires trailing delimiters if you ever want to match the last field -
$: fix 0 10 HI file
1;2;3;4;5;6;7;8;9;HI;
foo;bar;baz;qux;foo;bar;baz;qux;
a;b;c;d;e;f;g;
delimiters removed:
$: fix 0 10 HI file
1;2;3;4;5;6;7;8;9;0
foo;bar;baz;qux;foo;bar;baz;qux
a;b;c;d;e;f;g
Otherwise you have to complicate the logic a bit.
But honestly, for field parsing, you'd be so much better served to use awk, or even perl or python, or for that matter a bash loop, though that's going to be relatively slow.

How to format number using shell script

I have a text document with the following content:
[ForwardTimer],__fc_layer_1__,[Span:1ms970us]
[ForwardTimer],__batch_norm_2__,[Span:5ms64us]
[ForwardTimer],__batch_norm_3__,[Span:5ms87us]
I want to convert the time values in ms unit, like
[ForwardTimer],__fc_layer_1__,1.970ms
[ForwardTimer],__batch_norm_2__,5.064ms
[ForwardTimer],__batch_norm_3__,5.087ms
while keeping previous words unchanged.
How can I process the document using shell script, especially with sed or awk command?
awk -F '\\[Span:' '{split($2,array,"ms|us"); printf("%s%s.%03dms\n",$1,array[1],array[2])}' file.txt
Output:
[ForwardTimer],__fc_layer_1__,1.970ms
[ForwardTimer],__batch_norm_2__,5.064ms
[ForwardTimer],__batch_norm_3__,5.087ms
This splits your lines with [Span: as field separator in two parts ($1 and $2). With function split() and ms or us as field separator it splits $2 in three parts (array[1], array[2] and array[3]). array[3] is unused. The formatted output then makes printf().
This might work for you (GNU sed):
sed -E 's/\[Span:([0-9]*)([^0-9]*)([0-9]*)[^]]*[]]/\1.\n\3\2/;:a;/\n[0-9]{3}/!s/\n/&0/;ta;s/\n//' file
Use pattern matching and back references to achieve the desired result.
Not forgetting to zero space the decimal part of the match using a loop and the introduced newline, which on completion is removed.
The first substitution command focuses on a string such as [Span:5ms64us] and if found groups the 5 in back reference 1, ms in back reference 2 and 64 in back reference 3. These are rearranged into \1.\n\3\2 i.e. 5.\n64ms and the remainder of the initial string is removed.
The second part of the sed script zero spaces the decimal part of back reference 3 to be 3 digits long with leading zeroes. Using the \n as a marker, if the numeric digits following the \n is less than 3 in length, a 0 is appended to the \n and the check is repeated. Once the check passes i.e. there are 3 digits, the \n is removed and the processing is complete.

replacing specific characters in a line shell script

I have the following contents in a file
{"Hi","Hello","unix":["five","six"]}
I would like to replace comma within the square brackets only to semi colon. Rest of the comma's in the line should not be changed.
Output should be
{"Hi","Hello","unix":["five";"six"]}
I have tried using sed but it is not working. Below is the command I tried. Kindly help.
sed 's/:\[*\,*\]/;/'
Thanks
If your Input_file is same as sample shown then following may help you in same.
sed 's/\([^[]*\)\([^,]*\),\(.*\)/\1\2;\3/g' Input_file
Output will be as follows.
{"Hi","Hello","unix":["five";"six"]}
EDIT: Adding explanation also for same now, it should be only taken for explanation purposes, one should run above code only for getting the output.
sed 's/\([^[]*\)\([^,]*\),\(.*\)/\1\2;\3/g' Input_file
s ##is for substitution in sed.
\([^[]*\) ##Creating the first memory hold which will have the contents from starting to before first occurrence of [ and will be obtained by 1 later in code.
\([^,]*\) ##creating second memory hold which will have everything from [(till where it stopped yesterday) to first occurrence of ,
, ##Putting , here in the line of Input_file.
\(.*\) ##creating third memory hold which will have everything after ,(comma) to till end of current line.
/\1\2;\3/g ##Now mentioning the memory hold by their number \1\2;\3/g so point to be noted here between \2 and \3 have out ;(semi colon) as per OP's request it needed semi colon in place of comma.
Awk would also be useful here
awk -F'[][]' '{gsub(/,/,";",$2); print $1"["$2"]"$3}' file
by using gsub, you can replace all occurrences of matched symbol inside a specific field
Input File
{"Hi","Hello","unix":["five","six"]}
{"Hi","Hello","unix":["five","six","seven","eight"]}
Output
{"Hi","Hello","unix":["five";"six"]}
{"Hi","Hello","unix":["five";"six";"seven";"eight"]}
You should definitely use RavinderSingh13's answer instead of mine (it's less likely to break or exhibit unexpected behavior given very complex input) but here's a less robust answer that's a little easier to explain than his:
sed -r 's/(:\[.*),(.*\])/\1;\2/g' test
() is a capture group. You can see there are two in the search. In the replace, they are refered to as \1 and \2. This allows you to put chunks of your search back in the replace expression. -r keeps the ( and ) from needing to be escaped with a backslash. [ and ] are special and need to be escaped for literal interpretation. Oh, and you wanted .* not *. The * is a glob and is used in some places in bash and other shells, but not in regexes alone.
edit: and /g allows the replacement to happen multiple times.

grep: keep lines by number in specific column

I know how to do it with awk, for example, keep lines, which contains number 3 in second column: $ awk '"$2" == 3'
But how to do the same with only grep?
What about for first column?
Grep is not great for this, awk is better. But assuming your columns are separated by spaces, then you want
grep -E '^[^ ]+ +3( |$)'
Explanation: find something that has a start of line, followed by one or more non-space characters (first column), then one or more space characters (column separator), then the number 3, then either a space (because there's another column) or end of line (if there's no other column).
(Updated to fix syntax after testing.)
Here is the longer explanation for my mysterious command grep -P '^[^\t]*\t3\t' your_file from the comments:
I assumed that the column delimiter is a tab. grep without -P would require some strange things to use it directly (see e.g. see here ) . The -P makes it possible to just write \t without any problems. If for example your delimiter is ; then you could replace the \t with ; and you dont need the -P option.
Having said that, lets explain the idea behind the regular expression: You said, you want to match a 3 in the second column:
^ means: at the beginning of the line
[^\t]* means: zero or more (*) occurences of something not a tab ([^\t] here the ^ means "not a")
followed by tab
followed by 3
followed by tab
Now we have effectively expressed the idea that we need a 3 as the content of the second column (\t3\t) and we are not interested in the precise content of the first column. The ^[^\t]*\t is only necessary to express the idea "what follows is in the second column".
If you want to match something in the fourth column, you could use this to "skip" the first three column and match a 4 in the fourth column:
^([^\t]*\t){3}4. (Note the parenthesis and the {3}).
As you can see many details and awk is much more elegant and easy.
You can read this up in the documentation of grep and then you will need to study something about regular expression, e.g. start here.

Change field name and edit a csv file

I have a csv file I am looking at in bash that I am trying to manipulate. There are several things that I have/am trying to edit. Structure is like so where the first row are the column(field) headers
cat,dog,hippopotamus,zebra
1,,3,2
three species, five species,only one,multiple
at,home, at, home, wild, wild
How can I edit the field (column) names in the csv?
head -1 test.csv
shows what the field (column) names are, but it still has the commas in it as well and this doesn't allow for field name changing at all.
The other part about this is that I want to only edit titles that are greater than 8 characters in length, in which case I will just take the first 8 characters. I'm guessing I would use some sort of loop based on string length but since I don't know how to even edit the field name of just one column I'm not sure how to do this. In scenario above, changing hippopotamus to hippopot.
How can I replace empty cells in the csv to NA or NULL?
sed -i 's/ /NULL/g'
Thought would work but doesn't.
Some of the cells have commas within them, messing with the , delimiter. I used the code below and it seems to work, but is there a better/safer way to do this?
sed -i "s/, /_/g"
Or in a similar situation, if multiple columns contain strings sometimes with spaces within a string but I only want to remove the space in one of the columns while leaving the other columns alone, how can I achieve this?
sed -i 's/ //g' test.csv
Sed will allow a line number as a command prefix, to only work on a single line (or a range of numbers, to work on lines in that range). Try something like this.
sed -e '1s/cat/Feline/' test.csv > test2.csv
CSV files will store an empty field as either a comma at start of line, a comma at end of line, or a comma followed by another comma:
Field1,Field2,Field3
,"<-- empty field1",field3
field1,,"<-- empty field2"
field1,"empty field3-->",
You can use the following sed commands to fix these:
sed -e 's/^,/NA,/;s/,$/,NA/' -e ':loop' -e 's/,,/,NA,/g;tloop' test.csv
Your solution appears good. Be aware, however, that CSV should have quotes around any string containing a comma. And that's legit. It's also the point where sed stops being a good tool for manipulating CSV files. ;-) One suggestion would be to replace "interior" commas with "%2C", which is the HTML encoding for a comma. That's pretty distinctive, and at least somewhat standard.
sed numbers groups starting from the left-most paren. If your groups match multiple times, you can only get the last match contents, but if an outer group contains the multi-match, the outer group is still valid. (I assume here that you have already replaced the "interior" commas with something else.)
sed -e ':loop' -e '^\(\([^,]*,\)\{3\}\)\([^ ,]*\) /\1\3/;tloop'
This will remove the first space in column 4, then loop. It will stop when it finds the comma that ends the column, or end-of-line.
Note that the first part, called \1, is general. You can replace the 3 with whatever field, minus one, and that will get you to the start of the field. The actual work is in the second part, \3, where you can do what you like. (Note that \2 is included within \1, and not particularly useful.)

Resources