Use sed (or similar) to remove anything between repeating patterns - bash

I'm essentially trying to "tidy" a lot of data in a CSV. I don't need any of the information that's in "quotes".
Tried sed 's/".*"/""/' but it removes the commas if there's more than one section together.
I would like to get from this:
1,2,"a",4,"b","c",5
To this:
1,2,,4,,,5
Is there a sed wizard who can help? :)

You may use
sed 's/"[^"]*"//g' file > newfile
See online sed demo:
s='1,2,"a",4,"b","c",5'
sed 's/"[^"]*"//g' <<< "$s"
# => 1,2,,4,,,5
Details
The "[^"]*" pattern matches ", then 0 or more characters other than ", and then ". The matches are removed since RHS is empty. g flag makes it match all occurrences on each line.

Could you please try following.
awk -v s1="\"" 'BEGIN{FS=OFS=","} {for(i=1;i<=NF;i++){if($i~s1){$i=""}}} 1' Input_file
Non-one liner form of solution is:
awk -v s1="\"" '
BEGIN{
FS=OFS=","
}
{
for(i=1;i<=NF;i++){
if($i~s1){
$i=""
}
}
}
1
' Input_file
Detailed explanation:
awk -v s1="\"" ' ##Starting awk program from here and mentioning variable s1 whose value is "
BEGIN{ ##Starting BEGIN section of this code here.
FS=OFS="," ##Setting field separator and output field separator as comma(,) here.
}
{
for(i=1;i<=NF;i++){ ##Starting a for loop which traverse through all fields of current line.
if($i~s1){ ##Checking if current field has " in it if yes then do following.
$i="" ##Nullifying current field value here.
}
}
}
1 ##Mentioning 1 will print edited/non-edited line here.
' Input_file ##Mentioning Input_file name here.

With Perl:
perl -p -e 's/".*?"//g' file
? forces * to be non-greedy.
Output:
1,2,,4,,,5

Related

need help on unix Script to read a data from specific position and use the extracted in query

Input file:
ADSDWETTYT017775227ACG
ADSDWETTYT029635225HCG
ADSDWETTYC018525223JCG
ADSDWETTYC987415221ACG
ADSDWETTCC891235219ACG
ADSDWETTTT074565217ACG
ADSDWETTYT567895213ACG
ADSDWETTYH037535215ACG
ADSDWETTYC051595211ACG
ADSDWETTYT052465209ACG
ADSDWETTYT067595207ACG
ADSDWETTYT077515205ACG
need to check the 10 position on the file contain/start with T, if its start with "T" then i need to take the value from 14 char from 16.
from the above file am expecting the below output,
'5227','5225','5217','5213','5209','5207','5205'
this result i should assigned to some constant like (result below) and should be used in the query where clause like below,
result=$(awk '
BEGIN{
conf="" };
{ if(substr($0,10,1)=="T"){
conf=substr($0,16,4);
{NT==1?s="'\''"conf:s=s"'\'','\''"conf}
}
}
END {print s"'\''"}
' $INPUT_FILE_PATH
db2 "EXPORT TO ${OUTPUT_FILE} OF DEL select STATUS FROM TRAN where TN_NR in (${result})"
I need some help to enhance the awk command and passing the constant in query where clause. kindly help.
With your shown samples, attempts; please try following awk code.
awk -v s1="'" 'BEGIN{OFS=", "} substr($0,10,1)=="T"{val=(val?val OFS:"") (s1 substr($0,16,4) s1)} END{print val}' Input_file
Adding non-one liner form of above code:
awk -v s1="'" '
BEGIN{ OFS=", " }
substr($0,10,1)=="T"{
val=(val?val OFS:"") (s1 substr($0,16,4) s1)
}
END{
print val
}
' Input_file
To save output of this code into a shell variable try following:
value=$(awk -v s1="'" 'BEGIN{OFS=", "} substr($0,10,1)=="T"{val=(val?val OFS:"") (s1 substr($0,16,4) s1)} END{print val}' Input_file)
Explanation: Adding detailed explanation for above code.
awk -v s1="'" ' ##Starting awk program from here setting s1 to ' here.
BEGIN{ OFS=", " } ##Setting OFS as comma space here.
substr($0,10,1)=="T"{ ##Checking condition if 10th character is T then do following.
val=(val?val OFS:"") (s1 substr($0,16,4) s1) ##Creating val which has values from current line as per OP requirement.
}
END{ ##Starting END block of this program from here.
print val ##Printing val here.
}
' Input_file ##Mentioning Input_file name here.
I'd use sed for this:
sed -En '/^.{9}T/ s/^.{15}(....).*/\1/p' file
And then to get your exact output, pipe that into
... | sed "s/.*/'&'/" | paste -sd,
I'd use perl over awk here for its better arrays (in particular, joining one into a string). Something like:
perl -nE "push #n, substr(\$_, 15, 4) if /^.{9}T/;
END { say join(',', map { \"'\$_'\" } #n) }" "$INPUT_FILE_PATH"

Copy one csv header to another csv with type modification

I want to copy one csv header to another in row wise with some modifications
Input csv
name,"Mobile Number","mobile1,mobile2",email2,Address,email21
test, 123456789,+123456767676,a#test.com,testaddr,a1#test.com
test1,7867778,8799787899898,b#test,com, test2addr,b2#test.com
In new csv this should be like this and file should also be created. And for sting column I will pass the column name so only that column will be converted to string
name.auto()
Mobile Number.auto()
mobile1,mobile2.string()
email2.auto()
Address.auto()
email21.auto()
As you see above all these header with type modification should be inserted in different rows
I have tried with below command but this is only for copy first row
sed '1!d' input.csv > output.csv
You may try this alternative gnu awk command as well:
awk -v FPAT='"[^"]+"|[^,]+' 'NR == 1 {
for (i=1; i<=NF; ++i)
print gensub(/"/, "", "g", $i) "." ($i ~ /,/ ? "string" : "auto") "()"
exit
}' file
name.auto()
Mobile Number.auto()
mobile1,mobile2.string()
email2.auto()
Address.auto()
email21.auto()
Or using sed:
sed -i -e '1i 1234567890.string(),My address is test.auto(),abc3#gmail.com.auto(),120000003.auto(),abc-003.auto(),3.com.auto()' -e '1d' test.csv
EDIT: As per OP's comment to print only first line(header) please try following.
awk -v FPAT='[^,]*|"[^"]+"' '
FNR==1{
for(i=1;i<=NF;i++){
if($i~/^".*,.*"$/){
gsub(/"/,"",$i)
print $i".string()"
}
else{
print $i".auto()"
}
}
exit
}
' Input_file > output_file
Could you please try following, written and tested with GUN awk with shown samples.
awk -v FPAT='[^,]*|"[^"]+"' '
FNR==1{
for(i=1;i<=NF;i++){
if($i~/^".*,.*"$/){
gsub(/"/,"",$i)
print $i".string()"
}
else{
print $i".auto()"
}
}
next
}
1
' Input_file
Explanation: Adding detailed explanation for above.
awk -v FPAT='[^,]*|"[^"]+"' ' ##Starting awk program and setting FPAT to [^,]*|"[^"]+".
FNR==1{ ##Checking condition if this is first line then do following.
for(i=1;i<=NF;i++){ ##Running for loop from i=1 to till NF value.
if($i~/^".*,.*"$/){ ##Checking condition if current field starts from " and ends with " and having comma in between its value then do following.
gsub(/"/,"",$i) ##Substitute all occurrences of " with NULL in current field.
print $i".string()" ##Printing current field and .string() here.
}
else{ ##else do following.
print $i".auto()" ##Printing current field dot auto() string here.
}
}
next ##next will skip all further statements from here.
}
1 ##1 will print current line.
' Input_file ##Mentioning Input_file name here.

Retrieve data from a file using patterns and annotate it with its filename

I have a file called bin.001.fasta looking like this:
>contig_655
GGCGGTTATTTAGTATCTGCCACTCAGCCTCGCTATTATGCGAAATTTGAGGGCAGGAGGAAACCATGAC
AGTAGTCAAGTGCGACAAGC
>contig_866
CCCAGACCTTTCAGTTGTTGGGTGGGGTGGGTGCTGACCGCTGGTGAGGGCTCGACGGCGCCCATCCTGG
CTAGTTGAAC
...
What I wanna do is to get a new file, where the 1st column is retrieved contig IDs and the 2nd column is the filename without .fasta:
contig_655 bin.001
contig_866 bin.001
Any ideas how to make it ?
Could you please try following.
awk -F'>' '
FNR==1{
split(FILENAME,array,".")
file=array[1]"."array[2]
}
/^>/{
print $2,file
}
' Input_file
OR more generic if your Input_file has more than 2 dots then run following.
awk -F'>' '
FNR==1{
match(FILENAME,/.*\./)
file=substr(FILENAME,RSTART,RLENGTH-1)
}
/^>/{
print $2,file
}
' Input_file
Explanation: Adding detailed explanation for above code.
awk -F'>' ' ##Starting awk program from here and setting field separator as > here for all lines.
FNR==1{ ##Checking condition if this is first line then do following.
split(FILENAME,array,".") ##Splitting filename which is passed to this awk program into an array named array with delimiter .
file=array[1]"."array[2] ##Creating variable file whose value is 1st and 2nd element of array with DOT in between as per OP shown sample.
}
/^>/{ ##Checking condition if a line starts with > then do following.
print $2,file ##Printing 2nd field and variable file value here.
}
' Input_file ##Mentioning Input_file name here.

How to add single quote after specific word using sed?

I am trying to write a script to add a single quote after a "GOOD" word .
For example, I have file1 :
//WER GOOD=ONE
//WER1 GOOD=TWO2
//PR1 GOOD=THR45
...
Desired change is to add single quotes :
//WER GOOD='ONE'
//WER1 GOOD='TWO2'
//PR1 GOOD='THR45'
...
This is the script which I am trying to run:
#!/bin/bash
for item in `grep "GOOD" file1 | cut -f2 -d '='`
do
sed -i 's/$item/`\$item/`\/g' file1
done
Thank you for the help in advance !
Could you please try following.
sed "s/\(.*=\)\(.*\)/\1'\2'/" Input_file
OR as per OP's comment to remove empty line use:
sed "s/\(.*=\)\(.*\)/\1'\2'/;/^$/d" Input_file
Explanation: following is only for explanation purposes.
sed " ##Starting sed command from here.
s/ ##Using s to start substitution process from here.
\(.*=\)\(.*\) ##Using sed buffer capability to store matched regex into memory, saving everything till = in 1st buffer and rest of line in 2nd memory buffer.
/\1'\2' ##Now substituting 1st and 2nd memory buffers with \1'\2' as per OP need adding single quotes before = here.
/" Input_file ##Closing block for substitution, mentioning Input_file name here.
Please use -i option in above code in case you want to save output into Input_file itself.
2nd solution with awk:
awk 'match($0,/=.*/){$0=substr($0,1,RSTART) "\047" substr($0,RSTART+1,RLENGTH) "\047"} 1' Input_file
Explanation: Adding explanation for above code.
awk '
match($0,/=.*/){ ##Using match function to mmatch everything from = to till end of line.
$0=substr($0,1,RSTART) "\047" substr($0,RSTART+1,RLENGTH) "\047" ##Creating value of $0 with sub-strings till value of RSTART and adding ' then sub-strings till end of line adding ' then as per OP need.
} ##Where RSTART and RLENGTH are variables which will be SET once a TRUE matched regex is found.
1 ##1 will print edited/non-edited line.
' Input_file ##Mentioning Input_file name here.
3rd solution: In case you have only 2 fields in your Input_file then try more simpler in awk:
awk 'BEGIN{FS=OFS="="} {$2="\047" $2 "\047"} 1' Input_file
Explanation of 3rd code: Use only for explanation purposes, for running please use above code itself.
awk ' ##Starting awk program here.
BEGIN{FS=OFS="="} ##Setting FS and OFS values as = for all line for Input_file here.
{$2="\047" $2 "\047"} ##Setting $2 value with adding a ' $2 and then ' as per OP need.
1 ##Mentioning 1 will print edited/non-edited lines here.
' Input_file ##Mentioning Input_file name here.

How do I join lines using space and comma

I have the file that contains content like:
IP
111
22
25
I want to print the output in the format IP 111,22,25.
I have tried tr ' ' , but its not working
Welcome to paste
$ paste -sd " ," file
IP 111,22,25
Normally what paste does is it writes to standard output lines consisting of sequentially corresponding lines of each given file, separated by a <tab>-character. The option -s does it differently. It states to paste each line of the files sequentially with a <tab>-character as a delimiter. When using the -d flag, you can give a list of delimiters to be used instead of the <tab>-character. Here I gave as a list " ," indicating, use space and then only commas.
In pure Bash:
# Read file into array
mapfile -t lines < infile
# Print to string, comma-separated from second element on
printf -v str '%s %s' "${lines[0]}" "$(IFS=,; echo "${lines[*]:1}")"
# Print
echo "$str"
Output:
IP 111,22,25
I'd go with:
{ read a; read b; read c; read d; } < file
echo "$a $b,$c,$d"
This will also work:
xargs printf "%s %s,%s,%s" < file
Try cat file.txt | tr '\n' ',' | sed "s/IP,/IP /g"
tr deletes new lines, sed changes IP,111,22,25 into IP 111,22,25
The following awk script will do the requested:
awk 'BEGIN{OFS=","} FNR==1{first=$0;next} {val=val?val OFS $0:$0} END{print first FS val}' Input_file
Explanation: Adding explanation for above code now.
awk ' ##Starting awk program here.
BEGIN{ ##Starting BEGIN section here of awk program.
OFS="," ##Setting OFS as comma, output field separator.
} ##Closing BEGIN section of awk here.
FNR==1{ ##Checking if line is first line then do following.
first=$0 ##Creating variable first whose value is current first line.
next ##next keyword is awk out of the box keyword which skips all further statements from here.
} ##Closing FNR==1 BLOCK here.
{ ##This BLOCK will be executed for all lines apart from 1st line.
val=val?val OFS $0:$0 ##Creating variable val whose values will be keep concatenating its own value.
}
END{ ##Mentioning awk END block here.
print first FS val ##Printing variable first FS(field separator) and variable val value here.
}' Input_file ##Mentioning Input_file name here which is getting processed by awk.
Using Perl
$ cat captain.txt
IP
111
22
25
$ perl -0777 -ne ' #k=split(/\s+/); print $k[0]," ",join(",",#k[1..$#k]) ' captain.txt
IP 111,22,25
$

Resources