Bash: Find and replace all variable characters up to a constant character with a constant string - bash

I've seen many search and replace threads based on the assumption that 1. you either know what string or substring you are explicitly looking for or 2. you know the exact position it is at within the string or 3. both combined.
In my situation I have one csv file containing one column and 1M rows. e.g.
1,google.com
2,yahoo.com
3,twitter.com
4,xyz.com
For every column, I want to replace every character (the incrementing integers) up to and including the comma with the http semicolon dble forward slash dubdubdub
So far I have the following
HTTPSTRING="http://www."
cat X.csv << Will this ensure that the while block is executed on this file?
while IFS=, read line
do {$line/(.*?),/HTTPSTRING} << This is where I am having trouble
done
exit 0
and I would likea text file containing one URL per line e.g.
http://www.google.com
...
http://www.${999,999_more_urls}
Thank you so much in advance
Lewis

This does a greedy match, which would be problematic if you ever have any commas other than the one that separates the initial integer from the characters you want to retain. But it works on your sample X.csv file, producing a Y.csv file that meets your output specification.
HTTPSTRING="http://www."
while read line
do
echo ${line/*,/$HTTPSTRING}
done < X.csv > Y.csv
exit 0
For what it's worth, if you put this in a script, you can take the file input/input redirection parts out of the code itself, and instead apply them when calling the script.
If you're not strictly limited to bash itself, you might want to consider using sed. Either of these should do what you want, differing only in whether you prefer to escape the slashes in your string or use a non-standard delimiter:
sed 's/[0-9]*,/http:\/\/www./' X.csv > Y.csv
sed 's~[0-9]*,~http://www.~' X.csv > Y.csv

Your script is close. You can pipe the output of cat directly to the while loop, but it's better to use input redirection ( < X.csv). Using IFS=, before read will split the line into fields separated by a comma, but you are just missing a variable to hold the second field.
HTTPSTRING="http://www."
while IFS=, read number domain
do
echo "$HTTPSTRING$domain"
done < X.csv

You could use commands only, there is no need for an explicit Bash loop :
cut -d',' -f2 < X.csv | sed 's_^_http://www._' > Y.txt
Notice that the usual / used after the s in sed is replaced by _ because it is included in the string to replace. ^ matches the start of the line.

Related

How can I replace all occurrences of a value in a text file just in one column using the Sed command in shell script (columns are seperated by ;)? [duplicate]

This question already has answers here:
sed: replace values in a single column
(3 answers)
Closed last month.
I have a file that has columns seperated by a semi column(;) and I want to change all occurrences of a word in a particular column only to another word. The column number differentiates based on the variable that holds the column number. The word I want to change is stored in a variable, and the word I want to change to is stored in a variable too.
I tried
sed -i "s/\<$word\>/$wordUpdate/g" $anyFile
I tried this but it changed all occurrences of word in the whole file! I only want in a particular column
the number of column is stored in a variable called numColumn
and the columns are seperated by a semi column ;
It is much simpler to use awk for column edits, e.g. if your input looks like this:
68;61;83;27;60;70;84;11;46;62;93;97;40;23;19
33;70;17;49;81;21;68;83;16;6;42;38;68;81;89
73;40;95;64;32;33;77;56;23;11;70;28;33;80;24
8;9;74;6;86;78;87;41;11;79;23;28;71;99;15
29;87;77;9;98;12;7;66;60;85;20;14;55;97;17
39;24;21;58;23;61;39;26;57;70;76;16;70;53;8
37;46;18;64;56;28;86;7;80;71;94;46;19;53;43
71;2;47;62;9;21;68;9;9;80;32;59;73;74;72
20;34;89;58;74;92;86;35;48;81;50;6;63;67;90
78;17;6;63;61;65;75;31;33;82;24;5;90;46;12
You can replace 60 in column c with s with something like this:
<infile awk '$c ~ m { $c = s } 1' FS=';' OFS=';' c=5 m=60 s=XX
Output:
68;61;83;27;XX;70;84;11;46;62;93;97;40;23;19
33;70;17;49;81;21;68;83;16;6;42;38;68;81;89
73;40;95;64;32;33;77;56;23;11;70;28;33;80;24
8;9;74;6;86;78;87;41;11;79;23;28;71;99;15
29;87;77;9;98;12;7;66;60;85;20;14;55;97;17
39;24;21;58;23;61;39;26;57;70;76;16;70;53;8
37;46;18;64;56;28;86;7;80;71;94;46;19;53;43
71;2;47;62;9;21;68;9;9;80;32;59;73;74;72
20;34;89;58;74;92;86;35;48;81;50;6;63;67;90
78;17;6;63;61;65;75;31;33;82;24;5;90;46;12
This might work for you (GNU sed):
word=foo wordUpdate=bar numColumn=3
sed -i 'y/;/\n/
s#.*#echo "&" | sed "'${numColumn}'s/\<'${word}'\>/'${wordUpdate}'/"#e
y/\n/;/' file
Convert each line into a separate file where the columns are lines.
Substitute the matching line (column number) with the word for the updated word.
Reverse the conversion.
N.B. The solution relies on the GNU only e evaluation flag. Also the word and updateWord may need to be quoted.
This can be done with a little creativity...
Note that I'm using double-quotes to embed the logic. This takes a little extra care to double your \'s on backreferences.
$: word=baz; c=3; new=XX; lead="^([^;]*;){$((c-1))}"; sed -E "/$lead$word;/{s/($lead)$word/\\1$new/}" file
1;2;3;4;5;6;7;8;9;0;
foo;bar;XX;qux;foo;bar;baz;qux;
a;b;c;d;e;f;g;
Explained:
lead="^([^;]*;){$((c-1))}"
^ means at the start of a record
(...) is grouping for the following {...} which specified repetition
[^;]* mean zero or more non-semicolons
$((c-1)) does the math and returns one less than the desired column; if you want to look at column 3, it returns two.
SO, ^([^;]*;){$((c-1))} at the start of the record, one-less-than-column occurrences of non-semicolons followed by a semicolon
thus, sed -E "/$lead$word;/{s/($lead)$word/\\1$new/}" file mean read file and on records where $word occurs in the requested column, save everything before it, and put that stuff back, but replace $word with $new.
Even if you MUST use sed, I recommend a function.
fix(){
local word="$1" col="$2" new="$3" file="$4"
local lead="^([^;]*;){$((col-1))}"
sed -E "/$lead$word;/{s/($lead)$word/\\1$new/}" "$file"
}
In use -
$: fix bar 2 HI file
1;2;3;4;5;6;7;8;9;0;
foo;HI;baz;qux;foo;bar;baz;qux;
a;b;c;d;e;f;g;
$: fix 1 1 XX file
XX;2;3;4;5;6;7;8;9;0;
foo;bar;baz;qux;foo;bar;baz;qux;
a;b;c;d;e;f;g;
$: fix bar 2 '(^_^)' file
1;2;3;4;5;6;7;8;9;0;
foo;(^_^);baz;qux;foo;bar;baz;qux;
a;b;c;d;e;f;g;
No changes if no matches -
$: fix bar 5 HI file
1;2;3;4;5;6;7;8;9;0;
foo;bar;baz;qux;foo;bar;baz;qux;
a;b;c;d;e;f;g;
NOTE -
This logic requires trailing delimiters if you ever want to match the last field -
$: fix 0 10 HI file
1;2;3;4;5;6;7;8;9;HI;
foo;bar;baz;qux;foo;bar;baz;qux;
a;b;c;d;e;f;g;
delimiters removed:
$: fix 0 10 HI file
1;2;3;4;5;6;7;8;9;0
foo;bar;baz;qux;foo;bar;baz;qux
a;b;c;d;e;f;g
Otherwise you have to complicate the logic a bit.
But honestly, for field parsing, you'd be so much better served to use awk, or even perl or python, or for that matter a bash loop, though that's going to be relatively slow.

Add multiple elements to the text file in a specific way using Bash

I have a text file that contains a list of "word sequences" and I need to add some "" and "," to each word sequence, I´m thinking in use a bash command.
Here is the data:
NTSS
NGTG
NVSQ
NITL
NFTS
...
I need to add "" to each word sequence and separate with ","
Here an expected output:
"NTSS",
"NGTG",
"NVSQ",
"NITL",
...
Any recommendation with BASH to do that?
This can be done in many ways, but sed is perfect for the job.
sed 's/^.*$/"\0",/' < file.txt
This replacement simply matches the whole line and replaces it according to what you need.
The one above is a regular expression replacement, which has the structure:
s/<pattern to match>/<replacement>/
^ matches the beginning of the line
.* matches any character any number of times
$ matches the end of the line
In the replacement part, \0 represents the whole string that has matched the pattern (the entire line in this case)
Check out some regular expression tutorial for more.
If you prefer a purely bash alternative, you can use:
while read -r line; do echo "\"${line}\","; done < file.txt

Remove multiple sequences from fasta file

I have a text file of character sequences that consist of two lines: a header, and the sequence itself in the following line. The structure of the file is as follow:
>header1
aaaaaaaaa
>header2
bbbbbbbbbbb
>header3
aaabbbaaaa
[...]
>headerN
aaabbaabaa
In an other file I have a list of headers of sequences that I would like to remove, like this:
>header1
>header5
>header12
[...]
>header145
The idea is to remove these sequences from the first file, so all these headers+the following line. I did it using sed like the following,
while read line; do sed -i "/$line/,+1d" first_file.txt; done < second_file.txt
It works but takes quite long since I am loading the whole file several times with sed, and it is quite big. Any idea on how I could speed up this process?
The question you have is easy to answer but will not help you when you handle generic fasta files. Fasta files have a sequence header followed by one or multiple lines which can be concatenated to represent the sequence. The Fasta file-format roughly obeys the following rules:
The description line (defline) or header/identifier line, which begins with <greater-then> character (>), gives a name and/or a unique identifier for the sequence, and may also contain additional information.
Following the description line is the actual sequence itself in a standard one-letter character string. Anything other than a valid character would be ignored (including spaces, tabulators, asterisks, etc...).
The sequence can span multiple lines.
A multiple sequence FASTA format would be obtained by concatenating several single sequence FASTA files in a common file, generally by leaving an empty line in between two subsequent sequences.
Most of the presented methods will fail on a multi-fasta with multi-line sequences
The following will work always:
awk '(NR==FNR) { toRemove[$1]; next }
/^>/ { p=1; for(h in toRemove) if ( h ~ $0) p=0 }
p' headers.txt file.fasta
This is very similar to the answers of EdMorton and Anubahuva but the difference here is that the file headers.txt could contain only a part of the header.
$ awk 'NR==FNR{a[$0];next} $0 in a{c=2} !(c&&c--)' list file
>header2
bbbbbbbbbbb
>header3
aaabbbaaaa
[...]
>headerN
aaabbaabaa
c is how many lines you want to skip starting at the one that just matched. See https://stackoverflow.com/a/17914105/1745001.
Alternatively:
$ awk 'NR==FNR{a[$0];next} /^>/{f=($0 in a ? 1 : 0)} !f' list file
>header2
bbbbbbbbbbb
>header3
aaabbbaaaa
[...]
>headerN
aaabbaabaa
f is whether or not the most recently read >... line was found in the target array a[]. f=($0 in a ? 1 : 0) could be abbreviated to just f=($0 in a) but I prefer the ternary expression for clarity.
The first script relies on you knowing how many lines each record is long while the 2nd one relies on every record starting with >. If you know both then which one you use is a style choice.
You may use this awk:
awk 'NR == FNR{seen[$0]; next} /^>/{p = !($0 in seen)} p' hdr.txt details.txt
Create a script with the delete commands from the second file:
sed 's#\(.*\)#/\1/,+1d#' secondFile.txt > commands.sed
Then apply that file to the first
sed -f commands.sed firstFile.txt
This awk might work for you:
awk 'FNR==NR{a[$0]=1;next}a[$0]{getline;next}1' input2 input1
One option is to create a long sed expression:
sedcmd=
while read line; do sedcmd+="/^$line\$/,+1d;"; done < second_file.txt
echo "sedcmd:$sedcmd"
sed $sedcmd first_file.txt
This will only read the file once. Note that I added the ^ and $ to the sed pattern (so >header1 doesn't match >header123...)
Using a file (as #daniu suggests) might be better if you have thousands of files, as you risk hitting the command-line maximum count with this method.
try gnu sed,
sed -E ':s $!N;s/\n/\|/;ts ;s~.*~/&/\{N;d\}~' second_file.txt| sed -E -f - first_file.txt
prepend time command to both scripts to compare the speed,
look time while read line;do... and time sed -.... result in my test this is done in less than half time of OP's
This can easily be done with bbtools. The seqs2remove.txt file should be one header per line exactly as they appear in the large.fasta file.
filterbyname.sh in=large.fasta out=kept.fasta names=seqs2remove.txt

Dynamic delimiter in Unix

Input:-
echo "1234ABC89,234" # A
echo "0520001DEF78,66" # B
echo "46545455KRJ21,00"
From the above strings, I need to split the characters to get the alphabetic field and the number after that.
From "1234ABC89,234", the output should be:
ABC
89,234
From "0520001DEF78,66", the output should be:
DEF
78,66
I have many strings that I need to split like this.
Here is my script so far:
echo "1234ABC89,234" | cut -d',' -f1
but it gives me 1234ABC89 which isn't what I want.
Assuming that you want to discard leading digits only, and that the letters will be all upper case, the following should work:
echo "1234ABC89,234" | sed 's/^[0-9]*\([A-Z]*\)\([0-9].*\)/\1\n\2/'
This works fine with GNU sed (I have 4.2.2), but other sed implementations might not like the \n, in which case you'll need to substitute something else.
Depending on the version of sed you can try:
echo "0520001DEF78,66" | sed -E -e 's/[0-9]*([A-Z]*)([,0-9]*)/\1\n\2/'
or:
echo "0520001DEF78,66" | sed -E -e 's/[0-9]*([A-Z]*)([,0-9]*)/\1$\2/' | tr '$' '\n'
DEF
78,66
Explanation: the regular expression replaces the input with the expected output, except instead of the new-line it puts a "$" sign, that we replace to a new-line with the tr command
Where do the strings come from? Are they read from a file (or other source external to the script), or are they stored in the script? If they're in the script, you should simply reformat the data so it is easier to manage. Therefore, it is sensible to assume they come from an external data source such as a file or being piped to the script.
You could simply feed the data through sed:
sed 's/^[0-9]*\([A-Z]*\)/\1 /' |
while read alpha number
do
…process the two fields…
done
The only trick to watch there is that if you set variables in the loop, they won't necessarily be visible to the script after the done. There are ways around that problem — some of which depend on which shell you use. This much is the same in any derivative of the Bourne shell.
You said you have many strings like this, so I recommend if possible save them to a file such as input.txt:
1234ABC89,234
0520001DEF78,66
46545455KRJ21,00
On your command line, try this sed command reading input.txt as file argument:
$ sed -E 's/([0-9]+)([[:alpha:]]{3})(.+)/\2\t\3/g' input.txt
ABC 89,234
DEF 78,66
KRJ 21,00
How it works
uses -E for extended regular expressions to save on typing, otherwise for example for grouping we would have to escape \(
uses grouping ( and ), searches three groups:
firstly digits, + specifies one-or-more of digits. Oddly using [0-9] results in an extra blank space above results, so use POSIX class [[:digit:]]
the next is to search for POSIX alphabetical characters, regardless if lowercase or uppercase, and {3} specifies to search for 3 of them
the last group searches for . meaning any character, + for one or more times
\2\t\3 then returns group 2 and group 3, with a tab separator
Thus you are able to extract two separate fields per line, just separated by tab, for easier manipulation later.

Scripting username creation from text file?

I'm really new at Bash and scripting in general.
I have to create usernames formed of first letter of first name followed by last name. To do it, I use a provided text file that looks like this:
doe,john
smith,mike
...
I declared the following variables:
fname=$(cut -d, -f2 "file.txt" | cut -c1)
lname=$(cut -d, -f1 "file.txt")
But how do I put the elements together to form the names jdoe and msmith ? I tried the methods I know to concatenate strings and vriables, but nothing works..
I think I found a method using awk that is supposed to work, but is there any other way to "concatenate" the elements of 2 lists?
Thank you
There's a million ways to do it, this is simplest:
$ awk -F, '{print substr($2,1,1) $1}' file
jdoe
msmith
Ed Morton's awk-based answer is simplest (and probably fastest), but since you asked for a different solution:
#!/usr/bin/env bash
while IFS=, read -r last first _; do
username=${first:0:1}${last}
echo "username: $username"
done < file.txt
IFS=, read -r last first _ reads the first 2 ,-separated fields from each input line (_ is a dummy variable that receives the rest of the input line, if any; -r prevents interpretation of \ chars. in the input, which is usually what you want).
username=${first:0:1}${last} concatenates the 1st char. of variable $first's value with variable $last's value, simply by placing the two variable references next to each other.
${first:0:1} - extract 1 character from $first at position 0 - is an example of parameter expansion, specifically: substring expansion
< file.txt is an input redirection that sends file.txt's contents via stdin to the while loop.
This looks a bit too much like homework, so I'll just drop some hints.
To read the lastname and firstname into separate variables for each line of the file, see BashFAQ 1. It should not involve cut.
To grab the first character of a variable, see BashFAQ 100.

Resources