How can I add spaces between every character or symbol within a UTF-8 document? E.g. 123hello! becomes 1 2 3 h e l l o !.
I have BASH, OpenOffice.org, and gedit, if any of those can do that.
I don't care if it sometimes leaves extra spaces in places (e.g. 2 or 3 spaces in a single place is no problem).
Shortest sed version
sed 's/./& /g'
Output
$ echo '123hello!' | sed 's/./& /g'
1 2 3 h e l l o !
Obligatory awk version
awk '$1=$1' FS= OFS=" "
Output
$ echo '123hello!' | awk '$1=$1' FS= OFS=" "
1 2 3 h e l l o !
sed(1) can do this:
$ sed -e 's/\(.\)/\1 /g' < /etc/passwd
r o o t : x : 0 : 0 : r o o t : / r o o t : / b i n / b a s h
d a e m o n : x : 1 : 1 : d a e m o n : / u s r / s b i n : / b i n / s h
It works well on e.g. UTF-8 encoded Japanese content:
$ file japanese
japanese: UTF-8 Unicode text
$ sed -e 's/\(.\)/\1 /g' < japanese
E X I F 中 の 画 像 回 転 情 報 対 応 に よ り 、 一 部 画 像 ( 特 に 『
$
sed is ok but this is pure bash
string=hello
for ((i=0; i<${#string}; i++)); do
string_new+="${string:$i:1} "
done
Since you have bash, I am will assume that you have access to sed. The following command line will do what you wish.
$ sed -e 's:\(.\):\1 :g' < input.txt > output.txt
I like these solutions because they do not have a trailing space like the rest
here.
GNU awk:
echo 123hello! | awk NF=NF FS=
GNU awk:
echo 123hello! | awk NF=NF FPAT=.
POSIX awk:
echo 123hello! | awk '{while(a=substr($0,++b,1))printf b-1?FS a:a}'
This might work for you:
echo '1 23h ello ! ' | sed 's/\s*/ /g;s/^\s*\(.*\S\)\s*$/\1/;l'
1 2 3 h e l l o !$
1 2 3 h e l l o !
In retrospect a far better solution:
sed 's/\B/ /g' file
Replaces the space between letters with a space.
string='abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789'
echo ${string} | sed -r 's/(.{1})/\1 /g'
Pure POSIX Shell version:
addspace() {
__addspace_str="$1"
while [ -n "${__addspace_str#?}" ]; do
printf '%c ' "$__addspace_str"
__addspace_str="${__addspace_str#?}"
done
printf '%c' "$__addspace_str"
}
Or if you need to put it in a variable:
addspace_var() {
addspace_result=""
__addspace_str="$1"
while [ -n "${__addspace_str#?}" ]; do
addspace_result="$addspace_result${__addspace_str%${__addspace_str#?}} "
__addspace_str="${__addspace_str#?}"
done
addspace_result="$addspace_result$__addspace_str"
}
addspace_var abc
echo "$addspace_result"
Tested with dash, ksh, zsh, bash (+ bash --posix), and busybox ash.
Explanation
${x#?}
This parameter expansion removes the first character of x. ${x#...} in general removes a prefix given by a pattern, and ? matches any single character.
printf '%c ' "$str"
The %c format parameter transforms the string argument into its first character, so the full format string '%c ' prints the first character of the string followed by a space. Note that if the string was empty this would cause issues, but we already checked that it wasn't before, so it's fine. To print the first character safely in any situation we can use '%.1s', but I like living dangerously :3j
${x%${x#?}}
This is an alternate way to get the first character of the string. We already know that ${x#?} is all but the first character. Well, ${x%...} removes ... from the end of x, so ${x%${x#?}} removes all but the first character from the end of x, leaving only the first one.
__prefixed_variable_names
POSIX doesn't define local, so to avoid variable conflicts it's safer to create unique names that are unlikely to clobber each other. I am starting to experiment using M4 to generate unique names while not having to destroy my code every time but it's probably overkill for people who don't use shell as much as me.
[ -n "${str#?}" ]
Why not just [ -n "$str" ]? It's to avoid the dreaded trailing space, it's also why we have a little statement guy at the bottom there outside the loop. The loops goes until the string is one character long, then we finish outside of it so we can append this last character without adding a space.
When should I use this?
This is good for small inputs in long running loops, since it avoids the overhead of calling an external process, but for larger inputs it starts lagging behind fast, specially the var version. (I fault the ${x%${x#?}} trick).
Benchmark Commands
# addspace
time dash -c ". ./addspace.sh; for x in $(seq -s ' ' 1 10000); do addspace \"$input\" >/dev/null; done"
# addspace_var
time dash -c ". ./addspace.sh; for x in $(seq -s ' ' 1 10000); do addspace_var \"$input\" >/dev/null; done"
# sed for comparison
time dash -c ". ./addspace.sh; for x in $(seq -s ' ' 1 10000); do echo \"$input\" | sed 's/./& /g' >/dev/null; done"
Input Length = 3
addspace addspace_var sed
real 0m0,106s 0m0,106s 0m10,651s
user 0m0,077s 0m0,075s 0m9,349s
sys 0m0,029s 0m0,031s 0m3,030s
Input Length = 200
addspace addspace_var sed
real 0m6,050s 0m47,115s 0m11,049s
user 0m5,557s 0m46,919s 0m9,727s
sys 0m0,488s 0m0,068s 0m3,085s
Input Length = 1000
addspace addspace_var sed
real 0m55,989s TBD 0m11,534s
user 0m53,560s TBD 0m10,214s
sys 0m2,428s TBD 0m2,975s
(Yeah, I was waiting a bit for that last var one.)
In situations like this you can simply check the length of the input and call the appropriate function for maximum performance.
addspace() {
if [ ${#1} -lt 100 ]; then
addspace_builtins "$1"
else
addspace_proccess "$1"
fi
}
Related
I'm would like to substitute a set of edit: single byte characters with a set of literal strings in a stream, without any constraint on the line size.
#!/bin/bash
for (( i = 1; i <= 0x7FFFFFFFFFFFFFFF; i++ ))
do
printf '\a,\b,\t,\v'
done |
chars_to_strings $'\a\b\t\v' '<bell>' '<backspace>' '<horizontal-tab>' '<vertical-tab>'
The expected output would be:
<bell>,<backspace>,<horizontal-tab>,<vertical-tab><bell>,<backspace>,<horizontal-tab>,<vertical-tab><bell>...
I can think of a bash function that would do that, something like:
chars_to_strings() {
local delim buffer
while true
do
delim=''
IFS='' read -r -d '.' -n 4096 buffer && (( ${#buffer} != 4096 )) && delim='.'
if [[ -n "${delim:+_}" ]] || [[ -n "${buffer:+_}" ]]
then
# Do the replacements in "$buffer"
# ...
printf "%s%s" "$buffer" "$delim"
else
break
fi
done
}
But I'm looking for a more efficient way, any thoughts?
Since you seem to be okay with using ANSI C quoting via $'...' strings, then maybe use sed?
sed $'s/\a/<bell>/g; s/\b/<backspace>/g; s/\t/<horizontal-tab>/g; s/\v/<vertical-tab>/g'
Or, via separate commands:
sed -e $'s/\a/<bell>/g' \
-e $'s/\b/<backspace>/g' \
-e $'s/\t/<horizontal-tab>/g' \
-e $'s/\v/<vertical-tab>/g'
Or, using awk, which replaces newline characters too (by customizing the Output Record Separator, i.e., the ORS variable):
$ printf '\a,\b,\t,\v\n' | awk -vORS='<newline>' '
{
gsub(/\a/, "<bell>")
gsub(/\b/, "<backspace>")
gsub(/\t/, "<horizontal-tab>")
gsub(/\v/, "<vertical-tab>")
print $0
}
'
<bell>,<backspace>,<horizontal-tab>,<vertical-tab><newline>
For a simple one-liner with reasonable portability, try Perl.
for (( i = 1; i <= 0x7FFFFFFFFFFFFFFF; i++ ))
do
printf '\a,\b,\t,\v'
done |
perl -pe 's/\a/<bell>/g;
s/\b/<backspace>/g;s/\t/<horizontal-tab>/g;s/\v/<vertical-tab>/g'
Perl internally does some intelligent optimizations so it's not encumbered by lines which are longer than its input buffer or whatever.
Perl by itself is not POSIX, of course; but it can be expected to be installed on any even remotely modern platform (short of perhaps embedded systems etc).
Assuming the overall objective is to provide the ability to process a stream of data in real time without having to wait for a EOL/End-of-buffer occurrence to trigger processing ...
A few items:
continue to use the while/read -n loop to read a chunk of data from the incoming stream and store in buffer variable
push the conversion code into something that's better suited to string manipulation (ie, something other than bash); for sake of discussion we'll choose awk
within the while/read -n loop printf "%s\n" "${buffer}" and pipe the output from the while loop into awk; NOTE: the key item is to introduce an explicit \n into the stream so as to trigger awk processing for each new 'line' of input; OP can decide if this additional \n must be distinguished from a \n occurring in the original stream of data
awk then parses each line of input as per the replacement logic, making sure to append anything leftover to the front of the next line of input (ie, for when the while/read -n breaks an item in the 'middle')
General idea:
chars_to_strings() {
while read -r -n 15 buffer # using '15' for demo purposes otherwise replace with '4096' or whatever OP wants
do
printf "%s\n" "${buffer}"
done | awk '{print NR,FNR,length($0)}' # replace 'print ...' with OP's replacement logic
}
Take for a test drive:
for (( i = 1; i <= 20; i++ ))
do
printf '\a,\b,\t,\v'
sleep 0.1 # add some delay to data being streamed to chars_to_strings()
done | chars_to_strings
1 1 15 # output starts printing right away
2 2 15 # instead of waiting for the 'for'
3 3 15 # loop to complete
4 4 15
5 5 13
6 6 15
7 7 15
8 8 15
9 9 15
A variation on this idea using a named pipe:
mkfifo /tmp/pipeX
sleep infinity > /tmp/pipeX # keep pipe open so awk does not exit
awk '{print NR,FNR,length($0)}' < /tmp/pipeX &
chars_to_strings() {
while read -r -n 15 buffer
do
printf "%s\n" "${buffer}"
done > /tmp/pipeX
}
Take for a test drive:
for (( i = 1; i <= 20; i++ ))
do
printf '\a,\b,\t,\v'
sleep 0.1
done | chars_to_strings
1 1 15 # output starts printing right away
2 2 15 # instead of waiting for the 'for'
3 3 15 # loop to complete
4 4 15
5 5 13
6 6 15
7 7 15
8 8 15
9 9 15
# kill background 'awk' and/or 'sleep infinity' when no longer needed
don't waste FS/OFS - use the built-in variables to take 2 out of the 5 needed :
echo $' \t abc xyz \t \a \n\n ' |
mawk 'gsub(/\7/, "<bell>", $!(NF = NF)) + gsub(/\10/,"<bs>") +\
gsub(/\11/,"<h-tab>")^_' OFS='<v-tab>' FS='\13' ORS='<newline>'
<h-tab> abc xyz <h-tab> <bell> <newline><newline> <newline>
To have NO constraint on the line length you could do something like this with GNU awk:
awk -v RS='.{1,100}' -v ORS= '{
$0 = RT
gsub(foo,bar)
print
}'
That will read and process the input 100 chars at a time no matter which chars are present, whether it has newlines or not, and even if the input was one multi-terabyte line.
Replace gsub(foo,bar) with whatever substitution(s) you have in mind, e.g.:
$ printf '\a,\b,\t,\v' |
awk -v RS='.{1,100}' -v ORS= '{
$0 = RT
gsub(/\a/,"<bell>")
gsub(/\b/,"<backspace>")
gsub(/\t/,"<horizontal-tab>")
gsub(/\v/,"<vertical-tab>")
print
}'
<bell>,<backspace>,<horizontal-tab>,<vertical-tab>
and of course it'd be trivial to pass a list of old and new strings to awk rather than hardcoding them, you'd just have to sanitize any regexp or backreference metachars before calling gsub().
Let's suppose that you need to generate a NUL-delimited stream of timestamped filenames.
On Linux & Solaris I can do it with:
stat --printf '%.9Y %n\0' -- *
On BSD, I can get the same info, but delimited by newlines, with:
stat -f '%.9Fm %N' -- *
The man talks about a few escape sequences but the NUL byte doesn't seem supported:
If the % is immediately followed by one of n, t, %, or #, then a newline character, a tab character, a percent character, or the current file number is printed.
Is there a way to work around that? edit: (accurately and efficiently?)
Update:
Sorry, the glob * is misleading. The arguments can contain any path.
I have a working solution that forks a stat call for each path. I want to improve it because of the massive number of files to process.
You may try this work-around solution if running stat command for files:
stat -nf "%.9Fm %N/" * | tr / '\0'
Here:
-n: To suppress newlines in stat output
Added / as terminator for each entry from stat output
tr / '\0': To convert / into NUL byte
Another work-around is to use a control character in stat and use tr to replace it with \0 like this:
stat -nf "%.9Fm %N"$'\1' * | tr '\1' '\0'
This will work with directories also.
Unfortunately, stat out of the box does not offer this option, and so what you ask is not directly achievable.
However, you can easily implement the required functionality in a scripting language like Perl or Python.
#!/usr/bin/env python3
from pathlib import Path
from sys import argv
for arg in argv[1:]:
print(
Path(arg).stat().st_mtime,
arg, end="\0")
Demo: https://ideone.com/vXiSPY
The demo exhibits a small discrepancy in the mtime which does not seem to be a rounding error, but the result could be different on MacOS (the demo platform is Debian Linux, apparently). If you want to force the result to a particular number of decimal places, Python has formatting facilities similar to those of stat and printf.
With any command that can't produce NUL-terminated (or any other character/string terminated) output, you can just wrap it in a function to call the command and then printf it's output with a terminating NUL instead of newline, for example:
nulstat() {
local fmt=$1 file
shift
for file in "$#"; do
printf '%s\0' "$(stat -f "$fmt" "$file")"
done
}
nulstat '%.9Fm %N' *
For example:
$ > foo
$ > $'foo\nbar'
$ nulstat '%.9Fm %N' foo* | od -c
0000000 1 6 6 3 1 6 2 5 3 6 . 4 7 7 9 8
0000020 0 1 4 0 f o o \0 1 6 6 3 1 6 2
0000040 5 3 9 . 3 8 8 0 6 9 9 3 0 f o
0000060 o \n b a r \0
0000066
1. What you can do (accurate but slow):
Fork a stat command for each input path:
for p in "$#"
do
stat -nf '%.9Fm' -- "$p" &&
printf '\t%s\0' "$p"
done
2. What you can do (accurate but twisted):
In the input paths, replace each occurrence of (possibly overlapping) /././ with a single /./, make stat output /././\n at the end of each record, and use awk to substitute each /././\n by a NUL byte:
#!/bin/bash
shopt -s extglob
stat -nf '%.9Fm%t%N/././%n' -- "${#//\/.\/+(.\/)//./}" |
awk -F '/\\./\\./' '{
if ( NF == 2 ) {
printf "%s%c", record $1, 0
record = ""
} else
record = record $1 "\n"
}'
N.B. If you wonder why I chose /././\n as record separator then take a look at Is it "safe" to replace each occurrence of (possibly overlapped) /./ with / in a path?
3. What you should do (accurate & fast):
You can use the following perl one‑liner on almost every UNIX/Linux:
LANG=C perl -MTime::HiRes=stat -e '
foreach (#ARGV) {
my #st = stat($_);
if ( #st > 0 ) {
printf "%.9f\t%s\0", $st[9], $_;
} else {
printf STDERR "stat: %s: %s\n", $_, $!;
}
}
' -- "$#"
note: for perl < 5.8.9, remove the -MTime::HiRes=stat from the command line.
ASIDE: There's a bug in BSD's stat:
When %N is at the end of the format string and the filename ends with a newline character, then its trailing newline might get stripped:
For example:
stat -f '%N' -- $'file1\n' file2
file1
file2
For getting the output that one would expect from stat -f '%N' you can use the -n switch and add an explicit %n at the end of the format string:
stat -nf '%N%n' -- $'file1\n' file2
file1
file2
Is there a way to work around that?
If all you need is to just replace all newlines with NULLs, then following tr should suffice
stat -f '%.9Fm %N' * | tr '\n' '\000'
Explanation: 000 is NULL expressed as octal value.
Background
I have a .xyz file from which I need to remove a specific set of lines from. As well as do some text replacements. I have a separate .txt file that contains a list of integers, corresponding to line numbers that need to be removed, and another for the lines which need replacing. This file will be called atomremove.txt and looks as follows. The other file is structured similarly.
Just as a preemptive TL;DR: The tabs in my input file that happen to have one extra whitespace (because they justify to a certain position regardless of one extra whitespace), end up being converted to a single whitespace in the output file.
14
13
11
10
4
The xyz file from which I need to remove lines will look like something like this.
24
Comment block
H 18.38385 15.26701 2.28399
C 19.32295 15.80772 2.28641
O 16.69023 17.37471 2.23138
B 17.99018 17.98940 2.24243
C 22.72612 1.13322 2.17619
C 14.47116 18.37823 2.18809
C 15.85803 18.42398 2.20614
C 20.51484 15.08859 2.30584
C 22.77653 3.65203 2.19000
H 20.41328 14.02079 2.31959
H 22.06640 8.65013 2.27145
C 19.33725 17.20040 2.26894
H 13.96336 17.42048 2.19342
H 21.69450 3.68090 2.22196
C 23.01832 9.16815 2.25575
C 23.48143 2.42830 2.16161
H 22.07113 11.03567 2.32659
C 13.75496 19.59644 2.16380
O 23.01248 6.08053 2.20226
C 12.41476 19.56937 2.14732
C 16.54400 19.61620 2.20021
C 23.50500 4.83405 2.17735
C 23.03249 10.56089 2.28599
O 17.87129 19.42333 2.22107
My Code
I am successful in doing the line removal, and the replacements, although the output is not as expected. It appears to replace some of the tabs with the whitespace, specifically for lines that have a 'y' coordinate with only 5 decimals. I am going to share the resulting output first, and then my code.
Here is the output
19
Comment Block
H 18.38385 15.26701 2.28399
C 19.32295 15.80772 2.28641
O 16.69023 17.37471 2.23138
H 22.72612 1.13322 2.17619
C 14.47116 18.37823 2.18809
C 15.85803 18.42398 2.20614
C 20.51484 15.08859 2.30584
C 22.77653 3.65203 2.19000
C 19.33725 17.20040 2.26894
C 23.01832 9.16815 2.25575
C 23.48143 2.42830 2.16161
H 22.07113 11.03567 2.32659
C 13.75496 19.59644 2.16380
O 23.01248 6.08053 2.20226
C 12.41476 19.56937 2.14732
C 16.54400 19.61620 2.20021
C 23.50500 4.83405 2.17735
H 23.03249 10.56089 2.28599
O 17.87129 19.42333 2.22107
Here is my code.
atomstorefile="./extract_internal/atomremove.txt"
atomchangefile="./extract_internal/atomchange.txt"
temp="temp.txt"
tempp="tempp.txt"
temppp="temppp.txt"
filestoreloc="./"$basefilename"_xyzoutputs/chops"
#get number of files in directory and set a loop for that # of files
numfiles=$( ls "./"$basefilename"_xyzoutputs/splits" | wc -l )
numfiles=$(( numfiles/2 ))
counter=1
while [ $counter -lt $(( numfiles + 1 )) ];
do
#set a loop for each split half
splithalf=1
while [ $splithalf -lt 3 ];
do
#storing the xyz file in a temp file for edits (non destructive)
cat ./"$basefilename"_xyzoutputs/splits/split"$splithalf"-geometry$counter.xyz > $temp
#changin specified atoms
while read line;
do
line=$(( line + 2 ))
sed -i "${line}s/C/H/" $temp
done < $atomchangefile
# removing specified atoms
while read line;
do
line=$(( line + 2 ))
sed -i "${line}d" $temp
done < $atomstorefile
remainatoms=$( wc -l $temp | awk '{print $1}' )
remainatoms=$(( remainatoms - 2 ))
tail -n $remainatoms $temp > $tempp
echo $remainatoms > "$filestoreloc"/split"$splithalf"-geometry$counter.xyz
echo Comment Block >> "$filestoreloc"/split"$splithalf"-geometry$counter.xyz
cat $tempp >> "$filestoreloc"/split"$splithalf"-geometry$counter.xyz
splithalf=$(( splithalf + 1 ))
done
counter=$(( counter + 1 ))
done
I am sure the solution is simple. Any insight into what is causing this issue would be very appreciated.
Not sure what you are doing but you file can be fixed using column -t < filename command.
Example :
❯ cat test
H 18.38385 15.26701 2.28399
C 19.32295 15.80772 2.28641
O 16.69023 17.37471 2.23138
H 22.72612 1.13322 2.17619
C 14.47116 18.37823 2.18809
C 15.85803 18.42398 2.20614
C 20.51484 15.08859 2.30584
C 22.77653 3.65203 2.19000
C 19.33725 17.20040 2.26894
C 23.01832 9.16815 2.25575
C 23.48143 2.42830 2.16161
H 22.07113 11.03567 2.32659
C 13.75496 19.59644 2.16380
O 23.01248 6.08053 2.20226
C 12.41476 19.56937 2.14732
C 16.54400 19.61620 2.20021
C 23.50500 4.83405 2.17735
H 23.03249 10.56089 2.28599
O 17.87129 19.42333 2.22107
~
❯ column -t < test
H 18.38385 15.26701 2.28399
C 19.32295 15.80772 2.28641
O 16.69023 17.37471 2.23138
H 22.72612 1.13322 2.17619
C 14.47116 18.37823 2.18809
C 15.85803 18.42398 2.20614
C 20.51484 15.08859 2.30584
C 22.77653 3.65203 2.19000
C 19.33725 17.20040 2.26894
C 23.01832 9.16815 2.25575
C 23.48143 2.42830 2.16161
H 22.07113 11.03567 2.32659
C 13.75496 19.59644 2.16380
O 23.01248 6.08053 2.20226
C 12.41476 19.56937 2.14732
C 16.54400 19.61620 2.20021
C 23.50500 4.83405 2.17735
H 23.03249 10.56089 2.28599
O 17.87129 19.42333 2.22107
~
❯
The reason you wreck your whitespace is that you need to quote your strings. But a much superior solution is to refactor all of this monumentally overcomplicated shell script to a simple sed or Awk script.
Assuming the line numbers all indicate line numbers in the original input file, try this.
tmp=$(mktemp -t atomtmpXXXXXXXXX) || exit
trap 'rm -f "$tmp"' ERR EXIT
( sed 's%$%s/C/H/%' extract_internal/atomchange.txt
sed 's%$%d%' extract_internal/atomremove.txt ) >"$tmp"
ls -l "$tmp"; nl "$tmp" # debugging
for file in "$basefilename"_xyzoutputs/splits/*; do
dst= "$basefilename"_xyzoutputs/chops/${file#*/splits/}
sed -f "$tmp" "$file" >"$dst"
done
This combines the two input files into a new sed script (remarkably, by way of sed); the debugging line lets you inspect the result (probably remove it once you understand how this works).
Your question doesn't really explain how the input files relate to the output files so I had to guess a bit. One of the important changes is to avoid sed -i when you are not modifying an existing file; but above all, definitely avoid repeatedly overwriting the same file with sed -i.
I'm looking for a way to merge 4 lines of dna probing results into one line.
The problem here is:
I don't want to append the lines. But associating them
The 4 lines of dna probing:
A----A----------A----A-A--AAAA-
-CC----CCCC-C-----CCC-C-------C
------G----G--G--G------G------
---TT--------T-T---------T-----
I need these to be 1 line, not just appended but intermixed without the dashes.
First characters of the result:
ACCTTAGCCCCGC...
This seem to be a kind of general problem, so the language choosed to solve this don't matter.
For fun: one bash way:
lines=(
A----A----------A----A-A--AAAA-
-CC----CCCC-C-----CCC-C-------C
------G----G--G--G------G------
---TT--------T-T---------T-----
)
result=""
for ((i=0;i<${#lines};i++)) ;do
chr=- c=()
for ((l=0;l<${#lines[#]};l++)) ;do
[ "${lines[l]:i:1}" != "-" ] &&
chr="${lines[l]:i:1}" &&
c+=($l)
done
[ ${#c[#]} -eq 0 ] && printf 'Char #%d not replaced.\n' $i
[ ${#c[#]} -gt 1 ] && c="${c[*]}" && chr="*" &&
printf "Conflict at char #%d (lines: %s).\n" $i "${c// /, }"
result+=$chr
done
echo $result
With provided input, there is no conflict and all characters is replaced. So the output is:
ACCTTAGCCCCGCTGTAGCCCACAGTAAAAC
Note: Question stand for 4 different files, so lines= syntax could be:
lines=($(cat file1 file2 file3 file4))
But with a wrong input:
lines=(
A----A---A-----A-----A-A--AAAA-
-CC----CCCC-C-----CCC-C-------C
------G----G---G-G------G------
---TT--------T-T---------T-----
)
output could be:
Conflict at char #9 (lines: 0, 1).
Char #14 not replaced.
Conflict at char #15 (lines: 0, 2, 3).
Char #16 not replaced.
and
echo $result
ACCTTAGCC*CGCT-*-GCCCACAGTAAAAC
Small perl filter
But if input are not to be verified, this little perl filter could do the job:
(Thanks #jm666 for }{ syntax)
perl -nlE 'y+-+\0+;$,|=$_}{say$,' <(cat file1 file2 file3 file4)
where
-n process all lines without output
-l whipe leading cariage return at end of lines
y+lhs+rhs+ replace (translate) chars from 'lhs' to 'rhs'
\0 is the *null* character, binary 0.
$, is a variable
|= binary or, between himself and current line ($_)
}{ at END, once all lines processed
Alternative way - not very efficient - but short:
file="./gene"
line1=$(head -1 "$file")
seq ${#line1} | xargs -n1 -I% cut -c% "$file" | paste -s - | tr -cd '[A-Z\n]'
prints:
ACCTTAGCCCCGCTGTAGCCCACAGTAAAAC
Assumption: each line has the same length.
Decomposition:
the line1=$(head -1 "$file") read the 1st line into the variable line1
the seq ${#line1} generate a sequence of numbers 1..char_count_in_the_line1, like
1
2
..
31
the xargs -n1 -I% cut -c% "$file" will run for each above number the command cut like cut -c22 filename - what extract the given column from the file, e.g. you will get output like:
A
-
-
-
-
C
-
-
# and so on
the paste -s - will join the above lines into one long line with the \t (tab) separator, like:
A - - - - C - - - C - - - - - T ... etc...
finally the tr -cd '[A-Z\n]' remove everything what isn't uppercase character or newline, so will get the final
ACCTTAGCCCCGCTGTAGCCCACAGTAAAAC
How can print a value, either 1, 2 or 3 (at random). My best guess failed:
#!/bin/bash
1 = "2 million"
2 = "1 million"
3 = "3 million"
print randomint(1,2,3)
To generate random numbers with bash use the $RANDOM internal Bash function:
arr[0]="2 million"
arr[1]="1 million"
arr[2]="3 million"
rand=$[ $RANDOM % 3 ]
echo ${arr[$rand]}
From bash manual for RANDOM:
Each time this parameter is
referenced, a random integer between 0
and 32767 is generated. The sequence
of random numbers may be initialized
by assigning a value to RANDOM. If
RANDOM is unset,it loses its
special properties, even if it is
subsequently reset.
Coreutils shuf
Present in Coreutils, this function works well if the strings don't contain newlines.
E.g. to pick a letter at random from a, b and c:
printf 'a\nb\nc\n' | shuf -n1
POSIX eval array emulation + RANDOM
Modifying Marty's eval technique to emulate arrays (which are non-POSIX):
a1=a
a2=b
a3=c
eval echo \$$(expr $RANDOM % 3 + 1)
This still leaves the RANDOM non-POSIX.
awk's rand() is a POSIX way to get around that.
64 chars alpha numeric string
randomString32() {
index=0
str=""
for i in {a..z}; do arr[index]=$i; index=`expr ${index} + 1`; done
for i in {A..Z}; do arr[index]=$i; index=`expr ${index} + 1`; done
for i in {0..9}; do arr[index]=$i; index=`expr ${index} + 1`; done
for i in {1..64}; do str="$str${arr[$RANDOM%$index]}"; done
echo $str
}
~.$ set -- "First Expression" Second "and Last"
~.$ eval echo \$$(expr $RANDOM % 3 + 1)
and Last
~.$
Want to corroborate using shuf from coreutils using the nice -n1 -e approach.
Example usage, for a random pick among the values a, b, c:
CHOICE=$(shuf -n1 -e a b c)
echo "choice: $CHOICE"
I looked at the balance for two samples sizes (1000, and 10000):
$ for lol in $(seq 1000); do shuf -n1 -e a b c; done > shufdata
$ less shufdata | sort | uniq -c
350 a
316 b
334 c
$ for lol in $(seq 10000); do shuf -n1 -e a b c; done > shufdata
$ less shufdata | sort | uniq -c
3315 a
3377 b
3308 c
Ref: https://www.gnu.org/software/coreutils/manual/html_node/shuf-invocation.html