kind of tranpose needed of a file with inconsistent number of columns in each row - bash

I have a tab delimited file (in which number of columns in each row is not fixed) which looks like this:
chr1 92536437 92537640 NM_024813 NM_053274
I want to have a file from this in following order (first three columns are identifiers which I need it while splitting it)
chr1 92536437 92537640 NM_024813
chr1 92536437 92537640 NM_053274
Suggestions for a shell script.

#!/bin/bash
{
IFS=' '
while read a b c rest
do
for fld in $rest
do
echo -e "$a\t$b\t$c\t$fld"
done
done
}
Note that you should enter a real tab there (IFS)
I also thought I should do a perl version:
#!/bin/perl -n
($a,$b,$c,#r)=(chomp and split /\t/); print "$a\t$b\t$c\t$_\n" for #r
To do it all from the commandline, reading from in.txt and outputting to out.txt:
perl -ne '($a,$b,$c,#r)=(chomp and split /\t/); print "$a\t$b\t$c\t$_\n" for #r' in.txt > out.txt
Of course if you save the perl script (say as script.pl)
perl script.pl in.txt > out.txt
If you also make the script file executable (chmod +x script.pl):
./script.pl in.txt > out.txt
HTH

Not shell, and the other answer is perfectly fine, but i onelined it in perl :
perl -F'/\s/' -lane '$,="\t"; print #F,$_ for splice #F,3' $FILE
Edit: New (even more unreadable ;) version, inspired by the other answers. Abusing perl's command line parameters and special variables for autosplitting and line ending handling.
Means: For each of the fields after the three first (for splice #F,3), print the first three and it (print #F,$_).
-F sets the field separator to \s (should be \t) for -a autosplitting into #F.
-l turns on line ending handling for -n which runs the -e code for each line of the input.
$, is the output field separator.

[Edited]
So you want to duplicate the first three columns for each remaining item?
$ cat File | while read X
do PRE=$(echo "$X" | cut -f1-3 -d ' ')
for Y in $(echo "$X" | cut -f4- -d ' ')
do echo $PRE $Y >> OutputFilename
done
done
Returns:
chr 786 789 NM
chr 786 789 NR
chr 786 789 NT
chr 123 345 NR
This cuts the first three space delimited columns as a prefix, and then abuses the fact that a for loop will step through a space delimited list to call echo.
Enjoy.

This is just a subset of your data comparison in two files question.
Extracting my slightly hacky solution from there:
for i in 4 5 6 7; do join -e _ -j $i f f -o 1.1,1.2,1.3,0; done | sed '/_$/d'

Related

Delete values in line based on column index using shell script

I want to be able to delete the values to the RIGHT(starting from given column index) from the test.txt at the given column index based on a given length, N.
Column index refers to the position when you open the file in the VIM editor in LINUX.
If my test.txt contains 1234 5678, and I call my delete_var function which takes in the column number as 2 to start deleting from and length N as 2 to delete as input, the test.txt would reflect 14 5678 as it deleted the values from column 2 to column 4 as the length to delete was 2.
I have the following code as of now but I am unable to understand what I would put in the sed command.
delete_var() {
sed -i -r 's/not sure what goes here' test.txt
}
clmn_index= $1
_N=$2
delete_var "$clmn_index" "$_N" # call the method with the column index and length to delete
#sample test.txt (before call to fn)
1234 5678
#sample test.txt (after call to fn)
14 5678
Can someone guide me?
You should avoid using regex for this task. It is easier to get this done in awk with simple substr function calls:
awk -v i=2 -v n=2 'i>0{$0 = substr($0, 1, i-1) substr($0, i+n)} 1' file
14 5678
Assumping OP must use sed (otherwise other options could include cut and awk but would require some extra file IOs to replace the original file with the modified results) ...
Starting with the sed command to remove the 2 characters starting in column 2:
$ echo '1234 5678' > test.txt
$ sed -i -r "s/(.{1}).{2}(.*$)/\1\2/g" test.txt
$ cat test.txt
14 5678
Where:
(.{1}) - match first character in line and store in buffer #1
.{2} - match next 2 characters but don't store in buffer
(.*$) - match rest of line and store in buffer #2
\1\2 - output contents of buffers #1 and #2
Now, how to get variables for start and length into the sed command?
Assume we have the following variables:
$ s=2 # start
$ n=2 # length
To map these variables into our sed command we can break the sed search-replace pattern into parts, replacing the first 1 and 2 with our variables like such:
replace {1} with {$((s-1))}
replace {2} with {${n}}
Bringing this all together gives us:
$ s=2
$ n=2
$ echo '1234 5678' > test.txt
$ set -x # echo what sed sees to verify the correct mappings:
$ sed -i -r "s/(.{"$((s-1))"}).{${n}}(.*$)/\1\2/g" test.txt
+ sed -i -r 's/(.{1}).{2}(.*$)/\1\2/g' test.txt
$ set +x
$ cat test.txt
14 5678
Alternatively, do the subtraction (s-1) before the sed call and just pass in the new variable, eg:
$ x=$((s-1))
$ sed -i -r "s/(.{${x}}).{${n}}(.*$)/\1\2/g" test.txt
$ cat test.txt
14 5678
One idea using cut, keeping in mind that storing the results back into the original file will require an intermediate file (eg, tmp.txt) ...
Assume our variables:
$ s=2 # start position
$ n=2 # length of string to remove
$ x=$((s-1)) # last column to keep before the deleted characters (1 in this case)
$ y=$((s+n)) # start of first column to keep after the deleted characters (4 in this case)
At this point we can use cut -c to designate the columns to keep:
$ echo '1234 5678' > test.txt
$ set -x # display the cut command with variables expanded
$ cut -c1-${x},${y}- test.txt
+ cut -c1-1,4- test.txt
14 5678
Where:
1-${x} - keep range of characters from position 1 to position $(x) (1-1 in this case)
${y}- - keep range of characters from position ${y} to end of line (4-EOL in this case)
NOTE: You could also use cut's ability to work with the complement (ie, explicitly tell what characters to remove ... as opposed to above which says what characters to keep). See KamilCuk's answer for an example.
Obviously (?) the above does not overwrite test.txt so you'd need an extra step, eg:
$ echo '1234 5678' > test.txt
$ cut -c1-${x},${y}- test.txt > tmp.txt # store result in intermediate file
$ cat tmp.txt > test.txt # copy intermediate file over original file
$ cat test.txt
14 5678
Looks like:
cut --complement -c $1-$(($1 + $2 - 1))
Should just work and delete columns between $1 and $2 columns behind it.
please provide code how to change test.txt
cut can't modify in place. So either pipe to a temporary file or use sponge.
tmp=$(mktemp)
cut --complement -c $1-$(($1 + $2 - 1)) test.txt > "$tmp"
mv "$tmp" test.txt
Below command result in the elimination of the 2nd character. Try to use this in a loop
sed s/.//2 test.txt

Multiplying all values in a txt file with another value

My aim is to multiply all values in a text file with a number. In my case it is 1000.
Original text in file:
0.00493293814
0.0438981727
0.149746656
0.443125129
0.882018387
0.975789607
0.995755374
1
I want the output to look like:
(so, changing the contents of the file to...)
4.93293814
43.8981727
149.746656
443.125129
882.018387
975.789607
995.755374
1000
Or even rather:
4.9
43.8
149.7
443.1
882.0
975.7
995.7
1000
I am using bash on macOS in the terminal.
If you have dc :
cat infile | dc -f - -e '1k1000sa[la*Sdz0!=Z]sZzsclZx[Ld1/psblcd1-sc1<Y]sYlYx'
Using Perl
perl -lpe ' $_=$_*1000 '
with inputs and inline replacing
$ cat andy.txt
0.00493293814
0.0438981727
0.149746656
0.443125129
0.882018387
0.975789607
0.995755374
1
$ perl -i -lpe ' $_=$_*1000 ' andy.txt
$ cat andy.txt
4.93293814
43.8981727
149.746656
443.125129
882.018387
975.789607
995.755374
1000
$
One decimal place
perl -lpe ' $_=sprintf("%0.1f",$_*1000 ) '
Zero decimal place and rounding off
perl -lpe ' $_=sprintf("%0.0f",$_*1000 ) '
Zero decimal place and Truncating
perl -lpe ' $_=sprintf("%0.0f",int($_*1000) ) '
awk to the rescue!
$ awk '{printf "%.1f\n", $1*1000}' file > tmp && mv tmp file
Using num-utils. For answers to 8 decimal places:
numprocess '/*1000/' n.txt
For rounded answers to 1 decimal place:
numprocess '/*1000/' n.txt | numround -n '.1'
Use sed to prefix each line with 1000*, then process the resulting mathematical expressions with bc. To show only the first digit after the decimal point you can use sed again.
sed 's/^/1000*/' yourFile | bc | sed -E 's/(.*\..).*/\1/'
This will print the latter of your expected outputs. Just as you wanted, decimals are cut rather than rounded (1.36 is converted to 1.3).
To remove all decimal digits either replace the last … | sed … with sed -E 's/\..*//' or use the following command
sed 's:^.*$:1000*&/1:' yourFile | bc
With these commands overwriting the file directly is not possible. You have to write to a temporary file (append > tmp && mv tmp yourFile) or use the sponge command from the package moreutils (append | sponge yourFile).
However, if you want to remove all decimal points after the multiplication there is a trick. Instead of actually multiplying by 1000 we can syntactically shift the decimal point. This can be done in one single sed command. sed has the -i option to overwrite input files.
sed -i.bak -E 's/\..*/&000/;s/^[^.]*$/&.000/;s/\.(...).*/\1/;s/^(-?)0*(.)/\1\2/' yourFile
The command changes yourFile's content to
4
43
149
443
882
975
995
1000
A backup yourFile.bak of the original is created.
The single sed command should work with every input number format too (even for things like -.1 → -100).

bash - how do I use 2 numbers on a line to create a sequence

I have this file content:
2450TO3450
3800
4500TO4560
And I would like to obtain something of this sort:
2450
2454
2458
...
3450
3800
4500
4504
4508
..
4560
Basically I would need a one liner in sed/awk that would read the values on both sides of the TO separator and inject those in a seq command or do the loop on its own and dump it in the same file as a value per line with an arbitrary increment, let's say 4 in the example above.
I know I can use several one temp file, go the read command and sorts, but I would like to do it in a one liner starting with cat filename | etc. as it is already part of a bigger script.
Correctness of the input is guaranteed so always left side of TOis smaller than bigger side of it.
Thanks
Like this:
awk -F'TO' -v inc=4 'NF==1{print $1;next}{for(i=$1;i<=$2;i+=inc)print i}' file
or, if you like starting with cat:
cat file | awk -F'TO' -v inc=4 'NF==1{print $1;next}{for(i=$1;i<=$2;i+=inc)print i}'
Something like this might work:
awk -F TO '{system("seq " $1 " 4 " ($2 ? $2 : $1))}'
This would tell awk to system (execute) the command seq 10 4 10 for lines just containing 10 (which outputs 10), and something like seq 10 4 40 for lines like 10TO40. The output seems to match your example.
Given:
txt="2450TO3450
3800
4500TO4560"
You can do:
echo "$txt" | awk -F TO '{$2<$1 ? t=$1 : t=$2; for(i=$1; i<=t; i++) print i}'
If you want an increment greater than 1:
echo "$txt" | awk -F TO -v p=4 '{$2<$1 ? t=$1 : t=$2; for(i=$1; i<=t; i+=p) print i}'
Give a try to this:
sed 's/TO/ /' file.txt | while read first second; do if [ ! -z "$second" ] ; then seq $first 4 $second; else printf "%s\n" $first; fi; done
sed is used to replace TO with space char.
read is used to read the line, if there are 2 numbers, seq is used to generate the sequence. Otherwise, the uniq number is printed.
This might work for you (GNU sed):
sed -r 's/(.*)TO(.*)/seq \1 4 \2/e' file
This evaluates the RHS of the substitution command if the LHS contains TO.

Bash Text file formatting

I have some files with the following format:
555584280113;01-04-2013 00:00:11;0,22;889;30008;1501;sms;/xxx/yyy/zzz
552185022741;01-04-2013 00:00:13;0,22;889;30008;1501;sms;/xxx/yyy/zzz
5511965271852;01-04-2013 00:00:14;0,22;889;30008;1501;sms;/xxx/yyy/zzz
5511980644500;01-04-2013 00:00:22;0,22;889;30008;1501;sms;/xxx/yyy/zzz
553186398559;01-04-2013 00:00:31;0,22;889;30008;1501;sms;/xxx/yyy/zzz
555584280113;01-04-2013 00:00:41;0,22;889;30008;1501;sms;/xxx/yyy/zzz
558487839822;01-04-2013 00:01:09;0,22;889;30008;1501;sms;/xxx/yyy/zzz
I need to have them with a sequence of 10 digits long at the beginning, removed the prefix 55 on the second column (which I have done with a simple sed 's/^55//g') and reformat the date to look like this:
0000000001;555584280113;20130401 00:00:11;0,22;889;30008;1501;sms;/xxx/yyy/zzz
0000000002;552185022741;20130401 00:00:13;0,22;889;30008;1501;sms;/xxx/yyy/zzz
0000000003;5511965271852;20130401 00:00:14;0,22;889;30008;1501;sms;/xxx/yyy/zzz
0000000004;5511980644500;20130401 00:00:22;0,22;889;30008;1501;sms;/xxx/yyy/zzz
0000000005;553186398559;20130401 00:00:31;0,22;889;30008;1501;sms;/xxx/yyy/zzz
0000000006;555584280113;01-04-2013 00:00:41;0,22;889;30008;1501;sms;/xxx/yyy/zzz
I have the date part in a separate way:
cat file.txt | cut -d\; -f2 | awk '{print $1}' |awk -v OFS="-" -F"-" '{print $3$2$1}'
And it works, but I don't know how to put all of them together, the sequence + sed for the prefix + change the date format. The sequence part I'm not even sure how to do it.
Thanks for the help.
awk is one of the best tool out there used for text parsing and formatting. Here is one way of meeting your requirements:
awk '
BEGIN { FS = OFS = ";" }
{
printf "%010d;", NR
$1 = substr($1,3)
split($2, tmp, /[- ]/)
$2=tmp[3]tmp[2]tmp[1]" "tmp[4]
}1' file
We set the input and output field separator to ;
We use printf to format your first column number requirement
We use substr function to remove the first two characters of column 1
We use split function to format the time
Using 1 we print rest of the statement as is.
Output:
0000000001;5584280113;20130401 00:00:11;0,22;889;30008;1501;sms;/xxx/yyy/zzz
0000000002;2185022741;20130401 00:00:13;0,22;889;30008;1501;sms;/xxx/yyy/zzz
0000000003;11965271852;20130401 00:00:14;0,22;889;30008;1501;sms;/xxx/yyy/zzz
0000000004;11980644500;20130401 00:00:22;0,22;889;30008;1501;sms;/xxx/yyy/zzz
0000000005;3186398559;20130401 00:00:31;0,22;889;30008;1501;sms;/xxx/yyy/zzz
0000000006;5584280113;20130401 00:00:41;0,22;889;30008;1501;sms;/xxx/yyy/zzz
0000000007;8487839822;20130401 00:01:09;0,22;889;30008;1501;sms;/xxx/yyy/zzz
If the name of the input file is input, then the following command removes the 55, adds a 10-digit line number, and rearranges the date. With GNU sed:
nl -nrz -w10 -s\; input | sed -r 's/55//; s/([0-9]{2})-([0-9]{2})-([0-9]{4})/\3\2\1/'
If one is using Mac OSX (or another OS without GNU sed), then a slight change is required:
nl -nrz -w10 -s\; input | sed -E 's/55//; s/([0-9]{2})-([0-9]{2})-([0-9]{4})/\3\2\1/'
Sample output:
0000000001;5584280113;20130401 00:00:11;0,22;889;30008;1501;sms;/xxx/yyy/zzz
0000000002;2185022741;20130401 00:00:13;0,22;889;30008;1501;sms;/xxx/yyy/zzz
0000000003;11965271852;20130401 00:00:14;0,22;889;30008;1501;sms;/xxx/yyy/zzz
0000000004;11980644500;20130401 00:00:22;0,22;889;30008;1501;sms;/xxx/yyy/zzz
0000000005;3186398559;20130401 00:00:31;0,22;889;30008;1501;sms;/xxx/yyy/zzz
0000000006;5584280113;20130401 00:00:41;0,22;889;30008;1501;sms;/xxx/yyy/zzz
0000000007;8487839822;20130401 00:01:09;0,22;889;30008;1501;sms;/xxx/yyy/zzz
How it works: nl is a handy *nix utility for adding line numbers. -w10 tells nl that we want 10 digit line numbers. -nrz tells nl to pad the line numbers with zeros, and -s\; tells nl to add a semicolon after the line number. (We have to escape the semicolon so that the shell ignores it.)
The remaining changes are handled by sed. The sed command s/55// removes the first occurrence of 55. The rearrangement of the date is handled by s/([0-9]{2})-([0-9]{2})-([0-9]{4})/\3\2\1/.
You could actually use a Bash loop to do this.
i=0
while read f1 f2; do
((++i))
IFS=\; read n d <<< $f1
d=${d:6:4}${d:3:2}${d:0:2}
printf "%010d;%d;%d %s\n" $i $n $d $f2
done < file.txt

Reorder lines of file by given sequence

I have a document A which contains n lines. I also have a sequence of n integers all of which are unique and <n. My goal is to create a document B which has the same contents as A, but with reordered lines, based on the given sequence.
Example:
A:
Foo
Bar
Bat
sequence: 2,0,1 (meaning: First line 2, then line 0, then line 1)
Output (B):
Bat
Foo
Bar
Thanks in advance for the help
Another solution:
You can create a sequence file by doing (assuming sequence is comma delimited):
echo $sequence | sed s/,/\\n/g > seq.txt
Then, just do:
paste seq.txt A.txt | sort tmp2.txt | sed "s/^[0-9]*\s//"
Here's a bash function. The order can be delimited by anything.
Usage: schwartzianTransform "A.txt" 2 0 1
function schwartzianTransform {
local file="$1"
shift
local sequence="$#"
echo -n "$sequence" | sed 's/[^[:digit:]][^[:digit:]]*/\
/g' | paste -d ' ' - "$file" | sort -n | sed 's/^[[:digit:]]* //'
}
Read the file into an array and then use the power of indexing :
echo "Enter the input file name"
read ip
index=0
while read line ; do
NAME[$index]="$line"
index=$(($index+1))
done < $ip
echo "Enter the file having order"
read od
while read line ; do
echo "${NAME[$line]}";
done < $od
[aman#aman sh]$ cat test
Foo
Bar
Bat
[aman#aman sh]$ cat od
2
0
1
[aman#aman sh]$ ./order.sh
Enter the input file name
test
Enter the file having order
od
Bat
Foo
Bar
an awk oneliner could do the job:
awk -vs="$s" '{d[NR-1]=$0}END{split(s,a,",");for(i=1;i<=length(a);i++)print d[a[i]]}' file
$s is your sequence.
take a look this example:
kent$ seq 10 >file #get a 10 lines file
kent$ s=$(seq 0 9 |shuf|tr '\n' ','|sed 's/,$//') # get a random sequence by shuf
kent$ echo $s #check the sequence in var $s
7,9,1,0,5,4,3,8,6,2
kent$ awk -vs="$s" '{d[NR-1]=$0}END{split(s,a,",");for(i=1;i<=length(a);i++)print d[a[i]]}' file
8
10
2
1
6
5
4
9
7
3
One way(not an efficient one though for big files):
$ seq="2 0 1"
$ for i in $seq
> do
> awk -v l="$i" 'NR==l+1' file
> done
Bat
Foo
Bar
If your file is a big one, you can use this one:
$ seq='2,0,1'
$ x=$(echo $seq | awk '{printf "%dp;", $0+1;print $0+1> "tn.txt"}' RS=,)
$ sed -n "$x" file | awk 'NR==FNR{a[++i]=$0;next}{print a[$0]}' - tn.txt
The 2nd line prepares a sed command print instruction, which is then used in the 3rd line with the sed command. This prints only the line numbers present in the sequence, but not in the order of the sequence. The awk command is used to order the sed result depending on the sequence.

Resources