Sed creating duplicates - bash

I have used the command sed in shell to remove everything except for numbers from my string.
Now, my string contains three 0s among other numbers and after running
sed 's/[^0-9]*//g'
Instead of three 0s, i now have 0 01 and 02.
How can I prevent sed from doing that so that I can have the three 0s?
sample of the string:
0 cat
42 dog
24 fish
0 bird
0 tiger
5 fly

Now that we know that digits in filenames in the output from the du utility caused the problem (tip of the hat to Lars Fischer), simply use cut to extract only first column (which contains the data of interest, each file's/subdir.'s size in blocks):
du -a "$var" | cut -f1
du outputs tab-separated data, and a tab is also cut's default separator, so all that is needed is to ask for the 1st field (-f1).
In hindsight, your problem was unrelated to sed; your sample data simply wasn't representative of your actual data. It's always worth creating an MCVE (Minimal, Complete, and Verifiable Example) when asking a question.

try this:
du -a "$var" | sed -e 's/ .*//' -e 's/[^0-9]*//g'

Related

How can I mask 200 characters of each line in a file with 3000 long lines?

I have a fixed width text data file. Each line is 3000 characters long. I need to mask (change to 'X") all the characters between position 1000 and 1200. There are no delimiters in the file, each field is known by its position in the line.
If I only needed to change 10 characters I could use sed:
sed -i -r 's/^(.{999}).{10}(.*)/\1XXXXXXXXX\2/'
But writing a sed command with 200 X's does not seem like a good idea.
I tried using awk, but it returns different values for some lines because of spaces in the data.
But writing a sed command with 200 X's does not seem like a good idea.
Let's do it anyway, but script it:
sed -E 's/^(.{999}).{200}/\1'"$(yes X | head -n200 | tr -d '\n')"'/'
Because it just so happens that 1000 % 200 == 0, I think we also could:
sed -E 's/.{200}/'"$(yes X | head -n200 | tr -d '\n')"'/6'
My go-to tools are, in order of increasing ability to get stuff done, sed, awk and python. You may want to consider stepping up :-)
In any case, this can be done in awk with some initial setup, something like:
BEGIN {x="XXXXXXXXXX"; x=x""x""x""x""x; x=x""x""x""x}
which gives you (10, then 50, then) 200 X's.
Then you can just fiddle with $0, which is the whole line regardless of spacing. Depending on what you actually meant by "between positions 1000 and 1200", the numbers below may be slightly different but you should get the idea:
{ print substr($0,1,999)""x""substr($0,1200) }
You can see how this will behave in the following snippet, replacing character positions 3 through 6 on each line:
pax> printf "hello there\ngoodbye\n" | awk '
...> BEGIN {x="X";x=x""x;x=x""x}
...> {print substr($0,1,2)""x""substr($0,7)}'
heXXXXthere
goXXXXe
This might work for you (GNU sed):
sed -E '1{x;:a;/^x{200}/!s/^/x/;ta;x};G;s/^(.{999}).{200}(.*)\n(.*)/\1\3\2/' file
Prime the hold space with a string containing 200 x's. Append the hold space to the current line and using substitution replace the intended string with the mask.

How to extract specific rows based on row number from a file

I am working on a RNA-Seq data set consisting of around 24000 rows (genes) and 1100 columns (samples), which is tab separated. For the analysis, I need to choose a specific gene set. It would be very helpful if there is a method to extract rows based on row number? It would be easier that way for me rather than with the gene names.
Below is an example of the data (4X4) -
gene    Sample1    Sample2    Sample3
A1BG       5658    5897      6064
AURKA    3656    3484      3415
AURKB    9479    10542    9895
From this, say for example, I want row 1, 3 and4, without a specific pattern
I have also asked on biostars.org.
You may use a for-loop to build the sed options like below
var=-n
for i in 1 3,4 # Put your space separated ranges here
do
var="${var} -e ${i}p"
done
sed $var filename
Note: In any case the requirement mentioned here would still be pain as it involves too much typing.
Say you have a file, or a program that generates a list of the line numbers you want, you could edit that with sed to make it into a script that prints those lines and passes it to a second invocation of sed.
In concrete terms, say you have a file called lines that says which lines you want (or it could equally be a program that generates the lines on its stdout):
1
3
4
You can make that into a sed script like this:
sed 's/$/p/' lines
1p
3p
4p
Now you can pass that to another sed as the commands to execute:
sed -n -f <(sed 's/$/p/' lines) FileYouWantLinesFrom
This has the advantage of being independent of the maximum length of arguments you can pass to a script because the sed commands are in a pseudo-file, i.e. not passed as arguments.
If you don't like/use bash and process substitution, you can do the same like this:
sed 's/$/p/' lines | sed -n -f /dev/stdin FileYouWantLinesFrom

How to loop a variable range in cut command

I have a file with 2 columns, and i want to use the values from the second column to set the range in the cut command to select a range of characters from another file. The range i desire is the character in the position of the value in the second column plus the next 10 characters. I will give an example in a while.
My files are something like that:
File with 2 columns and no blank lines between lines (file1.txt):
NAME1 10
NAME2 25
NAME3 48
NAME4 66
File that i want to extract the variable range of characters(just one very long line with no spaces and no bold font) (file2.txt):
GATCGAGCGGGATTCTTTTTTTTTAGGCGAGTCAGCTAGCATCAGCTACGAGAGGCGAGGGCGGGCTATCACGACTACGACTACGACTACAGCATCAGCATCAGCGCACTAGAGCGAGGCTAGCTAGCTACGACTACGATCAGCATCGCACATCGACTACGATCAGCATCAGCTACGCATCGAAGAGAGAGC
...or, more literally (for copy/paste to test):
GATCGAGCGGGATTCTTTTTTTTTAGGCGAGTCAGCTAGCATCAGCTACGAGAGGCGAGGGCGGGCTATCACGACTACGACTACGACTACAGCATCAGCATCAGCGCACTAGAGCGAGGCTAGCTAGCTACGACTACGATCAGCATCGCACATCGACTACGATCAGCATCAGCTACGCATCGAAGAGAGAGC
Desired resulting file, one sequence per line (result.txt):
GATTCTTTTT
GGCGAGTCAG
CGAGAGGCGA
TATCACGACT
The resulting file would have the characters from 10-20, 25-35, 48-58 and 66-76, each range in a new line. So, it would always keep the range of 10, but in different start points and those start points are set by the values in the second column from the first file.
I tried the command:
for i in $(awk '{print $2}' file1.txt);
do
p1=$i;
p2=`expr "$1" + 10`
cut -c$p1-$2 file2.txt > result.txt;
done
I don't get any output or error message.
I also tried:
while read line; do
set $line
p2=`expr "$2" + 10`
cut -c$2-$p2 file2.txt > result.txt;
done <file1.txt
This last command gives me an error message:
cut: invalid range with no endpoint: -
Try 'cut --help' for more information.
expr: non-integer argument
There's no need for cut here; dd can do the job of indexing into a file, and reading only the number of bytes you want. (Note that status=none is a GNUism; you may need to leave it out on other platforms and redirect stderr otherwise if you want to suppress informational logging).
while read -r name index _; do
dd if=file2.txt bs=1 skip="$index" count=10 status=none
printf '\n'
done <file1.txt >result.txt
This approach avoids excessive memory requirements (as present when reading the whole of file2 -- assuming it's large), and has bounded performance requirements (overhead is equal to starting one copy of dd per sequence to extract).
Using awk
$ awk 'FNR==NR{a=$0; next} {print substr(a,$2+1,10)}' file2 file1
GATTCTTTTT
GGCGAGTCAG
CGAGAGGCGA
TATCACGACT
If file2.txt is not too large, then you can read it in memory,
and use Bash sub-strings to extract the desired ranges:
data=$(<file2.txt)
while read -r name index _; do
echo "${data:$index:10}"
done <file1.txt >result.txt
This will be much more efficient than running cut or another process for every single range definition.
(Thanks to #CharlesDuffy for the tip to read data without a useless cat, and the while loop.)
One way to solve it:
#!/bin/bash
while read line; do
pos=$(echo "$line" | cut -f2 -d' ')
x=$(head -c $(( $pos + 10 )) file2.txt | tail -c 10)
echo "$x"
done < file1.txt > result.txt
It's not the solution an experienced bash hacker would use, but it is very good for someone who is new to bash. It uses tools that are very versatile, although somewhat bad if you need high performance. Shell scripting is commonly used by people who rarely shell scripts, but knows a few commands and just wants to get the job done. That's why I'm including this solution, even if the other answers are superior for more experienced people.
The first line is pretty easy. It just extracts the numbers from file1.txt. The second line uses the very nice tools head and tail. Usually, they are used with lines instead of characters. Nevertheless, I print the first pos + 10 characters with head. The result is piped into tail which prints the last 10 characters.
Thanks to #CharlesDuffy for improvements.

remove everything but first and last 6 digits

I am trying to remove everything but the first number and last 6 digits from every line in a file. So far I have removed everything but the last 6 digits using sed like so:
sed -r 's/.*(.{6})/\1/' test
Would there be a way for me to modify this so that I keep the first number too? This number can be any length but will always be followed by a space. Basically, I would like to get rid of /home/usr/file and only keep 123456789 123456 Any help would be greatly appreciated!
Input line:
123455679 /home/usr/file123456
Desired Output:
123456789 123456
echo 5 /home/usr/file123456 | awk '{print $1,substr($2,length($2)-5,6)}'
Do the same thing you did for the end at the beginning.
sed -r 's/(.).*(.{6})/\1\2/' test
(I have no idea how efficient this is however. It might need to back-track for the length of the final match.)
To grab the first "field" (space separated) and the last six characters you can use.
sed -r 's/([^[:space:]]*) .*(.{6})/\1 \2/' test
Though I think the awk solution is generally a better idea.
$ echo '123456789 /home/usr123/file123456' | sed -r 's/ .*(.{6})/ \1/'
123456789 123456

sed: replace a character only between two positions

Sorry for this apparently simple question, but spent too long trying to find the solution everywhere and trying different sed options.
I just need to replace all dots by commas in a text file, but just between two positions.
As an example, from:
1.3.5.7.9
to
1.3,5,7.9
So, replace . by , between positions 3 to 7.
Thanks!
EDITED: sorry, I pretended to simplify the problem, but as none of the first 3 answers work due to a lack of details in my question, let me go a bit deeper. The important point is replacing all dots by comas in an interval of positions without knowing the rest of the string:
Here some text. I don't want to change. 10.000 usd 234.566 usd Continuation text.
More text. No need to change this part. 345 usd 76.433 usd Text going on. So on.
This is a fixed width text file, in columns, and I need to change the international format of numbers, replacing dots by commas. I just know the initial and final positions where I need to search and eventually replace the dots. Obviously, not all figures have dots (only those over 1000).
Thanks.
Rewriting the answer after the clarification of the question:
This is hard to handle with sed only, but can be simplified with other standard utilities like cut and paste:
$ start=40
$ end=64
$ paste -d' ' <(cut -c -$((start-1)) example.txt) \
> <(cut -c $((start+1))-$((end-1)) example.txt | sed 'y/./,/') \
> <(cut -c $((end+1))- example.txt)
Here some text. I don't want to change. 10,000 usd 234,566 usd Continuation text.
More text. No need to change this part. 345 usd 76,433 usd Text going on. So on.
(> just mean continuation of the previous line. < are real). This of course is very inefficient, but conceptually simple.
I used all the +1 and -1 stuff to get rid of extra spaces. Not sure if you need it.
A pure sed solution (brace yourself):
$ sed "s/\(.\{${start}\}\)\(.\{$((end-start))\}\)/\1\n\2\n/;h;s/.*\n\(.*\)\n.*/\1/;y/./,/;G;s/^\(.*\)\n\(.*\)\n\(.*\)\n\(.*\)$/\2\1\4/" example.txt
Here some text. I don't want to change. 10,000 usd 234,566 usd Continuation text.
More text. No need to change this part. 345 usd 76,433 usd Text going on. So on.
GNU sed:
$ sed -r "s/(.{${start}})(.{$((end-start))})/\1\n\2\n/;h;s/.*\n(.*)\n.*/\1/;y/./,/;G;s/^(.*)\n(.*)\n(.*)\n(.*)$/\2\1\4/" example.txt
Here some text. I don't want to change. 10,000 usd 234,566 usd Continuation text.
More text. No need to change this part. 345 usd 76,433 usd Text going on. So on.
I try to simplify the regex, but it more permissive.
echo 1.3.5.7.9 | sed -r "s/^(...).(.).(..)/\1,\2,\3/"
1.3,5,7.9
PS: It doesn't work with BSD sed.
$ echo "1.3.5.7.9" |
gawk -v s=3 -v e=7 '{
print substr($0,1,s-1) gensub(/\./,",","g",substr($0,s,e-s+1)) substr($0,e+1)
}'
1.3,5,7.9
This is rather awkward to do in pure sed. If you're not strictly constrained to sed, I suggest using another tool to do this. Ed Morton's gawk-based solution is probably the least-awkward (no pun intended) way to solve this.
Here's an example of using sed to do the grunt work, but wrapped in a bash function for simplicity:
function transform () {
line=$1
start=$2
end=$3
# Save beginning and end of line
front=$(echo $line | sed -e "s/\(^.\{$start\}\).*$/\1/")
back=$(echo $line | sed -e "s/^.\{$end\}//")
# Translate characters
line=$(echo $line | sed -e 'y/\./,/')
# Restore unmodified beginning/end
echo $line | sed -e "s/^.\{$start\}/$front/" -e "s/\(^.\{$end\}\).*$/\1$back/"
}
Call this function like:
$ transform "1.3.5.7.9" 3 7
1.3,5,7.9
Thank you all.
What I found around (not my merit) as simple solutions are:
For fixed width files:
awk -F "" 'OFS="";{for (j=2;j<= 5;j++) if ($j==".") $j=","}'1
Will change all dots into commas from the 2nd position to the 5th.
For tab delimited fields files:
awk -F'\t' 'OFS="\t" {for (j=2;j<=5;j++) gsub(/\./,",",$j)}'1
Will change all dots into comas from the 2nd field to the 5th.
Hope that can help someone: couldn't imagine it would be so tough in the begining.

Resources