Trimming a textfile - bash

i want to trim a textfile and delete all lines from line n to the end of the file. I tried to use sed for that. The sed command for n=26 should look like that:
sed -i '26,$d' /path/to/textfile
So in my textfile i don't know n beforehand, but i know that there is a unique text in that line. So i tried it that way:
myvar=`grep -n 'unique text' /path/to/textfile | awk -F":" '{print $1 }'`
sed -i "${myvar}"',$d' /path/to/textfile
That works and deletes all wanted lines but it throws the error message:
sed: -e expression # 1, character 1: unknown command: »,«
So i tried changing my command to:
myvar=`grep -n 'unique text' /path/to/textfile | awk -F":" '{print $1 }'`
sed -i "${myvar},$d" /path/to/textfile
With that i get the same error message but it doesn't delete the lines.
I tried some variations with ' and " and how to put the variable in there, but it never works as wanted. Does someone knows what i do wrong?
I would appreciate other methods for trimming the textfile as long as i can do it in a bash script.

You can replace the fixed line number with a regular expression matching the line to start at.
sed -i '/unique text/,$d' /path/to/textfile
You can also use ed to edit the file, rather than rely on a non-standard sed extension.
printf '/unique text/,$d\nwq\n' | ed /path/to/textfile

Related

How to remove consecutive repeating characters from every line?

I have the below lines in a file
Acanthocephala;Palaeacanthocephala;Polymorphida;Polymorphidae;;Profilicollis;Profilicollis_altmani;
Acanthocephala;Eoacanthocephala;Neoechinorhynchida;Neoechinorhynchidae;;;;
Acanthocephala;;;;;;;
Acanthocephala;Palaeacanthocephala;Polymorphida;Polymorphidae;;Polymorphus;;
and I want to remove the repeating semi-colon characters from all lines to look like below (note- there are repeating semi-colons in the middle of some of the above lines too)
Acanthocephala;Palaeacanthocephala;Polymorphida;Polymorphidae;Profilicollis;Profilicollis_altmani;
Acanthocephala;Eoacanthocephala;Neoechinorhynchida;Neoechinorhynchidae;
Acanthocephala;
Acanthocephala;Palaeacanthocephala;Polymorphida;Polymorphidae;Polymorphus;
I would appreciate if someone could kindly share a bash one-liner to accomplish this.
You can use tr with "squeeze":
tr -s ';' < infile
perl -p -e 's/;+/;/g' myfile # writes output to stdout
or
perl -p -i -e 's/;+/;/g' myfile # does an in-place edit
If you want to edit the file itself:
printf "%s\n" 'g/;;/s/;\{2,\}/;/g' w | ed -s foo.txt
If you want to pipe a modified copy of the file to something else and leave the original unchanged:
sed 's/;\{2,\}/;/g' foo.txt | whatever
These replace runs of 2 or more semicolons with single ones.
could be solved easily by substitutions.
I add an awk solution by playing with the FS/OFS variable:
awk -F';+' -v OFS=';' '$1=$1' file
or
awk -F';+' -v OFS=';' '($1=$1)||1' file
Here's a sed version of alaniwi's answer:
sed 's/;\+/;/g' myfile # Write output to stdout
or
sed -i 's/;\+/;/g' myfile # Edit the file in-place

sed '$' matching start of line instead of end

I am trying to append '.tsv' to the end of a column of text in a file.
You can do this easily with sed 's|$|.tsv|' myfile.txt
However, this is not working for my file, and I am trying to figure out why and how to fix it so that this works.
The column I want to edit looks like this:
$ cut -f12 chickspress.tsv | sort -u | head
Adipose_proteins
Adrenal_gland
Cerebellum
Cerebrum
Heart
Hypothalamus
Ovary
Sciatic_nerve
Testis
Tissue
But when I try to use sed, the result comes out wrong:
$ cut -f12 chickspress.tsv | sort -u | sed -e 's|$|.tsv|'
.tsvose_proteins
.tsvnal_gland
.tsvbellum
.tsvbrum
.tsvt
.tsvthalamus
.tsvy
.tsvtic_nerve
.tsvis
.tsvue
.tsvey
.tsvr
.tsv
.tsvreas
.tsvoral_muscle
.tsventriculus
the .tsv is supposed to be at the end of the line, not the front.
I thought there might be some whitespace error, so I tried this (macOS):
$ cut -f12 chickspress.tsv | sort -u | cat -ve
Adipose_proteins^M$
Adrenal_gland^M$
Cerebellum^M$
Cerebrum^M$
Heart^M$
Hypothalamus^M$
Ovary^M$
Sciatic_nerve^M$
Testis^M$
Tissue^M$
kidney^M$
liver^M$
lung^M$
pancreas^M$
pectoral_muscle^M$
proventriculus^M$
This ^M does not look right, its not present in my other files, but I am not sure what it is representing here or how to fix it or just get this sed command to work around it.
I produced this file using Python's csv.DictWriter in a script which I've used many times in the past but never noticed this error coming from its output before. Run on macOS in this case.
EDIT: As per Ed's comment, in case you want to remove carriage returns at last of lines only then following may help.
awk '{sub(/\r$/,"")} 1' Input_file > temp_file && mv temp_file Input_file
OR
sed -i.bak '#s#\r$##' Input_file
Remove the control M characters by doing following and then try your command.
tr -d '\r' < Input_file > temp_file && mv temp_file Input_file
Or if you have dos2unix utility in your system you could use that too for removing these characters.
With awk:
awk '{gsub(/\r/,"")} 1' Input_file > temp_file && mv temp_file Input_file
With sed:
sed -i.bak 's#\r##g' Input_file

Extract nth column from a variable

i have a variable Firstline with value FHEAD,0000000001,STKU,20150927000000,201509270000000000,1153,,0000000801,W from which i need 5th field alone.
Can any one help me to resolve this.
I have used the below command but it is giving me an error
echo "FHEAD,0000000001,STKU,20150927000000,201509270000000000,1153,,0000000801,W" | awk -f ',' '{print $5}'
awk: fatal: can't open source file
,' for reading (No such file or directory)
As you tag it as bash and not awk (which is also a valid solution), you can do
IFS=, read -a a <<< "FHEAD,0000000001,STKU,20150927000000,201509270000000000,1153,,0000000801,W"
echo ${a[4]}
to obtain the same result without spawning a new process (note that bash arrays are 0-based).
Try -F not -f.
-F is for the field separator
-f is for the filename of the awk program.
You can use sed too
echo "..." | sed -E 's/([^,]*,){4}([^,]*).*/\2/'

sed - unterminated `s' command

I have this peace of code:
cat BP.csv | while read line ; do
goterm=$(awk '{print $1}') ;
name=$(awk '{print $2}') ;
grep -w "$goterm" GOEA.csv | sed "s/$goterm/pi/g" ;
done
file BP.csv has this format:
GO:0008283 cell proliferation
GO:0009405 pathogenesis
GO:0010201 response to continuous far red light stimulus by the high-irradiance response system
GO:0009641 shade avoidance
while GOEA.csv has this format:
4577 GO:0006807 0.994 2014_06_01
4577 GO:0016788 0.989 2014_06_01
4577 GO:0043169 0.977 2014_06_01
4577 GO:0043170 0.963 2014_06_01
sed doesn't work. I want to change GO:0043170 for example, to string "pi", but it gives:
sed: -e expression #1, char 12: unterminated `s' command
Why?
Thanks.
You running your awk command against no input, Try this:
cat BP.csv | while read line ; do
goterm=$(awk '{print $1}' <<< "$line") ;
name=$(awk '{print $2}' <<< "$line" ) ;
grep -w "$goterm" GOEA.csv | sed "s/$goterm/pi/g" ;
done
Let's clean up this code a bit:
while read goterm name
do
grep -w "$goterm" GOEA.csv | sed "s/$goterm/pi/g"
done < BP.cvs
The problem is that your awk statements are attempting to read in from STDIN just like your while is doing. You're reading from the same input stream.
What you want to do is to pull out the values from your line. I'm using read to do this. The read statement uses the values in $IFS to separate out the input. This is normally spaces, tabs, and newlines. The read reads each variable you put on the line, and the last value read in contains the entire rest of the line.
Thus:
while read line
reads in the entire line while:
while goterm name
will break the line as
goterm="GO:0008283"
name="cell proliferation"
One more thing. When you use grep and sed together, you probably can get away with just sed:
while read goterm name
do
sed -n "/$goterm/s/$goterm/pi/gp" GOEA.csv
done < BP.csv
The format for the sed command is:
/lines/command/parameters/
So, I'm searching for lines with $goterm in them, then I am replacing $goterm with pi. The -n means don't print out the lines as sed processes them and p means to print out the lines were the substitute is located.
By the way, csv as a file suffix means comma separated values but neither file looks like it is comma separated. Are these tabs separating each field. If so, you'll need to modify $IFS to be tabs.
I would restructure that whole thing more like this:
while read goterm restofline
do
grep -w "${goterm}" GOEA.csv | sed -e "s/${goterm}/pi/g"
done < BP.csv
No reason for the awk things, as the bash read builtin will do rudimentary field splitting for you if you give it multiple variables. Also, you aren't using name anyway, so it's not needed. cat is unnecessary as well.
Depending on your exact use case, even the grep may be unnecessary, making the inner command simply sed -ne "s/${goterm}/pi/gp" GOEA.csv. Unless your purpose for the grep -w is eliminating lines where ${goterm} is a substring of a word instead of the whole word...
For future reference, inserting a set -x above your loop in your script would show you the exact commands that are being run, so that you can compare them with your expectations.

sed extract substring between two characters from a file and save to variable

I am automatically building a package. The automated script needs to get the version of the package to build.
I need to get the string of the python script main.py. It says in line 15
VERSION="0.2.0.4" #DO NOT MOVE THIS LINE
I need the 0.2.0.4, in future it can easily become 0.10.3.15 or so, so the sed command must not have a fixed length.
I found this on stackoverflow:
sed -n '15s/.*\#\([0-9]*\)\/.*/\1/p'
"This suppresses everything but the second line, then echos the digits between # and /"
This does not work (adjusted). Which is the last "/"? How can I save the output into a variable called "version"?
version = sed -n ...
throws an error
command -n not found
If you just need version number.
awk -F\" 'NR==15 {print $2}' main.py
This prints everything between " on line 15. Like 0.2.0.4
With awk:
$ awk -F= 'NR==15 {gsub("\"","",$2); print $2}' main.py
0.2.0.4
Explanation
NR==15 performs actions on line number 15.
-F= defines the field separator as =.
{gsub("\"","",$2); print $2} removes the " character on the 2nd field and prints it.
Update
to be more specific the line is version="0.2.0.4" #DO NOT MOVE THIS
LINE
$ awk -F[=#] 'NR==15 {gsub("\"","",$2); print $2}' main.py
0.2.0.4
Using multiple field separator -F[=#] which means it can be either # or =.
To save it into your version variable, use the expression var=$(command) like:
version=$(awk -F[=#] 'NR==15 {gsub("\"","",$2); print $2}' main.py)
Try:
sed -n '15s/[^"]*"\(.*\)".*/\1/p' inputfile
In order to assign it to a variable, say:
VARIABLE=$(sed -n '15s/[^"]*"\(.*\)".*/\1/p' inputfile)
In order to remove the dependency that the VERSION would occur only on line 15, you could say:
sed -n '/^VERSION=/ s/[^"]*"\(.*\)".*/\1/p' inputfile
there should not be space in assigning variables
version=$(your code)
version=$(sed -r -i '15s/.*\"\([0-9]*\)\/.*/"/p' main.py)
OR
version=`sed -r -i '15s/.*\"\([0-9]*\)\/.*/"/p' main.py`

Resources