How to delete multiple first columns from multiple files? - bash

I have multiple files like this:
trans_ENSG00000047849.txt.traw
trans_ENSG00000047848.txt.traw
trans_ENSG00000047847.txt.traw
...
and each has around 300 columns. Columns are separated with tab.I would like to remove the first 7 columns from each of those files.
I know how to do it for each file:
cut -f 7- trans_ENSG00000047849.txt.traw > trans_ENSG00000047849.txt.trawN
Is there is a way to do it at once for all files?
NOTE: there is a tab at the beginning. Therefore I used here cut -f 7 rather than cut -f 8 to remove the first 7 columns.

Just use a for loop:
for file in *.txt.traw
do
cut -f 7- "$file" > "$file"N
done

Backup your files first, and try this (GNU sed):
sed -ri 's/^([^\t]*\t){7}//' trans_*.txt.traw
As -i to sed will change your files in place. (You can remove the i for testing).
Eg:
$ cat file
1 2 3 4 5 6 7 8 9 0
a b c d e f g h i j
dfad da
$ sed -ri 's/^([^\t]*\t){7}//' file
$ cat file
8 9 0
h i j
dfad da
However, the command's for simple, so it won't remove when there're less than 7 columns. (Guess you won't have lines like this, right?)
If you still want to remove when there're less than 7 columns:
sed -r 's/^([^\t]*(\t|$)){,7}//'

Related

Extract blocks of lines with sed

How would one go about with sed to extract n lines of a file every m-th line?
Say my textfile looks like this:
myfile.dat:
1
2
3
4
5
6
7
8
9
10
Say that I want to extract blocks of three lines and then skipping two lines throughout the entire file, such that my output looks like this:
output.dat:
1
2
3
6
7
8
Any suggestions on how one could achieve this with sed?
Edit:
For my example I could just have used
sed -n 'p;n;p;n;p;n;n' myfile.dat > output.dat
or with GNU sed (not preferred due to portability)
sed '1~5b;2~5b;3~5b;d' myfile.dat > output.dat
However, I typically want to print blocks of 2450 lines from a file with 49 002 450 lines, such that my outputfile contains 247 450 lines.
This might work for you (GNU sed):
sed -n '1~5,+2p' file
Starting at line 1, print line numbers with modulus 5 and the following two lines.
An alternative:
sed -n 'N;N;p;n;n' file
In your case the below would work. It's checking the remainder when divided by 5 is between 1 and 3:
awk 'NR%5==1, NR%5==3' myfile.dat

how can I remove lines from file using command line?

how can i delete part of the file with command line?
I have tried using sed the following way:
c:\sed '1,2!d' res.txt > res.txt
but the file became empty
what i exepect to get is
1 a
2 b
3 c
4 d
to become
1 a
2 b
in the same file res.txt
Add -i or --in-place switch to sed to read and write the same file. Also, windows command line uses double quotes. So you should use
sed -i "1,2!d" res.txt
Just try to call c:\sed '1,2!d' res.txt. You'll see correct result:
1 a
2 b
So, you can't use the same file for input and output. You can use different files and move/copy after it: c:\sed '1,2!d' res.txt > res.tmp & move /y res.tmp res.txt

Extract a range of rows, with overlap using sed

I have a (dummy) file that looks like this:
header
1
2
3
4
5
6
7
8
9
10
And I need a command that would give me different files made of rows extracted every four lines with one overlaping row. So I would have something like this:
1
2
3
4
3
4
5
6
5
6
7
8
7
8
9
10
So here is what I got (it is not much, sorry):
tail -n + 2 | sed -n 1,4p > window1.txt
But I don't know how to apply this over all the file, with an overlap.
Thanks in advance.
This might work for you (GNU sed and split):
sed -nr '1{N;N;N};:a;p;$q;s/^.*\n.*\n(.*\n.*)$/\1/;N;N;ba' file | split -dl4
EDIT:
To make this programmable use:
sed -nr ':a;$!{N;s/[^\n]+/&/4;Ta};p;$q;s/.*((\n[^\n]*){2})$/\1/;D' file |
split -dl4 file-name-prefix
Where 4 is the number lines per file and 2 is the number of overlap lines.
File-name-prefix is your chosen file name which will have numbers appended (see man split).

Removing newlines between tokens

I have a file that contains some information spanning multiple lines. In order for certain other bash scripts I have to work property, I need this information to all be on a single line. However, I obviously don't want to remove all newlines in the file.
What I want to do is replace newlines, but only between all pairs of STARTINGTOKEN and ENDINGTOKEN, where these two tokens are always on different lines (but never get jumbled up together, it's impossible for instance to have two STARTINGTOKENs in a row before an ENDINGTOKEN).
I found that I can remove newlines with
tr "\n" " "
and I also found that I can match patterns over multiple lines with
sed -e '/STARTINGTOKEN/,/ENDINGTOKEN/!d'
However, I can't figure out how to combine these operations while leaving the remainder of the file untouched.
Any suggestions?
are you looking for this?
awk '/STARTINGTOKEN/{f=1} /ENDINGTOKEN/{f=0} {if(f)printf "%s",$0;else print}' file
example:
kent$ cat file
foo
bar
STARTINGTOKEN xx
1
2
ENDINGTOKEN yy
3
4
STARTINGTOKEN mmm
5
6
7
nnn ENDINGTOKEN
8
9
kent$ awk '/STARTINGTOKEN/{f=1} /ENDINGTOKEN/{f=0} {if(f)printf "%s",$0;else print}' file
foo
bar
STARTINGTOKEN xx12ENDINGTOKEN yy
3
4
STARTINGTOKEN mmm567nnn ENDINGTOKEN
8
9
This seems to work:
sed -ne '/STARTINGTOKEN/{ :next ; /ENDINGTOKEN/!{N;b next;}; s/\n//g;p;}' "yourfile"
Once it finds the starting token it loops, picking up lines until it finds the ending token, then removes all the embedded newlines and prints it. Then repeats.
Using awk:
awk '$0 ~ /STARTINGTOKEN/ || l {l=sprintf("%s%s", l, $0)}
/ENDINGTOKEN/{print l; l=""}' input.file
This might work for you (GNU sed):
sed '/STARTINGTOKEN/!b;:a;$bb;N;/ENDINGTOKEN/!ba;:b;s/\n//g' file
or:
sed -r '/(START|END)TOKEN/,//{/STARTINGTOKEN/{h;d};H;/ENDINGTOKEN/{x;s/\n//gp};d}' file

Grabbing every 4th file

I have 16,000 jpg's from a webcan screeb grabber that I let run for a year pointing into the back year. I want to find a way to grab every 4th image so that I can then put them into another directory so I can later turn them into a movie. Is there a simple bash script or other way under linux that I can do this.
They are named like so......
frame-44558.jpg
frame-44559.jpg
frame-44560.jpg
frame-44561.jpg
Thanks from a newb needing help.
Seems to have worked.
Couple of errors in my origonal post. There were actually 280,000 images and the naming was.
/home/baldy/Desktop/webcamimages/webcam_2007-05-29_163405.jpg
/home/baldy/Desktop/webcamimages/webcam_2007-05-29_163505.jpg
/home/baldy/Desktop/webcamimages/webcam_2007-05-29_163605.jpg
I ran.
cp $(ls | awk '{nr++; if (nr % 10 == 0) print $0}') ../newdirectory/
Which appears to have copied the images. 70-900 per day from the looks of it.
Now I'm running
mencoder mf://*.jpg -mf w=640:h=480:fps=30:type=jpg -ovc lavc -lavcopts vcodec=msmpeg4v2 -nosound -o ../output-msmpeg4v2.avi
I'll let you know how the movie works out.
UPDATE: Movie did not work.
Only has images from 2007 in it even though the directory has 2008 as well.
webcam_2008-02-17_101403.jpg webcam_2008-03-27_192205.jpg
webcam_2008-02-17_102403.jpg webcam_2008-03-27_193205.jpg
webcam_2008-02-17_103403.jpg webcam_2008-03-27_194205.jpg
webcam_2008-02-17_104403.jpg webcam_2008-03-27_195205.jpg
How can I modify my mencoder line so that it uses all the images?
One simple way is:
$ touch a b c d e f g h i j k l m n o p q r s t u v w x y z
$ mv $(ls | awk '{nr++; if (nr % 4 == 0) print $0}') destdir
Create a script move.sh which contains this:
#!/bin/sh
mv $4 ../newdirectory/
Make it executable and then do this in the folder:
ls *.jpg | xargs -n 4 ./move.sh
This takes the list of filenames, passes four at a time into move.sh, which then ignores the first three and moves the fourth into a new folder.
This will work even if the numbers are not exactly in sequence (e.g. if some frame numbers are missing, then using mod 4 arithmetic won't work).
As suggested, you should use
seq -f 'frame-%g.jpg' 1 4 number-of-frames
to generate the list of filenames since 'ls' will fail on 280k files. So the final solution would be something like:
for f in `seq -f 'frame-%g.jpg' 1 4 number-of-frames` ; do
mv $f destdir/
done
seq -f 'frame-%g.jpg' 1 4 number-of-frames
…will print the names of the files you need.
An easy way in perl (probably easily adaptable to bash) is to glob the filenames in an array then get the sequence number and remove those that are not divisible by 4
Something like this will print the files you need:
ls -1 /path/to/files/ | perl -e 'while (<STDIN>) {($seq)=/(\d*)\.jpg$/; print $_ if $seq && $seq % 4 ==0}'
You can replace the print by a move...
This will work if the files are numbered in sequence even if the number of digits is not constant like file_9.jpg followed by file_10.jpg )
Given masto's caveats about sorting:
ls | sed -n '1~4 p' | xargs -i mv {} ../destdir/
The thing I like about this solution is that everything's doing what it was designed to do, so it feels unixy to me.
Just iterate over a list of files:
files=( frame-*.jpg )
i=0
while [[ $i -lt ${#files} ]] ; do
cur_file=${files[$i]}
mungle_frame $cur_file
i=$( expr $i + 4 )
done
This is pretty cheesy, but it should get the job done. Assuming you're currently cd'd into the directory containing all of your files:
mkdir ../outdir
ls | sort -n | while read fname; do mv "$fname" ../outdir/; read; read; read; done
The sort -n is there assuming your filenames don't all have the same number of digits; otherwise ls will sort in lexical order where frame-123.jpg comes before frame-4.jpg and I don't think that's what you want.
Please be careful, back up your files before trying my solution, etc. I don't want to be responsible for you losing a year's worth of data.
Note that this solution does handle files with spaces in the name, unlike most of the others. I know that wasn't part of the sample filenames, but it's easy to write shell commands that don't handle spaces safely, so I wanted to do that in this example.
brace expansion {m..n..s} is more efficient than seq. AND it allows a bit of output formatting:
$ echo {0000..0010..2}
0000 0002 0004 0006 0008 0010
Postscript: In curl if you only want every fourth (nth) numbered images so you tell curl a step counter too. This example range goes from 0 to 100 with an increment of 4 (n):
curl -O "http://example.com/[0-100:4].png"

Resources