Splitting multiple input files into multiple outputs using split function in linux - bash

I have 8 files I would like to split into 5 chunks per file. I would normally do this individually but would like to run this as a loop. I work within a HPC.
I have created a list of the file names and labelled it "variantlist.txt". My code is:
for f in 'cat variantlist.txt'; do split ${f} -n 5 -d; done
However, it only splits the final file in the variantlist.txt file outputting 5 chunks from the final entry only.
Even if I list the files individually:
for f in chr001.vcf chr002 ...chr008.vcf ; do split ${f} -n 5 -d; done
It still only splits the final file into 5 chunks.
Not sure where I am going wrong here. The desired output would be 40 chunks, 5 per chromosome. Your help would be greatly appreciated.
Many thanks

The split is creating the same set of files each time and overwriting the previous ones. Here's one way to handle that -
for f in $(<variantlist.txt) # don't use cat
do mkdir -p $f.split # make a subdir for the files
( cd $f.split && # change into the subdir only in a subshell
split ../$f -n 5 -d # split from there
) # close the subshell, parent still in base dir
done
Or you could just do this -
while read f # grab each filename
do split $f -n 5 -d # split it
for x in x?? # for each split file
do mv $x $f.$x # rename it to include the parent file name
done
done < variantlist.txt # take names from this file
This is a lot slower, but doesn't use subdirs.
My favorite, though -
xargs -I {} split {} -n 5 -d {} < variantlist.txt
The last arg becomes the PREFIX for split instead of the default of x.
EDIT -- with 2 billion lines per file, use this one:
for f in $(<variantlist.txt)
do split "$f" -d -n 5 "$f" & # run all in background at the same time
done

When using split the -n swicth will determine the number of output files that the orinal is split into...
You need -l for the number of lines you need, 5 in your case:
split -l 5 ${f}

Related

how to produce multiple readlength.tsv at once from multiple fastq files?

ı have 16 fastq files under the different directories to produce readlength.tsv seperately and ı have some script to produce readlength.tsv .this is the script that ı should use to produce readlength.tsv
zcat ~/proje/project/name/fıle_fastq | paste - - - - | cut -f1,2 | while read readID sequ;
do
len=`echo $sequ | wc -m`
echo -e "$readID\t$len"
done > ~/project/name/fıle1_readlength.tsv
one by one ı can produce this readlength but it will take long time .I want to produce readlength at once thats why I created list that involved these fastq fıles but ı couldnt produce any loop to produce readlength.tsv at once from 16 fastq files.
ı would appreaciate ıf you can help me
Assuming a file list.txt contains the 16 file paths such as:
~/proje/project/name/file1_fastq
~/proje/project/name/file2_fastq
..
~/path/to/the/fastq_file16
Then would you please try:
#!/bin/bash
while IFS= read -r f; do # "f" is assigned to each fastq filename in "list.txt"
mapfile -t ary < <(zcat "$f") # assign "ary" to the array of lines
echo -e "${ary[0]}\t${#ary[1]}" # ${ary[0]} is the id and ${#ary[1]} is the length of sequence
done < list.txt > readlength.tsv
As the fastq file format contains the id in the 1st line and the sequence
in the 2nd line, bash built-in mapfile will be better to handle them.
As a side note, the letter ı in your code looks like a non-ascii character.

Shell script to copy the files

I worked very little with scripts and I don't know..
I need to create a script (in Ubuntu) that copies only files where a certain user modified more than 20 lines at a given time.
I know that to copy a file elsewhere I use this code:
$ ls dir1/
dir2/
$ cp -r dir1/ dir1.copy
$ ls dir1.copy
dir2/
And to count lines: wc -l file1
But how could I check if a user has modified more than 20 lines in a file (eg a simple txt, for example today)?
Thank you in advance !
In the first place, if by "modifying lines in a file" you mean "adding lines to a file", then you can do something about it. If you are literally talking about modifying lines in files, there is nothing you can do to track that activity without setting up some version control first.
So, assuming we are talking about files in which your users will be adding lines, a workaround for that may consist of setting up some scheduled tasks to check the line numbers of those files "at a given time" (as you said) and compare that value to a previous result, and then if there are more than 20 additional lines than from the last value, copy the files elsewhere.
First things first, counting the lines of your files is something you have already mentioned and it is right: I will propose using wc -l too.
Once here you will need two things: one place (tipically a file) to periodically save the number of lines of your files at a given time and one trigger that would start copying the files in case there have been more than 20 lines added.
So for example, in this case you can set up a cron job, like this (i.e. to run every hour):
0 */1 * * * cat ${FILE} |wc -l > /tmp/${FILE}_counter
That one will check the number of lines of a given file and send the output to a temporary file that we will be using soon. In case you have multiple files you can easily script that and make a loop, like this:
#!/bin/bash
for FILE in file1 file2 file3; do
cat ${FILE} |wc -l > /tmp/${FILE}_counter
done
Don't forget to add the path to the script in the cron job if you do that this way. After that, you will have something like this in your /tmp directory:
/tmp/file1_counter
/tmp/file2_counter
/tmp/file3_counter
...
At this point you only need a trigger, which can be another script, to compare the current number of lines of a file at a given time and start copying it elsewhere in case there are more than 20 additional lines than in the previous check. Consider this:
#!/bin/bash
LAST_VALUE=$(cat /tmp/${FILE}_counter)
CURRENT_VALUE=$(cat ${FILE} |wc -l)
if [ ${CURRENT_VALUE} -gt $(expr ${LAST_VALUE} + 20) ]; then
# Your cp stuff here
fi
Of course you can add a loop here too in case of handling multiple files:
#!/bin/bash
LAST_VALUE=$(cat /tmp/${FILE}_counter)
CURRENT_VALUE=$(cat ${FILE} |wc -l)
for FILE in file1 file2 file3; do
if [ ${CURRENT_VALUE} -gt $(expr ${LAST_VALUE} + 20) ]; then
# Your cp stuff here
fi
done
Then you only have to add this last script to a cron job too, and you should be done.
Hope you find this useful.
You can use diff to compare 2 files. With the -u0 option, it will show you the added/deleted/modified lines, prefixed with "+" or "-". You can then count lines starting with "+" or "-" with grep and it's -c option.
So for the number of lines added or modified, which begin with "+" :
diff -u0 $file_before $file_after | grep -c '^+'
and this will count the deleted or modified lines, which start with "-" :
diff -u0 $file_before $file_after | grep -c '^-'
Note that there are 2 header lines in this format, which also start with "+" and "-", so you may want to take that in account.

Split large gzip file while adding header line to each split

I want to automate the process of splitting large gzip file to smaller gzip file each split containing 10000000 lines (Last split will be left over and will less than 10000000).
Here is how I am doing at the moment and I am actually repeating by calculating number of left over lines.
gunzip -c large_gzip_file.txt.gz | tail -n +10000001 | head -n 10000000 > split1_.txt
gzip split1_.txt
gunzip -c large_gzip_file.txt.gz | tail -n +20000001 | head -n 10000000 > split2_.txt
gzip split2_.txt
I continue this by repeating as shown all the way until the end. Then I open these and manually add the header line. How can this be automated.
I search online where i see awk and other solutions but didn't see for gzip or similar to this scenario.
I would approach it like this:
gunzip the file
use head to get the first line and save it off to another file
use tail to get the rest of the file and pipe it to split to produce files of 10,000,000 lines each
use sed to insert the header into each file, or just cat the header with each file
gzip each file
You'll want to wrap this in a script or a function to make it easier to rerun at a later time. Here's an attempt at a solution, lightly tested:
#!/bin/bash
set -euo pipefail
LINES=10000000
file=$(basename $1 .gz)
gunzip -k ${file}.gz
head -n 1 $file >header.txt
tail -n +2 $file | split -l $LINES - ${file}.part.
rm -f $file
for part in ${file}.part.* ; do
[[ $part == *.gz ]] && continue # ignore partial results of previous runs
gzip -c header.txt $part >${part}.gz
rm -f $part
done
rm -f header.txt
To use:
$ ./splitter.sh large_gzip_file.txt.gz
I would further improve this by using a temporary directory (mktemp -d) for the intermediate files and ensuring the script cleans up after itself at exit (with a trap). Ideally it would also sanity check the arguments, possibly accepting a second argument indicating the number of lines per part, and inspect the contents of the current directory to ensure it doesn't clobber any preexisting files.
I don't think awk is for splitting gzip file into smaller pieces files, it's for text-processing. Below is my way to solve your issue, hope it helps:
step1:
gunzip -c large_gzip_file.txt.gz | split -l 10000000 - split_file_
split command can split a file into pieces, you can specify the size of each piece and also provide prefix for all pieces.
the large gzip file will be splited to multiple files with name prefix split_file_
step2:
save header content into file header_file.csv
step3:
for f in split_file*; do
cat header_file.csv $f > $f.new
mv $f.new $f
done
Here I assume you work in the splited file directory, if not, replace split_file* with the absolute path, for example /path/to/split_file*. Iterates all files with name pattern split_file*, add header content to the beginning of each match file

rename images using terminal script in linux

In my /home/myself/Pictures/travels folder on the Fedora 17 linux I have files IMG_2516.JPG, IMG_2519.JPG, IMG_2520.JPG, IMG_2525.JPG, IMG_2528.JPG.
I would like to rename them one by one from left to right such that IMG_2516.JPG becomes 01.JPG, IMG_2519.JPG - 02.JPG, IMG_2520.JPG - 03.JPG, IMG_2525.JPG - 04.JPG, IMG_2528.JPG - 05.JPG.
Notice that neighbouring numbers can be close (as 2519 and 2520) and distant (2516 and 2519), but always increase.
How can I write a terminal script to substitute the routine. These numbers are given for example, there many more files and at the moment I can only manually rename them (very time-consuming).
If the images all have the same number of digits:
I=1
for F in IMG_*.JPG; do
mv "$F" IMG_$(printf "%02d" $I).JPG
I=$(( I + 1 ))
done
Otherwise,
LIST=$(mktemp)
find . -maxdepth 1 -iname "*.jpg" > $LIST
sort -n -o $LIST $LIST
I=1
cat $LIST | while read F; do
mv "$F" IMG_$(printf "%02d" $I).JPG
I=$(( I + 1 ))
done
rm "$LIST"
The first one means: For each image, move it to IMG_0I.JPG, increase I by 1.
The second one means:
make a temporary file
find all of the JPG files in the directory (and not subdirectories, case-insensitive),
save one per line in the temporary file.
sort them by their numerical ordering (-n) and write back to the temporary file (-o)
send the contents of the file to the following:
-- while there is a next line,
-- -- store it in F
-- -- move the file with that name to I, where I is represented with two digits and padded with leading zeros, prefixed with IMG_ and postfixed with .JPG
-- -- increase I

Grabbing every 4th file

I have 16,000 jpg's from a webcan screeb grabber that I let run for a year pointing into the back year. I want to find a way to grab every 4th image so that I can then put them into another directory so I can later turn them into a movie. Is there a simple bash script or other way under linux that I can do this.
They are named like so......
frame-44558.jpg
frame-44559.jpg
frame-44560.jpg
frame-44561.jpg
Thanks from a newb needing help.
Seems to have worked.
Couple of errors in my origonal post. There were actually 280,000 images and the naming was.
/home/baldy/Desktop/webcamimages/webcam_2007-05-29_163405.jpg
/home/baldy/Desktop/webcamimages/webcam_2007-05-29_163505.jpg
/home/baldy/Desktop/webcamimages/webcam_2007-05-29_163605.jpg
I ran.
cp $(ls | awk '{nr++; if (nr % 10 == 0) print $0}') ../newdirectory/
Which appears to have copied the images. 70-900 per day from the looks of it.
Now I'm running
mencoder mf://*.jpg -mf w=640:h=480:fps=30:type=jpg -ovc lavc -lavcopts vcodec=msmpeg4v2 -nosound -o ../output-msmpeg4v2.avi
I'll let you know how the movie works out.
UPDATE: Movie did not work.
Only has images from 2007 in it even though the directory has 2008 as well.
webcam_2008-02-17_101403.jpg webcam_2008-03-27_192205.jpg
webcam_2008-02-17_102403.jpg webcam_2008-03-27_193205.jpg
webcam_2008-02-17_103403.jpg webcam_2008-03-27_194205.jpg
webcam_2008-02-17_104403.jpg webcam_2008-03-27_195205.jpg
How can I modify my mencoder line so that it uses all the images?
One simple way is:
$ touch a b c d e f g h i j k l m n o p q r s t u v w x y z
$ mv $(ls | awk '{nr++; if (nr % 4 == 0) print $0}') destdir
Create a script move.sh which contains this:
#!/bin/sh
mv $4 ../newdirectory/
Make it executable and then do this in the folder:
ls *.jpg | xargs -n 4 ./move.sh
This takes the list of filenames, passes four at a time into move.sh, which then ignores the first three and moves the fourth into a new folder.
This will work even if the numbers are not exactly in sequence (e.g. if some frame numbers are missing, then using mod 4 arithmetic won't work).
As suggested, you should use
seq -f 'frame-%g.jpg' 1 4 number-of-frames
to generate the list of filenames since 'ls' will fail on 280k files. So the final solution would be something like:
for f in `seq -f 'frame-%g.jpg' 1 4 number-of-frames` ; do
mv $f destdir/
done
seq -f 'frame-%g.jpg' 1 4 number-of-frames
…will print the names of the files you need.
An easy way in perl (probably easily adaptable to bash) is to glob the filenames in an array then get the sequence number and remove those that are not divisible by 4
Something like this will print the files you need:
ls -1 /path/to/files/ | perl -e 'while (<STDIN>) {($seq)=/(\d*)\.jpg$/; print $_ if $seq && $seq % 4 ==0}'
You can replace the print by a move...
This will work if the files are numbered in sequence even if the number of digits is not constant like file_9.jpg followed by file_10.jpg )
Given masto's caveats about sorting:
ls | sed -n '1~4 p' | xargs -i mv {} ../destdir/
The thing I like about this solution is that everything's doing what it was designed to do, so it feels unixy to me.
Just iterate over a list of files:
files=( frame-*.jpg )
i=0
while [[ $i -lt ${#files} ]] ; do
cur_file=${files[$i]}
mungle_frame $cur_file
i=$( expr $i + 4 )
done
This is pretty cheesy, but it should get the job done. Assuming you're currently cd'd into the directory containing all of your files:
mkdir ../outdir
ls | sort -n | while read fname; do mv "$fname" ../outdir/; read; read; read; done
The sort -n is there assuming your filenames don't all have the same number of digits; otherwise ls will sort in lexical order where frame-123.jpg comes before frame-4.jpg and I don't think that's what you want.
Please be careful, back up your files before trying my solution, etc. I don't want to be responsible for you losing a year's worth of data.
Note that this solution does handle files with spaces in the name, unlike most of the others. I know that wasn't part of the sample filenames, but it's easy to write shell commands that don't handle spaces safely, so I wanted to do that in this example.
brace expansion {m..n..s} is more efficient than seq. AND it allows a bit of output formatting:
$ echo {0000..0010..2}
0000 0002 0004 0006 0008 0010
Postscript: In curl if you only want every fourth (nth) numbered images so you tell curl a step counter too. This example range goes from 0 to 100 with an increment of 4 (n):
curl -O "http://example.com/[0-100:4].png"

Resources