How to shuffle multiple files and save different files? [duplicate] - shell

This question already has answers here:
Shuffle multiple files in same order
(3 answers)
Closed 4 years ago.
I have three files as:
file1 file2 file3
A  B  C
D  E  F
G  H  I
The lines in each file relate to each other.
Thus, I want to generate shuffled files as:
file1.shuf file2.shuf file3.shuf
G     H    I
D     E    F
A     B    C
I often face this kind of problem and I always write a small script in Ruby or Python, but I thought it can be solved by some simple shell commands.
Could you suggest any simple ways to do this by shell commands or a script?

Here’s a simple script that does what you want. Specify all the input
files on the command line. It assumes all of the files have the same
number of lines.
First it creates a list of numbers and shuffles it. Then it combines
those numbers with each input file, sorts that, and removes the numbers.
Thus, each input file is shuffled in the same order.
#!/bin/bash
# Temp file to hold shuffled order
shuffile=$(mktemp)
# Create shuffled order
lines=$(wc -l < "$1")
digits=$(printf "%d" $lines | wc -c)
fmt=$(printf "%%0%d.0f" $digits)
seq -f "$fmt" $lines | shuf > $shuffile
# Shuffle each file in same way
for fname in "$#"; do
paste $shuffile "$fname" | sort | cut -f 2- > "$fname.shuf"
done
# Clean up
rm $shuffile

Related

how to produce multiple readlength.tsv at once from multiple fastq files?

ı have 16 fastq files under the different directories to produce readlength.tsv seperately and ı have some script to produce readlength.tsv .this is the script that ı should use to produce readlength.tsv
zcat ~/proje/project/name/fıle_fastq | paste - - - - | cut -f1,2 | while read readID sequ;
do
len=`echo $sequ | wc -m`
echo -e "$readID\t$len"
done > ~/project/name/fıle1_readlength.tsv
one by one ı can produce this readlength but it will take long time .I want to produce readlength at once thats why I created list that involved these fastq fıles but ı couldnt produce any loop to produce readlength.tsv at once from 16 fastq files.
ı would appreaciate ıf you can help me
Assuming a file list.txt contains the 16 file paths such as:
~/proje/project/name/file1_fastq
~/proje/project/name/file2_fastq
..
~/path/to/the/fastq_file16
Then would you please try:
#!/bin/bash
while IFS= read -r f; do # "f" is assigned to each fastq filename in "list.txt"
mapfile -t ary < <(zcat "$f") # assign "ary" to the array of lines
echo -e "${ary[0]}\t${#ary[1]}" # ${ary[0]} is the id and ${#ary[1]} is the length of sequence
done < list.txt > readlength.tsv
As the fastq file format contains the id in the 1st line and the sequence
in the 2nd line, bash built-in mapfile will be better to handle them.
As a side note, the letter ı in your code looks like a non-ascii character.

Executing a bash loop script from a file

I am trying to execute this in unix. So let's for example say I have five files named after dates, and in each of those files there are thousand of numerical values (six to ten digit number). Now, lets say I also have bunch of numerical values and I want to know which value belongs to which file.I am trying to do it the hard way like below but how do I put all my values in a file and just do a loop from there.
FILES:
20170101
20170102
20170103
20170104
20170105
Code:
for i in 5555555 67554363 564324323 23454657 666577878 345576867; do
echo $i; grep -l $i 201701*;
done
Or, why loop at all? If you have a file containing all your numbers (say numbers.txt you can find in which date file each are contained and on what line with a simple
grep -nH -w -f numbers.txt 201701*
Where the -f option simply tells grep to use the values contained in the file numbers.txt to search in each of the files matching 201701*. The -nH options for listing the line number and filename associated with each match, respectively. And as Ed points out below, the -w option to insure grep only select lines containing the whole word sought.
You can also do it with a while loop and read from the file if you create it as #Barmar suggested:
while read -r i; do
...
done < numbers.txt
Put the values in a file numbers.txt and do:
for i in $(cat numbers.txt); do
...
done

Keep text file rows by line number in bash [duplicate]

This question already has answers here:
Print lines indexed by a second file
(4 answers)
Closed 8 years ago.
Have two files. the first file (called k.txt) looks like this
lineTTY
lineRTU
lineERT
.....furtherline like this...
The other file (called w.txt) contains indices of rows which shall be kept. It looks like:
2
9
12
The indices in the latter file are sorted. Is there a way to do that in bash quickly as my file is large over 1 million rows?
Every line is the row of a matrix in a text file and only specific rows specified in the other file should be in the matrix.
I think you need here is:
cat w.txt | xargs -i{} sed -n '{}p' k.txt
if you must also sort a file, then
sort -g w.txt | xargs -i{} sed -n '{}p' k.txt

How to remove the path part from a list of files and copy it into another file?

I need to accomplish the following things with bash scripting in FreeBSD:
Create a directory.
Generate 1000 unique files whose names are taken from other random files in the system.
Each file must contain information about the original file whose name it has taken - name and size without the original contents of the file.
The script must show information about the speed of its execution in ms.
What I could accomplish was to take the names and paths of 1000 unique files with the commands find and grep and put them in a list. Then I just can't imagine how to remove the path part and create the files in the other directory with names taken from the list of random files. I tried a for loop with the basename command in it but somehow I can't get it to work and I don't know how to do the other tasks as well...
[Update: I've wanted to come back to this question to try to make my response more useful and portable across platforms (OS X is a Unix!) and $SHELLs, even though the original question specified bash and zsh. Other responses assumed a temporary file listing of "random" file names since the question did not show how the list was constructed or how the selection was made. I show one method for constructing the list in my response using a temporary file. I'm not sure how one could randomize the find operation "inline" and hope someone else can show how this might be done (portably). I also hope this attracts some comments and critique: you never can know too many $SHELL tricks. I removed the perl reference, but I hereby challenge myself to do this again in perl and - because perl is pretty portable - make it run on Windows. I will wait a while for comments and then shorten and clean up this answer. Thanks.]
Creating the file listing
You can do a lot with GNU find(1). The following would create a single file with the file names and three, tab-separated columns of the data you want (name of file, location, size in kilobytes).
find / -type f -fprintf tmp.txt '%f\t%h/%f\t%k \n'
I'm assuming that you want to be random across all filenames (i.e. no links) so you'll grab the entries from the whole file system. I have 800000 files on my workstation but a lot of RAM, so this doesn't take too long to do. My laptop has ~ 300K files and not much memory, but creating the complete listing still only took a couple minutes or so. You'll want to adjust by excluding or pruning certain directories from the search.
A nice thing about the -fprintf flag is that it seems to take care of spaces in file names. By examining the file with vim and sed (i.e. looking for lines with spaces) and comparing the output of wc -l and uniq you can get a sense of your output and whether the resulting listing is sane or not. You could then pipe this through cut, grep or sed, awk and friends in order to to create the files in the way you want. For example from the shell prompt:
~/# touch `cat tmp.txt |cut -f1`
~/# for i in `cat tmp.txt|cut -f1`; do cat tmp.txt | grep $i > $i.dat ; done
I'm giving the files we create a .dat extension here to distinguish them from the files to which they refer, and to make it easier to move them around or delete them, you don't have to do that: just leave off the extension $i > $i.
The bad thing about the -fprintf flag is that it is only available with GNU find and is not a POSIX standard flag so it won't be available on OS X or BSD find(1) (though GNU find may be installed on your Unix as gfind or gnufind). A more portable way to do this is to create a straight up list of files with find / -type f > tmp.txt (this takes about 15 seconds on my system with 800k files and many slow drives in a ZFS pool. Coming up with something more efficient should be easy for people to do in the comments!). From there you can create the data values you want using standard utilities to process the file listing as Florin Stingaciu shows above.
#!/bin/sh
# portably get a random number (OS X, BSD, Linux and $SHELLs w/o $RANDOM)
randnum=`od -An -N 4 -D < /dev/urandom` ; echo $randnum
for file in `cat tmp.txt`
do
name=`basename $file`
size=`wc -c $file |awk '{print $1}'`
# Uncomment the next line to see the values on STDOUT
# printf "Location: $name \nSize: $size \n"
# Uncomment the next line to put data into the respective .dat files
# printf "Location: $file \nSize: $size \n" > $name.dat
done
# vim: ft=sh
If you've been following this far you'll realize that this will create a lot of files - on my workstation this would create 800k of .dat files which is not what we want! So, how to randomly select 1000 files from our listing of 800k for processing? There's several ways to go about it.
Randomly selecting from the file listing
We have a listing of all the files on the system (!). Now in order to select 1000 files we just need to randomly select 1000 lines from our listing file (tmp.txt). We can set an upper limit of the line number to select by generating a random number using the cool od technique you saw above - it's so cool and cross-platform that I have this aliased in my shell ;-) - then performing modulo division (%) on it using the number of lines in the file as the divisor. Then we just take that number and select the line in the file to which it corresponds with awk or sed (e.g. sed -n <$RANDOMNUMBER>p filelist), iterate 1000 times and presto! We have a new list of 1000 random files. Or not ... it's really slow! While looking for a way to speed up awk and sed I came across an excellent trick using dd from Alex Lines that searches the file by bytes (instead of lines) and translates the result into a line using sed or awk.
See Alex's blog for the details. My only problems with his technique came with setting the count= switch to a high enough number. For mysterious reasons (which I hope someone will explain) - perhaps because my locale is LC_ALL=en_US.UTF-8 - dd would spit incomplete lines into randlist.txt unless I set count= to a much higher number that the actual maximum line length. I think I was probably mixing up characters and bytes. Any explanations?
So after the above caveats and hoping it works on more than two platforms, here's my attempt at solving the problem:
#!/bin/sh
IFS='
'
# We create tmp.txt with
# find / -type f > tmp.txt # tweak as needed.
#
files="tmp.txt"
# Get the number of lines and maximum line length for later
bytesize=`wc -c < $files`
# wc -L is not POSIX and we need to multiply so:
linelenx10=`awk '{if(length > x) {x=length; y = $0} }END{print x*10}' $files`
# A function to generate a random number modulo the
# number of bytes in the file. We'll use this to find a
# random location in our file where we can grab a line
# using dd and sed.
genrand () {
echo `od -An -N 4 -D < /dev/urandom` ' % ' $bytesize | bc
}
rm -f randlist.txt
i=1
while [ $i -le 1000 ]
do
# This probably works but is way too slow: sed -n `genrand`p $files
# Instead, use Alex Lines' dd seek method:
dd if=$files skip=`genrand` ibs=1 count=$linelenx10 2>/dev/null |awk 'NR==2 {print;exit}'>> randlist.txt
true $((i=i+1)) # Bourne shell equivalent of $i++ iteration
done
for file in `cat randlist.txt`
do
name=`basename $file`
size=`wc -c <"$file"`
echo -e "Location: $file \n\n Size: $size" > $name.dat
done
# vim: ft=sh
What I could accomplish was to take the names and paths of 1000 unique files with the commands "find" and "grep" and put them in a list
I'm going to assume that there is a file that holds on each line a full path to each file (FULL_PATH_TO_LIST_FILE). Considering there's not much statistics associated with this process, I omitted that. You can add your own however.
cd WHEREVER_YOU_WANT_TO_CREATE_NEW_FILES
for file_path in `cat FULL_PATH_TO_LIST_FILE`
do
## This extracts only the file name from the path
file_name=`basename $file_path`
## This grabs the files size in bytes
file_size=`wc -c < $file_path`
## Create the file and place info regarding original file within new file
echo -e "$file_name \nThis file is $file_size bytes "> $file_name
done

Why is sort -k not working all the time?

I have now a script that puts a list of files in two separate arrays:
First, I get a file list from a ZIP file and fill FIRST_Array() with it. Second, I get a file list from a control file within a ZIP file and fill SECOND_Array() with it
while read length date time filename
do
FIRST_Array+=( "$filename" )
echo "$filename" >> FIRST.report.out
done < <(/usr/bin/unzip -qql AAA.ZIP |sort -g -k12 -t~)
Third, I compare both array like so:
diff -q <(printf "%s\n" "${FIRST_Array[#]}") <(printf "%s\n" "${SECOND_Array[#]}") |wc -l
I can tell that Diff fails because I output each array to files: FIRST.report.out and SECOND.report.out are simply not sorted properly.
1) FIRST.report.out (what's inside the ZIP file)
JGS-Memphis~AT1~Pre-Test~X-BanhT~JGMDTV387~6~P~1100~HR24-500~033072053326~20120808~240914.XML
JGS-Memphis~PRE~DTV_PREP~X-GuinE~JGMDTV069~6~P~1100~H24-700~033081107519~20120808~240914.XML
JGS-Memphis~PRE~DTV_PREP~X-MooreBe~JGM98745~40~P~1100~H21-200~029264526103~20120808~240914.XML
JGS-Memphis~FUN~Pre-Test~X-RossA~jgmdtv168~2~P~1100~H21-200~029415655926~20120808~240914.XML
2) SECOND.report.out (what's inside the ZIP's control file)
JGS-Memphis~AT1~Pre-Test~X-BanhT~JGMDTV387~6~P~1100~HR24-500~033072053326~20120808~240914.XML
JGS-Memphis~FUN~Pre-Test~X-RossA~jgmdtv168~2~P~1100~H21-200~029415655926~20120808~240914.XML
JGS-Memphis~PRE~DTV_PREP~X-GuinE~JGMDTV069~6~P~1100~H24-700~033081107519~20120808~240914.XML
JGS-Memphis~PRE~DTV_PREP~X-MooreBe~JGM98745~40~P~1100~H21-200~029264526103~20120808~240914.XML
Using sort -k12 -t~ made sense since ~ is the delimiter for the file's date field (12th position). But it is not working consistently. Added -g made no difference.
The sort is worse when my script processes bigger ZIP files. Why is sort -k not working all the time? How can I sort both arrays?
you don't really have a k12 in your data, your separator is '~' in your spec, but you have ~ and sometimes - in your data.
you can check by
head -n 1 your.data.file | sed -e "s/~/\n/g"
Business requirements are going to be changed. Sort is no longer required in this case. Thread can closed. Thank you.

Resources