How to sort files in paste command with 500 files csv

How to sort files in paste command with 500 files csv - shell

My question is similar to How to sort files in paste command?
- which has been solved.
I have 500 csv files (daily rainfall data) in a folder with naming convention chirps_yyyymmdd.csv. Each file has only 1 column (rainfall value) with 100,000 rows, and no header. I want to merge all the csv files into a single csv in chronological order.
When I tried this script ls -v file_*.csv | xargs paste -d, with only 100 csv files, it worked. But when tried using 500 csv files, I got this error: paste: chirps_19890911.csv: Too many open files
How to handle above error?
For fast solution, I can divide the csv's into two folder and do the process using above script. But, the problem I have 100 folders and it has 500 csv in each folder.
Thanks
Sample data and expected result: https://www.dropbox.com/s/ndofxuunc1sm292/data.zip?dl=0

You can do it with gawk like this...
Simply read all the files in, one after the other and save them into an array. The array is indexed by two numbers, firstly the line number in the current file (FNR) and secondly the column, which I increment each time we encounter a new file in the BEGINFILE block.
Then, at the end, print out the entire array:
gawk 'BEGINFILE{ ++col } # New file, increment column number
{ X[FNR SEP col]=$0; rows=FNR } # Save datum into array X, indexed by current record number and col
END { for(r=1;r<=rows;r++){
comma=","
for(c=1;c<=col;c++){
if(c==col)comma=""
printf("%s%s",X[r SEP c],comma)
}
printf("\n")
}
}' chirps*
SEP is just an unused character that makes a separator between indices. I am using gawk because BEGINFILE is useful for incrementing the column number.
Save the above in your HOME directory as merge. Then start a Terminal and, just once, make it executable with the command:
chmod +x merge
Now change directory to where your chirps are with a command like:
cd subdirectory/where/chirps/are
Now you can run the script with:
$HOME/merge
The output will rush past on the screen. If you want it in a file, use:
$HOME/merge > merged.csv

First make one file without pasting and change that file into a oneliner with tr:
cat */chirps_*.csv | tr "\n" "," > long.csv

If the goal is a file with 100,000 lines and 500 columns then something like this should work:
paste -d, chirps_*.csv > chirps_500_merge.csv
Additional code can be used to sort the chirps_... input files into any desired order before pasteing.

The error comes from ulimit, from man ulimit:
-n or --file-descriptor-count The maximum number of open file descriptors
On my system ulimit -n returns 1024.
Happily we can paste the paste output, so we can chain it.
find . -type f -name 'file_*.csv' |
sort |
xargs -n$(ulimit -n) sh -c '
tmp=$(mktemp);
paste -d, "$#" >$tmp;
echo $tmp
' -- |
xargs sh -c '
paste -d, "$#"
rm "$#"
' --
Don't parse ls output
Once we moved from parsing ls output to good find, we find all files and sort them.
the first xargs takes 1024 files at a time, creates temporary file, pastes the output into temporary and outputs the temporary file filename
The second xargs does the same with temporary files, but also removes all the temporaries
As the count of files would be 100*500=500000 which is smaller then 1024*1024 we can get away with one pass.
Tested against test data generated with:
seq 1 2000 |
xargs -P0 -n1 -t sh -c '
seq 1 1000 |
sed "s/^/ $RANDOM/" \
>"file_$(date --date="-${1}days" +%Y%m%d).csv"
' --
The problem seems to be much like foldl with maximum size of chunk to fold in one pass. Basically we want paste -d, <(paste -d, <(paste -d, <1024 files>) <1023 files>) <rest of files> that runs kind-of-recursively. With a little fun I came up with the following:
func() {
paste -d, "$#"
}
files=()
tmpfilecreated=0
# read filenames...c
while IFS= read -r line; do
files+=("$line")
# if the limit of 1024 files is reached
if ((${#files[#]} == 1024)); then
tmp=$(mktemp)
func "${files[#]}" >"$tmp"
# remove the last tmp file
if ((tmpfilecreated)); then
rm "${files[0]}"
fi
tmpfilecreated=1
# start with fresh files list
# with only the tmp file
files=("$tmp")
fi
done
func "${files[#]}"
# remember to clear tmp file!
if ((tmpfilecreated)); then
rm "${files[0]}"
fi
I guess readarray/mapfile could be faster, and result in a bit clearer code:
func() {
paste -d, "$#"
}
tmp=()
tmpfilecreated=0
while readarray -t -n1023 files && ((${#files[#]})); do
tmp=("$(mktemp)")
func "${tmp[#]}" "${files[#]}" >"$tmp"
if ((tmpfilecreated)); then
rm "${files[0]}"
fi
tmpfilecreated=1
done
func "${tmp[#]}" "${files[#]}"
if ((tmpfilecreated)); then
rm "${files[0]}"
fi
PS. I want to merge all the csv files into a single csv in chronological order. Wouldn't that be just cut? Right now each column represents one day.

You can try this Perl-one liner. It will work for any number of files matching *.csv under a directory
$ ls -1 *csv
file_1.csv
file_2.csv
file_3.csv
$ cat file_1.csv
1
2
3
$ cat file_2.csv
4
5
6
$ cat file_3.csv
7
8
9
$ perl -e ' BEGIN { while($f=glob("*.csv")) { $i=0;open($FH,"<$f"); while(<$FH>){ chomp;#t=#{$kv{$i}}; push(#t,$_);$kv{$i++}=[#t];}} print join(",",#{$kv{$_}})."\n" for(0..$i) } ' <
1,4,7
2,5,8
3,6,9
$

Related

looping with grep over several files

I have multiple files /text-1.txt, /text-2.txt ... /text-20.txt
and what I want to do is to grep for two patterns and stitch them into one file.
For example:
I have
grep "Int_dogs" /text-1.txt > /text-1-dogs.txt
grep "Int_cats" /text-1.txt> /text-1-cats.txt
cat /text-1-dogs.txt /text-1-cats.txt > /text-1-output.txt
I want to repeat this for all 20 files above. Is there an efficient way in bash/awk, etc. to do this ?

#!/bin/sh
count=1
next () {
[[ "${count}" -lt 21 ]] && main
[[ "${count}" -eq 21 ]] && exit 0
}
main () {
file="text-${count}"
grep "Int_dogs" "${file}.txt" > "${file}-dogs.txt"
grep "Int_cats" "${file}.txt" > "${file}-cats.txt"
cat "${file}-dogs.txt" "${file}-cats.txt" > "${file}-output.txt"
count=$((count+1))
next
}
next

grep has some features you seem not to be aware of:
grep can be launched on lists of files, but the output will be different:
For a single file, the output will only contain the filtered line, like in this example:
cat text-1.txt
I have a cat.
I have a dog.
I have a canary.
grep "cat" text-1.txt
I have a cat.
For multiple files, also the filename will be shown in the output: let's add another textfile:
cat text-2.txt
I don't have a dog.
I don't have a cat.
I don't have a canary.
grep "cat" text-*.txt
text-1.txt: I have a cat.
text-2.txt: I don't have a cat.
grep can be extended to search for multiple patterns in files, using the -E switch. The patterns need to be separated using a pipe symbol:
grep -E "cat|dog" text-1.txt
I have a dog.
I have a cat.
(summary of the previous two points + the remark that grep -E equals egrep):
egrep "cat|dog" text-*.txt
text-1.txt:I have a dog.
text-1.txt:I have a cat.
text-2.txt:I don't have a dog.
text-2.txt:I don't have a cat.
So, in order to redirect this to an output file, you can simply say:
egrep "cat|dog" text-*.txt >text-1-output.txt

Assuming you're using bash.
Try this:
for i in $(seq 1 20) ;do rm -f text-${i}-output.txt ; grep -E "Int_dogs|Int_cats" text-${i}.txt >> text-${i}-output.txt ;done
Details
This one-line script does the following:
Original files are intended to have the following name order/syntax:
text-<INTEGER_NUMBER>.txt - Example: text-1.txt, text-2.txt, ... text-100.txt.
Creates a loop starting from 1 to <N> and <N> is the number of files you want to process.
Warn: rm -f text-${i}-output.txt command first will be run and remove the possible outputfile (if there is any), to ensure that a fresh new output file will be only available at the end of the process.
grep -E "Int_dogs|Int_cats" text-${i}.txt will try to match both strings in the original file and by >> text-${i}-output.txt all the matched lines will be redirected to a newly created output file with the relevant number of the original file. Example: if integer number in original file is 5 text-5.txt, then text-5-output.txt file will be created & contain the matched string lines (if any).

BASH: File sorting according to file name

I need to sort 12000 filles into 1000 groups, according to its name and create for each group a new folder containing filles of this group. The name of each file is given in multi-column format (with _ separator), where the second column is varried from 1 to 12 (number of the part) and the last column ranged from 1 to 1000 (number of the system), indicating that initially 1000 different systems (last column) were splitted on 12 separate parts (second column).
Here is an example for a small subset based on 3 systems devided by 12 parts, totally 36 filles.
7000_01_lig_cne_1.dlg
7000_02_lig_cne_1.dlg
7000_03_lig_cne_1.dlg
...
7000_12_lig_cne_1.dlg
7000_01_lig_cne_2.dlg
7000_02_lig_cne_2.dlg
7000_03_lig_cne_2.dlg
...
7000_12_lig_cne_2.dlg
7000_01_lig_cne_3.dlg
7000_02_lig_cne_3.dlg
7000_03_lig_cne_3.dlg
...
7000_12_lig_cne_3.dlg
I need to group these filles based on the second column of their names (01, 02, 03 .. 12), thus creating 1000 folders, which should contrain 12 filles for each system in the following manner:
Folder1, name: 7000_lig_cne_1, it contains 12 filles: 7000_{this is from 01 to 12}_lig_cne_1.dlg
Folder2, name: 7000_lig_cne_2, it contains 12 filles 7000_{this is from 01 to 12}_lig_cne_2.dlg
...
Folder1000, name: 7000_lig_cne_1000, it contains 12 filles 7000_{this is from 01 to 12}_lig_cne_1000.dlg
Assuming that all *.dlg filles are present withint the same dir, I propose bash loop workflow, which only lack some sorting function (sed, awk ??), organized in the following manner:
#set the name of folder with all DLG
home=$PWD
FILES=${home}/all_DLG/7000_CNE
# set the name of protein and ligand library to analyse
experiment="7000_CNE"
#name of the output
output=${home}/sub_folders_to_analyse
#now here all magic comes
rm -r ${output}
mkdir ${output}
# sed sollution
for i in ${FILES}/*.dlg # define this better to suit your needs
do
n=$( <<<"$i" sed 's/.*[^0-9]\([0-9]*\)\.dlg$/\1/' )
# move the file to proper dir
mkdir -p ${output}/"${experiment}_lig$n"
cp "$i" ${output}/"${experiment}_lig$n"
done
! Note: there I indicated beggining of the name of each folder as ${experiment} to which I add the number of the final column $n at the end. Would it be rather possible to set up each time the name of the new folder automatically based on the name of the coppied filles? Manually it could be achived via skipping the second column in the name of the folder
cp ./all_DLG/7000_*_lig_cne_987.dlg ./output/7000_lig_cne_987

Iterate over files. Extract the destination directory name from the filename. Move the file.
for i in *.dlg; do
# extract last number with your favorite tool
n=$( <<<"$i" sed 's/.*[^0-9]\([0-9]*\)\.dlg$/\1/' )
# move the file to proper dir
echo mkdir -p "folder$n"
echo mv "$i" "folder$n"
done
Notes:
Do not use upper case variables in your scripts. Use lower case variables.
Remember to quote variables expansions.
Check your scripts with http://shellcheck.net
Tested on repl
update: for OP's foldernaming convention:
for i in *.dlg; do
foldername="$HOME/output/${i%%_*}_${i#*_*_}"
echo mkdir -p "$foldername"
echo mv "$i" "$foldername"
done

This might work for you (GNU parallel):
ls *.dlg |
parallel --dry-run 'd={=s/^(7000_).*(lig.*)\.dlg/$1$2/=};mkdir -p $d;mv {} $d'
Pipe the output of ls command listing files ending in .dlg to parallel, which creates directories and moves the files to them.
Run the solution as is, and when satisfied the output of the dry run is ok, remove the option --dry-run.
The solution could be one instruction:
parallel 'd={=s/^(7000_).*(lig.*)\.dlg/$1$2/=};mkdir -p $d;mv {} $d' ::: *.dlg

Using POSIX shell's built-in grammar only and sort:
#!/usr/bin/env sh
curdir=
# Create list of files with newline
# Safe since we know there is no special
# characters in name
printf -- %s\\n *.dlg |
# Sort the list by 5th key with _ as field delimiter
sort -t_ -k5 |
# Iterate reading the _ delimited fields of the sorted list
while IFS=_ read -r _ _ c d e; do
# Compose the new directory name
newdir="${c}_${d}_${e%.dlg}"
# If we enter a new group / directory
if [ "$curdir" != "$newdir" ]; then
# Make the new directory current
curdir="$newdir"
# Create the new directory
echo mkdir -p "$curdir"
# Move all its files into it
echo mv -- *_"$curdir.dlg" "$curdir/"
fi
done
Optionally as a sort and xargs arguments stream:
printf -- %s\\n * |
sort -u -t_ -k5
xargs -n1 sh -c
'd="lig_cne_${0##*_}"
d="${d%.dlg}"
echo mkdir -p "$d"
echo mv -- *"_$d.dlg" "$d/"
'

Here is a very simple awk script that do the trick in single sweep.
script.awk
BEGIN{FS="[_.]"} # make field separator "_" or "."
{ # for each filename
dirName=$1"_"$3"_"$4"_"$5; # compute the target dir name from fields
sysCmd = "mkdir -p " dirName"; cp "$0 " "dirName; # prepare bash command
system(sysCmd); # run bash command
}
running script.awk
ls -1 *.dlg | awk -f script.awk
oneliner awk script
ls -1 *.dlg | awk 'BEGIN{FS="[_.]"}{d=$1"_"$3"_"$4"_"$5;system("mkdir -p "d"; cp "$0 " "d);}'

How to iterate over several files in bash

I have several files to compare. con and ref files contain list of paths to .txt files that should be compared,and the output should contain the variable name of con_vs_ref_1.txt.
con:
/home/POP_xpclr/A.txt
/home/POP_xpclr/B.txt
ref:
/home/POP_xpclr/C.txt
/home/POP_xpclr/D.txt
#!/usr/bin/env bash
XPCLR="/home/Tools/XPCLR/bin/XPCLR"
CON="/home/POP_xpclr/con"
REF="/home/POP_xpclr/ref"
MAPS="/home/POP_xpclr/1"
OUTDIR="/home/POP_xpclr/Results"
$XPCLR -xpclr $CON $REF $MAPS $OUTDIR -w1 0.5 200 1000000 $MAPS -p1 0.95

Comments in code.
# create an MCVE, ie. input files:
cat <<EOF >con
/home/POP_xpclr/A.txt
/home/POP_xpclr/B.txt
EOF
cat <<EOF >ref
/home/POP_xpclr/C.txt
/home/POP_xpclr/D.txt
ref
# join streams
paste <(
# repeat ref file times con file has lines
seq $(<con wc -l) |
xargs -i cat ref
) <(
# repeat each line from con file times ref file has lines
# from https://askubuntu.com/questions/594554/repeat-each-line-in-a-text-n-times
awk -v max=$(<ref wc -l) '{for (i = 0; i < max; i++) print $0}' con
) |
# ok, we have all combinations of lines
# now read them field by field and do whatever we want
while read -r file1 file2; do
# run the compare function
cmp "$file1" "$file2"
# probably you want something along:
"$XPCLR" -xpclr "$file1" "$file2" "$MAPS" "$OUTDIR" -w1 0.5 200 1000000 "$MAPS" -p1 0.95
done

Looping over the file paths in your con and ref files is pretty easy in bash.
As for "the output should contain the variable name of con_vs_ref_1.txt", you haven't explained what you want very well, but I'll guess that you want the file created to be named according to that formula and inside the output directory. Something like /home/POP_xpclr/Results/A_vs_C_1.txt.
#!/usr/bin/env bash
XPCLR="/home/Tools/XPCLR/bin/XPCLR"
CON="/home/POP_xpclr/con"
REF="/home/POP_xpclr/ref"
MAPS="/home/POP_xpclr/1"
OUTDIR="/home/POP_xpclr/Results"
for FILE1 in $(cat $CON)
do
for FILE2 in $(cat $REF)
do
OUTFILE="$OUTDIR/$(basename ${FILE1%.txt})_vs_$(basename ${FILE2%.txt})_1.txt"
$XPCLR -xpclr $FILE1 $FILE2 $MAPS $OUTFILE -w1 0.5 200 1000000 $MAPS -p1 0.95
done
done
What's this doing...
$(cat $CON) creates a subshell and runs cat to read your CON file, inserting the output (i.e. all the file paths) into the script at that point
for FILE1 in $(cat $CON) creates a loop where all the values read from your CON file are iterated across and assigned to the FILE1 variable one at a time.
for FILE2 in $(cat $REF) as above but with the REF file.
${FILE1%.txt} inserts the value of FILE1 variable, with ".txt" extension removed from the end. This is called parameter expansion.
$(basename ${FILE1%.txt}) makes a subshell as before, basename strips the path of all the leading directories and returns just the filename, which we have already stripped of the ".txt" extension with the parameter expansion.
OUTFILE="$OUTDIR/$(basename ${FILE1%.txt})_vs_$(basename ${FILE2%.txt})_1.txt" combines the above two dot points to create your new file path based on your formula.
do and done are parts of the for loop construct that I hope are pretty self explanatory.

Finding the file name in a directory with a pattern

I need to find the latest file - filename_YYYYMMDD in the directory DIR.
The below is not working as the position is shifting each time because of the spaces between(occurring mostly at file size field as it differs every time.)
please suggest if there is other way.
report =‘ls -ltr $DIR/filename_* 2>/dev/null | tail -1 | cut -d “ “ -f9’

You can use AWK to cut the last field . like below
report=`ls -ltr $DIR/filename_* 2>/dev/null | tail -1 | awk '{print $NF}'`
Cut may not be an option here

If I understand you want to loop though each file in the directory and file the largest 'YYYYMMDD' value and the filename associated with that value, you can use simple POSIX parameter expansion with substring removal to isolate the 'YYYYMMDD' and compare against a value initialized to zero updating the latest variable to hold the largest 'YYYYMMDD' as you loop over all files in the directory. You can store the name of the file each time you find a larger 'YYYYMMDD'.
For example, you could do something like:
#!/bin/sh
name=
latest=0
for i in *; do
test "${i##*_}" -gt "$latest" && { latest="${i##*_}"; name="$i"; }
done
printf "%s\n" "$name"
Example Directory
$ ls -1rt
filename_20120615
filename_20120612
filename_20120115
filename_20120112
filename_20110615
filename_20110612
filename_20110115
filename_20110112
filename_20100615
filename_20100612
filename_20100115
filename_20100112
Example Use/Output
$ name=; latest=0; \
> for i in *; do \
> test "${i##*_}" -gt "$latest" && { latest="${i##*_}"; name="$i"; }; \
> done; \
> printf "%s\n" "$name"
filename_20120615
Where the script selects filename_20120615 as the file with the greatest 'YYYYMMDD' of all files in the directory.
Since you are using only tools provided by the shell itself, it doesn't need to spawn subshells for each pipe or utility it calls.
Give it a test and let me know if that is what you intended, let me know if your intent was different, or if you have any further questions.

Writing a script for large text file manipulation (iterative substitution of duplicated lines), weird bugs and very slow.

I am trying to write a script which takes a directory containing text files (384 of them) and modifies duplicate lines that have a specific format in order to make them not duplicates.
In particular, I have files in which some lines begin with the '#' character and contain the substring 0:0. A subset of these lines are duplicated one or more times. For those that are duplicated, I'd like to replace 0:0 with i:0 where i starts at 1 and is incremented.
So far I've written a bash script that finds duplicated lines beginning with '#', writes them to a file, then reads them back and uses sed in a while loop to search and replace the first occurrence of the line to be replaced. This is it below:
#!/bin/bash
fdir=$1"*"
#for each fastq file
for f in $fdir
do
(
#find duplicated read names and write to file $f.txt
sort $f | uniq -d | grep ^# > "$f".txt
#loop over each duplicated readname
while read in; do
rname=$in
i=1
#while this readname still exists in the file increment and replace
while grep -q "$rname" $f; do
replace=${rname/0:0/$i:0}
sed -i.bu "0,/$rname/s/$rname/$replace/" "$f"
let "i+=1"
done
done < "$f".txt
rm "$f".txt
rm "$f".bu
done
echo "done" >> progress.txt
)&
background=( $(jobs -p) )
if (( ${#background[#]} ==40)); then
wait -n
fi
done
The problem with it is that its impractically slow. I ran it on a 48 core computer for over 3 days and it hardly got through 30 files. It also seemed to have removed about 10 files and I'm not sure why.
My question is where are the bugs coming from and how can I do this more efficiently? I'm open to using other programming languages or changing my approach.
EDIT
Strangely the loop works fine on one file. Basically I ran
sort $f | uniq -d | grep ^# > "$f".txt
while read in; do
rname=$in
i=1
while grep -q "$rname" $f; do
replace=${rname/0:0/$i:0}
sed -i.bu "0,/$rname/s/$rname/$replace/" "$f"
let "i+=1"
done
done < "$f".txt
To give you an idea of what the files look like below are a few lines from one of them. The thing is that even though it works for the one file, it's slow. Like multiple hours for one file of 7.5 M. I'm wondering if there's a more practical approach.
With regard to the file deletions and other bugs I have no idea what was happening Maybe it was running into memory collisions or something when they were run in parallel?
Sample input:
#D00269:138:HJG2TADXX:2:1101:0:0 1:N:0:CCTAGAAT+ATTCCTCT
GATAAGGACGGCTGGTCCCTGTGGTACTCAGAGTATCGCTTCCCTGAAGA
+
CCCFFFFFHHFHHIIJJJJIIIJJIJIJIJJIIBFHIHIIJJJJJJIJIG
#D00269:138:HJG2TADXX:2:1101:0:0 1:N:0:CCTAGAAT+ATTCCTCT
CAAGTCGAACGGTAACAGGAAGAAGCTTGCTTCTTTGCTGACGAGTGGCG
Sample output:
#D00269:138:HJG2TADXX:2:1101:1:0 1:N:0:CCTAGAAT+ATTCCTCT
GATAAGGACGGCTGGTCCCTGTGGTACTCAGAGTATCGCTTCCCTGAAGA
+
CCCFFFFFHHFHHIIJJJJIIIJJIJIJIJJIIBFHIHIIJJJJJJIJIG
#D00269:138:HJG2TADXX:2:1101:2:0 1:N:0:CCTAGAAT+ATTCCTCT
CAAGTCGAACGGTAACAGGAAGAAGCTTGCTTCTTTGCTGACGAGTGGCG

Here's some code that produces the required output from your sample input.
Again, it is assumed that your input file is sorted by the first value (up to the first space character).
time awk '{
#dbg if (dbg) print "#dbg:prev=" prev
if (/^#/ && prev!=$1) {fixNum=0 ;if (dbg) print "prev!=$1=" prev "!=" $1}
if (/^#/ && (prev==$1 || NR==1) ) {
prev=$1
n=split($1,tmpArr,":") ; n++
#dbg if (dbg) print "tmpArr[6]="tmpArr[6] "\tfixNum="fixNum
fixNum++;tmpArr[6]=fixNum;
# magic to rebuild $1 here
for (i=1;i<n;i++) {
tmpFix ? tmpFix=tmpFix":"tmpArr[i]"" : tmpFix=tmpArr[i]
}
$1=tmpFix ; $0=$0
print $0
}
else { tmpFix=""; print $0 }
}' file > fixedFile
output
#D00269:138:HJG2TADXX:2:1101:1:0 1:N:0:CCTAGAAT+ATTCCTCT
GATAAGGACGGCTGGTCCCTGTGGTACTCAGAGTATCGCTTCCCTGAAGA
+
CCCFFFFFHHFHHIIJJJJIIIJJIJIJIJJIIBFHIHIIJJJJJJIJIG
#D00269:138:HJG2TADXX:2:1101:2:0 1:N:0:CCTAGAAT+ATTCCTCT
CAAGTCGAACGGTAACAGGAAGAAGCTTGCTTCTTTGCTGACGAGTGGCG
I've left a few of the #dbg:... statements in place (but they are now commented out) to show how you can run a small set of data as you have provided, and watch the values of variables change.
Assuming a non-csh, you should be able to copy/paste the code block into a terminal window cmd-line and replace file > fixFile at the end with your real file name and a new name for the fixed file. Recall that awk 'program' file > file (actually, any ...file>file) will truncate the existing file and then try to write, SO you can lose all the data of a file trying to use the same name.
There are probably some syntax improvements that will reduce the size of this code, and there might be 1 or 2 things that could be done that will make the code faster, but this should run very quickly. If not, please post the result of time command that should appear at the end of the run, i.e.
real 0m0.18s
user 0m0.03s
sys 0m0.06s
IHTH

#!/bin/bash
i=4
sort $1 | uniq -d | grep ^# > dups.txt
while read in; do
if [ $((i%4))=0 ] && grep -q "$in" dups.txt; then
x="$in"
x=${x/"0:0 "/$i":0 "}
echo "$x" >> $1"fixed.txt"
else
echo "$in" >> $1"fixed.txt"
fi
let "i+=1"
done < $1

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to sort files in paste command with 500 files csv - shell

First make one file without pasting and change that file into a oneliner with tr: cat /chirps_.csv | tr "\n" "," > long.csv

If the goal is a file with 100,000 lines and 500 columns then something like this should work: paste -d, chirps_*.csv > chirps_500_merge.csv Additional code can be used to sort the chirps_... input files into any desired order before pasteing.

Related

looping with grep over several files

BASH: File sorting according to file name

How to iterate over several files in bash

Finding the file name in a directory with a pattern

Writing a script for large text file manipulation (iterative substitution of duplicated lines), weird bugs and very slow.

Categories

Resources

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to sort files in paste command with 500 files csv - shell

First make one file without pasting and change that file into a oneliner with tr: cat */chirps_*.csv | tr "\n" "," > long.csv

If the goal is a file with 100,000 lines and 500 columns then something like this should work: paste -d, chirps_*.csv > chirps_500_merge.csv Additional code can be used to sort the chirps_... input files into any desired order before pasteing.

Related

looping with grep over several files

BASH: File sorting according to file name

How to iterate over several files in bash

Finding the file name in a directory with a pattern

Writing a script for large text file manipulation (iterative substitution of duplicated lines), weird bugs and very slow.

Categories

Resources

First make one file without pasting and change that file into a oneliner with tr: cat /chirps_.csv | tr "\n" "," > long.csv