Combining many files columnwise, use first column only once - bash

I have to combine a lot of similar csv files to one file. They are stored in many different subdirectories but the single csv files have the same name.
I need to append them columnwise, but I need the first "name" column only once. So I want to keep the first column of the first csv file and remove them from all following. Referring to this question I tried the following command: Iterating through all the subdirectories while the final file is in the main directory (And is in the beginning a copy of one of the many csv files, so that it already contains the "name" column):
for i in */; do paste final_table.csv <(cut -f 2- "$i"single_table.csv) > final_table.csv ; done
However it seems like paste does not work when one of the input files is also the output file.
How would I solve this correctly?

Don't overwrite with output the file you're reading input from. Instead, mv/rename it to an intermediate name, let your script read from that file, and output to a file with the original name. Remove the input file when complete.
Alternatively, choose an intermediate name for output file, write all input to it, and only after all input was processed, mv/rename output file to the final name.
as intemediate name, appending a temporary file name ending ("extension") could be useful.

The sponge utility from the moreutils package is what I always use for this kind of situation:
for i in */; do
paste final_table.csv <(cut -f 2- "$i"single_table.csv) | sponge final_table.csv
done
sponge quite simply "soaks up" standard in and writes to the filename you give it afterwards. It is written specifically for situations like this, to avoid the need for you to create (and then remember to delete) a temporary file.

Related

how to merge multiple text files using bash and preserving column order

I'm new to bash, I have a folder in which there are many text files among them there's a group which are named namefile-0, namefile-1,... namefile-100. I need to merge these file all in a new file. The format of each of these files is: header and 3 columns of data.
It is very important that the format of the new file is:
3 * 100 columns of data respecting the order of the columns (123123123...).
I don't mind if the header is also repeated or not.
I'm also willing, in case it was necessary, to place all these files in a folder in which no other files are present.
I've tried to do something like this:
for i in {1..100}
do
paste `echo "namefile$i"` >> `echo "b"
done
which prints only the first file into b.
I've also tried to do this:
STR=""
for i in {1..100}
do
STR=$STR"namefile"$i" "
done
paste $STR > b
which prints everything but does not preserve the order of the columns.
You need to mention what delimeter separates columns in your file.
Assuming the columns are separated by a single space,
paste -d' ' namefile-* > newfile
Other conditions like existence of other similar files or directories in the working directory, stripping of headers etc can also be tackled but some more information needs to be provided in the question.
paste namefile-{0..100} >combined
paste namefile* > new_file_name

Running a process on every combination between files in two folders

I have two folders where the 1st has 19 .fa files and the 2nd has 37096 .fa files
Files in the 1st folder are named BF_genomea[a-s].fa, and files in the 2nd are named [1-37096]ZF_genome.fa
I have to run this process where lastz filein1stfolder filein2ndfolder [arguments] > outputfile.axt, so that I run every file in the 1st folder against every file in the 2nd folder.
Any sort of output file's naming would serve, as far as it allows for id which particular combination of parent files they came from, and they have extension .axt
This is what I have done so far
for file in /tibet/madzays/finch_data/BF_genome_split/*.fa; do for otherfile in /tibet/madzays/finch_data/ZF_genome_split/*.fa; name="${file##*/}"; othername="${otherfile##*/}"; lastz $file $otherfile --step=19 --hspthresh=2200 --gappedthresh=10000 --ydrop=3400 --inner=2000 --seed=12of19 --format=axt --scores=/tibet/madzays/finch_data/BFvsZFLASTZ/HoxD55.q > /home/madzays/qsub/test/"$name""$othername".axt; done; done
Ad I said in a comment, the inner loop is missing a do keyword (for otherfile in pattern; do <-- right there). Is this in the form of a script file? If so, you should add a shebang as the first line to tell the OS how to run the script. And break it into multiple lines and indent the contents of the loops, to make it easier to read (and easier to spot problems like the missing do).
Off the top of my head, I see one other thing I'd change: the output filenames are going to be pretty ugly, just the two input files mashed together with a ".atx" on the end (along the lines of "BF_genomeac.fa14ZF_genome.fa.axt"). I'd parse the IDs out of the input filenames and then use them to build a more reasonable output filename convention. Something like this
#!/bin/bash
for file in /tibet/madzays/finch_data/BF_genome_split/*.fa; do
for otherfile in /tibet/madzays/finch_data/ZF_genome_split/*.fa; do
name="${file##*/}"
tmp="${name#BF_genomea}" # remove filename prefix
id="${tmp%.*}" # remove extension to get the ID
othername="${otherfile##*/}"
otherid="${othername%ZF_genome.fa}" # just have to remove a suffix here
lastz $file $otherfile --step=19 --hspthresh=2200 --gappedthresh=10000 --ydrop=3400 --inner=2000 --seed=12of19 --format=axt --scores=/tibet/madzays/finch_data/BFvsZFLASTZ/HoxD55.q > "/home/madzays/qsub/test/BF${id}_${otherid}ZF.axt"
done
done
The code can nearly directly been translated from your requierements:
base=/tibet/madzays/finch_data
for b in {a..s}
do
for z in {1..37096}
do
lastz $base/BF_genome_split/${b}.fa $base/ZF_genome_split/${z}.fa --hspthresh=2200 --gappedthresh=10000 --ydrop=3400 --inner=2000 --seed=12of19 --format=axt --scores=$base/BFvsZFLASTZ/HoxD55.q > /home/madzays/qsub/test/${b}-${z}.axt
done
done
Note that oneliners easily lead to errors, like missing dos, which are then hard to find from the error message (error in line 1).

Bash: Copy a file x times and rename the file automatically from a list in txt

I have a .txt file containing a list of over 450 lines eg.
name_1
name_2
name_3
etc
I'd like to copy a file named file_to_copy.txt x times (~450) and automatically rename the file I just made to name_1, name_2 and so on, untill I create those ~450 files, each named by a line in the previously mentioned .txt file.
How can I do that?
Perhaps you want to say:
while read name; do
cp file_to_copy.txt "${name}";
done < my_text_file_with_filenames.txt
try this line:
sed 's/^/cp file_to_cp.txt &/' foo.txt|sh
remove the |sh to see the cp commands generated, with |sh to execute those commands.
An alternative to #devnull's answer, which is probably more canonical than this one, is to use your favorite text editor to insert cp file_to_copy.txt in front of all lines in your file and then source it. It is quick and dirty, but it gets the job done quickly if you are not familiar with Bash loops or GNU tools in general.

Grep -f and only return the first match

I'm working with a large CSV that follows a basic process.
Backup the working original
Generate a skeleton CSV
Read from another CSV, format the contents, and then append it to the skeleton
Append the data from the backup to the new one.
The issue I'm running into is that when I read in the contents from the backup, I'm using grep -Ev -f with a file containing regexes to exclude undesired data from the backup to be included in the next revision. This currently presents a problem because grep appears to evaluate each regex in the file against every line from STDIN which will cause duplicates. The simple solution would be to simply pipe it through sort | uniq and call it a day but that will screw with the formatting of the csv currently in use. I can elaborate if needed but the short of it is I run a script to bulk process IP addresses but there is also manual editing of the file by other people and with the current form of the script the final output will be all of the automated content with manual entries being at the bottom of the file.
So, is there anyway without some ugly looping of grep to tell it to stop evaluating a line after a pattern is matched? Using -m 1 will stop grep after the first match in the whole stream where I need it stop after each new line.
For the task you want to accomplish. It would be best in my opinion to use AWK. You can find an excellent tutorial for AWK at : http://www.grymoire.com/Unix/Awk.html. You basically need to change the input field separator for awk with
awk -f',' foo.awk bar.dat
As far as the problem with sorting is concerned follow this : http://www.linuxquestions.org/questions/linux-general-1/how-to-use-awk-to-sort-243177/

method for merging two files, opinion needed

Problem: I have two folders (one is Delta Folder-where the files get updated, and other is Original Folder-where the original files exist). Every time the file updates in Delta Folder I need merge the file from Original folder with updated file from Delta folder.
Note: Though the file names in Delta folder and Original folder are unique, but the content in the files may be different. For example:
$ cat Delta_Folder/1.properties
account.org.com.email=New-Email
account.value.range=True
$ cat Original_Folder/1.properties
account.org.com.email=Old-Email
account.value.range=False
range.list.type=String
currency.country=Sweden
Now, I need to merge Delta_Folder/1.properties with Original_Folder/1.properties so, my updated Original_Folder/1.properties will be:
account.org.com.email=New-Email
account.value.range=True
range.list.type=String
currency.country=Sweden
Solution i opted is:
find all *.properties files in Delta-Folder and save the list to a temp file(delta-files.txt).
find all *.properties files in Original-Folder and save the list to a temp file(original-files.txt)
then i need to get the list of files that are unique in both folders and put those in a loop.
then i need to loop each file to read each line from a property file(1.properties).
then i need to read each line(delta-line="account.org.com.email=New-Email") from a property file of delta-folder and split the line with a delimiter "=" into two string variables.
(delta-line-string1=account.org.com.email; delta-line-string2=New-Email;)
then i need to read each line(orig-line=account.org.com.email=Old-Email from a property file of orginal-folder and split the line with a delimiter "=" into two string variables.
(orig-line-string1=account.org.com.email; orig-line-string2=Old-Email;)
if delta-line-string1 == orig-line-string1 then update $orig-line with $delta-line
i.e:
if account.org.com.email == account.org.com.email then replace
account.org.com.email=Old-Email in original folder/1.properties with
account.org.com.email=New-Email
Once the loop finishes finding all lines in a file, then it goes to next file. The loop continues until it finishes all unique files in a folder.
For looping i used for loops, for splitting line i used awk and for replacing content i used sed.
Over all its working fine, its taking more time(4 mins) to finish each file, because its going into three loops for every line and splitting the line and finding the variable in other file and replace the line.
Wondering if there is any way where i can reduce the loops so that the script executes faster.
With paste and awk :
File 2:
$ cat /tmp/l2
account.org.com.email=Old-Email
account.value.range=False
currency.country=Sweden
range.list.type=String
File 1 :
$ cat /tmp/l1
account.org.com.email=New-Email
account.value.range=True
The command + output :
paste /tmp/l2 /tmp/l1 | awk '{print $NF}'
account.org.com.email=New-Email
account.value.range=True
currency.country=Sweden
range.list.type=String
Or with a single awk command if sorting is not important :
awk -F'=' '{arr[$1]=$2}END{for (x in arr) {print x"="arr[x]}}' /tmp/l2 /tmp/l1
I think your two main options are:
Completely reimplement this in a more featureful language, like perl.
While reading the delta file, build up a sed script. For each line of the delta file, you want a sed instruction similar to:
s/account.org.com.email=.*$/account.org.email=value_from_delta_file/g
That way you don't loop through the original files a bunch of extra times. Don't forget to escape & / and \ as mentioned in this answer.
Is using a database at all an option here?
Then you would only have to write code for extracting data from the Delta files (assuming that can't be replaced by a database connection).
It just seems like this is going to keep getting more complicated and slower as time goes on.

Resources