Merge multiple files into a single row file with a delimeter - bash

UPDATED QS:
I have been working on a bash script that will merge multiple text files with numerical values into one a single row text file using delimiter for each file values while merging
Example:
File1.txt has the followling contents:
168321099
File2.txt has:
151304
151555
File3.txt has:
16980925
File4.txt has:
154292
149092
Now i want a output.txt file like below:
, 168321099 151304 151555 16980925 , 154292 149092
Basically each file delimited by space and in a single row. with comma as first and 6 field of the outputrow
tried:
cat * > out.txt but its not coming as expected

I am not very sure If I understood your question correctly, but I interpreted it as following :
The set of files file1,...,filen contain a set of words which you want to have printed in one single line.
Each word is space separated
In addition to the string of words, you want the first character to be a , and between word 4 and 5 you want to have a ,.
The cat+tr+awk solution:
$ cat <file1> ... <filen> | tr '\n' ' ' | awk '{$1=", "$1; $4=$4" ,"; print}'
The awk solution:
$ awk 'NR==1||NR==4{printf s",";s=" "}{printf " "$1}' <file1> ... <filen>

If tr is available on your system you can do the following cat * | tr "\n" " " > out.txt
tr "\n" " " translates all line breaks to spaces

If the number of lines per file is constant, then the easiest way is tr as #Littlefinix suggested, with a couple of anonymous files to supply the commas, and an echo at the end to add an explicit newline to the output line:
cat <(echo ",") File1.txt File2.txt File3.txt <(echo ",") File4.txt | tr "\n" " " > out.txt; echo >> out.txt
out.txt is exactly what you specified:
, 168321099 151304 151555 16980925 , 154292 149092
If the number of lines per input file might vary (e.g., File2.txt has 3 or 4 lines, etc.), then placing the commas always in the 1st and 6th field will be more involved, and you'd probably need a script and not a one-liner.

Following single awk could help you on same.
awk 'FNR==1{count++;} {printf("%s%s",count==1||(count==(ARGC-1)&&FNR==1)?", ":" ",$0)} END{print ""}' *.txt
Adding a non-one liner form of solution too now.
awk '
FNR==1 { count++ }
{ printf("%s%s",count==1||(count==(ARGC-1)&&FNR==1)?", ":" ",$0) }
END { print "" }
' *.txt

Related

Shell command for inserting a newline every nth element of a huge line of comma separated strings

I have a one line csv containing a lot of elements. Now I want to insert a newline after every n-th element in a bash/shell script.
Bonus: I'd like to prepend a line with descriptors and using the count of descriptors as 'n'.
Example:
"4908041eee3d4bf98e606140b21ebc89.16","7.38974601030349731","45.31298584267982221","94ff11ce7eb54642b0768dde313e8b25.16","7.38845318555831909","45.31425320325949713", (...)
into
"id","lon","lat"
"4908041eee3d4bf98e606140b21ebc89.16","7.38974601030349731","45.31298584267982221"
"94ff11ce7eb54642b0768dde313e8b25.16","7.38845318555831909","45.31425320325949713"
(...)
Edit: I made a first attempt, but the comma delimiters are missing then:
(...) | xargs --delimiter=',' -n3
"4908041eee3d4bf98e606140b21ebc89.16" "7.38974601030349731" "45.31298584267982221"
"94ff11ce7eb54642b0768dde313e8b25.16" "7.38845318555831909" "45.31425320325949713"
trying to replace the " " with ","
(...) | xargs --delimiter=',' -n3 -i echo ${{}//" "/","}
-bash: ${{}//\": bad substitution
I would go with Perl for that!
Let's assume this outputs something like your file:
printf "1,2,3,4,5,6,7,8,9,10"
1,2,3,4,5,6,7,8,9,10
Then you could use this if you wanted every 4th comma replaced:
printf "1,2,3,4,5,6,7,8,9,10" | perl -pe 's{,}{++$n % 4 ? $& : "\n"}ge'
1,2,3,4
5,6,7,8
9,10
cat data.txt | xargs -n 3 -d, | sed 's/ /,/g'
With n=3 here and input filename is called data.txt
Note: What distinguishes this solution is that it derives the number of output columns from the number of columns in the header line.
Assuming that the fields in your CSV input have no embedded , instances (in which case you'd need a proper CSV parser), try awk:
awk -v RS=, -v header='"id","lon","lat"' '
BEGIN {
print header
colCount = 1 + gsub(",", ",", header)
}
{
ORS = NR % colCount == 0 ? "\n" : ","
print
}
' file.csv
Note that if the input file ends with a newline (as is typical), you'll get an extra newline trailing the output.
With GNU Awk or Mawk (but not BSD/OSX Awk, which only supports literal, single-character RS values), you can fix this as follows:
awk -v RS='[,\n]' -v header='"id","lon","lat"' '
BEGIN {
print header
colCount = 1 + gsub(",", ",", header)
}
{
ORS = NR % colCount == 0 ? "\n" : ","
print
}
' file.csv
BSD/OSX Awk workaround: stick with -v RS=, and replace file.csv with <(tr -d '\n' < file.csv) in order to remove all newlines from the input first.
Assuming your input file is named input:
echo id,lon,lat; awk '{ORS=NR%3?",":"\n"}1' RS=, input

Unix row to column format with string prefix and post fix

I have the requirement to convert row string data to column format and pre/postfix specific strings. The data string in file has 4 major fixed columns (separated by ";") and each column is further divided in two sections (separated by ":").
E.g.
Source data file:
A100:T100;B100:T200;A200:T300;B200:T400
Output from file should be:
TABa:BatchID=A100:TagId=T100:ProcId=1
TABb:BatchID=B100:TagId=T200:ProcId=2
TABc:BatchID=A200:TagId=T300:ProcId=3
TABd:BatchID=B200:TagId=T400:ProcId=4
Meanwhile I am trying with following code:
String="A100:T100;B100:T200;A200:T300;B200:T400"
> File.txt
for deploy in $(echo $String | tr ";" "\n")
do
echo $deploy >> File.txt
done
cat File.txt | awk 'BEGIN { FS=":"; OFS=":" } NR==1{ print "TABa:BatchID="$1,$2 } NR==2{ print "TABb:BatchID="$1,$2 }'
printf handles this:
$ awk -F: '{sub(/\n/,""); printf "TAB%c:BatchID=%s:TagId=%s:ProcId=%i\n",(NR+96),$1,$2,NR }' RS=';' File.txt
TABa:BatchID=A100:TagId=T100:ProcId=1
TABb:BatchID=B100:TagId=T200:ProcId=2
TABc:BatchID=A200:TagId=T300:ProcId=3
TABd:BatchID=B200:TagId=T400:ProcId=4
How it works
-F:
This sets the field separator to a colon: :.
sub(/\n/,"")
This removes newline characters.
printf "TAB%c:BatchID=%s:TagId=%s:ProcId=%i\n",(NR+96),$1,$2,NR
This does all the work. It makes use of the record number, NR, and the first and second fields and prints the output that you want.
RS=';'
This tells awk to use a semicolon, ;, as the record separator.

sed - removing last comma from listed value after doing a replace

I'm using sed to replace my file of new lines \n with ',' which works fine however, in my last item, I don't want the ,.
How can I remove this?
Example:
sed 's/\n/,/g' myfile.out > myfile.csv
Output:
1,2,3,4,5,6,
Well you can use labels:
$ cat file
1
2
3
4
5
6
$ sed ':a;N;s/\n/,/;ba' file
1,2,3,4,5,6
You can also use paste command:
$ paste -sd, file
1,2,3,4,5,6
Consider jaypal singh's paste solution, which is the most efficient and elegant.
An awk alternative, which doesn't require reading the entire file into memory first:
awk '{ printf "%s%s", sep, $0; sep = "," }' myfile.out > myfile.csv
If the output should have a trailing newline (thanks, Ed Morton):
awk '{ printf "%s%s", sep, $0; sep = "," } END { printf "\n" }' myfile.out > myfile.csv
For the first input line, sep, due to being an uninitialized variable, defaults to the empty string, effectively printing just $0, the input line.
Setting sep to "," after the first print ensures that all remaining lines have a , prepended.
END { printf "\n" } prints a trailing newline after all input lines have been processed. (print "" would work too, given that print appends the output record separator (ORS), which defaults to a newline).
The net effect is that , is only placed between input lines, so the output won't have a trailing comma.
You could add a second s command after the first: sed -z 's/\n/,/g ; s/,$//. This removes a comma at the end. (The option -z is from gnu sed and I needed it to get the first s command working.)

converting nexus to FASTA format

I have many .nexus files that I want to convert to FASTA style format and combine into one .fasta file. Here is an example code:
for i in *.nexus;
do
awk 'NR >5' /path/to/nexus_files/$i | tr -d "'" | tr " " "\n" | sed 's/locus/>locus/g' > /path/to/fasta/${i}.fasta
done
This works for the first nexus file, but the #NEXUS header remains in subsequent conversions.
Input:
#NEXUS
begin data;
dimensions ntax=1 nchar=300;
format datatype=dna missing=? gap=-;
matrix
'locus1_individual-1'
???????????????????????????????TAGATTTTTTAGTCCTTAC
;
end;
Desired output:
>locus1_individual-1
???????????????????????????????TAGATTTTTTAGTCCTTAC
To speed it some up, you may reduce the number of commands needed:
for i in *.nexus;
do
awk 'NR>5 {gsub(f,"");gsub(/ /,"\n");gsub(/uce/,">&");print}' f="'" /path/to/nexus_files/$i > /path/to/fasta/${i}.fasta
done
An idea from anishsane. (all in one awk)
awk 'FNR>5 {sub(/\.nexus$/,"",FILENAME);sub(/.*\//,"/path/to/fasta/",FILENAME);gsub(f,"");gsub(/ /,"\n");gsub(/uce/,">&");print >FILENAME".fasta"}' f="'" /path/to/nexus_files/*
First sub removes the nexus extention from the filename.
Second change the path to /path/to/fasta/
Now its important to use FNR, since you read many files within one awk
Try:
for i in *.nexus;
do
awk 'FNR >5' /path/to/nexus_files/$i | tr -d "'" | tr " " "\n" | sed 's/uce/>uce/g' > /path/to/fasta/${i}.fasta
done
NR is the total number of records across files, FNR is the record count for each file, reset to zero for a new file.

split a file into segments?

I have a file containing text data which are separated by semicolon ";". I want to separate the data , in other words split where ; occurs and write the data to an output file. Is there any way to do with bash script?
You most likely want awk with the FS (field separator variable) set to ';'.
Awk is the tool of choice for column-based data (some prefer Perl, but not me).
echo '1;2;3;4;5
6;7;8;9;10' | awk -F\; '{print $3" "$5}'
outputs:
3 5
8 10
If you just want to turn semicolons into newlines:
echo '1;2;3;4;5
6;7;8;9;10' | sed 's/;/\n/g'
outputs the numbers 1 through 10 on separate lines.
Obviously those commands are just using my test data. If you want to use them on your own file, use something like:
sed 's/;/\n/g' <input_file >output_file
#!/bin/bash
while read -d ';' ITEM; do
echo "$ITEM"
done
Try:
cat original_file.txt | cut -d";" -f1 > new_file.txt
This will split each line in fields delimited by ";" and select the first field (-f1).
You can access other fields with -f1, -f2, ... or multiple fields with -f1-2, -f2-.
You can translate a character to another character by the 'tr' command.
cat input.txt | tr ';' '\n' > output.txt
Where \n is new line and if you want a tab only you should replace it with \t

Resources