Convert many txt files to xls files with bash script - bash

I'm trying to convert many text files to xls files. The style of the txt file is as follow:
"Name";"Login";"Role"
"Max Muster";"Bla102";"user"
"Heidi Held";"Held100";"admin"
I tried to work with this bash script:
for file in *.txt; do
tr ";" "," < "$file" | paste -d, <(seq 1 $(wc < "$file")) - > "${file%.*}.xls"
soffice --headless --convert-to xls:"MS Excel 95" filename.xls "${file%.*}.xls"
done
with this, I lost the header row and I also get a column with many Chinese signs, but the rest looks okay:
攀挀琀 | Max Muster | Bla102 | user
氀愀猀 | Heidi Held | Held100 | admin
How can I get rid of these Chinese signs and keep the header row?

The question unfortunately does not provide enough details to be sure what exactly the issues are; but we have identified in comments at least the following.
Apparently, the input file contains DOS carriage returns.
Apparently, soffice attempted to read the file as UTF-16, which is what produced the essentially random Chinese characters. (The characters could be anything; it's just more probable that a random Unicode BMP character will be in a Chinese/Japanese block.)
With those observations and a refactoring of the existing script, try
for file in *.txt; do
awk -F ';' 'BEGIN { OFS="," }
FNR==1 {
# Add UTF-8 BOM
printf "\357\273\277"
# Generate header line for soffice to discard
for (i=1; i<=NF; i++) printf "bogus%s", (i==NF ? "\n" : OFS)
}
{ sub(/\015/, ""); print FNR, $0 }' "$file" > "${file%.*}.xls"
soffice --headless --convert-to xls:"MS Excel 95" filename.xls "${file%.*}.xls"
done
In so many words, the Awk script splits each input line on semicolons (-F ';') and sets the output field separator OFS to a comma. On the first output line, we add a BOM and a synthetic header line for soffice to discard before the real output, so that the header line appears like a regular data line in the output. The sub takes care of removing any DOS carriage return character, and the variable FNR is the current input line's line number.
I'm not sure if the BOM or the bogus header line are strictly necessary, or if perhaps you need to pass in some additional options to make soffice treat the input as proper UTF-8. Perhaps you also need to include LC_ALL=C somewhere in the pipeline.

Related

How to add an empty line at the end of these commands?

I am in a situation where I have so many fastq files that I want to convert to fasta.
Since they belong to the same sample, I would like to merge the fasta files to get a single file.
I tried running these two commands:
sed -n '1~4s/^#/>/p;2~4p' INFILE.fastq > OUTFILE.fasta
cat infile.fq | awk '{if(NR%4==1) {printf(">%s\n",substr($0,2));} else if(NR%4==2) print;}' > file.fa
And the output files is correctly a fasta file.
However I get a problem in the next step. When I merge files with this command:
cat $1 >> final.fasta
The final file apparently looks correct. But when I run makeblastdb it gives me the following error:
FASTA-Reader: Ignoring invalid residues at position(s): On line 512: 1040-1043, 1046-1048, 1050-1051, 1053, 1055-1058, 1060-1061, 1063, 1066-1069, 1071-1076
Looking at what's on that line I found that a file header was put at the end of the previous file sequence. And it turns out like this:
GGCTTAAACAGCATT>e45dcf63-78cf-4769-96b7-bf645c130323
So how can I add a blank line to the end of the file within the scripts that convert fastq to fasta?
So that when I merge they are placed on top of each other correctly and not at the end of the sequence of the previous file.
So how can I add a blank line to the end of the file within the
scripts that convert fastq to fasta?
I would use GNU sed following replace
cat $1 >> final.fasta
using
sed '$a\\n' $1 >> final.fasta
Explanation: meaning of expression for sed is at last line ($) append newline (\n) - this action is undertaken before default one of printing. If you prefer GNU AWK then you might same behavior following way
awk '{print}END{print ""}' $1 >> final.fasta
Note: I was unable to test any of solution as you doesnot provide enough information to this. I assume above line is somewhere inside loop and $1 is always name of file existing in current working directory.
if the only thing you need is extra blank line, and the input files are within 1.5 GB in size, then just directly do :
awk NF=NF RS='^$' FS='\n' OFS='\n'
Should work for mawk 1/2, gawk, and nawk, maybe others as well. This works despite appearing not to do anything special is that the extra \n comes from ORS.

Batch create files with name and content based on input file

I am a mac OS user trying to batch create a bunch of files. I have a text file with column of several hundred terms/subjects, eg:
hydrogen
oxygen
nitrogen
carbon
etcetera
I want to programmatically fill a directory with text files generated from this subject list. For example, "hydrogen.txt" and "oxygen.txt" and so on, with each file created by iterating through the lines of my list_of_names.txt file. Some lines are one word, but other lines are two or three words (eg: "carbon monoxide"). This I have figured out how to do:
awk 'NF>0' list_of_names.txt | while read line; do touch "${line}.txt"; done
Additionally I need to create two lines of content within each of these files, and the content is both static and dynamic...
# filename
#elements/filename
...where in the example above the pound sign ("#") and "elements/" would be the same in all of the files created, but "filename" would be variable (eg: "hydrogen" for "hydrogen.txt" and "oxygen" for "oxygen.txt" etc). One further wrinkle is that if any spaces appear at all on the second line of content, there needs to be a trailing pound sign. For example:
# filename
#elements/carbon monoxide#
...although this last part is not a dealbreaker and I can use grep to modify list_of_names.txt such that phrases like "carbon monoxide" become "carbon_monoxide" and just deal with the repercussions of this later. (But if it is easy to preserve the spaces, I would prefer that.)
After a couple hours of searching and attempts to use sed, awk, and so on I am stuck at a directory full of files with the correct filename.txt format, but I can't get further that this. Mostly I think my efforts are failing because the solutions I can find for doing something like this are using commands I am not familiar with and they are structured for GNU and don't execute correctly in Terminal on Mac OS.
I am amenable to processing this in multiple steps (ie make all of the files.txt first, then run a second step to populate the content of the files), or as a single command that makes the files and all of their content simultaneously ('simultaneously' from a human timescale).
My horrible pseudocode (IN CAPS) for how this would look as 2 steps:
awk 'NF>0' list_of_names.txt | while read line; do touch "${line}.txt"; done
awk 'NF>0' list_of_names.txt | while read line; OPEN "${line}.txt" AND PRINT "# ${line}\n#elements/${line}"; IF ${line} CONTAINS CHARACTER " " PRINT "#"; done
You could use a simple Bash loop and create the files in one shot:
#!/bin/bash
while read -r name; do # loop through input file content
[[ $name ]] || continue # skip empty lines
output=("# $name") # initialize the array with first element
trailing=
[[ $name = *" "* ]] && trailing="#" # name has spaces in it
output+=("#elements/$name$trailing") # name doesn't have a space
printf '%s\n' "${output[#]}" > "$name.txt" # write array content to the output file
done < list_of_names.txt
Doing it in awk:
awk '
NF {
trailing = (/ / ? "#" : "")
out=$0".txt"
printf("# %s\n#elements/%s%s\n", $0, $0, trailing) > out
close(out)
}
' list_of_names.txt
Doing the whole job in awk will yield better performance than in bash, which isn't really suited to processing text like this.
It seems to me that this should cover the requirements you've specified:
awk '
{
out=$0 ".txt"
printf "# %s\n#elements/%s%s\n", $0, $0, (/ / ? "#" : "") >> out
close(out)
}
' list_of_subjects.txt
Though you could shrink it to a one-liner:
awk '{printf "# %s\n# elements/%s%s\n",$0,$0,(/ /?"#":"")>($0".txt");close($0".txt")}' list_of_subjects.txt

How to add a header to text file in bash?

I have a text file and want to convert it to csv file before to convert it, i want to add a header to text file so that the csv file has the same header. I have one thousand columns in text file and want to have one thousand column name. As a side note, the content of the text file is just rows of some numbers which is separated by comma ",". Is there any way to add the header line in bash?
I tried the way below and didn't work. I did the command below first in python.
> for i in range(1001):
> print "col" + "_" + "i"
save the output of this in text file with this command (python header.py >> header.txt) and add the output of this in format of text file to the original text file that i have like below:
cat header.txt filename.txt > newfilename.txt
then convert the txt file to csv file with "mv newfilename.txt newfilename.csv".
But unfortunately this way doesn't work as the header line has double number of other rows for some reason. I would appreciate any help to make this problem solve.
based on the description your file is already comma separated, so is a csv file. You just want to add a column number header line.
$ awk -F, 'NR==1{for(i=1;i<=NF;i++) printf "col_%d%s", $i,(i==NF?ORS:FS)}1' file
will add column headers as many as the fields in the first row of the file
e.g.
$ seq 5 | paste -sd, | # create 1,2,3,4,5 as a test input
awk -F, 'NR==1{for(i=1;i<=NF;i++) printf "col_%d%s", i, (i==NF?ORS:FS)}1'
col_1,col_2,col_3,col_4,col_5
1,2,3,4,5
You can generate the column names in bash using one of the options below. Each example generates a header.txt file. You already have code to add this to the beginning of your file as a header.
Using bash loops
Bash loops for this many iterations will be inefficient, but will work.
for i in {1..10}; do
echo -n "col_$i "
done > header.txt
echo >> header.txt
or using seq
for i in $(seq 1 1000); do
echo -n "col_$i "
done > header.txt
echo >> header.txt
Using seq only
Using seq alone will be more efficient.
seq -f "col_%g" -s" " 1 1000 > header.txt
Use seq and sed
You can use the seq utility to construct your CSV header, with a little minor help from Bash expansions. You can then insert the new header row into your existing CSV file, or concatenate the header with your data.
For example:
# construct a quoted CSV header
columns=$(seq -f '"col_%g"' -s', ' 1 1001)
# strip the trailing comma
columns="${columns%,*}"
# insert headers as first line of foo.csv with GNU sed
sed -i -e "1 i\\${columns}" /tmp/foo.csv
Caveats
If you don't have GNU sed, you can also use cat, sponge, or other tools to concatenate your header and data, although most of your concatenation options will require redirection to a new combined file to avoid clobbering your existing data.
For example, given /tmp/data.csv as your original data file:
seq -f '"col_%g"' -s', ' 1 1001 > /tmp/header.csv
sed -i -e 's/,[[:space:]]*$//' /tmp/header.csv
cat /tmp/header /tmp/data > /tmp/new_file.csv
Also, note that while Bash solutions that avoid calling standard utilities are possible, doing it in pure Bash might be too slow or memory intensive for large data sets.
Your mileage may vary.
printf "col%s," {1..100} |
sed 's/,$//' |
cat - filename.txt >newfilename.txt
I believe sed should supply the missing final newline as a side effect. If not, maybe try 's/,$/\n/' though this isn't entirely portable, either. You could probably replace the cat with sed as well, something like
... | sed 's/,$//;r filename.txt'
but again, I'm not entirely sure how portable this is.

Remove a header from a file during parsing

My script gets every .csv file in a dir and writes them into a new file together. It also edits the files such that certain information is written into every row for a all of a file's entries. For instance this file called "trap10c_7C000000395C1641_160110.csv":
"",1/10/2016
"Timezone",-6
"Serial No.","7C000000395C1641"
"Location:","LS_trap_10c"
"High temperature limit (�C)",20.04
"Low temperature limit (�C)",-0.02
"Date - Time","Temperature (�C)"
"8/10/2015 16:00",30.0
"8/10/2015 18:00",26.0
"8/10/2015 20:00",24.5
"8/10/2015 22:00",24.0
Is converted into this format
LS_trap_10c,7C000000395C1641,trap10c_7C000000395C1641_160110.csv,Location:,LS_trap_10c
LS_trap_10c,7C000000395C1641,trap10c_7C000000395C1641_160110.csv,High,temperature,limit,(�C),20.04
LS_trap_10c,7C000000395C1641,trap10c_7C000000395C1641_160110.csv,Low,temperature,limit,(�C),-0.02
LS_trap_10c,7C000000395C1641,trap10c_7C000000395C1641_160110.csv,Date,-,Time,Temperature,(�C)
LS_trap_10c,7C000000395C1641,trap10c_7C000000395C1641_160110.csv,8/10/2015,16:00,30.0
LS_trap_10c,7C000000395C1641,trap10c_7C000000395C1641_160110.csv,8/10/2015,18:00,26.0
LS_trap_10c,7C000000395C1641,trap10c_7C000000395C1641_160110.csv,8/10/2015,20:00,24.5
LS_trap_10c,7C000000395C1641,trap10c_7C000000395C1641_160110.csv,8/10/2015,22:00,24.0
I use this script to do this:
dos2unix *.csv
gawk '{print FILENAME, $0}' *.csv>>all_master.erin
sed -i 's/Serial No./SerialNo./g' all_master.erin
sed -i 's/ /,/g' all_master.erin
gawk -F, '/"SerialNo."/ {sn = $3}
/"Location:"/ {loc = $3}
/"([0-9]{1,2}\/){2}[0-9]{4} [0-9]{2}:[0-9]{2}"/ {lin = $0}
{$0 =loc FS sn FS $0}1' all_master.erin > formatted_log.csv
sed -i 's/\"//g' formatted_log.csv
sed -i '/^,/ d' formatted_log.csv
rm all_master.erin
printf "\nDone\n"
I want to remove the messy header from the formatted_log.csv file. I've tried and failed to use a sed, as it seems to remove things that I don't want to remove. Is sed the best way to approach this problem? The current sed fixes some problems with the header, but I want the header gone entirely. Any lines that say "serial no." and "location" are important and require information. The other lines can be removed entirely.
I suppose you edited your script before posting; as it stands, it will not produce the posted output (all_master.erin should be $(<all_master.erin) except in the first occurrence).
You don’t specify many vital details of the format of your input files, so we must guess them. Here are my guesses:
You ignore the first two lines and the subsequent empty third line.
The 4th and 5th lines are useful, since they provide the serial number and location you want to use in all lines of that file
The 6th, 7th and 8th lines are useless.
For each file, you want to discard the first four lines of the posted output.
With these assumptions, this is how I would modify your script:
#!/bin/bash
dos2unix *.csv
awk -vFS=, -vOFS=, \
'{gsub("\"","")}
FNR==4{s=$2}
FNR==5{l=$2}
FNR>8{gsub(" ",OFS);print l,s,FILENAME,$0}' \
*.csv > formatted_log.CSV
printf "\nDone\n"
Explanation of the awk script:
First we delete all double quotes with gsub("\"",""). Then, if the line number is 4, we set the variable s to the second field, which is the serial number. If the line number is 5, we set the variable l to the second field, which is the location. If the line number is greater than 8, we do two things. First, we execute gsub(" ",OFS) to replace all spaces with the value of the output field separator: this is needed because the intended output makes two separate fields of date and time, which were only one field in the input. Second, we print the line preceded by the values of l, s and FILENAME as requested.
Note that I’m using the (questionable) Unix trick of naming the output file with an all-caps extension .CSV to avoid it being wrongly matched by a subsequent *.csv. A better solution would be to put it in another directory, but I don’t know anything about your directory tree so I suggest you modify the output file name yourself.
You could use awk to remove anything
with less than 3 columns in your final file:
awk 'NF>=3' file

performance issues in shell script

I have a 200 MB tab separated text file with millions of rows. In this file, I have a column with multiple locations like US , UK , AU etc.
Now I want to break this file on the basis of this column. Though this code is working fine for me, but facing performance issue as it is taking more than 1 hour to split the file into multiple files based on locations. Here is the code:
#!/bin/bash
read -p "Please enter the file to split " file
read -p "Enter the Col No. to split " col_no
#set -x
header=`head -1 $file`
cnt=1
while IFS= read -r line
do
if [ $((cnt++)) -eq 1 ]
then
echo "$line" >> /dev/null
else
loc=`echo "$line" | cut -f "$col_no"`
f_name=`echo "file_"$loc".txt"`
if [ -f "$f_name" ]
then
echo "$line" >> "$f_name";
else
touch "$f_name";
echo "file $f_name created.."
echo "$line" >> "$f_name";
sed -i '1i '"$header"'' "$f_name"
fi
fi
done < $file
The logic applied here is that we are reading the entire file only once, and depending on the locations, we are creating and appending the data to it.
Please suggest necessary improvements in the code to enhance its performance.
Following is a sample data and is separated by colon instead of tab. The country code is in the 4th column:
ID1:ID2:ID3:ID4:ID5
100:abcd:TEST1:ZA:CCD
200:abcd:TEST2:US:CCD
300:abcd:TEST3:AR:CCD
400:abcd:TEST4:BE:CCD
500:abcd:TEST5:CA:CCD
600:abcd:TEST6:DK:CCD
312:abcd:TEST65:ZA:CCD
1300:abcd:TEST4153:CA:CCD
There are a couple of things to bear in mind:
Reading files using while read is slow
Creating subshells and executing external processes is slow
This is a job for a text processing tool, such as awk.
I would suggest that you used something like this:
# save first line
NR == 1 {
header = $0
next
}
{
filename = "file_" $col ".txt"
# if country code has changed
if (filename != prev) {
# close the previous file
close(prev)
# if we haven't seen this file yet
if (!(filename in seen)) {
print header > filename
}
seen[filename]
}
# print whole line to file
print >> filename
prev = filename
}
Run the script using something along the following lines:
awk -v col="$col_no" -f script.awk file
where $col_no is a shell variable containing the column number with the country codes.
If you don't have too many different country codes, you can get away with leaving all the files open, in which case you can remove the call to close(filename).
You can test the script on the sample provided in the question like this:
awk -F: -v col=4 -f script.awk file
Note that I've added -F: to change the input field separator to :.
I think Tom is on the right track, but I'd simplify this a little.
Awk is magical in some ways. One of those ways is that it will keep all its input and output file handles open unless you explicitly close them. So if you create a variable containing an output file name, you can simply redirect to your variable and trust that awk will send the data to the place you've specified and eventually close the output file when it runs out of input to process.
(N.B. an extension of this magic is that in addition to redirects, you can maintain multiple PIPES. Imagine if you were to cmd="gzip -9 > file_"$4".txt.gz"; print | cmd)
The following splits your file without adding a header to each output file.
awk -F: 'NR>1 {out="file_"$4".txt"; print > out}' inp.txt
If adding the header is important, a little more code is required. But not much.
awk -F: 'NR==1{h=$0;next} {out="file_"$4".txt"} !(out in files){print h > out; files[out]} {print > out}' inp.txt
Or, because this one-liner is now a bit long, we can split it out for explanation:
awk -F: '
NR==1 {h=$0;next} # Capture the header
{out="file_"$4".txt"} # Capture the output file
!(out in files){ # If we haven't seen this output file before,
print h > out; # print the header to it,
files[out] # and record the fact that we've seen it.
}
{print > out} # Finally, print our line of input.
' inp.txt
I tested these two scripts successfully on the input data you provided in your question. With this type of solution, there is no need to sort your input data -- your output in each file will be in the order in which that subset's records appeared in your input data.
Note: different versions of awk will permit you to open different numbers of open files. GNU awk (gawk) has a limit in the thousands -- significantly more than the number of countries you might have to deal with. BSD awk version 20121220 (in FreeBSD) appears to run out after 21117 files. BSD awk version 20070501 (in OS X El Capitan) is limited to 17 files.
If you're not confident in your potential number of open files, you can experiment with your version of awk usig something like this:
mkdir -p /tmp/i
awk '{o="/tmp/i/file_"NR".txt"; print "hello" > o; printf "\r%d ",NR > "/dev/stderr"}' /dev/random
You can also test the number of open pipes:
awk '{o="cat >/dev/null; #"NR; print "hello" | o; printf "\r%d ",NR > "/dev/stderr"}' /dev/random
(If you have a /dev/yes or something that just spits out lines of text ad nauseam, that would be better than using /dev/random for input.)
I haven't previously come across this limit in my own awk programming because when I've needed to create many many output files, I've always used gawk. :-P

Resources