Convert text file to csv using shell script

Convert text file to csv using shell script - bash

I have a file like this
InputFile.txt
JOB JOB_A
Source C://files/InputFile
Resource 0 AC
User Guest
ExitCode 0 Success
EndJob
JOB JOB_B
Source C://files/
Resource 1 AD
User Current
ExitCode 1 Fail
EndJob
JOB JOB_C
Source C://files/Input/
Resource 3 AE
User Guest2
ExitCode 0 Success
EndJob
I have to convert the above file to a csv file as below
How to convert it using shell scripting?

I used awk.
The separator is a tabulator because it's more common than a comma in the CSV format.
If you want a coma, you can simply change the \t -> ,.
cat InputFile.txt | \
awk '
BEGIN{print "Source\tResource\tUser\tExitCode"}
/^JOB/{i=0}
/^\s/{
i++;
match($0,/\s*[a-zA-Z]* /);
a[i]=substr($0,RLENGTH+RPOS)}
/^EndJob/{for(i=1;i<5;i++) printf "%s\t",a[i];print ""}'
The first line BEGIN writes header.
The second line matches /JOB/ and only sets an iterator i as zero.
The third line matches the blank on the start of a line and fills array a with values (it count on strict count and order of rows).
The fourth part of the awk script matches EndJob and prints stored values.
Output:
Source
Resource
User
ExitCode
C://files/InputFile
0 AC
Guest
0 Success
C://files/
1 AD
Current
1 Fail
C://files/Input/
3 AE
Guest2
0 Success
Script using associative array:
You can change the script so that uses strict Source, Resource, User, and ExitCode values from $1 (first record) of lines, but it would be a little longer, and this input file doesn't need it.
cat InputFile.txt | \
awk '
BEGIN{
h[1]="Source";
h[2]="Resource";
h[3]="User";
h[4]="ExitCode";
for(i=1;i<5;i++) printf "%s\t",h[i];print ""}
/^\s/{
i++;
match($0,/\s*[a-zA-Z]* /);
a[$1]=substr($0,RLENGTH+RPOS)}
/^EndJob/{for(i=1;i<5;i++) printf "%s\t",a[h[i]];print ""}'

with sed ... dont know if the order in the InputFile.txt is always the same
as Source, Resource, User, ExitCode, but if it is
declare delimiter=";"
sed -Ez "s/[^\n]*(Source|Resource|User) ([^\n]*)\n/\2${delimiter}/g;s/[ \t]*ExitCode //g;s/[^\n]*JOB[^\n]*\n//gi;s/^/Source${delimiter}Resource${delimiter}User${delimiter}ExitCode\n/" < InputFile.txt > output.csv

Related

Processing of the data from a big number of input files

My AWK script processes each log file from the folder "${results}, from which it looks for a pattern (a number occurred on the first line of ranking table) and then print it in one line together with the filename of the log:
awk '$1=="1"{sub(/.*\//,"",FILENAME); sub(/\.log/,"",FILENAME); printf("%s: %s\n", FILENAME, $2)}' "${results}"/*_rep"${i}".log
Here is the format of each log file, from which the number
-9.14
should be taken
AutoDock Vina v1.2.3
#################################################################
# If you used AutoDock Vina in your work, please cite: #
# #
# J. Eberhardt, D. Santos-Martins, A. F. Tillack, and S. Forli #
# AutoDock Vina 1.2.0: New Docking Methods, Expanded Force #
# Field, and Python Bindings, J. Chem. Inf. Model. (2021) #
# DOI 10.1021/acs.jcim.1c00203 #
# #
# O. Trott, A. J. Olson, #
# AutoDock Vina: improving the speed and accuracy of docking #
# with a new scoring function, efficient optimization and #
# multithreading, J. Comp. Chem. (2010) #
# DOI 10.1002/jcc.21334 #
# #
# Please see https://github.com/ccsb-scripps/AutoDock-Vina for #
# more information. #
#################################################################
Scoring function : vina
Rigid receptor: /home/gleb/Desktop/dolce_vita/temp/nsp5holoHIE.pdbqt
Ligand: /home/gleb/Desktop/dolce_vita/temp/active2322.pdbqt
Grid center: X 11.106 Y 0.659 Z 18.363
Grid size : X 18 Y 18 Z 18
Grid space : 0.375
Exhaustiveness: 48
CPU: 48
Verbosity: 1
Computing Vina grid ... done.
Performing docking (random seed: -1717804037) ...
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
mode | affinity | dist from best mode
| (kcal/mol) | rmsd l.b.| rmsd u.b.
-----+------------+----------+----------
1 -9.14 0 0
2 -9.109 2.002 2.79
3 -9.006 1.772 2.315
4 -8.925 2 2.744
5 -8.882 3.592 8.189
6 -8.803 1.564 2.092
7 -8.507 4.014 7.308
8 -8.36 2.489 8.193
9 -8.356 2.529 8.104
10 -8.33 1.408 3.841
It works OK for a moderate number of input log files (tested for up to 50k logs), but does not work for the case of big number of the input logs (e.g. with 130k logs), producing the following error:
./dolche_finito.sh: line 124: /usr/bin/awk: Argument list too long
How could I adapt the AWK script to be able processing any number of input logs?

If you get a /usr/bin/awk: Argument list too long then you'll have to control the number of "files" that you supply to awk; the standard way to do that efficiently is:
results=. # ???
i=00001 # ???
output= # ???
find "$results" -type f -name "*_rep$i.log" -exec awk '
FNR == 1 {
filename = FILENAME
sub(/.*\//,"",filename)
sub(/\.[^.]*$/,"",filename)
}
$1 == 1 { printf "%s: %s\n", filename, $2 }
' {} + |
LC_ALL=C sort -t':' -k2,2g > "$results"/ranking_"$output"_rep"$i".csv
edit: appended the rest of the chain as asked in comment
note: you might need to specify other predicates to the find command if you don't want it to search the sub-folders of $results recursively

Note that your error message:
./dolche_finito.sh: line 124: /usr/bin/awk: Argument list too long
is from your shell interpreting line 124 in your shell script, not from awk - you just happen to be calling awk at that line but it could be any other tool and you'd get the same error. Google ARG_MAX for more information on it.
Assuming printf is a builtin on your system:
printf '%s\0' "${results}"/*_rep"${i}".log |
xargs -0 awk '...'
or if you need awk to process all input files in one call for some reason and your file names don't contain newlines:
printf '%s' "${results}"/*_rep"${i}".log |
awk '
NR==FNR {
ARGV[ARGC++] = $0
next
}
...
'
If you're using GNU awk or some other awk that can process NUL characters as the RS and your input file names might contain newlines then you could do:
printf '%s\0' "${results}"/*_rep"${i}".log |
awk '
NR==FNR {
ARGV[ARGC++] = $0
next
}
...
' RS='\0' - RS='\n'

When using GNU AWK you might alter ARGC and ARGV to command GNU AWK to read additional files, consider following simple example, let filelist.txt content be
file1.txt
file2.txt
file3.txt
and content of these files to be respectively uno, dos, tres then
awk 'FNR==NR{ARGV[NR+1]=$0;ARGC+=1;next}{print FILENAME,$0}' filelist.txt
gives output
file1.txt uno
file2.txt dos
file3.txt tres
Explanation: when reading first file i.e. where number of row in file (FNR) is equal number of row globally (NR) I add to ARGV line as value under key being number of row plus one, as ARGV[1] is already filelist.txt and I increase ARGC by 1, I instruct GNU AWK to then go to next line so no other action is undertaken. For other files I print filename followed by whole line.
(tested in GNU Awk 5.0.1)

bash + awk: extract specific information from ensemble of filles

I am using bash script to extract some information from log files located within the directory and save the summary in the separate file.
In the bottom of each log file, there is a table like:
mode | affinity | dist from best mode
| (kcal/mol) | rmsd l.b.| rmsd u.b.
-----+------------+----------+----------
1 -6.961 0 0
2 -6.797 2.908 4.673
3 -6.639 27.93 30.19
4 -6.204 2.949 6.422
5 -6.111 24.92 28.55
6 -6.058 2.836 7.608
7 -5.986 6.448 10.53
8 -5.95 19.32 23.99
9 -5.927 27.63 30.04
10 -5.916 27.17 31.29
11 -5.895 25.88 30.23
12 -5.835 26.24 30.36
from this I need to take only the value from the second column of the first line (-6.961) and add it together with the name of the log as one string in new ranking_${output}.log
log_name -6.961
so for 5 processed logs it should be something like:
# ranking_${output}.log
log_name1 -X.XXX
log_name2 -X.XXX
log_name3 -X.XXX
log_name4 -X.XXX
log_name5 -X.XXX
Here is a simple bash workflow, which takes ALL THE LINES from ranking table and saves it together with the name of the LOG file:
#!/bin/bash
home="$PWD"
#folder contained all *.log files
results="${home}"/results
# loop each log file and take its name + all the ranking table
for log in ${results}/*.log; do
log_name=$(basename "$log" .log)
echo "$log_name" >> ${results}/ranking_${output}.log
cat $log | tail -n 12 >> ${results}/ranking_${output}.log
done
Could you suggest me an AWK routine which would select only the top value located on the first line of each table?
This is an AWK example that I had used for another format, which does not work there:
awk -F', *' 'FNR==2 {f=FILENAME;
sub(/.*\//,"",f);
sub(/_.*/ ,"",f);
printf("%s: %s\n", f, $5) }' ${results}/*.log >> ${results}/ranking_${output}.log

With awk. If first column contains 1 print filename and second column to file output:
awk '$1=="1"{print FILENAME, $2}' *.log > output
Update to remove path and suffix (.log):
awk '$1=="1"{sub(/.*\//,"",FILENAME); sub(/\.log/,"",FILENAME); print FILENAME, $2}'

Store variables from lines in a text file using awk and cut in a for loop

I have a tab separated text file, call it input.txt
cat input.txt
Begin Annotation Diff End Begin,End
6436687 >ENST00000422706.5|ENSG00000100342.21|OTTHUMG00000030427.9|-|APOL1-205|APOL1|2901|protein_coding| 50 6436736 6436687,6436736
6436737 >ENST00000426053.5|ENSG00000100342.21|OTTHUMG00000030427.9|-|APOL1-206|APOL1|2808|protein_coding| 48 6436784 6436737,6436784
6436785 >ENST00000319136.8|ENSG00000100342.21|OTTHUMG00000030427.9|OTTHUMT00000075315.5|APOL1-201|APOL1|3000|protein_coding| 51 6436835 6436785,6436835
6436836 >ENST00000422471.5|ENSG00000100342.21|OTTHUMG00000030427.9|OTTHUMT00000319151.1|APOL1-204|APOL1|561|nonsense_mediated_decay| 11 6436846 6436836,6436846
6436847 >ENST00000475519.5|ENSG00000100342.21|OTTHUMG00000030427.9|OTTHUMT00000319153.1|APOL1-212|APOL1|600|retained_intron| 11 6436857 6436847,6436857
6436858 >ENST00000438034.5|ENSG00000100342.21|OTTHUMG00000030427.9|OTTHUMT00000319152.2|APOL1-210|APOL1|566|protein_coding| 11 6436868 6436858,6436868
6436869 >ENST00000439680.5|ENSG00000100342.21|OTTHUMG00000030427.9|OTTHUMT00000319252.1|APOL1-211|APOL1|531|nonsense_mediated_decay| 10 6436878 6436869,6436878
6436879 >ENST00000427990.5|ENSG00000100342.21|OTTHUMG00000030427.9|OTTHUMT00000319154.2|APOL1-207|APOL1|624|protein_coding| 12 6436890 6436879,6436890
6436891 >ENST00000397278.8|ENSG00000100342.21|OTTHUMG00000030427.9|OTTHUMT00000319100.4|APOL1-202|APOL1|2795|protein_coding| 48 6436938 6436891,6436938
6436939 >ENST00000397279.8|ENSG00000100342.21|OTTHUMG00000030427.9|-|APOL1-203|APOL1|1564|protein_coding| 28 6436966 6436939,6436966
6436967 >ENST00000433768.5|ENSG00000100342.21|OTTHUMG00000030427.9|OTTHUMT00000319253.2|APOL1-209|APOL1|541|protein_coding| 11 6436977 6436967,6436977
6436978 >ENST00000431184.1|ENSG00000100342.21|OTTHUMG00000030427.9|OTTHUMT00000319254.1|APOL1-208|APOL1|550|nonsense_mediated_decay| 11 6436988 6436978,6436988
Using the information in input.txt I want to obtain information from a file called Other_File.fa. This file is an annotation file filled with ENST#'s (transcript IDs) and sequences of A's,T's,C's,and G's. I want to store the sequence in a file called Output.log (see example below) and I want to store the command used to retrieve the text in a file called Input.log (see example below).
I have tried to do this using awk and cut so far using a for loop. This is the code I have tried.
for line in `awk -F "\\t" 'NR != 1 {print substr($2,2,17)"#"$5}' input.txt`
do
transcript=`cut -d "#" -f 1 $line`
range=`cut -d "#" -f 2 $line` #Range is the string location in Other_File.fa
echo "Our transcript is ${transcript} and our range is ${range}" >> Input.log
sed -n '${range}' Other_File.fa >> Output.log
done
Here is an example of the 11 lines between ENST00000433768.5 and ENST00000431184.1 in Other_File.fa.
grep -A 11 ENST00000433768.5 Other_File.fa
>ENST00000433768.5|ENSG00000100342.21|OTTHUMG00000030427.9|OTTHUMT00000319253.2|APOL1-209|APOL1|541|protein_coding|
ATCCACACAGCTCAGAACAGCTGGATCTTGCTCAGTCTCTGCCAGGGGAAGATTCCTTGG
AGGAGCACACTGTCTCAACCCCTCTTTTCCTGCTCAAGGAGGAGGCCCTGCAGCGACATG
GAGGGAGCTGCTTTGCTGAGAGTCTCTGTCCTCTGCATCTGGATGAGTGCACTTTTCCTT
GGTGTGGGAGTGAGGGCAGAGGAAGCTGGAGCGAGGGTGCAACAAAACGTTCCAAGTGGG
ACAGATACTGGAGATCCTCAAAGTAAGCCCCTCGGTGACTGGGCTGCTGGCACCATGGAC
CCAGGCCCAGCTGGGTCCAGAGGTGACAGTGGAGAGCCGTGTACCCTGAGACCAGCCTGC
AGAGGACAGAGGCAACATGGAGGTGCCTCAAGGATCAGTGCTGAGGGTCCCGCCCCCATG
CCCCGTCGAAGAACCCCCTCCACTGCCCATCTGAGAGTGCCCAAGACCAGCAGGAGGAAT
CTCCTTTGCATGAGAGCAGTATCTTTATTGAGGATGCCATTAAGTATTTCAAGGAAAAAG
T
>ENST00000431184.1|ENSG00000100342.21|OTTHUMG00000030427.9|OTTHUMT00000319254.1|APOL1-208|APOL1|550|nonsense_mediated_decay|
The range value in input.txt for this transcript is 6436967,6436977. In my file Input.log for this transcript I hope to get
Our transcript is ENST00000433768.5 and our range is 6436967,6436977
And in Output.log for this transcript I hope to get
>ENST00000433768.5|ENSG00000100342.21|OTTHUMG00000030427.9|OTTHUMT00000319253.2|APOL1-209|APOL1|541|protein_coding|
ATCCACACAGCTCAGAACAGCTGGATCTTGCTCAGTCTCTGCCAGGGGAAGATTCCTTGG
AGGAGCACACTGTCTCAACCCCTCTTTTCCTGCTCAAGGAGGAGGCCCTGCAGCGACATG
GAGGGAGCTGCTTTGCTGAGAGTCTCTGTCCTCTGCATCTGGATGAGTGCACTTTTCCTT
GGTGTGGGAGTGAGGGCAGAGGAAGCTGGAGCGAGGGTGCAACAAAACGTTCCAAGTGGG
ACAGATACTGGAGATCCTCAAAGTAAGCCCCTCGGTGACTGGGCTGCTGGCACCATGGAC
CCAGGCCCAGCTGGGTCCAGAGGTGACAGTGGAGAGCCGTGTACCCTGAGACCAGCCTGC
AGAGGACAGAGGCAACATGGAGGTGCCTCAAGGATCAGTGCTGAGGGTCCCGCCCCCATG
CCCCGTCGAAGAACCCCCTCCACTGCCCATCTGAGAGTGCCCAAGACCAGCAGGAGGAAT
CTCCTTTGCATGAGAGCAGTATCTTTATTGAGGATGCCATTAAGTATTTCAAGGAAAAAG
T
But I am getting the following error, and I am unsure as to why or how to fix it.
cut: ENST00000433768.5#6436967,6436977: No such file or directory
cut: ENST00000433768.5#6436967,6436977: No such file or directory
Our transcript is and our range is
My thought was each line from the awk would be read as a string then cut could split the string along the "#" symbol I have added, but it is reading each line as a file and throwing an error when it can't locate the file in my directory.
Thanks.

EDIT2: This is a generic solution which will compare 2 files(input and other_file.fa) and on whichever line whichever range is found it will print them. Eg--> Range numbers are found on 300 line number but range shows you should print from 1 to 20 it will work in that case also. Also note this calls system command which further calls sed command(like you were using range within sed), there are other ways too, like to load whole Input_file into an array or so and then print, but I am going with this one here, fair warning this is not tested with huge size files.
awk -F'[>| ]' '
FNR==NR{
arr[$2]=$NF
next
}
($2 in arr){
split(arr[$2],lineNum,",")
print arr[$2]
start=lineNum[1]
end=lineNum[2]
print "sed -n \047" start","end"p \047 " FILENAME
system("sed -n \047" start","end"p\047 " FILENAME)
start=end=0
}
' file1 FS="[>|]" other_file.fa
EDIT: With OP's edited samples, please try following to print lines based on other file. assumes that the line you find range values, those values will be always after the line on which they found(eg--> 3rd line range values found and range is 4 to 10).
awk -F'[>| ]' '
FNR==NR{
arr[$2]=$NF
next
}
($2 in arr){
split(arr[$2],lineNum," ")
start=lineNum[1]
end=lineNum[2]
}
FNR>=start && FNR<=end{
print
if(FNR==end){
start=end=0
}
}
' file1 FS="[>|]" other_file.fa
You need not to do this with a for loop and then call awk program each time for each line. This could be done in single awk, considering that you have to only print them. Written and tested with your shown samples.
awk -F'[>| ]' 'FNR>1{print "Our transcript is:"$3" and our range is:"$NF}' Input_file
NOTE: This will print for each line of your Input_file values of transcript and range, in case you want to further perform some operation with their values then please do mention.

Unable to parse the log file using Shell and python

I am trying to parse the log file using shell or python script. I used awk and sed but no luck. Can some one help me to resolve this. Below is the input and expecting output.
Input:
customer1:123
SRE:1
clientID:1
Error=1
customer1:124
SRE:1
clientID:1
Error=2
customer1:125
SRE:1
clientID:1
Error=3
customer1:126
SRE:1
clientID:1
Error=4
Output:
Customer | Error
123 1
124 2
125 3
126 4

It's usual to show some of your work, or what you've tried so far, but here's a rough guess at what you're looking for.
tmp$ awk -F: '/^customer1:/ {CUST=$2} ; /^Error/ {split($0,a,"=") ; print CUST, a[2]} ' t
Or breaking down by line:
tmp$ awk -F: '\
> /^customer1:/ {CUST=$2} ; \
> /^Error/ {split($0,a,"=") ; print CUST, a[2]} \
> ' t
123 1
124 2
125 3
126 4
The first line
/^customer1:/ {CUST=$2} ;
Does two things - matches lines that start (^ means start) with customer1, and those lines are automatically split on : because we said -F: at the start of our command.
/^Error/ {split($0,a,"=") ; print CUST, a[2]} ;
Matches lines that starts with Error, splits those lines into array a, on the delimiter "=", and then prints out the last value of CUST, as well as the second field on the error line.
Hopefully that all makes sense. It's worth reading an awk tutorial like https://www.grymoire.com/Unix/Awk.html

Getting specific lines of a file

I have this file with 25 million rows. I want to get specific 10 million lines from this file
I have the indices of these lines in another file. How can I do it efficiently?

Assuming that the list of lines is in a file list-of-lines and the data is in data-file, and that the numbers in list-of-lines are in ascending order, then you could write:
current=0
while read wanted
do
while ((current < wanted))
do
if read -u 3 line
then ((current++))
else break 2
fi
done
echo "$line"
done < list-of-lines 3< data-file
This uses the Bash extension that allows you to specify which file descriptor read should read from (read -u 3 to read from file descriptor 3). The list of line numbers to be printed is read from standard input; the data file is read from file descriptor 3. This makes one pass through each of the two files, which is within a constant factor of optimal.
If the list-of-lines is not sorted, replace the last line with the following, which uses the Bash extension called process substitution:
done < <(sort -n list-of-lines) 3< data-file

Assume that the file containing line indices is called "no.txt" and the data file is "input.txt".
awk '{printf "%08d\n", $1}' no.txt > no.1.txt
nl -n rz -w 8 input.txt | join - no.1.txt | cut -d " " -f1 --complement > output.txt
The output.txt will have the lines wanted. I am not sure if this is efficient enough. It seems to be faster than this script (https://stackoverflow.com/a/22926494/3264368) under my environment though.
Some explanations:
The 1st command preprocess the indices file so that the numbers are right adjusted with leading zeroes and width 8 (since number of rows in input.txt is known to be 25M)
The 2nd command will print the rows and line numbers with exactly the same format as in the preprocessed index file, then join them to get the wanted rows (cut to remove the line numbers).

Since you said the file with lines you're looking for is sorted, you can loop through the two files in awk:
awk 'BEGIN{getline nl < "line_numbers.txt"} NR == nl {print; getline nl < "line_numbers.txt"}' big_file.txt
This will read each line in each file precisely once.

Like your index file is index.txt and datafile is data.txt then you can do it using sed like as follows
#!/bin/bash
while read line_no
do
sed ''$line_no'q;d' data.txt
done < input.txt

You could run a loop that reads from the 25 million lined file and when the loop counter reaches a line number that you want tell it to write that line. EX:
String line = "";
int count = 0;
while((line = br.readLine())!=null)
{
if(count == indice)
{
System.out.println(line) //or file write
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Convert text file to csv using shell script - bash

Related

Processing of the data from a big number of input files

bash + awk: extract specific information from ensemble of filles

Store variables from lines in a text file using awk and cut in a for loop

Unable to parse the log file using Shell and python

Getting specific lines of a file

Categories

Resources