Need help splitting a file with blanc lines (BASH) - bash

I have a file containing several few thousands of lines. The file format is similar to this:
1
H
H 13.1641870 7.1039560 -5.9652740
3
O2H2
H 15.5567440 5.6184980 -4.5255100
H 15.8907030 4.2338600 -5.4917990
O 15.5020000 6.4310000 -7.0960000
O 13.7940000 5.5570000 -8.1620000
2
CH
H 13.0960830 7.7155820 -3.5224750
C 11.0480000 7.4400000 -5.5080000
.
.
.
.
What I want is to split the full file in several files where putting in each file all the information between empty lines. The problem is that the blank lines do not follow a pattern. Some parts of the text have 1 line and others have 10.
Could someone tell me how to separate the file using the blank lines as separator?

Using awk and the data in a file called mainfile
awk 'BEGIN { RS="[\n]+" } { print $0 >> "file"NR".txt" }' mainfile
Set the record separator to one or more line feeds and then print each record to a file dictated by the record number i.e. file1.txt etc

Would you please try the following:
awk -v RS="" '{print > "file" ++i ".txt"; close("file" i ".txt")}' input.txt
If the awk variable RS is set to the null string, then records are separated by blank lines.
It is recommended to close each file to avoid the "too many open files" error.

Related

Pass number of for loop elements to external command

I'm using for loop to iterate through .txt files in a directory and grab specified rows from the files. Afterwards the output is passed to pr command in order to print it as a table. Everything works fine, however I'm manually specifying the number of columns that the table should contain. This is cumbersome when the number of files is not constant.
The command I'm using:
for f in *txt; do awk -F"\t" 'FNR ~ /^(2|6|9)$/{print $2}' $f; done | pr -ts --column 4
How should I modify the command to replace '4' with elements number?
Edit:
The fundamental question was if one can provide matching files number to function outside the loop. Seeing the solutions I guess it is not possible to work around the problem. Until this conclusion the structure of the files was not really relevant.
However taking the above into account, I'm providing the files structure below.
Sample file.txt:
Irrelevant1 text
Placebo 1222327
Irrelevant1 text
Irrelevant2 text
Irrelevant3 text
Treatment1 105956
Irrelevant1 text
Irrelevant2 text
Treatment2 49271
Irrelevant1 text
Irrelevant2 text
The for loop generates the following from 4 *txt files:
1222327
105956
49271
969136
169119
9672
1297357
237210
11581
1189529
232095
13891
Expected pr output using a dynamically generated --column 4:
1222327 969136 1297357 1189529
105956 169119 237210 232095
49271 9672 11581 13891
Assumptions:
all input files generate the same number of output lines (otherwise we can add some code to keep track of the max number of lines and generate blank columns as needed)
Setup (columns are tab-delimited):
$ grep -n xxx f[1-4].txt
f1.txt:6:xxx 1222327
f1.txt:9:xxx 105956
f1.txt:24:xxx 49271
f2.txt:6:xxx 969136
f2.txt:9:xxx 169119
f2.txt:24:xxx 9672
f3.txt:6:xxx 1297357
f3.txt:9:xxx 237210
f3.txt:24:xxx 11581
f4.txt:6:xxx 1189529
f4.txt:9:xxx 232095
f4.txt:24:xxx 13891
One idea using awk to dynamically build the 'table' (replaces OP's current for loop):
awk -F'\t' '
FNR==1 { c=0 }
FNR ~ /^(6|9|24)$/ { ++c ; arr[c]=arr[c] (FNR==NR ? "" : " ") $2 }
END { for (i=1;i<=c;i++) print arr[i] }
' f[1-4].txt | column -t -o ' '
NOTE: we'll go ahead and let column take care of pretty-printing the table with a single space separating the columns, otherwise we could add some more code to awk to right-pad columns with spaces
This generates:
1222327 969136 1297357 1189529
105956 169119 237210 232095
49271 9672 11581 13891
You could just run ls and pipe the output to wc -l. Then once you've got that number you can assign it to a variable and place that variable in your command.
num=$(ls *.txt | wc -l)
I forget how to place bash variables in AWK, but I think you can do that. If not, respond back and I'll try to find a different answer.

How Can I Use Sort or another bash cmd To Get 1 line from all the lines if 1st 2nd and 3rd Field are The same

I have a file named file.txt
$cat file.txt
1./abc/cde/go/ftg133333.jpg
2./abc/cde/go/ftg24555.jpg
3./abc/cde/go/ftg133333.gif
4./abt/cte/come/ftg24555.jpg
5./abc/cde/go/ftg133333.jpg
6./abc/cde/go/ftg24555.pdf
MY GOAL: To get only one line from lines who's first, second and third PATH are the same and have the same file EXTENSION.
Note each PATH is separated by forward slash "/". Eg in the first line of the list, the first PATH is abc, second PATH is cde and third PATH is go.
File EXTENSION is .jpg, .gif,.pdf... always at the end of the line.
HERE IS WHAT I TRIED
sort -u -t '/' -k1 -k2 -k3
My thoughts
Using / as a delimiter gives me 4 fields in each line. Sorting them with "-u" will remove all but 1 line with unique First, Second and 3rd field/PATH. But obviously, I didn't take into account the EXTENSION(jpg,pdf,gif) in this case.
MY QUESTION
I need a way to grep only 1 of the lines if the first, second and third field are same and have the same EXTENSION using "/" as delimiter to divide it into fields. I want to output it to a another file, say file2.txt.
In the file2.txt, how do I add a word say "KALI" before the extension in each line, so it will look something like /abc/cde/go/ftg13333KALI.jpg using line 1 as an example in file.txt above.
Desired Output
/abc/cde/go/ftg133333KALI.jpg
/abt/cte/come/ftg24555KALI.jpg
/abc/cde/go/ftg133333KALI.gif
/abc/cde/go/ftg24555KALI.pdf
COMMENT
Line 1,2 & 5 have the same 1st,2nd and 3rd field, with same file extension
".jpg" so only line 1 should be in the output.
Line 3 is in the output even though it has same 1st,2nd and 3rd field with
1,2 and 5, because the extension is different ".gif".
Line 4 has different 1st, 2nd and 3rd field, hence it in output.
Line 6 is in the output even though it has same 1st,2nd and 3rd field with
1,2 and 5, because the extension is different ".pdf".
$ awk '{ # using awk
n=split($0,a,/\//) # split by / to get all path components
m=split(a[n],b,".") # split last by . to get the extension
}
m>1 && !seen[a[2],a[3],a[4],b[m]]++ { # if ext exists and is unique with 3 1st dirs
for(i=2;i<=n;i++) # loop component parts and print
printf "/%s%s",a[i],(i==n?ORS:"")
}' file
Output:
/abc/cde/go/ftg133333.jpg
/abc/cde/go/ftg133333.gif
/abt/cte/come/ftg24555.jpg
/abc/cde/go/ftg24555.pdf
I split by / separately from .s in case there are .s in dir names.
Missed the KALI part:
$ awk '{
n=split($0,a,/\//)
m=split(a[n],b,".")
}
m>1&&!seen[a[2],a[3],a[4],b[m]]++ {
for(i=2;i<n;i++)
printf "/%s",a[i]
for(i=1;i<=m;i++)
printf "%s%s",(i==1?"/":(i==m?"KALI.":".")),b[i]
print ""
}' file
Output:
/abc/cde/go/ftg133333KALI.jpg
/abc/cde/go/ftg133333KALI.gif
/abt/cte/come/ftg24555KALI.jpg
/abc/cde/go/ftg24555KALI.pdf
Using awk:
$ awk -F/ '{ split($5, ext, "\\.")
if (!(($2,$3,$4,ext[2]) in files)) files[$2,$3,$4,ext[2]]=$0
}
END { for (f in files) {
sub("\\.", "KALI.", files[f])
print files[f]
}}' input.txt
/abt/cte/come/ftg24555KALI.jpg
/abc/cde/go/ftg133333KALI.gif
/abc/cde/go/ftg24555KALI.pdf
/abc/cde/go/ftg133333KALI.jpg
another awk
$ awk -F'[./]' '!a[$2,$3,$4,$NF]++' file
/abc/cde/go/ftg133333.jpg
/abc/cde/go/ftg133333.gif
/abt/cte/come/ftg24555.jpg
/abc/cde/go/ftg24555.pdf
assumes . doesn't exist in directory names (not necessarily true in general).

Grep list (file) from another file

Im new to bash and trying to extract a list of patterns from file:
File1.txt
ABC
BDF
GHJ
base.csv (tried comma separated and tab delimited)
line 1,,,,"hfhf,ferf,ju,ABC"
line 2 ,,,,,"ewy,trggt,gtg,ABC,RFR"
line 3 .."himk,n,hn.ujj., BDF"
etc
Suggested output is smth like
ABC
line 1..
line 2..(whole lines)
BDF
line 3..
and so on for each pattern from file 1
the code i tried was:
#!/bin/bash
for i in *.txt -# cycle through all files containing pattern lists
do
for q in "$i"; # # cycle through list
do
echo $q >>output.${i};
grep -f "${q}" base.csv >>output.${i};
echo "\n";
done
done
But output is only filename and then some list of strings without pattern names, e.g.
File1.txt
line 1...
line 2...
line 3..
so i don`t know to what pattern belongs each string and have to check and assign manually. Can you please point out my errors? Thanks!
grep can process multiple files in one go, and then has the attractive added bonus of indicating which file it found a match in.
grep -f File1.txt base.csv >output.txt
It's not clear what you hope for the inner loop to do; it will just loop over a single token at a time, so it's not really a loop at all.
If you want the output to be grouped per pattern, here's a for loop which looks for one pattern at a time:
while read -r pat; do
echo "$pat"
grep "$pat" *.txt
done <File1.txt >output.txt
But the most efficient way to tackle this is to write a simple Awk script which processes all the input files at once, and groups the matches before printing them.
An additional concern is anchoring. grep "ABC" will find a match in 123DEABCXYZ; is this something you want to avoid? You can improve the regex, or, again, turn to Awk which gives you more control over where exactly to look for a match in a structured line.
awk '# Read patterns into memory
NR==FNR { a[++i] = $1; next }
# Loop across patterns
{ for(j=1; j<=i; ++j)
if($0 ~ a[j]) {
print FILENAME ":" FNR ":" $0 >>output.a[j]
next }
}' File1.txt base.csv
You're not actually reading the files, you're just handling the filenames. Try this:
#!/bin/bash
for i in *.txt # cycle through all files containing pattern lists
do
while read -r q # read file line by line
do
echo "$q" >>"output.${i}"
grep -f "${q}" base.csv >>"output.${i}"
echo "\n"
done < "${i}"
done
Here is one that separates (with split, comma-separatd with quotes and spaces stripped off) words from file2 to an array (word[]) and stores the record names (line 1 etc.) to it comma-separated:
awk '
NR==FNR {
n=split($0,tmp,/[" ]*(,|$)[" ]*/) # split words
for(i=2;i<=n;i++) # after first
if(tmp[i]!="") # non-empties
word[tmp[i]]=word[tmp[i]] (word[tmp[i]]==""?"":",") tmp[1] # hash rownames
record[tmp[1]]=$0 # store records
next
}
($1 in word) { # word found
n=split(word[$1],tmp,",") # get record names
print $1 ":" # output word
for(i=1;i<=n;i++) # and records
print record[tmp[i]]
}' file2 file1
Output:
ABC:
line 1,,,,"hfhf,ferf,ju,ABC"
line 2 ,,,,,"ewy,trggt,gtg,ABC,RFR"
BDF:
line 3 .."himk,n,hn.ujj., BDF"
Thank you for your kind help, my friends.
Tried both variants above but kept getting various errors ( "do" expected) or misbehavior ( gets names of pattern blocks, eg ABC, BDF, but no lines.
Gave up for a while and then eventually tried another way
While base goal were to cycle through pattern list files, search for patterns in huge file and write out specific columns from lines found - i simply wrote
for *i in *txt # cycle throughfiles w/ patterns
do
grep -F -f "$i" bigfile.csv >> ${i}.out1 #greps all patterns from current file
cut -f 2,3,4,7 ${i}.out1>> ${i}.out2 # cuts columns of interest and writes them out to another file
done
I'm aware that this code should be improved using some fancy pipeline features, but it works perfectly as is, hope it`ll help somebody in similar situation. You can easily add some echoes to write out pattern list names as i initially requested

Fill lines with a certain character until it contains a given amount of them

I have various flat files that contain roughly 30k records that I need to reformat using a script to ensure they all have the same amount of ~'s.
For example below are two of the sample records in the flat file. The first record containing 8 ~'s and the second containing 10 ~'s.
736~company 1~cp1~1~19~~08/07/1878~09/12/2015~
658~company 2~cp2~1~19~65.12~27/06/1868~22/08/2015~address line 1~address line 2~
I need both records to contain 12 ~'s so I need code that will loop through the file add pad out each line to contain the correct number of ~'s. The desired result would be as follows.
736~company 1~cp1~1~19~~08/07/1878~09/12/2015~~~~~
658~company 2~cp2~1~19~65.12~27/06/1868~22/08/2015~address line 1~address line 2~~~
I have the following bit of code which will display the number of ~'s in each file but I'm sure how to proceed from here.
sed 's/[^~]//g' inputfile.text| awk '{ print length }'
Set the field separator to ~ and keep adding ~s until you have enough:
$ awk -F"~" -v cols=12 'NF<=cols{for (i=NF;i<=cols;i++) $0=$0 FS}1' file
736~company 1~cp1~1~19~~08/07/1878~09/12/2015~~~~~
658~company 2~cp2~1~19~65.12~27/06/1868~22/08/2015~address line 1~address line 2~~~
awk -v FS='~' -v count=12 '{line = ""; for (i = 1; i <= count; i++) { line = line "~" $i } print line }' tildes.txt
Here's a bash builtin-only solution:
a=~~~~~~~~~~~~;
while read -r line; do
n=${line//[!~]};
echo "$line${a/$n}";
done < file
Explanation
n=${l//[!~]} makes a string from all ~ characters in $l.
${a/$n} takes the string of 12 ~ characters and deletes the first match of $n, so when we append it to $l, there will be exactly 12 tildes in the echoed line.
Warning: If there are more than 12 tildes in one line in file, the corresponding outputted line will have more than 24 tildes in it, as there are no checks to make sure ${#n} is less than ${#a}.
$ awk -F"~" -v c=12 'BEGIN{OFS=FS;c++}{$c=$c}1' file
736~company 1~cp1~1~19~~08/07/1878~09/12/2015~~~~~
658~company 2~cp2~1~19~65.12~27/06/1868~22/08/2015~address line 1~address line 2~~~

Use awk to separate text file into multiple files

I've read a couple of other questions about this, but none of them seem to be working. I'm currently trying to split something like file A.txt using the delimiter "STOPHERE".
This is the code:
#!/bin/bash
awk 'BEGIN{
RS = "STOPHERE"
file = 0}
{
file++
print $0 > ("sepf" file)
}' A.txt
File A:
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa lwdjnuqqfqaaaaaaaaaa qlknfqek fkgnl efekfnwegelflfne
ldnwefne f STOPHEREsdfnkjnf nnnnnnnnnnnnnnnnnnnnnnnasd fefffffffffffffflllo
aldn3orn STOPHERE
fknjke bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbowqff STOPHERE i
asfjfenf STOPHERE
Into these:
sepf1:
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa lwdjnuqqfqaaaaaaaaaa qlknfqek fkgnl efekfnwegelflfne
ldnwefne f
sepf2:
sdfnkjnf nnnnnnnnnnnnnnnnnnnnnnnasd fefffffffffffffflllo
aldn3orn
sepf3:
#line starts here
fknjke bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbowqff
sepf4:
i
asfjfenf
So basically, the formatting has to stay exactly the same between the STOPHERE.
But for some reason, this is the kind of output I'm getting in some of the files:
Eg: sepf2
TOPHEREsdfnkjnf nnnnnnnnnnnnnnnnnnnnnnnasd fefffffffffffffflllo
aldn3orn
Any ideas as to why the "TOPHERE" remains??
GNU awk allows RS to be a regex. So you can provide multiple characters as a record separator. Your code can also be simplified as AWK provides a default value of 0.
So this will generate separate files for each record.
awk -v RS="STOPHERE" '{print $0 > ("sepf" ++file)}'

Resources