Including empty lines using pattern - bash

My problem is the following: I have a text file where there are no empty lines, now I would like to include the lines according to the pattern file where 1 means print the line without including a new line, 0 - include a new line. My text file is :
apple
banana
orange
milk
bread
Thу pattern file is :
1
1
0
1
0
1
1
The desire output correspondingly:
apple
banana
orange
milk
bread
What I tried is:
for i in $(cat pattern file);
do
awk -v var=$i '{if var==1 {print $0} else {printf "\n" }}' file;
done.
But it prints all the lines first, and only after that it changes $i
Thanks for any prompts.

Read the pattern file into an array, then use that array when processing the text file.
awk 'NR==FNR { newlines[NR] = $0; next}
{ print $0 (newlines[FNR] ? "" : "\n") }' patternfile textfile

allow multiple 0 between 1
Self documented code
awk '# for file 1 only
NR==FNR {
#load an array with 0 and 1 (reversed due to default value of an non existing element = 0)
n[NR]=!$1
# cycle to next line (don't go furthier in the script for this line)
next
}
# at each line (of file 2 due to next of last bloc)
{
# loop while (next due to a++) element of array = 1
for(a++;n[a]==1;a++){
# print an empty line
printf( "\n")
}
# print the original line
print
}' pattern YourFile
need of inversion of value to avoid infinite new line on last line in case there is less info in pattern than line in data file
multiple 0 need a loop + test
unsynchro between file number of pattern and data file is a problem using a direct array (unless it keep how much newline to insert, another way to doing it)

This is a bit of a hack, but I present it as an alternative to your traditionally awk-y solutions:
paste -d, file.txt <(cat pattern | tr '\n' ' ' | sed 's,1 0,10,g' | tr ' ' '\n' | tr -d '1') | tr '0' '\n' | tr -d ','
The output looks like this:
apple
banana
orange
milk
bread

Inverse of Barmar's, read the text into an array and then print as you process the pattern:
$ awk 'NR==FNR {fruit[NR]=$0; next} {print $0?fruit[++i]:""}' fruit.txt pattern.txt
apple
banana
orange
milk

For an answer using only bash:
i=0; mapfile file < file
for p in $(<pattern); do
((p)) && printf "%s" "${file[i++]}" || echo
done

Related

Selectively reformatting a file with spaces and \n

I have multiple files in the following format. This one has 3 sequences (number of sequences vary in all files, but always end in ".") with 40 positions each, as indicated by the numbers in the first line. From the beginning of the lines (except the first one) there are the names of the sequences:
3 40
00076284. ATGTCTGTGG TTCTTTAACC
00892634. TTGTCTGAGG TTCGTAAACC
00055673. TTGTCTGAGG TCCGTGAACC
GCCGGGAACA TCCGCAAAAA
ACCGTGAAAC GGGGTGAACT
TCCCCCGAAC TCCCTGAACG
I need to convert it to this format, where the sequences are continuous, with no spaces nor \n, and on a new line after their names.The only spaces that should remain are between the two numbers in the first line.
3 40
00076284.
ATGTCTGTGGTTCTTTAACCGCCGGGAACATCCGCAAAAA
00892634.
TTGTCTGAGGTTCGTAAACCACCGTGAAACGGGGTGAACT
00055673.
TTGTCTGAGGTCCGTGAACCTCCCCCGAACTCCCTGAACG
Tried sed to delete spaces and \n's but don't know how to apply it after the first line and how to avoid making one huge line.
Thanks
Here's a shell script that may provide what you need:
head -1 input
awk '
NR == 1 { sequences = $1 ; positions = $2 ; next }
{
if ( $1 ~ /^[0-9]/ ) {
sid = $1 ; $1 = "" ; sequence_name[ NR - 1 ] = sid
sequence[ NR - 1 ] = $0
} else {
sequence[ ( NR - 1 ) % ( sequences + 1 ) ] = sequence[ (NR-1) % ( sequences + 1 ) ] " " $0
}
}
END {
for ( x = 1 ; x <= length( sequence_name ) ; x++ )
{
print sequence_name[x]
print sequence[x]
}
}' input | tr -d ' '
I added head -1 to the top of the shell just to get the first line out of your file. I couldn't output the first line within the awk script because of the pipe to tr -d ' '.
I think this should work, but my output is longer since if I actually concat all the last "orphan" sequences I get a way longer line.
cat input.txt | awk '/^[0-9]+ [0-9]+$/{printf("%s\n",$0); next} /[0-9]+[.]/{ printf("\n%s\n",$1);for(i=2; i<=NF;i++){printf("%s",$i)}; next} /^ */{ for(i=1; i<=NF;i++){printf("%s",$i)}; next;}'
3 40
Please try and let me know.
Remember the position of empty line and merge the lines before empty line with those after:
awk '
NR==1{print;next}
NR!=1 && !empty{arr[NR]=$1 "\n" $2 $3}
/^$/{empty=NR-1;next}
NR!=1 && empty{printf "%s%s%s\n", arr[NR-empty], $1, $2}
' file
My second solution without awk: Merge the file with itself using empty line as separator
cat >file <<EOF
3 40
00076284. ATGTCTGTGG TTCTTTAACC
00892634. TTGTCTGAGG TTCGTAAACC
00055673. TTGTCTGAGG TCCGTGAACC
GCCGGGAACA TCCGCAAAAA
ACCGTGAAAC GGGGTGAACT
TCCCCCGAAC TCCCTGAACG
EOF
head -n1 file
paste <(sed -n '1!{ /^$/q;p; }' file) <(sed -n '1!{ /^$/,//{/^$/!p}; }' file) |
sed 's/[[:space:]]//g; s/\./.\n/'
Will output:
3 40
00076284.
ATGTCTGTGGTTCTTTAACCGCCGGGAACATCCGCAAAAA
00892634.
TTGTCTGAGGTTCGTAAACCACCGTGAAACGGGGTGAACT
00055673.
TTGTCTGAGGTCCGTGAACCTCCCCCGAACTCCCTGAACG
:
head -n1 file output first line
sed -n '1!{ /^$/q;p; }' file
1! - don't output first line
/^$/q - quit when empty line
p print everything else
sed -n '1!{ /^$/,//{/^$/!p}; }' file
1! - ignore first line
/^$/,// - from empty line until the end
/^$/!p - output if not an empty tline
paste <(..) <(...) - merge the two seds with a tab
sed 's/[[:space:]]//g; s/\./.\n/
s/[[:space:]]//g; remove all spaces
s/\./.\n/ replace a comma with a comma and a newline.

Shell command for inserting a newline every nth element of a huge line of comma separated strings

I have a one line csv containing a lot of elements. Now I want to insert a newline after every n-th element in a bash/shell script.
Bonus: I'd like to prepend a line with descriptors and using the count of descriptors as 'n'.
Example:
"4908041eee3d4bf98e606140b21ebc89.16","7.38974601030349731","45.31298584267982221","94ff11ce7eb54642b0768dde313e8b25.16","7.38845318555831909","45.31425320325949713", (...)
into
"id","lon","lat"
"4908041eee3d4bf98e606140b21ebc89.16","7.38974601030349731","45.31298584267982221"
"94ff11ce7eb54642b0768dde313e8b25.16","7.38845318555831909","45.31425320325949713"
(...)
Edit: I made a first attempt, but the comma delimiters are missing then:
(...) | xargs --delimiter=',' -n3
"4908041eee3d4bf98e606140b21ebc89.16" "7.38974601030349731" "45.31298584267982221"
"94ff11ce7eb54642b0768dde313e8b25.16" "7.38845318555831909" "45.31425320325949713"
trying to replace the " " with ","
(...) | xargs --delimiter=',' -n3 -i echo ${{}//" "/","}
-bash: ${{}//\": bad substitution
I would go with Perl for that!
Let's assume this outputs something like your file:
printf "1,2,3,4,5,6,7,8,9,10"
1,2,3,4,5,6,7,8,9,10
Then you could use this if you wanted every 4th comma replaced:
printf "1,2,3,4,5,6,7,8,9,10" | perl -pe 's{,}{++$n % 4 ? $& : "\n"}ge'
1,2,3,4
5,6,7,8
9,10
cat data.txt | xargs -n 3 -d, | sed 's/ /,/g'
With n=3 here and input filename is called data.txt
Note: What distinguishes this solution is that it derives the number of output columns from the number of columns in the header line.
Assuming that the fields in your CSV input have no embedded , instances (in which case you'd need a proper CSV parser), try awk:
awk -v RS=, -v header='"id","lon","lat"' '
BEGIN {
print header
colCount = 1 + gsub(",", ",", header)
}
{
ORS = NR % colCount == 0 ? "\n" : ","
print
}
' file.csv
Note that if the input file ends with a newline (as is typical), you'll get an extra newline trailing the output.
With GNU Awk or Mawk (but not BSD/OSX Awk, which only supports literal, single-character RS values), you can fix this as follows:
awk -v RS='[,\n]' -v header='"id","lon","lat"' '
BEGIN {
print header
colCount = 1 + gsub(",", ",", header)
}
{
ORS = NR % colCount == 0 ? "\n" : ","
print
}
' file.csv
BSD/OSX Awk workaround: stick with -v RS=, and replace file.csv with <(tr -d '\n' < file.csv) in order to remove all newlines from the input first.
Assuming your input file is named input:
echo id,lon,lat; awk '{ORS=NR%3?",":"\n"}1' RS=, input

How to find the difference between the values two fields from two files and print only if there is a difference >10 using shell

Let say, i have two files a.txt and b.txt. the content of a.txt and b.txt is as follows:
a.txt:
abc|def|ghi|jfkdh|dfgj|hbkjdsf|ndf|10|0|cjhk|00|098r|908re|
dfbk|sgvfd|ZD|zdf|2df|3w43f|ZZewd|11|19|fdgvdf|xz00|00|00
b.txt:
abc|def|ghi|jfkdh|dfgj|hbkjdsf|ndf|11|0|cjhk|00|098r|908re|
dfbk|sgvfd|ZD|zdf|2df|3w43f|ZZewd|22|18|fdgvdf|xz00|00|00
So let's say these files have various fields separated by "|" and can have any number of lines. Also, assume that both are sorted files and so that we can match exact line between the two files. Now, i want to find the difference between the fields 8 & 9 of each row of each to be compared respectively and if any of their difference is greater than 10, then print the lines, otherwise remove the lines from file.
i.e., in the given example, i will subtract |10-11| (respective field no. 8 which is 1(absolute value) from a.txt and b.txt) and similarly for field no. 9 (0-0) which is 0,and both the difference is <10 so we delete this line from the files.
for the second line, the differences are (11-22)= 10 so we print this line.(dont need to check 19-18 as if any of the fields values(8,9) is >=10 we print such lines.
So the output is
a.txt:
dfbk|dfdag|sgvfd|ZD|zdf|2df|3w43f|ZZewd|11|19|fdgvdf|xz00|00|00
b.txt:
dfbk|dfdag|sgvfd|ZD|zdf|2df|3w43f|ZZewd|22|18|fdgvdf|xz00|00|00
You can do this with awk:
awk -F\| 'FNR==NR{x[FNR]=$0;eight[FNR]=$8;nine[FNR]=$9;next} {d1=eight[FNR]-$8;d2=nine[FNR]-$9;if(d1>10||d1<-10||d2>10||d2<-10){print x[FNR] >> "newa";print $0 >> "newb"}}' a.txt b.txt
Explanation
The -F sets the field separator to the pipe symbol. The stuff in curly braces after FNR==NR applies only to the processing of a.txt. It says to save the whole line in array x[] indexed by line number (FNR) and also to save the eighth field in array eight[] also indexed by line number. Likewise field 9 is saved in array nine[].
The second set of curly braces applies to processing file b. It calculates the differences d1 and d2. If either exceeds 10, the line is printed to each of the files newa and newb.
You can write bash shell script that does it:
while true; do
read -r lineA <&3 || break
read -r lineB <&4 || break
vara_8=$(echo "$lineA" | cut -f8 -d "|")
varb_8=$(echo "$lineB" | cut -f8 -d "|")
vara_9=$(echo "$lineA" | cut -f9 -d "|")
varb_9=$(echo "$lineB" | cut -f9 -d "|")
if (( vara_8-varb_8 > 10 || vara_8-varb_8 < -10
|| vara_9-varb_9 > 10 || vara_9-varb_9 < -10 )); then
echo "$lineA" >> newA.txt
echo "$lineB" >> newB.txt
fi
done 3<a.txt 4<b.txt
For short files
Use the method provided by Mark Setchell. Seen below in an expanded and slightly modified version:
parse.awk
FNR==NR {
x[FNR] = $0
m[FNR] = $8
n[FNR] = $9
next
}
{
if(abs(m[FNR] - $8) || abs(n[FNR] - $9)) {
print x[FNR] >> "newa"
print $0 >> "newb"
}
}
Run it like this:
awk -f parse.awk a.txt b.txt
For huge files
The method above reads a.txt into memory. If the file is very large, this becomes unfeasible and streamed parsing is called for.
It can be done in a single pass, but that requires careful handling of the multiplexed lines from a.txt and b.txt. A less error prone approach is to identify relevant line numbers, and then extract those into new files. An example of the last approach is shown below.
First you need to identify the matching lines:
# Extract fields 8 and 9 from a.txt and b.txt
paste <(awk -F'|' '{print $8, $9}' OFS='\t' a.txt) \
<(awk -F'|' '{print $8, $9}' OFS='\t' b.txt) |
# Check if it the fields matche the criteria and print line number
awk '$1 - $3 > n || $3 - $1 > n || $2 - $4 > n || $4 - $2 > 10 { print NR }' n=10 > linesfile
Now we are ready to extract the lines from a.txt and b.txt, and as the numbers are sorted, we can use the extract.awk script proposed here (repeated for convenience below):
extract.awk
BEGIN {
getline n < linesfile
if(length(ERRNO)) {
print "Unable to open linesfile '" linesfile "': " ERRNO > "/dev/stderr"
exit
}
}
NR == n {
print
if(!(getline n < linesfile)) {
if(length(ERRNO))
print "Unable to open linesfile '" linesfile "': " ERRNO > "/dev/stderr"
exit
}
}
Extract the lines (can be run in parallel):
awk -v linesfile=linesfile -f extract.awk a.txt > newa
awk -v linesfile=linesfile -f extract.awk b.txt > newb

Comparing values in two files

I am comparing two files, each having one column and n number of rows.
file 1
vincy
alex
robin
file 2
Allen
Alex
Aaron
ralph
robin
if the data of file 1 is present in file 2 it should return 1 or else 0, in a tab seprated file.
Something like this
vincy 0
alex 1
robin 1
What I am doing is
#!/bin/bash
for i in `cat file1 `
do
cat file2 | awk '{ if ($1=="'$i'") print 1 ; else print 0 }'>>binary
done
the above code is not giving me the output which I am looking for.
Kindly have a look and suggest correction.
Thank you
The simple awk solution:
awk 'NR==FNR{ seen[$0]=1 } NR!=FNR{ print $0 " " seen[$0] + 0}' file2 file1
A simple explanation: for the lines in file2, NR==FNR, so the first action is executed and we simply record that a line has been seen. In file1, the 2nd action is taken and the line is printed, followed by a space, followed by a "0" or a "1", depending on if the line was seen in file2.
AWK loves to do this kind of thing.
awk 'FNR == NR {a[tolower($1)]; next} {f = 0; if (tolower($1) in a) {f = 1}; print $1, f}' file2 file1
Swap the positions of file2 and file1 in the argument list to make file1 the dictionary instead of file2.
When FNR (the record number in the current file) and NR (the record number of all records so far) are equal, then the first file is the one being processed. Simply referencing an array element brings it into existence. This sets up the dictionary. The next instruction reads the next record.
Once FNR and NR aren't equal, subsequent file(s) are being processed and their data is looked up in the dictionary array.
The following code should do it.
Take a close look to the BEGIN and END sections.
#!/bin/bash
rm -f binary
for i in $(cat file1); do
awk 'BEGIN {isthere=0;} { if ($1=="'$i'") isthere=1;} END { print "'$i'",isthere}' < file2 >> binary
done
There are several decent approaches. You can simply use line-by-line set math:
{
grep -xF -f file1 file2 | sed $'s/$/\t1/'
grep -vxF -f file1 file2 | sed $'s/$/\t0/'
} > somefile.txt
Another approach would be to simply combine the files and use uniq -c, then just swap the numeric column with something like awk:
sort file1 file2 | uniq -c | awk '{ print $2"\t"$1 }'
The comm command exists to do this kind of comparison for you.
The following approach does only one pass and scales well to very large input lists:
#!/bin/bash
while read; do
if [[ $REPLY = $'\t'* ]] ; then
printf "%s\t0\n" "${REPLY#?}"
else
printf "%s\t1\n" "${REPLY}"
fi
done < <(comm -2 <(tr '[A-Z]' '[a-z]' <file1 | sort) <(tr '[A-Z]' '[a-z]' <file2 | sort))
See also BashFAQ #36, which is directly on-point.
Another solution, if you have python installed.
If you're familiar with Python and are interested in the solution, you only need a bit of formatting.
#/bin/python
f1 = open('file1').readlines()
f2 = open('file2').readlines()
f1_in_f2 = [int(x in f2) for x in f1]
for n,c in zip(f1, f1_in_f2):
print n,c

get Nth line in file after parsing another file

I have one of my large file as
foo:43:sdfasd:daasf
bar:51:werrwr:asdfa
qux:34:werdfs:asdfa
foo:234:dfasdf:dasf
qux:345:dsfasd:erwe
...............
here 1st column foo, bar and qux etc. are file names. and 2nd column 43,51, 34 etc. are line numbers. I want to print Nth line(specified by 2nd column) for each file(specified in 1st column).
How can I automate above in unix shell.
Actually above file is generated while compiling and I want to print warning line in code.
-Thanks,
while IFS=: read name line rest
do
head -n $line $name | tail -1
done < input.txt
while IFS=: read file line message; do
echo "$file:$line - $message:"
sed -n "${line}p" "$file"
done <yourfilehere
awk 'NR==4 {print}' yourfilename
or
cat yourfilename | awk 'NR==4 {print}'
The above one will work for 4th line in your file.You can change the number as per your requirement.
Just in awk, but probably worse performance than answers by #kev or #MarkReed.
However it does process each file just once. Requires GNU awk
gawk -F: '
BEGIN {OFS=FS}
{
files[$1] = 1
lines[$1] = lines[$1] " " $2
msgs[$1, $2] = $3
}
END {
for (file in files) {
split(lines[file], l, " ")
n = asort(l)
count = 0
for (i=1; i<=n; i++) {
while (++count <= l[i])
getline line < file
print file, l[i], msgs[file, l[i]]
print line
}
close(file)
}
}
'
This might work for you:
sed 's/^\([^,]*\),\([^,]*\).*/sed -n "\2p" \1/' file |
sort -k4,4 |
sed ':a;$!N;s/^\(.*\)\(".*\)\n.*"\(.*\)\2/\1;\3\2/;ta;P;D' |
sh
sed -nr '3{s/^([^:]*):([^:]*):.*$/\1 \2/;p}' namesNnumbers.txt
qux 34
-n no output by default,
-r regular expressions (simplifies using the parens)
in line 3 do {...;p} (print in the end)
s ubstitute foobarbaz with foo bar
So to work with the values:
fnUln=$(sed -nr '3{s/^([^:]*):([^:]*):.*$/\1 \2/;p}' namesNnumbers.txt)
fn=$(echo ${fnUln/ */})
ln=$(echo ${fnUln/* /})
sed -n "${ln}p" "$fn"

Resources