Shell scripting to find the delimiter - bash

I have a file with three columns, which has pipe as a delimiter. Now some lines in the file can have a "," instead of "|", due to some error. I want to output all such erroneous rows.

You can also use grep, it is more complicated:
egrep "\|.*\|.*\|" input
echo No pipe
egrep "^[^\|]*$" input
echo One pipe
egrep "^[^\|]*\|[^\|\]*$" input
echo 3+ pipe
egrep "\|[^\|]*\|[^\|\]*\|" input
Before combining the greps, first introduce new variables
p (pipe) and n (no pipe)
p="\|"
n="[^\|]*"
echo "p=$p, n=$n"
echo No pipe
egrep "^$n$" input
echo One pipe
egrep "^$n$p$n$" input
echo 3+ pipe
egrep "$p$n$p$n$p" input
Now bring all together
egrep "^$n$|^$n$p$n$|$p$n$p$n$p" input
Edit: The comments and variable names were about "slashes", but they are pipes (with backslashes). That was a bit confusing.

To count the number of columns with awk you can use the NF variable:
$ cat file
ABC|12345|EAR
PQRST|123|TWOEYES
ssdf|fdas,sdfsf
$ awk -F\| 'NF!=3' file
ssdf|fdas,sdfsf
However, this does not seem to cover all the possible ways the data could be corrupted based on the various revisions of the question and the comments.
A better approach would be to define the exact format that the data must follow. For example, assuming that a line is "correct" if it is three columns, with the first and third letters only, and the second numeric, you could write the following script to match all non conforming lines:
awk -F\| '!(NF==3 && $1$3 ~ /^[a-zA-Z]+$/ && $2+0==$2)' file
Test (notice that only the second line (which is conforming) does not get printed):
$ cat file
A,BC|12345|EAR
PQRST|123|TWOEYES
ssdf|fdas,sdfsf
ABC|3983|MAKE,
sf dl lfsdklf |kldsamfklmadkfmask |mfkmadskfmdslafmka
ABC|abs|EWE
sdf|123|123
$ awk -F\| '!(NF==3&&$1$3~/^[a-zA-Z]+$/&&$2+0==$2)' file
A,BC|12345|EAR
ssdf|fdas,sdfsf
ABC|3983|MAKE,
sf dl lfsdklf |kldsamfklmadkfmask |mfkmadskfmdslafmka
ABC|abs|EWE
sdf|123|12
You can adapt the above command to your specific needs, based on what you think is a valid input. For example, if you wanted to also restrict the length of each line to 50 characters, you could do
awk -F\| '!(NF==3 && $1$3 ~ /^[a-zA-Z]+$/ && $2+0==$2 && length($0)<50)' file

Related

How to split a text file content by a string?

Suppose I've got a text file that consists of two parts separated by delimiting string ---
aa
bbb
---
cccc
dd
I am writing a bash script to read the file and assign the first part to var part1 and the second part to var part2:
part1= ... # should be aa\nbbb
part2= ... # should be cccc\ndd
How would you suggest write this in bash ?
You can use awk:
foo="$(awk 'NR==1' RS='---\n' ORS='' file.txt)"
bar="$(awk 'NR==2' RS='---\n' ORS='' file.txt)"
This would read the file twice, but handling text files in the shell, i.e. storing their content in variables should generally be limited to small files. Given that your file is small, this shouldn't be a problem.
Note: Depending on your actual task, you may be able to just use awk for the whole thing. Then you don't need to store the content in shell variables, and read the file twice.
A solution using sed:
foo=$(sed '/^---$/q;p' -n file.txt)
bar=$(sed '1,/^---$/b;p' -n file.txt)
The -n command line option tells sed to not print the input lines as it processes them (by default it prints them). sed runs a script for each input line it processes.
The first sed script
/^---$/q;p
contains two commands (separated by ;):
/^---$/q - quit when you reach the line matching the regex ^---$ (a line that contains exactly three dashes);
p - print the current line.
The second sed script
1,/^---$/b;p
contains two commands:
1,/^---$/b - starting with line 1 until the first line matching the regex ^---$ (a line that contains only ---), branch to the end of the script (i.e. skip the second command);
p - print the current line;
Using csplit:
csplit --elide-empty-files --quiet --prefix=foo_bar file.txt "/---/" "{*}" && sed -i '/---/d' foo_bar*
If version of coreutils >= 8.22, --suppress-matched option can be used and sed processing is not required, like
csplit --suppress-matched --elide-empty-files --quiet --prefix=foo_bar file.txt "/---/" "{*}".

Utilising variables in tail command

I am trying to export characters from a reference file in which their byte position is known. To do this, I have a long list of numbers stored as a variable which have been used as the input to a tail command.
For example, the reference file looks like:
ggaaatgcattcaaacatgc
And the list looks like:
5
10
7
15
I have tried using this code:
list=$(<pos.txt)
echo "$list"
cat ref.txt | tail -c +"list" | head -c1 > out.txt
However, it keeps returning "invalid number of bytes: '+5\n10\n7\n15...'"
My expected output would be
a
t
g
a
...
Can anybody tell me what I'm doing wrong? Thanks!
It looks like you are trying to access your list variable in your tail command. You can access it like this: $list rather than just using quotes around it.
Your logic is flawed even after fixing the variable access. The list variable includes all lines of your list.txt file. Including the newline character \n which is invisible in many UIs and programs, but it is of course visible when you are manually reading single bytes. You need to feed the lines one by one to make it work properly.
Also unless those numbers are indexes from the end, you need to feed them to head instead of tail.
If I understood what you are attempting to do correctly, this should work:
while read line
do
head -c $line ref.txt | tail -c 1 >> out.txt
done < pos.txt
The reason for your command failure is simple. The variable list contains a multi-line string stored from the pos.txt files including newlines. You cannot pass not more than one integer value for the -c flag.
Your attempts can be fixed quite easily with removing calls to cat and using a temporary variable to hold the file content
while IFS= read -r lineNo; do
tail -c "$lineNo" ref.txt | head -c1
done < pos.txt
But then if your intentions is print the desired output in a new-line every time, head does not output that way. It just forms a string atga for your given input in a single line and not across multiple lines with one character at each line.
As Gordon mentions in one of the comments, for much more efficient FASTA files processing, you could just use one invocation of awk though (skipping multiple forks to head/tail). Your provided input does not involve any headers to skip which would be straightforward as
awk ' FNR==NR{ n = split($0,arr,""); for(i=1;i<=n;i++) hash[i] = arr[i] }
( $0 in hash ){ print hash[$0] } ' ref.txt pos.txt
You could use cut instead of tail:
pos=$(<pos.txt)
cut -c ${pos//$'\n'/,} --output-delimiter=$'\n' ref.txt
Or just awk:
awk -F '' 'NR==FNR{c[$0];next} {for(i in c) print $i}' pos.txt ref.txt
both yield:
a
g
t
a

How to find integer values and compare them then transfer the main files?

I have some output files (5000 files) of .log which are the results of QM computations. Inside each file there are two special lines indicate the number of electrons and orbitals, like this below as an example (with exact spaces as in output files):
Number of electrons = 9
Number of orbitals = 13
I thought about a script (bash or Fortran), as a solution to this problem, which grep these two lines (at same time) and get the corresponding integer values (9 and 13, for instance), compare them and finds the difference between two values, and finally, list them in a new text file with the corresponding filenames.
I would really appreciate any help given.
Am posting an attempt in GNU Awk, and have tested it in that only.
#!/bin/bash
for file in *.log
do
awk -F'=[[:blank:]]*' '/Number of/{printf "%s%s",$2,(NR%2?" ":RS)}' "$file" | awk 'function abs(v) {return v < 0 ? -v : v} {print abs($1-$2)}' >> output_"$file"
done
The reason I split the AWK logic to two was to reduce the complexity in doing it in single huge command. The first part is for extracting the numbers from your log file in a columnar format and second for getting their absolute value.
I will break-down the AWK logic:-
-F'=[[:blank:]]*' is a mult0 character delimiter logic including = and one or more instances of [[:blank:]] whitespace characters.
'/Number of/{printf "%s%s",$2,(NR%2?" ":RS)}' searches for lines starting with Number of and prints it in a columnar fashion, i.e. as 9 13 from your sample file.
The second part is self-explanatory. I have written a function to get the absolute value from the two returned values and print it.
Each output is saved in a file named output_, for you to process it further.
Run the script from your command line as bash script.sh, where script.sh is the name containing the above lines.
Update:-
In case if you are interested in negative values too i.e. without the absolute function, change the awk statement to
awk -F'=[[:blank:]]*' '/Number of/{printf "%s%s",$2,(NR%2?" ":RS)}' "$file" | awk '{print ($1-$2)}' >> output_"$file"
Bad way to do it (but it will work)-
while read file
do
first=$(awk -F= '/^Number/ {print $2}' "$file" | head -1)
second=$(awk -F= '/^Number/ {print $2}' "$file" | tail -1)
if [ "$first" -gt "$second" ]
then
echo $(("$first" - "$second"))
else
echo $(("$second" - "$first"))
fi > "$file"_answer ;
done < list_of_files
This method picks up the values (in the awk one liner and compares them.
It then subtracts them to give you one value which it saves in the file called "$file"_answer. i.e. the initial file name with '_answer' as a suffix to the name.
You may need to tweak this code to fit your purposes exactly.

Printing a sequence from a fasta file

I often need to find a particular sequence in a fasta file and print it. For those who don't know, fasta is a text file format for biological sequences (DNA, proteins, etc.). It's pretty simple, you have a line with the sequence name preceded by a '>' and then all the lines following until the next '>' are the sequence itself. For example:
>sequence1
ACTGACTGACTGACTG
>sequence2
ACTGACTGACTGACTG
ACTGACTGACTGACTG
>sequence3
ACTGACTGACTGACTG
The way I'm currently getting the sequence I need is to use grep with -A, so I'll do
grep -A 10 sequence_name filename.fa
and then if I don't see the start of the next sequence in the file, I'll change the 10 to 20 and repeat until I'm sure I'm getting the whole sequence.
It seems like there should be a better way to do this. For example, can I ask it to print up until the next '>' character?
Using the > as the record separator:
awk -v seq="sequence2" -v RS='>' '$1 == seq {print RS $0}' file
>sequence2
ACTGACTGACTGACTG
ACTGACTGACTGACTG
Like this maybe:
awk '/>sequence1/{p++;print;next} /^>/{p=0} p' file
So, if the line starts with >sequence1, set a flag (p) to start printing, print this line and move to next. On subsequent lines, if the line starts with >, change p flag to stop printing. In general, print if the flag p is set.
Or, improving a little on your grep solution, use this to cut off the -A (after) context:
grep -A 999999 "sequence1" file | awk 'NR>1 && /^>/{exit} 1'
So, that prints up to 999999 lines after sequence1 and pipes them into awk. Awk then looks for a > at the start of any line after line 1, and exits if it finds one. Until then, the 1 causes awk to do its standard thing, which is print the current line.
Using sed only:
sed -n '/>sequence3/,/>/ p' | sed '${/>/d}'
$ perl -0076 -lane 'print join("\n",#F) if $F[0]=~/sequence2/' file
This question has excellent answers already. However, if you are dealing with FASTA records often, I would highly recommend Python's Biopython Module. It has many options and make life easier if you want to manipulate FASTA records. Here is how you can read and print the records:
from Bio import SeqIO
import textwrap
for seq_record in SeqIO.parse("input.fasta", "fasta"):
print(f'>{seq_record.id}\n{seq_record.seq}')
#If you want to wrap the record into multiline FASTA format
#You can use textwrap module
for seq_record in SeqIO.parse("input.fasta", "fasta"):
dna_sequence = str(seq_record.seq)
wrapped_dna_sequence = textwrap.fill(dna_sequence, width=8)
print(f'>{seq_record.id}\n{wrapped_dna_sequence}')

How to parse a CSV in a Bash script?

I am trying to parse a CSV containing potentially 100k+ lines. Here is the criteria I have:
The index of the identifier
The identifier value
I would like to retrieve all lines in the CSV that have the given value in the given index (delimited by commas).
Any ideas, taking in special consideration for performance?
As an alternative to cut- or awk-based one-liners, you could use the specialized csvtool aka ocaml-csv:
$ csvtool -t ',' col "$index" - < csvfile | grep "$value"
According to the docs, it handles escaping, quoting, etc.
See this youtube video: BASH scripting lesson 10 working with CSV files
CSV file:
Bob Brown;Manager;16581;Main
Sally Seaforth;Director;4678;HOME
Bash script:
#!/bin/bash
OLDIFS=$IFS
IFS=";"
while read user job uid location
do
echo -e "$user \
======================\n\
Role :\t $job\n\
ID :\t $uid\n\
SITE :\t $location\n"
done < $1
IFS=$OLDIFS
Output:
Bob Brown ======================
Role : Manager
ID : 16581
SITE : Main
Sally Seaforth ======================
Role : Director
ID : 4678
SITE : HOME
First prototype using plain old grep and cut:
grep "${VALUE}" inputfile.csv | cut -d, -f"${INDEX}"
If that's fast enough and gives the proper output, you're done.
CSV isn't quite that simple. Depending on the limits of the data you have, you might have to worry about quoted values (which may contain commas and newlines) and escaping quotes.
So if your data are restricted enough can get away with simple comma-splitting fine, shell script can do that easily. If, on the other hand, you need to parse CSV ‘properly’, bash would not be my first choice. Instead I'd look at a higher-level scripting language, for example Python with a csv.reader.
In a CSV file, each field is separated by a comma. The problem is, a field itself might have an embedded comma:
Name,Phone
"Woo, John",425-555-1212
You really need a library package that offer robust CSV support instead of relying on using comma as a field separator. I know that scripting languages such as Python has such support. However, I am comfortable with the Tcl scripting language so that is what I use. Here is a simple Tcl script which does what you are asking for:
#!/usr/bin/env tclsh
package require csv
package require Tclx
# Parse the command line parameters
lassign $argv fileName columnNumber expectedValue
# Subtract 1 from columnNumber because Tcl's list index starts with a
# zero instead of a one
incr columnNumber -1
for_file line $fileName {
set columns [csv::split $line]
set columnValue [lindex $columns $columnNumber]
if {$columnValue == $expectedValue} {
puts $line
}
}
Save this script to a file called csv.tcl and invoke it as:
$ tclsh csv.tcl filename indexNumber expectedValue
Explanation
The script reads the CSV file line by line and store the line in the variable $line, then it split each line into a list of columns (variable $columns). Next, it picks out the specified column and assigned it to the $columnValue variable. If there is a match, print out the original line.
Using awk:
export INDEX=2
export VALUE=bar
awk -F, '$'$INDEX' ~ /^'$VALUE'$/ {print}' inputfile.csv
Edit: As per Dennis Williamson's excellent comment, this could be much more cleanly (and safely) written by defining awk variables using the -v switch:
awk -F, -v index=$INDEX -v value=$VALUE '$index == value {print}' inputfile.csv
Jeez...with variables, and everything, awk is almost a real programming language...
For situations where the data does not contain any special characters, the solution suggested by Nate Kohl and ghostdog74 is good.
If the data contains commas or newlines inside the fields, awk may not properly count the field numbers and you'll get incorrect results.
You can still use awk, with some help from a program I wrote called csvquote (available at https://github.com/dbro/csvquote):
csvquote inputfile.csv | awk -F, -v index=$INDEX -v value=$VALUE '$index == value {print}' | csvquote -u
This program finds special characters inside quoted fields, and temporarily replaces them with nonprinting characters which won't confuse awk. Then they get restored after awk is done.
index=1
value=2
awk -F"," -v i=$index -v v=$value '$(i)==v' file
I was looking for an elegant solution that support quoting and wouldn't require installing anything fancy on my VMware vMA appliance. Turns out this simple python script does the trick! (I named the script csv2tsv.py, since it converts CSV into tab-separated values - TSV)
#!/usr/bin/env python
import sys, csv
with sys.stdin as f:
reader = csv.reader(f)
for row in reader:
for col in row:
print col+'\t',
print
Tab-separated values can be split easily with the cut command (no delimiter needs to be specified, tab is the default). Here's a sample usage/output:
> esxcli -h $VI_HOST --formatter=csv network vswitch standard list |csv2tsv.py|cut -f12
Uplinks
vmnic4,vmnic0,
vmnic5,vmnic1,
vmnic6,vmnic2,
In my scripts I'm actually going to parse tsv output line by line and use read or cut to get the fields I need.
Parsing CSV with primitive text-processing tools will fail on many types of CSV input.
xsv is a lovely and fast tool for doing this properly. To search for all records that contain the string "foo" in the third column:
cat file.csv | xsv search -s 3 foo
A sed or awk solution would probably be shorter, but here's one for Perl:
perl -F/,/ -ane 'print if $F[<INDEX>] eq "<VALUE>"`
where <INDEX> is 0-based (0 for first column, 1 for 2nd column, etc.)
Awk (gawk) actually provides extensions, one of which being csv processing.
Assuming that extension is installed, you can use awk to show all lines where a specific csv field matches 123.
Assuming test.csv contains the following:
Name,Phone
"Woo, John",425-555-1212
"James T. Kirk",123
The following will print all lines where the Phone (aka the second field) is equal to 123:
gawk -l csv 'csvsplit($0,a) && a[2] == 123 {print $0}'
The output is:
"James T. Kirk",123
How does it work?
-l csv asks gawk to load the csv extension by looking for it in $AWKLIBPATH;
csvsplit($0, a) splits the current line, and stores each field into a new array named a
&& a[2] == 123 checks that the second field is 123
if both conditions are true, it { print $0 }, aka prints the full line as requested.

Resources