I have a file with some IDs listed like this:
id1
id2
id3
etc
I want to use those IDs to extract data from files (IDs are occurring in every file) and save output for each of these IDs to a separate file (IDs are protein family names and I want to get each protein from a specific family). And, when I have the name for each of the protein I want to use this name to get those proteins (in .fasta format), so that they would be grouped by their family (they'll be staying in the same group)
So I've tried to do it like this (I knew that it would dump all the IDs into one file):
#! /bin/bash
for file in *out
do grep -n -E 'id1|id2|id3' /directory/$file >> output; done
I would appreciate any help and I will gladly specify if not everything is clear to you.
EDIT: i will try to clarify, sorry for the inconvenience:
so theres a file called "pfamacc" with the following content:
PF12312
PF43555
PF34923
and so on - those are the IDs that i need to acces other files, which have a structure like that "something_something.faa.out"
<acc_number> <aligment_start> <aligment_end> <pfam_acc>
RXOOOA 5 250 PF12312
OC2144 6 200 PF34923
i need those accesion numbers so i can then get protein sequences from files which look like this:
>RXOOOA
ASDBSADBASDGHH
>OC2144
SADHHASDASDCJHWINF
With the assumption there is a file ids_file.txt in the same directory with the subsequent content:
id1
id2
id3
id4
And in the same directory is as well a file called id1 with the following content:
Bla bla bla
id1
and id2
is
here id4
Then this script could help:
#!/bin/sh
IDS=$(cat ids_file.txt)
IDS_IN_ONE=$(cat ids_file.txt | tr '\n' '|' | sed -r 's/(\|)?\|$//')
echo $IDS_IN_ONE
for file in $IDS; do
grep -n -E "$IDS_IN_ONE" ./$file >> output
done
The file output has then the following result:
2:id1
3:and id2
5:here id4
Reading that a list needs to be cross-referenced to get a 2nd list, which then needs to be used to gather FASTAs.
Starting with the following 3 files...
starting_values.txt
PF12312
PF43555
PF34923
cross_reference.txt
<acc_number> <aligment_start> <aligment_end> <pfam_acc>
RXOOOA 5 250 PF12312
OC2144 6 200 PF34923
find_from_file.fasta
>RXOOOA
ASDBSADBASDGHH
>OC2144
SADHHASDASDCJHWINF
SADHHASDASDCJHWINF
>NC11111
IURJCNKAERJKADSF
for i in `cat starting_values.txt`; do awk -v var=$i 'var==$4 {print $1}' cross_reference.txt; done > needed_accessions.txt
If multiline FASTA change to single line. https://www.biostars.org/p/9262/
awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);} END {printf("\n");}' find_from_file.fasta > find_from_file.temp
for i in `cat needed_accessions.txt`; do grep -A 1 "$i" find_from_file.temp; done > found_sequences.fasta
Final Output...
found_sequences.fasta
>RXOOOA
ASDBSADBASDGHH
>OC2144
SADHHASDASDCJHWINFSADHHASDASDCJHWINF
I'm working on a long Bash script. I want to read cells from a CSV file into Bash variables. I can parse lines and the first column, but not any other column. Here's my code so far:
cat myfile.csv|while read line
do
read -d, col1 col2 < <(echo $line)
echo "I got:$col1|$col2"
done
It's only printing the first column. As an additional test, I tried the following:
read -d, x y < <(echo a,b,)
And $y is empty. So I tried:
read x y < <(echo a b)
And $y is b. Why?
You need to use IFS instead of -d:
while IFS=, read -r col1 col2
do
echo "I got:$col1|$col2"
done < myfile.csv
To skip a given number of header lines:
skip_headers=3
while IFS=, read -r col1 col2
do
if ((skip_headers))
then
((skip_headers--))
else
echo "I got:$col1|$col2"
fi
done < myfile.csv
Note that for general purpose CSV parsing you should use a specialized tool which can handle quoted fields with internal commas, among other issues that Bash can't handle by itself. Examples of such tools are cvstool and csvkit.
How to parse a CSV file in Bash?
Coming late to this question and as bash do offer new features, because this question stand about bash and because none of already posted answer show this powerful and compliant way of doing precisely this.
Parsing CSV files under bash, using loadable module
Conforming to RFC 4180, a string like this sample CSV row:
12,22.45,"Hello, ""man"".","A, b.",42
should be splitted as
1 12
2 22.45
3 Hello, "man".
4 A, b.
5 42
bash loadable .C compiled modules.
Under bash, you could create, edit, and use loadable c compiled modules. Once loaded, they work like any other builtin!! ( You may find more information at source tree. ;)
Current source tree (Oct 15 2021, bash V5.1-rc3) do contain a bunch of samples:
accept listen for and accept a remote network connection on a given port
asort Sort arrays in-place
basename Return non-directory portion of pathname.
cat cat(1) replacement with no options - the way cat was intended.
csv process one line of csv data and populate an indexed array.
dirname Return directory portion of pathname.
fdflags Change the flag associated with one of bash's open file descriptors.
finfo Print file info.
head Copy first part of files.
hello Obligatory "Hello World" / sample loadable.
...
tee Duplicate standard input.
template Example template for loadable builtin.
truefalse True and false builtins.
tty Return terminal name.
uname Print system information.
unlink Remove a directory entry.
whoami Print out username of current user.
There is an full working cvs parser ready to use in examples/loadables directory: csv.c!!
Under Debian GNU/Linux based system, you may have to install bash-builtins package by
apt install bash-builtins
Using loadable bash-builtins:
Then:
enable -f /usr/lib/bash/csv csv
From there, you could use csv as a bash builtin.
With my sample: 12,22.45,"Hello, ""man"".","A, b.",42
csv -a myArray '12,22.45,"Hello, ""man"".","A, b.",42'
printf "%s\n" "${myArray[#]}" | cat -n
1 12
2 22.45
3 Hello, "man".
4 A, b.
5 42
Then in a loop, processing a file.
while IFS= read -r line;do
csv -a aVar "$line"
printf "First two columns are: [ '%s' - '%s' ]\n" "${aVar[0]}" "${aVar[1]}"
done <myfile.csv
This way is clearly the quickest and strongest than using any other combination of bash builtins or fork to any binary.
Unfortunely, depending on your system implementation, if your version of bash was compiled without loadable, this may not work...
Complete sample with multiline CSV fields.
Conforming to RFC 4180, a string like this single CSV row:
12,22.45,"Hello ""man"",
This is a good day, today!","A, b.",42
should be splitted as
1 12
2 22.45
3 Hello "man",
This is a good day, today!
4 A, b.
5 42
Full sample script for parsing CSV containing multilines fields
Here is a small sample file with 1 headline, 4 columns and 3 rows. Because two fields do contain newline, the file are 6 lines length.
Id,Name,Desc,Value
1234,Cpt1023,"Energy counter",34213
2343,Sns2123,"Temperatur sensor
to trigg for alarm",48.4
42,Eye1412,"Solar sensor ""Day /
Night""",12199.21
And a small script able to parse this file correctly:
#!/bin/bash
enable -f /usr/lib/bash/csv csv
file="sample.csv"
exec {FD}<"$file"
read -ru $FD line
csv -a headline "$line"
printf -v fieldfmt '%-8s: "%%q"\\n' "${headline[#]}"
numcols=${#headline[#]}
while read -ru $FD line;do
while csv -a row "$line" ; (( ${#row[#]} < numcols )) ;do
read -ru $FD sline || break
line+=$'\n'"$sline"
done
printf "$fieldfmt\\n" "${row[#]}"
done
This may render: (I've used printf "%q" to represent non-printables characters like newlines as $'\n')
Id : "1234"
Name : "Cpt1023"
Desc : "Energy\ counter"
Value : "34213"
Id : "2343"
Name : "Sns2123"
Desc : "$'Temperatur sensor\nto trigg for alarm'"
Value : "48.4"
Id : "42"
Name : "Eye1412"
Desc : "$'Solar sensor "Day /\nNight"'"
Value : "12199.21"
You could find a full working sample there: csvsample.sh.txt or
csvsample.sh.
Note:
In this sample, I use head line to determine row width (number of columns). If you're head line could hold newlines, (or if your CSV use more than 1 head line). You will have to pass number or columns as argument to your script (and the number of head lines).
Warning:
Of course, parsing CSV using this is not perfect! This work for many simple CSV files, but care about encoding and security!! For sample, this module won't be able to handle binary fields!
Read carefully csv.c source code comments and RFC 4180!
From the man page:
-d delim
The first character of delim is used to terminate the input line,
rather than newline.
You are using -d, which will terminate the input line on the comma. It will not read the rest of the line. That's why $y is empty.
We can parse csv files with quoted strings and delimited by say | with following code
while read -r line
do
field1=$(echo "$line" | awk -F'|' '{printf "%s", $1}' | tr -d '"')
field2=$(echo "$line" | awk -F'|' '{printf "%s", $2}' | tr -d '"')
echo "$field1 $field2"
done < "$csvFile"
awk parses the string fields to variables and tr removes the quote.
Slightly slower as awk is executed for each field.
In addition to the answer from #Dennis Williamson, it may be helpful to skip the first line when it contains the header of the CSV:
{
read
while IFS=, read -r col1 col2
do
echo "I got:$col1|$col2"
done
} < myfile.csv
If you want to read CSV file with some lines, so this the solution.
while IFS=, read -ra line
do
test $i -eq 1 && ((i=i+1)) && continue
for col_val in ${line[#]}
do
echo -n "$col_val|"
done
echo
done < "$csvFile"
I would like to extract values from the second column in my csv file and store the extracted values in a new column.
sample of my dataset:
page_name post_id page_id
A 86680728811_272953252761568 86680728811
A 86680728811_273859942672742 86680728811
B 86680728033_281125741936891 86680728033
B 86680728033_10150500662053812 86680728033
I would like to extract the number that come after the underscore and store them in a new column. Sample output:
page_name post_id page_id
A 272953252761568 86680728811
A 273859942672742 86680728811
B 281125741936891 86680728033
B 10150500662053812 86680728033
I tried using this code:
cat FB_Dataset.csv | sed -Ee 's/(.*)post_id/\1post_id/' -e 's/,[_ ]/,/' -e 's/_/,/'
but I don't get the desired output.
Any help is appreciated. Thank you.
sed 's/[0-9][0-9]*_//' < a.csv
where a.csv is the file with your original data
edited to add [0-9]
I have a .csv file I am working with and I need to output another csv file that contains a de-deuplicated list of columns 2 and 6 from the first csv with some caveats.
This is a bit difficult to explain in words but here is an example of what my input is:
"customer_name","cid”,”boolean_status”,”type”,”number”
“conotoso, inc.”,”123456”,”TRUE”,”Inline”,”210”
"conotoso, inc.","123456”,”FALSE”,”Inline”,”411"
“afakename”,”654321”,”TRUE","Inline”,”253”
“bfakename”,”909090”,”FALSE”,”Inline”,”321”
“cfakename”,”121212”,”TRUE","Inline","145”
what I need for this to do is create a new .csv file containing only "customer_name" column and "boolean_status" column.
Now, I also need there to be only one line for "customer_name" and to show "TRUE" if ANY of the customer_name matches a "true" value in the boolean column.
The output from the above input should be this:
"customer_name",”boolean_status”
“conotoso, inc.”,”TRUE”
“afakename”,”TRUE"
“cfakename”,”TRUE"
So far I tried
awk -F "\"*\",\"*\"" '{print $1","$6}' data1.csv >data1out.csv
to give me the output file, but then I attempted to cat data1out.csv | grep 'TRUE' with no good luck
can someone help me out on what i should do to manipulate this properly?
I'm also running into issues with the awk printing out the leading commas
All I really need at the end is a number of "how many unique 'customer_names' have at least 1 'True' in the "boolean" column?"
You will get your de duplicated file by using
sort -u -t, -k2,2 -k6,6 filname>sortedfile
Post this you can write a script to extract the columns required.
while read line
do
grep "TRUE" "$line"
if [ $? -eq 0]
then
a=$(cut -d',' -f1-f3 $line)
echo a >>outputfile
fi
done<<sortedfile
The dataset is one big file with three columns: An ID of a section, something irrelevant and a line of text. An example could look like the following:
A01 001 This is a simple test.
A01 002 Just for exemplary purpose.
A01 003
A02 001 This is another text
I want to use the first column (in this example A01 and A02, which represent different texts) to be the file name, whichs content is everything in that line after the second column.
The example above should result two files, one with name A01 and content:
This is a simple test.
Just for exemplary purpose.
and another one A02 with content:
This is another text
My questions are:
Is AWK the appropriate program for this task? Or perhaps there are more convinient ways doing this?
How would this task be done?
awk is perfect for these kind of tasks. If you do not mind to have some leading spaces, you can use:
awk '{f=$1; $1=$2=""; print > f}' file
This will empty first and second fields and then print all the line into the f file, which was previously stored as first field.
And in case these spaces are bothering, you can delete them with sub(" ", ""):
awk '{f=$1; $1=$2=""; sub(" ", ""); print > f}' file
Bash will work too. Probably slower than awk if that's a concern
while read -r id num line; do
[[ $line ]] && echo "$line" >> $id
done < file