Using the first field in AWK as file name - bash

The dataset is one big file with three columns: An ID of a section, something irrelevant and a line of text. An example could look like the following:
A01 001 This is a simple test.
A01 002 Just for exemplary purpose.
A01 003
A02 001 This is another text
I want to use the first column (in this example A01 and A02, which represent different texts) to be the file name, whichs content is everything in that line after the second column.
The example above should result two files, one with name A01 and content:
This is a simple test.
Just for exemplary purpose.
and another one A02 with content:
This is another text
My questions are:
Is AWK the appropriate program for this task? Or perhaps there are more convinient ways doing this?
How would this task be done?

awk is perfect for these kind of tasks. If you do not mind to have some leading spaces, you can use:
awk '{f=$1; $1=$2=""; print > f}' file
This will empty first and second fields and then print all the line into the f file, which was previously stored as first field.
And in case these spaces are bothering, you can delete them with sub(" ", ""):
awk '{f=$1; $1=$2=""; sub(" ", ""); print > f}' file

Bash will work too. Probably slower than awk if that's a concern
while read -r id num line; do
[[ $line ]] && echo "$line" >> $id
done < file

Related

Script to find and print common IDs between two files working but it's not optimal

I have a code that is working and does what I want, but it is extremely slow. It takes 1 or 2 days depending on the size of the input files. I know that there are alternatives that can be almost instant and that my code is slow because it's a recursive grep. I wrote another code in python that works as intended and is almost instant, but it does not print everything I need.
What I need is the common IDs between two files, and I want it to print the whole line. My python script does not do that, while the bash does it but it's too much slow.
This is my code in bash:
awk '{print $2}' file1.bim > sites.txt
for snp in `cat sites.txt`
do
grep -w $snp file2.bim >> file1_2_shared.txt
done
This is my code in python:
#!/usr/bin/env python3
import sys
argv1=sys.argv[1] #argv1 is the first .bim file
argv2=sys.argv[2] #argv2 is the second .bim file
argv3=sys.argv[3] #argv3 is the output .txt file name
def printcommonSNPs(inputbim1,inputbim2,outputtxt):
bim1 = open(inputbim1, "r")
bim2 = open(inputbim2, "r")
output = open(outputtxt,"w")
snps1 = []
line1 = bim1.readline()
line1 = line1.split()
snps1.append(line1[1])
for line1 in bim1:
line1 = line1.split()
snps1.append(line1[1])
bim1.close()
snps2 = []
line2 = bim2.readline()
line2 = line2.split()
snps2.append(line2[1])
for line2 in bim2:
line2 = line2.split()
snps2.append(line2[1])
bim2.close()
common=[]
common = list(set(snps1).intersection(snps2))
for SNP in common:
print(SNP, file=output)
printcommonSNPs(argv1,argv2,argv3)
My .bim input files are made this way:
1 1:891021 0 891021 G A
1 1:903426 0 903426 T C
1 1:949654 0 949654 A G
I would appreciate suggestions on what I could do to make it quick in bash (I suspect I can use an awk script, but I tried awk 'FNR==NR {map[$2]=$2; next} {print $2, map[$2]}' file1.bim file2.bim > Roma_sets_shared_sites.txt and it simply prints every line, so it's not working as I need), or how could I tell to print the whole line in python3.
It looks as if the problem can be solved like this:
grep -w -f <(awk '{ print $2 }' file1.bim) file2.bim
The identifiers (field $2) from file1.bim are to be treated as patterns to grep for in file2.bim. GNU grep takes a -f file argument which gives a list of patterns, one per line. We use <() process substitution in place of a file. It looks as if the -w option individually applies to the -f patterns.
This won't have the same output as your shell script if there are duplicate IDs in file1.bim. If the same pattern occurs more than once, that's the same as one instance. And of course the order is different. Grepping the entire second file for one identifier and hen the next and next, produces the matches in a different order. If that order has to be reproduced, it will take extra work.

sed/awk between two patterns in a file: pattern 1 set by a variable from lines of a second file; pattern 2 designated by a specified charcacter

I have two files. One file contains a pattern that I want to match in a second file. I want to use that pattern to print between that pattern (included) up to a specified character (not included) and then concatenate into a single output file.
For instance,
File_1:
a
c
d
and File_2:
>a
MEEL
>b
MLPK
>c
MEHL
>d
MLWL
>e
MTNH
I have been using variations of this loop:
while read $id;
do
sed -n "/>$id/,/>/{//!p;}" File_2;
done < File_1
hoping to obtain something like the following output:
>a
MEEL
>c
MEHL
>d
MLWL
But have had no such luck. I have played around with grep/fgrep awk and sed and between the three cannot seem to get the right (or any output). Would someone kindly point me in the right direction?
Try:
$ awk -F'>' 'FNR==NR{a[$1]; next} NF==2{f=$2 in a} f' file1 file2
>a
MEEL
>c
MEHL
>d
MLWL
How it works
-F'>'
This sets the field separator to >.
FNR==NR{a[$1]; next}
While reading in the first file, this creates a key in array a for every line in file file.
NF==2{f=$2 in a}
For every line in file 2 that has two fields, this sets variable f to true if the second field is a key in a or false if it is not.
f
If f is true, print the line.
A plain (GNU) sed solution. Files are read only once. It is assumed that characters in File_1 needn't to be quoted in sed expression.
pat=$(sed ':a; $!{N;ba;}; y/\n/|/' File_1)
sed -E -n ":a; /^>($pat)/{:b; p; n; /^>/ba; bb}" File_2
Explanation:
The first call to sed generates a regular expression to be used in the second call to sed and stores it in the variable pat. The aim is to avoid reading repeatedly the entire File_2 for each line of File_1. It just "slurps" the File_1 and replaces new-line characters with | characters. So the sample File_1 becomes a string with the value a|c|d. The regular expression a|c|d matches if at least one of the alternatives (a, b, c for this example) matches (this is a GNU sed extension).
The second sed expression, ":a; /^>($pat)/{:b; p; n; /^>/ba; bb}", could be converted to pseudo code like this:
begin:
read next line (from File_2) or quit on end-of-file
label_a:
if line begins with `>` followed by one of the alternatives in `pat` then
label_b:
print the line
read next line (from File_2) or quit on end-of-file
if line begins with `>` goto label_a else goto label_b
else goto begin
Let me try to explain why your approach does not work well:
You need to say while read id instead of while read $id.
The sed command />$id/,/>/{//!p;} will exclude the lines which start
with >.
Then you might want to say something like:
while read id; do
sed -n "/^>$id/{N;p}" File_2
done < File_1
Output:
>a
MEEL
>c
MEHL
>d
MLWL
But the code above is inefficient because it reads File_2 as many times as the count of the id's in File_1.
Please try the elegant solution by John1024 instead.
If ed is available, and since the shell is involve.
#!/usr/bin/env bash
mapfile -t to_match < file1.txt
ed -s file2.txt <<-EOF
g/\(^>[${to_match[*]}]\)/;/^>/-1p
q
EOF
It will only run ed once and not every line that has the pattern, that matches from file1. Like say if you have a to z from file1,ed will not run 26 times.
Requires bash4+ because of mapfile.
How it works
mapfile -t to_match < file1.txt
Saves the entry/value from file1 in an array named to_match
ed -s file2.txt point ed to file2 with the -s flag which means don't print info about the file, same info you get with wc file
<<-EOF A here document, shell syntax.
g/\(^>[${to_match[*]}]\)/;/^>/-1p
g means search the whole file aka global.
( ) capture group, it needs escaping because ed only supports BRE, basic regular expression.
^> If line starts with a > the ^ is an anchor which means the start.
[ ] is a bracket expression match whatever is inside of it, in this case the value of the array "${to_match[*]}"
; Include the next address/pattern
/^>/ Match a leading >
-1 go back one line after the pattern match.
p print whatever was matched by the pattern.
q quit ed

Grep list (file) from another file

Im new to bash and trying to extract a list of patterns from file:
File1.txt
ABC
BDF
GHJ
base.csv (tried comma separated and tab delimited)
line 1,,,,"hfhf,ferf,ju,ABC"
line 2 ,,,,,"ewy,trggt,gtg,ABC,RFR"
line 3 .."himk,n,hn.ujj., BDF"
etc
Suggested output is smth like
ABC
line 1..
line 2..(whole lines)
BDF
line 3..
and so on for each pattern from file 1
the code i tried was:
#!/bin/bash
for i in *.txt -# cycle through all files containing pattern lists
do
for q in "$i"; # # cycle through list
do
echo $q >>output.${i};
grep -f "${q}" base.csv >>output.${i};
echo "\n";
done
done
But output is only filename and then some list of strings without pattern names, e.g.
File1.txt
line 1...
line 2...
line 3..
so i don`t know to what pattern belongs each string and have to check and assign manually. Can you please point out my errors? Thanks!
grep can process multiple files in one go, and then has the attractive added bonus of indicating which file it found a match in.
grep -f File1.txt base.csv >output.txt
It's not clear what you hope for the inner loop to do; it will just loop over a single token at a time, so it's not really a loop at all.
If you want the output to be grouped per pattern, here's a for loop which looks for one pattern at a time:
while read -r pat; do
echo "$pat"
grep "$pat" *.txt
done <File1.txt >output.txt
But the most efficient way to tackle this is to write a simple Awk script which processes all the input files at once, and groups the matches before printing them.
An additional concern is anchoring. grep "ABC" will find a match in 123DEABCXYZ; is this something you want to avoid? You can improve the regex, or, again, turn to Awk which gives you more control over where exactly to look for a match in a structured line.
awk '# Read patterns into memory
NR==FNR { a[++i] = $1; next }
# Loop across patterns
{ for(j=1; j<=i; ++j)
if($0 ~ a[j]) {
print FILENAME ":" FNR ":" $0 >>output.a[j]
next }
}' File1.txt base.csv
You're not actually reading the files, you're just handling the filenames. Try this:
#!/bin/bash
for i in *.txt # cycle through all files containing pattern lists
do
while read -r q # read file line by line
do
echo "$q" >>"output.${i}"
grep -f "${q}" base.csv >>"output.${i}"
echo "\n"
done < "${i}"
done
Here is one that separates (with split, comma-separatd with quotes and spaces stripped off) words from file2 to an array (word[]) and stores the record names (line 1 etc.) to it comma-separated:
awk '
NR==FNR {
n=split($0,tmp,/[" ]*(,|$)[" ]*/) # split words
for(i=2;i<=n;i++) # after first
if(tmp[i]!="") # non-empties
word[tmp[i]]=word[tmp[i]] (word[tmp[i]]==""?"":",") tmp[1] # hash rownames
record[tmp[1]]=$0 # store records
next
}
($1 in word) { # word found
n=split(word[$1],tmp,",") # get record names
print $1 ":" # output word
for(i=1;i<=n;i++) # and records
print record[tmp[i]]
}' file2 file1
Output:
ABC:
line 1,,,,"hfhf,ferf,ju,ABC"
line 2 ,,,,,"ewy,trggt,gtg,ABC,RFR"
BDF:
line 3 .."himk,n,hn.ujj., BDF"
Thank you for your kind help, my friends.
Tried both variants above but kept getting various errors ( "do" expected) or misbehavior ( gets names of pattern blocks, eg ABC, BDF, but no lines.
Gave up for a while and then eventually tried another way
While base goal were to cycle through pattern list files, search for patterns in huge file and write out specific columns from lines found - i simply wrote
for *i in *txt # cycle throughfiles w/ patterns
do
grep -F -f "$i" bigfile.csv >> ${i}.out1 #greps all patterns from current file
cut -f 2,3,4,7 ${i}.out1>> ${i}.out2 # cuts columns of interest and writes them out to another file
done
I'm aware that this code should be improved using some fancy pipeline features, but it works perfectly as is, hope it`ll help somebody in similar situation. You can easily add some echoes to write out pattern list names as i initially requested

How to extract one column of a csv file

If I have a csv file, is there a quick bash way to print out the contents of only any single column? It is safe to assume that each row has the same number of columns, but each column's content would have different length.
You could use awk for this. Change '$2' to the nth column you want.
awk -F "\"*,\"*" '{print $2}' textfile.csv
yes. cat mycsv.csv | cut -d ',' -f3 will print 3rd column.
The simplest way I was able to get this done was to just use csvtool. I had other use cases as well to use csvtool and it can handle the quotes or delimiters appropriately if they appear within the column data itself.
csvtool format '%(2)\n' input.csv
Replacing 2 with the column number will effectively extract the column data you are looking for.
Landed here looking to extract from a tab separated file. Thought I would add.
cat textfile.tsv | cut -f2 -s
Where -f2 extracts the 2, non-zero indexed column, or the second column.
Here is a csv file example with 2 columns
myTooth.csv
Date,Tooth
2017-01-25,wisdom
2017-02-19,canine
2017-02-24,canine
2017-02-28,wisdom
To get the first column, use:
cut -d, -f1 myTooth.csv
f stands for Field and d stands for delimiter
Running the above command will produce the following output.
Output
Date
2017-01-25
2017-02-19
2017-02-24
2017-02-28
To get the 2nd column only:
cut -d, -f2 myTooth.csv
And here is the output
Output
Tooth
wisdom
canine
canine
wisdom
incisor
Another use case:
Your csv input file contains 10 columns and you want columns 2 through 5 and columns 8, using comma as the separator".
cut uses -f (meaning "fields") to specify columns and -d (meaning "delimiter") to specify the separator. You need to specify the latter because some files may use spaces, tabs, or colons to separate columns.
cut -f 2-5,8 -d , myvalues.csv
cut is a command utility and here is some more examples:
SYNOPSIS
cut -b list [-n] [file ...]
cut -c list [file ...]
cut -f list [-d delim] [-s] [file ...]
I think the easiest is using csvkit:
Gets the 2nd column:
csvcut -c 2 file.csv
However, there's also csvtool, and probably a number of other csv bash tools out there:
sudo apt-get install csvtool (for Debian-based systems)
This would return a column with the first row having 'ID' in it.
csvtool namedcol ID csv_file.csv
This would return the fourth row:
csvtool col 4 csv_file.csv
If you want to drop the header row:
csvtool col 4 csv_file.csv | sed '1d'
First we'll create a basic CSV
[dumb#one pts]$ cat > file
a,b,c,d,e,f,g,h,i,k
1,2,3,4,5,6,7,8,9,10
a,b,c,d,e,f,g,h,i,k
1,2,3,4,5,6,7,8,9,10
Then we get the 1st column
[dumb#one pts]$ awk -F , '{print $1}' file
a
1
a
1
Many answers for this questions are great and some have even looked into the corner cases.
I would like to add a simple answer that can be of daily use... where you mostly get into those corner cases (like having escaped commas or commas in quotes etc.,).
FS (Field Separator) is the variable whose value is dafaulted to
space. So awk by default splits at space for any line.
So using BEGIN (Execute before taking input) we can set this field to anything we want...
awk 'BEGIN {FS = ","}; {print $3}'
The above code will print the 3rd column in a csv file.
The other answers work well, but since you asked for a solution using just the bash shell, you can do this:
AirBoxOmega:~ d$ cat > file #First we'll create a basic CSV
a,b,c,d,e,f,g,h,i,k
1,2,3,4,5,6,7,8,9,10
a,b,c,d,e,f,g,h,i,k
1,2,3,4,5,6,7,8,9,10
a,b,c,d,e,f,g,h,i,k
1,2,3,4,5,6,7,8,9,10
a,b,c,d,e,f,g,h,i,k
1,2,3,4,5,6,7,8,9,10
a,b,c,d,e,f,g,h,i,k
1,2,3,4,5,6,7,8,9,10
a,b,c,d,e,f,g,h,i,k
1,2,3,4,5,6,7,8,9,10
And then you can pull out columns (the first in this example) like so:
AirBoxOmega:~ d$ while IFS=, read -a csv_line;do echo "${csv_line[0]}";done < file
a
1
a
1
a
1
a
1
a
1
a
1
So there's a couple of things going on here:
while IFS=, - this is saying to use a comma as the IFS (Internal Field Separator), which is what the shell uses to know what separates fields (blocks of text). So saying IFS=, is like saying "a,b" is the same as "a b" would be if the IFS=" " (which is what it is by default.)
read -a csv_line; - this is saying read in each line, one at a time and create an array where each element is called "csv_line" and send that to the "do" section of our while loop
do echo "${csv_line[0]}";done < file - now we're in the "do" phase, and we're saying echo the 0th element of the array "csv_line". This action is repeated on every line of the file. The < file part is just telling the while loop where to read from. NOTE: remember, in bash, arrays are 0 indexed, so the first column is the 0th element.
So there you have it, pulling out a column from a CSV in the shell. The other solutions are probably more practical, but this one is pure bash.
You could use GNU Awk, see this article of the user guide.
As an improvement to the solution presented in the article (in June 2015), the following gawk command allows double quotes inside double quoted fields; a double quote is marked by two consecutive double quotes ("") there. Furthermore, this allows empty fields, but even this can not handle multiline fields. The following example prints the 3rd column (via c=3) of textfile.csv:
#!/bin/bash
gawk -- '
BEGIN{
FPAT="([^,\"]*)|(\"((\"\")*[^\"]*)*\")"
}
{
if (substr($c, 1, 1) == "\"") {
$c = substr($c, 2, length($c) - 2) # Get the text within the two quotes
gsub("\"\"", "\"", $c) # Normalize double quotes
}
print $c
}
' c=3 < <(dos2unix <textfile.csv)
Note the use of dos2unix to convert possible DOS style line breaks (CRLF i.e. "\r\n") and UTF-16 encoding (with byte order mark) to "\n" and UTF-8 (without byte order mark), respectively. Standard CSV files use CRLF as line break, see Wikipedia.
If the input may contain multiline fields, you can use the following script. Note the use of special string for separating records in output (since the default separator newline could occur within a record). Again, the following example prints the 3rd column (via c=3) of textfile.csv:
#!/bin/bash
gawk -- '
BEGIN{
RS="\0" # Read the whole input file as one record;
# assume there is no null character in input.
FS="" # Suppose this setting eases internal splitting work.
ORS="\n####\n" # Use a special output separator to show borders of a record.
}
{
nof=patsplit($0, a, /([^,"\n]*)|("(("")*[^"]*)*")/, seps)
field=0;
for (i=1; i<=nof; i++){
field++
if (field==c) {
if (substr(a[i], 1, 1) == "\"") {
a[i] = substr(a[i], 2, length(a[i]) - 2) # Get the text within
# the two quotes.
gsub(/""/, "\"", a[i]) # Normalize double quotes.
}
print a[i]
}
if (seps[i]!=",") field=0
}
}
' c=3 < <(dos2unix <textfile.csv)
There is another approach to the problem. csvquote can output contents of a CSV file modified so that special characters within field are transformed so that usual Unix text processing tools can be used to select certain column. For example the following code outputs the third column:
csvquote textfile.csv | cut -d ',' -f 3 | csvquote -u
csvquote can be used to process arbitrary large files.
I needed proper CSV parsing, not cut / awk and prayer. I'm trying this on a mac without csvtool, but macs do come with ruby, so you can do:
echo "require 'csv'; CSV.read('new.csv').each {|data| puts data[34]}" | ruby
I wonder why none of the answers so far have mentioned csvkit.
csvkit is a suite of command-line tools for converting to and working
with CSV
csvkit documentation
I use it exclusively for csv data management and so far I have not found a problem that I could not solve using cvskit.
To extract one or more columns from a cvs file you can use the csvcut utility that is part of the toolbox. To extract the second column use this command:
csvcut -c 2 filename_in.csv > filename_out.csv
csvcut reference page
If the strings in the csv are quoted, add the quote character with the q option:
csvcut -q '"' -c 2 filename_in.csv > filename_out.csv
Install with pip install csvkit or sudo apt install csvkit.
Simple solution using awk. Instead of "colNum" put the number of column you need to print.
cat fileName.csv | awk -F ";" '{ print $colNum }'
csvtool col 2 file.csv
where 2 is the column you are interested in
you can also do
csvtool col 1,2 file.csv
to do multiple columns
You can't do it without a full CSV parser.
If you know your data will not be quoted, then any solution that splits on , will work well (I tend to reach for cut -d, -f1 | sed 1d), as will any of the CSV manipulation tools.
If you want to produce another CSV file, then xsv, csvkit, csvtool, or other CSV manipulation tools are appropriate.
If you want to extract the contents of one single column of a CSV file, unquoting them so that they can be processed by subsequent commands, this Python 1-liner does the trick for CSV files with headers:
python -c 'import csv,sys'$'\n''for row in csv.DictReader(sys.stdin): print(row["message"])'
The "message" inside of the print function selects the column.
If the CSV file doesn't have headers:
python -c 'import csv,sys'$'\n''for row in csv.reader(sys.stdin): print(row[1])'
Python's CSV library supports all kinds of CSV dialects, so if your CSV file uses different conventions, it's possible to support them with relatively little change to the code.
Been using this code for a while, it is not "quick" unless you count "cutting and pasting from stackoverflow".
It uses ${##} and ${%%} operators in a loop instead of IFS. It calls 'err' and 'die', and supports only comma, dash, and pipe as SEP chars (that's all I needed).
err() { echo "${0##*/}: Error:" "$#" >&2; }
die() { err "$#"; exit 1; }
# Return Nth field in a csv string, fields numbered starting with 1
csv_fldN() { fldN , "$1" "$2"; }
# Return Nth field in string of fields separated
# by SEP, fields numbered starting with 1
fldN() {
local me="fldN: "
local sep="$1"
local fldnum="$2"
local vals="$3"
case "$sep" in
-|,|\|) ;;
*) die "$me: arg1 sep: unsupported separator '$sep'" ;;
esac
case "$fldnum" in
[0-9]*) [ "$fldnum" -gt 0 ] || { err "$me: arg2 fldnum=$fldnum must be number greater or equal to 0."; return 1; } ;;
*) { err "$me: arg2 fldnum=$fldnum must be number"; return 1;} ;;
esac
[ -z "$vals" ] && err "$me: missing arg2 vals: list of '$sep' separated values" && return 1
fldnum=$(($fldnum - 1))
while [ $fldnum -gt 0 ] ; do
vals="${vals#*$sep}"
fldnum=$(($fldnum - 1))
done
echo ${vals%%$sep*}
}
Example:
$ CSVLINE="example,fields with whitespace,field3"
$ $ for fno in $(seq 3); do echo field$fno: $(csv_fldN $fno "$CSVLINE"); done
field1: example
field2: fields with whitespace
field3: field3
You can also use while loop
IFS=,
while read name val; do
echo "............................"
echo Name: "$name"
done<itemlst.csv

comparing csv files

I want to write a shell script to compare two .csv files. First one contains filename,path the second .csv file contains filename,paht,target. Now, I want to compare the two .csv files and output the target name where the file from the first .csv exists in the second .csv file.
Ex.
a.csv
build.xml,/home/build/NUOP/project1
eesX.java,/home/build/adm/acl
b.csv
build.xml,/home/build/NUOP/project1,M1
eesX.java,/home/build/adm/acl,M2
ddexse3.htm,/home/class/adm/33eFg
I want the output to be something like this.
M1 and M2
Please help
Thanks,
If you don't necessarily need a shell script, you can easily do it in Python like this:
import csv
seen = set()
for row in csv.reader(open('a.csv')):
seen.add(tuple(row))
for row in csv.reader(open('b.csv')):
if tuple(row[:2]) in seen:
print row[2]
if those M1 and M2 are always at field 3 and 5, you can try this
awk -F"," 'FNR==NR{
split($3,b," ")
split($5,c," ")
a[$1]=b[1]" "c[1]
next
}
($1 in a){
print "found: " $1" "a[$1]
}' file2.txt file1.txt
output
# cat file2.txt
build.xml,/home/build/NUOP/project1,M1 eesX.java,/home/build/adm/acl,M2 ddexse3.htm,/home/class/adm/33eFg
filename, blah,M1 blah, blah, M2 blah , end
$ cat file1.txt
build.xml,/home/build/NUOP/project1 eesX.java,/home/build/adm/acl
$ ./shell.sh
found: build.xml M1 M2
try http://sourceforge.net/projects/csvdiff/
Quote:
csvdiff is a Perl script to diff/compare two csv files with the possibility to select the separator. Differences will be shown like: "Column XYZ in record 999" is different. After this, the actual and the expected result for this column will be shown.

Resources