How do I split a CSV file based on the column data?

How do I split a CSV file based on the column data? - bash

I'm given a alphabet.CSV file with columns A B C D. In column B, the data entered is either 1, or 2.
I'm tasked to write a shellscript that reads data out of the CSV file and split it into two files with names B_1.csv and B_2.csv. Each of these files must only contain data rows belonging to the respective data in column B.
i.e:
B_1.csv should contain all the columns A B C D, but in column B, only 1 shows up.
B_2.csv should contain all the columns A B C D, but in column B, only 2 shows up.
Here is what I have so far:
file=alphabet.csv
if grep -c '1', file
then cat >> B_1.csv
else cat >> B_2.csv
fi
exit
However, this gives me the following error:
command not found
I'm a bit lost. I'm looking through guides, quite understand what they mean, but I'm not sure how I could do this with "sed" or "awk" or other formats.
enter image description here

awk is well suited to this task:
awk '$2 == 1{print > "B_1.csv"} $2 ==2 {print > "B_2.csv"}' FS=, alphabet.csv
This can be easily generalized to more possible values in column 2:
awk '{print > ("B_" $2 ".csv")}' FS=, alphabet.csv

cat without a file name argument will write standard input to standard output. You probably don't have anything meaningful on standard input (unless you are manually typing in the file one line at a time). Similarly, your grep command would count the number of occurrences of 1, on all lines of the file (and basically always succeed; so not a very good command to put in a condition).
Probably you mean you want to write the current line to a different file, depending on what it contains. Awk makes this really easy;
awk -F ',' 'BEGIN { IFS=OFS } { print >($2 == "1" ? "B_1.csv" : "B_2.csv") }' alphabet.csv
In brief, this splits the current line on commas, examines the second field, and writes to one file or the other depending on what it contains. (Awk in general reads one line at a time, and applies the current script to each line in turn.) The compact but slightly obscure notation a ? b : c checks the truth value of a, and returns b if it is true, otherwise c (this "tertiary operator" exists in many languages, including C).
The assignment OFS=FS makes sure the output it comma-separated, too (the default is to read and print whitespace-separated fields).
If you wanted to do this in pure shell script, it would look something like
while IFS=, read -r one two three four; do
case $two in
1) echo "$one,$two,$three,$four" >>B_1.csv;;
2) echo "$one,$two,$three,$four" >>B_2.csv;;
esac
done <alphabet.csv
But really, use Awk. An important task when learning shell scripting is also to learn when the shell is not the most adequate tool for the job; you generally tend to learn sed and Awk (or these days a modern scripting language like Python) as you go.
Both of these are slightly brittle with real-world CSV files, where a comma may not always be a column delimiter. (Commonly you have double-quoted fields which may contain literal commas which are not acting as delimiters, but CSV is not properly standardized.)

Related

Is there a way to take an input that behaves like a file in bash?

I have a task where I'm given an input of the format:
4
A CS 22 M
B ECE 23 M
C CS 23 F
D CS 22 F
as the user input from the command line. From this, we have to perform tasks like determine the number of male and female students, determine which department has the most students, etc. I have done this using awk with the input as a file. Is there any way to do this with a user input instead of a file?
Example of a command I used for a file (where the content in the file is in the same format):
numberofmales=$(awk -F ' ' '{print $4}' file.txt | grep M | wc -l) #list number of males

Not Reproducible
It works fine for me, so your problem can't be reproduced with either GNU or BSD awk under Bash 5.0.18(1). With your posted code and file sample:
$ numberofmales=$(awk -F ' ' '{print $4}' file.txt | grep M | wc -l)
$ echo $numberofmales
2
Check to make sure you don't have problems in your input file, or elsewhere in your code.
Also, note that if you call awk without a file argument or input from a pipe, it tries to collect data from standard input. It may not actually be hanging; it's probably just waiting on end-of-file, which you can trigger with CTRL+D.
Recommended Improvements
Even if your code works, it can be improved. Consider the following, which skips the unnecessary field-separator definition and performs all the actions of your pipeline within awk.
males=$(
awk 'tolower($4)=="m" {count++}; END {print count}' file.txt
)
echo "$males"
Fewer moving parts are often easier to debug, and can often be more performant on large datasets. However, your mileage may vary.
User Input
If you want to use user input rather than a file, you can use standard input to collect your data, and then pass it as a quoted argument to a function. For example:
count_males () {
awk 'tolower($4)=="m" {count++}; END {print count}' <<< "$*"
}
echo "Enter data (CTRL-D when done):"
data=$(cat -)
# If at command prompt, wait until EOF above before
# pasting this line. Won't matter in scripts.
males=$(count_males "$data")
The result is now stored in males, and you can echo "$males" or make use of the variable in whatever other way you like.

Bash indeed does not care whether a file handle is connected to standard input or to a file, and neither does Awk.
However, if you want to pass the same input to multiple Awk instances, it really does make sense to store it in a temporary file.
A better overall solution is to write a better Awk script so you only need to read the input once.
awk 'NF > 1 { ++a[$4] } END { for (g in a) print g, a[g] }'
Demo: https://ideone.com/0ML7Xk
The NF > 1 condition is to skip the silly first line. Probably don't put that information there in the first place and let Awk figure out how many lines there are; it's probably better at counting than you are anyway.

How do I parse a csv file to find the "fails" in the file which is on column 2 and find the average of column 7

grep "false" $1 | cut -d ',' -f2,7
this is as far as I got. with this I can get all the false errors and their respond time. But I am having a hard time finding the average out of all the respond time's combine.

It's not fully clear what you're trying to do, but if you're looking for the arithmetic mean of all second comma-delimited fields ("columns") where the seventh field is false then here's an answer using awk:
awk -F ',' '$7 == "false" { f++; sum += $2 } END { print sum / f }' "$#"
This sets the field separator to be , and then parses only lines whose seventh (comma-delimited) field is exactly false (also consider tolower($7) == "false"), incrementing a counter (f) and adding the second column to a sum variable. After running through all lines of all input files, the script prints the arithmetic mean by dividing the sum by the number of rows it keyed on. The trailing "$#" will send each argument to your shell script as a file for this awk command.
A note on fields: awk is one-indexed, but 0 often has a special value. $0 is the whole line, $1 is the first field, and so on. awk is pretty flexible, so you can also do things like $i to refer to the field represented by a variable i, including things like $(NF-1) to refer to the contents of the field before the last field of the line.
Non-delimiting commas:
If your data might have quoted values with commas in them, or escaped commas, the field calculation in awk (or in cut) won't work. A proper CSV parser (requiring a more complete language than bash plus extras like awk, sed, or cut) would be preferable to making your own. Alternatively, if you control the format, you can consider a different delimiter such as Tab or the dedicated ASCII Record Separator character (RS, a.k.a. U+001E, Information Separator Two, which you can enter in bash as $'\x1e' and in awk (and most other languages) as "\x1e").

Use grep only on specific columns in many files?

Basically, I have one file with patterns and I want every line to be searched in all text files in a certain directory. I also only want exact matches. The many files are zipped.
However, I have one more condition. I need the first two columns of a line in the pattern file to match the first two columns of a line in any given text file that is searched. If they match, the output I want is the pattern(the entire line) followed by all the names of the text files that a match was found in with their entire match lines (not just first two columns).
An output such as:
pattern1
file23:"text from entire line in file 23 here"
file37:"text from entire line in file 37 here"
file156:"text from entire line in file 156 here"
pattern2
file12:"text from entire line in file 12 here"
file67:"text from entire line in file 67 here"
file200:"text from entire line in file 200 here"
I know that grep can take an input file, but the problem is that it takes every pattern in the pattern file and searches for them in a given text file before moving onto the next file, which makes the above output more difficult. So I thought it would be better to loop through each line in a file, print the line, and then search for the line in the many files, seeing if the first two columns match.
I thought about this:
cat pattern_file.txt | while read line
do
echo $line >> output.txt
zgrep -w -l $line many_files/*txt >> output.txt
done
But with this code, it doesn't search by the first two columns only. Is there a way so specify the first two columns for both the pattern line and for the lines that grep searches through?
What is the best way to do this? Would something other than grep, like awk, be better to use? There were other questions like this, but none that used columns for both the search pattern and the searched file.
Few lines from pattern file:
1 5390182 . A C 40.0 PASS DP=21164;EFF=missense_variant(MODERATE|MISSENSE|Aag/Cag|p.Lys22Gln/c.64A>C|359|AT1G15670|protein_coding|CODING|AT1G15670.1|1|1)
1 5390200 . G T 40.0 PASS DP=21237;EFF=missense_variant(MODERATE|MISSENSE|Gcc/Tcc|p.Ala28Ser/c.82G>T|359|AT1G15670|protein_coding|CODING|AT1G15670.1|1|1)
1 5390228 . A C 40.0 PASS DP=21317;EFF=missense_variant(MODERATE|MISSENSE|gAa/gCa|p.Glu37Ala/c.110A>C|359|AT1G15670|protein_coding|CODING|AT1G15670.1|1|1)
Few lines from a file in searched files:
1 10699576 . G A 36 PASS DP=4 GT:GQ:DP 1|1:36:4
1 10699790 . T C 40 PASS DP=6 GT:GQ:DP 1|1:40:6
1 10699808 . G A 40 PASS DP=7 GT:GQ:DP 1|1:40:7
They both in reality are much larger.

It sounds like this might be what you want:
awk 'NR==FNR{a[$1,$2]; next} ($1,$2) in a' patternfile anyfile
If it's not then update your question to provide a clear, simple statement of your requirements and concise, testable sample input and expected output that demonstrates your problem and that we could test a potential solution against.
if anyfile is actually a zip file then you'd do something like:
zcat anyfile | awk 'NR==FNR{a[$1,$2]; next} ($1,$2) in a' patternfile -
Replace zcat with whatever command you use to produce text from your zip file if that's not what you use.
Per the question in the comments, if both input files are compressed and your shell supports it (e.g. bash) you could do:
awk 'NR==FNR{a[$1,$2]; next} ($1,$2) in a' <(zcat patternfile) <(zcat anyfile)
otherwise just uncompress patternfile to a tmp file first and use that in the awk command.

Use read to parse the pattern file's columns and add an anchor to the zgrep pattern :
while read -r column1 column2 rest_of_the_line
do
echo "$column1 $column2 $rest_of_the_line"
zgrep -w -l "^$column1\s*$column2" many_files/*txt
done < pattern_file.txt >> output.txt
read is able to parse lines into multiple variables passed as parameters, the last of which getting the rest of the line. It will separate fields around characters of the $IFS Internal Field Separator (by default tabulations, spaces and linefeeds, can be overriden for the read command by using while IFS='...' read ...).
Using -r avoids unwanted escapes and makes the parsing more reliable, and while ... do ... done < file performs a bit better since it avoids an useless use of cat. Since the output of all the commands inside the while is redirected I also put the redirection on the while rather than on each individual commands.

Combine multiple files into single file in unix shell scripting

I want to combine the data of 3 (say) files having the same columns and datatype for those, into a single file, which I can further use for processing.
Currently I have to process the files one after the other. So, I am looking for a solution which I can write in a script to combine all the files into one single file.
For ex:
File 1:
mike,sweden,2015
tom,USA,1522
raj,india,455
File 2:
a,xyz,155
b,pqr,3215
c,lmn,3252
Expected combined file 3:
mike,sweden,2015
tom,USA,1522
raj,india,455
a,xyz,155
b,pqr,3215
c,lmn,3252
Kindly help me with this.

Answer to the original form of the question:
As #Lars states in a comment on the question, it looks like a simple concatenation of the input files is desired, which is precisely what cat is for (and even named for):
cat file1 file2 > file3
To fulfill the requirements you added later:
#!/bin/sh
# Concatenate the input files and sort them with duplicates removed
# and save to output file.
cat "$1" "$2" | sort -u > "$3"
Note, however, that you can combine the concatenation and sorting into a single step, as demonstrated by Jean-Baptiste Yunès's answer:
# Sort the input files directly with duplicates removed and save to output file.
sort -u "$1" "$2" > "$3"
Note that using sort is the simplest way to eliminate duplicates.
If you don't want sorting, you'll have to use a different, more complex approach, e.g. with awk:
#!/bin/sh
# Process the combined input and only
# output the first occurrence in a set of duplicates to the output file.
awk '!seen[$0]++' "$1" "$2" > "$3"
!seen[$0]++ is a common awk idiom to only print the first in a set of duplicates:
seen is an associative array that is filled with each input line ($0) as the key (index), with each element created on demand.
This implies that all lines from a set of duplicates (even if not adjacent) refer to the same array element.
In a numerical context, awk's variable values and array elements are implicitly 0, so when a given input line is seen for the first time and the post-decrement (++) is applied, the resulting value of the element is 1.
Whenever a duplicate of that line is later encountered, the value of the array element is incremented.
The net effect is that for any given input line !seen[$0]++ returns true if the input line is seen for the first time, and false for each of its duplicates, if any. Note that ++, due to being a post-increment, is only applied after !seen[$0] is evaluated.
! negates the value of seen[$0], causing a value of 0 - which is false in a Boolean context to return true, and any nonzero value (encountered for duplicates) to return false.
!seen[$0]++ is an instance of a so-called pattern in awk - a condition evaluated against the input line that determines whether the associated action (a block of code) should be processed. Here, there is no action, in which case awk implicitly simply prints the input line, if !seen[$0]++ indicates true.
The overall effect is: Lines are printed in input order, but for lines with duplicates only the first instance is printed, effectively eliminating duplicates.
Note that this approach can be problematic with large input files with few duplicates, because most of the data must then be held in memory.

A script like:
#!/bin/sh
sort "$1" "$2" | uniq > "$3"
should do the trick. Sort will sort the concatenation of the two files (two first args of the script), pass the result to uniq which will remove adjacent identical lines and push the result into the third file (third arg of the script).

If your file naming convention is same(say file1,file2,file3...fileN), then you can use this to combine all.
cat file* > combined_file
Edit: Script to do the same assuming you are passing file names as parameter
#!/bin/sh
cat $1 $2 $3 | uniq > combined_file
Now you can display combined_file if you want. Or access it directly.

How to parse a CSV in a Bash script?

I am trying to parse a CSV containing potentially 100k+ lines. Here is the criteria I have:
The index of the identifier
The identifier value
I would like to retrieve all lines in the CSV that have the given value in the given index (delimited by commas).
Any ideas, taking in special consideration for performance?

As an alternative to cut- or awk-based one-liners, you could use the specialized csvtool aka ocaml-csv:
$ csvtool -t ',' col "$index" - < csvfile | grep "$value"
According to the docs, it handles escaping, quoting, etc.

See this youtube video: BASH scripting lesson 10 working with CSV files
CSV file:
Bob Brown;Manager;16581;Main
Sally Seaforth;Director;4678;HOME
Bash script:
#!/bin/bash
OLDIFS=$IFS
IFS=";"
while read user job uid location
do
echo -e "$user \
======================\n\
Role :\t $job\n\
ID :\t $uid\n\
SITE :\t $location\n"
done < $1
IFS=$OLDIFS
Output:
Bob Brown ======================
Role : Manager
ID : 16581
SITE : Main
Sally Seaforth ======================
Role : Director
ID : 4678
SITE : HOME

First prototype using plain old grep and cut:
grep "${VALUE}" inputfile.csv | cut -d, -f"${INDEX}"
If that's fast enough and gives the proper output, you're done.

CSV isn't quite that simple. Depending on the limits of the data you have, you might have to worry about quoted values (which may contain commas and newlines) and escaping quotes.
So if your data are restricted enough can get away with simple comma-splitting fine, shell script can do that easily. If, on the other hand, you need to parse CSV ‘properly’, bash would not be my first choice. Instead I'd look at a higher-level scripting language, for example Python with a csv.reader.

In a CSV file, each field is separated by a comma. The problem is, a field itself might have an embedded comma:
Name,Phone
"Woo, John",425-555-1212
You really need a library package that offer robust CSV support instead of relying on using comma as a field separator. I know that scripting languages such as Python has such support. However, I am comfortable with the Tcl scripting language so that is what I use. Here is a simple Tcl script which does what you are asking for:
#!/usr/bin/env tclsh
package require csv
package require Tclx
# Parse the command line parameters
lassign $argv fileName columnNumber expectedValue
# Subtract 1 from columnNumber because Tcl's list index starts with a
# zero instead of a one
incr columnNumber -1
for_file line $fileName {
set columns [csv::split $line]
set columnValue [lindex $columns $columnNumber]
if {$columnValue == $expectedValue} {
puts $line
}
}
Save this script to a file called csv.tcl and invoke it as:
$ tclsh csv.tcl filename indexNumber expectedValue
Explanation
The script reads the CSV file line by line and store the line in the variable $line, then it split each line into a list of columns (variable $columns). Next, it picks out the specified column and assigned it to the $columnValue variable. If there is a match, print out the original line.

Using awk:
export INDEX=2
export VALUE=bar
awk -F, '$'$INDEX' ~ /^'$VALUE'$/ {print}' inputfile.csv
Edit: As per Dennis Williamson's excellent comment, this could be much more cleanly (and safely) written by defining awk variables using the -v switch:
awk -F, -v index=$INDEX -v value=$VALUE '$index == value {print}' inputfile.csv
Jeez...with variables, and everything, awk is almost a real programming language...

For situations where the data does not contain any special characters, the solution suggested by Nate Kohl and ghostdog74 is good.
If the data contains commas or newlines inside the fields, awk may not properly count the field numbers and you'll get incorrect results.
You can still use awk, with some help from a program I wrote called csvquote (available at https://github.com/dbro/csvquote):
csvquote inputfile.csv | awk -F, -v index=$INDEX -v value=$VALUE '$index == value {print}' | csvquote -u
This program finds special characters inside quoted fields, and temporarily replaces them with nonprinting characters which won't confuse awk. Then they get restored after awk is done.

index=1
value=2
awk -F"," -v i=$index -v v=$value '$(i)==v' file

I was looking for an elegant solution that support quoting and wouldn't require installing anything fancy on my VMware vMA appliance. Turns out this simple python script does the trick! (I named the script csv2tsv.py, since it converts CSV into tab-separated values - TSV)
#!/usr/bin/env python
import sys, csv
with sys.stdin as f:
reader = csv.reader(f)
for row in reader:
for col in row:
print col+'\t',
print
Tab-separated values can be split easily with the cut command (no delimiter needs to be specified, tab is the default). Here's a sample usage/output:
> esxcli -h $VI_HOST --formatter=csv network vswitch standard list |csv2tsv.py|cut -f12
Uplinks
vmnic4,vmnic0,
vmnic5,vmnic1,
vmnic6,vmnic2,
In my scripts I'm actually going to parse tsv output line by line and use read or cut to get the fields I need.

Parsing CSV with primitive text-processing tools will fail on many types of CSV input.
xsv is a lovely and fast tool for doing this properly. To search for all records that contain the string "foo" in the third column:
cat file.csv | xsv search -s 3 foo

A sed or awk solution would probably be shorter, but here's one for Perl:
perl -F/,/ -ane 'print if $F[<INDEX>] eq "<VALUE>"`
where <INDEX> is 0-based (0 for first column, 1 for 2nd column, etc.)

Awk (gawk) actually provides extensions, one of which being csv processing.
Assuming that extension is installed, you can use awk to show all lines where a specific csv field matches 123.
Assuming test.csv contains the following:
Name,Phone
"Woo, John",425-555-1212
"James T. Kirk",123
The following will print all lines where the Phone (aka the second field) is equal to 123:
gawk -l csv 'csvsplit($0,a) && a[2] == 123 {print $0}'
The output is:
"James T. Kirk",123
How does it work?
-l csv asks gawk to load the csv extension by looking for it in $AWKLIBPATH;
csvsplit($0, a) splits the current line, and stores each field into a new array named a
&& a[2] == 123 checks that the second field is 123
if both conditions are true, it { print $0 }, aka prints the full line as requested.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio