Include header in grep of specific csv columns - bash

I am trying to extract relevant information from a large csv file for further processing, so I would like to have the column names (header) saved in my output mini-csv files.
I have:
grep "Example" $fixed_file | cut -d ',' -f 4,6 > $outputpath"Example.csv"
which works fine in generating a csv file with two columns, but I would like the header information to also be included in the output file.

Use command grouping and add head -1 to the mix:
{ head -1 "$fixed_file" && grep "Example" "$fixed_file" | cut -d ',' -f 4,6 ;} \
>"$outputpath"Example.csv

My suggestion would be to replace your multiple-command pipeline with a single awk script.
awk '
BEGIN {
OFS=FS=","
}
NR==1;
/Example/ {
print $4,$6
}
' "$fixed_file" > "$outputpath/Example.csv"
If you want your header only to contain the headers fields fields 4 and 6, you could change this to:
awk '
BEGIN {
OFS=FS=","
}
NR==1 || /Example/ {
print $4,$6
}
' "$fixed_file" > "$outputpath/Example.csv"
Awk scripts consist of pairs of condition { statement }. A missing statement assumes you want to print the line (which is why NR==1; prints the header).
And of course, you could compact this into a one-liner:
awk -F, 'NR==1||/Example/{print $4 FS $6}' "$fixed_file" > "$outputpath/Example.csv"

Related

grep few columns from a file to another file in shell

The following file is present in file1.txt:
mudId|~|mudType|~|mudNAme|~|mudDate|~|mudEndDate
100|~|Balance|~|Abc|~|21-09-2020|~|22-09-2020
101|~|Clone|~|Bcd|~|11-07-2020|~|12-07-2020
102|~|Ledger|~|Def|~|12-06-2019|~|13-06-2019
How to grep only the columns mudId, mudType and mudDate with all the rows into another file?
The columns are separated by |~|
To meet your criteria of specifying the field names from the heading row, you can use awk utilizing a Regular Expression as the Field-Separator variable (e.g. "[|][~][|]"). For the first record (line), read the field names as array indexes and set the value to the current field index. For your second rule, simply output the field value captured in your array that corresponds to the strings "mudId", "mudType" and "mudDate".
For example you can do:
awk '
BEGIN { FS="[|][~][|]"; OFS="|~|" }
FNR==1 { for(i=1;i<=NF;i++) arr[$i]=i; next }
{ print $arr["mudId"], $arr["mudType"], $arr["mudDate"] }
' file
(note: the above intentionally generalizes to meet your criteria where you want to specify the string names of the fields to output)
If you simply want to write fields 1, 2, & 4 to a new file, you would do:
awk -v FS="[|][~][|]" -v OFS="|~|" 'FNR>1 {print $1,$2,$4}' file
Example Use/Output
Simply copy/middle-mouse paste the above into an xterm where file is in the current directory, e.g.
$ awk '
> BEGIN { FS="[|][~][|]"; OFS="|~|" }
> FNR==1 { for(i=1;i<=NF;i++) arr[$i]=i; next }
> { print $arr["mudId"], $arr["mudType"], $arr["mudDate"] }
> ' file
100|~|Balance|~|21-09-2020
101|~|Clone|~|11-07-2020
102|~|Ledger|~|12-06-2019
(note: if you want the new file space-delimited, just remove OFS="|~|")
or
$ awk -v FS="[|][~][|]" -v OFS="|~|" 'FNR>1 {print $1,$2,$4}' file
100|~|Balance|~|21-09-2020
101|~|Clone|~|11-07-2020
102|~|Ledger|~|12-06-2019
To write the contents to a new filename, just redirect the output to a new filename (e.g. for the last line above, add ' file > newfile)
Look things over and let me know if you have further questions.
If the column is fixed by mudId|~|mudType|~|mudNAme|~|mudDate|~|mudEndDate, try this:
sed 's/|~|/\t/g' file1.txt | awk '{print $1"|~|"$2"|~|"$4}'
you should change \t to other character which will not occur in your file1.txt if the \t would exist in file1.txt, and then add -F'\t' after awk.

shell : Number of string occurrences in a column using grep

My data file is:
name,age,favourite_person
Adam,19,Helen Keller
Alex,18,Joe Biden
Kyle,18,George Washington
Mary,20,Marie Curie
Jade,16,Marie Kondo
I want to find number of times "Marie" occurred in the column 'favourite_person' (column 3). My code right now is grep -R "Marie" file | wc -l but this checks for the word "Marie" in the entire file. I only want it to check among the favourite_person column. What should I add in this case?
You can use awk as follows:
awk 'BEGIN { FS = "," } {if ($3 ~ "Marie") { count++ }} END { print count }' file
BEGIN { FS = "," } sets , as the field separator,
{ if ... } part reads like "if the third field matches "Marie", then increment variable count",
END { print count } prints count at the end.
You can use cut as well as grep:
cut -d "," -f3 file | grep Marie | wc -l
-d means delimeter, and -f3 takes the third column only
grep Marie checks if Marie is in the third column, and wc -l counts the occurences

Split a large gz file into smaller ones filtering and distributing content

I have a gzip file of size 81G which I unzip and size of uncompressed file is 254G. I want to implement a bash script which takes the gzip file and splits it on the basis of the first column. The first column has values range between 1-10. I want to split the files into 10 subfiles where by all rows where value in first column is 1 is put into 1 file. All the rows where the value is 2 in the first column is put into a second file and so on. While I do that I don't want to put column 3 and column 5 in the new subfiles. Also the file is tab separated. For example:
col_1 col_2. col_3. col_4. col_5. col_6
1. 7464 sam. NY. 0.738. 28.9
1. 81932. Dave. NW. 0.163. 91.9
2. 162. Peter. SD. 0.7293. 673.1
3. 7193. Ooni GH. 0.746. 6391
3. 6139. Jess. GHD. 0.8364. 81937
3. 7291. Yeldish HD. 0.173. 1973
File above will result in three different gzipped files such that col_3 and col_5 are removed from each of the new subfiles. What I did was
#!/bin/bash
#SBATCH --partition normal
#SBATCH --mem-per-cpu 500G
#SBATCH --time 12:00:00
#SBATCH -c 1
awk -F, '{print > $1".csv.gz"}' file.csv.gz
But this is not producing the desired result. Also I don't know how to remove col_3 and col_5 from the new subfiles.
Like I said gzip file is 81G and therefore, I am looking for an efficient solution. Insights will be appreciated.
You have to decompress and recompress; to get rid of columns 3 and 5, you could use GNU cut like this:
gunzip -c infile.gz \
| cut --complement -f3,5 \
| awk '{ print | "gzip > " $1 "csv.gz" }'
Or you could get rid of the columns in awk:
gunzip -c infile.gz \
| awk -v OFS='\t' '{ print $1, $2, $4, $6 | "gzip > " $1 "csv.gz" }'
Something like
zcat input.csv.gz | cut -f1,2,4,6- | awk '{ print | ("gzip -c > " $1 "csv.gz") }'
Uncompress the file, remove fields 3 and 5, save to the appropriate compressed file based on the first column.
Robustly and portably with any awk, if the file is always sorted by the first field as shown in your example:
gunzip -c infile.gz |
awk '
{ $0 = $1 OFS $2 OFS $4 OFS $6 }
NR==1 { hdr = $0; next }
$1 != prev { close(gzip); gzip="gzip > \047"$1".csv.gz\047"; prev=$1 }
!seen[$1]++ { print hdr | gzip }
{ print | gzip }
'
otherwise:
gunzip -c infile.gz |
awk 'BEGIN{FS=OFS="\t"} {print (NR>1), NR, $0}' |
sort -k1,1n -k3,3 -k2,2n |
cut -f3- |
awk '
{ $0 = $1 OFS $2 OFS $4 OFS $6 }
NR==1 { hdr = $0; next }
$1 != prev { close(gzip); gzip="gzip > \047"$1".csv.gz\047"; prev=$1 }
!seen[$1]++ { print hdr | gzip }
{ print | gzip }
'
The first awk adds a number at the front to ensure the header line sorts before the rest during the sort phase, and adds the line number so that lines with the same original first field value retain their original input order. Then we sort by the first field, and then cut away the 2 fields added in the first step, then use awk to robustly and portably create the separate output files, ensuring that each output file starts with a copy of the header. We close each output file as we go so that the script will work for any number of output files using any awk and will work efficiently even for a large number of output files with GNU awk. It also ensures that each output file name is properly quoted to avoid globbing, word splitting, and filename expansion.

Shell command for inserting a newline every nth element of a huge line of comma separated strings

I have a one line csv containing a lot of elements. Now I want to insert a newline after every n-th element in a bash/shell script.
Bonus: I'd like to prepend a line with descriptors and using the count of descriptors as 'n'.
Example:
"4908041eee3d4bf98e606140b21ebc89.16","7.38974601030349731","45.31298584267982221","94ff11ce7eb54642b0768dde313e8b25.16","7.38845318555831909","45.31425320325949713", (...)
into
"id","lon","lat"
"4908041eee3d4bf98e606140b21ebc89.16","7.38974601030349731","45.31298584267982221"
"94ff11ce7eb54642b0768dde313e8b25.16","7.38845318555831909","45.31425320325949713"
(...)
Edit: I made a first attempt, but the comma delimiters are missing then:
(...) | xargs --delimiter=',' -n3
"4908041eee3d4bf98e606140b21ebc89.16" "7.38974601030349731" "45.31298584267982221"
"94ff11ce7eb54642b0768dde313e8b25.16" "7.38845318555831909" "45.31425320325949713"
trying to replace the " " with ","
(...) | xargs --delimiter=',' -n3 -i echo ${{}//" "/","}
-bash: ${{}//\": bad substitution
I would go with Perl for that!
Let's assume this outputs something like your file:
printf "1,2,3,4,5,6,7,8,9,10"
1,2,3,4,5,6,7,8,9,10
Then you could use this if you wanted every 4th comma replaced:
printf "1,2,3,4,5,6,7,8,9,10" | perl -pe 's{,}{++$n % 4 ? $& : "\n"}ge'
1,2,3,4
5,6,7,8
9,10
cat data.txt | xargs -n 3 -d, | sed 's/ /,/g'
With n=3 here and input filename is called data.txt
Note: What distinguishes this solution is that it derives the number of output columns from the number of columns in the header line.
Assuming that the fields in your CSV input have no embedded , instances (in which case you'd need a proper CSV parser), try awk:
awk -v RS=, -v header='"id","lon","lat"' '
BEGIN {
print header
colCount = 1 + gsub(",", ",", header)
}
{
ORS = NR % colCount == 0 ? "\n" : ","
print
}
' file.csv
Note that if the input file ends with a newline (as is typical), you'll get an extra newline trailing the output.
With GNU Awk or Mawk (but not BSD/OSX Awk, which only supports literal, single-character RS values), you can fix this as follows:
awk -v RS='[,\n]' -v header='"id","lon","lat"' '
BEGIN {
print header
colCount = 1 + gsub(",", ",", header)
}
{
ORS = NR % colCount == 0 ? "\n" : ","
print
}
' file.csv
BSD/OSX Awk workaround: stick with -v RS=, and replace file.csv with <(tr -d '\n' < file.csv) in order to remove all newlines from the input first.
Assuming your input file is named input:
echo id,lon,lat; awk '{ORS=NR%3?",":"\n"}1' RS=, input

How to print variable value always as last column in CSV file

I have a list of CSV files, I have to print a variable name (dynamically; it will change), to last column in the CSV files.
Here is the code:
addProgramtypeID () {
for csv in $1
do
file_name="$csv"
echo $file_name
f=`echo $file_name | cut -d '_' -f3 | cut -d '.' -f1`
echo $f
k=`grep -i $f Program_type.csv | cut -d ',' -f3`
echo $k
awk '{ print $0 "," "'"$k"'" }' "$csv" > tempfile && mv tempfile "$csv"
done
}
addProgramtypeID "T_H_EDCGO.csv"
As of now the variable value K is being printed at the 1st column of the CSV file , also it is removing the first 2 characters of the first column in the file. My requirement is that the variable value should always come as the last column in the CSV file.
input :
TX_ID,SEQUENCE,PROGRAM_ID,CA_ID,C_ID
123,3,334,234,3
545,2,444,456,5
if suppose $k=2
output:
TX_ID,SEQUENCE,PROGRAM_ID,CA_ID,C_ID,2
123,3,334,234,3,2
545,2,444,456,5,2
Program_type.csv
type,desc,id
EDC,Alb,1
EDG,Gsc,2
Assuming there is is nothing nasty in your CSV file, you can use awk as follows:
for csv_file in $ALL_MY_FILES
do
cat csv_file | awk 'BEGIN{FS=","}; {print($(NF))}'
done
Or even just
cat $ALL_MY_FILES | awk 'BEGIN{FS=","}; {print($(NF))}'
Both of these will print the last line column of all the csv files. The results from each CSV are just appended together (is that really what you want?).
The difficulties are on the awk side. This completely unaware of things like quited strings
or extra whitespace. My recommendation is to try the line above, see what goes wrong (if anything) and then start tweaking.
It looks like what you want is just:
$ cat tst.sh
addProgramtypeID () {
csv="$1"
awk -v csv="$csv" '
BEGIN{ FS=OFS=","; split(csv,csvA,/[_.]/); f=csvA[3] }
NR==FNR { if ($0 ~ f) { k = $3 }; next }
{ print $0, k }
' Program_type.csv "$csv" > tempfile && mv tempfile "$csv"
}
addProgramtypeID "T_H_EDC.csv"
$ cat Program_type.csv
type,desc,id
EDC,Alb,1
EDG,Gsc,2
$ cat T_H_EDC.csv
TX_ID,SEQUENCE,PROGRAM_ID,CA_ID,C_ID
123,3,334,234,3
545,2,444,456,5
$ ./tst.sh
$ cat T_H_EDC.csv
TX_ID,SEQUENCE,PROGRAM_ID,CA_ID,C_ID,1
123,3,334,234,3,1
545,2,444,456,5,1
but it's hard to tell since your posted sample input could not produce your posted desired output so I had to make some up.
if ($0 ~ f) should probably just be if ($1 == f), I just copied what your original grep f <file> logic would do.

Resources