How to put pivot table using Shell script - shell

I have data in a CSV file as below...
Emailid Storeid
a#gmail.com 2000
b#gmail.com 2001
c#gmail.com 2000
d#gmail.com 2000
e#gmail.com 2001
I am expecting below output, basically finding out how many email ids are there for each store.
StoreID Emailcount
2000 3
2001 2
So far i tried to solve my issue
IFS=","
while read f1 f2
do
awk -F, '{ A[$1]+=$2 } END { OFS=","; for (x in A) print x,A[x]; }' > /home/ec2-user/storewiseemials.csv
done < temp4.csv
With the above shell script i am not getting desired output, Can you guys please help me?

Using miller (https://github.com/johnkerl/miller) and starting from this (I have used a CSV, because I do not know if you use a tab or a white space as separator)
Emailid,Storeid
a#gmail.com,2000
b#gmail.com,2001
c#gmail.com,2000
d#gmail.com,2000
e#gmail.com,2001
and running
mlr --csv count-distinct -f Storeid -o Emailcount input >output
you will have
+---------+------------+
| Storeid | Emailcount |
+---------+------------+
| 2000 | 3 |
| 2001 | 2 |
+---------+------------+

Related

Bash - Removing empty columns from .csv file

I have a large .csv file in which I have to remove columns which are empty. By empty, I mean that they have a header, but the rest of the column contains no data.
I've written a Bash script to try and do this, but am running into a few issues.
Here's the code:
#!/bin/bash
total="$(head -n 1 Reddit-cleaner.csv | grep -o ',' | wc -l)"
i=1
count=0
while [ $i -le $total ]; do
cat Reddit-cleaner.csv | cut -d "," -f$i | while read CMD; do if [ -n CMD ]; then count=$count+1; fi; done
if [ $count -eq 1 ]; then
cut -d "," -f$i --complement <Reddit-cleaner.csv >Reddit-cleanerer.csv
fi
count=0
i=$i+1
done
Firstly I find the number of columns, and store it in total. Then while the program has not reached the last column, I loop through the columns individually. The nested while loop checks if each row in the column is empty, and if there is more than one row that is not empty, it writes all other columns to another file.
I recognise that there are a few problems with this script. Firstly, the count modification occurs in a subshell, so count is never modified in the parent shell. Secondly, the file I am writing to will be overwritten every time the script finds an empty column.
So my question then is how can I fix this. I initially wanted to have it so that it wrote to a new file column by column, based on count, but couldn't figure out how to get that done either.
Edit: People have asked for a sample input and output.
Sample input:
User, Date, Email, Administrator, Posts, Comments
a, 20201719, a#a.com, Yes, , 3
b, 20182817, b#b.com, No, , 4
c, 20191618, , No, , 4
d, 20190126, , No, , 2
Sample output:
User, Data, Email, Administrator, Comments
a, 20201719, a#a.com, Yes, 3
b, 20182817, b#b.com, No, 4
c, 20191618, , No, 4
d, 20190126, , No, 2
In the sample output, the column which has no data in it except for the header (Posts) has been removed, while the columns which are either entirely or partially filled remain.
I may be misinterpreting the question (due to its lack of example input and expected output), but this should be as simple as:
$ x="1,2,3,,4,field 5,,,six,7"
$ echo "${x//,+(,)/,}"
1,2,3,4,field 5,six,7
This requires bash with extglob enabled. Otherwise, you can use an external call to sed:
$ echo "1,2,3,,4,field 5,,,six,7" |sed 's/,,,*/,/g'
1,2,3,4,field 5,six,7
There's a lot of redundancy in your sample code. You should really consider awk since it already tracks the current field count (as NF) and the number of lines (as NR), so you could add that up with a simple total+=NF on each line. With the empty fields collapsed, awk can just run on the field number you want.
$ echo "1,2,3,,4,field 5,,,six,7" |awk -F ',+' '
{ printf "line %d has %d fields, the 6th of which is <%s>\n", NR, NF, $6 }'
line 1 has 7 fields, the 6th of which is <six>
This uses printf to denote the number of records (NR, the current line number), the number of fields (NF) and the value of the sixth field ($6, can also be as a variable, e.g. $NF is the value of the final field since awk is one-indexed).
It is actually job of a CSV parser but you may use this awk script to get the job done:
cat removeEmptyCellsCsv.awk
BEGIN {
FS = OFS = ", "
}
NR == 1 {
for (i=1; i<=NF; i++)
e[i] = 1 # initially all cols are marked empty
next
}
FNR == NR {
for (i=1; i<=NF; i++)
e[i] = e[i] && ($i == "")
next
}
{
s = ""
for (i=1; i<=NF; i++)
s = s (i==1 || e[i-1] ? "" : OFS) (e[i] ? "" : $i)
print s
}
Then run it as:
awk -f removeEmptyCellsCsv.awk file.csv{,}
Using sample data provided in question, it will produce following output:
1, User, Date, Email, Administrator, Comments
2, a, 20201719, a#a.com, Yes, 3
3, b, 20182817, b#b.com, No, 4
4, c, 20191618, , No, 4
5, d, 20190126, , No, 2
Note that Posts columns has been removed because it is empty in every record.
$ cat tst.awk
BEGIN { FS=OFS="," }
NR==FNR {
if ( NR > 1 ) {
for (i=1; i<=NF; i++) {
if ( $i ~ /[^[:space:]]/ ) {
gotValues[i]
}
}
}
next
}
{
c=0
for (i=1; i<=NF; i++) {
if (i in gotValues) {
printf "%s%s", (c++ ? OFS : ""), $i
}
}
print ""
}
$ awk -f tst.awk file file
User, Date, Email, Administrator, Comments
a, 20201719, a#a.com, Yes, 3
b, 20182817, b#b.com, No, 4
c, 20191618, , No, 4
d, 20190126, , No, 2
See also What's the most robust way to efficiently parse CSV using awk? if you need to work with any more complicated CSVs than the one in your question.
You can use Miller (https://github.com/johnkerl/miller) and its remove-empty-columns verb.
Starting from
+------+----------+---------+---------------+-------+----------+
| User | Date | Email | Administrator | Posts | Comments |
+------+----------+---------+---------------+-------+----------+
| a | 20201719 | a#a.com | Yes | - | 3 |
| b | 20182817 | b#b.com | No | - | 4 |
| c | 20191618 | - | No | - | 4 |
| d | 20190126 | - | No | - | 2 |
+------+----------+---------+---------------+-------+----------+
and running
mlr --csv remove-empty-columns input.csv >output.csv
you will have
+------+----------+---------+---------------+----------+
| User | Date | Email | Administrator | Comments |
+------+----------+---------+---------------+----------+
| a | 20201719 | a#a.com | Yes | 3 |
| b | 20182817 | b#b.com | No | 4 |
| c | 20191618 | - | No | 4 |
| d | 20190126 | - | No | 2 |
+------+----------+---------+---------------+----------+

CSV - How to add columns based on an existing column?

What is the best way to do this and how?
I gather things called sed, AWK and bash may be relevant.
I have used AWK once for one command, the others never.
I have searched and other apparently similar questions do not have an answer I need.
I have columns I have called fields in a CSV file:
_________________________
field1 | field2 | field3|
-------------------------
1990AB | 123456 | 123456|
-------------------------
I want to add fields based on these three original fields to appear as follows:
_______________________________________________________
field1 | field2 | field3 | field1a | field2a | field3a |
-------------------------------------------------------
1990AB | 123456 | 123456| 1990 | 12345 | 12345 |
-------------------------------------------------------
where:
field1a 1990 column 1 first 4 always digits then alpha
field2a 12345 column 2 is always 6 digits
field3a 12345 column 3 is always 6 digits
These are one-time-per-file actions, prior to database import.
macosx has about 6 million records. 2nd attempt at this question as my first was apparently not good. In this area I am a 100% novice.
awk to the rescue!
this should be easy to read even if you have no prior experience with awk
$ awk -F, -v OFS=, 'NR==1 {for(i=1;i<=3;i++) $(++NF)=$i"a"}
NR>1 {$(++NF)=substr($1,1,4);
$(++NF)=substr($2,1,5);
$(++NF)=substr($3,1,5)}1' file
NR is line number, special treatment for header, NF is number of fields, here incrementing for each additional column and $i is field value at position i. The last 1 is shorthand for printing the line. Initial options are for setting input field delimiter (F) and output field delimiter (OFS) to comma.

convert a text file to csv using shell script

I am new to shell scripting, can anyone give me shell script for the condition below
My Input:
id | name | values
----+------+--------
1 | abc | 2
1 | abc | 3
1 | abc | 4
1 | abc | 5
1 | abc | 6
1 | abc | 7
Expected Output:
1,abc,2
"
"
1 million records
You can use awk for this:
awk -F '[[:blank:]]*\\|[[:blank:]]*' -v OFS=, 'NF==3 && NR>1{sub(/^[[:blank:]]*/, "", $1); print}' file
1,abc,2
1,abc,3
1,abc,4
1,abc,5
1,abc,6
1,abc,7

retrieve and add two numbers of files

In my file I have following structure :-
A | 12 | 10
B | 90 | 112
C | 54 | 34
What I have to do is I have to add column 2 and column 3 and print the result with column 1.
output:-
A | 22
B | 202
C | 88
I retrieve the two columns but dont know how to add
What I did is :-
cut -d ' | ' -f3,5 myfile.txt
How to add those columns and display.
A Bash solution:
#!/bin/bash
while IFS="|" read f1 f2 f3
do
echo $f1 "|" $((f2+f3))
done < file
You can do this easily with awk.
awk '{print $1," | ",($3+$5)'} myfile.txt
wil work perhaps.
You can do this with awk:
awk 'BEGIN{FS="|"; OFS="| "} {print $1 OFS $2+$3}' input_filename
Input:
A | 12 | 10
B | 90 | 112
C | 54 | 34
Output:
A | 22
B | 202
C | 88
Explanation:
awk: invoke the awk tool
BEGIN{...}: do things before starting to read lines from the file
FS="|": FS stands for Field Separator. Think of it as the delimiter that separates each line of your file into fields
OFS="| ": OFS stands for Output Field Separator. Same idea as above, but for output. FS =/= OFS in this case due to formatting
{print $1 OFS $2+$3}: For each line that awk reads, print the first field (the letter), followed by a delimiter specified by OFS, then the sum of field 2 and field 3.
input_filename: awk accepts the input file name as an argument here.

Bash and awk: converting a field from 12 hour to 24 hour clock time

I have a large txt file space delimited which I split into 18 smaller files (each with their own number of columns). This split is based on a delimiter i.e. whenever the timestamp hits midnight. So effectively, I'll end up with a 18 files in the form of (note, ignore the dashes and pipes, I've used them to improve the readability):
file1
time ----------- valueA - valueB
12:00:00 AM | 54.13 | 239.12
12:00:01 AM | 51.83 | 119.93
..
file18
time ---------- valueA - valueB - valueC - valueD
12:00:00 AM | 54.92 | 239.12 | 231.23 | 882.12
12:00:01 AM | 23.92 | 121.92 | 201.23 | 892.12
..
Once I split the file I then perform some processing on each of the files using AWK so in short there's 2 stages the 'split stage' and the 'processing stage'.
Unfortunately, the timestamp contained in the large txt file is in 1 of 2 formats. Either the desirable 24 hour format of "00:00:01" or the undesirable 12 hour format of "12:00:01 AM".
As a result, I'm trying to convert all formats to be 24 hours and I'm not sure how to do this. I'm also not sure whether to attempt this at the split stage using bash or at the process stage using AWK. I know that the following function converts 12 hour to 24 hr
'date --date="12:00:01 AM" +%T'
however, I'm not sure how to incorporate this into my shell script were I'm using 'while read line' at the 'split stage' or whether I should do the time conversion in AWK (if possible?) at the 'processing stage'.
see the test below, is it helpful for you?
kent$ echo "12:00:00 AM | 54.92 | 239.12 | 231.23 | 882.12 "\
|awk -F'|' 'BEGIN{OFS="|"}{("date --date=\""$1"\" +%T") |getline $1;print }'
output
00:00:00| 54.92 | 239.12 | 231.23 | 882.12

Resources