convert a text file to csv using shell script - shell

I am new to shell scripting, can anyone give me shell script for the condition below
My Input:
id | name | values
----+------+--------
1 | abc | 2
1 | abc | 3
1 | abc | 4
1 | abc | 5
1 | abc | 6
1 | abc | 7
Expected Output:
1,abc,2
"
"
1 million records

You can use awk for this:
awk -F '[[:blank:]]*\\|[[:blank:]]*' -v OFS=, 'NF==3 && NR>1{sub(/^[[:blank:]]*/, "", $1); print}' file
1,abc,2
1,abc,3
1,abc,4
1,abc,5
1,abc,6
1,abc,7

Related

awk command to print multiple columns using for loop

I am having a single file in which it contains 1st and 2nd column with item code and name, then from 3rd to 12th column which contains its 10 days consumption quantity continuously.
Now i need to convert that into 10 different files. In each the 1st and 2nd column should be the same item code and item name and the 3rd column will contain the consumption quantity of one day in each..
input file:
Code | Name | Day1 | Day2 | Day3 |...
10001 | abcd | 5 | 1 | 9 |...
10002 | degg | 3 | 9 | 6 |...
10003 | gxyz | 4 | 8 | 7 |...
I need the Output in different file as
file 1:
Code | Name | Day1
10001 | abcd | 5
10002 | degg | 3
10003 | gxyz | 4
file 2:
Code | Name | Day2
10001 | abcd | 1
10002 | degg | 9
10003 | gxyz | 8
file 3:
Code | Name | Day3
10001 | abcd | 9
10002 | degg | 6
10003 | gxyz | 7
and so on....
I wrote a code like this
awk 'BEGIN { FS = "\t" } ; {print $1,$2,$3}' FILE_NAME > file1;
awk 'BEGIN { FS = "\t" } ; {print $1,$2,$4}' FILE_NAME > file2;
awk 'BEGIN { FS = "\t" } ; {print $1,$2,$5}' FILE_NAME > file3;
and so on...
Now i need to write it with in a 'for' or 'while' loop which would be faster...
I dont know the exact code, may be like this..
for (( i=3; i<=NF; i++)) ; do awk 'BEGIN { FS = "\t" } ; {print $1,$2,$i}' input.tsv > $i.tsv; done
kindly help me to get the output as i explained.
If you absolutely need to to use a loop in Bash, then your loop can be fixed like this:
for ((i = 3; i <= 10; i++)); do awk -v field=$i 'BEGIN { FS = "\t" } { print $1, $2, $field }' input.tsv > file$i.tsv; done
But it would be really better to solve this using pure awk, without shell at all:
awk -v FS='\t' '
NR == 1 {
for (i = 3; i < NF; i++) {
fn = "file" (i - 2) ".txt";
print $1, $2, $i > fn;
print "" >> fn;
}
}
NR > 2 {
for (i = 3; i < NF; i++) {
fn = "file" (i - 2) ".txt";
print $1, $2, $i >> fn;
}
}' inputfile
That is, when you're on the first record,
create the output files by writing the header line and a blank line (as in specified in your question).
For the 3rd and later records, append to the files.
Note that the code in your question suggests that the fields in the file are separated by tabs, but the example files seem to use | padded with variable number of spaces. It's not clear which one is your actual case. If it's really tab-separated, then the above code will work. If in fact it's as the example inputs, then change the first line to this:
awk -v OFS=' | ' -v FS='[ |]+' '
bash + cut solution:
input.tsv test content:
Code | Name | Day1 | Day2 | Day3
10001 | abcd | 5 | 1 | 9
10002 | degg | 3 | 9 | 6
10003 | gxyz | 4 | 8 | 7
day_splitter.sh script:
#!/bin/bash
n=$(cat $1 | head -1 | awk -F'|' '{print NF}') # total number of fields
for ((i=3; i<=$n; i++))
do
fn="Day"$(($i-2)) # file name containing `Day` number
$(cut -d'|' -f1,2,$i $1 > $fn".txt")
done
Usage:
bash day_splitter.sh input.tsv
Results:
$cat Day1.txt
Code | Name | Day1
10001 | abcd | 5
10002 | degg | 3
10003 | gxyz | 4
$cat Day2.txt
Code | Name | Day2
10001 | abcd | 1
10002 | degg | 9
10003 | gxyz | 8
$cat Day3.txt
Code | Name | Day3
10001 | abcd | 9
10002 | degg | 6
10003 | gxyz | 7
In pure awk:
$ awk 'BEGIN{FS=OFS="|"}{for(i=3;i<=NF;i++) {f="file" (i-2); print $1,$2,$i >> f; close(f)}}' file
Explained:
$ awk '
BEGIN {
FS=OFS="|" } # set delimiters
{
for(i=3;i<=NF;i++) { # loop the consumption fields
f="file" (i-2) # create the filename
print $1,$2,$i >> f # append to target file
close(f) } # close the target file
}' file

Replace string in Nth array

I have a .txt file with strings in arrays which looks like these:
id | String1 | String2 | Counts
1 | Abc | Abb | 0
2 | Cde | Cdf | 0
And i want to add counts, so i need to replace last digit, but i need to change it only for the one line.
I am getting new needed value by this function:
$(awk -F "|" -v i=$idOpen 'FNR == i { gsub (" ", "", $0); print $4}' filename)"
And them I want to replace it with new value, which will be bigger for 1.
And im doing it right in there.
counts=(("$(awk -F "|" -v i=$idOpen 'FNR == i { gsub (" ", "", $0); print $4}' filename)"+1))
Where IdOpen is an id of the array, where i need to replace string.
So i have tried to replace the whole array by these:
counter="$(awk -v i=$idOpen 'BEGIN{FNqR == i}{$7+=1} END{ print $0}' bookmarks)"
N=$idOpen
sed -i "{N}s/.*/${counter}" bookmarks
But it doesn't work!
So is there a way to replace only last string with value which i have got earlier?
As result i need to get:
id | String1 | String2 | Counts
1 | Abc | Abb | 1 # if idOpen was 1 for 1 time
2 | Cde | Cdf | 2 # if idOpen was 2 for 2 times
And the last number will be increased by 1 everytime when i will activate these commands.
awk solution:
setting idOpen variable(for ex. 2):
idOpen=2
awk -F'|' -v i=$idOpen 'NR>1{if($1 == i) $4=" "$4+1}1' OFS='|' file > tmp && mv tmp file
The output(after executing the above command twice):
cat file
id | String1 | String2 | Counts
1 | Abc | Abb | 0
2 | Cde | Cdf | 2
NR>1 - skipping the header line

Can't iterate over array in Bash

I need to add a new column with a (ordinal) number after the last column in my table.
Both input and output files are .CSV tables.
Incoming table has more then 500 000 lines (rows) of data and 7 columns, e.g. https://www.dropbox.com/s/g2u68fxrkttv4gq/incoming_data.csv?dl=0
Incoming CSV table (this is just an example, so "|" and "-" are here for the sake of clarity):
| id | Name |
-----------------
| 1 | Foo |
| 1 | Foo |
| 1 | Foo |
| 4242 | Baz |
| 4242 | Baz |
| 4242 | Baz |
| 4242 | Baz |
| 702131 | Xyz |
| 702131 | Xyz |
| 702131 | Xyz |
| 702131 | Xyz |
Result CSV (this is just an example, so "|" and "-" are here for the sake of clarity):
| id | Name | |
--------------------------
| 1 | Foo | 1 |
| 1 | Foo | 2 |
| 1 | Foo | 3 |
| 4242 | Baz | 1 |
| 4242 | Baz | 2 |
| 4242 | Baz | 3 |
| 4242 | Baz | 4 |
| 702131 | Xyz | 1 |
| 702131 | Xyz | 2 |
| 702131 | Xyz | 3 |
| 702131 | Xyz | 4 |
First column is ID, so I've tried to group all lines with the same ID and iterate over them. Script (I don't know bash scripting, to be honest):
FILE=$PWD/$1
# Delete header and extract IDs and delete non-unique values. Also change \n to ♥, because awk doesn't properly work with it.
IDS_ARRAY=$(awk -v FS="|" '{for (i=1;i<=NF;i++) if ($i=="\"") inQ=!inQ; ORS=(inQ?"♥":"\n") }1' $FILE | awk -F'|' '{if (NR!=1) {print $1}}' | awk '!seen[$0]++')
for id in $IDS_ARRAY; do
# Group $FILE by $id from $IDS_ARRAY.
cat $FILE | grep $id >> temp_mail_group.csv
ROW_GROUP=$PWD/temp_mail_group.csv
# Add a number after each row.
# NF+1 — add a column after last existing.
awk -F'|' '{$(NF+1)=++i;}1' OFS="|", $ROW_GROUP >> "numbered_mails_$(date +%Y-%m-%d).csv"
rm -f $PWD/temp_mail_group.csv
done
Right now this script works almost like I want to, except that it thinks that (for example) ID 2834 and 772834 are the same.
UPD: Although I marked one answer as approved it does not assign correct values to some groups of records with the same ID (right now I don't see a pattern).
You can do everything in a single script:
gawk 'BEGIN { FS="|"; OFS="|";}
/^-/ {print; next;}
$2 ~ /\s*id\s*/ {print $0,""; next;}
{print "", $2, $3, ++a[$2];}
'
$1 is the empty field before the first | in the input. I use an empty output column "" to get the leading |.
The trick is ++a[$2] which takes the second field in each row (= the ID column) and looks for it in the associative array a. If there is no entry, the result is 0. By pre-incrementing, we start with 1 and add 1 every time the ID reappears.
Every time you write a loop in shell just to manipulate text you have the wrong approach. The guys who invented shell also invented awk for shell to call to manipulate text - don't disappoint them :-).
$ awk '
BEGIN{ w = 8 }
{
if (NR==1) {
val = sprintf("%*s|",w,"")
}
else if (NR==2) {
val = sprintf("%*s",w+1,"")
gsub(/ /,"-",val)
}
else {
val = sprintf(" %-*s|",w-1,++cnt[$2])
}
print $0 val
}
' file
| id | Name | |
----------------------
| 1 | Foo | 1 |
| 1 | Foo | 2 |
| 1 | Foo | 3 |
| 42 | Baz | 1 |
| 42 | Baz | 2 |
| 42 | Baz | 3 |
| 42 | Baz | 4 |
| 70 | Xyz | 1 |
| 70 | Xyz | 2 |
| 70 | Xyz | 3 |
| 70 | Xyz | 4 |
An awk way
Without considering the dotted line being extended.
awk 'NR>2{$0=$0 (++a[$2])"|"}1' file
output
| id | Name |
-------------
| 1 | Foo |1|
| 1 | Foo |2|
| 1 | Foo |3|
| 42 | Baz |1|
| 42 | Baz |2|
| 42 | Baz |3|
| 42 | Baz |4|
| 70 | Xyz |1|
| 70 | Xyz |2|
| 70 | Xyz |3|
| 70 | Xyz |4|
Here's a way to do it with pure Bash:
inputfile=$1
prev_id=
while IFS= read -r line ; do
printf '%s' "$line"
IFS=$'| \t\n' read t1 id name t2 <<<"$line"
if [[ $line == -* ]] ; then
printf '%s\n' '---------'
elif [[ $id == 'id' ]] ; then
printf ' Number |\n'
else
if [[ $id != "$prev_id" ]] ; then
id_count=0
prev_id=$id
fi
printf '%2d |\n' "$(( ++id_count ))"
fi
done <"$inputfile"

replace string in comma delimiter file using nawk

I need to implement the if condition in the below nawk command to process input file if the third column has more that three digit.Pls help with the command what i am doing wrong as it is not working.
inputfile.txt
123 | abc | 321456 | tre
213 | fbc | 342 | poi
outputfile.txt
123 | abc | 321### | tre
213 | fbc | 342 | poi
cat inputfile.txt | nawk 'BEGIN {FS="|"; OFS="|"} {if($3 > 3) $3=substr($3, 1, 3)"###" print}'
Try:
awk 'length($3) > 3 { $3=substr($3, 1, 3)"###" } 1 ' FS=\| OFS=\| test1.txt
This works with gawk:
awk -F '[[:blank:]]*\\\|[[:blank:]]*' -v OFS=' | ' '
$3 ~ /^[[:digit:]]{4,}/ {$3 = substr($3,1,3) "###"}
1
' inputfile.txt
It won't preserve the whitespace so you might want to pipe through column -t

shell - grep - how to get only lines that have certain amount char

good morning.
I have the following lines :
1 | blah | 2 | 1993 | 86 | 0 | NA | 123 | 123
1 | blah | TheBeatles | 0 | 3058 | NA | NA | 11
And I wanna get only the lines with 7 "|" and the same first field.
So the output for these two lines will be nothing, but for these two lines :
1 | blah | 2 | 1993 | 86 | 0 | NA | 123
1 | blah | TheBeatles | 0 | 3058 | NA | NA | 11
The output will be "error".
I'm getting the inputs from a file using the following command :
grep '.*|.*|.*|.*|.*|.*|.*|.*' < $1 | sort -nbsk1 | cut -d "|" -f1 | uniq -d |
while read line2; do
echo error
done
But this implementation would still print error even if I have more then 7 "|".
Any suggestions ?
P.S - I can assume that there is a \n in the end of each line.
For printing lines containing only 7 |, try:
awk -F'|' 'NF == 8' filename
If you want to use bash to count the number of | in a given line, try:
line="1 | blah | 2 | 1993 | 86 | 0 | NA | 123 | 123";
count=${line//[^|]/};
echo ${#count};
With grep
grep '^\([^|]*|[^|]*\)\{7\}$'
Assuming zz.txt is:
$ cat zz.txt
1 | blah | 2 | 1993 | 86 | 0 | NA | 123 | 123
1 | blah | TheBeatles | 0 | 3058 | NA | NA | 11
$ cut -d\| -f1-8 zz.txt
above cut will give you the output you need.
I would suggest that you use awk for this job.
BEGIN { FS = "|" }
NF == 8 && $1 == '1' { print $0}
would do the job (although the == and && could be = and & ; my awk is a bit rusty)

Resources