How to convert CSV file in bash? - bash

I have a file, each line of which is a list of comma separated values. For example,
1, a, b, c, d, e
2, x, y, z
Now I would like to convert it in bash as follows:
1 a
1 b
1 c
1 d
1 e
2 x
2 y
2 z
How to do it with a shell (bash) script?

awk -F, '{for(i=2;i<=NF;i++)print $1,$i}' temp
tested below:
> cat temp
1, a, b, c, d, e
2, x, y, z
> awk -F, '{for(i=2;i<=NF;i++)print $1,$i}' temp
1 a
1 b
1 c
1 d
1 e
2 x
2 y
2 z

You can split the line into tokens and put them in an array. The first element of the array will be having the number, in your case it is 1 or 2 and so on. Something like this may be :
while read line
do
arrIN=(${line//,/ })
## make a loop and echo them
## arrIN[0] will have the initial number
done < $file
# $file is the input file you are reading

I understand you need a shell script, but does it need to be bash ? e.g. I would normally use a higher-level scripting language and a CSV library. e.g. Perl and Text::CSV

Related

How do i change an array of numbers into corresponding letters of the alphabet

I have an array called variable that contains the numbers 1-26, i am trying to use a for loop in bash to go through each number of the array and associating it with a letter from the alphabet as tr only lets me translate the first few letters of the alphabet. An example of my code is
Note: i am using bash
#!/bin/bash
for p1 in "${variable[#]}"; do
if (( $p1 == 1 )); then
newvar+='a'
elif (( $p1 == 2 )); then
newvar+='b'
...... and so on down to z
i am trying to create the string newvar which contains these translated letters. However when i try to run this it only shows me a which is the very first number translated. Why doesn't this work?
for p1 in "${variable[#]}"; do
chars+=( $((p1 + 96)) )
done
printf '%b' $(printf '\\%03o' ${chars[#]})
Maybe:
# alphabet=(a b c d e f g h i j k l m n o p q r s t u v w x y z)
alphabet=({a..z})
letters=(8 5 12 12 15 23 15 18 12 4)
phrase=''
for i in "${letters[#]}"; do
phrase+="${alphabet[i-1]}"
done
echo $phrase
helloworld

Processing data swapped over files BASH

First, I would like to apologize for my extremely basic knowledge about coding. Then I hope that I will be able to express myself correctly about my issue. Do no hesitate to ask for further clarifications or anything else...
I'm encountering troubles postprocessing data...
My goal is to recombine data which were swapped.
EDIT : here is a .rar folder containing my test example which works and the one that I try to make working... (do not be afraid by the time it requires to process the data)
https://drive.google.com/file/d/1AEPUc8haT5_Z3LR3jnZZlpyfxhdDwwo6/view?usp=sharing
EDIT 2 : Here is what I expect on paper (Its my TestReorder3OK folder in my rar archive)
enter image description here
EDIT 3 : MINIMAL COMPLETE EXAMPLE
Script :
#!/bin/bash
# Definir le nombre de replica
NP=3
NP1=$[NP-1]
rm torder*
for repl in `seq 0 $NP1`
do
echo $repl
# colle la colonne 2 du fichier .lammps dans un fichier rep_0, puis dans la seconde boucle, la colonne 3 dans rep_1, etc.
awk -v rep=$repl '{r2=rep+2;print $r2}' < log.lammps > rep_$repl
i=0
j=0
# cree une boucle dans la boucle
for a in `cat rep_$repl`
do
i=$[i+1]
j=$[j+3]
head -$i screen.$repl.temp | tail -1 >> torder.$a
head -$j ccccd2_H_${repl}_col.bak2 | tail -3 >> ccccd2_H_${a}_temp_col.bak2
done
done
log.lammps file
1 0 1 2
2 1 0 2
3 1 2 0
Starting at column 2, this file contains the number associated to the inputs below. Here is an expanded explanation :
column 2 has three values : 0, 1 and 1 ; the 0 is associated to the first three lines of the file ccccd2_H_0_col.bak2, the next three ones are associated the 1 and the last three ones again to the value 1.
column 3 has also three values : 1, 0 and 2 ; the 1 is associated to the first three lines of the file ccccd2_H_1_col.bak2, the next three ones are associated the 0 and the last three ones again to the value 2.
Same story for column 4.
Now what I want, is that every set of three lines associated to the 0 value go into a single file. Every set of three lines associated to the 1 value go into another single file, and the sets of three lines associated to the 2 value to a last file.
Inputs :
ccccd2_H_0_col.bak2
blank line
N a b c
C d e f
N g h i
C j k l
N m n o
C p q r
ccccd2_H_1_col.bak2
blank line
N s t u
C v w x
N y z a
C b c d
N e f g
C h i j
ccccd2_H_2_col.bak2
blank line
N k l m
C n o p
N q r s
C t u v
N w x y
C z a b
Outputs : These are the desired outputs and the one that I get for simple test files
ccccd2_H_0_temp_col
blank line
N a b c
C d e f
N y z a
C b c d
N w x y
C z a b
ccccd2_H_1_temp_col
blank line
N g h i
C j k l
N m n o
C p q r
N s t u
C v w x
ccccd2_H_2_temp_col
blank line
N e f g
C h i j
N k l m
C n o p
N q r s
C t u v
This works fine on small test files (as shown here), but not on my real system. For my real system, I have the log.lammps file that contains 14 rows and 10,001 lines, and my input files that contain 121,121 lines (so 10,001 * block of 121 lines). It creates files 10 times larger with more data than it should.
Can you enlighten me about my issue ? I think this is linked to the difference of line number from my files containing a single row and the files containing cartesian coordinates, but I really don't understand the link nor the way to solve it...
Thank you in advance...
I think I understand what you're trying do do now and this GNU awk script (for ARGIND, ENDFILE and inbuilt open file management) will do it:
$ cat ../tst.awk
ARGIND == 1 {
for (inFileNr=2; inFileNr<=NF; inFileNr++) {
outFileNrs[inFileNr,NR] = $inFileNr
}
next
}
ENDFILE { RS = "" }
{ print ORS $0 > ("ccccd2_H_" outFileNrs[ARGIND,FNR] "_temp_col") }
Look:
INPUT:
$ ls
ccccd2_H_0_col.bak2 ccccd2_H_1_col.bak2 ccccd2_H_2_col.bak2 log.lammps
$ cat log.lammps
1 0 1 2
2 1 0 2
3 1 2 0
$ paste ccccd2_H_0_col.bak2 ccccd2_H_1_col.bak2 ccccd2_H_2_col.bak2 | sed 's/\t/\t\t/g'
N a b c N s t u N k l m
C d e f C v w x C n o p
N g h i N y z a N q r s
C j k l C b c d C t u v
N m n o N e f g N w x y
C p q r C h i j C z a b
SCRIPT EXECUTION:
$ awk -f ../tst.awk log.lammps ccccd2_H_0_col.bak2 ccccd2_H_1_col.bak2 ccccd2_H_2_col.bak2
OUTPUT:
$ ls
ccccd2_H_0_col.bak2 ccccd2_H_1_col.bak2 ccccd2_H_2_col.bak2 log.lammps
ccccd2_H_0_temp_col ccccd2_H_1_temp_col ccccd2_H_2_temp_col
$ paste ccccd2_H_0_temp_col ccccd2_H_1_temp_col ccccd2_H_2_temp_col | sed 's/\t/\t\t/g'
N a b c N g h i N e f g
C d e f C j k l C h i j
N y z a N m n o N k l m
C b c d C p q r C n o p
N w x y N s t u N q r s
C z a b C v w x C t u v

gsub many columns simultaneously based on different gsub conditions?

I have a file with the following data-
Input-
A B C D E F
A B B B B B
C A C D E F
A B D E F A
A A A A A F
A B C B B B
If any of the other rows starting from row 2 have the same letter as row 1, they should be changed to 1. Basically, I'm trying to find out how similar any of the rows are to the first row.
Desired Output-
1 1 1 1 1 1
1 1 B B B B
C A 1 1 1 1
1 1 D E F A
1 A A A A 1
1 1 1 B B B
The first row has become all 1 since it is identical to itself (obviously). In the second row, the first and second columns are identical to the first row (A B) and hence they become 1 1. And so on for the other rows.
I have written the following code which does this transformation-
for seq in {1..1} ; #Iterate over the rows (in this case just row 1)
do
for position in {1..6} ; #Iterate over the columns
do
#Define the letter in the first row with which I'm comparing the rest of the rows
aa=$(awk -v pos=$position -v line=$seq 'NR == line {print $pos}' f)
#If it matches, gsub it to 1
awk -v var=$aa -v pos=$position '{gsub (var, "1", $pos)} 1' f > temp
#Save this intermediate file and now act on this
mv temp f
done
done
As you can imagine, this is really slow because that nested loop is expensive. My real data is a 60x10000 matrix and it takes about 2 hours for this program to run on that.
I was hoping you could help me get rid of the inner loop so that I can do all 6 gsubs in a single step. Maybe putting them in an array of their own? My awk skills aren't that great yet.
You can use this simpler awk command to do the job which will be faster to complete as we are avoiding nested loops in shell and also invoking awk repeatedly in nested loop:
awk '{for (i=1; i<=NF; i++) {if (NR==1) a[i]=$i; if (a[i]==$i) $i=1} } 1' file
1 1 1 1 1 1
1 1 B B B B
C A 1 1 1 1
1 1 D E F A
1 A A A A 1
1 1 1 B B B
EDIT:
As per the comments below here is what you can do to get the sum of each column in each row:
awk '{sum=0; for (i=1; i<=NF; i++) { if (NR==1) a[i]=$i; if (a[i]==$i) $i=1; sum+=$i}
print $0, sum}' file
1 1 1 1 1 1 6
1 1 B B B B 2
C A 1 1 1 1 4
1 1 D E F A 2
1 A A A A 1 2
1 1 1 B B B 3
Input
$ cat f
A B C D E F
A B B B B B
C A C D E F
A B D E F A
A A A A A F
A B C B B B
Desired o/p
$ awk 'FNR==1{split($0,a)}{for(i=1;i<=NF;i++)if (a[i]==$i) $i=1}1' f
1 1 1 1 1 1
1 1 B B B B
C A 1 1 1 1
1 1 D E F A
1 A A A A 1
1 1 1 B B B
Explanation
FNR==1{ .. }
When awk reads first record of current file, do things inside braces
split(string, array [, fieldsep [, seps ] ])
Divide string into pieces separated by fieldsep and store the pieces
in array and the separator strings in the seps array.
split($0,a)
split current record or row ($0) into pieces by fieldsep (defualt space, as
we have not supplied 3rd argument) and store the pieces in array a
So array a contains data from first row
a[1] = A
a[2] = B
a[3] = C
a[4] = D
a[5] = E
a[6] = F
for(i=1;i<=NF;i++)
Loop through all the fields of for each record of file till end of file.
if (a[i]==$i) $i=1
if first row's column value of current index (i) is equal to
current column value of current row set current column value = 1 ( meaning modify current column value )
Now we modified column value next just print modified row
}1
1 always evaluates to true, it performs default operation {print $0}
For update request on comment
Same question here, I have a second part of the program that adds up
the numbers in the rows. I.e. You would get 6, 2, 4, 2, 2, 3 for this
output. Can your program be tweaked to get these values out at this
step itself?
$ awk 'FNR==1{split($0,a)}{s=0;for(i=1;i<=NF;i++)if(a[i]==$i)s+=$i=1;print $0,s}' f
1 1 1 1 1 1 6
1 1 B B B B 2
C A 1 1 1 1 4
1 1 D E F A 2
1 A A A A 1 2
1 1 1 B B B 3

How to separate lines depending on the value in column 1

I have a text file that contains the following (a b c d etc... contains some random values):
1 a
1 b
2 c
2 d
2 e
2 f
6 g
6 h
6 i
12 j
12 k
Is there a way to separate lines with some characters depending on the content of the first string, knowing that those numbers will always be increasing, but may vary as well. The separation would be when first string is incrementing, going from 1 to 2, then 2 to 6 etc...
The output would be like this (here I would like to use ---------- as a separation):
1 a
1 b
----------
2 c
2 d
2 e
2 f
----------
6 g
6 h
6 i
----------
12 j
12 k
awk 'NR>1 && old != $1 { print "----------" } { print; old = $1 }'
If it isn't the first line and the value in old isn't the same as in $1, print the separator. Then unconditionally print the current line, and record the value of $1 in old so that we remember for next time. Repeat until done.

Unix / Shell Add a range of columns to file

So I've been trying the same problem for the last few days, and I'm at a formatting road block.
I have a program that will only run if its working on an equal number of columns. I know the total column count, and the number needed to add with a filler value of 0, but am not sure how to do this. Is there some time of range option with awk or sed for this?
Input:
A B C D E
A B C D E 1 1 1 1
Output:
A B C D E 0 0 0 0
A B C D E 1 1 1 1
The the alphabet columns are always present (with different values), but this "fill in the blank" function is eluding me. I can't use R for this due to data file size.
One way using awk:
$ awk 'NF!=n{for(i=NF+1;i<=n;i++)$i=0}1' n=9 file
A B C D E 0 0 0 0
A B C D E 1 1 1 1
Just set n to the number of columns you want to pad upto.

Resources