Concatenate string in .csv after x commas using shell/bash - bash

I have several .csv files containing data. The data vendor created the files indicating the years once in the first line with missing values in between, variables names in the second. Data follows in the third to the Xth line.
"year 1", , , "year 2", , ,"year 2", , ,
"Var1", "Var2", "Var3", "Var1", "Var2", "Var3", "Var1", "Var2", "Var3"
"ABC" , 1234 , 4567 , "DEF" , 789 , "ABC" , 1234 , 4567 , "DEF"
I am new to shell programming but it shouldn't be too complicated writing a script that outputs the following
"Var1_year1", "Var2_year1", "Var3_year1", "Var1_year2", "Var2_year2", "Var3_year2", "Var1_year3", "Var2_year3", "Var3_year3"
"ABC" , 1234 , 4567 , "DEF" , 789 , "ABC" , 1234 , 4567 , "DEF"
Some thing like
#!/bin/bash
FILES=/Users/pathTo.csvfiles/*.csv
for f in $FILES
do
echo "Processing $f file..."
# 1. Replace the second line with 'Varname_YearX' where YearX comes from the first line
cat ????
# 2. Delete first line
sed -i '' 1d $f
done
echo "Processing complete."
Update: The .csv files vary in their amount of lines. Only the first two lines need to be edited, the following lines are data.

If you want to merge the first and the second line of each CSV, try this.
# No point in using a variable for the wildcard
for f in /Users/pathTo.csvfiles/*.csv
do
awk -F , 'NR==1 { # Collect first line
# Squash quotes
gsub(/"/, "")
for(i=1;i<=NF;++i)
y[i] = $i || y[i-1]
next # Do not fall through to print
}
NR==2 { # Combine collected with current
gsub(/"/, "")
for(i=1;i<=NF;++i)
$i = y[i] "_" $i
}
# Print everything (except first)
1' "$f" > "$f.tmp"
mv "$f.tmp" "$f"
done
The first loop simply copies the previous field's value to y[i] if the i:th field is empty.

Ugly code using csvtool, various standard tools, and bash:
i=file.csv
paste -d_ <(head -2 $i | tail -1 | csvtool transpose -) \
<(head -1 $i | csvtool transpose - |
sed '$d;s/ //;/^$/{g;b};h') |
csvtool transpose - | sed 's/[^,]*/"&"/g' | cat - <(tail +3 $i)
Output:
"Var1_year1","Var2_year1","Var3_year1","Var1_year2","Var2_year2","Var3_year2","Var1_year2","Var2_year2","Var3_year2"
"ABC" , 1234 , 4567 , "DEF" , 789 , "ABC" , 1234 , 4567 , "DEF"

Related

How to make that variable $f defined how much "Freq" will be printed from column number 3?

I need a help with my bash script. I've problem with code:
for v in $(seq 1 $f)); do echo $(grep "Freq" freq.log) | awk '{print$3}')
because this comands printed $f times column number 3 instead should be printed $f values of "Freq" from column number 3.
It's look like
enter image description here
Should be like
enter image description here
I don't know how make that variable $f defined how much "Freq" will be printed from column number 3. In this file I've plenty expressions of "Freq" but I need just $f.
For sure I pasted all content of script:
#!/bin/bash
e=$(grep "atomic number" freq.log | tail -1 | awk '{print$2}')
echo "Liczba atomow znajdujacyh sie w podanej czasteczce wynosi: $e"
f=$(bc <<< "($e*3-6)/3")
echo "Liczba wartosci Freq, ktore wczyta skrypt to $f"
for v in $(seq 1 $f); do
echo "$(grep "Freq" freq.log | awk '{print$3}')"
done
Sample input data file; geometry optimization calculations in GAUSSIAN
A A A
Frequencies -- 182.1477 202.8948 227.7144
Red. masses -- 6.6528 8.2622 6.3837
Frc consts -- 0.1300 0.2004 0.1950
IR Inten -- 0.8602 0.4870 1.2090
NAtoms= 35 NActive= 35 NUniq= 35 SFac= 1.00D+00 NAtFMM= 60 NAOKFM=F Big=F
Here is your bash script converted to a single awk script:
awk script script.awk
/atomic number/{ # for each line matching regEx pattern "atomic number"
e = $2; # store current 2nd field in variable e
}
/Freq/{ # for each line matching regEx pattern "Freq"
freqArr[fr++]=$3; # add 3rd field to array freqArr, increment array counter fr
}
END { # on complete scanning input file
print "Liczba atomow znajdujacyh sie w podanej czasteczce wynosi: " e;
f = ( ((e * 3) - 6) / 3 ); # claculate vairable f
print "Liczba wartosci Freq, ktore wczyta skrypt to " f;
for (currFreq in freqArr) { # scan all element freqArr
if (currFreq == f) # if currFreq equals f
freqCount++; # increment freqCount coutner
}
print freqCount;
}
run command
awk -f script.awk freq.log

Is there a Bash function that allow me to separate/delete/isolate line from a file when they have the same first word

I have a text file like this:
id ; lorem ipsum fgdg df gdg
id ; lorem ipsum fgdg df gdg
id ; lorem ipsum fgdg df gdg
id ; lorem ipsum fgdg df gdg
id ; lorem ipsum fgdg df gdg
And if 2 id are similar, I want to separate to line where 2 id are similar and the line that are unique.
uniquefile contains the lines with unique id.
notuniquefile contains the lines that don't have one.
I already found a way to almost do it but only with the first word. Basically it is just isolating the id and deleting the rest of line.
Command 1: isolating unique id (but missing the line):
awk -F ";" '{!seen[$1]++};END{for(i in seen) if(seen[i]==1)print i }' originfile >> uniquefile
Command 2: isolating the not unique id (but missing the line and losing the "lorem ipsum" content that can be different depending on the line):
awk -F ":" '{!seen[$1]++;!ligne$0};END{for(i in seen) if(seen[i]>1)print i }' originfile >> notuniquefile
So in a perfect world I would like you to help me obtain this type of result:
originfile:
1 ; toto
2 ; toto
3 ; toto
3 ; titi
4 ; titi
uniquefile:
1 ; toto
2 ; toto
4 ; titi
notuniquefile:
3 ; toto
3 ; titi
Have a good day.
Yet another method with just two unix commands, that works if your id fields always have the same length (let's assume they are one character in length like in my testdata, but it of course works also for longer fields):
# feed the testfile.txt sorted to uniq
# -w means: only compare the first 1 character of each line
# -D means: output only duplicate lines (fully not just one per group)
sort testfile.txt | uniq -w 1 -D > duplicates.txt
# then filter out all duplicate lines from the text file
# to just let the unique files slip through
# -v means: negate the pattern
# -F means: use fixed strings instead of regex
# -f means: load the patterns from a file
grep -v -F -f duplicates.txt testfile.txt > unique.txt
And the output is (for the same input lines as used in my other post):
$uniq -w 2 -D testfile.txt
2;line B
2;line C
3;line D
3;line E
3;line F
and:
$ grep -v -F -f duplicates.txt testfile.txt
1;line A
4;line G
Btw. in case you want to avoid the grep, you can also store the output of the sort (lets say in sorted_file.txt) and replace the second line by
uniq -w 1 -u sorted_file.txt > unique.txt
where the number behind -w again is the length of your id field in characters.
untested: process the file twice: first to count the ids, second to decide where to print the record:
awk -F';' '
NR == FNR {count[$1]++; next}
count[$1] == 1 {print > "uniquefile"}
count[$1] > 1 {print > "nonuniquefile"}
' file file
Here is a small Python script which does this:
#!/usr/bin/env python3
import sys
unique_markers = []
unique_lines = []
nonunique_markers = set()
for line in sys.stdin:
marker = line.split(' ')[0]
if marker in nonunique_markers:
# found a line which is not unique
print(line, end='', file=sys.stderr)
elif marker in unique_markers:
# found a double
index = unique_markers.index(marker)
print(unique_lines[index], end='', file=sys.stderr)
print(line, end='', file=sys.stderr)
del unique_markers[index]
del unique_lines[index]
nonunique_markers.add(marker)
else:
# marker not known yet
unique_markers.append(marker)
unique_lines.append(line)
for line in unique_lines:
print(line, end='', file=sys.stdout)
It is not a pure shell solution (which would be cumbersome and hard to maintain IMHO), but maybe it helps you.
Call it like this:
separate_uniq.py < original.txt > uniq.txt 2> nonuniq.txt
With a pure bash script, you could do it like this:
duplicate_file="duplicates.txt"
unique_file="unique.txt"
file="${unique_file}"
rm $duplicate_file $unique_file
last_id=""
cat testfile.txt | sort | (
while IFS=";" read id line ; do
echo $id
if [[ "${last_id}" != "" ]] ; then
if [[ "${last_id}" != "${id}" ]] ; then
echo "${last_id};${last_line}" >> "${file}"
file="${unique_file}"
else
file="${duplicate_file}"
echo "${last_id};${last_line}" >> "${file}"
fi
fi
last_line="${line}"
last_id="${id}"
done
echo "${last_id};${last_line}" >> "${file}"
)
With an inputfile as:
1;line A
2;line B
2;line C
3;line D
3;line E
3;line F
4;line G
It outputs:
$ cat duplicates.txt
2;line B
2;line C
3;line D
3;line E
3;line F
work$ cat unique.txt
1;line A
4;line G

append data from one file to anoher file from right

I have two files with below formats:
file1:
Sub_amount , date/time
12 , 2018040412
78 , 2018040413
26 , 2018040414
file2:
Unsub_amount , date/time
76 , 2018040412
98 , 2018040413
56 , 2018040414
what I need is, append file2 to file1 from right. what I mean is:
Sub_amount, Unsub_amount , date/time
12 , 76 , 2018040412
78 , 98 , 2018040413
26 , 56 , 2018040414
At the end, what is needed to be shown is:
date/time , Unsub_amount , Sub_amount
2018040412, 76 , 12
2018040413, 98 , 78
2018040414, 26 , 56
I would be appreciate if anyone can support :)
Thanks.
I would use awk for this:
awk -F'[[:blank:]]*,[[:blank:]]*' -v OFS="," '
# remove leading and trailing blanks from the line
{ gsub(/^[[:blank:]]+|[[:blank:]]+$/, "") }
# skip empty lines
/^$/ { next }
# store the Sub values from file1
NR == FNR { sub_amt[$2] = $1; next }
# print the data from file2, matching the cached value from file1
{ print $2, $1, sub_amt[$2] }
' file1 file2
date/time,Unsub_amount,Sub_amount
2018040412,76,12
2018040413,98,78
2018040414,56,26
edit: Not as cool or well-working as the awk solution, but an alternative. Trims the columns (using sed) and extracts them (using cut), then joins them (using paste). Assumes that the rows match up. Happy coding!
#!/usr/bin/env sh
# copy this code
# pbpaste > csv-col-merge.sh # paste/create the file
# chmod u+x csv-col-merge.sh # make it executable
# ./csv-col-merge.sh --gen-example
# ./csv-col-merge.sh demo/subs.csv demo/unsubs.csv demo/combined.csv
#
# learn more:
# - "paste": https://askubuntu.com/questions/616166/how-can-i-merge-files-on-a-line-by-line-basis
# - "<<- EOF": https://stackoverflow.com/questions/2953081/how-can-i-write-a-heredoc-to-a-file-in-bash-script
# - "$_": https://unix.stackexchange.com/questions/271659/vs-last-argument-of-the-preceding-command-and-output-redirection
# - "cat -": https://stackoverflow.com/questions/14004756/read-stdin-in-function-in-bash-script
# - "cut -d ',' -f 1": split lines with ',' and take the first column, see "man cut"
# - "sed -E 's/find/replace/g'"; -E for extended regex, for ? (optional) support, see "man sed"
# -
# --gen-example
if [ "$1" = '--gen-example' ]; then
mkdir -p demo && cd $_
cat <<- EOF > subs.csv
sub_amount, date/time
12, 2018040412
78, 2018040413
26, 2018040414
EOF
cat <<- EOF > unsubs.csv
unsub_amount, date/time
76, 2018040412
98, 2018040413
56, 2018040414
EOF
exit
fi
# load
trim () { cat - | sed -E 's/ ?, ?/,/g'; }
subs="$(cat "$1" | trim)"
unsubs="$(cat "$2" | trim)"
combined=$3
# intermediate
getcol () { cut -d ',' -f $1; }
col_subs="$(echo "$subs" | getcol 1)"
col_unsubs="$(echo "$unsubs" | getcol 1)"
col_subs_date="$(echo "$subs" | getcol 2)"
col_unsubs_date="$(echo "$unsubs" | getcol 2)"
if [ ! "$col_subs_date" = "$col_unsubs_date" ]; then echo 'Make sure date col match up'; exit; fi
# process
mkdir tmp
echo "$col_subs_date" > tmp/a
echo "$col_unsubs" > tmp/b
echo "$col_subs" > tmp/c
paste -d ',' < tmp/a tmp/b tmp/c
rm -rf tmp

bash to identify and verify file headers

Using the tab-delimited file below I am trying to validate the header line 1 and then store that number in a variable $header to use in a couple of if statements. If $header equals 10 then file has expected number of fields, but if $header less than 10 file is missing header for: and the missing header fields are printed underneath. The bash seems close and if i use the awk by itself it seems to work perfectly, but I can not seem to use it in the if. Thank you :).
file.txt
Index Chr Start End Ref Alt Freq Qual Score Input
1 1 1 100 C - 1 GOOD 10 .
2 2 20 200 A C .002 STRAND BIAS 2 .
3 2 270 400 - GG .036 GOOD 6 .
file2.txt
Index Chr Start End Ref Alt Freq Qual Score
1 1 1 100 C - 1 GOOD 10
2 2 20 200 A C .002 STRAND BIAS 2
3 2 270 400 - GG .036 GOOD 6
bash
for f in /home/cmccabe/Desktop/validate/*.txt; do
bname=`basename $f`
pref=${bname%%.txt}
header=$(awk -F'\t' '{print NF, "fields detected in file and they are:" ORS $0; exit}') $f >> ${pref}_output # detect header row in file and store in header and write to output
if [[ $header == "10" ]]; then # display results
echo "file has expected number of fields" # file is validated for headers
else
echo "file is missing header for:" # missing header field ...in file not-validated
echo "$header"
fi # close if.... else
done >> ${pref}_output
desired output for file.txt
file has expected number of fields
desired output for file1.txt
file is missing header for:
Input
You can use awk if you like, but bash is more than capable of handling the first line fields comparison on its own. If you maintain an array of expected field names, you can then easily split the first line into fields, compare against the expected number of fields, and output the identity of the missing field if you read less than the expected number of fields from any given file.
The following is a short example that takes filenames as arguments (you need to take filenames from stdin for a large number of files, or use xargs, as required). The script simply reads the first line in each file, separates the line into fields, checks the field count, and outputs any missing fields in a short error message:
#!/bin/bash
declare -i header=10 ## header has 10 fields
## aray of field names (can be read from 1st file)
fields=( "Index"
"Chr"
"Start"
"End"
"Ref"
"Alt"
"Freq"
"Qual"
"Score"
"Input" )
for i in "$#"; do ## for each file given as argument
read -r line < "$i" ## read first line from file into 'line'
oldIFS="$IFS" ## save current Internal Field Separator (IFS)
IFS=$'\t' ## set IFS to word-split on '\t'
fldarray=( $line ); ## fill 'fldarray' with fields in line
IFS="$oldIFS" ## restore original IFS
nfields=${#fldarray[#]} ## get number of fields in 'line'
if (( nfields < header )) ## test against header
then
printf "error: only '%d' fields in file '%s'\nmissing:" "$nfields" "$i"
for j in "${fields[#]}" ## for each expected field
do ## check against those in line, if not present print
[[ $line =~ $j ]] || printf " %s" "$j"
done
printf "\n\n" ## tidy up with newlines
fi
done
Example Input
$ cat dat/hdr.txt
Index Chr Start End Ref Alt Freq Qual Score Input
1 1 1 100 C - 1 GOOD 10 .
2 2 20 200 A C .002 STRAND BIAS 2 .
3 2 270 400 - GG .036 GOOD 6 .
$ cat dat/hdr2.txt
Index Chr Start End Ref Alt Freq Qual Score
1 1 1 100 C - 1 GOOD 10
2 2 20 200 A C .002 STRAND BIAS 2
3 2 270 400 - GG .036 GOOD 6
$ cat dat/hdr3.txt
Index Chr Start End Alt Freq Qual Score Input
1 1 1 100 - 1 GOOD 10 .
2 2 20 200 C .002 STRAND BIAS 2 .
3 2 270 400 GG .036 GOOD 6 .
Example Use/Output
$ bash hdrfields.sh dat/hdr.txt dat/hdr2.txt dat/hdr3.txt
error: only '9' fields in file 'dat/hdr2.txt'
missing: Input
error: only '9' fields in file 'dat/hdr3.txt'
missing: Ref
Look things over, while awk can do many things bash cannot on its own, bash is more than capable with parsing text.
Here is one in GNU awk (nextfile):
$ awk '
FNR==NR {
for(n=1;n<=NF;n++)
a[$n]
nextfile
}
NF==(n-1) {
print FILENAME " file has expected number of fields"
nextfile
}
{
for(i=1;i<=NF;i++)
b[$i]
print FILENAME " is missing header for: "
for(i in a)
if(i in b==0)
print i
nextfile
}' file1 file1 file2
file1 file has expected number of fields
file2 is missing header for:
Input
The first file processed by the script defines the headers (in a) that the following files should have and compares them (in b) against it.
This piece of code will do exactly what you are asking. Let me know if it works for you.
for f in ./*.txt; do
[[ $( head -1 $f | awk '{ print NF}' ) -eq 10 ]] && echo "File $f has all the fields on its header" || echo "File $f is missing " $( echo "Index Chr Start End Ref Alt Freq Qual Score Input $( head -1 $f )" | tr ' ' '\n' | sort | uniq -c | awk '/1 / {print $2}' );
done
Output :
File ./file2.txt is missing Input
File ./file.txt has all the fields on its header

Bash: Merging 4 lines in 4 files into one single file

I'm looking for a way to merge 4 lines of dna probing results into one line.
The problem here is:
I don't want to append the lines. But associating them
The 4 lines of dna probing:
A----A----------A----A-A--AAAA-
-CC----CCCC-C-----CCC-C-------C
------G----G--G--G------G------
---TT--------T-T---------T-----
I need these to be 1 line, not just appended but intermixed without the dashes.
First characters of the result:
ACCTTAGCCCCGC...
This seem to be a kind of general problem, so the language choosed to solve this don't matter.
For fun: one bash way:
lines=(
A----A----------A----A-A--AAAA-
-CC----CCCC-C-----CCC-C-------C
------G----G--G--G------G------
---TT--------T-T---------T-----
)
result=""
for ((i=0;i<${#lines};i++)) ;do
chr=- c=()
for ((l=0;l<${#lines[#]};l++)) ;do
[ "${lines[l]:i:1}" != "-" ] &&
chr="${lines[l]:i:1}" &&
c+=($l)
done
[ ${#c[#]} -eq 0 ] && printf 'Char #%d not replaced.\n' $i
[ ${#c[#]} -gt 1 ] && c="${c[*]}" && chr="*" &&
printf "Conflict at char #%d (lines: %s).\n" $i "${c// /, }"
result+=$chr
done
echo $result
With provided input, there is no conflict and all characters is replaced. So the output is:
ACCTTAGCCCCGCTGTAGCCCACAGTAAAAC
Note: Question stand for 4 different files, so lines= syntax could be:
lines=($(cat file1 file2 file3 file4))
But with a wrong input:
lines=(
A----A---A-----A-----A-A--AAAA-
-CC----CCCC-C-----CCC-C-------C
------G----G---G-G------G------
---TT--------T-T---------T-----
)
output could be:
Conflict at char #9 (lines: 0, 1).
Char #14 not replaced.
Conflict at char #15 (lines: 0, 2, 3).
Char #16 not replaced.
and
echo $result
ACCTTAGCC*CGCT-*-GCCCACAGTAAAAC
Small perl filter
But if input are not to be verified, this little perl filter could do the job:
(Thanks #jm666 for }{ syntax)
perl -nlE 'y+-+\0+;$,|=$_}{say$,' <(cat file1 file2 file3 file4)
where
-n process all lines without output
-l whipe leading cariage return at end of lines
y+lhs+rhs+ replace (translate) chars from 'lhs' to 'rhs'
\0 is the *null* character, binary 0.
$, is a variable
|= binary or, between himself and current line ($_)
}{ at END, once all lines processed
Alternative way - not very efficient - but short:
file="./gene"
line1=$(head -1 "$file")
seq ${#line1} | xargs -n1 -I% cut -c% "$file" | paste -s - | tr -cd '[A-Z\n]'
prints:
ACCTTAGCCCCGCTGTAGCCCACAGTAAAAC
Assumption: each line has the same length.
Decomposition:
the line1=$(head -1 "$file") read the 1st line into the variable line1
the seq ${#line1} generate a sequence of numbers 1..char_count_in_the_line1, like
1
2
..
31
the xargs -n1 -I% cut -c% "$file" will run for each above number the command cut like cut -c22 filename - what extract the given column from the file, e.g. you will get output like:
A
-
-
-
-
C
-
-
# and so on
the paste -s - will join the above lines into one long line with the \t (tab) separator, like:
A - - - - C - - - C - - - - - T ... etc...
finally the tr -cd '[A-Z\n]' remove everything what isn't uppercase character or newline, so will get the final
ACCTTAGCCCCGCTGTAGCCCACAGTAAAAC

Resources