Bash - Transpose a single field keeping the rest same and repeat it across - bash

I have a file with pipe separated fields.
eg.
1,2,3|xyz|abc
I need the output in below format:
1|xyz|abc
2|xyz|abc
3|xyz|abc
I have a working code in bash:
while read i
do
f1=`echo $i | cut -d'|' -f1`
f2=`echo $i | cut -d'|' -f2-`
echo $f1 | tr ',' '\n' | sed "s:$:|$f2:" >> output.txt
done < pipe_delimited_file.txt
Can anyone suggest a way to achieve this witout using loop.
The file contains a large number of records.

Uses a loop, but it's inside awk, so very fast:
awk -F\| 'BEGIN{OFS="|"}{n = split($1, a, ","); $1=""; for(i=1; i<=n; i++) {print a[i] $0}}' pipe_delimited_file.txt

Perl may be a bit faster than awk:
perl -F'[|]' -ane 'for $n (split /,/, $F[0]) {$F[0] = $n; print join "|", #F}' file
bash is very slow, but here's a quicker way to use it. This uses plain bash without calling any external programs:
( # in a subshell:
IFS=, # use comma as field separator
set -f # turn off filename generation
while IFS='|' read -r f1 rest; do # temporarily using pipe as field separator,
# read first field and rest of line
for word in $f1; do # iterate over comma-separated words
echo "$word|$rest"
done
done
) < file

Related

how to change words with the same words but with number at the back bash

I have a file for example with the name file.csv and content
adult,REZ
man,BRB
women,SYO
animal,HIJ
and a line that is nor a directory nor a file
file.csv BRB1 REZ3 SYO2
And what I want to do is change the content of the file with the words that are on the line and then get the nth letter of that word with the number at the end of the those words in capital
and the output should then be
umo
I know that I can get over the line with
for i in "${#:2}"
do
words+=$(echo "$i ")
done
and then the output is
REZ3 BRB1 SYO2
Using awk:
Pass the string of values as an awk variable and then split them into an array a. For each record in file.csv, iterate this array and if the second field of current record matches the first three characters of the current array value, then strip the target character from the first field of the current record and append it to a variable. Print the value of the aggregated variable.
awk -v arr="BRB1 REZ3 SYO2" -F, 'BEGIN{split(arr,a," ")} {for (v in a) { if ($2 == substr(a[v],0,3)) {n=substr(a[v],length(a[v]),1); w=w""substr($1,n,1) }}} END{print w}' file.csv
umo
You can also put this into a script:
#!/bin/bash
words="${2}"
src_file="${1}"
awk -v arr="$words" -F, 'BEGIN{split(arr,a," ")} \
{for (v in a) { \
if ($2 == substr(a[v],0,3)) { \
n=substr(a[v],length(a[v]),1); \
w=w""substr($1,n,1);
}
}
} END{print w}' "$src_file"
Script execution:
./script file.csv "BRB1 REZ3 SYO2"
umo
This is a way using sed.
Create a pattern string from command arguments and convert lines with sed.
#!/bin/bash
file="$1"
pat='s/^/ /;Te;'
for i in ${#:2}; do
pat+=$(echo $i | sed 's#^\([^0-9]*\)\([0-9]*\)$#s/.\\{\2\\}\\(.\\).*,\1$/\\1/;#')
done
pat+='Te;H;:e;${x;s/\n//g;p}'
eval "sed -n '$pat' $file"
Try this code:
#!/bin/bash
declare -A idx_dic
filename="$1"
pattern_string=""
for i in "${#:2}";
do
pattern_words=$(echo "$i" | grep -oE '[A-Z]+')
index=$(echo "$i" | grep -oE '[0-9]+')
pattern_string+=$(echo "$pattern_words|")
idx_dic["$pattern_words"]="$index"
done
pattern_string=${pattern_string%|*}
while IFS= read -r line
do
line_pattern=$(echo $line | grep -oE $pattern_string)
[[ -n $line_pattern ]] && line_index="${idx_dic[$line_pattern]}" && echo $line | awk -v i="$line_index" '{split($0, chars, ""); printf("%s", chars[i]);}'
done < $filename
first find the capital words pattern and catch the index corresponding
then construct the hole pattern words string which connect with |
at last, iterate the every line according to the pattern string, and find the letter by the index
Execute this script.sh like:
bash script.sh file.csv BRB1 REZ3 SYO2

Removing newlines in a txt file

I have a txt file in a format like this:
test1
test2
test3
How can I bring it into a format like this using bash?
test1,test2,test3
Assuming that “using Bash” means “without any external processes”:
if IFS= read -r line; then
printf '%s' "$line"
while IFS= read -r line; do
printf ',%s' "$line"
done
echo
fi
Old answer here
TL;DR:
cat "export.txt" | paste -sd ","
Another pure bash implementation that avoids explicit loops:
#!/usr/bin/env bash
file2csv() {
local -a lines
readarray -t lines <"$1"
local IFS=,
printf "%s\n" "${lines[*]}"
}
file2csv input.txt
You can use awk. If the file name is test.txt then
awk '{print $1}' ORS=',' test.txt | awk '{print substr($1, 1, length($1)-1)}'
The first awk commad joins the three lines with comma (test1,test2,test3,).
The second awk command just deletes the last comma from the string.
Use tool 'tr' (translate) and sed to remove last comma:
tr '\n' , < "$source_file" | sed 's/,$//'
If you want to save the output into a variable:
var="$( tr '\n' , < "$source_file" | sed 's/,$//' )"
Using sed:
$ sed ':a;N;$!ba;s/\n/,/g' file
Output:
test1,test2,test3
I think this is where I originally picked it up.
If you don't want a terminating newline:
$ awk '{printf "%s%s", sep, $0; sep=","}' file
test1,test2,test3
or if you do:
awk '{printf "%s%s", sep, $0; sep=","} END{print ""}' file
test1,test2,test3
Another loopless pure Bash solution:
contents=$(< input.txt)
printf '%s\n' "${contents//$'\n'/,}"
contents=$(< input.txt) is equivalent to contents=$(cat input.txt). It puts the contents of the input.txt file (with trailing newlines automatically removed) into the variable contents.
"${contents//$'\n'/,}" replaces all occurrences of the newline character ($'\n') in contents with the comma character. See Parameter expansion [Bash Hackers Wiki].
See the accepted, and excellent, answer to Why is printf better than echo? for an explanation of why printf '%s\n' is used instead of echo.

Convert text file into a comma delimited string

I don't seem to locate an SO question that matches this exact problem.
I have a text file that has one text token per line, without any commas, tabs, or quotes. I want to create a comma delimited string based on the file content.
Input:
one
two
three
Output:
one,two,three
I am using this command:
csv_string=$(tr '\n' ',' < file | sed 's/,$//')
Is there a more efficient way to do this?
The usual command to do this is paste
csv_string=$(paste -sd, file.txt)
You can do it entirely with bash parameter expansion operators instead of using tr and sed.
csv_string=$(<file) # read file into variable
csv_string=${csv_string//$'\n'/,} # replace \n with ,
csv_string=${csv_string%,} # remove trailing comma
One way with Awk would be to reset the RS and treat the records as separated by blank lines. This would handle words with spaces and format them in CSV format as expected.
awk '{$1=$1}1' FS='\n' OFS=',' RS= file
The {$1=$1} is a way to reconstruct the fields in each line($0) of the file based on modifications to Field (FS/OFS) and/or Record separators(RS/ORS). The trailing 1 is to print every line with the modifications done inside {..}.
With Perl one-liner:
$ cat csv_2_text
one
two
three
$ perl -ne '{ chomp; push(#lines,$_) } END { $x=join(",",#lines); print "$x" }' csv_2_text
one,two,three
$ perl -ne ' { chomp; $_="$_," if not eof ;printf("%s",$_) } ' csv_2_text
one,two,three
$
From #codeforester
$ perl -ne 'BEGIN { my $delim = "" } { chomp; printf("%s%s", $delim, $_); $delim="," } END { printf("\n") }' csv_2_text
one,two,three
$
Tested the four approaches on a Linux box - Bash only, paste, awk, Perl, as well as the tr | sed approach shown in the question:
#!/bin/bash
# generate test data
seq 1 10000 > test.file
times=${1:-50}
printf '%s\n' "Testing paste solution"
time {
for ((i=0; i < times; i++)); do
csv_string=$(paste -sd, test.file)
done
}
printf -- '----\n%s\n' "Testing pure Bash solution"
time {
for ((i=0; i < times; i++)); do
csv_string=$(<test.file) # read file into variable
csv_string=${csv_string//$'\n'/,} # replace \n with ,
csv_string=${csv_strings%,} # remove trailing comma
done
}
printf -- '----\n%s\n' "Testing Awk solution"
time {
for ((i=0; i < times; i++)); do
csv_string=$(awk '{$1=$1}1' FS='\n' OFS=',' RS= test.file)
done
}
printf -- '----\n%s\n' "Testing Perl solution"
time {
for ((i=0; i < times; i++)); do
csv_string=$(perl -ne '{ chomp; $_="$_," if not eof; printf("%s",$_) }' test.file)
done
}
printf -- '----\n%s\n' "Testing tr | sed solution"
time {
for ((i=0; i < times; i++)); do
csv_string=$(tr '\n' ',' < test.file | sed 's/,$//')
done
}
Surprisingly, the Bash only solution does quite poorly. paste comes on top, followed by tr | sed, Awk, and perl:
Testing paste solution
real 0m0.109s
user 0m0.052s
sys 0m0.075s
----
Testing pure Bash solution
real 1m57.777s
user 1m57.113s
sys 0m0.341s
----
Testing Awk solution
real 0m0.221s
user 0m0.152s
sys 0m0.077s
----
Testing Perl solution
real 0m0.424s
user 0m0.388s
sys 0m0.080s
----
Testing tr | sed solution
real 0m0.162s
user 0m0.092s
sys 0m0.141s
For some reasons, csv_string=${csv_string//$'\n'/,} hangs on macOS Mojave running Bash 4.4.23.
Related posts:
How to join multiple lines of file names into one with custom delimiter?
Concise and portable “join” on the Unix command-line
Turning multi-line string into single comma-separated

How to process large csv files efficiently using shell script, to get better performance than that for following script?

I have a large csv file input_file with 5 columns. I want to do two things to second column:
(1) Remove last character
(2) Append leading and trailing single quote
Following are the sample rows from input_file.dat
420374,2014-04-06T18:44:58.314Z,214537888,12462,1
420374,2014-04-06T18:44:58.325Z,214537850,10471,1
281626,2014-04-06T09:40:13.032Z,214535653,1883,1
Sample output would look like :
420374,'2014-04-06T18:44:58.314',214537888,12462,1
420374,'2014-04-06T18:44:58.325',214537850,10471,1
281626,'2014-04-06T09:40:13.032',214535653,1883,1
I have written a following code to do the same.
#!/bin/sh
inputfilename=input_file.dat
outputfilename=output_file.dat
count=1
while read line
do
echo $count
count=$((count + 1))
v1=$(echo $line | cut -d ',' -f1)
v2=$(echo $line | cut -d ',' -f2)
v3=$(echo $line | cut -d ',' -f3)
v4=$(echo $line | cut -d ',' -f4)
v5=$(echo $line | cut -d ',' -f5)
v2len=${#v2}
v2len=$((v2len -1))
newv2=${v2:0:$v2len}
newv2="'$newv2'"
row=$v1,$newv2,$v3,$v4,$v5
echo $row >> $outputfilename
done < $inputfilename
But it's taking lot of time.
Is there any efficient way to achieve this?
You can do this with awk
awk -v q="'" 'BEGIN{FS=OFS=","} {$2=q substr($2,1,length($2)-1) q}1' input_file.dat
How it works:
BEGIN{FS=OFS=","} : set input and output field separator (FS, OFS) to ,.
-v q="'" : assign a literal single quote to the variable q (to avoid complex escaping in the awk expression)
{$2=q substr($2,1,length($2)-1) q} : Replace the second field ($2) with a single quote (q) followed by the value of the 2nd field without the last character (substr(string, start, length)) and appending a literal single quote (q) at the end.
1 : Just invoke the default action, which is print the current (edited) line.

using awk within loop to replace field

I have written a script finding the hash value from a dictionary and outputting it in the form "word:md5sum" for each word. I then have a file of names which I would like to use to place each name followed by every hash value i.e.
tom:word1hash
tom:word2hash
.
.
bob:word1hash
and so on. Everything works fine but I can not figure out the substitution. Here is my script.
$#!/bin/bash
#/etc/dictionaries-common/words
cat words.txt | while read line; do echo -n "$line:" >> dbHashFile.txt
echo "$line" | md5sum | sed 's/[ ]-//g' >> dbHashFile.txt; done
cat users.txt | while read name
do
cat dbHashFile.txt >> nameHash.txt;
awk '{$1="$name"}' nameHash.txt;
cat nameHash.txt >> dbHash.txt;
done
the line
$awk '{$1="$name"}' nameHash.txt;
is where I attempt to do the substitution.
thank you for your help
Try replacing the entire contents of the last loop (both cats and the awk) with:
awk -v name="$name" -F ':' '{ print name ":" $2 }' dbHashFile.txt >>dbHash.txt

Resources