Shell command for inserting a newline every nth element of a huge line of comma separated strings - bash

I have a one line csv containing a lot of elements. Now I want to insert a newline after every n-th element in a bash/shell script.
Bonus: I'd like to prepend a line with descriptors and using the count of descriptors as 'n'.
Example:
"4908041eee3d4bf98e606140b21ebc89.16","7.38974601030349731","45.31298584267982221","94ff11ce7eb54642b0768dde313e8b25.16","7.38845318555831909","45.31425320325949713", (...)
into
"id","lon","lat"
"4908041eee3d4bf98e606140b21ebc89.16","7.38974601030349731","45.31298584267982221"
"94ff11ce7eb54642b0768dde313e8b25.16","7.38845318555831909","45.31425320325949713"
(...)
Edit: I made a first attempt, but the comma delimiters are missing then:
(...) | xargs --delimiter=',' -n3
"4908041eee3d4bf98e606140b21ebc89.16" "7.38974601030349731" "45.31298584267982221"
"94ff11ce7eb54642b0768dde313e8b25.16" "7.38845318555831909" "45.31425320325949713"
trying to replace the " " with ","
(...) | xargs --delimiter=',' -n3 -i echo ${{}//" "/","}
-bash: ${{}//\": bad substitution

I would go with Perl for that!
Let's assume this outputs something like your file:
printf "1,2,3,4,5,6,7,8,9,10"
1,2,3,4,5,6,7,8,9,10
Then you could use this if you wanted every 4th comma replaced:
printf "1,2,3,4,5,6,7,8,9,10" | perl -pe 's{,}{++$n % 4 ? $& : "\n"}ge'
1,2,3,4
5,6,7,8
9,10

cat data.txt | xargs -n 3 -d, | sed 's/ /,/g'
With n=3 here and input filename is called data.txt

Note: What distinguishes this solution is that it derives the number of output columns from the number of columns in the header line.
Assuming that the fields in your CSV input have no embedded , instances (in which case you'd need a proper CSV parser), try awk:
awk -v RS=, -v header='"id","lon","lat"' '
BEGIN {
print header
colCount = 1 + gsub(",", ",", header)
}
{
ORS = NR % colCount == 0 ? "\n" : ","
print
}
' file.csv
Note that if the input file ends with a newline (as is typical), you'll get an extra newline trailing the output.
With GNU Awk or Mawk (but not BSD/OSX Awk, which only supports literal, single-character RS values), you can fix this as follows:
awk -v RS='[,\n]' -v header='"id","lon","lat"' '
BEGIN {
print header
colCount = 1 + gsub(",", ",", header)
}
{
ORS = NR % colCount == 0 ? "\n" : ","
print
}
' file.csv
BSD/OSX Awk workaround: stick with -v RS=, and replace file.csv with <(tr -d '\n' < file.csv) in order to remove all newlines from the input first.

Assuming your input file is named input:
echo id,lon,lat; awk '{ORS=NR%3?",":"\n"}1' RS=, input

Related

Processing text with multiple delims in awk

I have a text which looks like -
Application.||dates:[2022-11-12]|models:[MODEL1]|count:1|ids:2320
Application.||dates:[2022-11-12]|models:[MODEL1]|count:5|ids:2320
I want the number from the count:1 columns so 1 and i wish to store these numbers in an array.
nums=($(echo -n "$grepResult" | awk -F ':' '{ print $4 }' | awk -F '|' '{ print $1 }'))
this seems very repetitive and not very efficient, any ideas how to simplify this ?
You can use awk once, set the field separator to |. Then loop all the fields and split on :
If the field starts with count then print the second part of the splitted value.
This way the count: part can occur anywhere in the string and can possibly print this multiple times.
nums=($(echo -n "$grepResult" | awk -F'|' '
{
for(i=1; i<=NF; i++) {
split($i, a, ":")
if (a[1] == "count") {
print a[2]
}
}
}
'))
for i in "${nums[#]}"
do
echo "$i"
done
Output
1
5
If you want to combine the both split values, you can use [|:] as a character class and print field number 8 for a precise match as mentioned in the comments.
Note that it does not check if it starts with count:
nums=($(echo -n "$grepResult" | awk -F '[|:]' '{print $8}'))
With gnu awk you can use a capture group to get a bit more precise match where on the left and right can be either the start/end of string or a pipe char. The 2nd group matches 1 or more digits:
nums=($(echo -n "$grepResult" | awk 'match($0, /(^|\|)count:([0-9]+)(\||$)/, a) {print a[2]}' ))
Try sed
nums=($(sed 's/.*count://;s/|.*//' <<< "$grepResult"))
Explanation:
There are two sed commands separated with ; symbol.
First command 's/.*count://' remove all characters till 'count:' including it.
Second command 's/|.*//' remove all characters starting from '|' including it.
Command order is important here.

Remove first two lines, last two lines and space from file and add quotes on each line and replace newline with commas in shell script

I have to input.txt file which needs to be formatted by shell script with following condition
remove first two lines and
last two lines
remove all spaces in each
lines(each line have two spaces at
beginning and one space at end)
Each line should be within single
quotes(' ')
At last replace newline($) with
commas.
(original)
input.txt
sql
--------
Abce
Bca
Efr
-------
Row (3)
Desired output file
output.txt
'Abce','Bca','Efr'
I have tried using following commands
Sed -i 1,2d input.txt > input.txt
Sed "$(( $(wc -l <input.txt) -2+1)), $ d" Input.txt > input.txt
Sed ':a;N;$!ba;s/\n/, /g' input.txt > output.txt
But i get blank output.txt
Would you please try the following:
mapfile -t ary < <(tail -n +3 input.txt | head -n -2 | sed -E "s/^[[:blank:]]*/'/; s/[[:blank:]]*$/'/")
(IFS=,; echo "${ary[*]}")
tail -n +3 outputs lines after the 3rd line, inclusive.
head -n -2 outputs lines excluding the last 2 lines.
sed -E "s/^[[:blank:]]*/'/" removes leading whitespaces and prepends
a single quote.
Similarly the sed command "s/[[:blank:]]*$/'/" removes trailing
whitespaces and appends a single quote.
The syntax <(command ..) is a process substitution and the
output of the commands within the parentheses is fed to the mapfile
via the redirect.
mapfile -t ary reads lines from the standard input into the array
variable named ary.
echo "${ary[*]}" expands to a single string with the contents of
the array ary separated by the value of IFS, which is just assigned
to a comma.
The assignment of IFS and the array expansion are enclosed with
parentheses to be executed in the subshell. This prevents the IFS
to be modified in the current process.
With your shown samples, please try following awk program. Written and tested in GNU awk, should work with any version.
awk -v s1="'" -v lines="$(wc -l < Input_file)" '
BEGIN{ OFS="," }
FNR==(lines-1) {
print val
exit
}
FNR>2{
sub(/^[[:space:]]+/,"")
val=(val?val OFS:"") (s1 $0 s1)
}
' Input_file
Explanation: Adding detailed explanation for above code, this is only for explanation purposes.
awk -v s1="'" -v lines="$(wc -l < Input_file)" ' ##Starting awk program, setting s1 variable to ' and creating lines which has total number of lines in it, using wc -l command on Input_file file.
BEGIN{ OFS="," } ##Setting OFS to comma in BEGIN section of this program.
FNR==(lines-1) { ##Checking condition if its 2nd last line of Input_file.
print val ##Then printing val here.
exit ##exiting from program from here.
}
FNR>2{ ##Checking condition if FNR is greater than 2 then do following.
sub(/^[[:space:]]+/,"") ##Substituting initial spaces with NULL here.
val=(val?val OFS:"") (s1 $0 s1) ##Creating val which has ' current line ' in it and keep adding it in val.
}
' Input_file ##Mentioning Input_file name here.
If you know the input is small enough to fit in memory:
$ awk '
NR>4 { gsub(/^ *| *$/,"\047",p2); out=out sep p2; sep="," }
{ p2=p1; p1=$0 }
END { print out }
' input.txt
'Abce','Bca','Efr'
Otherwise:
$ awk '
NR>4 { gsub(/^ *| *$/,"\047",p2); printf "%s%s", sep, p2; sep="," }
{ p2=p1; p1=$0 }
END { print "" }
' input.txt
'Abce','Bca','Efr'
Either script will work using any awk in any shell on every Unix box.
This might work for you (GNU sed):
sed -E '1,2d;$!H;$!d;x;s/^\s*(.*)\s*$/'\''\1'\''/mg;s/\n[^\n]*$//;y/\n/,/' file
Delete the first two lines.
Append each line to the hold space, except for the last (this means the second from last line will still be present - see later).
Delete all lines except for the last.
Swap to the hold space.
Remove all spaces either side of the words on each line and surround those words by single quotes.
Remove the last line and its newline.
Replace all newlines by commas.
The first sed -i overwrites input.txt with an empty file. You can't write output back to the file you are reading, and sed -i does not produce any output anyway.
The minimal fix is to take out the -i and string together the commands into a pipeline; but of course, sed allows you to combine the commands into a single script.
len=$(wc -l <input.txt)
sed -e '1,2d' -e "$((len - 3))"',$d' \
-e ':a' \
-e 's/^ \(.*\) $/'"'\\1'/" \
-e N -e '$!ba' -e 's/\n/, /g' input.txt >output.txt
(Untested; if your sed does not allow multiple -e options, needs refactoring to use a single string with semicolons or newlines between the commands.)
This is hard to write and debug and brittle because of the ways you have to combine the quoting features of the shell with the requirements of sed and this particular script, but also more inherently because sed is a terse and obscure language.
A much more legible and maintainable solution is to switch to Awk, which allows you to express the logic in more human terms, and avoid having to pull in support from the shell for simple tasks like arithmetic and string formatting.
awk 'FNR > 2 { sub(/^ /, ""); sub(/ $/, "");
a[++i] = sprintf("\047%s\047,", $0); }
END { for(j=1; j < i-1; ++j) printf "%s", a[j] }' input.txt >output.txt
This literally replaces all newlines with commas; perhaps you would in fact like to print a newline instead of the comma on the last line?
awk 'FNR > 2 { sub(/^ /, ""); sub(/ $/, "");
a[++i] = sprintf("%s\047%s\047", sep, $0); sep="," }
END { for(j=1; j < i-1; ++j) printf "%s", a[j]; printf "\n" }' input.txt >output.txt
If the input file is really large, you might want to refactor this to not keep all the lines in memory. The array a collects the formatted output and we print all its elements except the last two in the END block.
sed -E '
/^-+$/,/^-+$/!d
//d
s/^[[:space:]]*|[[:space:]]*$/'\''/g
' input.txt |
paste -sd ,
This uses a trick that doesn't work on all sed implementations, to print the lines between two patterns (the dashes in this case), excluding those patterns.
On the plus side if the ---- pattern is at a different line number, it still works. Down side is it breaks, if that pattern (a line containing only dashes) occurs an odd number of times (ie. not in pairs, that wrap the lines you want).
Then sub line start and end (including white space) with single quotes.
Finally pipe to paste to sub the new lines with commas, excluding a trailing comma.
Using sed
$ sed "1,2d; /-/,$ d; s/\s\+//;s/.*/'&'/" input_file | sed -z 's/\n/,/g;s/,$/\n/'
'Abce','Bca','Efr'
I'll post a sed solution which is rather light.
sed '$d' input.txt | sed "\$d; 1,2d; s/^\s*\|\s*$/'/g" | paste -sd ',' > output.txt
$d Remove last line with first sed
\$d Remove the last line. $ escaped with backslash as we are within double-quotes.
1,2d Remove the first two lines.
s/^\s*\|\s*$/'/g Replace all leading and trailing whitespace with single quotes.
Use paste to concatenate to a single, comma delimited strings.
If we know that the relevant lines always start with two spaces, then it can even be simplified further.
sed -n "s/\s*$/'/; s/^ /'/p" input.txt | paste -sd ',' > output.txt
-n suppress printing lines unless told to
s/\s*$/'/ replace trailing whitespace with single quotes
s/^ /'/p replace two leading spaces and print lines that match
paste to concat
Then an awk solution:
awk -v i=1 -v q=\' 'FNR>2 {
gsub(/^[[:space:]]*|[[:space:]]*$/, q)
a[i++]=$0
} END {
for(i=1; i<=length(a)-3; i++)
printf "%s,", a[i]
print a[i++]
}' input.txt > output.txt
-v i=1 create an awk variable starting at one
-v q=\' create an awk variable for the single quote character
FNR>2 { ... tells it to only process line 3+
gsub(/^[[:space:]]*|[[:space:]]*$/, q) substitute leading and trailing whitespace with single quotes
a[i++]=$0 add line to array
END { ... Process the rest after reaching end of file
for(i=1; i<=length(a)-3; i++) take the length of the array but subtract three -- representing the last three lines
printf "%s,", a[i] print all but last three entries comma delimited
print a[i++] print next entry and complete the script (skipping the last two entries)
Not a one liner but works
sed "s/^ */\'/;s/\$/\',/;1,2d;N;\$!P;\$!D;\$d" | sed ' H;1h;$!d;x;s/\n//g;s/,$//'
Explanation:
s/^ */\'/;s/\$/\',/ ---> Adds single quotes and comma
N;$!P;$!D;$d ---> Deletes last two lines
H;1h;$!d;x;s/\n//g;s/,$//' ---> Loads entire file and merge all lines and remove last comma

Unix row to column format with string prefix and post fix

I have the requirement to convert row string data to column format and pre/postfix specific strings. The data string in file has 4 major fixed columns (separated by ";") and each column is further divided in two sections (separated by ":").
E.g.
Source data file:
A100:T100;B100:T200;A200:T300;B200:T400
Output from file should be:
TABa:BatchID=A100:TagId=T100:ProcId=1
TABb:BatchID=B100:TagId=T200:ProcId=2
TABc:BatchID=A200:TagId=T300:ProcId=3
TABd:BatchID=B200:TagId=T400:ProcId=4
Meanwhile I am trying with following code:
String="A100:T100;B100:T200;A200:T300;B200:T400"
> File.txt
for deploy in $(echo $String | tr ";" "\n")
do
echo $deploy >> File.txt
done
cat File.txt | awk 'BEGIN { FS=":"; OFS=":" } NR==1{ print "TABa:BatchID="$1,$2 } NR==2{ print "TABb:BatchID="$1,$2 }'
printf handles this:
$ awk -F: '{sub(/\n/,""); printf "TAB%c:BatchID=%s:TagId=%s:ProcId=%i\n",(NR+96),$1,$2,NR }' RS=';' File.txt
TABa:BatchID=A100:TagId=T100:ProcId=1
TABb:BatchID=B100:TagId=T200:ProcId=2
TABc:BatchID=A200:TagId=T300:ProcId=3
TABd:BatchID=B200:TagId=T400:ProcId=4
How it works
-F:
This sets the field separator to a colon: :.
sub(/\n/,"")
This removes newline characters.
printf "TAB%c:BatchID=%s:TagId=%s:ProcId=%i\n",(NR+96),$1,$2,NR
This does all the work. It makes use of the record number, NR, and the first and second fields and prints the output that you want.
RS=';'
This tells awk to use a semicolon, ;, as the record separator.

sed - removing last comma from listed value after doing a replace

I'm using sed to replace my file of new lines \n with ',' which works fine however, in my last item, I don't want the ,.
How can I remove this?
Example:
sed 's/\n/,/g' myfile.out > myfile.csv
Output:
1,2,3,4,5,6,
Well you can use labels:
$ cat file
1
2
3
4
5
6
$ sed ':a;N;s/\n/,/;ba' file
1,2,3,4,5,6
You can also use paste command:
$ paste -sd, file
1,2,3,4,5,6
Consider jaypal singh's paste solution, which is the most efficient and elegant.
An awk alternative, which doesn't require reading the entire file into memory first:
awk '{ printf "%s%s", sep, $0; sep = "," }' myfile.out > myfile.csv
If the output should have a trailing newline (thanks, Ed Morton):
awk '{ printf "%s%s", sep, $0; sep = "," } END { printf "\n" }' myfile.out > myfile.csv
For the first input line, sep, due to being an uninitialized variable, defaults to the empty string, effectively printing just $0, the input line.
Setting sep to "," after the first print ensures that all remaining lines have a , prepended.
END { printf "\n" } prints a trailing newline after all input lines have been processed. (print "" would work too, given that print appends the output record separator (ORS), which defaults to a newline).
The net effect is that , is only placed between input lines, so the output won't have a trailing comma.
You could add a second s command after the first: sed -z 's/\n/,/g ; s/,$//. This removes a comma at the end. (The option -z is from gnu sed and I needed it to get the first s command working.)

Using seq to cut every other column from a large file

I wanted to print only every other column in a file (with over 1 million columns) so first I tried using seq on a subset of data:
cut -d ' ' -f 1,2,3,4,5,6,7,8,9,10,11,12 filename > filename.test
cut -d ' ' -f$(seq -s, 1 2 12) filename.test > filename.testout
That works--I end up with half as many columns as the first file. However when I try the same approach on the full dataset like so:
cut -d ' ' -f$(seq -s, 1 2 1211418) file > fileout
I get this error:
-bash: /bin/cut: Argument list too long
So what do I do if I need only every other column from a file with this many columns? Thanks!
I referenced using cut command to remove multiple columns but I'm stuck on how to make it work for my huge file...
Use a different tool.
perl -F'/\x20/' -ane 'print $F[0]; for ( $i=2 ; $i<=$#F ; $i+=2 ) { print " $F[$i]" } print "\n"'
-F tells Perl to split on a space (\x20). -a tells Perl to split into the #F array. -n tells Perl to process the input line by line without printing each line. -e introduces the expression. $#F is the index of the last element in the #F array.
So what do I do if I need only every other column from a file with
this many columns?
Instead of attempting to use a tool that requires you to pass the fields explicitly on the command line that cause the argument list too long error, use something that doesn't require you to do that. Using awk:
awk '{for(i=1;i<=NF;i=i+2){printf "%s ", $i}{printf "%s", RS}}' filename
Or, using perl:
perl -lane 'print join" ",#F[map {$_*2} 0..int($#F/2)]' inputfile
Using awk
awk '{for(i=1;i<=NF;i+=2){if(i<NF-1){printf "%s ", $i} else{printf "%s", $i}}{print "";}}' input.txt
You might try this awk solution:
awk 'BEGIN {FS=" " ; OFS=FS}
{ for (i=1;i<=NF;i+=2) {printf("%s%s",$i,OFS)}
printf("\n","")
} INPUTFILE
You can also use sed to extract every senond column:
sed -r 's/ [^ ]*( |$)/\1/g' file > fileout
here assuming your delimiter is space.

Resources