Converting file with single field into multiple comma separated fields - shell

I have a .dat file in which there is no delimiter between fields.
Eg: 2014HELLO2500
I have to convert the file into a comma separated file with commas at specific positions
i.e 2014,HELLO,2500
I could convert the file using for loop. But can it be done using a single command.
I tried using --output-delimiter option of cut command. But it does not work.
I am using AIX OS.
Thanks

Assuming your field widths are known, you can use gawk like this:
awk -v FIELDWIDTHS="4 5 4 ..." -v OFS=, '{print $1,$2,$3,$4,$5...}' file

Using awk
Assuming that you know the lengths of the fields, say, for example, 4 characters for the first field and 5 for the second, then try this:
$ awk -v s='4 5' 'BEGIN{n=split(s,a)} {pos=1; for (i=1;i<=n;i++) {printf "%s,",substr($0,pos,a[i]); pos+=a[i]}; print substr($0,pos)}' file
2014,HELLO,2500
As an example of the exact same code but applied with many fields, consider this test file:
$ cat alphabet
ABCDEFGHIJKLMNOPQRSTUVWXYZ
Let's divide it up:
$ awk -v s='1 2 3 2 1 2 3 2 1 2 3 2' 'BEGIN{n=split(s,a)} {pos=1; for (i=1;i<=n;i++) {printf "%s,",substr($0,pos,a[i]); pos+=a[i]}; print substr($0,pos)}' alphabet
A,BC,DEF,GH,I,JK,LMN,OP,Q,RS,TUV,WX,YZ
How it works:
-v s='1 2 3 2 1 2 3 2 1 2 3 2'
This creates a variable s which defines the lengths of all but the last field. (There is no need to specify a length of the last field.)
BEGIN{n=split(s,a)}
This converts the string variable s to an array with each number as an element of the array.
pos=1
At the beginning of each line, we initialize the position variable, pos, to the value 1.
for (i=1;i<=n;i++) {printf "%s,",substr($0,pos,a[i]); pos+=a[i]}
For each element in array a, we print the required number of characters starting at position pos followed by a comma. After each print, we increment position pos so that the next print will start with the next character.
print substr($0,pos)
We print the last field on the line using however many character are left after position pos.
Using sed
Assuming that you know the lengths of the fields, say, for example, 4 characters for the first field and 5 for the second, then try this:
$ sed -E 's/(.{4})(.{5})/\1,\2,/' file
2014,HELLO,2500
This approach can be used for up to nine fields at a time. To get 15 fields, two passes would be needed.

Assuming you want a delimiter always between characters and number then you can use this:
$ sed -r -e 's/([A-Za-z])([0-9])/\1,\2/g' -e 's/([0-9])([A-Za-z])/\1,\2/g' <<< "2014HELLO2500"
2014,HELLO,2500
$

When numbers and strings alternate, you can use
echo "2014HELLO2500other_string121312Other_word10" |
sed 's/\([A-Za-z]\)\([0-9]\)/\1,\2/g; s/\([0-9]\)\([A-Za-z]\)/\1,\2/g'

echo TEP_CHECK.20180627023645.txt | cut -d'.' -f2 | awk 'BEGIN{OFS="_"} {print substr($1,1,4),substr($1,5,2),substr($1,7,2),substr($1,9,2),substr($1,11,2),substr($1,13,2)}'
Output:
2018_06_27_02_36_45

Related

Searching for a string between two characters

I need to find two numbers from lines which look like this
>Chr14:453901-458800
I have a large quantity of those lines mixed with lines that doesn't contain ":" so we can search for colon to find the line with numbers. Every line have different numbers.
I need to find both numbers after ":" which are separated by "-" then substract the first number from the second one and print result on the screen for each line
I'd like this to be done using awk
I managed to do something like this:
awk -e '$1 ~ /\:/ {print $0}' file.txt
but it's nowhere near the end result
For this example i showed above my result would be:
4899
Because it is the result of 458800 - 453901 = 4899
I can't figure it out on my own and would appreciate some help
With GNU awk. Separate the row into multiple columns using the : and - separators. In each row containing :, subtract the contents of column 2 from the contents of column 3 and print result.
awk -F '[:-]' '/:/{print $3-$2}' file
Output:
4899
Using awk
$ awk -F: '/:/ {split($2,a,"-"); print a[2] - a[1]}' input_file
4899

How to count number of values in a row and store total count to array

I have a scenario where i want to get count of all values row by row and store it to dynamic array
Data in file :
"A","B","C","B"
"P","W","R","S"
"E","U","C","S"
"Y","F","C"
first row as : 4 -> values
second row as : 4 -> values
third row as : 4 -> values
fourth row as : 3 -> values
Expected Output :
store to array : array_list=(4,4,4,3)
written a script but not working
array_list=()
while read -r line
do
var_comma_count=`echo "$line" | tr -cd , | wc -c`
array_list=+($( var_comma_count))
done < demo.txt
when i print array it should give me all values : echo "{array_list[#]}"
Note :
The file might contain empty lines at last which should not be read
when i count file it gave me count : 5 , it should have ignored last line which is empty
where as when i use awk it give me proper count : awk '{print NF}' demo.txt -> 4
I know processing file using while loop is not a best practise , but any better solution will be appreciated
Perhaps this might be easier using awk, set the FS to a comma and check if the number of fields is larger than 0:
#!/bin/bash
array_list=($(awk -v FS=, 'NF>0 {print NF}' demo.txt))
echo "${array_list[#]}"
Output
4 4 4 3
The awk command explained:
awk -v FS=, ' # Start awk, set the Field Separator (FS) to a comma
NF>0 {print NF} # If the Number of Fields (NF) is greater than 0, print the NF
' demo.txt # Close awk and set demo.txt as the input file
Another option could be first matching the format of the whole line. If it matches, there is at least a single occurrence.
Then split the line on a comma.
array_list=($(awk '/^"[A-Z]"(,"[A-Z]")*$/{print(split($0,a,","));}' demo.txt))
echo "${array_list[#]}"
Output
4 4 4 3
The awk command explained:
awk '/^"[A-Z]"(,"[A-Z]")*$/{ # Regex pattern for the whole line, match a single char A-Z between " and optionally repeat preceded by a comma
print(split($0,a,",")); # Split the whole line `$0` on a comma and print the number of parts
}
' demo.txt

Delete lines from a text file except the first and every nth

I have a long text file comprised of numbers, such as:
1
2
9.252
9.252
9.272
1
1
6.11
6.11
6.129
I would like to keep the first line, delete the subsequent three and then keep the next one. I would like to do this process for the whole file. Following that logic, considered the input above, I would like to have the following output:
1
9.272
1
6.129
Using GNU sed (needed for the ~ extension):
sed -n '1~5p;5~5p' file
Saving your numbers in a "textfile.txt" I can use the following with sed:
sed -n 'p;n;n;n;n;p;' textfile.txt
Sed prints the first line, reads the next 4 and prints the last line.
Or the following using while read in bash:
while read -r firstline && read -r nextone1 && read -r nextone2 && read -r nextone3 && read -r lastone; do
printf "%s\n" "$firstline" "$lastone";
done < textfile.txt
This just reads 5 lines at a time and prints only the first and 5th lines.
You can simply say:
awk 'NR%5<2' input.txt
Explanation: Considering the entire pattern repeats every five lines, let's start with applying modulo operation to the line number NR by five. Then we'll see the 1st line of the five-line block yields "1" and the 5th line of the block yields "0". Now they can be separated from other lines by comparing it to two.
To print the 1st and 5th line of every block of 5 lines (remember that 5%5 = 0):
$ awk '(NR%5) ~ /[10]/' file
1
9.272
1
6.129
If you want to print the 2nd, 3rd, and 4th line of every block of 5 lines instead of the 1st and 5th:
$ awk '(NR%5) ~ /[234]/' file
2
9.252
9.252
1
6.11
6.11
If you wanted to print the 27th and 53rd line of every block of 100:
awk '(NR%100) ~ /^(27|53)$/' file
We couldn't use a bracket expression there as we're now beyond single char numbers.
This might work for you (GNU sed):
sed '2~5,+2d' file
Starting from line 2, delete the next three lines using modulo 5.
An alternative:
sed -n '1p;5~5,+1p' file
Considering your groups are packed as 5 lines, you could use awk with a mod 5 operation.
awk '{i=(NR-1)%5;if(i==0||i==4)print $0}' input.txt
With indentation it looks like this:
{
i=(NR-1)%5;
if (i==0||i==4)
print $0;
}
i=(NR-1)%5 gets the line number and computes the modulo with 5, but since the line numbers start at 1 (instead of 0), you need to subtract 1 to it before computing the modulo.
This leaves you with an integer i that ranges from 0 to 4. You want to print the first line (index 0), skip the next three lines (indexes 1-3) and print the last line (index 4), which is exactly what does if (i==0||i==4) print $0
Alternately you can do the same thing with a shorter (and probably slightly more optimized version):
awk '((NR-1)%5==0||(NR-1)%5==4)' input.txt
This tells awk to do something for every 1st out of 5 lines and every 5th out of 5 lines. Since the "something" is not defined, by default it outputs the current line. If it helps, this is strictly equivalent to:
awk '((NR-1)%5==0||(NR-1)%5==4){print $0}' input.txt

cut string in a specific column in bash

How can I cut the leading zeros in the third field so it will only be 6 characters?
xxx,aaa,00000000cc
rrr,ttt,0000000yhh
desired output
xxx,aaa,0000cc
rrr,ttt,000yhh
or here's a solution using awk
echo " xxx,aaa,00000000cc
rrr,ttt,0000000yhh"|awk -F, -v OFS=, '{sub(/^0000/, "", $3)}1'
output
xxx,aaa,0000cc
rrr,ttt,000yhh
awk uses -F (or FS for FieldSeparator) and you must use OFS for OutputFieldSeparator) .
sub(/srchtarget/, "replacmentstring", stringToFix) is uses a regular expression to look for 4 0s at the front of (^) the third field ($3).
The 1 is a shorthand for the print statement. A longhand version of the script would be
echo " xxx,aaa,00000000cc
rrr,ttt,0000000yhh"|awk -F, -v OFS=, '{sub(/^0000/, "", $3);print}'
# ---------------------------------------------------------^^^^^^
Its all related to awk's /pattern/{action} idiom.
IHTH
If you can assume there are always three fields and you want to strip off the first four zeros in the third field you could use a monstrosity like this:
$ cat data
xxx,0000aaa,00000000cc
rrr,0000ttt,0000000yhh
$ cat data |sed 's/\([^,]\+\),\([^,]\+\),0000\([^,]\+\)/\1,\2,\3/
xxx,0000aaa,0000cc
rrr,0000ttt,000yhh
Another more flexible solution if you don't mind piping into Python:
cat data | python -c '
import sys
for line in sys.stdin():
print(",".join([f[4:] if i == 2 else f for i, f in enumerate(line.strip().split(","))]))
'
This says "remove the first four characters of the third field but leave all other fields unchanged".
Using awks substr should also work:
awk -F, -v OFS=, '{$3=substr($3,5,6)}1' file
xxx,aaa,0000cc
rrr,ttt,000yhh
It just take 6 characters from 5 position in field 3 and set it back to field 3

Split a big txt file to do grep - unix

I work (unix, shell scripts) with txt files that are millions field separate by pipe and not separated by \n or \r.
something like this:
field1a|field2a|field3a|field4a|field5a|field6a|[...]|field1d|field2d|field3d|field4d|field5d|field6d|[...]|field1m|field2m|field3m|field4m|field5m|field6m|[...]|field1z|field2z|field3z|field4z|field5z|field6z|
All text is in the same line.
The number of fields is fixed for every file.
(in this example I have field1=name; field2=surname; field3=mobile phone; field4=email; field5=office phone; field6=skype)
When I need to find a field (ex field2), command like grep doesn't work (in the same line).
I think that a good solution can be do a script that split every 6 field with a "\n" and after do a grep. I'm right? Thank you very much!
With awk :
$ cat a
field1a|field2a|field3a|field4a|field5a|field6a|field1d|field2d|field3d|field4d|field5d|field6d|field1m|field2m|field3m|field4m|field5m|field6m|field1z|field2z|field3z|field4z|field5z|field6z|
$ awk -F"|" '{for (i=1;i<NF;i=i+6) {for (j=0; j<6; j++) printf $(i+j)"|"; printf "\n"}}' a
field1a|field2a|field3a|field4a|field5a|field6a|
field1d|field2d|field3d|field4d|field5d|field6d|
field1m|field2m|field3m|field4m|field5m|field6m|
field1z|field2z|field3z|field4z|field5z|field6z|
Here you can easily set the length of line.
Hope this helps !
you can use sed to split the line in multiple lines:
sed 's/\(\([^|]*|\)\{6\}\)/\1\n/g' input.txt > output.txt
explanation:
we have to use heavy backslash-escaping of (){} which makes the code slightly unreadable.
but in short:
the term (([^|]*|){6}) (backslashes removed for readability) between s/ and /\1, will match:
[^|]* any character but '|', repeated multiple times
| followed by a '|'
the above is obviously one column and it is grouped together with enclosing parantheses ( and )
the entire group is repeated 6 times {6}
and this is again grouped together with enclosing parantheses ( and ), to form one full set
the rest of the term is easy to read:
replace the above (the entire dataset of 6 fields) with \1\n, the part between / and /g
\1 refers to the "first" group in the sed-expression (the "first" group that is started, so it's the entire dataset of 6 fields)
\n is the newline character
so replace the entire dataset of 6 fields by itself followed by a newline
and do so repeatedly (the trailing g)
you can use sed to convert every 6th | to a newline.
In my version of tcsh I can do:
sed 's/\(\([^|]\+|\)\{6\}\)/\1\n/g' filename
consider this:
> cat bla
a1|b2|c3|d4|
> sed 's/\(\([^|]\+|\)\{6\}\)/\1\n/g' bla
a1|b2|
c3|d4|
This is how the regex works:
[^|] is any non-| character.
[^|]\+ is a sequence of at least one non-| characters.
[^|]\+| is a sequence of at least one non-| characters followed by a |.
\([^|]\+|\) is a sequence of at least one non-| characters followed by a |, grouped together
\([^|]\+|\)\{6\} is 6 consecutive such groups.
\(\([^|]\+|\)\{6\}\) is 6 consecutive such groups, grouped together.
The replacement just takes this sequence of 6 groups and adds a newline to the end.
Here is how I would do it with awk
awk -v RS="|" '{printf $0 (NR%7?RS:"\n")}' file
field1a|field2a|field3a|field4a|field5a|field6a|[...]
field1d|field2d|field3d|field4d|field5d|field6d|[...]
field1m|field2m|field3m|field4m|field5m|field6m|[...]
field1z|field2z|field3z|field4z|field5z|field6z|
Just adjust the NR%7 to number of field you to what suites you.
What about printing the lines on blocks of six?
$ awk 'BEGIN{FS=OFS="|"} {for (i=1; i<=NF; i+=6) {print $(i), $(i+1), $(i+2), $(i+3), $(i+4), $(i+5)}}' file
field1a|field2a|field3a|field4a|field5a|field6a
field1d|field2d|field3d|field4d|field5d|field6d
field1m|field2m|field3m|field4m|field5m|field6m
field1z|field2z|field3z|field4z|field5z|field6z
Explanation
BEGIN{FS=OFS="|"} set input and output field separator as |.
{for (i=1; i<=NF; i+=6) {print $(i), $(i+1), $(i+2), $(i+3), $(i+4), $(i+5)}} loop through items on blocks of 6. Every single time, print six of them. As print end up writing a new line, then you are done.
If you want to treat the files as being in multiple lines, then make \n the field separator. For example, to get the 2nd column, just do:
tr \| \\n < input-file | sed -n 2p
To see which columns match a regex, do:
tr \| \\n < input-file | grep -n regex

Resources