Creating a CSV file from a text file - shell

i have the below text file in this format
2015-04-21
00:21:00
5637
5694
12
2015-04-21
00:23:00
5637
5694
12
I want to create a csv file like the below one-
2015-04-21,00:21:00,5637,5694,12
2015-04-21,00:23:00,5637,5694,12
i used the tr and the sed like this-
cat file | tr '\n' ',' | sed 's/,$//'
It results in the below way-
2015-04-21,00:21:00,5637,5694,12,2015-04-21,00:23:00,5637,5694,12
but it doesn't have an new line after the column 5.
Do suggest a solution.

Use awk like so:
awk 'ORS=NR%5 ? "," : "\n"'
$ cat test.txt
2015-04-21
00:21:00
5637
5694
12
2015-04-21
00:23:00
5637
5694
12
$ awk 'ORS=NR%5 ? "," : "\n"' test.txt
2015-04-21,00:21:00,5637,5694,12
2015-04-21,00:23:00,5637,5694,12
Explanation:
ORS stands for output record separator
NR is number of records
NR % 5 - % is modulo operator. If it is zero (every 5th record), use line feed. Otherwise, use comma

a simple solution in python
fin = open('file','r')
fout = open('outputfile','w')
a=[]
i=0
for line in fin:
a.append(line.rstrip())
i+=1
if i==5:
fout.write(','.join(a)+'\n')
a=[]
i=0
fin.close()
fout.close()

Related

convert txt to columnated file

I need to convert test.txt file to a columnated file.
I know how to convert it with awk if the number of lines after each keyword are same but they are different in this example.
awk 'NR % 5 {printf "%s ", $0; next}1' test.txt
if the number of lines are same here is the code but this one will not work with this input file.
Anyway to convert this? Please advise.
test.txt
"abc"
4
21
22
25
"standard"
1
"test"
4
5
10
11
12
Expected Output:
"abc" 4 21 22 25
"standard" 1
"test" 4 5 10 11 12
$ awk '{printf "%s%s", (/^"/ ? ors : OFS), $0; ors=ORS} END{print ""}' file
"abc" 4 21 22 25
"standard" 1
"test" 4 5 10 11 12
A bit magic, but works in this case:
sed -z 's/\n"/\n\x01"/g' |
tr '\n' ' ' |
tr $'\x01' '\n'
Each "header" starts is a string between " ... ". So:
Using sed I put some delimter (I chose 0x01 in hex) between a newline and a ", everywhere in the file. Note that -z is a gnu extension.
Then I substitute all newlines for a space.
Then I substitute all 0x01 bytes for newlines.
This method is a little tricky, but is simply and works in cases where the header starts with some certain character on the beginning of the line.
Live version available at tutorialspoint.
One can get with sed without gnu extension by using for example:
sed '2,$s/^"/\x01"/'
ie. for lines greater then the second if the line starts with a ", then add the 0x01 byte on the beginning of the line.
with GNU awk
$ awk -v RS='\n"' '{$1=$1; printf "%s", rt $0; rt=RT}' file
"abc" 4 21 22 25
"standard" 1
"test" 4 5 10 11 12
POSIX awk:
$ awk '/^"/{if (s) print s; s=$0; next} {s=s OFS $0} END{print s}' file
"abc" 4 21 22 25
"standard" 1
"test" 4 5 10 11 12
Or with perl:
$ perl -0777 -lnE 'for (/^"[^"]+"\R(?:[\s\S]+?)(?=^"|\z)/mg) {tr /\n/ /; say} ' file
If your fields do not have spaces in them, you can use a simple tr and sed pipe:
$ cat file | tr '\n' ' ' | sed -E 's/ ("[^"]*")/\
\1/g'
Or GNU sed:
$ cat file | tr '\n' ' ' | sed -E 's/ ("[^"]*")/\n\1/g'
While an awk or sed solution is advisable, since the question is also tagged bash, you can do all that is needed with a simple read loop and a flag variable to control the newline output for the first iteration. Essentially, you are reading each line and using the string indexing parameter expansion to test whether the first character is a non-digit, and on the 1st iteration simply output the string, for all additional iterations, output the string preceded by a '\n'. If the line begins with a digit, simply output it with a space preceding.
For example:
#!/bin/bash
declare -i n=0 ## simple flag to omit '\n' on first string output
while read -r line; do ## read each line
[[ ${line:0:1} =~ [^0-9] ]] && { ## begins with non-digit
## 1st iteration, just output $line, rest output '\n$line'
((n == 0)) && printf "%s" "$line" || printf "\n%s" "$line"
} || printf " %s" "$line" ## begins with digit - output " $line"
n=1 ## set flag
done < "$1"
echo "" ## tidy up with newline
Example Use/Output
$ bash fmtlines test.txt
"abc" 4 21 22 25
"standard" 1
"test" 4 5 10 11 12
While awk and sed will generally be faster (as a general rule), here with nothing more than a while read loop and a few conditionals and parameter expansions, the native bash solution would not be bad by comparison.
Look things over and let me know if you have questions.

How to extract rows present only once by column via commandline

I have a space separated file as shown below:
D2ABMACXX:5:1101:10000:93632_1:N:0 c111 12462 6
D2ABMACXX:5:1101:10004:54586_1:N:0 c6753 3473 1
D2ABMACXX:5:1101:10004:54586_2:N:0 c7000 5726 1
D2ABMACXX:5:1101:10006:56411_1:N:0 c4282 877 42
D2ABMACXX:5:1101:10006:56411_2:N:0 c5703 240 6
D2ABMACXX:5:1101:10013:29259_2:N:0 c6008 384 11
I would need to extract rows that are present only once based on the text before "_" in column 1. The sample output should look like below:
##required output format###
D2ABMACXX:5:1101:10000:93632_1:N:0 c111 12462 6
D2ABMACXX:5:1101:10013:29259_2:N:0 c6008 384 11
I have a complicated way of doing this but loosing original information:
cat file.txt | awk '{print $2,$3,$4,$1}' | sed 's/_1//g; s/_2//g' | uniq -f 3 -u
Could anyone suggest an optimal way of doing this on a huge text file ~10Gb getting the output in the same format as that of input as shown in the required output format?
You can try doing all with awk, for example:
awk -F'_' '{ uniqs[$1] = $0; count[$1]++ } END { for (uniq in uniqs) if ( count[uniq] == 1 ) print uniqs[uniq] }' file.txt

Add line numbers for duplicate lines in a file

My text file would read as:
111
111
222
222
222
333
333
My resulting file would look like:
1,111
2,111
1,222
2,222
3,222
1,333
2,333
Or the resulting file could alternatively look like the following:
1
2
1
2
3
1
2
I've specified a comma as a delimiter here but it doesn't matter what the delimeter is --- I can modify that at a future date.In reality, I don't even need the original text file contents, just the line numbers, because I can just paste the line numbers against the original text file.
I am just not sure how I can go through numbering the lines based on repeated entries.
All items in list are duplicated at least once. There are no single occurrences of a line in the file.
$ awk -v OFS=',' '{print ++cnt[$0], $0}' file
1,111
2,111
1,222
2,222
3,222
1,333
2,333
Use a variable to save the previous line, and compare it to the current line. If they're the same, increment the counter, otherwise set it back to 1.
awk '{if ($0 == prev) counter++; else counter = 1; prev=$0; print counter}'
Perl solution:
perl -lne 'print ++$c{$_}' file
-n reads the input line by line
-l handles newlines
++$c{$_} increments the value assigned to the contents of the current line $_ in the hash table %c.
Software tools method, given textfile as input:
uniq -c textfile | cut -d' ' -f7 | xargs -L 1 seq 1
Shell loop-based variant of the above:
uniq -c textfile | while read a b ; do seq 1 $a ; done
Output (of either method):
1
2
1
2
3
1
2

Shell command for inserting a newline every nth element of a huge line of comma separated strings

I have a one line csv containing a lot of elements. Now I want to insert a newline after every n-th element in a bash/shell script.
Bonus: I'd like to prepend a line with descriptors and using the count of descriptors as 'n'.
Example:
"4908041eee3d4bf98e606140b21ebc89.16","7.38974601030349731","45.31298584267982221","94ff11ce7eb54642b0768dde313e8b25.16","7.38845318555831909","45.31425320325949713", (...)
into
"id","lon","lat"
"4908041eee3d4bf98e606140b21ebc89.16","7.38974601030349731","45.31298584267982221"
"94ff11ce7eb54642b0768dde313e8b25.16","7.38845318555831909","45.31425320325949713"
(...)
Edit: I made a first attempt, but the comma delimiters are missing then:
(...) | xargs --delimiter=',' -n3
"4908041eee3d4bf98e606140b21ebc89.16" "7.38974601030349731" "45.31298584267982221"
"94ff11ce7eb54642b0768dde313e8b25.16" "7.38845318555831909" "45.31425320325949713"
trying to replace the " " with ","
(...) | xargs --delimiter=',' -n3 -i echo ${{}//" "/","}
-bash: ${{}//\": bad substitution
I would go with Perl for that!
Let's assume this outputs something like your file:
printf "1,2,3,4,5,6,7,8,9,10"
1,2,3,4,5,6,7,8,9,10
Then you could use this if you wanted every 4th comma replaced:
printf "1,2,3,4,5,6,7,8,9,10" | perl -pe 's{,}{++$n % 4 ? $& : "\n"}ge'
1,2,3,4
5,6,7,8
9,10
cat data.txt | xargs -n 3 -d, | sed 's/ /,/g'
With n=3 here and input filename is called data.txt
Note: What distinguishes this solution is that it derives the number of output columns from the number of columns in the header line.
Assuming that the fields in your CSV input have no embedded , instances (in which case you'd need a proper CSV parser), try awk:
awk -v RS=, -v header='"id","lon","lat"' '
BEGIN {
print header
colCount = 1 + gsub(",", ",", header)
}
{
ORS = NR % colCount == 0 ? "\n" : ","
print
}
' file.csv
Note that if the input file ends with a newline (as is typical), you'll get an extra newline trailing the output.
With GNU Awk or Mawk (but not BSD/OSX Awk, which only supports literal, single-character RS values), you can fix this as follows:
awk -v RS='[,\n]' -v header='"id","lon","lat"' '
BEGIN {
print header
colCount = 1 + gsub(",", ",", header)
}
{
ORS = NR % colCount == 0 ? "\n" : ","
print
}
' file.csv
BSD/OSX Awk workaround: stick with -v RS=, and replace file.csv with <(tr -d '\n' < file.csv) in order to remove all newlines from the input first.
Assuming your input file is named input:
echo id,lon,lat; awk '{ORS=NR%3?",":"\n"}1' RS=, input

Removing characters from each line in shellscript

I have a text file like below
IMPALA COUNT :941 MONGO COUNT : 980
IMPALA COUNT :78 MONGO COUNT : 78
IMPALA COUNT :252 MONGO COUNT : 258
IMPALA COUNT :3008 MONGO COUNT : 3064
I want to remove everything and keep like below
941 980
78 78
252 258
3008 3064
Can anybody suggest any shellscript for this.
One way:
cut -d':' -f2,3 file.txt | cut -d' ' -f1,5
Another:
awk '{print substr($3, 2) " " $7}' file.txt
A sed solution extracting the two digits:
sed -r 's/[^0-9]*([0-9]+)[^0-9]*([0-9]+).*/\1 \2/g' file
Here's a few options:
grep -Eo '[0-9]+' file | paste -d " " - -
awk -F'[ :]+' '{print $4, $7}' file
awk -F: '{print $2+0, $3}' file
perl -lne '#matches = /(?<=:) *(\S+)/g; print join " ", #matches' file
sed -e 's/[^:]*: *\([0-9]*\) */\1 /g;s/ $//'
that is: Replace any sequence of non-colons [^:]*, followed by a colon and possibly spaces : *, followed by a sequence of digits and possibly spaces \([0-9]*\) *, by the digit sequence \1 plus one space; afterwards delete the final space in the line.
sed -r 's/[^0-9]+://g' file
It just matches all the characters distinct of a number [^0-9]+ followed by a : and remove them.
Example
$ cat file
IMPALA COUNT :941 MONGO COUNT : 980
IMPALA COUNT :78 MONGO COUNT : 78
IMPALA COUNT :252 MONGO COUNT : 258
IMPALA COUNT :3008 MONGO COUNT : 3064
$ sed -r 's/[^0-9]+://g' file
941 980
78 78
252 258
3008 3064

Resources