converting space separated file into csv format in linux - shell

I have a file that contain data in the format
2 Mar 1 1234 141.98.80.59
1 Mar 1 1234 171.239.249.233
5 Mar 1 admin 116.110.119.156
4 Mar 1 admin1 177.154.8.15
2 Mar 1 admin 141.98.80.63
2 Mar 1 Admin 141.98.80.63
i tried this command to convert into csv format but it is giving me the output with extra (,) in the front
cat data.sql | tr -s '[:blank:]' ',' > data1.csv
,2,Mar,1,1234,141.98.80.59
,1,Mar,1,1234,171.239.249.233
,5,Mar,1,admin,116.110.119.156
,4,Mar,1,admin1,177.154.8.15
,2,Mar,1,admin,141.98.80.63
,2,Mar,1,Admin,141.98.80.63
In my file there is 6 character space is there in-front on every record
how can i remove extra comma from the front

how [to] remove extra comma from the front using awk:
$ awk -v OFS=, '{$1=$1}1' file
Output:
2,Mar,1,1234,141.98.80.59
1,Mar,1,1234,171.239.249.233
5,Mar,1,admin,116.110.119.156
...
Output with #EdMorton's version proposed in comments:
2,Mar,1,1234,141.98.80.59
1,Mar,1,1234,171.239.249.233
5,Mar,1,admin,116.110.119.156
...

The improved version of your current method is:
cat data.sql | sed -E -e 's/^[[:blank:]]+//g' -e 's/[[:blank:]]+/,/g' > data1.csv
But do be aware that replacing spaces/commas isnt a real way of changing this format into a CSV. If there are/were any commas and/or spaces present in the actual data this approach would fail.
The fact that your example source file has the .sql extension suggests that perhaps you get this file by exporting a database, and have already stripped parts of the file away with other tr statements ? If that is the case, a better approach would be to export to CSV (or another format) directly
edit: Made sed statement more portable as recommended by per Quasímodo in the comments.

Using Miller is
mlr --n2c -N remove-empty-columns ./input.txt >./output.txt
The output will be
2,Mar,1,1234,141.98.80.59
1,Mar,1,1234,171.239.249.233
5,Mar,1,admin,116.110.119.156
4,Mar,1,admin1,177.154.8.15
2,Mar,1,admin,141.98.80.63
2,Mar,1,Admin,141.98.80.63

Related

Concatenating sed commands

I have a .txt file and there are three sed commands that I am using to manipulate it. First I convert it to a .csv by substituting tabs for commas (A), then I remove lines 1-8 (B) and then remove a '# ' that is in the beginning of line 9 (C).
(A) sed 's/\t/,/g' individuals/$message/$message.txt > individuals/$message/$message.csv
(B) sed -i 1,8d individuals/$message/$message.csv
(C) sed -i 's/.\{2\}//' individuals/$message/$message.csv
Is there a better way to do it, maybe integrating these three commands into a single one? It doesn't need to be done using sed but it does need to be done via bash commands.
Here is a sample of my data:
# This data file generated by PLINK at: Mon Jul 11 16:18:56 2022
#
# Below is a text version of your data. Fields are TAB-separated.
# Each line corresponds to a single SNP. For each SNP, we provide its
# identifier, its location on a reference human genome, and the genotype call.
# For further information (e.g. which reference build was used), consult the
# original source of your data.
#
# rsid chromosome position genotype
22:16050607G-A 22 16050607 GG
I deeply appreciate the help!
PS: Line 9 is the # rsid chromossome... one and it should be kept in the file, just without the #
Use multiple -e options to execute multiple sed commands in one call.
sed -e '1,8d' -e '9s/^# //' -e '9,$s/\t/,/g' "individuals/$message/$message.txt" > "individuals/$message/$message.csv"

How to extract multiple fields with specific character lengths in Bash?

I have a file (test.csv) with a few fields and what I wanted is the Title and Path with 10 character for the title and remove a few levels from the path. What have done is use the awk command to pick two fields:
$ awk -F "," '{print substr($4, 1, 10)","$6}' test.csv [1]
The three levels in the path need to be removed are not always the same. It can be /article/17/1/ or this /open-organization/17/1 so I can't use the substr for field $6.
Here the result I have:
Title,Path
Be the ope,/article/17/1/be-open-source-supply-chain
Developing,/open-organization/17/1/developing-open-leaders
Wanted result would be:
Title,Path
Be the ope,be-open-source-supply-chain
Developing,developing-open-leaders
The title is ok with 10 characters but I still need to remove 3 levels off the path.
I could use the cut command:
cut -d'/' -f5- to remove the "/.../17/1/"
But not sure how this can be piped to the [1]
I tried to use a for loop to get the title and the path one by one by but I have difficulty in getting the awk command to run one line at time.
I have spent hours on this with no luck. Any help would be appreciated.
Dummy Data for testing:
test.csv
Post date,Content type,Author,Title,Comment count,Path,Tags,Word count
31 Jan 2017,Article,Scott Nesbitt,Book review: Ours to Hack and to Own,0,/article/17/1/review-book-ours-to-hack-and-own,Books,660
31 Jan 2017,Article,Jason Baker,5 new guides for working with OpenStack,2,/article/17/1/openstack-tutorials,"OpenStack, How-tos and tutorials",419
you can replace the string by using regex.
stringZ="Be the ope,/article/17/1/be-open-source-supply-chain"
sed -E "s/((\\/\\w+){3}\\/)//" <<< $stringZ
note that you need to use -i if you are going to give file as input to sed

How to add a header to text file in bash?

I have a text file and want to convert it to csv file before to convert it, i want to add a header to text file so that the csv file has the same header. I have one thousand columns in text file and want to have one thousand column name. As a side note, the content of the text file is just rows of some numbers which is separated by comma ",". Is there any way to add the header line in bash?
I tried the way below and didn't work. I did the command below first in python.
> for i in range(1001):
> print "col" + "_" + "i"
save the output of this in text file with this command (python header.py >> header.txt) and add the output of this in format of text file to the original text file that i have like below:
cat header.txt filename.txt > newfilename.txt
then convert the txt file to csv file with "mv newfilename.txt newfilename.csv".
But unfortunately this way doesn't work as the header line has double number of other rows for some reason. I would appreciate any help to make this problem solve.
based on the description your file is already comma separated, so is a csv file. You just want to add a column number header line.
$ awk -F, 'NR==1{for(i=1;i<=NF;i++) printf "col_%d%s", $i,(i==NF?ORS:FS)}1' file
will add column headers as many as the fields in the first row of the file
e.g.
$ seq 5 | paste -sd, | # create 1,2,3,4,5 as a test input
awk -F, 'NR==1{for(i=1;i<=NF;i++) printf "col_%d%s", i, (i==NF?ORS:FS)}1'
col_1,col_2,col_3,col_4,col_5
1,2,3,4,5
You can generate the column names in bash using one of the options below. Each example generates a header.txt file. You already have code to add this to the beginning of your file as a header.
Using bash loops
Bash loops for this many iterations will be inefficient, but will work.
for i in {1..10}; do
echo -n "col_$i "
done > header.txt
echo >> header.txt
or using seq
for i in $(seq 1 1000); do
echo -n "col_$i "
done > header.txt
echo >> header.txt
Using seq only
Using seq alone will be more efficient.
seq -f "col_%g" -s" " 1 1000 > header.txt
Use seq and sed
You can use the seq utility to construct your CSV header, with a little minor help from Bash expansions. You can then insert the new header row into your existing CSV file, or concatenate the header with your data.
For example:
# construct a quoted CSV header
columns=$(seq -f '"col_%g"' -s', ' 1 1001)
# strip the trailing comma
columns="${columns%,*}"
# insert headers as first line of foo.csv with GNU sed
sed -i -e "1 i\\${columns}" /tmp/foo.csv
Caveats
If you don't have GNU sed, you can also use cat, sponge, or other tools to concatenate your header and data, although most of your concatenation options will require redirection to a new combined file to avoid clobbering your existing data.
For example, given /tmp/data.csv as your original data file:
seq -f '"col_%g"' -s', ' 1 1001 > /tmp/header.csv
sed -i -e 's/,[[:space:]]*$//' /tmp/header.csv
cat /tmp/header /tmp/data > /tmp/new_file.csv
Also, note that while Bash solutions that avoid calling standard utilities are possible, doing it in pure Bash might be too slow or memory intensive for large data sets.
Your mileage may vary.
printf "col%s," {1..100} |
sed 's/,$//' |
cat - filename.txt >newfilename.txt
I believe sed should supply the missing final newline as a side effect. If not, maybe try 's/,$/\n/' though this isn't entirely portable, either. You could probably replace the cat with sed as well, something like
... | sed 's/,$//;r filename.txt'
but again, I'm not entirely sure how portable this is.

What is the Exact Use and Meaning of "IFS=!"

I was trying to understand the usage of IFS but there is something I couldn't find any information about.
My example code:
#!/bin/sh
# (C) 2016 Ergin Bilgin
IFS=!
for LINE in $(last -a | sed '$ d')
do
echo $LINE | awk '{print $1}'
done
unset IFS
I use this code to print last users line by line. I totally understand the usage of IFS and in this example when I use default IFS, it reads word by word inside of my loop. And when I use IFS=! it reads line by line as I wish. The problem here is I couldn't find anything about that "!" on anywhere. I don't remember where I learned that. When I google about achieving same kind of behaviour, I see other values which are usually strings.
So, what is the meaning of that "!" and how it gives me the result I wish?
Thanks
IFS=! is merely setting a non-existent value for IFS so that you can iterate input line by line. Having said that using for loop here is not recommended, better to use read in a while loop like this to print first column i.e. username:
last | sed '$ d' | while read -r u _; do
echo "$u"
done
As you are aware, if the output of last had a !, the script would split the input lines on that character.
The output format of last is not standardized (not in POSIX for instance), but you are unlikely to find a system where the first column contains anything but the name of whatever initiated an action. For instance, I see this:
tom pts/8 Wed Apr 27 04:25 still logged in michener.jexium-island.net
tom pts/0 Wed Apr 27 04:15 still logged in michener.jexium-island.net
reboot system boot Wed Apr 27 04:02 - 04:35 (00:33) 3.2.0-4-amd64
tom pts/0 Tue Apr 26 16:23 - down (04:56) michener.jexium-island.net
continuing to
reboot system boot Fri Apr 1 15:54 - 19:03 (03:09) 3.2.0-4-amd64
tom pts/0 Fri Apr 1 04:34 - down (00:54) michener.jexium-island.net
wtmp begins Fri Apr 1 04:34:26 2016
with Linux, and different date-formats, origination, etc., on other machines.
By setting IFS=!, the script sets the field-separator to a value which is unlikely to occur in the output of last, so each line is read into LINE without splitting it. Normally, lines are split on spaces.
However, as you see, the output of last normally uses spaces for separating columns, and it is fed into awk which splits the line anyway — with spaces. The script could be simplified in various ways, e.g.,:
#!/bin/sh
for LINE in $(last -a | sed -e '$ d' -e 's/ .*//')
do
echo $LINE
done
which is (starting from the example in the question) adequate if the number of logins is not large enough to exceed your command-line. While checking for variations in last output, I noticed one machine with about 9800 lines from several years. (The other usual motivations given for not using for-loops are implausible in this instance). As a pipe:
#!/bin/sh
last -a | sed -e 's/ .*//' -e '/^$/d' | while IFS= read LINE
do
echo $LINE
done
I changed the sed expression (which OP likely copied from some place such as Bash - remove the last line from a file) because it does not work.
Finally, using the -a option of last is unnecessary, since all of the additional information it provides is discarded.

sed & awk, second column modifications

I've got a file that I need to make some simple modifications to. Normally, I wouldn't have an issue, however the columns are nearly identical which throws me off.
Some examples:
net_192.168.0.64_26 192.168.0.64_26
net_192.168.0.128-26 192.168.0.128-26
etc
Now, normally in a stream I'd just modify the second column, however I need to write this to a file which confuses me.
The following string does what I need it do to but then I lose visibility to the first column, and can't pipe it somewhere useful:
cat file.txt | awk '{print $2}' | sed 's/1_//g;s/2_//g;s/1-//g;s/2-//g;s/_/\ /g;s/-/\ /g' | egrep '[[:digit:]]{1,3}\.[[:digit:]]{1,3}\.[[:digit:]]{1,3}\.[[:digit:]]{1,3}'
Output needs to look like (subnet becomes the 3rd column):
net_192.168.0.64_26 192.168.0.64 26
net_192.168.0.128-26 192.168.0.128 26
How do I do what the above line does, while keeping both columns visible so I can pipe them to a new file/modify the old etc.
Thanks!
try this, if it is ok for you:
awk '{gsub(/[_-]/," ",$2)}1' file
test with your example text:
kent$ echo "net_192.168.0.64_26 192.168.0.64_26
net_192.168.0.128-26 192.168.0.128-26"|awk '{gsub(/[_-]/," ",$2)}1'
net_192.168.0.64_26 192.168.0.64 26
net_192.168.0.128-26 192.168.0.128 26
If you just want to replace the characters _,- with a single space from the second field then:
$ awk '{gsub(/[-_]/," ",$2)}1' file
net_192.168.0.64_26 192.168.0.64 26
net_192.168.0.128-26 192.168.0.128 26
And a sed version:
sed 's/\(.*\)[-_]/\1 /' file

Resources