split a file into segments? - bash

I have a file containing text data which are separated by semicolon ";". I want to separate the data , in other words split where ; occurs and write the data to an output file. Is there any way to do with bash script?

You most likely want awk with the FS (field separator variable) set to ';'.
Awk is the tool of choice for column-based data (some prefer Perl, but not me).
echo '1;2;3;4;5
6;7;8;9;10' | awk -F\; '{print $3" "$5}'
3 5
8 10
If you just want to turn semicolons into newlines:
echo '1;2;3;4;5
6;7;8;9;10' | sed 's/;/\n/g'
outputs the numbers 1 through 10 on separate lines.
Obviously those commands are just using my test data. If you want to use them on your own file, use something like:
sed 's/;/\n/g' <input_file >output_file

while read -d ';' ITEM; do
echo "$ITEM"

cat original_file.txt | cut -d";" -f1 > new_file.txt
This will split each line in fields delimited by ";" and select the first field (-f1).
You can access other fields with -f1, -f2, ... or multiple fields with -f1-2, -f2-.

You can translate a character to another character by the 'tr' command.
cat input.txt | tr ';' '\n' > output.txt
Where \n is new line and if you want a tab only you should replace it with \t


Getting last X fields from a specific line in a CSV file using bash

I'm trying to get as bash variable list of users which are in my csv file. Problem is that number of users is random and can be from 1-5.
Example CSV file:
I would like to get something like
list_of_users="cat file.csv | grep "record2_data2" | <something> "
echo $list_of_users
I'm trying this:
cat file.csv | grep "record2_data2" | awk -F, -v OFS=',' '{print $4,$5,$6,$7,$8 }' | sed 's/"//g'
My result is:
How to remove all "," from the end of my result? Sometimes it is just one but sometimes can be user1,,,,
Can I do it in better way? Users always starts after 3rd column in my file.
This will do what your code seems to be trying to do (print the users for a given string record2_data2 which only exists in the 2nd field):
$ awk -F',' '{gsub(/"/,"")} $2=="record2_data2"{sub(/([^,]*,){3}/,""); print}' file.csv
but I don't see how that's related to your question subject of Getting last X records from CSV file using bash so idk if it's what you really want or not.
Better to use a bash array, and join it into a CSV string when needed:
#!/usr/bin/env bash
readarray -t listofusers < <(cut -d, -f4- file.csv | tr -d '"' | tr ',' $'\n' | sort -u))
printf "%s\n" "${listofusers[*]}"
cut -d, -f4- file.csv | tr -d '"' | tr ',' $'\n' | sort -u is the important bit - it first only prints out the fourth and following fields of the CSV input file, removes quotes, turns commas into newlines, and then sorts the resulting usernames, removing duplicates. That output is then read into an array with the readarray builtin, and you can manipulate it and the individual elements however you need.
GNU sed solution, let file.csv content be
sed -n -e 's/"//g' -e '/record2_data/ s/[^,]*,[^,]*,[^,]*,// p' file.csv
gives output
Explanation: -n turns off automatic printing, expressions meaning is as follow: 1st substitute globally " using empty string i.e. delete them, 2nd for line containing record2_data substitute (s) everything up to and including 3rd , with empty string i.e. delete it and print (p) such changed line.
(tested in GNU sed 4.2.2)
awk -F',' '
for(i=4;i<=NF;i++) o=sprintf("%s%s,",o,$i);
print o
}' file.csv
This might work for you (GNU sed):
sed -E '/record2_data/!d;s/"([^"]*)"(,)?/\1\2/4g;s///g' file
Delete all records except for that containing record2_data.
Remove double quotes from the fourth field onward.
Remove any double quoted fields.

How to extract two pieces of data from a string

I am trying to extract two pieces of data from a string and I have having a bit of trouble. The string is formatted like this:
11111111-2222:3333:4444:555555555555 aaaaaaaa:bbbbbbbb:cccccccc:dddddddd
What I am trying to achieve is to print the first column (11111111-2222:3333:4444:555555555555) and the third section of the colon string (cccccccc), on the same line with a space between the two, as the first column is an identifier. Ideally in a way that can just be run as one-line from the terminal.
I have tried using cut and awk but I have yet to find a good way to make this work.
How about a sed expression like this?
echo "11111111-2222:3333:4444:555555555555 aaaaaaaa:bbbbbbbb:cccccccc:dddddddd" |
sed -e "s/\(.*\) .*:.*:\(.*\):.*/\1 \2/"
11111111-2222:3333:4444:555555555555 cccccccc
The following awk script does the job without relying on the format of the first column.
awk -F: 'BEGIN {RS=ORS=" "} NR==1; NR==2 {print $3}'
Use it in a pipe or pass the string as a file (simply append the filename as an argument) or as a here-string (append <<< "your string").
Instead of lines this awk script splits the input into space-separated records (RS=ORS=" "). Each record is subdivided into :-separated fields (-F:). The first record will be printed as is (NR==1;, that's the same as NR==1 {print $0}). In the second record, we will only print the 3rd field (NR==2 {print {$3}}); in case of the record aaa:bbb:ccc:ddd the 3rd field is ccc.
I think the answer from user803422 is better but here's another option. Maybe it'll help you use cut in the future.
str='11111111-2222:3333:4444:555555555555 aaaaaaaa:bbbbbbbb:cccccccc:dddddddd'
first=$(echo "$str" | cut -d ' ' -f1)
second=$(echo "$str" | cut -d ':' -f6)
echo "$first $second"
With pure Bash Regex:
str='11111111-2222:3333:4444:555555555555 aaaaaaaa:bbbbbbbb:cccccccc:dddddddd'
echo "$([[ $str =~ (.*\ ).*:.*:([^:]*) ]])${BASH_REMATCH[1]}${BASH_REMATCH[2]}"
[[ $str =~ (.*\ ).*:.*:([^:]* ]]: Match $str against the POSIX Extended RegEx (.*\ ).*:.*:([^:]*) witch contains two capture groups: 1: (.*\ ) 0 or more of any characters, followed by a space; and capture group 2: ([^:]*) witch contains any number of characters that are not :.
$([[ $str =~ (.*\ ).*:.*:([^:]*) ]]): execute the RegEx match in a sub-shell during the string value expansion. (here it produces no output, but the RegEx captured groups are referenced later).
${BASH_REMATCH[1]}${BASH_REMATCH[2]}: expand the content of the RegEx captured groups that Bash keeps in the dedicated $BASH_REMATCH array.

bash check for words in first file not contained in second file

I have a txt file containing multiple lines of text, for example:
This is a
file containing several
lines of text.
Now I have another file containing just words, like so:
Now I want to output the words which are in file 1, but not in file 2. I have tried the following:
cat file_1.txt | xargs -n1 | tr -d '[:punct:]' | sort | uniq | comm -i23 - file_2.txt
xargs -n1 to put each space separated substring on a newline.
tr -d '[:punct:] to remove punctuations
sort and uniq to make a sorted file to use with comm which is used with the -i flag to make it case insensitive.
But somehow this doesn't work. I've looked around online and found similar questions, however, I wasn't able to figure out what I was doing wrong. Most answers to those questions were working with 2 files which were already sorted, stripped of newlines, spaces, and punctuation while my file_1 may contain any of those at the start.
Desired output:
paste + grep approach:
grep -Eiv "($(paste -sd'|' <file2.txt))" <(grep -wo '\w*' file1.txt)
The output:
I would try something more direct:
for A in `cat file1 | tr -d '[:punct:]'`; do grep -wq $A file2 || echo $A; done
flags used for grep: q for quiet (don't need output), w for word match
One in awk:
$ awk -F"[^A-Za-z]+" ' # anything but a letter is a field delimiter
NR==FNR { # process the word list
for(i=1;i<=NF;i++) # loop all fields
if(!(tolower($i) in a)) # if word was not in the word list
print $i # print it. duplicates are printed also.
}' another_file txt_file
$ grep -vwi -f another_file <(cat txt_file | tr -s -c '[a-zA-Z]' '\n')
This pipeline will take the original file, replace spaces with newlines, convert to lowercase, then use grep to filter (-v) full words (-w) case insensitive (-i) using the lines in the given file (-f file2):
cat file1 | tr ' ' '\n' | tr '[:upper:]' '[:lower:]' | grep -vwif file2

Bash + sed/awk/cut to delete nth character

I trying to delete 6,7 and 8th character for each line.
Below is the file containing text format.
Actual output..
#cat test
Expecting below, after formatting.
#cat test
Even I tried with below , no luck
#awk -F ":" '{print $1":"$2","$3}' test
#sed 's/^\(.\{7\}\).\(.*\)/\1\2/' test { Here I can remove only one character }
Even with cut also failed
#cut -d ":" -f1,2,3 test
Need to delete character in each line like 6th , 7th , 8th
Suggestion please
With GNU cut you can use the --complement switch to remove characters 6 to 8:
cut --complement -c6-8 file
Otherwise, you can just select the rest of the characters yourself:
cut -c1-5,9- file
i.e. characters 1 to 5, then 9 to the end of each line.
With awk you could use substrings:
awk '{ print substr($0, 1, 5) substr($0, 9) }' file
Or you could write a regular expression, but the result will be more complex.
For example, to remove the last three characters from the first comma-separated field:
awk -F, -v OFS=, '{ sub(/...$/, "", $1) } 1' file
Or, using sed with a capture group:
sed -E 's/(.{5}).{3}/\1/' file
Capture the first 5 characters and use them in the replacement, dropping the next 3.
it's a structured text, why count the chars if you can describe them?
$ awk '{sub(":..,",",")}1' file
remove the seconds.
The solutions below are generic and assume no knowledge of any format. They just delete character 6,7 and 8 of any line.
sed 's/.//8;s/.//7;s/.//6' <file> # from high to low
sed 's/.//6;s/.//6;s/.//6' <file> # from low to high (subtract 1)
sed 's/\(.....\).../\1/' <file>
sed 's/\(.{5}\).../\1/' <file>
s/BRE/replacement/n :: substitute nth occurrence of BRE with replacement
awk 'BEGIN{OFS=FS=""}{$6=$7=$8="";print $0}' <file>
awk -F "" '{OFS=$6=$7=$8="";print}' <file>
awk -F "" '{OFS=$6=$7=$8=""}1' <file>
This is 3 times the same, removing the field separator FS let awk assume a field to be a character. We empty field 6,7 and 8, and reprint the line with an output field separator OFS which is empty.
cut -c -5,9- <file>
cut --complement -c 6-8 <file>
Just for fun, perl, where you can assign to a substring
perl -pe 'substr($_,5,3)=""' file
With awk :
echo "18:40:12,,UP" | awk '{ $0 = ( substr($0,1,5) substr($0,9) ) ; print $0}'
If you are running on bash, you can use the string manipulation functionality of it instead of having to call awk, sed, cut or whatever binary:
while read STRING
echo ${STRING:0:5}${STRING:9}
done < myfile.txt
${STRING:0:5} represents the first five characters of your string, ${STRING:9} represents the 9th character and all remaining characters until the end of the line. This way you cut out characters 6,7 and 8 ...

Count number of Special Character in Unix Shell

I have a delimited file that is separated by octal \036 or Hexadecimal value 1e.
I need to count the number of delimiters on each line using a bash shell script.
I was trying to use awk, not sure if this is the best way.
Sample Input (| is a representation of \036)
Expected output:
awk -F'|' '{print NF-1}' file
Change | to whatever separator you like. If your file can have empty lines then you need to tweak it to:
awk -F'|' '{print (NF ? NF-1 : 0)}' file
You can try
awk '{print gsub(/\|/,"")}'
Simply try
awk -F"|" '{print substr($3,length($3))}' OFS="|" Input_file
Explanation: Making field separator -F as | and then printing the 3rd column by doing $3 only as per your need. Then setting OFS(output field separator) to |. Finally mentioning Input_file name here.
This will work as far as I know
echo "Example|Running|123|" | tr -cd '|' | wc -c
This should work for you:
awk -F '\036' '{print NF-1}' file
-F '\036' sets input field delimiter as octal value 036
Awk may not be the best tool for this. Gnu grep has a cool -o option that prints each matching pattern on a separate line. You can then count how many matching lines are generated for each input line, and that's the count of your delimiters. E.g. (where ^^ in the file is actually hex 1e)
$ cat -v i
$ grep -n -o $'\x1e' i | uniq -c
2 1:
3 2:
if you remove the uniq -c you can see how it's working. You'll get "1" printed twice because there are two matching patterns on the first line. Or try it with some regular ascii characters and it becomes clearer what the -o and -n options are doing.
If you want to print the line number followed by the field count for that line, I'd do something like:
$grep -n -o $'\x1e' i | tr -d ':' | uniq -c | awk '{print $2 " " $1}'
1 2
2 3
This assumes that every line in the file contains at least one delimiter. If that's not the case, here's another approach that's probably faster too:
$ tr -d -c $'\x1e\n' < i | awk '{print length}'
This uses tr to delete (-d) all characters that are not (-c) 1e or \n. It then pipes that stream of data to awk which just counts how many characters are left on each line. If you want the line number, add " | cat -n" to the end.
