Split a big txt file to do grep - unix - bash

I work (unix, shell scripts) with txt files that are millions field separate by pipe and not separated by \n or \r.
something like this:
field1a|field2a|field3a|field4a|field5a|field6a|[...]|field1d|field2d|field3d|field4d|field5d|field6d|[...]|field1m|field2m|field3m|field4m|field5m|field6m|[...]|field1z|field2z|field3z|field4z|field5z|field6z|
All text is in the same line.
The number of fields is fixed for every file.
(in this example I have field1=name; field2=surname; field3=mobile phone; field4=email; field5=office phone; field6=skype)
When I need to find a field (ex field2), command like grep doesn't work (in the same line).
I think that a good solution can be do a script that split every 6 field with a "\n" and after do a grep. I'm right? Thank you very much!

With awk :
$ cat a
field1a|field2a|field3a|field4a|field5a|field6a|field1d|field2d|field3d|field4d|field5d|field6d|field1m|field2m|field3m|field4m|field5m|field6m|field1z|field2z|field3z|field4z|field5z|field6z|
$ awk -F"|" '{for (i=1;i<NF;i=i+6) {for (j=0; j<6; j++) printf $(i+j)"|"; printf "\n"}}' a
field1a|field2a|field3a|field4a|field5a|field6a|
field1d|field2d|field3d|field4d|field5d|field6d|
field1m|field2m|field3m|field4m|field5m|field6m|
field1z|field2z|field3z|field4z|field5z|field6z|
Here you can easily set the length of line.
Hope this helps !

you can use sed to split the line in multiple lines:
sed 's/\(\([^|]*|\)\{6\}\)/\1\n/g' input.txt > output.txt
explanation:
we have to use heavy backslash-escaping of (){} which makes the code slightly unreadable.
but in short:
the term (([^|]*|){6}) (backslashes removed for readability) between s/ and /\1, will match:
[^|]* any character but '|', repeated multiple times
| followed by a '|'
the above is obviously one column and it is grouped together with enclosing parantheses ( and )
the entire group is repeated 6 times {6}
and this is again grouped together with enclosing parantheses ( and ), to form one full set
the rest of the term is easy to read:
replace the above (the entire dataset of 6 fields) with \1\n, the part between / and /g
\1 refers to the "first" group in the sed-expression (the "first" group that is started, so it's the entire dataset of 6 fields)
\n is the newline character
so replace the entire dataset of 6 fields by itself followed by a newline
and do so repeatedly (the trailing g)

you can use sed to convert every 6th | to a newline.
In my version of tcsh I can do:
sed 's/\(\([^|]\+|\)\{6\}\)/\1\n/g' filename
consider this:
> cat bla
a1|b2|c3|d4|
> sed 's/\(\([^|]\+|\)\{6\}\)/\1\n/g' bla
a1|b2|
c3|d4|
This is how the regex works:
[^|] is any non-| character.
[^|]\+ is a sequence of at least one non-| characters.
[^|]\+| is a sequence of at least one non-| characters followed by a |.
\([^|]\+|\) is a sequence of at least one non-| characters followed by a |, grouped together
\([^|]\+|\)\{6\} is 6 consecutive such groups.
\(\([^|]\+|\)\{6\}\) is 6 consecutive such groups, grouped together.
The replacement just takes this sequence of 6 groups and adds a newline to the end.

Here is how I would do it with awk
awk -v RS="|" '{printf $0 (NR%7?RS:"\n")}' file
field1a|field2a|field3a|field4a|field5a|field6a|[...]
field1d|field2d|field3d|field4d|field5d|field6d|[...]
field1m|field2m|field3m|field4m|field5m|field6m|[...]
field1z|field2z|field3z|field4z|field5z|field6z|
Just adjust the NR%7 to number of field you to what suites you.

What about printing the lines on blocks of six?
$ awk 'BEGIN{FS=OFS="|"} {for (i=1; i<=NF; i+=6) {print $(i), $(i+1), $(i+2), $(i+3), $(i+4), $(i+5)}}' file
field1a|field2a|field3a|field4a|field5a|field6a
field1d|field2d|field3d|field4d|field5d|field6d
field1m|field2m|field3m|field4m|field5m|field6m
field1z|field2z|field3z|field4z|field5z|field6z
Explanation
BEGIN{FS=OFS="|"} set input and output field separator as |.
{for (i=1; i<=NF; i+=6) {print $(i), $(i+1), $(i+2), $(i+3), $(i+4), $(i+5)}} loop through items on blocks of 6. Every single time, print six of them. As print end up writing a new line, then you are done.

If you want to treat the files as being in multiple lines, then make \n the field separator. For example, to get the 2nd column, just do:
tr \| \\n < input-file | sed -n 2p
To see which columns match a regex, do:
tr \| \\n < input-file | grep -n regex

Related

How to replace a specific character in a file, only on the lines by counting this specific character in the line?

I would like to double the 4th comma in the lines counting 7 and only 7 commas in all the csv's of a folder.
In this command line, I double the 4th comma:
sed  's/,/,,/4' Person_7.csv > new.csv
In this command line, I can find and count all the commas in a line:
sed 's/[^,]//g' dat | awk '{ print length }'
In this command line, I can count and create a new file with lines containing 7 commas:
awk -F , 'NF == 7' <Person_test.csv >Person_7.csv
But I don't know how to do the specific work...
You need something to select only the lines that contain exactly 7 commas and then operate on just these lines. You can do that with sed:
sed '/^\([^,]*,\)\{7\}[^,]*$/s/,/&&/4'
where ^\([^,]*,\)\{7\}[^,]*$ defines a line that contains exactly 7 commas.
It's a bit easier with awk, though:
awk -F, -v OFS=, 'NF == 8 { $4 = $4 OFS } 1'
This sets input and output field separators to ,, and then for lines with 8 fields (7 commas) appends a , to the end of the 4th field, doubling the comma. The final 1 makes sure every line gets printed.

Is it possible to use grep for the space character

I have a text file (example.txt) like this:
100 this is a string
50 word
10
(Note that there are trailing space characters on the last line.)
When I do the following in my shell script:
cat example.txt | sed '1!d' | awk '{for (i=2; i < NF; i++) printf $i " "; print $NF}' - returns this is a string
cat example.txt | sed '2!d' | awk '{for (i=2; i < NF; i++) printf $i " "; print $NF}' - returns word
cat example.txt | sed '3!d' | awk '{for (i=2; i < NF; i++) printf $i " "; print $NF}' - returns 10 (incorrect, should be a space character instead)
Is there any method to use grep in bash to return the result I am looking for?
Is there any method to use grep in bash to return the result I am looking for?
Well, grep can match space characters. You have to quote them to avoid the shell interpreting them as delimiters. But grep will output either the whole line or the part of it that matches, depending on the options given to it, and I don't think that will satisfy your output requirement.
It looks like your input format may employ fixed field widths, or at least a fixed-width first field, and that you're trying to remove that first field. In that case, why not use sed? For example,
cat example.txt | sed 's/^....//'
will remove the first four characters from each line. You can also spell that
sed 's/^....//' example.txt
. If you want instead to cut a variable-length head of the line consisting of decimal digits up to the first space then that would be
sed 's/^[0-9]* //' example.txt
Note that although that's what you said in comments you want, it will produce different output than your awk example in the case of your second input line -- it will output a leading space:
word
Note also that your awk-based approach will replace multiple adjacent whitespace in the retained part of your lines with single spaces. That behavior could be obtained from sed, too, but I'm inclined to think that it's not actually wanted.

Grep and returning only column of match

if i want to search from a file with various number of columns like this:
ppl:apple age:5 F add:blabla love:dog
ppl:tom M add:blablaa love:cat
ppl:jay age:3 M love:apple
ppl:jenny acc:jen age:8 F add:blabla
...
the file is tab separated, and the output i want is:
age:5
age:3
age:8
...
using grep age: will return the entire row, while
using cut -f2 will return some unwanted column:
age:5
M
age:3
acc:jen
and neither cut -f2|grep age: nor grep age|cut -f2: work
My data may range from 11-23 columns,
will there be any simpler way to handle it using grep sed or awk,
many thanks
grep itself can do this, with no additional tools, by using the -o/--only-matching switch. You should be able to just do:
grep -o '\<age:[0-9]\+'
To explain the less common parts of the regex:
\< is a zero-width assertion that you're at the beginning of a word (that is, age is preceded by a non-word character or occurs at the beginning of the line, but it's not actually matching that non-word character); this prevents you from matching, say image:123. It doesn't technically require whitespace, so it would match :age: or the like; if that's a problem, match \t itself and use cut or tr to remove it later.
\+ means "match 1 or more occurrences of the preceding character class" (which is [0-9], so it matches one or more digits). \+ is equivalent to repeating the class twice, with the second copy followed by *, e.g. [0-9][0-9]*, except it's shorter, and some regex engines can optimize \+ better.
You can use script below:
cat file|grep age|awk '{for(i=1;i<22;i++){if($i ~ /^age:/)print $i}}'
You can also use sed
sed -nr 's/^.*(age:.).*$/\1/p' input_pattern.txt
Where input_pattern.txt contains you data.
ShadowRanger's simple grep-based answer is probably the best choice.
A solution that works with both GNU sed and BSD/OSX sed:
sed -nE 's/^.*[[:blank:]](age:[0-9]+).*$/\1/p' file
With GNU sed you can simplify to:
sed -nr 's/^.*\t(age:[0-9]+).*$/\1/p' file
Both commands match the entire input line, if it contains an age: field of interest, replace it with that captured field (\1), and print the result; other lines are ignored.
Original answer, before the requirements were clarified:
Assuming that on lines where age: is present, it is always the 2nd tab-separated field, awk is the best solution:
awk '$2 ~ /^age:/ { print $2 }' file
$2 ~ /^age:/ only matches lines whose 2nd whitespace-separate field starts with literal age:
{ print $2 } simply prints that field.
Limit search for regexp to columns 11 to 23:
awk '{ for(i = 11; i <= 23; i++) { if ($i ~ /^age:/) print $i } }' file

Single-quote part of a line using sed or awk

Convert input text as follows, using sed or awk:
Input file:
113259740 QA Test in progress
219919630 UAT Test in progress
Expected output:
113259740 'QA Test in progress'
219919630 'UAT Test in progress'
Using GNU sed or BSD (OSX) sed:
sed -E "s/^( *)([^ ]+)( +)(.*)$/\1\2\3'\4'/" file
^( *) captures all leading spaces, if any
([^ ]+) captures the 1st field (a run of non-space characters of at least length 1)
( +) captures the space(s) after the first field
(.*)$ matches the rest of the line, whatever it may be
\1\2\3'\4' replaces each (matching) input line with the captured leading spaces, followed by the 1st field, followed by the captured first inter-field space(s), followed by the single-quoted remainder of the input line. To discard the leading spaces, simply omit \1.
Note:
Matching the 1st field is more permissive than strictly required in that it matches any non-space sequence of characters, not just digits (as in the sample input data).
A generalized solution supporting other forms of whitespace (such as tabs), including after the 1st field, would look like this:
sed -E "s/^([[:space:]]*)([^[:space:]]+)([[:space:]]+)(.*)$/\1\2\3'\4'/" file
If your sed version doesn't support -E (or -r) to enable support for extended regexes, try the following, POSIX-compliant variant that uses a basic regex:
sed "s/^\( *\)\([^ ]\{1,\}\)\( \{1,\}\)\(.*\)$/\1\2\3'\4'/" file
You could try this GNU sed command also,
sed -r "s/^( +) ([0-9]+) (.*)$/\1 \2 '\3'/g" file
^( +), catches one or more spaces at the starting and stored it in a group(1).
([0-9]+) - After catching one or more spaces at the starting, next it matches a space after that and fetch all the numbers that are next to that space then store it in a group(2).
(.*)$ - Fetch all the characters that are next to numbers upto the last character and then store it in a group(3).
All the fetched groups are rearranged in the replacement part according to the desired output.
Example:
$ cat ccc
113259740 QA Test in progress
219919630 UAT Test in progress
$ sed -r "s/^( +) ([0-9]+) (.*)$/\1 \2 '\3'/g" ccc
113259740 'QA Test in progress'
219919630 'UAT Test in progress'
And in awk:
awk '{ printf "%s '"'"'", $1; for (i=2; i<NF; ++i) printf "%s ", $i; print $NF "'"'"'" }' file
Explanation:
printf "%s '"'"'", $1; Print the first field, followed by a space and a quote (')
for (i=2; i<NF; ++i) printf "%s ", $i; Print all of the following fields save the last one, each followed by a space.
print $NF "'"'"'" Print the last field followed by a quote(')
Note that '"'"'" is used to print just a single quote ('). An alternative is to specify the quote character on the command line as a variable:
awk -v qt="'" '{ printf "%s %s", $1, qt; for (i=2; i<NF; ++i) printf "%s ", $i; print $NF qt }' file
An awk solution:
awk -v q="'" '{ f1=$1; $1=""; print f1, q substr($0,2) q }' file
Lets awk split each input line into fields by whitespace (the default behavior).
-v q="'" defines awk variable q containing a single quote so as to make it easier to use a single quote inside the awk program, which is single-quoted as a whole.
f1=$1 saves the 1st field for later use.
$1=="" effectively removes the first field from the input line, leaving $0, which originally referred to the whole input line, to contain a space followed by the rest of the line (strictly speaking, the fields are re-concatenated using the output-field separator OFS, which defaults to a space; since the 1st field is now empty, the resulting $0 starts with a single space followed by all remaining fields separated by a space each).
print f1, q substr($0,2) q then prints the saved 1st field, followed by a space (OFS) due to ,, followed by the remainder of the line (with the initial space stripped with substr()) enclosed in single quotes (q).
Note that this solution normalizes whitespace:
leading and trailing whitespace is removed
interior whitespace of length greater than 1 is compressed to a single space each.
Since the post is tagged with bash, here is an all Bash solution that preserves leading white space.
while IFS= read -r line; do
read -r f1 f2 <<<"$line"
echo "${line/$f1 $f2/$f1 $'\''$f2$'\''}"
done < file
Output:
113259740 'QA Test in progress'
219919630 'UAT Test in progress'
Here is a simple way to do it with awk
awk '{sub($2,v"&");sub($NF,"&"v)}1' v=\' file
113259740 'QA Test in progress'
219919630 'UAT Test in progress'
It does not change the formatting of the file.
You can perform this by taking advantage of the word-splitting involved in most shells like bash. To avoid ending up with an extra single quote in the final result, you can just remove it with sed. This will also trim any extra spaces before i, between i and j and after j.
cat file.txt | sed "s/'//g" | while read i j; do echo "$i '$j'"; done
Here, we'll pipe the first word into variable i, and the rest in j.

Trim leading and trailing spaces from a string in awk

I'm trying to remove leading and trailing space in 2nd column of the below input.txt:
Name, Order  
Trim, working
cat,cat1
I have used the below awk to remove leading and trailing space in 2nd column but it is not working. What am I missing?
awk -F, '{$2=$2};1' input.txt
This gives the output as:
Name, Order  
Trim, working
cat,cat1
Leading and trailing spaces are not removed.
If you want to trim all spaces, only in lines that have a comma, and use awk, then the following will work for you:
awk -F, '/,/{gsub(/ /, "", $0); print} ' input.txt
If you only want to remove spaces in the second column, change the expression to
awk -F, '/,/{gsub(/ /, "", $2); print$1","$2} ' input.txt
Note that gsub substitutes the character in // with the second expression, in the variable that is the third parameter - and does so in-place - in other words, when it's done, the $0 (or $2) has been modified.
Full explanation:
-F, use comma as field separator
(so the thing before the first comma is $1, etc)
/,/ operate only on lines with a comma
(this means empty lines are skipped)
gsub(a,b,c) match the regular expression a, replace it with b,
and do all this with the contents of c
print$1","$2 print the contents of field 1, a comma, then field 2
input.txt use input.txt as the source of lines to process
EDIT I want to point out that #BMW's solution is better, as it actually trims only leading and trailing spaces with two successive gsub commands. Whilst giving credit I will give an explanation of how it works.
gsub(/^[ \t]+/,"",$2); - starting at the beginning (^) replace all (+ = zero or more, greedy)
consecutive tabs and spaces with an empty string
gsub(/[ \t]+$/,"",$2)} - do the same, but now for all space up to the end of string ($)
1 - ="true". Shorthand for "use default action", which is print $0
- that is, print the entire (modified) line
remove leading and trailing white space in 2nd column
awk 'BEGIN{FS=OFS=","}{gsub(/^[ \t]+/,"",$2);gsub(/[ \t]+$/,"",$2)}1' input.txt
another way by one gsub:
awk 'BEGIN{FS=OFS=","} {gsub(/^[ \t]+|[ \t]+$/, "", $2)}1' infile
Warning by #Geoff: see my note below, only one of the suggestions in this answer works (though on both columns).
I would use sed:
sed 's/, /,/' input.txt
This will remove on leading space after the , .
Output:
Name,Order
Trim,working
cat,cat1
More general might be the following, it will remove possibly multiple spaces and/or tabs after the ,:
sed 's/,[ \t]\?/,/g' input.txt
It will also work with more than two columns because of the global modifier /g
#Floris asked in discussion for a solution that removes trailing and and ending whitespaces in each colum (even the first and last) while not removing white spaces in the middle of a column:
sed 's/[ \t]\?,[ \t]\?/,/g; s/^[ \t]\+//g; s/[ \t]\+$//g' input.txt
*EDIT by #Geoff, I've appended the input file name to this one, and now it only removes all leading & trailing spaces (though from both columns). The other suggestions within this answer don't work. But try: " Multiple spaces , and 2 spaces before here " *
IMO sed is the optimal tool for this job. However, here comes a solution with awk because you've asked for that:
awk -F', ' '{printf "%s,%s\n", $1, $2}' input.txt
Another simple solution that comes in mind to remove all whitespaces is tr -d:
cat input.txt | tr -d ' '
I just came across this. The correct answer is:
awk 'BEGIN{FS=OFS=","} {gsub(/^[[:space:]]+|[[:space:]]+$/,"",$2)} 1'
just use a regex as a separator:
', *' - for leading spaces
' *,' - for trailing spaces
for both leading and trailing:
awk -F' *,? *' '{print $1","$2}' input.txt
Simplest solution is probably to use tr
$ cat -A input
^I Name, ^IOrder $
Trim, working $
cat,cat1^I
$ tr -d '[:blank:]' < input | cat -A
Name,Order$
Trim,working$
cat,cat1
The following seems to work:
awk -F',[[:blank:]]*' '{$2=$2}1' OFS="," input.txt
If it is safe to assume only one set of spaces in column two (which is the original example):
awk '{print $1$2}' /tmp/input.txt
Adding another field, e.g. awk '{print $1$2$3}' /tmp/input.txt will catch two sets of spaces (up to three words in column two), and won't break if there are fewer.
If you have an indeterminate (large) number of space delimited words, I'd use one of the previous suggestions, otherwise this solution is the easiest you'll find using awk.

Resources