Counting character in a 1 line huge file - shell

I'm working with a huge file (4 GB). This file doesn't have conventional line delimiter. Instead of the \n or such, the line delimiter is a bunch of characters (#####).
On this annoying file, I would like to count the number of time the "#####" string occurs to get the number of line.
I tried:
grep -o "#####" MyHugeFile.txt | wc -l
awk -F'#####' '{print NF-1}' MyHugeFile.txt
awk '{print gsub(/#####/,"&")}' MyHugeFile.txt
But none of them would work: the process seems to die before it gets the number of occurrence.
Is there another way ?
Thank you very much !

Related

Why is wc -l counting lines incorrectly? [duplicate]

I have a text file which has over 60MB size. It has got entries in 5105043 lines, but when I am doing wc -l it is giving only 5105042 results which is one less than actual. Does anyone have any idea why it is happening?
Is it a common thing when the file size is large?
Last line does not contain a new line.
One trick to get the result you want would be:
sed -n '=' <yourfile> | wc -l
This tells sed just to print the line number of each line in your file which wc then counts. There are probably better solutions, but this works.
The last line in your file is probably missing a newline ending. IIRC, wc -l merely counts the number of newline characters in the file.
If you try: cat -A file.txt | tail does your last line contain a trailing dollar sign ($)?
EDIT:
Assuming the last line in your file is lacking a newline character, you can append a newline character to correct it like this:
printf "\n" >> file.txt
The results of wc -l should now be consistent.
60 MB seems a bit big file but for small size files. One option could be
cat -n file.txt
OR
cat -n sample.txt | cut -f1 | tail -1

How many times has the letter "N" or its repeat(eg: "NNNNN") been found in a text file?

I am given a file.txt (text file) with a string of data. Example contents:
abcabccabbabNababbababaaaNNcacbba
abacabababaaNNNbacabaaccabbacacab
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
aaababababababacacacacccbababNbNa
abababbacababaaacccc
To find the number of distinct repeated patterns of "N" (repeated one or more times) that are present in the file using unix commands.
I am unsure on what commands to use even after trying a range of different commands.
$ grep -E -c "(N)+" file.txt
the output must be 6
One way:
$ sed 's/[^N]\{1,\}/\n/g' file.txt | grep -c N
6
How it works:
Replace all sequences of one or more non-N characters in the input with a newline.
This turns strings like abcabccabbabNababbababaaaNNcacbba into
N
NN
Count the number of lines with at least one N (Ignoring the empty lines).
Regular-expression free alternative:
$ tr -sc N ' ' < file.txt | wc -w
6
Uses tr to replace all runs of non-N characters with a single space, and counts the remaining words (Which are the N sequences). Might not even need the -s option.
Using GNU awk (well, just tested with gawk, mawk, busybox awk and awk version 20121220 and it seemed to work with all of them):
$ gawk -v RS="^$" -F"N+" '{print NF-1}' file
6
It reads in the whole file as a single record, uses regex N+ as field separator and outputs the field count minus one. For other awks:
$ awk -v RS="" -F"N+" '{c+=NF-1}END{print c}' file
It reads in empty line separated blocks of records, counts and sums fields.
Here is an awk that should work on most system.
awk -F'N+' '{a+=NF-1} END {print a}' file
6
It splits the line by one or more N and then count number of fields-1 pr line.
If you have a text file, and you want to count the number times a sequence of letters of N appear, you can do:
awk '{a+=gsub(/N+/,"")}END{print a}' file
This, however, will distinguish sequences that are split over multiple lines. Example:
abcNNN
NNefg
If you want this to be counted as a single sequence, you should do:
awk 'BEGIN{RS=OFS=""}{$1=$1}{a+=gsub(/N+/,"")}END{print a}' file

Making bash output a certain word from a .txt file

I have a question on Bash:
Like the title says, I require bash to output a certain word, depending on where it is in the file. In my explicit example I have a simple .txt file.
I already found out that you can count the number of words within a file with the command:
wc -w < myFile.txt
An output example would be:
78501
There certainly is also a way to make "cat" to only show word number x. Something like:
cat myFile.txt | wordno. 3125
desired-word
Notice, that I will welcome any command, that gets this done, not only cat.
Alternatively or in addition, I would be happy to know how you can make certain characters in a file show, based on their place in it. Something like:
cat myFile.txt | characterno. 2342
desired-character
I already know how you can achieve this with a variable:
a="hello, how are you"
echo ${a:9:1}
w
Only problem is a variable can only be so long. Is it as long as a whole .txt file, it won't work.
I look forward to your answers!
You could use awkfor this job it splits the string at spaces and prints the $wordnumber stringpart and tr is used to remove newlines
cat myFile.txt | tr -d '\n' | awk -v wordnumber=5 '{ print $wordnumber }'
And if you want the for example 5th. character you could do this like so
head -c 5 myFile.txt | tail -c 1
Since you have NOT shown samples of Input_file or expected output so couldn't test it. You could simply do this with awk as follows could be an example.
awk 'FNR==1{print substr($0,2342,1);next}' Input_file
Where we are telling awk to look for 1st line FNR==1 and in substr where we tell awk to take character 2342 and next 1 means from that position take only 1 character you could increase its value or keep it as per your need too.
With gawk:
awk 'BEGIN{RS="[[:space:]]+"} NR==12345' file
or
gawk 'NR==12345' RS="[[:space:]]+" file
I'm setting the record separator to a sequences of spaces which includes newlines and then print the 12345th record.
To improve the average performance you can exit the script once the match is found:
gawk 'BEGIN{RS="[[:space:]]+"}NR==12345{print;exit}' file

Printing a line of a file given line number

Is it possible, in UNIX, to print a particular line of a file? For example I would like to print line 10 of file example.c. I tried with cat, ls, awk but apparently either these don't have the feature or I'm not able to properly read the man :-).
Using awk:
awk 'NR==10' file
Using sed:
sed '10!d' file
sed -n '10{p;q;}' example.c
will print the tenth line of example.c for you.
Try head and tail, you can specify the amount of lines and where to start.
To get the third line:
head -n 3 yourfile.c | tail -n 1
head -n 10 /tmp/asdf | tail -n 1
Unfortunately, all other solutions which use head/tail will NOT work incorrectly if line number provided is larger than total number of lines in our file.
This will print line number N or nothing if N is beyond total number of lines:
grep "" file | grep "^20:"
If you want to cut line number from output, pipe it through sed:
grep "" file | grep "^20:" | sed 's/^20://'
Try this:
cat -n <yourfile> | grep ^[[:space:]]*<NUMBER>[[:space:]].*$
cat -n numbers the file
the regex of grep searches the line numbered ;-)
The original mismatched as mentioned in the comments.
Te current one looks for the exact match.
- i.e. in the particular cas we need a line starting with an arbitrary amount () of spaces the followed by a space followed by whatever (.)
In case anyone thumbles over this regex and doesn't get it at all - here is a good tutorial to get you started: http://regex.learncodethehardway.org/book/ (it uses python regex as an example tough).
This might work for you:
sed '10q;d' file

Counting commas in a line in bash

Sometimes I receive a CSV file which has a carriage return inside a cell. This is not an acceptable format to a program that will use it as input.
In order to detect if an input line is split, I determined that a bad line would not have the expected number of commas in it. Is there a bash or other common unix command line tool that would allow me to count the commas in the line? If necessary, I can write a Python or Perl program to do it, but if possible, I'd like to add a line or two to an existing bash script to cause it to fail if the comma count is wrong. Any ideas?
Strip everything but the commas, and then count number of characters left:
$ echo foo,bar,baz | tr -cd , | wc -c
2
To count the number of times a comma appears, you can use something like awk:
string=(line of input from CSV file)
echo "$string" | awk -F "," '{print NF-1}'
But this really isn't sufficient to determine whether a field has carriage returns in it. Fields can have commas inside as long as they're surrounded by quotes.
What worked for me better than the other solutions was this. If test.txt has:
foo,bar,baz
baz,foo,foobar,bar
Then cat test.txt | xargs -I % sh -c 'echo % | tr -cd , | wc -c' produces
2
3
This works very well for streaming sources, or tailing logs, etc.
In pure Bash:
while IFS=, read -ra array
do
echo "$((${#array[#]} - 1))"
done < inputfile
or
while read -r line
do
count=${line//[^,]}
echo "${#count}"
done < inputfile
Try Perl:
$ perl -ne 'print 0+#{[/,/g]},"\n"'
a
0
a,a
1
a,a,a,a,a
4
Depending on what you are trying to do with the CSV data, it may be helpful to use a wrapper script like csvquote to temporarily replace the problematic newlines (and commas) inside quoted fields, then restore them. For instance:
csvquote inputfile.csv | wc -l
and
csvquote inputfile.csv | cut -d, -f1 | csvquote -u
may be the sort of thing you're looking for. See [https://github.com/dbro/csvquote][1] for the code and more information
An example Python command you could run (since it's going to be installed on most modern shells) is:
python -c "import pathlib; print({l.count(',') for l in pathlib.Path('my_file.csv').read_text().splitlines()})"
This counts the number of commas per line, then makes a set from them (so if your lines all have the same number of commas in, you'll get a set with just that number in).
Just remove all of the carriage returns:
tr -d "\r" old_file > new_file

Resources