Counting lines in file ambiguity - python-2.6

I have this code to count each line in a file:
n = sum(1 for line in open('myfile.txt'))
'n' being the number of lines. But it's not giving the correct number of lines. There's a difference of hundreds in count. Also tried different ways found in google but nothings seems to do the trick.
Any idea why this is happening? Or are there scenarios that stop this from giving the correct count?
--UPDATE--
Tried re-writing the file to another file:
i = 0
with open(file2) as outFile:
with open(file1) as inFile:
for line in inFile:
outFile.write(line)
i += 1
The output is file2 is exactly the same as file1 when seen on a viewer in terms of number of lines. However the value of 'i' doesn't give the correct number of lines.

A better way to do this would just be to open the file, and then count the number of lines in the file's readlines() method.
f = open('yourfile.txt', 'r')
print len(f.readlines())

Related

Is there a faster way to combine files in an ordered fashion than a for loop?

For some context, I am trying to combine multiple files (in an ordered fashion) named FILENAME.xxx.xyz (xxx starts from 001 and increases by 1) into a single file (denoted as $COMBINED_FILE), then replace a number of lines of text in the $COMBINED_FILE taking values from another file (named $ACTFILE). I have two for loops to do this which work perfectly fine. However, when I have a larger number of files, this process tends to take a fairly long time. As such, I am wondering if anyone has any ideas on how to speed this process up?
Step 1:
for i in {001..999}; do
[[ ! -f ${FILENAME}.${i}.xyz ]] && break
cat ${FILENAME}.${i}.xyz >> ${COMBINED_FILE}
mv -f ${FILENAME}.${i}.xyz ${XYZDIR}/${JOB_BASENAME}_${i}.xyz
done
Step 2:
for ((j=0; j<=${NUM_CONF}; j++)); do
let "n = 2 + (${j} * ${LINES_PER_CONF})"
let "m = ${j} + 1"
ENERGY=$(awk -v NUM=$m 'NR==NUM { print $2 }' $ACTFILE)
sed -i "${n}s/.*/${ENERGY}/" ${COMBINED_FILE}
done
I forgot to mention: there are other files named FILENAME.*.xyz which I do not want to append to the $COMBINED_FILE
Some details about the files:
FILENAME.xxx.xyz are molecular xyz files of the form:
Line 1: Number of atoms
Line 2: Title
Line 3-Number of atoms: Molecular coordinates
Line (number of atoms +1): same as line 1
Line (number of atoms +2): Title 2
... continues on (where line 1 through Number of atoms is associated with conformer 1, and so on)
The ACT file is a file containing the energies which has the form:
Line 1: conformer1 Energy
Line 2: conformer2 Energy2
Where conformer1 is in column 1 and the energy is in column 2.
The goal is to make the energy for the conformer the title for in the combined file (where the energy must be the title for a specific conformer)
If you know that at least one matching file exists, you should be able to do this:
cat -- ${FILENAME}.[0-9][0-9][0-9].xyz > ${COMBINED_FILE}
Note that this will match the 000 file, whereas your script counts from 001. If you know that 000 either doesn't exist or isn't a problem if it were to exist, then you should just be able to do the above.
However, moving these files to renamed names in another directory does require a loop, or one of the less-than-highly portable pattern-based renaming utilities.
If you could change your workflow so that the filenames are preserved, it could just be:
mv -- ${FILENAME}.[0-9][0-9][0-9].xyz ${XYZDIR}/${JOB_BASENAME}
where we now have a directory named after the job basename, rather than a path component fragment.
The Step 2 processing should be doable entirely in Awk, rather than a shell loop; you can read the file into an associative array indexed by line number, and have random access over it.
Awk can also accept multiple files, so the following pattern may be workable for processing the individual files:
awk 'your program' ${FILENAME}.[0-9][0-9][0-9].xyz
for instance just before catenating and moving them away. Then you don't have to rely on a fixed LINES_PER_FILE and such. Awk has the FNR variable which is the record in the current file; condition/action pairs can tell when processing has moved to the next file.
GNU Awk also has extensions BEGINFILE and ENDFILE, which are similar to the standard BEGIN and END, but are executed around each processed file; you can do some calculations over the record and in ENDFILE print the results for that file, and clear your accumulation variables for the next file. This is nicer than checking for FNR == 1, and having an END action for the last file.
if you really wanna materialize all the file names without globbing you can always jot it (it's like seq with more integer digits in default mode before going to scientific notation) :
jot -w 'myFILENAME.%03d' - 0 999 |
mawk '_<(_+=(NR == +_)*__)' \_=17 __=91 # extracting fixed interval
# samples without modulo(%) math
myFILENAME.016
myFILENAME.107
myFILENAME.198
myFILENAME.289
myFILENAME.380
myFILENAME.471
myFILENAME.562
myFILENAME.653
myFILENAME.744
myFILENAME.835
myFILENAME.926

Using cloc (count Lines of Codes) result

I am writing a script for my research, and I want to get the total number of lines in a source file. I came around cloc and I think I am going to use it in my script.
However, cloc gives result with too many information (unfortunately since I am a new member I cannot upload a photo). It gives number of files, number of lines, number of blank lines, number of comment lines, and other graphical representation stuff.
I am only interested in the number of lines to use it on my calculations. Is there a way to get that number easily (maybe by performing some options in command line (although I went through the available options and didn't find something useful for my case))?
I thought to do regular expression on the result to get the number; however, this is my first time using cloc and there might be a better/professional way of doing it.
Any thought?
Regards,
Arwa
I am not sure about CLOC. But it is worth using default shell command.
Please have a look at this question.
To get number of lines of code individually
find . -name '*.*' | xargs wc -l
To get total number of lines of code in a directory.
(find ./ -name '*.*' -print0 | xargs -0 cat) | wc -l
Please note that if you need number of lines from files with specific extension you could use *.ext. *.rb, if it is ruby.
For something very quick and simple you could just use:
Dir.glob('your_directory/**/*.rb').map do |file|
File.foreach(file).count
end.reduce(:+)
This will count all the lines of .rb files in your_directory and it's sub directories. Although I would recommend adding some handling for blank lines as well as comment lines. For more on Dir::glob
#BinaryMee and #engineersmnky thanks for your response.
I tried two different solutions, one using "readlines" got the answer from #gicappa
Count the length (number of lines) of a CSV file?
the other solution using cloc. I ran the command
%x{perl #{ClocPath} #{path-to-file} > result.txt}
and saved the result in result.txt
cloc returns result in a graphical form (I cannot upload image), it also reports number of blank lines, comment lines, and code lines. As I said, I am interested in code lines. So, I opened the file and used regular expression to get the number I needed.
content = File.read("#{path}/result.txt")
line = content.scan(/(\s+\d+\s+\d+\s+\d+\s+\d+)/)
total = line[0][0].split(' ').last
content here will have the content of a file, then line will get this line from the file:
C# 1 3 3 17
C# is the language of a file, 1 is number of files, 3 is number of blank lines, 3 is number of comment lines, and 17 is number of code lines. I got the help of the format from the script of cloc. total then will have number 17.
This solution will help if you are reading a specific file only, you need to add more solutions if you are reading the lines of more than one file.
Hopefully this will help who needs it.
Regards,
Arwa

AWK - replace with constant character in a specified number of random lines

I'm tasked with imputing masked genotypes, and I have to mask (hide) 2% of genotypes.
The file I do this in looks like this (genotype.dat):
M rs4911642
M rs9604821
M rs9605903
M rs5746647
M rs5747968
M rs5747999
M rs2070501
M rs11089263
M rs2096537
and to mask it, I simply change M to S2.
Yet, I have to do this for 110 (2%) of 5505 lines, so my strategy of using a random number generator (generate 110 numbers between 1 and 5505 and then manually changing the corresponding line number's M to S2 took almost an hour... (I know, not terribly sophisticated).
I thought about saving the numbers in a separate file (maskedlines.txt) and then telling awk to replace the first character in that line number with S2, but I could not find any adjustable example of to do this.
Anyway, any suggestions of how to tackle this will be deeply appreciated.
Here's one simple way, if you have shuf (it's in Gnu coreutils, so if you have Linux, you almost certainly have it):
sed "$(printf '%ds/M/S2/;' $(shuf -n110 -i1-5505 | sort -n))" \
genotype.dat > genotype.masked
A more sophisticated version wouldn't depend on knowing that you want 110 of 5505 lines masked; you can easily extract the line count with lines=$(wc -l < genotype.dat), and from there you can compute the percentage.
shuf is used to produce a random sample of lines, usually from a file; the -i1-5505 option means to use the integers from 1 to 5505 instead, and -n110 means to produce a random sample of 110 (without repetition). I sorted that for efficiency before using printf to create a sed edit script.
awk 'NR==FNR{a[$1]=1;next;} a[FNR]{$1="S2"} 1' maskedlines.txt genotype.dat
How it works
In sum, we first read in maskedlines.txt into an associative array a. This file is assumed to have one number per line and a of that number is set to one. We then read in genotype.dat. If a for that line number is one, we change the first field to S2 to mask it. The line, whether changed or not, is then printed.
In detail:
NR==FNR{a[$1]=1;next;}
In awk, FNR is the number of records (lines) read so far from the current file and NR is the total number of lines read so far. So, when NR==FNR, we are reading the first file (maskedlines.txt). This file contains the line number of lines in genotype.dat that are to be masked. For each of these line numbers, we set a to 1. We then skip the rest of the commands and jump to the next line.
a[FNR]{$1="S2"}
If we get here, we are working on the second file: genotype.dat. For each line in this file, we check to see if its line number, FNR, was mentioned in maskedlines.txt. If it was, we set the first field to S2 to mask this line.
1
This is awk's cryptic shorthand to print the current line.

Comparing two text files and counting number of occurrences

I'm trying to write a blog post about the dangers of having a common access point name.
So I did some wardriving to get a list of access point names, and I downloaded a list of the 1000 most common access point names (which there exists rainbow tables for) from Renderlab.
But how can I compare those two text files, to see how many of my collected access point names that are open to attacks from rainbow tables?
The text files are build like this:
collected.txt:
linksys
internet
hotspot
Most common access point names are called
SSID.txt:
default
NETGEAR
Wireless
WLAN
Belkin54g
So the script should sort the lines, compare them and show how many times the lines from collected.txt are found in SSID.txt ..
Does that make any sense? Any help would be grateful :)
If you don't mind using python script:
file1=open('collected.txt', 'r') # open file 1 for reading
with open('SSID.txt', 'r') as content_file: # ready file 2
SSID = content_file.read()
found={} # summary of found names
for line in file1:
if line in SSID:
if line not in found:
found[line]=1
else:
found[line]+=1
for i in found:
print found[i], i # print out list and no. of occurencies
...it can be run in the dir containing these files - collected.txt and SSID.txt - it will return a list looking like this:
5 NETGEAR
3 default
(...)
Script reads file 1 line-by line and compares it to the whole file 2. It can be easily modified to take file names from command prompt.
First, take a look on a simple tutorial about sdiff command, like How do I Compare two files under Linux or UNIX. Also, Notepad++ support this.
To find the number of times each line in file A appears in file B, you can do:
awk 'FNR==NR{a[$0]=1; next} $0 in a { count[$0]++ }
END { for( i in a ) print i, count[i] }' A B
If you want the output sorted, pipe the output to sort, but there's no need to sort just to find the counts. Note that the $0 in a clause can be omitted at the cost of consuming more memory, which may be a problem if file B is very large.

Is there a better split function for terminal?

I'm trying to split a very big CSV file into smaller more manageable ones. I've tried split but it seems that it tops out at 676 files.
The CSV file I have is in excess of 80mb and I'd like to split it into 50 line files.
Note by better I mean one that uses a numbering structure instead of split's a-z sequencing.
split is the right tool, the problem is that the suffix is only 2 long 26^2 = 676, if you make it longer you should be fine:
split -a LEN file
Use 'cat' to number each line and pipe the output to 'grep' with params to only print n lines

Resources