Using cloc (count Lines of Codes) result - ruby

I am writing a script for my research, and I want to get the total number of lines in a source file. I came around cloc and I think I am going to use it in my script.
However, cloc gives result with too many information (unfortunately since I am a new member I cannot upload a photo). It gives number of files, number of lines, number of blank lines, number of comment lines, and other graphical representation stuff.
I am only interested in the number of lines to use it on my calculations. Is there a way to get that number easily (maybe by performing some options in command line (although I went through the available options and didn't find something useful for my case))?
I thought to do regular expression on the result to get the number; however, this is my first time using cloc and there might be a better/professional way of doing it.
Any thought?
Regards,
Arwa

I am not sure about CLOC. But it is worth using default shell command.
Please have a look at this question.
To get number of lines of code individually
find . -name '*.*' | xargs wc -l
To get total number of lines of code in a directory.
(find ./ -name '*.*' -print0 | xargs -0 cat) | wc -l
Please note that if you need number of lines from files with specific extension you could use *.ext. *.rb, if it is ruby.

For something very quick and simple you could just use:
Dir.glob('your_directory/**/*.rb').map do |file|
File.foreach(file).count
end.reduce(:+)
This will count all the lines of .rb files in your_directory and it's sub directories. Although I would recommend adding some handling for blank lines as well as comment lines. For more on Dir::glob

#BinaryMee and #engineersmnky thanks for your response.
I tried two different solutions, one using "readlines" got the answer from #gicappa
Count the length (number of lines) of a CSV file?
the other solution using cloc. I ran the command
%x{perl #{ClocPath} #{path-to-file} > result.txt}
and saved the result in result.txt
cloc returns result in a graphical form (I cannot upload image), it also reports number of blank lines, comment lines, and code lines. As I said, I am interested in code lines. So, I opened the file and used regular expression to get the number I needed.
content = File.read("#{path}/result.txt")
line = content.scan(/(\s+\d+\s+\d+\s+\d+\s+\d+)/)
total = line[0][0].split(' ').last
content here will have the content of a file, then line will get this line from the file:
C# 1 3 3 17
C# is the language of a file, 1 is number of files, 3 is number of blank lines, 3 is number of comment lines, and 17 is number of code lines. I got the help of the format from the script of cloc. total then will have number 17.
This solution will help if you are reading a specific file only, you need to add more solutions if you are reading the lines of more than one file.
Hopefully this will help who needs it.
Regards,
Arwa

Related

grep listing false duplicates

I have the following data containing a subset of record numbers formatting like so:
>head pilot.dat
AnalogPoint,206407
AnalogPoint,2584
AnalogPoint,206292
AnalogPoint,206278
AnalogPoint,206409
AnalogPoint,206410
AnalogPoint,206254
AnalogPoint,206266
AnalogPoint,206408
AnalogPoint,206284
I want to compare the list of entries to another subset file called "disps.dat" to find duplicates, which is formatted in the same way:
>head disps.dat
StatusPoint,280264
StatusPoint,280266
StatusPoint,280267
StatusPoint,280268
StatusPoint,280269
StatusPoint,280335
StatusPoint,280336
StatusPoint,280334
StatusPoint,280124
I used the command:
grep -f pilot.dat disps.dat > duplicate.dat
However, the output file "duplicate.dat" is listing records that exist in the second file "disps.dat", but do not exist in the first file.
(Note, both files are big, so the sample shown above don't have duplicates, but I do expect and have confirmed at least 10-12k duplicates to show up in total)
> head duplicate.dat
AnalogPoint,208106
AnalogPoint,208107
StatusPoint,1235220
AnalogPoint,217270
AnalogPoint,217271
AnalogPoint,217272
AnalogPoint,217273
AnalogPoint,217274
AnalogPoint,217275
AnalogPoint,217277
> grep "AnalogPoint,208106" pilot.dat
>
I tested the above command with a smaller sample of data (10 records), also formatted the same, and the results work fine, so I'm a little bit confused on why it is failing on the larger execution.
I also tried feeding it in as a string with -F thinking that the "," comma might be the source of issue. Right now, I am feeding the data through a 'for' loop and echoing each line, which is executing very, very slowly but at least it will help me cross out the regex possibility.
the -x or -w option is needed to do an exact match.
-x will match exact string, and -w will match exact substring and block non-word characters which works in my case to handle trailing numbers.
The issue is that a record in the first file such as:
"AnalogPoint,1"
Would end up flagging records in the second file like:
"AnalogPoint,10"
"AnalogPoint,123"
"AnalogPoint,100200"
And so on.
Thanks to #Barmar for pointing out my issue.

Finding a newline in the csv file

I know there are a lot of questions about this (latest one here.), but almost all of them are how to join those broken lines into one from a csv file or remove them. I don't want to remove, but I just want to display/find that line (or probably the line number?)
Example data:
22224,across,some,text,0,,,4 etc
33448,more,text,1,,3,,,4 etc
abcde,text,number,444444,0,1,,,, etc
358890,more
,text,here,44,,,, etc
abcdefg,textds3,numberss,413,0,,,,, etc
985678,93838,text,,,,
,text,continuing,from,previous,line,,, etc
More search on this, and I know I shouldn't use bash to accomplish this, but rather shoud use perl. I tried (from various website, I don't know perl), but apparently I don't have the Text::CSV package and I don't have permission to install one.
As I told I have no idea how to even start looking for this, so I don't have any script. This is not a windows file, this is very much unix file so we can ignore the CR problem.
Desired output:
358890,more
,text,here,44,,,, etc
985678,93838,text,,,,
,text,continuing,from,previous,line,,, etc
or
Line 4: 358890,more
,text,here,44,,,, etc
Line 7: 985678,93838,text,,,,
,text,continuing,from,previous,line,,, etc
Much appreciated.
You can use perl to count the number of fields(commas), and append the next line until it reaches the correct number
perl -ne 'if(tr/,/,/<28){$line=$.;while(tr/,/,/<28){$_.=<>}print "Line $line: $_\n"}' file
I do love Perl but I don't think it is the best tool for this job.
If you want a report of all lines that DO NOT have exactly the correct number of commas/delimiters, you could use the unix language awk.
For example, this command:
/usr/bin/awk -F , 'NF != 8' < csv_file.txt
will print all lines that DO NOT have exactly 7 commas. Comma is specified as the Field with -F and the Number of Fields is specified with NF.

Extracting lines with specific character count

I have a python script that is pulling URLs from pastebin.com/archive, which has links to pastes (which have 8 random digits after pastbin.com in the url). My current output is a .txt with the below data in it, I only want the links to pastes present (Example: http://pastebin.com///Y5JhyKQT) and not links to other pages such as pastebin.com/tools). This is so I can set wget to go pull each individual paste.
The only way I can think of doing this is writing a bash script to count the number of characters in each line and only keep lines with 30 characters exactly (this is the length of the URLs linking to pastes).
I have no idea how I'd go about implementing something like this using grep or awk, perhaps using a while do loop? Any help would be appreciated!
http://pastebin.com///tools
http://pastebin.com//top.location.href
http://pastebin.com///trends
http://pastebin.com///Y5JhyKQT <<< I want to keep this
http://pastebin.com//=
http://pastebin.com///>
From the sample you posted it looks like all you need is:
grep -E '/[[:alnum:]]{8}$' file
or maybe:
grep -E '^.{30}$' file
If that doesn't work for you, explain why and provide a better sample.
This is the algorithm
Find all characters between new line characters or read one line at a time.
Count them or store them in variable and get its count. This is the length of your line.
Only process those lines that are exactly same count as you want.
In python there is both functions character count of string and reading line as well.
#!/usr/bin/env zsh
while read aline
do
if [[ ${#aline} == 30 ]]; then
#do something
fi
done
This is documented in the bash man pages under the "Parameter Expansion" section.
EDIT=this solution is zsh-only

Split text file into multiple files

I am having large text file having 1000 abstracts with empty line in between each abstract . I want to split this file into 1000 text files.
My file looks like
16503654 Three-dimensional structure of neuropeptide k bound to dodecylphosphocholine micelles. Neuropeptide K (NPK), an N-terminally extended form of neurokinin A (NKA), represents the most potent and longest lasting vasodepressor and cardiomodulatory tachykinin reported thus far.
16504520 Computer-aided analysis of the interactions of glutamine synthetase with its inhibitors. Mechanism of inhibition of glutamine synthetase (EC 6.3.1.2; GS) by phosphinothricin and its analogues was studied in some detail using molecular modeling methods.
You can use split and set "NUMBER lines per output file" to 2. Each file would have one text line and one empty line.
split -l 2 file
Something like this:
awk 'NF{print > $1;close($1);}' file
This will create 1000 files with filename being the abstract number. This awk code writes the records to a file whose name is retrieved from the 1st field($1). This is only done only if the number of fields is more than 0(NF)
You could always use the csplit command. This is a file splitter but based on a regex.
something along the lines of :
csplit -ks -f /tmp/files INPUTFILENAMEGOESHERE '/^$/'
It is untested and may need a little tweaking though.
CSPLIT

display consolidated list of numbers from a CSV using BASH

I was sent a large list of URL's in an Excel spreadsheet, each unique according to a certain get variable in the string (who's value is a number ranging from 5-7 numbers in length). I am having to run some queries on our databases based on those numbers, and don't want to have to go through the hundreds of entries weeding out the numbers one-by-one. What BASH commands that can be used to parse out the number from each line (it's the only number in each line) and consolidate it down to one line with all the numbers, comma separated?
A sample (shortened) listing of the CVS spreadsheet includes:
http://www.domain.com/view.php?fDocumentId=123456
http://www.domain.com/view.php?fDocumentId=223456
http://www.domain.com/view.php?fDocumentId=323456
http://www.domain.com/view.php?fDocumentId=423456
DocumentId=523456
DocumentId=623456
DocumentId=723456
DocumentId=823456
....
...
The change of format was intentional, as they decided to simply reduce it down to the variable name and value after a few rows. The change of the get variable from fDocumentId to just DocumentId was also intentional. Ideal output would look similar to:
123456,23456,323456,423456,523456,623456,723456,823456
EDIT: my apologies, I did not notice that half way through the list, they decided to get froggy and change things around, there's entries that when saved as CSV, certain rows will appear as:
"DocumentId=098765 COMMENT, COMMENT"
DocumentId=898765 COMMENT
DocumentId=798765- COMMENT
"DocumentId=698765- COMMENT, COMMENT"
With several other entries that look similar to any of the above rows. COMMENT can be replaced with a single string of (upper-case) characters no longer than 3 characters in length per COMMENT
Assuming the variable always on it's own, and last on the line, how about just taking whatever is on the right of the =?
sed -r "s/.*=([0-9]+)$/\1/" testdata | paste -sd","
EDIT: Ok, with the new information, you'll have to edit the regex a bit:
sed -r "s/.*f?DocumentId=([0-9]+).*/\1/" testdata | paste -sd","
Here anything after DocumentId or fDocumentId will be captured. Works for the data you've presented so far, at least.
More simple than this :)
cat file.csv | cut -d "=" -f 2 | xargs
If you're not completely committed to bash, the Swiss Army Chainsaw will help:
perl -ne '{$_=~s/.*=//; $_=~s/ .*//; $_=~s/-//; chomp $_ ; print "$_," }' < YOUR_ORIGINAL_FILE
That cuts everything up to and including an =, then everything after a space, then removes any dashes. Run on the above input, it returns
123456,223456,323456,423456,523456,623456,723456,823456,098765,898765,798765,698765,

Resources