Extracting lines with specific character count - bash

I have a python script that is pulling URLs from pastebin.com/archive, which has links to pastes (which have 8 random digits after pastbin.com in the url). My current output is a .txt with the below data in it, I only want the links to pastes present (Example: http://pastebin.com///Y5JhyKQT) and not links to other pages such as pastebin.com/tools). This is so I can set wget to go pull each individual paste.
The only way I can think of doing this is writing a bash script to count the number of characters in each line and only keep lines with 30 characters exactly (this is the length of the URLs linking to pastes).
I have no idea how I'd go about implementing something like this using grep or awk, perhaps using a while do loop? Any help would be appreciated!
http://pastebin.com///tools
http://pastebin.com//top.location.href
http://pastebin.com///trends
http://pastebin.com///Y5JhyKQT <<< I want to keep this
http://pastebin.com//=
http://pastebin.com///>

From the sample you posted it looks like all you need is:
grep -E '/[[:alnum:]]{8}$' file
or maybe:
grep -E '^.{30}$' file
If that doesn't work for you, explain why and provide a better sample.

This is the algorithm
Find all characters between new line characters or read one line at a time.
Count them or store them in variable and get its count. This is the length of your line.
Only process those lines that are exactly same count as you want.
In python there is both functions character count of string and reading line as well.

#!/usr/bin/env zsh
while read aline
do
if [[ ${#aline} == 30 ]]; then
#do something
fi
done
This is documented in the bash man pages under the "Parameter Expansion" section.
EDIT=this solution is zsh-only

Related

Appending a count to a code in multiple files and saving the result

I'm looking for a bit of help here. I'm a complete newbie!
I need to look in a file for a code matching the pattern A00000_00_A and append a count to it, so the first time it appears it is replaced with A00000_00_A_001, second time A00000_00_A_002 etc. The output needs to be written back to the same file. Each file only contains 1 code, but it appears multiple times.
After some digging I have found-
perl -pi -e 's/Q\d{4,5}'_'\d{2}_./$&.'_'.++$A /ge' /users/documents/*.xml
but the issue is the counter does not reset in each file.
That is, the output of the first file is say Q00390_01_A_1 to Q00390_01_A_7, while the second file is Q00391_01_A_8 to Q00391_01_A_10.
What I want is Q00390_01_A_1 to Q00390_01_A_7 in the first file and Q00391_01_A_1 to Q00391_01_A_2 in the second.
Does anyone have any idea on how to edit the above code to make it do that? I'm a total newbie so ideally an edit to what I have would be brilliant. Thanks
cd /users/documents/
for f in *.xml;do
perl -pi -e 's/facs=.(Q|M)\d{4,5}_\d{2}_\w/$&._.sprintf("%04d",++$A) /ge' $f
done
This matches the string facs= and any character, then "Q" or "M" followed by either four or five digits, then an underscore, then two digits, another underscore, and a word character. The entire match is then concatenated with an underscore and the value of $A zero padded to four digits.

Finding a newline in the csv file

I know there are a lot of questions about this (latest one here.), but almost all of them are how to join those broken lines into one from a csv file or remove them. I don't want to remove, but I just want to display/find that line (or probably the line number?)
Example data:
22224,across,some,text,0,,,4 etc
33448,more,text,1,,3,,,4 etc
abcde,text,number,444444,0,1,,,, etc
358890,more
,text,here,44,,,, etc
abcdefg,textds3,numberss,413,0,,,,, etc
985678,93838,text,,,,
,text,continuing,from,previous,line,,, etc
More search on this, and I know I shouldn't use bash to accomplish this, but rather shoud use perl. I tried (from various website, I don't know perl), but apparently I don't have the Text::CSV package and I don't have permission to install one.
As I told I have no idea how to even start looking for this, so I don't have any script. This is not a windows file, this is very much unix file so we can ignore the CR problem.
Desired output:
358890,more
,text,here,44,,,, etc
985678,93838,text,,,,
,text,continuing,from,previous,line,,, etc
or
Line 4: 358890,more
,text,here,44,,,, etc
Line 7: 985678,93838,text,,,,
,text,continuing,from,previous,line,,, etc
Much appreciated.
You can use perl to count the number of fields(commas), and append the next line until it reaches the correct number
perl -ne 'if(tr/,/,/<28){$line=$.;while(tr/,/,/<28){$_.=<>}print "Line $line: $_\n"}' file
I do love Perl but I don't think it is the best tool for this job.
If you want a report of all lines that DO NOT have exactly the correct number of commas/delimiters, you could use the unix language awk.
For example, this command:
/usr/bin/awk -F , 'NF != 8' < csv_file.txt
will print all lines that DO NOT have exactly 7 commas. Comma is specified as the Field with -F and the Number of Fields is specified with NF.

Create a new sequence of files from an existing sequence, along with numbering

I know this question has been asked, but I can't find more than one solution, and it does not work for me. Essentially, I'm looking for a bash script that will take a file list that looks like this:
image1.jpg
image2.jpg
image3.jpg
And then make a copy of each one, but number it sequentially backwards. So, the sequence would have three new files created, being:
image4.jpg
image5.jpg
image6.jpg
And yet, image4.jpg would have been an untouched copy of image3.jpg, and image5.jpg an untouched copy of image2.jpg, and so on. I have already tried the solution outlined in this stackoverflow question with no luck. I am admittedly not very far down the bash scripting path, and if I take the chunk of code in the first listed answer and make a script, I always get "2: Syntax error: "(" unexpected" over and over. I've tried changing the syntax with the ( around a bit, but no success ever. So, either I am doing something wrong or there's a better script around.
Sorry for not posting this earlier, but the code I'm using is:
image=( image*.jpg )
MAX=${#image[*]}
for i in ${image[*]}
do
num=${i:5:3} # grab the digits
compliment=$(printf '%03d' $(echo $MAX-$num | bc))
ln $i copy_of_image$compliment.jpg
done
And I'm taking this code and pasting it into a file with nano, and adding !#/bin/bash as the first line, then chmod +x script and executing in bash via sh script. Of course, in my test runs, I'm using files appropriately titled image1.jpg - but I was also wondering about a way to apply this script to a directory of jpegs, not necessarily titled image(integer).jpg - in my file keeping structure, most of these are a single word, followed by a number, then .jpg, and it would be nice to not have to rewrite the script for each use.
Perhaps something like this. It will work well for something like script image*.jpg where the wildcard matches a set of files which match a regular pattern with monotonously increasing numbers of the same length, and less ideally with a less regular subset of the files in the current directory. It simply assumes that the last file's digit index plus one through the total number of file names is the range of digits to loop over.
#!/bin/sh
# Extract number from final file name
eval lastidx=\$$#
tmp=${lastidx#*[!0-9][0-9]}
lastidx=${lastidx#${lastidx%[0-9]$tmp}}
tmp=${lastidx%[0-9][!0-9]*}
lastidx=${lastidx%${lastidx#$tmp[0-9]}}
num=$(expr $lastidx + $#)
width=${#lastidx}
for f; do
pref=${f%%[0-9]*}
suff=${f##*[0-9]}
# Maybe show a warning if pref, suff, or width changed since the previous file
printf "cp '$f' '$pref%0${width}i$suff'\\n" $num
num=$(expr $num - 1)
done |
sh
This is sh-compatible; the expr stuff and the substring extraction up front is ugly but Bourne-compatible. If you are fine with the built-in arithmetic and string manipulation constructs of Bash, converting to that form should be trivial.
(To be explicit, ${var%foo} returns the value of $var with foo trimmed off the end, and ${var#foo} does similar trimming from the beginning of the value. Regular shell wildcard matching operators are available in the expression for what to trim. ${#var} returns the length of the value of $var.)
Maybe your real test data runs from 001 to 300, but here you have image1 2 3, and therefore you extract one, not three digits from the filename. num=${i:5:1}
Integer arithmetic can be done in the bash without calling bc
${#image[#]} is more robust than ${#image[*]}, but shouldn't be a difference here.
I didn't consult a dictionary, but isn't compliment something for your girl friend? The opposite is complement, isn't it? :)
the other command made links - to make copies, call cp.
Code:
#!/bin/bash
image=( image*.jpg )
MAX=${#image[#]}
for i in ${image[#]}
do
num=${i:5:1}
complement=$((2*$MAX-$num+1))
cp $i image$complement.jpg
done
Most important: If it is bash, call it with bash. Best: do a shebang (as you did), make it executable and call it by ./name . Calling it with sh name will force the wrong interpreter. If you don't make it executable, call it bash name.

display consolidated list of numbers from a CSV using BASH

I was sent a large list of URL's in an Excel spreadsheet, each unique according to a certain get variable in the string (who's value is a number ranging from 5-7 numbers in length). I am having to run some queries on our databases based on those numbers, and don't want to have to go through the hundreds of entries weeding out the numbers one-by-one. What BASH commands that can be used to parse out the number from each line (it's the only number in each line) and consolidate it down to one line with all the numbers, comma separated?
A sample (shortened) listing of the CVS spreadsheet includes:
http://www.domain.com/view.php?fDocumentId=123456
http://www.domain.com/view.php?fDocumentId=223456
http://www.domain.com/view.php?fDocumentId=323456
http://www.domain.com/view.php?fDocumentId=423456
DocumentId=523456
DocumentId=623456
DocumentId=723456
DocumentId=823456
....
...
The change of format was intentional, as they decided to simply reduce it down to the variable name and value after a few rows. The change of the get variable from fDocumentId to just DocumentId was also intentional. Ideal output would look similar to:
123456,23456,323456,423456,523456,623456,723456,823456
EDIT: my apologies, I did not notice that half way through the list, they decided to get froggy and change things around, there's entries that when saved as CSV, certain rows will appear as:
"DocumentId=098765 COMMENT, COMMENT"
DocumentId=898765 COMMENT
DocumentId=798765- COMMENT
"DocumentId=698765- COMMENT, COMMENT"
With several other entries that look similar to any of the above rows. COMMENT can be replaced with a single string of (upper-case) characters no longer than 3 characters in length per COMMENT
Assuming the variable always on it's own, and last on the line, how about just taking whatever is on the right of the =?
sed -r "s/.*=([0-9]+)$/\1/" testdata | paste -sd","
EDIT: Ok, with the new information, you'll have to edit the regex a bit:
sed -r "s/.*f?DocumentId=([0-9]+).*/\1/" testdata | paste -sd","
Here anything after DocumentId or fDocumentId will be captured. Works for the data you've presented so far, at least.
More simple than this :)
cat file.csv | cut -d "=" -f 2 | xargs
If you're not completely committed to bash, the Swiss Army Chainsaw will help:
perl -ne '{$_=~s/.*=//; $_=~s/ .*//; $_=~s/-//; chomp $_ ; print "$_," }' < YOUR_ORIGINAL_FILE
That cuts everything up to and including an =, then everything after a space, then removes any dashes. Run on the above input, it returns
123456,223456,323456,423456,523456,623456,723456,823456,098765,898765,798765,698765,

How to rename files keeping a variable part of the original file name

I'm trying to make a script that will go into a directory and run my own application with each file matching a regular expression, specifically Test[0-9]*.txt.
My input filenames look like this TestXX.txt. Now, I could just use cut and chop off the Test and .txt, but how would I do this if XX wasn't predefined to be two digits? What would I do if I had Test1.txt, ..., Test10.txt? In other words, How would I get the [0-9]* part?
Just so you know, I want to be able to make a OutputXX.txt :)
EDIT:
I have files with filename Test[0-9]*.txt and I want to manipulate the string into Output[0-9]*.txt
Would something like this help?
#!/bin/bash
for f in Test*.txt ;
do
process < $f > ${f/Test/Output}
done
Bash Shell Parameter Expansion
A good tutorial on regexes in bash is here. Summarizing, you need something like:
if [[$filenamein =~ "^Test([0-9]*).txt$"]]; then
filenameout = "Output${BASH_REMATCH[1]}.txt"
and so on. The key is that, when you perform the =~" regex-match, the "sub-matches" to parentheses-enclosed groups in the RE are set in the entries of arrayBASH_REMATCH(the[0]entry is the whole match,1` the first parentheses-enclosed group, etc).
You need to use rounded brackets around the part you want to keep.
i.e. "Test([0-9]*).txt"
The syntax for replacing these bracketed groups varies between programs, but you'll probably find you can use \1 , something like this:
s/Test(0-9*).txt/Output\1.txt/
If you're using a unix shell, then 'sed' might be your best bet for performing the transformation.
http://www.grymoire.com/Unix/Sed.html#uh-4
Hope that helps
for file in Test[0-9]*.txt;
do
num=${file//[^0-9]/}
process $file > "Output${num}.txt"
done

Resources