Extract Tweet ID from text - bash

I have a large, 4.5M+ row CSV (commas are the separators) containing tweets. The CSV comes from some time ago, and has all manner of line breaks inside column data, characters, etc. It is likely malformed in some ways but it is difficult for me to discern exactly where and how with a file of this size.
I want to move through this CSV file as a large body of text, pull out all the Tweet IDs, and put each pulled ID into a line in a new file.
Doing this via bash, perl, Python will work fine. Can anyone help here? I can't seem to even find info on the parameters for a tweet ID, though the ones in this corpus seem to all be 17 integers.

Since in your question the only evidence for a Tweet ID is that its an integer of length of 17, that is the only rule I am going to use.
Plus, I am going to use it as a hard-and-fast rule. Anything that is an integer of length is a Tweet ID, nothing else.
After that its a normal regular expression search.
import re
string = '''
12345678912345678, abcd, efgh
45645645645645645, ijkl, mnop
78944556677889900, qrst, uvwx
0, y, z
'''
m = re.findall('[0-9]{17}', string)
print(m)
re.findall searches for the regular expression (first arg) in the string (second argument)
(a):- [0-9] means any integer between 0 to 9
(b):- {m} means the regular exp. that preceded this must repeat m number of times
(a)+(b):- [0-9]{17} get me a match that has is a string of integers 0 to 9 repeated 17 times. i.e. a number of length 17
find out more about re module in python
This is as much I can help with you without knowing anything about the input file and tweet format.

Related

Line count in csv doesn't match

I have a large CSV with a large number of columns. I am trying to count the number of lines using
File.open(file).readlines.to_a.compact.count.to_i
It displays 57 although there are only 56 rows. Upon close examination I found that a part of one line is wrapped to form the next line. How to get the correct count?
Upon close examination I found that a part of one line is wrapped to form the next line. How to get the correct count?
You need to show an example of the incoming data if you want us to help beyond generic answers.
To fix the problem, you have to be able to identify the line. We can't help you there because it could look like anything. Making a wild guess, I'd say that one of the columns had an embedded new-line in it, which forces the line to wrap.
It the file is a true CSV file, that column should be wrapped in double-quotes, so you could search the file for lines that do NOT end with whatever data type should be in the last column, then read the next line, join them, then rewrite the file. But, again, we have nothing to work with, because your file's format could be a huge number of different things.
Your best bet is to use the CSV class that comes with Ruby, and let it read the file, instead of trying to treat it like a text file. CSV files are text, but they are formatted to maintain the columns and rows, so using the CSV class will give you a better chance of getting at the data.
Looking at your code:
There are a number of ways to count the number of lines in a file, including the easiest which is:
`wc -l /path/to/file`.to_i
if you're using *nix.
Using File.open(file).readlines.to_a is horribly redundant and not fast or scalable if your file is big.
readlines returns an array.
to_a returns an array.
Why turn the array into an array?
readlines loads an entire file into memory, then splits it on line ends into an array. That process can be a lot slower than simply reading the file line-by-line and incrementing a counter, plus "slurping" can make your program crawl if the file is larger than available memory.
See "Why is "slurping" a file not a good practice?" for more information.
compact removes nils from an array. readlines should never return any nils so compact will iterate over the array looking for something that shouldn't exist.
count returns an integer.
to_i converts the receiver to an integer.
In other words, to_i is turning an integer into an integer. Why?
If you want to do it in Ruby instead of using wc -l, do something simple and fast:
lines_in_file = 0
File.foreach(some_file) { lines_in_file += 1 }
After running that, lines_in_file will contain the number of lines read. Memory won't be impacted and it'll run like blue blazes on huge files.

Jmeter : Removing Spaces using RegEx

Jmeter :
I am having a JSON from which I have to fetch value of "ci".
I am using the following RegEx : ci:\s*(.*?)\" and getting the following result RegEx tester:
Match count: 1
Match1[0]=ci: 434547"
Match1=434547
Issue is Match1[0] is having spaces because of which while running the load test it says
: Server Error - Could not convert JSON to Object
Need help is correcting this RegEx.
Basically, your RegEx is fine. This is the way I would look for it too, the first group (Match[1]) would give you 434613, which is the value you are looking for. As I don't know that piece of software you are using, I have no idea why using just that match doesn't work.
Here is an idea to work around that: if the value will always be the only numeric value in the string, you could simplify the RegEx to:
\d+
This will give you a numeric value that is at least 1 digit long. If there are other numeric values in the string though, but these have different lengths, try this:
\d{m,n} --> between m and n digits long
\d{n,} --> at least n digits long
\d{0,n} --> not more than n digits long
This is not as secure / reliable as the original RegEx (since it assumes some certain conditions), but it might work in your case, because you don't have to look for groups but just use the whole matched text. Tell me if it helped!

Ruby (on Rails) Regex: removing thousands comma from numbers

This seems like a simple one, but I am missing something.
I have a number of inputs coming in from a variety of sources and in different formats.
Number inputs
123
123.45
123,45 (note the comma used here to denote decimals)
1,234
1,234.56
12,345.67
12,345,67 (note the comma used here to denote decimals)
Additional info on the inputs
Numbers will always be less than 1 million
EDIT: These are prices, so will either be whole integers or go to the hundredths place
I am trying to write a regex and use gsub to strip out the thousands comma. How do I do this?
I wrote a regex: myregex = /\d+(,)\d{3}/
When I test it in Rubular, it shows that it captures the comma only in the test cases that I want.
But when I run gsub, I get an empty string: inputstr.gsub(myregex,"")
It looks like gsub is capturing everything, not just the comma in (). Where am I going wrong?
result = inputstr.gsub(/,(?=\d{3}\b)/, '')
removes commas only if exactly three digits follow.
(?=...) is a lookahead assertion: It needs to be possible to be matched at the current position, but it's not becoming part of the text that is actually matched (and subsequently replaced).
You are confusing "match" with "capture": to "capture" means to save something so you can refer to it later. You want to capture not the comma, but everything else, and then use the captured portions to build your substitution string.
Try
myregex = /(\d+),(\d{3})/
inputstr.gsub(myregex,'\1\2')
In your example, it is possible to tell from the number of digits after the last separator (either , or .) that it is a decimal point, since there are 2 lone digits. For most cases, if the last group of digits does not have 3 digits then you can assume that the separator in front is decimal point. Another sign is the multiple appearance of a separator in big numbers allows us to differentiate between decimal point and separators.
However, I can give a string 123,456 or 123.456 without any sort of context. It is impossible to tell whether they are "123 thousand 456" or "123 point 456".
You need to scan the document to look for clue whether , is used for thousand separator or decimal point, and vice versa for .. With the context provided, then you can safely apply the same method to remove the thousand separators.
You may also want to check out this article on Wikipedia on the less common ways to specify separators or decimal points. Knowing and deciding not to support is better than assuming things will work.

Finding date in file, getting data after it

Help me brainstorm how I would solve this problem.
I have a file of dates with corresponding data, the format looks like this:
Date,data,data,data,data,data
Date,data,data,data,data,data
It's a plain csv file, only commas being used.
I need to be able to select a beginning date. And then get the data for the next 20 days beginning with the date selected.
Date format:
2007.05.21 (y,m,d)
So I think it would be best to search for the date. Either loading the entire file first into memory or read line by line. The file is only 1 megabyte, however I might want to do this with a 100 megabyte file as well. Is that still little?
Also I will want to do this very many times. I think I may want to keep the file in memory for the entire run of the program. So I can repeatedly access it.
After finding the date. I need to be able to get column 2 day 1, column 4 day 4. Ect. However there is always the same amount of columns for each day. So I guess if this is loaded into some kind of array I can always know in what array number the next and next day starts.
Any help would be greatly appreciated. Also any code examples provided would really help. This is not a homework problem or anything like that and I'm really new to programming.
You can user csv library to parse your file like this line by line
require 'csv'
date_to_search = Date(2009, 10, 10)
CSV.read('yourfilename.txt', :col_sep => ',') do |row|
# row will be an array of strings which you can parse
cur_date = Date.parse(row[0])
if cur_date == date_to_search
# you are set to read next 19 lines
# you can keep a counter and increment it after parsing each line (row here)
end
# compare and check if you need this line (and next 19)
# other calculations
end
As your data is sorted, Binary Search is what you want to use.
Simply put, you look up an element near the middle of your CSV, compare its date to the one you're looking for, and continue recursively in the matching half of the file (See the Wikipedia link for details).
Binary search has a runtime complexity of O(log n), which means that the number of read operations on a file containing 1,000,000 lines (Reasonable estimation for 100 MB) will never (under normal circumstances, that is, lines of different length are equally distributed) exceed 20.
Therefore, there is no need to keep the file in memory, quite the contrary. The operating system's disk cache will do the task of accelerating consecutive operations for you without running into memory shortage.
To read and process a line, you first need to find its first character, which is either the first letter after a newline character (\n) or the beginning of the file. Reading multiple lines can be achieved similar.
To parse a line, I suggest you split the line at the separation characters and/or the date's dots. This is, of course, only appropriate if the CSV comes from a trustworthy source and never changes its layout.

How to get a Ruby substring of a Unicode string?

I have a field in my Rails model that has max length 255.
I'm importing data into it, and some times the imported data has a length > 255. I'm willing to simply chop it off so that I end up with the largest possible valid string that fits.
I originally tried to do field[0,255] in order to get this, but this will actually chop trailing Unicode right through a character. When I then go to save this into the database, it throws an error telling me I have an invalid character due to the character that's been halved or quartered.
What's the recommended way to chop off Unicode characters to get them to fit in my space, without chopping up individual characters?
Uh. Seems like truncate and friends like to play with chars, but not their little cousins bytes. Here's a quick answer for your problem, but I don't know if there's a more straighforward and elegant question I mean answer
def truncate_bytes(string, size)
count = 0
string.chars.take_while{|c| (a += c.bytes.to_a.length) <= size }.join
end
Give a look at the Chars class of ActiveSupport.
Use the multibyte proxy method (mb_chars) before manipulating the string:
str.mb_chars[0,255]
See http://api.rubyonrails.org/classes/String.html#method-i-mb_chars.
Note that until Rails 2.1 the method was "chars".

Resources