Is there a way to read fixed length files using csv.reader() module in Python 2.x - python-2.x

I have a fixed length file like:
0001ABC,DEF1234
The file definition is:
id[1:4]
name[5:11]
phone[12:15]
I need to load this data into a table. I tried to use CSV module and defined the fixed lengths of each field. It is working fine except for the name field.
For the NAME field, only the value till ABC is getting loaded. The reason is:
As I am using CSV module, it is treating 0001ABC, as a value and only parsing till that.
I tried to use escapechar = ',' while reading the file, but it removes the ',' from the data. I also tried quoting=csv.QUOTE_ALL but that didnt work either.
with open("xyz.csv") as csvfile:
readCSV = csv.reader(csvfile)
writeCSV = open("sample_csv", 'w');
output = csv.writer(writeCSV, dialect='excel', lineterminator="\n")
for row in readCSV:
print(row) # to debug #
data= str(row[0])
print(data) # to debug #
id = data[0:4]
name = data[5:11]
phone = data[12:15]
output.writerow([id,name,phone])
writeCSV.close()
Output of the print commands:
row: ['0001ABC','DEF1234']
data: 0001ABC
Ideally, I expect to see the entire set 0001ABC,DEF1234 in the variable: data.
I can then use the parsing as mentioned in the code to break it into different fields.
Can you please let me know where I am going wrong?

Related

Count and print images from URL

This is my first time using Spark/Scala and I am lost.
I am suppose to write a program that takes in a URL and outputs the number of images and the name of the image file.
So I was able to get the image count. I am doing this all in the command prompt which is making it quite difficult to go back and edit my def without out retyping the whole thing. Is there a better alternative. It took me quite a while just to get Spark/Scala working (I would of like to u PySpark but was unable to get them to communicate)
scala> def URLcount(url : String) : String = {
| var html = scala.io.Source.fromURL(url).mkString
| var list = html.split("\n").filter(_ != "")
| val rdds = sc.parallelize(list)
| val count = rdds.filter(_.contains("img")).count()
| return("There are " + count + " images at the " + url + " site.")
| }
URLcount: (url: String)String
scala> URLcount("https://www.yahoo.com/")
res14: String = There are 9 images at the https://www.yahoo.com/ site.
So I'm assuming after I parallelize the list I should be about to apply a filter and create a list of all the strings that contain "img src"
How would I create such list and then print it line by line to display the image urls?
I don't sure it is great solution for parsing HTML via Spark. I think that Spark created for big data (while it is general purpose). I did not find any easy way to parse HTML through Spark (but I easy find it for both XML and JSON). It is mean that in this case you will print a very long string, because HTML pages are often compressed. Anyway, for this page your program will print lines like this:
<p>So I'm assuming after I parallelize the list I should be about to apply a filter and create a list of all the strings that contain "img src"
I can advice you use Jsoup:
val yahoo = Jsoup.connect("https://www.yahoo.com").get
val images = yahoo.select("img[src]")
images.forEach(println)
You can use Spark for other purposes.
PS: I found 39 image tags with src attribute on https://www.yahoo.com. It is very easy to got error if you don't use good HTML parser.
Another way: prepare your data and than use Spark.
Sorry for my English.

json files saved with trailing "?"

I'm saving a bunch of data as a hash and saving it as a json file. Code for the saving:
def write_to_file(id, data)
Dir.chdir(File.dirname(__FILE__)+"/specs")
filename = "./"+id+".json"
File.open("#{filename}", 'w') do |f2|
f2.write(data.to_json)
end
end
I want to save it as id.json, but the file is getting saved with a "?" at the end. For example, 199015806906670?.json where the original value of "id" is 199015806806670.
If I search for 19901580606670 I am also unable to use TAB to autocomplete.
Does anyone know why this is happening?
EDIT:
Sample from file containing the ids:
104184946332304
131321736945390
693134284084652
146974018804301
288608807960773
Code to get these:
url = File.open("newapps/curlist.txt","r")
url.each_line do |line|
func1(line) #func1 calls write_to_file, no changes to line in func1
end
As has been mentioned in the comments you likely have a newline as the last character of your id field. The character (being invalid) is being replaced with a question mark.
Use this to remove the newline...
filename = "./"+id.chomp+".json"

Ruby String to access an object attribute

I have a text file (objects.txt) which contains Objects and its attributes.
The content of the file is something like:
Object.attribute = "data"
On a different file, I am Loading the objects.txt file and if I type:
puts object.attribute it prints out data
The issue comes when I am trying to access the object and/or the attribute with a string. What I am doing is:
var = "object" + "." + "access"
puts var
It prints out object.access and not the content of it "data".
I have already tried with instance_variable_get and it works, but I have to modify the object.txt and append an # at the beginning to make it an instance variable, but I cannot do this, because I am not the owner of the object.txt file.
As a workaround I can parse the object.txt file and get the data that I need but I don't want to do this, as I want take advantage of what is already there.
Any suggestions?
Yes, puts is correctly spitting out "object.access" because you are creating that string exactly.
In order to evaluate a string as if it were ruby code, you need to use eval()
eg:
var = "object" + "." + "access"
puts eval(var)
=> "data"
Be aware that doing this is quite dangerous if you are evaluating anything that potentially comes from another user.

using MultiStorage to store records in separate files

I'm trying to store a set of records like these:
2342514224232 | some text here whatever
2342514224234| some more text here whatever
....
into separate files in the output folder like this:
output / 2342514224232
output / 2342514224234
the value of the idstr should be the file name and the text should be inside the file. Here's my pig code:
REGISTER /home/bytebiscuit/pig-0.11.1/contrib/piggybank/java/piggybank.jar;
A = LOAD 'cleantweets.csv' using PigStorage(',') AS (idstr:chararray, createdat:chararray, text:chararray,followers:int,friends:int,language:chararray,city:chararray,country:chararray,lat:chararray,lon:chararray);
B = FOREACH A GENERATE idstr, text, language, country;
C = FILTER B BY (country == 'United States' OR country == 'United Kingdom') AND language == 'en';
texts = FOREACH C GENERATE idstr,text;
STORE texts INTO 'output/query_results_one' USING org.apache.pig.piggybank.storage.MultiStorage('output/query_results_one', '0');
Running this pig script gives me the following error:
<file pigquery1.pig, line 12, column 0> pig script failed to validate: java.lang.RuntimeException: could not instantiate 'org.apache.pig.piggybank.storage.MultiStorage' with arguments '[output/query_results_one, idstr]'
Any help is really appreciated!
Try this option:
MultiStorage('output/query_results_one', '0', 'none', ',');
In case anybody stumbles across this post like I did, the problem for me was that my pig script looked like:
DEFINE MultiStorage org.apache.pig.piggybank.storage.MultiStorage();
...
STORE stuff INTO 's3:/...' USING MultiStorage('s3:/...','0','none',',');
The DEFINE statement was incorrectly not specifying inputs/outputs. Foregoing the DEFINE statement and directly putting the following fixed my problem.
STORE stuff INTO 's3:/...' USING org.apache.pig.piggybank.storage.MultiStorage('s3:/...','0','none',',');

Replacing part of the content in one file with data from another file

I have an html file myfile.html, which includes a script with a line like this:
var json = '[{"name":"Hydrogen","number":"1","symbol":"H","weight":"1.00794"},{"name":"Helium","number":2,"symbol":"He","weight":4.002602},{"name":"Lithium","number":3,"symbol":"Li","weight":6.941},{"name":"Beryllium","number":4,"symbol":"Be","weight":9.012182},{"name":"Boron","number":5,"symbol":"B","weight":10.811},{"name":"Carbon","number":6,"symbol":"C","weight":12.0107}]';
The string within single quotes that is assigned to variable json will actually vary. I would like to replace this string with the entire contents of another file myjson.json.
I tried with the code here:
Find and replace in a file in Ruby
and here:
search and replace with ruby regex
doing this:
replace = File.read("myjson.json")
changefile = File.read("myfile.html")
changefile.sub( %r{var json = '[^<]+';}, replace )
but its not working. I'm not sure if its the regex I'm doing incorrectly, or if its something more.
UPDATE
After reading the reply below, my first attempt was:
replace = File.read("myjson.json")
changefile = File.read("myfile.html")
changefile.sub!(%r{var json = '.+'}, replace)
puts changefile
This did the find correctly, but removed all of the var json = '' and replaced it with myjson.json - I want to keep var json = and only replace the contents between the two single quotes after. So then I tried:
replace = File.read("myjson.json")
changefile = File.read("myfile.html")
changefile.sub!(%r{var json = '.+'}, "var json = 'replace'")
puts changefile
But that just replaced it with var json = 'replace'
I want to use the original var json = to find the location, but I don't want it to be removed.
So I did something I know is dumb and wrong, but it worked:
replace = File.read("myjson.json")
changefile = File.read("myfile.html")
changefile.sub!(%r{var json = '.+'}, "var json = 'thanksforthehelptinman'")
changefile.sub!(%r{thanksforthehelptinman}, replace)
puts changefile
Thanks for the help!
The regex isn't right because [ and ] are reserved in regex. You need to escape them:
%r{var json = '\[.+\]'}
I can't be more exact because I don't know what's in your JSON file, but that should get you into the ballpark.
Also, unless you assign changefile.sub to something, the substitution will be thrown away. You can do one of these two things:
changefile = changefile.sub(%r{var json = '\[.+\]'}, json)
or mutate changefile:
changefile.sub!(%r{var json = '\[.+\]'}, json)

Resources