Processing a huge text file in laravel - laravel

I got a huge 500m file or even greater, i need to parse the text and insert data from the file into db.I know how to accomplish it using plain php (read line by line, or by small chunks).The question is: Is it possible to accomplish using laravel filesystem abstraction (laravel uses "thephpleague flysystem" library).Thanks!

It's on phpleague's documentation, you can use "chunk" method :
https://csv.thephpleague.com/9.0/connections/output/#outputting-the-document-into-chunks
Instead of output, you could insert data in database

To read files line by line in Laravel you can use File::lines('file_test.txt') which use LazyCollection or use LazyCollection directly.
Example:
use Illuminate\Support\Facades\File;
foreach (File::lines('file_test.txt') as $line) {
//Modifications in current line
}
References:
https://laravel.com/api/9.x/Illuminate/Support/Facades/File.html#method_lines
https://laravel.com/docs/9.x/collections#lazy-collections

Related

Reading from multiple TFRecord files

I am using multiple tfRecord files and want to read from them to create datasets. I am trying to use paths from_tensor_slices and use that dataset to further read TFRecords
(Advantages of multiple tfRecords : https://datascience.stackexchange.com/questions/16318/what-is-the-benefit-of-splitting-tfrecord-file-into-shards)
I want to know if there is an easier and proven method to do this.
file_names_dataset = tf.data.Dataset.from_tensor_slices(filenames_full)
def read(inp):
return tf.data.TFRecordDataset(inp)
file_content = file_names.map(read)
My next step would be to parse the dataset using tf.io.parse_single_example for example.
The tf.data.TFRecordDataset constructor already accepts a list or a tensor of filenames. Hence, you can call it directly with your filenames: file_content = tf.data.TFRecordDataset(filenames_full)
From the tf.io.parse_single_example documentation:
One might see performance advantages by batching Example protos with parse_example instead of using this function directly.
Hence, I would recommend to batch your dataset before mapping the tf.io.parse_example function over it:
tf.data.TFRecordDataset(
filenames_full
).batch(
my_batch_size
).map(
lambda batch: tf.io.parse_example(batch, my_features)
)
If you want a complete example, in this post I share my input pipeline (reading from many TFRecord files).
Kind, Alexis.

Pig load files using tuple's field

I need help for following use case:
Initially we load some files and and process those records (or more technically tuples). After this processing, finally we have tuples of the form:
(some_field_1, hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000, some_field_3)
(some_field_1, hdfs://localhost:9000/user/kailashgupta/data/2/part-r-00000, some_field_3)
(some_field_1, hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00001, some_field_3)
So basically, tuples has file path as value of its field (We can obviously transform this tuple having only one field having file path as value OR to a single tuple having only one field with some delimiter (say comma) separated string).
So now I have to load these files in Pig script, but I am not able to do so. Could you please suggest how to proceed further. I thought of using advanced foreach operator and tried as follows:
data = foreach tuples_with_file_info {
fileData = load $2 using PigStorage(',');
....
....
};
However its not working.
Edit:
For simplicity lets assume, I have single tuple with one field having file name:
(hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000)
You can't use Pig out of the box to do it.
What I would do is use some other scripting language (bash, Python, Ruby...) to read the file from hdfs and concatenate the files into a single string that you can then push as a parameter to a Pig script to use in your LOAD statement. Pig supports globbing so you can do the following:
a = LOAD '{hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000,hdfs://localhost:9000/user/kailashgupta/data/2/part-r-00000}' ...
so all that's left to do is read the file that contains those file names, concatenate them into a glob such as:
{hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000,hdfs://localhost:9000/user/kailashgupta/data/2/part-r-00000}
and pass that as a parameter to Pig so your script would start with:
a = LOAD '$input'
and your pig call would look like this:
pig -f script.pig -param input={hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000,hdfs://localhost:9000/user/kailashgupta/data/2/part-r-00000}
First, store the tuples_with_file_info into some file:
STORE tuples_with_file_info INTO 'some_temporary_file';
then,
data = LOAD 'some_temporary_file' using MyCustomLoader();
where
MyCustomLoader is nothing but a Pig loader extending LoadFunc, which uses MyInputFormat as InputFormat.
MyInputFormat is an encapsulation over the actual InputFormat (e.g. TextInputFormat) which has to be used to read actual data from the files (e.g. in my case from file hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000).
In MyInputFormat, override getSplits method; first read the actual file name(s) from the some_temporary_file (You have to get this file name from Configuration's mapred.input.dir property), then update the same Configuration mapred.input.dir with retrieved file names, then return result from wrapped up InputFormat (e.g. in my case TextInputFormat).
Note: 1. You cannot use the setLocation API from the LoadFunc (or some other similar API) to read the contents of some_temporary_file, as its contents will be available only at run time.
2. One doubt may arise in your mind, what if LOAD statement executes before STORE? But this would not happen because if STORE and LOAD use same file in the script, Pig ensures that the jobs are executed in the right sequence. For more detail you may read section Store-load sequences on Pig Wiki

How to load a file with a JSON array per line in Pig Latin

An existing script creates text files with an array of JSON objects per line, e.g.,
[{"foo":1,"bar":2},{"foo":3,"bar":4}]
[{"foo":5,"bar":6},{"foo":7,"bar":8},{"foo":9,"bar":0}]
…
I would like to load this data in Pig, exploding the arrays and processing each individual object.
I have looked at using the JsonLoader in Twitter’s Elephant Bird to no avail. It doesn’t complain about the JSON, but I get “Successfully read 0 records” when running the following:
register '/tmp/elephant-bird/core/target/elephant-bird-core-4.3-SNAPSHOT.jar';
register '/tmp/elephant-bird/hadoop-compat/target/elephant-bird-hadoop-compat-4.3-SNAPSHOT.jar';
register '/tmp/elephant-bird/pig/target/elephant-bird-pig-4.3-SNAPSHOT.jar';
register '/usr/local/lib/json-simple-1.1.1.jar';
a = load '/path/to/file.json' using com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad=true');
dump a;
I have also tried loading the file as normal, treating each line as a containing a single column chararray, and then trying to parse that as JSON, but I can’t find a pre-existing UDF which seems to do the trick.
Any ideas?
Like Donald said, you should use a UDF here. Here in Xplenty we wrote JsonStringToBag to complement ElephantBird's JsonStringToMap.

PIG Latin : While loading how to discard the first line in any file?

I've been using PIG since sometime and wanted to know how to not consider the first line while loading a file. I've a file which has headers. So I should ignore the first line and go to the next line to do the processing on the date columns and all. How to go about this?
Thanks
If you have pig version 0.11 you could try this:
input_file = load 'input' USING PigStorage(',') as (row1:chararay, row2:chararray);
ranked = rank input_file;
NoHeader = Filter ranked by (rank_input_file > 1);
New_input_file = foreach NoHeader generate row1, row2;
The New_input_file should have your data without the header. note that the rank operator is new to pig 0.11, so this will not work with earlier versions.
EDIT: note this solution only works with a single file, if you are instead loading a directory try something else instead.
The given solution will work well if you just load in 1 file. However, if you load in all files in a directory (which is also possible by simply making sure that input is a directory path), the given solution will only cut off the top for the first file.
For removing the header in each file, you will probably want to use CSVExcelStorage
my_input = load 'inputfileordir' USING CSVExcelStorage(',', 'default', 'NOCHANGE', 'SKIP_INPUT_HEADER')

Pass data from workspace to a function

I created a GUI and used uiimport to import a dataset into matlab workspace, I would like to pass this imported data to another function in matlab...How do I pass this imported dataset into another function....I tried doing diz...but it couldnt pick diz....it doesnt pick the data on the matlab workspace....any ideas??
[file_input, pathname] = uigetfile( ...
{'*.txt', 'Text (*.txt)'; ...
'*.xls', 'Excel (*.xls)'; ...
'*.*', 'All Files (*.*)'}, ...
'Select files');
uiimport(file_input);
M = dlmread(file_input);
X = freed(M);
I think that you need to assign the result of this statement:
uiimport(file_input);
to a variable, like this
dataset = uiimport(file_input);
and then pass that to your next function:
M = dlmread(dataset);
This is a very basic feature of Matlab, which suggests to me that you would find it valuable to read some of the on-line help and some of the documentation for Matlab. When you've done that you'll probably find neater and quicker ways of doing this.
EDIT: Well, #Tim, if all else fails RTFM. So I did, and my previous answer is incorrect. What you need to pass to dlmread is the name of the file to read. So, you either use uiimport or dlmread to read the file, but not both. Which one you use depends on what you are trying to do and on the format of the input file. So, go RTFM and I'll do the same. If you are still having trouble, update your question and provide details of the contents of the file.
In your script you have three ways to read the file. Choose one on them depending on your file format. But first I would combine file name with the path:
file_input = fullfile(pathname,file_input);
I wouldn't use UIIMPORT in a script, since user can change way to read the data, and variable name depends on file name and user.
With DLMREAD you can only read numerical data from the file. You can also skip some number of rows or columns with
M = dlmread(file_input,'\t',1,1);
skipping the first row and one column on the left.
Or you can define a range in kind of Excel style. See the DLMREAD documentation for more details.
The filename you pass to DLMREAD must be a string. Don't pass a file handle or any data. You will get "Filename must be a string", if it's not a string. Easy.
FREAD reads data from a binary file. See the documentation if you really have to do it.
There are many other functions to read the data from file. If you still have problems, show us an example of your file format, so we can suggest the best way to read it.

Resources