Magento dataflow takes too long to load CSV file - performance

I have a large CSV file containing Inventory data to update (more than 35,000 rows). I created a method which extends Mage_Catalog_Model_Convert_Adapter_Productimport to do the inventory update. Then I used an Advanced Profile to do the update which calls that method.
It's working very well when I run the profile manually. The problem is when I use an extension which handles the profile running in cronjob, the system takes too long to load and parse the CSV file. I set the cronjob to run everyday at 6:15am, but the first row of the file wouldn't be processed until 1:20pm the same day, it takes 7 hours to load the file.
That makes the process stop in the middle somehow, less than 1/3 records being processed. I've been frustrating trying to figure out why, trying to solve the problem, but no luck.
Any ideas would be appreciated.

Varien_File_Csv is the class that parses your csv file.
It takes too much memory.
Function to log memory amount used and peak memory usage,
public function log($msg, $level = null)
{
if (is_null($level)) $level = Zend_Log::INFO;
$units = array('b', 'Kb', 'Mb', 'Gb', 'Tb', 'Pb');
$m = memory_get_usage();
$mem = #round($m / pow(1024, ($i = floor(log($m, 1024)))), 2).' '.$units[$i];
$mp = memory_get_peak_usage();
$memp = #round($mp / pow(1024, ($ip = floor(log($mp, 1024)))), 2).' '.$units[$ip];
$msg = sprintf('(mem %4.2f %s, %4.2f %s) ', $mem, $units[$i], $memp, $units[$ip]).$msg;
Mage::log($msg, $level, 'my_log.log', 1);
}
$MyClass->log('With every message I log the memory is closer to the sky');
You could split your csv (use same filename) and call the job multiple times. You'll need to be sure a previous call won't run same time with a newer one.
Thanks

Related

Laravel tagging overhead leaving behind significantly large reference sets using redis

I am using Laravel 9 with the Redis cache driver. However, I have an issue where the internal standard_ref and forever_ref map that Laravel uses to manage tagged cache exceed more than 10MB.
This map consists of numerous keys, 95% of which have already expired/decayed and no longer exist; this map seems to grow in size and has a TTL of -1 (never expire).
Other than "not using tags", has anyone else encountered and overcome this? I found this in the slow log of Redis Enterprise, which led me to realize this is happening:
I checked the key/s via SCAN and can confirm it's a massive set of cache misses. It seems highly inefficient and expensive to constantly transmit 10MB back and forth to find one key within the map.
This quickly and efficiently removes expired keys from the SET data-type that laravel uses to manage tagged cache.
use Illuminate\Support\Facades\Cache;
function flushExpiredKeysFromSet(string $referenceKey) : void
{
/** #var \Illuminate\Cache\RedisStore $store */
$store = Cache::store()->getStore();
$lua = <<<LUA
local keys = redis.call('SMEMBERS', '%s')
local expired = {}
for i, key in ipairs(keys) do
local ttl = redis.call('ttl', key)
if ttl == -2 or ttl == -1 then
table.insert(expired, key)
end
end
if #expired > 0 then
redis.call('SREM', '%s', unpack(expired))
end
LUA;
$store->connection()->eval(sprintf($lua, $key, $key), 1);
}
To show the calls that this LUA script generates, from the sample above:
10:32:19.392 [0 lua] "SMEMBERS" "63c0176959499233797039:standard_ref{0}"
10:32:19.392 [0 lua] "ttl" "i-dont-expire-for-an-hour"
10:32:19.392 [0 lua] "ttl" "aa9465100adaf4d7d0a1d12c8e4a5b255364442d:i-have-expired{1}"
10:32:19.392 [0 lua] "SREM" "63c0176959499233797039:standard_ref{0}" "aa9465100adaf4d7d0a1d12c8e4a5b255364442d:i-have-expired{1}"
Using a custom cache driver that wraps the RedisTaggedCache class; when cache is added to a tag, I dispatch a job using the above PHP script only once within that period by utilizing a 24-hour cache lock.
Here is how I obtain the reference key that is later passed into the cleanup script.
public function dispatchTidyEvent(mixed $ttl)
{
$referenceKeyType = $ttl === null ? self::REFERENCE_KEY_FOREVER : self::REFERENCE_KEY_STANDARD;
$lock = Cache::lock('tidy:'.$referenceKeyType, 60 * 60 * 24);
// if we were able to get a lock, then dispatch the event
if ($lock->get()) {
foreach (explode('|', $this->tags->getNamespace()) as $segment) {
dispatch(new \App\Events\CacheTidyEvent($this->referenceKey($segment, $referenceKeyType)));
}
}
// otherwise, we'll just let the lock live out its life to prevent repeating this numerous times per day
return true;
}
Remembering that a "cache lock" is simply just a SET/GET and Laravel is responsible for many of those already on every request to manage it's tags, adding a lock to achieve this "once per day" concept only adds negligible overhead.

My Roblox Studio script is sometimes having errors.. idk why

So.. I've been coding to make a GUI show the quantity of currency of a player, the datastore API works perfectly but the local script doesn't (it's local because else it would just update it each time a player's currency gets updated and would be a mess being the opposite of what I want to)
and well... sometimes it loads the currency into the GUI but other times it just stays on the original text: "Label" instead of my current currency (4600)
here's the proof
What normally happens and should always happen
What sometimes happens and shouldn't happen:
here's the script, I've tried putting waits on the start but the original code is inside the while true do..
wait(game.Players.LocalPlayer:WaitForChild("Data")
wait(game.Players.LocalPlayer.Data:WaitForChild("Bells"))
while true do
script.Parent.TextLabel.Text = game.Players.LocalPlayer:WaitForChild("Data"):WaitForChild("Bells").Value
wait() --wait is for not making the loop break and stop the whole script
end
well.. if you want to see if data is really in the player, here's the script, it requires a API (DataStore2)
--[Animal Crossing Roblox Edition Data Store]--
--Bryan99354--
--Module not mine--
--Made with a AlvinBlox tutorial--
--·.·.*[Get Data Store, do not erase]*.·.·--
local DataStore2 = require(1936396537)
--[Default Values]--
local DefaultValue_Bells = 300
local DefaultValue_CustomClothes = 0
--[Data Store Functions]--
game.Players.PlayerAdded:Connect(function(player)
--[Data stores]--
local BellsDataStore = DataStore2("Bells",player)
local Data = Instance.new("Folder",player)
Data.Name = "Data"
Bells = Instance.new("IntValue",Data)
Bells.Name = "Bells"
local CustomClothesDataStore = DataStore2("CustomClothes",player)
local CustomClothes = Instance.new("IntValue",Data)
CustomClothes.Name = "CustomClothes"
local function CustomClothesUpdate(UpdatedValue)
CustomClothes.Value = CustomClothesDataStore:Get(UpdatedValue)
end
local function BellsUpdate(UpdatedValue)
Bells.Value = BellsDataStore:Get(UpdatedValue)
end
BellsUpdate(DefaultValue_Bells)
CustomClothesUpdate(DefaultValue_CustomClothes)
BellsDataStore:OnUpdate(BellsUpdate)
CustomClothesDataStore:OnUpdate(CustomClothesUpdate)
end)
--[test and reference functions]--
workspace.TestDevPointGiver.ClickDetector.MouseClick:Connect(function(player)
local BellsDataStore = DataStore2("Bells",player)
BellsDataStore:Increment(50,DefaultValue_Bells)
end)
workspace.TestDevCustomClothesGiver.ClickDetector.MouseClick:Connect(function(player)
local CustomClothesDataStore = DataStore2("CustomClothes",player)
CustomClothesDataStore:Increment(50,DefaultValue_CustomClothes)
end)
the code that creates "Data" and "Bells" is located in the comment: Data Stores
the only script that gets the issue is the short one with no reason :<
I hope that you can help me :3
#Night94 I tryed your script but it also failed sometimes
The syntax in your LocalScript is a little off with the waits. With that fixed, it works every time. Also, I would use an event handler instead of updating the value with a loop:
game.Players.LocalPlayer:WaitForChild("Data"):WaitForChild("Bells").Changed:Connect(function(value)
script.Parent.TextLabel.Text = value
end)

Laravel multiple tasks simultaneously

I need to process several image files from a directory (S3 directory), the process is to read the filename (id and type) that is stored in the filename (001_4856_0-P-0-A_.jpg), this file is stored in the moment is invoked the process (im using cron and schedule, it works great) the objetive of the process is to store the info into a database.
I have the process working, it works great but my problem is the number of files that is in the directory, because every second adds a lot more files to the directory, the time spent in the process is about 0.19 sec for file, but the amount of files is huge, about 15,000 per minute is added, so i think a multiple simultaneous process (about 10 - 40 times) of the same original process can do the job.
I need some advice or idea,
First to know how to launch multiple process at the same time of one original process.
Second how to get only the non selected filenames bcause the process takes the filenames with:
$recibidos = Storage::disk('s3recibidos');
if(count($recibidos) <= 0)
{
$lognofile = ['Archivos' => 'No hay archivos para procesar'];
$orderLog->info('ImagesLog', $lognofile);
}else{
$files = $recibidos->files();
if(Image::count() == 0)
{
$last_record = 1;
} else{
$last_record = Image::latest('id')->pluck('id')->first()+1;
}
$i=$last_record;
$fotos_sin_info = 0;
foreach($files as $file)
{
$datos = explode('_',$file);
$tipos = str_replace('-','',$datos[2]);
Image::create([
'client_id' => $datos[0],
'tipo' => $tipos,
]);
$recibidos->move($file,'/procesar/'.$i.'.jpg');
$i++;
}
but i dont figured out how to retrieve only the non selected.
Thanks for your comments.
Using multi-threaded programming in php is possible and has been discussed on so How can one use multi threading in PHP applications.
However this is generally not the most obvious choice for standard applications. A solution for your situation will depend on the exact use-case.
Did you consider a solution using queues?
https://laravel.com/docs/5.6/queues
Or the scheduler?
https://laravel.com/docs/5.6/scheduling

TensorFlow: Reading images in queue without shuffling

I have a training set of 614 images which have already been shuffled. I want to read the images in order in batches of 5. Because my labels are arranged in the same order, any shuffling of the images when being read into the batch will result in incorrect labelling.
These are my functions to read and add the images to the batch:
# To add files from queue to a batch:
def add_to_batch(image):
print('Adding to batch')
image_batch = tf.train.batch([image],batch_size=5,num_threads=1,capacity=614)
# Add to summary
tf.image_summary('images',image_batch,max_images=30)
return image_batch
# To read files in queue and process:
def get_batch():
# Create filename queue of images to read
filenames = [('/media/jessica/Jessica/TensorFlow/StreetView/training/original/train_%d.png' % i) for i in range(1,614)]
filename_queue = tf.train.string_input_producer(filenames,shuffle=False,capacity=614)
reader = tf.WholeFileReader()
key, value = reader.read(filename_queue)
# Read and process image
# Image is 500 x 275:
my_image = tf.image.decode_png(value)
my_image_float = tf.cast(my_image,tf.float32)
my_image_float = tf.reshape(my_image_float,[275,500,4])
return add_to_batch(my_image_float)
This is my function to perform the prediction:
def inference(x):
< Perform convolution, pooling etc.>
return y_conv
This is my function to calculate loss and perform optimisation:
def train_step(y_label,y_conv):
""" Calculate loss """
# Cross-entropy
loss = -tf.reduce_sum(y_label*tf.log(y_conv + 1e-9))
# Add to summary
tf.scalar_summary('loss',loss)
""" Optimisation """
opt = tf.train.AdamOptimizer().minimize(loss)
return loss
This is my main function:
def main ():
# Training
images = get_batch()
y_conv = inference(images)
loss = train_step(y_label,y_conv)
# To write and merge summaries
writer = tf.train.SummaryWriter('/media/jessica/Jessica/TensorFlow/StreetView/SummaryLogs/log_5', graph_def=sess.graph_def)
merged = tf.merge_all_summaries()
""" Run session """
sess.run(tf.initialize_all_variables())
tf.train.start_queue_runners(sess=sess)
print "Running..."
for step in range(5):
# y_1 = <get the correct labels here>
# Train
loss_value = sess.run(train_step,feed_dict={y_label:y_1})
print "Step %d, Loss %g"%(step,loss_value)
# Save summary
summary_str = sess.run(merged,feed_dict={y_label:y_1})
writer.add_summary(summary_str,step)
print('Finished')
if __name__ == '__main__':
main()
When I check my image_summary the images do not seem to be in sequence. Or rather, what is happening is:
Images 1-5: discarded, Images 6-10: read, Images 11-15: discarded, Images 16-20: read etc.
So it looks like I am getting my batches twice, throwing away the first one and using the second one? I have tried a few remedies but nothing seems to work. I feel like I am understanding something fundamentally wrong about calling images = get_batch() and sess.run().
Your batch operation is a FIFOQueue, so every time you use it's output, it advances the state.
Your first session.run call uses the images 1-5 in the computation of train_step, your second session.run asks for the computation of image_summary which pulls images 5-6 and uses them in the visualization.
If you want to visualize things without affecting the state of input, it helps to cache queue values in variables and define your summaries with variables as inputs rather than depending on live queue.
(image_batch_live,) = tf.train.batch([image],batch_size=5,num_threads=1,capacity=614)
image_batch = tf.Variable(
tf.zeros((batch_size, image_size, image_size, color_channels)),
trainable=False,
name="input_values_cached")
advance_batch = tf.assign(image_batch, image_batch_live)
So now your image_batch is a static value which you can use both for computing loss and visualization. Between steps you would call sess.run(advance_batch) to advance the queue.
Minor wrinkle with this approach -- default saver will save your image_batch variable to checkpoint. If you ever change your batch-size, then your checkpoint restore will fail with dimension mismatch. To work-around you would need to specify the list of variables to restore manually, and run initializers for the rest.

Streaming to HBase with pyspark

There is a fair amount of info online about bulk loading to HBase with Spark streaming using Scala (these two were particularly useful) and some info for Java, but there seems to be a lack of info for doing it with PySpark. So my questions are:
How can data be bulk loaded into HBase using PySpark?
Most examples in any language only show a single column per row being upserted. How can I upsert multiple columns per row?
The code I currently have is as follows:
if __name__ == "__main__":
context = SparkContext(appName="PythonHBaseBulkLoader")
streamingContext = StreamingContext(context, 5)
stream = streamingContext.textFileStream("file:///test/input");
stream.foreachRDD(bulk_load)
streamingContext.start()
streamingContext.awaitTermination()
What I need help with is the bulk load function
def bulk_load(rdd):
#???
I've made some progress previously, with many and various errors (as documented here and here)
So after much trial and error, I present here the best I have come up with. It works well, and successfully bulk loads data (using Puts or HFiles) I am perfectly willing to believe that it is not the best method, so any comments/other answers are welcome. This assume you're using a CSV for your data.
Bulk loading with Puts
By far the easiest way to bulk load, this simply creates a Put request for each cell in the CSV and queues them up to HBase.
def bulk_load(rdd):
#Your configuration will likely be different. Insert your own quorum and parent node and table name
conf = {"hbase.zookeeper.qourum": "localhost:2181",\
"zookeeper.znode.parent": "/hbase-unsecure",\
"hbase.mapred.outputtable": "Test",\
"mapreduce.outputformat.class": "org.apache.hadoop.hbase.mapreduce.TableOutputFormat",\
"mapreduce.job.output.key.class": "org.apache.hadoop.hbase.io.ImmutableBytesWritable",\
"mapreduce.job.output.value.class": "org.apache.hadoop.io.Writable"}
keyConv = "org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter"
valueConv = "org.apache.spark.examples.pythonconverters.StringListToPutConverter"
load_rdd = rdd.flatMap(lambda line: line.split("\n"))\#Split the input into individual lines
.flatMap(csv_to_key_value)#Convert the CSV line to key value pairs
load_rdd.saveAsNewAPIHadoopDataset(conf=conf,keyConverter=keyConv,valueConverter=valueConv)
The function csv_to_key_value is where the magic happens:
def csv_to_key_value(row):
cols = row.split(",")#Split on commas.
#Each cell is a tuple of (key, [key, column-family, column-descriptor, value])
#Works well for n>=1 columns
result = ((cols[0], [cols[0], "f1", "c1", cols[1]]),
(cols[0], [cols[0], "f2", "c2", cols[2]]),
(cols[0], [cols[0], "f3", "c3", cols[3]]))
return result
The value converter we defined earlier will convert these tuples into HBase Puts
Bulk loading with HFiles
Bulk loading with HFiles is more efficient: rather than a Put request for each cell, an HFile is written directly and the RegionServer is simply told to point to the new HFile. This will use Py4J, so before the Python code we have to write a small Java program:
import py4j.GatewayServer;
import org.apache.hadoop.hbase.*;
public class GatewayApplication {
public static void main(String[] args)
{
GatewayApplication app = new GatewayApplication();
GatewayServer server = new GatewayServer(app);
server.start();
}
}
Compile this, and run it. Leave it running as long as your streaming is happening. Now update bulk_load as follows:
def bulk_load(rdd):
#The output class changes, everything else stays
conf = {"hbase.zookeeper.qourum": "localhost:2181",\
"zookeeper.znode.parent": "/hbase-unsecure",\
"hbase.mapred.outputtable": "Test",\
"mapreduce.outputformat.class": "org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2",\
"mapreduce.job.output.key.class": "org.apache.hadoop.hbase.io.ImmutableBytesWritable",\
"mapreduce.job.output.value.class": "org.apache.hadoop.io.Writable"}#"org.apache.hadoop.hbase.client.Put"}
keyConv = "org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter"
valueConv = "org.apache.spark.examples.pythonconverters.StringListToPutConverter"
load_rdd = rdd.flatMap(lambda line: line.split("\n"))\
.flatMap(csv_to_key_value)\
.sortByKey(True)
#Don't process empty RDDs
if not load_rdd.isEmpty():
#saveAsNewAPIHadoopDataset changes to saveAsNewAPIHadoopFile
load_rdd.saveAsNewAPIHadoopFile("file:///tmp/hfiles" + startTime,
"org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2",
conf=conf,
keyConverter=keyConv,
valueConverter=valueConv)
#The file has now been written, but HBase doesn't know about it
#Get a link to Py4J
gateway = JavaGateway()
#Convert conf to a fully fledged Configuration type
config = dict_to_conf(conf)
#Set up our HTable
htable = gateway.jvm.org.apache.hadoop.hbase.client.HTable(config, "Test")
#Set up our path
path = gateway.jvm.org.apache.hadoop.fs.Path("/tmp/hfiles" + startTime)
#Get a bulk loader
loader = gateway.jvm.org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles(config)
#Load the HFile
loader.doBulkLoad(path, htable)
else:
print("Nothing to process")
Finally, the fairly straightforward dict_to_conf:
def dict_to_conf(conf):
gateway = JavaGateway()
config = gateway.jvm.org.apache.hadoop.conf.Configuration()
keys = conf.keys()
vals = conf.values()
for i in range(len(keys)):
config.set(keys[i], vals[i])
return config
As you can see, bulk loading with HFiles is more complex than using Puts, but depending on your data load it is probably worth it since once you get it working it's not that difficult.
One last note on something that caught me off guard: HFiles expect the data they receive to be written in lexical order. This is not always guaranteed to be true, especially since "10" < "9". If you have designed your key to be unique, then this can be fixed easily:
load_rdd = rdd.flatMap(lambda line: line.split("\n"))\
.flatMap(csv_to_key_value)\
.sortByKey(True)#Sort in ascending order

Resources