Thursday, September 23, 2010

Data Scaling with Producer Consumer



Most applications that work with data eventually have a need to process more data then their inital solution is built for. Therefore they start thinking about scaling. Here I will try to explain what not to do and what would I recommend.

First we have to identify components:


Usually at start application is built to synchronously process data. Same process is used to get data to process it and store result.


Due to data growth or some other reason we get to the point that we can't process data as fast as we get new request so we start falling behind. First approach is usually to separate producer and consumer since they are logically different things and one is usually faster then the other one.

Here we got Asynchronous processing. Producer brings data and Consumer processes data.

Problem here is that most frequent solutions is that Producer stores data into table and Consumer goes to look for data in that table to processes it. After completed work record s marked as processed. As we get more and more data it is more difficult to find rows that have to be processed.


Solution is to include queue table between producer and consumer. Producer stores data into queue. Consumer picks it up, do processing and stores into storage table. Queue record is removed after processing.


To scale Producers all we have to do is bring more instances of same process.


Queue table needs unique identifier, info about who Consumer, time when record is ready for processing and release version number. To scale queue all we have to do is create new one - scale horizontally.


Real problem is how to scale Consumers.



First solution is usually create more instances of same processes. Oracle reads are read consistent and the most likely all consumers will get same records.

To avoid different processes updating same rows developers implement optimistic locking, checking that column value is same as when they picked it up.

This will prevent multiple updates but work will still be wasted and it will not scale.


Nest solution I see which prevents first problem is lock records so another consumer doesn't repeat work. One Consumer picks up data locks record. Rest N-1 Consumers come look for same records and try to lock same records. This will cause spin lock and it is easy to identify by lock waits or high system CPU.


To avoid this update nowait is used. Now we prevented spin locks but we didn't scale. Because of Oracle consistent read we have to get our consumers to look for different records.


We have to iterate through data. So Consumer I will get M rows. Consumer II-N will come and see 1-M rows locked. Next time they will go from (M+1)-(2M) ... One Consumer will get these rows and next will have to increment again. This will scale but it is not efficient. N Consumer will have to read 1-M, (M+1)-(2M) ... ((N-1)M+1)-(NM) .. Lots of data.


To avoid this we have to do "select for update skip locked" ... Now we can process much more data by increasing number of Consumers.


In next blog I will try to explain how to iterate through data without need to use "select for update".

No comments:

Post a Comment