IEEE TRANSACTIONS ON POWER SYSTEMS, VOL. 18, NO. 1, FEBRUARY 2003 11 Application of Actor-Critic Learning Algorithm for Optimal Bidding Problem of a Genco G. R. Gajjar, S. A. Khaparde, Senior Member, IEEE, P. Nagaraju, and S. A. Soman Abstract—The optimal bidding for Genco in a deregulated power market is an involved task. The problem is formulated in the framework of Markov decision process (MDP), a discrete stochastic optimization method. When the time span considered is 24 h, the temporal difference method becomes attractive for application. The cumulative profit over the span is the objective function to be optimized. The temporal difference technique and actor-critic learning algorithm are employed. An optimal strategy is devised to maximize the profit. The market-clearing system is included in the formulation. Simulation cases of three, seven, and ten participants are considered and the obtained results are discussed. Index Terms—Actor—critic learning algorithm, bidding strategies, energy auction, Markov decision process. I. INTRODUCTION D EREGULATION of the power industry has become an established practice in many parts of the world. The energy auction plays an important role in the operation of the deregulated power system. The power-exchange bidding mechanism requires each participant to submit bids for all 24 h of a day as a block. The guidelines for making an optimal bid are an involved and challenging task. The bid is in the form of points of piecewise linear curve on the axes, with energy level in megawatt hours and price in dollars per megawatt hour. An important objective is to find the hourly generation schedule coupled with the competitive bid price so that the total profit earned by a particular supplier is maximized and the risk involved is minimized. A basic price-based auction mechanism is proposed in [1]. An auctioneer matches the buyers’ and sellers’ bids to find the market clearing price (MCP) [2]. Game theory is used for optimal bidding [3] and [4] for hourly auction. While game theory is a good tool for decision making under uncertainties, it is difficult to model the 24-h bidding mechanism as a game theory problem. A modified method using game theory that takes into account the varying load demand is discussed in [5]. It is shown in [5] that the judicious choice of more number of points on a cost curve leads to significant increase in payoff values. Dynamic programming is used in [2] for revenue adequate bidding. A genetic-algorithm (GA)-based method is described in [6]. This approach is effective only if the market is not volatile. Reference [7] proposes a method, where a set of differential equations that specifies the necessary condition of optimality, is Manuscript received February 12, 2001; revised April 9, 2002. This work was supported by Department of Science and Technology, Government of India, under Grant III.5(135)/2001-SERC-Engg. The authors are with the Department of Electrical Engineering, Indian Institute of Technology, Bombay 400076 India (e-mail: [email protected]). Digital Object Identifier 10.1109/TPWRS.2002.807041 solved. Optimization-based bidding strategies are proposed in [8]. The optimal bidding is divided into two optimization problems—one each for a participant and the independent system operator (ISO). The ISO subproblem is deterministic and the participants’ subproblem is stochastic. In [9], the problem of optimal bidding is modeled as a Markov decision process (MDP) where load on a weekly basis with peak and offpeak loads is considered. Since only two load levels in 24 h are considered, it significantly reduces computations but at the cost of accuracy. A three-supplier system is considered in [9]. The value equation iteration based on backward dynamic programming is a function of transitional probability, and the evolution of transitional probability is left to the decision maker. Here, we propose to consider the load for all 24 h. Reinforcement learning allows the transitional probabilities to evolve as a function of temporal difference error, as described in Section II. The bidding in the energy auction is a daily process. Everyday, each participant has to submit its bids for 24 h. Under these conditions, the bidding has to be viewed as a discrete time-step multiple stage process, and the objective of each participant is the long-term gain. This can be naturally formulated as an MDP. MDP allows a systematic solution to a multiplestage probabilistic decision-making problem. The game theory, if applied for 24 h, separates the problem in independent 24 subproblems, and hence, will not be able to handle a multistage time-dependent optimization problem. As pointed out in [9], a state with seven elements with a reasonable ten allowable values of each element would result in 10 states. Even though all of the formulations require large computations, MDP formulation is most suitable for a multistage time-dependent optimization problem for accurate results. It will prove to be more worthwhile to seek the solution of the optimal bidding problem in a learning method domain. The methods such as reinforcement learning are known to give satisfactory results in many difficult-to-learn problems with a large number of states. II. MARKOV DECISION PROCESS A discrete time MDP is a multistate process, where states have the Markov property and the decision-making agent controls the actions, which changes the state of the system. A control is applied to each state while changing from one state to another. The state transition is stochastic, with the probabilities called transition probabilities. Every state transition is associated with some reward. The transition probabilities and the rewards completely specify the most important aspects of the dynamics of a finite-state MDP. The sum of all the rewards 0885-8950/03$17.00 © 2003 IEEE Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on September 17, 2009 at 04:56 from IEEE Xplore. Restrictions apply. 12 IEEE TRANSACTIONS ON POWER SYSTEMS, VOL. 18, NO. 1, FEBRUARY 2003 Fig. 1. General reinforcement learning. that follow from a state is the state value of that state. Due to the stochastic nature of the state transition, the state value of each state is the expected value of all the rewards following that state. The control that is applied in each state during transition depends on the policy ( ) which is applied. By applying control, the transition probabilities can be changed. Thus, the expected is sum of reward depends on the policy being applied. So the state value function for the policy . , and action A policy is a map from each state , of the probability of taking action when in state . is state set, and is action set (i.e., a set of all the actions available in state ) (1) (2) is probability; represents the expected value where given that agent follows policy ; represents the iterations over is the reward obwhich the expected value is calculated; in iteration; tained while in transition from state to is a discount parameter . The discount rate determines the present value of future rewards. It is particularly used in infinite horizon MDP to make the state values bounded. which achieves a maxThe solution of MDP is a policy imum reward over a period of time. The value function for a state for the optimal policy is given by Bellman optimality equation, [10] (3) The Bellman optimality equation represents a finite set of equations for a finite MDP, one equation each for every state. If there states, then there are equations and unknowns. If are there are dynamics of the environment, transition probabilities and rewards that are known, then one can solve this system of using methods such as dynamic programequations for ming. But in circumstances when complete dynamics of the system are not known, simulation methods like Monte Carlo estimation or temporal difference learning are used [10]. The following sections present the reinforcement learning method in general and the actor-critic learning algorithm in particular. Fig. 2. Schematic diagram of actor-critic method. A. Reinforcement Learning Reinforcement learning simulates an intelligent agent that can learn how to make good decisions by observing its own behavior. It uses built-in mechanisms for improving the actions through a reinforcement mechanism. It essentially maps situations to actions in order to maximize a numerical reward [10]. The temporal difference (TD) learning is a novel method in a group of reinforcement learning methods of solving large-stage MDP [10]. It learns from experience to solve the prediction problem. The general schematic diagram of an agent employing reinforcement learning is presented in Fig. 1. The agent selects some action based on the current state and obtains a reward from the environment. The environment makes a transition into a new state due to this action. The agent updates the state values depending upon the immediate reward and the next state which results. B. Actor—Critic Learning Algorithm The actor—critic learning algorithm is a part of the TD learning method. The actor-critic method used in [11], proposes two learning agents that work in two loops. The outer loop consists of a reinforcement learning agent, which selects the action in accordance with the current policy and receives the reinforcement feedback. The policy structure is called the actor. The inner loop constructs a more informative evaluation function. The state value function of each state is estimated from the reinforcement feedback. The estimated value function of each state is called the critic because it criticizes the action made by the actor. The actor—critic method can be represented schematically as shown in Fig. 2. The actor-critic method works by selecting an action from the existing policy. The reward obtained from the transition is used to update estimates of the state value of the current state and the preference of selection of the action next time. Three tables of policy, state values, and preferences are maintained. After selecting each action, the TD error is calculated as (4) where is the current value function implemented by the critic. It is called TD error because it estimates the difference between Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on September 17, 2009 at 04:56 from IEEE Xplore. Restrictions apply. GAJJAR et al.: APPLICATION OF LEARNING ALGORITHM FOR OPTIMAL BIDDING PROBLEM the current estimated state value and the actual state value for the present policy. The TD error is used to evaluate the action taken in state ). If TD error just selected (i.e., the action is positive, then the tendency to select action in the future should be encouraged and if it is negative, then the tendency to select action in the future should be discouraged. This is implemented by updating the preference as follows: (5) ), is a step-size parameter. The policy is where , ( derived from the preferences using the softmax method (6) is the total number of actions available in state . where The parameter is called the temperature. It is used to control the relative probabilities of selection of the states. Selection of should be made judiciously. A large value of makes all actions equally probable and a small value of increases the probability of marginally better actions disproportionately, leading the agent in the wrong direction. A simple but effective choice of is made by making it the function of the mean of the preferences (7) 13 the buyer bids the amount of the energy he or she is willing to buy for a given price or a price lower than it. The unconstrained MCP is determined by the point of intersection of the aggregate demand and supply bid curves. B. Participants The formulation is done from the point of view of one of the participants, say participant 1. It is presumed that the participant knows its own cost curve. The cost curve is assumed to be quadratic in nature. The total generation capacity of each participant is also known to the participant 1. It is also assumed that the participant knows the startup cost of each of its units. The load forecast for the next 24 h for which the bids are to be placed is also known. The participant assumes the cost curves of the other participants from past experience. The cost curve is of the form and the startup cost for an hour is . In case the cost curve data of other participants are not known deterministically, we can consider different cost curves with some probabilities assigned for each participant as in [4]. The participant first decides in how many parts it wants to bid, say parts. It divides its maximum generation capacity into parts and using the cost curve, finds the marginal generation cost at the higher end of generation for each part. This forms of bid set. The higher and lower its middle element elements of the bid set are obtained from the middle element as and The state values are updated according to (8) , is an another step-size parameter. The policy thus obtained after sufficient iterations is a suboptimal but vastly improved policy. In the following section, we present the formulation of the bidding problem of a Genco as an MDP and its solution based upon the actor-critic learning method. III. BIDDING PROBLEM AS MDP The problem is formulated in such a way that the transition probabilities change due to a change in the policies. The dynamics of the MDP are not known a priori but it evolves with the solution. Hence, the learning methods like TD are suitable to solve the problem. A. Energy Auction The power exchange (PX) conducts an energy auction for the day-ahead market. The PX day-ahead forward market is a wholesale market, and operates each day of the week from 7:00 A.M. until about 1:00 P.M. During this window, electricity is traded for delivery that will start at midnight. The auction mechanism as proposed in [2] is used here. The bids are in the form of points of piecewise linear curves on energy and price coordinates. The seller bids the amount of the energy that he or she is willing to sell at a given price or above, and for The bid set consists of a Cartesian product of the three bid ele) and three levels ments for each part. For a two-part bid ( ) for each part, the bid set consists of nine ( 3 ) bids. ( bid-set where, a bid (9) represents the power coordinate of the bid curve; is the linear coefficient of the cost curve; and is the price coordinate of the , or . bid-curve which can be any of The corresponding bid sets for all of the other participants are formed in a similar way, by the participant 1 itself. Since participant 1 cannot predict in how many parts each one of the other participants is going to bid, it takes a single part for each participant. The bid set of every other participant consists of three elements only. The marginal cost at the maximum generation bid of the participant. capacity is taken as the middle The state of the system is defined as bids placed by each participant in a particular hour. The bids placed by each participant are selected from their respective bid sets where bid-set bid-set Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on September 17, 2009 at 04:56 from IEEE Xplore. Restrictions apply. 14 IEEE TRANSACTIONS ON POWER SYSTEMS, VOL. 18, NO. 1, FEBRUARY 2003 represents one state and are its elements. The total number of possible states in each hour is the Cartesian product of all the elements of bid sets of every participant. With nine elements in participant 1’s bid set and two other participants with three elements in each of their bid sets, there are 81 states in each hour. The state which is selected in a particular hour is decided by the bids selected by each participant. The reward for participant 1, when a transition is made to a particular state, is defined as the profit it can earn when all of the participants place their bids according to the elements of that state. The transition probabilities of the selection of a state are the probability that all of the bids that form its elements are selected in that hour by the participants. The bids for every hour are stochastically selected by each participant, with each bid in the bid set having a finite probability of being selected. The set of probabilities of each bid for every hour is the policy that is being followed by the participant. C. Application of the Actor—Critic Learning Algorithm Fig. 3. In this section, we consider how the actor-critic method can be applied to solve the problem that was formulated as an MDP in the previous section. The load forecast data of the 24 h for which the bids are to be decided are obtained. Here, an assumption is made that the daily load variation is neglected. This assumption is necessary for considering the problem of bidding as an infinite horizon MDP. If the forecast for the load of more than one day is available, then it can be incorporated by having more corresponding stages. There are 24 stages for the forecasted load of one day and 48 stages for forecasted load of two days. The initial state values of every state is taken as zero, and the initial probabilities of the selection of a bid are equal for all of the bids in a bid set. The initial state of the system can be randomly selected. The simulations are carried out by the agent of participant 1. It assumes that similar agents operate for the other participants also. The agent selects one of the bids from its bid set according to the probabilities. Similarly, the other agents select the bids for their respective participants. All of the selected bids together decide the next state for transition. The MCP (let the value be ) that would result due to the given bidding by the defined as participants and the allotted generation ( ) for each participant are calculated. The profit for participant 1 is obtained as (10) In the first iteration, the agent for each participant starts the process by randomly selecting bids from their bid sets. The set of these selected bids defines the next transition state. Equation (10) can be used to calculate the profit for participant 1. The profit so obtained, is taken as the reward for participant 1, and the negative of the profit earned by participant 1 is taken as the reward for the other participants. The agent of participant 1 assumes that all of the other agents are going to bid in such a manner so as to reduce its profit. This assumption is similar to the maxmin method applied in game theory. It results in pessimistic values of expected profits but a policy that has State transition diagram. low risks. The rewards are defined in such a manner that participant-1’s agent learns to apply a policy which maximizes the profit earned by the participant 1, while the other agents learn to apply such policies which minimize the participant 1’s profit. The TD error is calculated by each agent after every transition using (4). The TD error for participant 1 would be equal to in the first iteration. The policy (i.e., the probabilities with which the agent selects a bid) is updated to reflect the outcome of the first selection. The agents form preference for the state action pair using the most recent reward through (5). The preference reflects the cumulative reward that the agent obtains from a particular state action pair. The probability of that state action pair being selected is calculated using (6). This completes the outer loop of Fig. 2. The state value for the initial state is updated to a better estimate using (8), hence going through the inner loop of Fig. 2. The procedure mentioned before gets repeated in every iteration, as the system makes a transition from a state to next state, each state depicting the bids selected by the participants in a particular hour. The transition from the 24th h is to the 1st h, so the iterations continue in the form of an infinite horizon MDP. The process is stopped typically around 10 000 iterations. The result yields probabilities of the selection of each bid during each hour. The expected profit can be found out from the converged values of the bid. The minimum up/down time constraints on the generating units can be incorporated in the formulation. As shown in Fig. 3, the constraint violation has to be identified at the end of every hour. A generating unit may have to be down either because of an outage or for satisfying a minimum generation limit as it becomes uneconomical to run a unit. Similarly, a generating unit may come on to grid at a later stage. Once a unit is down, that participant will not participate in the auction for the constrained hours. Hence, the number of states to be considered in those hours will be reduced to fewer states. Once the unit is up after the constraint hours are over, the number of states increases to the original value. This has been applied to sample system I. Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on September 17, 2009 at 04:56 from IEEE Xplore. Restrictions apply. GAJJAR et al.: APPLICATION OF LEARNING ALGORITHM FOR OPTIMAL BIDDING PROBLEM Fig. 4. 15 Load curve for 24 h. IV. IMPLEMENTATION Fig. 5. Probabilities over the bid set for 24 h. We apply the method mentioned before to three sample systems involving three, seven, and ten Gencos. A. Sample System I The sample system used in [4] is considered here. The three Gencos are the three participants who submit bids and are allocated portions of the total demand. The cost curves are as The maximum generation capacity for each participant is The startup cost for the participant 1 for the unit supplying upper . The 24-h load curve is 25 MW of the cost curve is shown in Fig. 4. The participant 1 forms the bid set for itself and for the competitors as mentioned in (9). In this example, we have taken two ). Fifty percent of the maxparts bid for participant 1 (i.e., imum generating capacity is the end point of the first part of the bid (i.e., ) and the maximum capacity is the end point of the second part of the bid (i.e., ). 1) Results: The market simulation was carried over for the above example. The discount factor is 0.9. The parameter of , where is the iteration number as mentioned step size is earlier. The step size parameter for preferences update is 0.01. We have observed that should be low. Higher values of show apparent early convergence but the solution may not be a feasible one. The temperature parameter for the softmax selection (6) is varied with progress in iterations, the temperature being reduced as iterations increase. is twice the mean of the preferences during the first 2500 iterations, equal to the mean during the next 5000 iterations and half of the mean during the last 2500 iterations. A total of 10 000 iterations were carried out for the above problem. The probabilities over the bid set of participant 1 are obtained as shown in Fig. 5. From the results, it is seen that during the offpeak period (22 to 8 h), the higher payoff is ob- Fig. 6. State values of 13th h of selected nine important states. Fig. 7. Probabilities over bid set for hour number 12. tained by bidding at marginal cost. During peak demand hours, better payoff is obtained by bidding at the higher end of the cost curve. Fig. 6 presents an average of the state value as the iterations progress. Figs. 7 and 6 show the relationship between the state values and the probabilities of the states being selected. Participants 2 and 3 assign higher probabilities to their respective bid Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on September 17, 2009 at 04:56 from IEEE Xplore. Restrictions apply. 16 IEEE TRANSACTIONS ON POWER SYSTEMS, VOL. 18, NO. 1, FEBRUARY 2003 TABLE II AVERAGE OF IMMEDIATE REWARDS FOR 24 H Fig. 8. Most probable states in the 24-h period for sample system I. TABLE I PROFITS OF ALL THREE GENCOS CONSIDERING UP/DOWN TIME FOR SS-I SS-I is sample system I. SS-II is sample system II. SS-III is sample system III. RL is reinforcement learning. GT is game theory. Genco 2. The return of the Genco 2 is now governed by down time constraint which is considered as 3 h. The reward value for Genco 2 in those hours becomes zero. Table I shows the profits of all three Gencos. Here we can observe that in the first few hours, the profit of Genco 2 is zero as its constraint is violated and is not participating in auction. The consideration of minimum up/down time results in a variation of the number of states to be evaluated and increases the benefits of participants 1 and 3 in those constrained hours. The state transitions in those hours are limited to a few states only. B. Sample System II G-Genco. number 3s in hour number 12. Hence, the agent of participant 1 learns to allot higher probabilities to the bid number 4 that lead to the state 36, which has a larger state value in the next hour. The immediate profit due to state transition also affects the probabilities. Similarly, the states having the largest probabilities to be selected by the participants for the 24 h are shown in Fig. 8. The -axis limits of the Fig. 8 are hour number 2 and hour number 25 (first hour of the next day). Column two of Table II (SS-I) presents the averaged values of the immediate rewards given by (10) taken over all iterations. Maximizing profit is the same as maximizing the state values of the most probable states. 2) Results With Up/Down Time Constraints: Here up/down time constraints are included. It is observed that the minimum generation constraint of Genco 2 is violated during the first few hours of the day, causing a shutdown of the generating unit of Here we take a system with seven Gencos, the cost curves, and the maximum generation capacities of each Genco which are similar to sample system I. The daily load curve is taken as earlier, with a total generating capacity of seven Gencos that is just enough to meet the peak system demand. The shape of the load curve is the same as in the earlier system. This system was tested with reinforcement learning (RL) as well as game theory. 1) Results: Columns three and four of Table II present the comparison of the expected profit earned by participant 1, calculated through RL and game theory, respectively. The game theory models the hourly auction separately, and hence, loses all of the temporal information. The effect of this can be seen in the hourly profits of participant 1. There are 6561 states in each hour when the system is modeled as a RL problem. C. Sample System III RL was applied for a sample system with ten Gencos. The particulars of the system are similar to the two cases mentioned Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on September 17, 2009 at 04:56 from IEEE Xplore. Restrictions apply. GAJJAR et al.: APPLICATION OF LEARNING ALGORITHM FOR OPTIMAL BIDDING PROBLEM 17 TABLE III PROBABILITIES OF BIDS OF PARTICIPANT 1 FOR 24 H FOR SS-III TABLE IV COMPARISON OF RESULTS before. The purpose of performing the study was to test the stability, convergence, and speed characteristics of the RL method under realistic situations. There are 177 417 states in each hour, hence, a total of 4 251 528 states for a complete problem of 24 h. 1) Results: The RL method converged to stable transition probabilities within 10 000 iterations. The time taken for a complete study was approximately 59 min with MATLAB programming on Pentium IV. The result of the profit earned by participant 1 during each hour is given in column five of Table II. Table III provides the most probable bid of participant 1 and its value for each hour. D. Comparison of Results Table IV gives the computation time required for different sample systems. Sample systems SS-I, SS-II, and SS-III correspond to 3, 7, and ten Genco systems. As reported in Table IV, they require 18-, 39-, and 59-min computation time, respectively. Hence, we can say that the increase in computation time is almost linear for the same number of iterations. However, for larger systems, if better accuracies are desired, the number of iterations may be increased. Also, it can be observed that the total profit for game theory application is nearly 28% less than that of RL for sample system II. The computation time depends on how many variables are stored. For observing the convergence pattern, the storage of state values will be required which would cause large computation time and large memory requirement. The reduction in computation time as well as memory requirement can be achieved by storing only intermediate values which is implemented for all sample systems (Table IV). V. CONCLUSIONS The spot market bidding problem is modeled here as an MDP. Fig. 6 show the convergence of state values, indicating that no further improvement is possible, so the method progresses toward convergence to a near optimal policy. All of the states show improvement in state values. The convergence is a function of , , and values. Selection of the value of is commented on in Section II-B. Up/down time constraints of generating units are also included in problem formulation. The state transitions are constrained to only a few states in those hours in which constraints are violated. The method is especially suited to when a good model is not available for the process and the risk involved is stochastic in nature. In realistic situations, the transition probabilities are not defined. This implies that dynamic programming cannot be applied to this problem. A cumulative long-term profit maximization problem as a multistage time process cannot be formulated in the framework of game theory. Our formulation of bidding as an MDP can account for such situations effectively. In RL, the initial transitional probabilities are assumed to be equal for all combinations. The combinations with larger profit are reinforced in the outer loop, with larger preference value. The inner-loop update state values as a function of TD error. Compared to game theory, the RL is able to produce larger overall profit for sample system II. This could be due to the fact that game theory formulation divides the problem into separate 24 subproblems. All of the methods require large computation time. The computation time can be reduced by storing only intermediate state values. The sample system III with ten Gencos required nearly 59 min without storing all of the state values for each iteration, with MATLAB on PENTIUM IV. A better guess of the initial start for small changes in load pattern is possible once the learning is over. ACKNOWLEDGMENT The authors would like to thank the Department of Science and Technology, Government of India, for their support in carrying out this research work. REFERENCES [1] G. B. Sheblé, “Priced based operation in an auction market structure,” IEEE Trans. Power Syst., vol. 11, pp. 1770–1777, Nov. 1996. [2] C. Li, A. J. Svoboda, X. Guan, and H. Singh, “Revenue adequate bidding strategies in competitive electricity market,” IEEE Trans. Power Syst., vol. 14, pp. 492–497, May 1999. [3] R. W. Ferrero, S. M. Shahidehpour, and V. C. Ramesh, “Transaction analysis in deregulated power system using game theory,” IEEE Trans. Power Syst., vol. 12, pp. 1340–1347, Aug. 1997. [4] V. Krishna and V. C. Ramesh, “Intelligent agent for negotiations in market games, Part 2: Application,” IEEE Trans. Power Syst., vol. 13, pp. 1109–1114, Aug. 1998. [5] G. R. Gajjar, S. A. Khaparde, and S. A. Soman, “Modified model for negotiations in market games under deregulated environment,” in Proc. 11th Nat. Power Syst. Conf., Bangalore, India, Dec. 2000. [6] C. W. Richter Jr., G. B. Sheblé, and D. Ashlok, “Comprehensive bidding strategies with genetic programming/finite state automata,” IEEE Trans. Power Syst., vol. 14, pp. 1207–1212, Nov. 1999. [7] S. Hao, “A study of basic bidding strategy in clearing pricing auction,” IEEE Trans. Power Syst., vol. 15, pp. 975–980, Aug. 2000. Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on September 17, 2009 at 04:56 from IEEE Xplore. Restrictions apply. 18 [8] D. Zhang, Y. Wang, and P. B. Luh, “Optimization based bidding strategies in the deregulated market,” IEEE Trans. Power Syst., vol. 15, pp. 981–986, Aug. 2000. [9] H. Song, C.-C. Lin, J. Lawarrée, and R. W. Dahlgren, “Optimal electricity supply bidding by Markov decision process,” IEEE Trans. Power Syst., vol. 15, pp. 618–624, May 2000. [10] R. S. Sutton and A. G. Barto, Reinforcement Learning -An Introduction. Cambridge, MA: MIT Press, 1998. [11] A. G. Barto, R. S. Sutton, and C. W. Anderson, “Neuron like adaptive elements that can solve difficult learning control problems,” IEEE Trans. Syst., Man, Cybern., vol. SMC-13, pp. 834–846, Sept./Oct. 1983. G. R. Gajjar received the M.Tech. degree from the Indian Institute of Technology, Bombay, India, and B.E.E. degree from Maharaja Sayajirao University of Baroda, Boroda, India. Currently, he is an Engineer at ABB, India Ltd., Baroda, India. His research interests include power system analysis, operation, and planning. IEEE TRANSACTIONS ON POWER SYSTEMS, VOL. 18, NO. 1, FEBRUARY 2003 S. A. Khaparde (M’88–SM’91) is Professor at the Department of Electrical Engineeering, Indian Institute of Technology, Bombay, India. His research interests include power system computations, analysis, and deregulation in the power industry. P. Nagaraju received the B.E.E. degree from Osmania University, College of Engineering, Hyderabad, India. He is currently pursuing the M.Tech. degree at the Indian Institute of Technology, Bombay, India. His research interest includes power system analysis and deregulation. S. A. Soman is Associate Professor in the Department of Electrical Engineering, Indian Institute of Technology, Bombay, India. His research interests include sparse matrix computations, power system analysis, object-oriented programming (OOP), and power system automation. Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on September 17, 2009 at 04:56 from IEEE Xplore. Restrictions apply.
© Copyright 2025 Paperzz