11122.pdf

IEEE TRANSACTIONS ON POWER SYSTEMS, VOL. 18, NO. 1, FEBRUARY 2003
11
Application of Actor-Critic Learning Algorithm for
Optimal Bidding Problem of a Genco
G. R. Gajjar, S. A. Khaparde, Senior Member, IEEE, P. Nagaraju, and S. A. Soman
Abstract—The optimal bidding for Genco in a deregulated
power market is an involved task. The problem is formulated
in the framework of Markov decision process (MDP), a discrete
stochastic optimization method. When the time span considered
is 24 h, the temporal difference method becomes attractive for
application. The cumulative profit over the span is the objective
function to be optimized. The temporal difference technique and
actor-critic learning algorithm are employed. An optimal strategy
is devised to maximize the profit. The market-clearing system
is included in the formulation. Simulation cases of three, seven,
and ten participants are considered and the obtained results are
discussed.
Index Terms—Actor—critic learning algorithm, bidding strategies, energy auction, Markov decision process.
I. INTRODUCTION
D
EREGULATION of the power industry has become an established practice in many parts of the world. The energy
auction plays an important role in the operation of the deregulated power system. The power-exchange bidding mechanism
requires each participant to submit bids for all 24 h of a day as a
block. The guidelines for making an optimal bid are an involved
and challenging task. The bid is in the form of points of piecewise linear curve on the axes, with energy level in megawatt
hours and price in dollars per megawatt hour. An important objective is to find the hourly generation schedule coupled with the
competitive bid price so that the total profit earned by a particular supplier is maximized and the risk involved is minimized.
A basic price-based auction mechanism is proposed in [1].
An auctioneer matches the buyers’ and sellers’ bids to find the
market clearing price (MCP) [2]. Game theory is used for optimal bidding [3] and [4] for hourly auction. While game theory
is a good tool for decision making under uncertainties, it is difficult to model the 24-h bidding mechanism as a game theory
problem. A modified method using game theory that takes into
account the varying load demand is discussed in [5]. It is shown
in [5] that the judicious choice of more number of points on a
cost curve leads to significant increase in payoff values. Dynamic programming is used in [2] for revenue adequate bidding. A genetic-algorithm (GA)-based method is described in
[6]. This approach is effective only if the market is not volatile.
Reference [7] proposes a method, where a set of differential
equations that specifies the necessary condition of optimality, is
Manuscript received February 12, 2001; revised April 9, 2002. This work
was supported by Department of Science and Technology, Government of India,
under Grant III.5(135)/2001-SERC-Engg.
The authors are with the Department of Electrical Engineering, Indian Institute of Technology, Bombay 400076 India (e-mail: [email protected]).
Digital Object Identifier 10.1109/TPWRS.2002.807041
solved. Optimization-based bidding strategies are proposed in
[8]. The optimal bidding is divided into two optimization problems—one each for a participant and the independent system
operator (ISO). The ISO subproblem is deterministic and the
participants’ subproblem is stochastic. In [9], the problem of optimal bidding is modeled as a Markov decision process (MDP)
where load on a weekly basis with peak and offpeak loads is
considered. Since only two load levels in 24 h are considered, it
significantly reduces computations but at the cost of accuracy.
A three-supplier system is considered in [9]. The value equation
iteration based on backward dynamic programming is a function of transitional probability, and the evolution of transitional
probability is left to the decision maker. Here, we propose to
consider the load for all 24 h. Reinforcement learning allows
the transitional probabilities to evolve as a function of temporal
difference error, as described in Section II.
The bidding in the energy auction is a daily process. Everyday, each participant has to submit its bids for 24 h. Under
these conditions, the bidding has to be viewed as a discrete
time-step multiple stage process, and the objective of each participant is the long-term gain. This can be naturally formulated
as an MDP. MDP allows a systematic solution to a multiplestage probabilistic decision-making problem. The game theory,
if applied for 24 h, separates the problem in independent 24 subproblems, and hence, will not be able to handle a multistage
time-dependent optimization problem. As pointed out in [9], a
state with seven elements with a reasonable ten allowable values
of each element would result in 10 states. Even though all of
the formulations require large computations, MDP formulation
is most suitable for a multistage time-dependent optimization
problem for accurate results.
It will prove to be more worthwhile to seek the solution of
the optimal bidding problem in a learning method domain. The
methods such as reinforcement learning are known to give satisfactory results in many difficult-to-learn problems with a large
number of states.
II. MARKOV DECISION PROCESS
A discrete time MDP is a multistate process, where states
have the Markov property and the decision-making agent controls the actions, which changes the state of the system.
A control is applied to each state while changing from one
state to another. The state transition is stochastic, with the probabilities called transition probabilities. Every state transition is
associated with some reward. The transition probabilities and
the rewards completely specify the most important aspects of
the dynamics of a finite-state MDP. The sum of all the rewards
0885-8950/03$17.00 © 2003 IEEE
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on September 17, 2009 at 04:56 from IEEE Xplore. Restrictions apply.
12
IEEE TRANSACTIONS ON POWER SYSTEMS, VOL. 18, NO. 1, FEBRUARY 2003
Fig. 1. General reinforcement learning.
that follow from a state is the state value of that state. Due
to the stochastic nature of the state transition, the state value of
each state is the expected value of all the rewards following that
state. The control that is applied in each state during transition
depends on the policy ( ) which is applied. By applying control,
the transition probabilities can be changed. Thus, the expected
is
sum of reward depends on the policy being applied. So
the state value function for the policy .
, and action
A policy is a map from each state
, of the probability
of taking action when in state
. is state set, and
is action set (i.e., a set of all the actions
available in state )
(1)
(2)
is probability;
represents the expected value
where
given that agent follows policy ; represents the iterations over
is the reward obwhich the expected value is calculated;
in
iteration;
tained while in transition from state to
is a discount parameter
. The discount rate determines the present value of future rewards. It is particularly used
in infinite horizon MDP to make the state values bounded.
which achieves a maxThe solution of MDP is a policy
imum reward over a period of time. The value function for a state
for the optimal policy is given by Bellman optimality equation,
[10]
(3)
The Bellman optimality equation represents a finite set of equations for a finite MDP, one equation each for every state. If there
states, then there are
equations and
unknowns. If
are
there are dynamics of the environment, transition probabilities
and rewards that are known, then one can solve this system of
using methods such as dynamic programequations for
ming. But in circumstances when complete dynamics of the
system are not known, simulation methods like Monte Carlo
estimation or temporal difference learning are used [10]. The
following sections present the reinforcement learning method
in general and the actor-critic learning algorithm in particular.
Fig. 2.
Schematic diagram of actor-critic method.
A. Reinforcement Learning
Reinforcement learning simulates an intelligent agent that
can learn how to make good decisions by observing its own behavior. It uses built-in mechanisms for improving the actions
through a reinforcement mechanism. It essentially maps situations to actions in order to maximize a numerical reward [10].
The temporal difference (TD) learning is a novel method in a
group of reinforcement learning methods of solving large-stage
MDP [10]. It learns from experience to solve the prediction
problem. The general schematic diagram of an agent employing
reinforcement learning is presented in Fig. 1. The agent selects
some action based on the current state and obtains a reward from
the environment. The environment makes a transition into a new
state due to this action. The agent updates the state values depending upon the immediate reward and the next state which
results.
B. Actor—Critic Learning Algorithm
The actor—critic learning algorithm is a part of the TD
learning method. The actor-critic method used in [11], proposes
two learning agents that work in two loops. The outer loop
consists of a reinforcement learning agent, which selects the
action in accordance with the current policy and receives the
reinforcement feedback. The policy structure is called the
actor. The inner loop constructs a more informative evaluation
function. The state value function of each state is estimated
from the reinforcement feedback. The estimated value function
of each state is called the critic because it criticizes the action
made by the actor. The actor—critic method can be represented
schematically as shown in Fig. 2.
The actor-critic method works by selecting an action from
the existing policy. The reward obtained from the transition is
used to update estimates of the state value of the current state
and the preference of selection of the action next time. Three
tables of policy, state values, and preferences are maintained.
After selecting each action, the TD error is calculated as
(4)
where is the current value function implemented by the critic.
It is called TD error because it estimates the difference between
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on September 17, 2009 at 04:56 from IEEE Xplore. Restrictions apply.
GAJJAR et al.: APPLICATION OF LEARNING ALGORITHM FOR OPTIMAL BIDDING PROBLEM
the current estimated state value and the actual state value for
the present policy. The TD error is used to evaluate the action
taken in state ). If TD error
just selected (i.e., the action
is positive, then the tendency to select action in the future
should be encouraged and if it is negative, then the tendency
to select action in the future should be discouraged. This is
implemented by updating the preference as follows:
(5)
), is a step-size parameter. The policy is
where , (
derived from the preferences using the softmax method
(6)
is the total number of actions available in state .
where
The parameter is called the temperature. It is used to control the relative probabilities of selection of the states. Selection of should be made judiciously. A large value of makes
all actions equally probable and a small value of increases
the probability of marginally better actions disproportionately,
leading the agent in the wrong direction. A simple but effective
choice of is made by making it the function of the mean of the
preferences
(7)
13
the buyer bids the amount of the energy he or she is willing to
buy for a given price or a price lower than it. The unconstrained
MCP is determined by the point of intersection of the aggregate
demand and supply bid curves.
B. Participants
The formulation is done from the point of view of one of the
participants, say participant 1. It is presumed that the participant knows its own cost curve. The cost curve is assumed to be
quadratic in nature. The total generation capacity of each participant is also known to the participant 1. It is also assumed
that the participant knows the startup cost of each of its units.
The load forecast for the next 24 h for which the bids are to be
placed is also known. The participant assumes the cost curves
of the other participants from past experience. The cost curve is
of the form
and the startup cost for an hour is
. In case the cost curve
data of other participants are not known deterministically, we
can consider different cost curves with some probabilities assigned for each participant as in [4].
The participant first decides in how many parts it wants to
bid, say parts. It divides its maximum generation capacity into
parts and using the cost curve, finds the marginal generation
cost at the higher end of generation for each part. This forms
of bid set. The higher
and lower
its middle element
elements of the bid set are obtained from the middle element as
and
The state values are updated according to
(8)
,
is an another step-size parameter.
The policy thus obtained after sufficient iterations is a suboptimal but vastly improved policy.
In the following section, we present the formulation of the
bidding problem of a Genco as an MDP and its solution based
upon the actor-critic learning method.
III. BIDDING PROBLEM AS MDP
The problem is formulated in such a way that the transition
probabilities change due to a change in the policies. The dynamics of the MDP are not known a priori but it evolves with
the solution. Hence, the learning methods like TD are suitable
to solve the problem.
A. Energy Auction
The power exchange (PX) conducts an energy auction for
the day-ahead market. The PX day-ahead forward market is
a wholesale market, and operates each day of the week from
7:00 A.M. until about 1:00 P.M. During this window, electricity is
traded for delivery that will start at midnight.
The auction mechanism as proposed in [2] is used here. The
bids are in the form of points of piecewise linear curves on energy and price coordinates. The seller bids the amount of the energy that he or she is willing to sell at a given price or above, and
for
The bid set consists of a Cartesian product of the three bid ele) and three levels
ments for each part. For a two-part bid (
) for each part, the bid set consists of nine ( 3 ) bids.
(
bid-set
where, a bid
(9)
represents the power coordinate of the bid curve; is the linear
coefficient of the cost curve; and is the price coordinate of the
, or .
bid-curve which can be any of
The corresponding bid sets for all of the other participants are
formed in a similar way, by the participant 1 itself. Since participant 1 cannot predict in how many parts each one of the other
participants is going to bid, it takes a single part for each participant. The bid set of every other participant consists of three
elements only. The marginal cost at the maximum generation
bid of the participant.
capacity is taken as the middle
The state of the system is defined as bids placed by each participant in a particular hour. The bids placed by each participant
are selected from their respective bid sets
where
bid-set
bid-set
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on September 17, 2009 at 04:56 from IEEE Xplore. Restrictions apply.
14
IEEE TRANSACTIONS ON POWER SYSTEMS, VOL. 18, NO. 1, FEBRUARY 2003
represents one state and
are its elements. The
total number of possible states in each hour is the Cartesian
product of all the elements of bid sets of every participant. With
nine elements in participant 1’s bid set and two other participants with three elements in each of their bid sets, there are 81
states in each hour. The state which is selected in a particular
hour is decided by the bids selected by each participant.
The reward for participant 1, when a transition is made to a
particular state, is defined as the profit it can earn when all of
the participants place their bids according to the elements of
that state. The transition probabilities of the selection of a state
are the probability that all of the bids that form its elements are
selected in that hour by the participants.
The bids for every hour are stochastically selected by each
participant, with each bid in the bid set having a finite probability of being selected. The set of probabilities of each bid for
every hour is the policy that is being followed by the participant.
C. Application of the Actor—Critic Learning Algorithm
Fig. 3.
In this section, we consider how the actor-critic method can
be applied to solve the problem that was formulated as an MDP
in the previous section.
The load forecast data of the 24 h for which the bids are to
be decided are obtained. Here, an assumption is made that the
daily load variation is neglected. This assumption is necessary
for considering the problem of bidding as an infinite horizon
MDP. If the forecast for the load of more than one day is available, then it can be incorporated by having more corresponding
stages. There are 24 stages for the forecasted load of one day
and 48 stages for forecasted load of two days. The initial state
values of every state is taken as zero, and the initial probabilities
of the selection of a bid are equal for all of the bids in a bid set.
The initial state of the system can be randomly selected.
The simulations are carried out by the agent of participant 1.
It assumes that similar agents operate for the other participants
also. The agent selects one of the bids from its bid set according
to the probabilities. Similarly, the other agents select the bids
for their respective participants. All of the selected bids together
decide the next state for transition. The MCP (let the value be
) that would result due to the given bidding by the
defined as
participants and the allotted generation ( ) for each participant
are calculated. The profit for participant 1 is obtained as
(10)
In the first iteration, the agent for each participant starts the
process by randomly selecting bids from their bid sets. The set
of these selected bids defines the next transition state. Equation (10) can be used to calculate the profit for participant 1.
The profit so obtained, is taken as the reward for participant 1,
and the negative of the profit earned by participant 1 is taken
as the reward for the other participants. The agent of participant 1 assumes that all of the other agents are going to bid in
such a manner so as to reduce its profit. This assumption is similar to the maxmin method applied in game theory. It results
in pessimistic values of expected profits but a policy that has
State transition diagram.
low risks. The rewards are defined in such a manner that participant-1’s agent learns to apply a policy which maximizes the
profit earned by the participant 1, while the other agents learn to
apply such policies which minimize the participant 1’s profit.
The TD error is calculated by each agent after every transition
using (4). The TD error for participant 1 would be equal to
in the first iteration. The policy (i.e., the probabilities with which
the agent selects a bid) is updated to reflect the outcome of the
first selection. The agents form preference for the state action
pair using the most recent reward through (5). The preference
reflects the cumulative reward that the agent obtains from a particular state action pair. The probability of that state action pair
being selected is calculated using (6). This completes the outer
loop of Fig. 2. The state value for the initial state is updated to a
better estimate using (8), hence going through the inner loop of
Fig. 2. The procedure mentioned before gets repeated in every
iteration, as the system makes a transition from a state to next
state, each state depicting the bids selected by the participants
in a particular hour. The transition from the 24th h is to the 1st
h, so the iterations continue in the form of an infinite horizon
MDP. The process is stopped typically around 10 000 iterations.
The result yields probabilities of the selection of each bid during
each hour. The expected profit can be found out from the converged values of the bid.
The minimum up/down time constraints on the generating
units can be incorporated in the formulation. As shown in Fig. 3,
the constraint violation has to be identified at the end of every
hour. A generating unit may have to be down either because
of an outage or for satisfying a minimum generation limit as it
becomes uneconomical to run a unit. Similarly, a generating unit
may come on to grid at a later stage. Once a unit is down, that
participant will not participate in the auction for the constrained
hours. Hence, the number of states to be considered in those
hours will be reduced to fewer states. Once the unit is up after
the constraint hours are over, the number of states increases to
the original value. This has been applied to sample system I.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on September 17, 2009 at 04:56 from IEEE Xplore. Restrictions apply.
GAJJAR et al.: APPLICATION OF LEARNING ALGORITHM FOR OPTIMAL BIDDING PROBLEM
Fig. 4.
15
Load curve for 24 h.
IV. IMPLEMENTATION
Fig. 5. Probabilities over the bid set for 24 h.
We apply the method mentioned before to three sample systems involving three, seven, and ten Gencos.
A. Sample System I
The sample system used in [4] is considered here. The three
Gencos are the three participants who submit bids and are allocated portions of the total demand. The cost curves are as
The maximum generation capacity for each participant is
The startup cost for the participant 1 for the unit supplying upper
. The 24-h load curve is
25 MW of the cost curve is
shown in Fig. 4.
The participant 1 forms the bid set for itself and for the competitors as mentioned in (9). In this example, we have taken two
). Fifty percent of the maxparts bid for participant 1 (i.e.,
imum generating capacity is the end point of the first part of the
bid (i.e., ) and the maximum capacity is the end point of the
second part of the bid (i.e., ).
1) Results: The market simulation was carried over for the
above example. The discount factor is 0.9. The parameter of
, where is the iteration number as mentioned
step size is
earlier. The step size parameter for preferences update is 0.01.
We have observed that should be low. Higher values of show
apparent early convergence but the solution may not be a feasible one. The temperature parameter for the softmax selection
(6) is varied with progress in iterations, the temperature being
reduced as iterations increase. is twice the mean of the preferences during the first 2500 iterations, equal to the mean during
the next 5000 iterations and half of the mean during the last 2500
iterations. A total of 10 000 iterations were carried out for the
above problem. The probabilities over the bid set of participant
1 are obtained as shown in Fig. 5. From the results, it is seen that
during the offpeak period (22 to 8 h), the higher payoff is ob-
Fig. 6. State values of 13th h of selected nine important states.
Fig. 7. Probabilities over bid set for hour number 12.
tained by bidding at marginal cost. During peak demand hours,
better payoff is obtained by bidding at the higher end of the cost
curve.
Fig. 6 presents an average of the state value as the iterations
progress. Figs. 7 and 6 show the relationship between the state
values and the probabilities of the states being selected. Participants 2 and 3 assign higher probabilities to their respective bid
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on September 17, 2009 at 04:56 from IEEE Xplore. Restrictions apply.
16
IEEE TRANSACTIONS ON POWER SYSTEMS, VOL. 18, NO. 1, FEBRUARY 2003
TABLE II
AVERAGE OF IMMEDIATE REWARDS FOR 24 H
Fig. 8. Most probable states in the 24-h period for sample system I.
TABLE I
PROFITS OF ALL THREE GENCOS CONSIDERING UP/DOWN TIME FOR SS-I
SS-I is sample system I.
SS-II is sample system II.
SS-III is sample system III.
RL is reinforcement learning.
GT is game theory.
Genco 2. The return of the Genco 2 is now governed by down
time constraint which is considered as 3 h. The reward value
for Genco 2 in those hours becomes zero. Table I shows the
profits of all three Gencos. Here we can observe that in the first
few hours, the profit of Genco 2 is zero as its constraint is violated and is not participating in auction. The consideration of
minimum up/down time results in a variation of the number of
states to be evaluated and increases the benefits of participants
1 and 3 in those constrained hours. The state transitions in those
hours are limited to a few states only.
B. Sample System II
G-Genco.
number 3s in hour number 12. Hence, the agent of participant
1 learns to allot higher probabilities to the bid number 4 that
lead to the state 36, which has a larger state value in the next
hour. The immediate profit due to state transition also affects
the probabilities. Similarly, the states having the largest probabilities to be selected by the participants for the 24 h are shown
in Fig. 8. The -axis limits of the Fig. 8 are hour number 2 and
hour number 25 (first hour of the next day).
Column two of Table II (SS-I) presents the averaged values
of the immediate rewards given by (10) taken over all iterations.
Maximizing profit is the same as maximizing the state values of
the most probable states.
2) Results With Up/Down Time Constraints: Here up/down
time constraints are included. It is observed that the minimum
generation constraint of Genco 2 is violated during the first few
hours of the day, causing a shutdown of the generating unit of
Here we take a system with seven Gencos, the cost curves,
and the maximum generation capacities of each Genco which
are similar to sample system I. The daily load curve is taken as
earlier, with a total generating capacity of seven Gencos that is
just enough to meet the peak system demand. The shape of the
load curve is the same as in the earlier system. This system was
tested with reinforcement learning (RL) as well as game theory.
1) Results: Columns three and four of Table II present the
comparison of the expected profit earned by participant 1, calculated through RL and game theory, respectively. The game
theory models the hourly auction separately, and hence, loses
all of the temporal information. The effect of this can be seen in
the hourly profits of participant 1. There are 6561 states in each
hour when the system is modeled as a RL problem.
C. Sample System III
RL was applied for a sample system with ten Gencos. The
particulars of the system are similar to the two cases mentioned
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on September 17, 2009 at 04:56 from IEEE Xplore. Restrictions apply.
GAJJAR et al.: APPLICATION OF LEARNING ALGORITHM FOR OPTIMAL BIDDING PROBLEM
17
TABLE III
PROBABILITIES OF BIDS OF PARTICIPANT 1 FOR 24 H FOR SS-III
TABLE IV
COMPARISON OF RESULTS
before. The purpose of performing the study was to test the stability, convergence, and speed characteristics of the RL method
under realistic situations. There are 177 417 states in each hour,
hence, a total of 4 251 528 states for a complete problem of 24 h.
1) Results: The RL method converged to stable transition
probabilities within 10 000 iterations. The time taken for a complete study was approximately 59 min with MATLAB programming on Pentium IV. The result of the profit earned by participant 1 during each hour is given in column five of Table II.
Table III provides the most probable bid of participant 1 and its
value for each hour.
D. Comparison of Results
Table IV gives the computation time required for different
sample systems. Sample systems SS-I, SS-II, and SS-III correspond to 3, 7, and ten Genco systems. As reported in Table IV,
they require 18-, 39-, and 59-min computation time, respectively. Hence, we can say that the increase in computation time
is almost linear for the same number of iterations. However, for
larger systems, if better accuracies are desired, the number of iterations may be increased. Also, it can be observed that the total
profit for game theory application is nearly 28% less than that
of RL for sample system II.
The computation time depends on how many variables are
stored. For observing the convergence pattern, the storage of
state values will be required which would cause large computation time and large memory requirement. The reduction in computation time as well as memory requirement can be achieved
by storing only intermediate values which is implemented for
all sample systems (Table IV).
V. CONCLUSIONS
The spot market bidding problem is modeled here as an MDP.
Fig. 6 show the convergence of state values, indicating that no
further improvement is possible, so the method progresses toward convergence to a near optimal policy. All of the states show
improvement in state values. The convergence is a function of
, , and values. Selection of the value of is commented on
in Section II-B.
Up/down time constraints of generating units are also
included in problem formulation. The state transitions are constrained to only a few states in those hours in which constraints
are violated.
The method is especially suited to when a good model is not
available for the process and the risk involved is stochastic in
nature. In realistic situations, the transition probabilities are not
defined. This implies that dynamic programming cannot be applied to this problem. A cumulative long-term profit maximization problem as a multistage time process cannot be formulated
in the framework of game theory. Our formulation of bidding as
an MDP can account for such situations effectively.
In RL, the initial transitional probabilities are assumed to be
equal for all combinations. The combinations with larger profit
are reinforced in the outer loop, with larger preference value.
The inner-loop update state values as a function of TD error.
Compared to game theory, the RL is able to produce larger
overall profit for sample system II. This could be due to the fact
that game theory formulation divides the problem into separate
24 subproblems. All of the methods require large computation
time. The computation time can be reduced by storing only intermediate state values. The sample system III with ten Gencos
required nearly 59 min without storing all of the state values for
each iteration, with MATLAB on PENTIUM IV. A better guess
of the initial start for small changes in load pattern is possible
once the learning is over.
ACKNOWLEDGMENT
The authors would like to thank the Department of Science
and Technology, Government of India, for their support in carrying out this research work.
REFERENCES
[1] G. B. Sheblé, “Priced based operation in an auction market structure,”
IEEE Trans. Power Syst., vol. 11, pp. 1770–1777, Nov. 1996.
[2] C. Li, A. J. Svoboda, X. Guan, and H. Singh, “Revenue adequate bidding
strategies in competitive electricity market,” IEEE Trans. Power Syst.,
vol. 14, pp. 492–497, May 1999.
[3] R. W. Ferrero, S. M. Shahidehpour, and V. C. Ramesh, “Transaction
analysis in deregulated power system using game theory,” IEEE Trans.
Power Syst., vol. 12, pp. 1340–1347, Aug. 1997.
[4] V. Krishna and V. C. Ramesh, “Intelligent agent for negotiations in
market games, Part 2: Application,” IEEE Trans. Power Syst., vol. 13,
pp. 1109–1114, Aug. 1998.
[5] G. R. Gajjar, S. A. Khaparde, and S. A. Soman, “Modified model for
negotiations in market games under deregulated environment,” in Proc.
11th Nat. Power Syst. Conf., Bangalore, India, Dec. 2000.
[6] C. W. Richter Jr., G. B. Sheblé, and D. Ashlok, “Comprehensive bidding
strategies with genetic programming/finite state automata,” IEEE Trans.
Power Syst., vol. 14, pp. 1207–1212, Nov. 1999.
[7] S. Hao, “A study of basic bidding strategy in clearing pricing auction,”
IEEE Trans. Power Syst., vol. 15, pp. 975–980, Aug. 2000.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on September 17, 2009 at 04:56 from IEEE Xplore. Restrictions apply.
18
[8] D. Zhang, Y. Wang, and P. B. Luh, “Optimization based bidding strategies in the deregulated market,” IEEE Trans. Power Syst., vol. 15, pp.
981–986, Aug. 2000.
[9] H. Song, C.-C. Lin, J. Lawarrée, and R. W. Dahlgren, “Optimal electricity supply bidding by Markov decision process,” IEEE Trans. Power
Syst., vol. 15, pp. 618–624, May 2000.
[10] R. S. Sutton and A. G. Barto, Reinforcement Learning -An Introduction. Cambridge, MA: MIT Press, 1998.
[11] A. G. Barto, R. S. Sutton, and C. W. Anderson, “Neuron like adaptive
elements that can solve difficult learning control problems,” IEEE Trans.
Syst., Man, Cybern., vol. SMC-13, pp. 834–846, Sept./Oct. 1983.
G. R. Gajjar received the M.Tech. degree from the Indian Institute of Technology, Bombay, India, and B.E.E. degree from Maharaja Sayajirao University
of Baroda, Boroda, India.
Currently, he is an Engineer at ABB, India Ltd., Baroda, India. His research
interests include power system analysis, operation, and planning.
IEEE TRANSACTIONS ON POWER SYSTEMS, VOL. 18, NO. 1, FEBRUARY 2003
S. A. Khaparde (M’88–SM’91) is Professor at the Department of Electrical
Engineeering, Indian Institute of Technology, Bombay, India.
His research interests include power system computations, analysis, and
deregulation in the power industry.
P. Nagaraju received the B.E.E. degree from Osmania University, College of
Engineering, Hyderabad, India. He is currently pursuing the M.Tech. degree at
the Indian Institute of Technology, Bombay, India.
His research interest includes power system analysis and deregulation.
S. A. Soman is Associate Professor in the Department of Electrical Engineering,
Indian Institute of Technology, Bombay, India.
His research interests include sparse matrix computations, power system analysis, object-oriented programming (OOP), and power system automation.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on September 17, 2009 at 04:56 from IEEE Xplore. Restrictions apply.