function NewQas=UpdateQ(JStartEnd,Qas,MoveVec) % This program updates the value table (as it depends on possible plays) % % PlayNim also includes definitions of JStartEnd and Qas % Input: % MoveVec: is the sequence of board positions that comprised the game. % Each element is an integer index which yields the state % of the game and can be found in NimScript.m % as NimColumn{CurrentBoardPosition}. % JStartEnd: This is comprised of rows of two elements indicating all the possible % transitions in the game. A typical row might be [1 4] indicating that % one can move from board position 1 to board 4. It represents % the topology of the game. % % Qas: This is comprised of rows of two elements. The first element of % each is the value of the corresponding row of JStartEnd. % The second element represents the number of times that the % choice given by the row JStartEnd has been taken. % % Output: % % NewQas is the updated version of the values, after compensating for the last % round of rewards % %% Algorithm is of the TD type % % Q_k+1=Q_k+alpha(r-Q_k) % % r is the reward that is eventually delivered % at present: there is no discounting % % I have made the reward so that the last move is a loser (-1), % next to last is a winner(+1), next to next to last % is loser, .... etc % % This is created in the array ToAdd % % alpha is the learning rate; %% Take MoveVec and write it as pairs ==> MoveStartEnd % Create Reward vector ==> ToAdd alpha=.1; sMoveVec= length(MoveVec); for j=2:sMoveVec MoveStartEnd(j-1,1:2)= [MoveVec(j-1), MoveVec(j)]; % This is a table of the moves ToAdd(j-1) = (-1)^(sMoveVec-j+1); % This is the undiscounted reward end %% % Add a weigthed version of ToAdd (average it in) to the old state values % to get the new state values % % the more the value TDError below is non-zero, the more that learning is % taking place % sMoveStartEnd= length(MoveStartEnd); NewQas= Qas; % The new values for j=1:sMoveStartEnd Start = MoveStartEnd(j,1); % End = MoveStartEnd(j,2); % Look for these in Qas MoveInd = find( (JStartEnd(:,1)== Start) & (JStartEnd(:,2)== End)); NumAlreadyp1 = Qas(MoveInd,2)+1; TDError = ToAdd(j)-Qas(MoveInd,1); % CHANGE ME MoveNxtInd = find( JStartEnd(:,1)== End); if length(MoveNxtInd)>0 TDError= -max(Qas(MoveNxtInd,1)) -Qas(MoveInd,1); % CHANGE ME end NewQas(MoveInd,1) = NewQas(MoveInd,1) + (1/NumAlreadyp1)*TDError; % CHANGE ME NewQas(MoveInd,2)= NumAlreadyp1; % update the number of times this move was made end % NewQas= Qas + alpha*(ToAdd-Qas); %%