{Week 1 & 2} Reinforcement Learning NPTEL Assignment Answers 2025

NPTEL Reinforcement Learning Week 1 Assignment Answers 2025

1. Which of the following is not a useful way to approach a standard multi-armed bandit problem with n arms? Assume bandits are stationary.

  • “How can I ensure the best action is the one which is mostly selected as time tends to infinity?”
  • “How can I ensure the total regret as time tends to infinity is minimal?”
  • “How can I ensure an arm which has an expected reward within a certain threshold of the optimal arm is chosen with a probability above a certain threshold?”
  • “How can I ensure that when given any 2 arms, I can select the arm with a higher expected return with a probability above a certain threshold?”
Answer :- d

2. What is the decay rate of the weightage given to past rewards in the computation of the Q
function in the stationary and non-stationary updates in the multi-armed bandit problem?

  • hyperbolic, linear
  • linear, hyperbolic
  • hyperbolic, exponential
  • exponential, linear
Answer :- c

3.

Answer :- For Answers Click Here 

4. Assertion: Taking exploratory actions is important for RL agents
Reason: If the rewards obtained for actions are stochastic, an action which gave a high reward once, might give lower reward next time.

  • Assertion and Reason are both true and Reason is a correct explanation of Assertion
  • Assertion and Reason are both true and Reason is not a correct explanation of Assertion
  • Assertion is true and Reason is false
  • Both Assertion and Reason are false
Answer :- 

5. We are trying different algorithms to find the optimal arm for a multi arm bandit. We plot expected payoff vs time graph for each algorithm for which the expected payoff satisfy some function with respect to time (staring from 0). Which among the following functions will have the least regret. (We know that the optimal expected pay off is 1) (Hint: Plot the functions)

  • tanh(t)
  • 1−2−t
  • x/20 if x<20 and 1 after that
  • Same regret for all the above functions.
Answer :- 

6. Consider the following statements for ϵ -greedy approach in a non-stationary environment:

i Keeping a small constant ϵ is a good approach if the environment is non-stationary..
ii Large values of ϵ will lead to unnecessary exploration in the long run.
iii For a stationary environment, decaying ϵ value to zero is a good approach, as after reaching optimality, we would like to reduce exploration.

Which of the above statements is/are correct?

  • ii, iii
  • only iii
  • only ii
  • i, ii
Answer :- For Answers Click Here 

7.

Answer :- 

8. Which of the following best refers to PAC -optimality solution to bandit problems?
ϵ – is the difference between the reward of the chosen arm and true optimal reward
δ – is the probability that chosen arm is not optimal
N – is the number of steps to reach PAC -optimality

  • Given δ and ϵ, minimize the number of steps to reach PAC-optimality(i.e. N)
  • Given δ and N, minimize ϵ.
  • Given ϵ and N, maximize the probability of choosing optimal arm(i.e. minimize δ)
  • none of the above is true about PAC -optimality
Answer :- 

9. Suppose we have a 10-armed bandit problem where the rewards for each of the 10 arms is deterministic and in the range (0, 10). Which among the following methods will allow us to accumulate maximum reward in the long term?

  • ϵ-greedy with ϵ=0.1.
  • ϵ-greedy with ϵ=0.01.
  • greedy with initial reward estimates set to 0.
  • greedy with initial reward estimates set to 10.
Answer :- 

10. Which of the following is/are correct and valid reasons to consider sampling actions from a softmax distribution instead of using an ϵ-greedy approach?

i Softmax exploration makes the probability of picking an action proportional to the actionvalue estimates. By doing so, it avoids wasting time exploring obviously ’bad’ actions.
ii We do not need to worry about decaying exploration slowly like we do in the ϵ-greedy case. Softmax exploration gives us asymptotic correctness even for a sharp decrease in temperature.
iii It helps us differentiate between actions with action-value estimates (Q values) that are very close to the action with maximum Q value.

Which of the above statements is/are correct?

  • i, ii, iii
  • only iii
  • only i
  • i, ii
  • i, iii
Answer :- For Answers Click Here 

NPTEL Reinforcement Learning Week 2 Assignment Answers 2025

1. Which of the following is true of the UCB algorithm?

  • The action with the highest Q value is chosen at every iteration
  • After a very large number of iterations, the confidence intervals of unselected actions will not change much
  • The true expected-value of an action always lies within it’s estimated confidence interval.
  • With a small probability ϵ, we select a random action to ensure adequate exploration of the action space.
Answer :- For Answers Click Here 

2.

  • Sub-optimal arms would be chosen more frequently
  • Sub-optimal arms would be chosen less frequently
  • Makes no change to the frequency of picking sub-optimal arms.
  • Sub-optimal arms could be chosen less or more frequently, depending on the samples.
Answer :- 

3. In a 4-arm bandit problem, after executing 100 iterations of the UCB algorithm, the estimates of Q values are Q100(1)=1.73, Q100(2)=1.83, Q100(3)=1.89, Q100(4)=1.55 and the number of times each of them are sampled are- n1=25,n2=20, n3=30, n4=15. Which arm will be sampled in the next trial?

  • Arm 1
  • Arm 2
  • Arm 3
  • Arm 4
Answer :- 

4. We need 8 rounds of median-elimination to get an (ϵ,δ)−PAC arm. Approximately how many samples would have been required using the naive (ϵ,δ)−PAC algorithm given (ϵ,δ)=(1/2,1/e) ? (Choose the value closest to the correct answer)

  • 15000
  • 10000
  • 500
  • 20000
Answer :- 

5.

Which of these equalities/inequalities are correct ?

  • i and iii
  • ii and iv
  • i, ii, iii
  • i, ii, iii, iv
Answer :- 

6.

Answer :- For Answers Click Here 

7. In median elimination method for (ϵ,δ)−PAC bounds, we claim that for every phase l, Pr[A≤B+ϵl]>1−δl. (Sl – is the set of arms remaining in the lth phase)

Consider the following statements:

(i)A – is the maximum of rewards of true best arm in Sl, i.e. in lth phase
(ii)B – is the maximum of rewards of true best arm in Sl+1, i.e. in l+1th phase
(iii)B – is the minimum of rewards of true best arm in Sl+1, i.e. in l+1th phase
(iv)A – is the minimum of rewards of true best arm in Sl, i.e. in lth phase
(v)A – is the maximum of rewards of true best arm in Sl+1, i.e. in l+1th phase
(vi)B – is the maximum of rewards of true best arm in Sl, i.e. in lth phase

Which of the statements above are correct?

  • i and ii
  • iii and iv
  • iii and iv
  • v and vi
  • i and iii
Answer :- 

8. Which of the following statements is NOT true about Thompson Sampling or Posterior Sampling?

  • After each sample is drawn, the q∗ distribution for that sampled arm is updated to be closer to the true distribution.
  • Thompson sampling has been shown to generally give better regret bounds than UCB.
  • In Thompson sampling, we do not need to eliminate arms each round to get good sample complexity
  • The algorithm requires that we use Gaussian priors to represent distributions over q∗ values for each arm
Answer :- 

9. Assertion: The confidence bound of each arm in the UCB algorithm cannot increase with iterations. Reason: The nj term in the denominator ensures that the confidence bound remains the same for unselected arms and decreases for the selected arm

  • Assertion and Reason are both true and Reason is a correct explanation of Assertion
  • Assertion and Reason are both true and Reason is not a correct explanation of Assertion
  • Assertion is true and Reason is false
  • Both Assertion and Reason are false
Answer :- 

10.

Answer :- For Answers Click Here 

Leave a Comment