Reinforcement Learning Week 3 NPTEL Assignment Answers 2025

Need help with this week’s assignment? Get detailed and trusted solutions for Reinforcement Learning Week 3 NPTEL Assignment Answers. Our expert-curated answers help you solve your assignments faster while deepening your conceptual clarity.

✅ Subject: Reinforcement Learning
📅 Week: 3
🎯 Session: NPTEL 2025 July-October
🔗 Course Link: Click Here
🔍 Reliability: Verified and expert-reviewed answers
📌 Trusted By: 5000+ Students

For complete and in-depth solutions to all weekly assignments, check out 👉 NPTEL Reinforcement Learning Week 3 NPTEL Assignment Answers

🚀 Stay ahead in your NPTEL journey with fresh, updated solutions every week!

NPTEL Reinforcement Learning Week 3 Assignment Answers 2025

1. The baseline in the REINFORCE update should not depend on which of the following (without voiding any of the steps in the proof of REINFORCE)?

  • rn−1
  • rn
  • Action taken(an)
  • None of the above
Answer : See Answers

2. Which of the following statements is true about the RL problem?

  • Our main aim is to maximize the cumulative reward.
  • The agent always performs the actions in a deterministic fashion.
  • We assume that the agent determines the next state based on the current state and action
  • It is impossible to have zero rewards.
Answer :

3. Let us say we are taking actions according to a Gaussian distribution with parameters µ and σ. We update the parameters according to REINFORCE and at denote the action taken at step t.

Which of the above updates are correct?

  • (i), (iii)
  • (i), (iv)
  • (ii), (iv)
  • (ii), (iii)
Answer :

4.

Answer :

5. Consider the following policy-search algorithm for a multi-armed binary bandit:

where is 1 if a=at and 0 otherwise. Which of the following is true for the above algorithm?

  • It is LR−I algorithm.
  • It is LR−ϵP algorithm.
  • It would work well if the best arm had probability of 0.9 of resulting in +1 reward and the next best arm had probability of 0.5 of resulting in +1 reward
  • It would work well if the best arm had probability of 0.3 of resulting in +1 reward and the worst arm had probability of 0.25 of resulting in +1 reward
Answer : See Answers

6. Assertion: Contextual bandits can be modeled as a full reinforcement learning problem.
Reason: We can define an MDP with n states where n is the number of bandits. The number of actions from each state corresponds to the arms in each bandit, with every action leading to termination of the episode, and giving a reward according to the corresponding bandit and arm.

  • Assertion and Reason are both true and Reason is a correct explanation of Assertion
  • Assertion and Reason are both true and Reason is not a correct explanation of Assertion
  • Assertion is true and Reason is false
  • Both Assertion and Reason are false
Answer :

7. Let’s assume for some full RL problem we are acting according to a policy π. At some time t, we are in a state s where we took action a1. After few time steps, at time t’, the same state s was reached where we performed an action a2(≠a1). Which of the following statements is true?

  • π is definitely a Stationary policy
  • π is definitely a Non-Stationary policy
  • π can be Stationary or Non-Stationary
Answer :

8. Stochastic gradient ascent/descent update occurs in the right direction at every step

  • True
  • False
Answer :

9. Which of the following is true for an MDP?

Answer :

10. Remember for discounted returns,

Gt=rt+γrt+1+γ2rt+2+...

Where γ is a discount factor. Which of the following best explains what happens when γ>1, (say γ=5)?

  • Nothing, γ>1 is common for many RL problems
  • Theoretically nothing can go wrong, but this case does not represent any real world problems
  • The agent will learn that delayed rewards will always be beneficial and so will not learn properly.
  • None of the above is true.
Answer : See Answers