Need help with this week’s assignment? Get detailed and trusted solutions for Reinforcement Learning Week 3 NPTEL Assignment Answers. Our expert-curated answers help you solve your assignments faster while deepening your conceptual clarity.
✅ Subject: Reinforcement Learning
📅 Week: 3
🎯 Session: NPTEL 2025 July-October
🔗 Course Link: Click Here
🔍 Reliability: Verified and expert-reviewed answers
📌 Trusted By: 5000+ Students
For complete and in-depth solutions to all weekly assignments, check out 👉 NPTEL Reinforcement Learning Week 3 NPTEL Assignment Answers
🚀 Stay ahead in your NPTEL journey with fresh, updated solutions every week!
NPTEL Reinforcement Learning Week 3 Assignment Answers 2025
1. The baseline in the REINFORCE update should not depend on which of the following (without voiding any of the steps in the proof of REINFORCE)?
- rn−1
- rn
- Action taken(an)
- None of the above
Answer : See Answers
2. Which of the following statements is true about the RL problem?
- Our main aim is to maximize the cumulative reward.
- The agent always performs the actions in a deterministic fashion.
- We assume that the agent determines the next state based on the current state and action
- It is impossible to have zero rewards.
Answer :
3. Let us say we are taking actions according to a Gaussian distribution with parameters µ and σ. We update the parameters according to REINFORCE and at denote the action taken at step t.

Which of the above updates are correct?
- (i), (iii)
- (i), (iv)
- (ii), (iv)
- (ii), (iii)
Answer :
4.

Answer :
5. Consider the following policy-search algorithm for a multi-armed binary bandit:

where is 1 if a=at and 0 otherwise. Which of the following is true for the above algorithm?
- It is LR−I algorithm.
- It is LR−ϵP algorithm.
- It would work well if the best arm had probability of 0.9 of resulting in +1 reward and the next best arm had probability of 0.5 of resulting in +1 reward
- It would work well if the best arm had probability of 0.3 of resulting in +1 reward and the worst arm had probability of 0.25 of resulting in +1 reward
Answer : See Answers
6. Assertion: Contextual bandits can be modeled as a full reinforcement learning problem.
Reason: We can define an MDP with n states where n is the number of bandits. The number of actions from each state corresponds to the arms in each bandit, with every action leading to termination of the episode, and giving a reward according to the corresponding bandit and arm.
- Assertion and Reason are both true and Reason is a correct explanation of Assertion
- Assertion and Reason are both true and Reason is not a correct explanation of Assertion
- Assertion is true and Reason is false
- Both Assertion and Reason are false
Answer :
7. Let’s assume for some full RL problem we are acting according to a policy π. At some time t, we are in a state s where we took action a1. After few time steps, at time t’, the same state s was reached where we performed an action a2(≠a1). Which of the following statements is true?
- π is definitely a Stationary policy
- π is definitely a Non-Stationary policy
- π can be Stationary or Non-Stationary
Answer :
8. Stochastic gradient ascent/descent update occurs in the right direction at every step
- True
- False
Answer :
9. Which of the following is true for an MDP?

Answer :
10. Remember for discounted returns,
Gt=rt+γrt+1+γ2rt+2+...
Where γ is a discount factor. Which of the following best explains what happens when γ>1, (say γ=5)?
- Nothing, γ>1 is common for many RL problems
- Theoretically nothing can go wrong, but this case does not represent any real world problems
- The agent will learn that delayed rewards will always be beneficial and so will not learn properly.
- None of the above is true.
Answer : See Answers


