Need help with this week’s assignment? Get detailed and trusted solutions for Reinforcement Learning Week 2 NPTEL Assignment Answers. Our expert-curated answers help you solve your assignments faster while deepening your conceptual clarity.
✅ Subject: Reinforcement Learning
📅 Week: 2
🎯 Session: NPTEL 2025 July-October
🔗 Course Link: Click Here
🔍 Reliability: Verified and expert-reviewed answers
📌 Trusted By: 5000+ Students
For complete and in-depth solutions to all weekly assignments, check out 👉 NPTEL Reinforcement Learning Week 2 NPTEL Assignment Answers
🚀 Stay ahead in your NPTEL journey with fresh, updated solutions every week!
NPTEL Reinforcement Learning Week 2 Assignment Answers 2025
1. Which of the following is true of the UCB algorithm?
- The action with the highest Q value is chosen at every iteration.
- After a very large number of iterations, the confidence intervals of unselected actions will not change much.
- The true expected-value of an action always lies within it’s estimated confidence interval.
- With a small probability ε, we select a random action to ensure adequate exploration of the action space.
Answer : See Answers
2.

- Sub-optimal arms would be chosen more frequently.
- Sub-optimal arms would be chosen less frequently.
- Makes no change to the frequency of picking sub-optimal arms.
- Sub-optimal arms could be chosen less or more frequently, depending on the samples.
Answer :
3. In a 4-arm bandit problem, after executing 100 iterations of the UCB algorithm, the estimates of Q values are- Q100(1)=1.73,Q100(2)=1.83,Q100(3)=1.89,Q100(4)=1.55 and the number of times each of them are sampled are- n1=25,n2=20,n3=30,n4=25. Which arm will be sampled in the next trial?
- Arm 1
- Arm 2
- Arm 3
- Arm 4
Answer :
4. We need 6 rounds of median-elimination to get an (ε,δ)− P AC arm. Approximately how many samples would have been required using the naive (ε,δ)− P AC algorithm given (ε,δ)=(1/2,1/e)? (Choose the value closest to the correct answer)
- 1500
- 1000
- 500
- 3000
Answer :
5. In median elimination method for (ε,δ)-PAC bounds, we claim that for every phase l,Pr[A≤B+εl]>1−δl.(Sl
– is the set of arms remaining in the lth phase)
Consider the following statements:
(i) A – is the maximum of rewards of true best arm in Sl, i.e. in lth phase
(ii) B – is the maximum of rewards of true best arm in Sl+1, i.e. in l+1th phase
(iii) B – is the minimum of rewards of true best arm in Sl+1, i.e. in l+1th phase
(iv) A – is the minimum of rewards of true best arm in Sl, i.e. in lth phase
(v) A – is the maximum of rewards of true best arm in Sl+1, i.e. in l+1th phase
(vi) B – is the maximum of rewards of true best arm in Sl, i.e. in lth phase
Which of the statements above are correct?
- i and ii
- iii and iv
- iii and iv
- v and vi
- i and iii
Answer : See Answers
6. Which of the following statements is NOT true about Thompson Sampling or Posterior Sampling?
- After each sample is drawn, the q∗ distribution for that sampled arm is updated to be closer to the true distribution.
- Thompson sampling has been shown to generally give better regret bounds than UCB.
- In Thompson sampling, we do not need to eliminate arms each round to get good sample complexity.
- The algorithm requires that we use Gaussian priors to represent distributions over q∗
Answer :
7. Assertion: The confidence bound of each arm in the UCB algorithm cannot increase with iterations.
Reason: The nj term in the denominator ensures that the confidence bound remains the same for unselected arms and decreases for the selected arm.
- Assertion and Reason are both true and Reason is a correct explanation of Assertion
- Assertion and Reason are both true and Reason is not a correct explanation of Assertion
- Assertion is true and Reason is false
- Both Assertion and Reason are false
Answer :
8. We need 100 samples for getting an (ε,δ) − P AC arm using naive (ε,δ) − P AC algorithm in a 10-arm bandit problem with certain values of ε and δ. Now, the epsilon is halved keeping the delta unchanged. How many samples would be needed to re-run naive (ε,δ) −P AC algorithm?
- 400
- 800
- 1600
- 100
Answer :
9. Which of the following is true about the Median Elimination algorithm?
- It is a regret minimizing algorithm.
- The probability of the εl-optimal arms of a round being eliminated is less than δl for the round.
- It is guaranteed to provide an ε-optimal arm at the end.
- Replacing ε with ε2 doubles the sample complexity.
Answer :
10. Suppose we are facing a non-stationary bandit problem. We want to use posterior sampling for picking the correct arm. What is the likely change that needs to be done to the algorithm so that it can adapt to non-stationarity?
- Update the posterior rarely.
- Randomly shift the posterior drastically from time to time.
- Keep adding a slight noise to the posterior to prevent its variance from going down quickly.
- No change is required.
Answer : See Answers
NPTEL Reinforcement Learning Week 2 Assignment Answers 2024
1. Which of the following is true of the UCB algorithm?
a. The action with the highest Q value is chosen at every iteration.
b. After a very large number of iterations, the confidence intervals of unselected actions will not change much.
c. The true expected-value of an action always lies within its estimated confidence interval.
d. With a small probability ε, we select a random action to ensure adequate exploration of the action space.
✅ Correct Answer: b
Explanation:
- UCB doesn’t select actions randomly (so option d is false, that refers to ε-greedy).
- In UCB, the confidence interval shrinks only when an action is selected. If an action is not selected much, its interval remains wide at first but stabilizes eventually.
3. In a 4-arm bandit problem, after executing 100 iterations of the UCB algorithm… Which arm will be sampled in the next trial?
Given:
- Q100(1) = 1.73, n1 = 25
- Q100(2) = 1.83, n2 = 20
- Q100(3) = 1.89, n3 = 30
- Q100(4) = 1.55, n4 = 25
UCB formula:
UCB(i) = Q(i) + sqrt((2 * log t) / n(i))
Calculate UCB value for each and choose max.
✅ Correct Answer: b. Arm 2
Explanation:
Arm 2 has a good balance of Q-value and higher uncertainty due to fewer samples (n=20). So, its UCB is likely the highest.
4. We need 6 rounds of median-elimination… How many samples would naive (ε,δ)-PAC need? Given (ε, δ) = (1/2, 1/e)
Options:
- a. 1500
- b. 1000
- c. 500
- d. 3000
✅ Correct Answer: d. 3000
Explanation:
The naive PAC sample complexity is:
O((k / ε²) * log(1/δ))
Plug in values:k = 1, ε = 0.5, δ = 1/e
=> (1 / 0.25) * log(e) = 4 * 1 = ~4 samples per arm
With multiple arms and confidence boosting, this balloons to ~3000 total.
5. In median elimination, which are correct?
You are told:
- A = max of best arm in Sl
- B = max/min of best in Sl+1
- Required: Pr[A ≤ B + εl] > 1 – δl
✅ Correct Answer: a. (i) and (ii)
Explanation:
- (i) A is correct: A is the best reward in current phase Sl
- (ii) B is correct: best in next phase
We’re comparing best so far vs best retained — this matches the PAC guarantee for median elimination.
6. Which of the following is NOT true about Thompson Sampling?
a. After each sample is drawn, q* distribution is updated to be closer
b. Thompson sampling has better regret than UCB
c. We don’t eliminate arms to get good sample complexity
d. Gaussian priors must be used
✅ Correct Answer: d
Explanation:
Thompson Sampling does not require Gaussian priors; priors can be Beta, Bernoulli, etc., depending on the reward model. So, d is NOT true.
7. Assertion: The confidence bound of each arm in UCB cannot increase…
- Assertion is FALSE: UCB bounds can increase for unselected arms (log t term grows).
- Reason is also FALSE: It incorrectly claims bounds remain same for unselected arms.
✅ Correct Answer: d. Both Assertion and Reason are false
Explanation:
UCB values of unselected arms do increase due to the log(t) term growing, even if n(i) stays the same.
8. We need 100 samples for getting an (ε,δ)-PAC arm… now ε is halved
How does this affect sample complexity?
Options:
- a. 400
- b. 800
- c. 1600
- d. 100
✅ Correct Answer: a. 400
Explanation:
PAC sample complexity is proportional to 1/ε². If ε becomes ε/2 → sample size increases 4x.
So from 100 → 400 samples.
10. Non-stationary bandit + posterior sampling: what to do?
a. Update posterior rarely
b. Randomly shift posterior drastically
c. Add slight noise to prevent variance from shrinking
d. No change is required
✅ Correct Answer: c. Keep adding slight noise
Explanation:
In non-stationary settings, to keep adapting, we want the algorithm to stay “uncertain” — so we inject noise or use discounted updates to prevent overconfidence in outdated posteriors.


