NPTEL Data Analytics with Python Week 11 Assignment Answers 2024
Q1. Which library is used for calculating distance measures in clustering using Python?
a. distance_matrix
b. scipy.spatial
c. scipy_spatial
d. distance.matrix
Answer: b
Explanation: scipy.spatial
is a module in SciPy that provides functions for spatial computations, including distance metrics like Euclidean, Minkowski, etc., which are commonly used in clustering.
Q2. Formula for dissimilarity computation between two objects for categorical variables is –
Here p is a categorical variable and m denotes the number of matches.
a. D(i, j) = (p – m) / p
b. D(i, j) = (p – m) / m
c. D(i, j) = (m – p) / p
d. D(i, j) = (m – p) / m
Answer: a
Explanation: The correct formula for dissimilarity in categorical variables is D(i, j) = (p – m) / p, where p is the total number of variables and m is the number of matching variables between objects i and j.
Q3. For a dataset with 7 objects and an interval-scaled variable f = (1, 2, 3, 4, 5, 8, 50), containing an outlier, which is true?
a. Std deviation (std_f) and mean absolute deviation (s_f) are equally affected
b. Mean absolute deviation (s_f) is more affected by the outlier
c. Std deviation (std_f) is less affected by the outlier
d. Std deviation (std_f) is more affected by the outlier
Answer: d
Explanation: Standard deviation squares the deviations, so it is more sensitive to outliers compared to mean absolute deviation which uses absolute values.
Q4. Select the correct statement about the standardization in clustering.
a. Standardizing the data always gives inefficient result while making clusters
b. Standardizing the data is always beneficial during clustering analysis
c. The variables having an absolute value may not be efficient after standardization during clustering
d. Outliers cannot be detected by standardized data
Answer: c
Explanation: Standardization might distort variables with meaningful absolute values, such as binary or categorical encodings, making clustering less effective.
Q5. Which of the following can act as possible termination conditions in K-Means?
- For a fixed number of iterations.
- Assignment of observations to clusters does not change between iterations.
- Centroids do not change between successive iterations.
- Terminate when RSS falls below a threshold.
a. 1, 3 and 4
b. 1, 2, 3 and 4
c. 2 and 3
d. None of these
Answer: b
Explanation: All four are valid stopping criteria in K-Means clustering: a set number of iterations, no change in assignments, stable centroids, or RSS falling below a defined threshold.
Q6. In the figure, if you draw a horizontal line at y = 2, what will be the number of clusters formed?
a. 1
b. 2
c. 3
d. 4
Answer: b
Explanation: Drawing a horizontal line at a certain height in a dendrogram (such as y = 2) helps determine how many clusters will be formed based on the number of vertical lines intersected.
Q7. Which type of clustering uses a merging approach?
a. Partitional
b. Naive Bayes
c. Hierarchical
d. None of the above
Answer: c
Explanation: Hierarchical clustering can follow a bottom-up (agglomerative) merging approach, where each data point starts in its own cluster and merges iteratively.
Q8. True or False: Hierarchical clustering should primarily be used for exploration.
- True
- False
Answer: True
Explanation: Hierarchical clustering helps understand the nested structure of data and is often used for exploratory data analysis, especially using dendrograms.
Q9. True or False: For finding dissimilarity between clusters in hierarchical clustering, average-link is the only metric used.
- True
- False
Answer: False
Explanation: There are multiple linkage criteria such as single-link, complete-link, and average-link, used for measuring dissimilarity in hierarchical clustering.
Q10. If two variables V1 and V2 are used for clustering with k = 3, and:
- If V1 and V2 have a correlation of 1, the cluster centroids will be in a straight line.
- If V1 and V2 have a correlation of 0, the cluster centroids will be in a straight line.
a. 1 only
b. 2 only
c. 1 and 2
d. None of the above
Answer: a
Explanation: When V1 and V2 are perfectly correlated (correlation = 1), the data and therefore the centroids lie along a straight line. When correlation = 0, centroids are likely scattered in 2D space.