Finding problems using unsupervised image categorization
Analysis
When an extended set of parameters is encountered, as in the example presented in this article, the most obvious means of reducing this feature space is to perform PCA (see the "PCA" sidebar). Listing 5 separates the data of each detector type to perform PCA on each group independently.
Listing 5
Independent DataFrames for Detector Types
# Map filename Extension to Detector Type dectorType = { "Specular": "000", "Phase": "001", "Scatter": "002" } # Load BevelImage DataFrames import pandas as pd imgFeatures = pd.read_pickle('image_features.pkl') wafStepIMGList_df = pd.read_pickle('waf_Step_IMG_List.pkl') wafStepList_df = pd.read_pickle('waf_Step_List.pkl') # Select the features for a single Detector Type # Specular Detector imgFeaturesSpecular = imgFeatures.filter( items = wafStepIMGList_df[wafStepIMGList_df['EXT'] == dectorType["Specular"] ]['IMAGE_ID'] , axis=0) # Phase Detector imgFeaturesPhase = imgFeatures.filter( items = wafStepIMGList_df[wafStepIMGList_df['EXT'] == dectorType["Phase"]]['IMAGE_ID'] , axis=0) # Scatter Detector imgFeaturesScatter = imgFeatures.filter( items = wafStepIMGList_df[wafStepIMGList_df['EXT'] == dectorType["Scatter"]]['IMAGE_ID'] , axis=0)
Standardizing the values within each component prevents bias resulting from individual components simply because they contain larger values (see Listing 6). This step is provided by scikit-learn and imported via sklearn [9].
Listing 6
Standardizing Values
from sklearn.preprocessing import StandardScaler stdimgFeaturesSpecular = pd.DataFrame(StandardScaler().fit_transform(imgFeaturesSpecular), index=imgFeaturesSpecular.index, columns=imgFeaturesSpecular.columns) stdimgFeaturesPhase = pd.DataFrame(StandardScaler().fit_transform(imgFeaturesPhase), index=imgFeaturesPhase.index, columns=imgFeaturesPhase.columns) stdimgFeaturesScatter = pd.DataFrame(StandardScaler().fit_transform(imgFeaturesScatter), index=imgFeaturesScatter.index, columns=imgFeaturesScatter.columns)
Once the data is in a fit state to perform PCA, the transformation can be performed via another sklearn sub-library [10] (see Listing 7). The decomposition into two components allows you to plot the results conveniently.
Listing 7
Extracting PCA Components into Two Axes
# import PCA from sklearn library from sklearn.decomposition import PCA pca = PCA(n_components=2) #tranform the given data set using PCA principalDfSpecular = pd.DataFrame(data =pca.fit_transform(stdimgFeaturesSpecular), index=stdimgFeaturesSpecular.index, columns = ['PC1', 'PC2']) principalDfPhase = pd.DataFrame(data =pca.fit_transform(stdimgFeaturesPhase), index=stdimgFeaturesPhase.index, columns = ['PC1', 'PC2']) principalDfScatter = pd.DataFrame(data =pca.fit_transform(stdimgFeaturesScatter), index=stdimgFeaturesScatter.index, columns = ['PC1', 'PC2']) #principalDf['PC1']
In order to simplify subsequent data handling, a label column is added to each DataFrame to identify the detector type (specular, phase, or scatter), and all DataFrames are subsequently concatenated together (see Listing 8).
Listing 8
Adding Column and Combining DataFrames
principalDfSpecular['Detector'] = 'Specular' principalDfPhase['Detector'] = 'Phase' principalDfScatter['Detector'] = 'Scatter' principalDf = pd.concat([principalDfSpecular, principalDfPhase, principalDfScatter])
The jointplot
functions available from the Seaborn library provide a succinct way to visualize any structure that may have been revealed by PCA (see Listing 9).
Listing 9
PCA Data for Each Detector Type
import seaborn as sns from matplotlib import pyplot as plt plt.figure(figsize=(8,5)) sns.jointplot(data=principalDf[principalDf['Detector'] =='Scatter'], x='PC1', y='PC2', kind="hex", joint_kws=dict(bins='log'))
The PCA density plots in Figure 2 show these results: Not only has the structure been revealed, but the structure is also distinctly different for each detector type.
To quantify this structure, some form of clustering needs to be applied to the PCA-transformed data. A natural choice is K-means clustering. However, in this case, the results are generally unsatisfactory at separating meaningful structures. For example in Figure 3, while the grey cluster and red random points are reasonably well separated, the other group of data points are crudely segmented, without regard for the internal structure. This is not necessarily always the case, and other examples exist where K-means clustering is an appropriate option [11].
Applying a DBSCAN clustering algorithm [12] produced much more promising results. Listing 10 applies this clustering algorithm with a minimum number of samples (22 in this example) per cluster to the two PCA components generated in Listing 7. The results of the DBSCAN clustering algorithm can be visualized with the help of the Matplotlib library, as shown in Listing 11. Figure 4 shows the results after the PCA data has been clustered using DBSCAN.
Listing 10
Applying the DBSCAN Algorithm
from sklearn.cluster import DBSCAN from sklearn import metrics # Extract PCA vales into an array to perform clustering X = df.iloc[:, [0, 1]].values type(X) # Compute DBSCAN db = DBSCAN(eps=0.35, min_samples=22).fit(X) # DEFAULT core_samples_mask = np.zeros_like(db.labels_, dtype=bool) core_samples_mask[db.core_sample_indices_] = True labels = db.labels_ # Number of clusters in labels, ignoring noise if present. n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0) n_noise_ = list(labels).count(-1) print("Number of clusters: %d" % n_clusters_) print("Number of random points: %d" % n_noise_) print("Silhouette Coefficient: %0.3f" % metrics.silhouette_score(X, labels))
Listing 11
Plotting DBSCAN Clustering
import matplotlib.pyplot as plt unique_labels = set(labels) colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels))] for k, col in zip(unique_labels, colors): if k == -1: col = [0, 0, 0, 1] class_member_mask = labels == k xy = X[class_member_mask & core_samples_mask] plt.plot(xy[:, 0],xy[:, 1],"o",markerfacecolor=tuple(col),markeredgecolor="k",markersize=6 ) xy = X[class_member_mask & ~core_samples_mask] plt.plot(xy[:, 0],xy[:, 1],"o",markerfacecolor=tuple(col),markeredgecolor="k",markersize=3) plt.title("Estimated number of clusters: %d" % n_clusters_) plt.show()
To visualize the differences between bevel images that were assigned different cluster IDs, the IDs must be merged with the DataFrame containing the images' and wafers' metadata from which the IDs were obtained (see Listing 12).
Listing 12
Merging Cluster IDs
Xlabels = pd.DataFrame(np.append(X, labels.reshape((4281, 1)), axis=1)) Xlabels.columns = ['PC1', 'PC2', 'Cluster'] # Copy IMAGE_ID from the index to a column df['IND'] = df.index # Merge Cluster IDs with the IMAGE_IDs, using the PCA values df = pd.merge(df, Xlabels, on=["PC1", "PC2"]) # Reset Index df.set_index('IND', drop=True, inplace=True) df.index.name = None # Merge the Clustered PCA dataframe with the Image and Wafer metadata df['IMAGE_ID'] = df.index dfPLUS = df.merge(wafStepIMGList_df, on='IMAGE_ID') dfPLUS = dfPLUS.merge(wafStepList_df, on=('WAF_ID','STEP_ID')) dfPLUS['LOT_GRP'] = dfPLUS['WAF'].str.replace('[^A-Z]+', '') # Pickle individual DataFrame with Clustered data to the filesystem #dfPLUS.to_pickle('clusteredSpecularImgPLUS.pkl') #dfPLUS.to_pickle('clusteredPhaseImgPLUS.pkl') dfPLUS.to_pickle('clusteredScatterImgPLUS.pkl')
I suspected much of the PCA chart structure was driven by wafer product differences. Listing 13 uses the Seaborn and Matplotlib libraries to generate a faceted chart to illustrate these differences (see Figure 5). Figure 5 shows a clear product type dependence, which can be seen in the overlap in the parameter space between the product types identified by A, B, and C, as well as those identified by X, Y, and Z.
Listing 13
Faceted Plot of Cluster IDs by Product Type
import seaborn as sns import matplotlib.pyplot as plt sns.set(style="ticks", color_codes=True) g = sns.FacetGrid(dfPLUS, col="PROD", col_wrap=4, hue="Cluster") g = g.map(plt.scatter, "PC1", "PC2", edgecolor="w") plt.show()
Image Display
Because I did not start with a classification, I have nothing to compare the cluster IDs with, so I cannot construct a confusion matrix or calculate the precision or recall for this method. I can, however, visually compare the images within different cluster IDs for specific product types. To achieve this, you must identify the images within each category (the distinct clusters and the unclustered images) with the greatest distance from the center of the clusters. This first requires that the centroids be identified for each product type and detector type. Then the distance is calculated from the appropriate centroid to each individual datapoint (see Listing 14).
Listing 14
Product-Detector Cluster Centers
import numpy as np dfPLUS['PROD_Detect'] = dfPLUS['PROD'] + '_' + dfPLUS['Detector'] clusterCenters = dfPLUS[dfPLUS['Cluster'] > -1][['PROD_Detect', 'Cluster','PC1','PC2']].groupby(['PROD_Detect']).mean() clusterCenters.index.name = None dfPLUS['Delta1'] = \ dfPLUS.apply(lambda x: np.round_(x.PC1-clusterCenters.loc[x.PROD_Detect].PC1, 6), axis=1) dfPLUS['Delta2'] = \ dfPLUS.apply(lambda x: np.round_(x.PC2-clusterCenters.loc[x.PROD_Detect].PC2, 6), axis=1) dfPLUS['Delta'] = np.sqrt(dfPLUS['Delta1']**2 + dfPLUS['Delta2']**2)
Once the distances from the centers are known, the most distant images can be identified for each product and detector type using the findExtremeImgs
function in Listing 15. This function returns two DataFrames: One for the abnormal or unclustered images and one for the normal or clustered images. The number of images within each category is specified by the mcols
value, which is set to a value of 5
in my example.
Listing 15
DataFrames for Normal and Abnormal Images
def findExtremeImgs (detectorType, prodName): mcols = 5 abnormalImg = dfPLUS[(dfPLUS['Detector'] == detectorType) & (dfPLUS['PROD'] == prodName) & (dfPLUS['Cluster'] < 0)]\ [['PROD_Detect', 'Cluster','PC1','PC2','Delta','FILENAME']].\ sort_values(['Delta'], ascending=False).\ groupby(['PROD_Detect','Cluster']).\ head(mcols) normalImg = dfPLUS[(dfPLUS['Detector'] == detectorType) & (dfPLUS['PROD'] == prodName) & (dfPLUS['Cluster'] >= 0)]\ [['PROD_Detect', 'Cluster','PC1','PC2','Delta','FILENAME']].\ sort_values(['Delta'], ascending=False).\ groupby(['PROD_Detect','Cluster']).\ head(mcols) return abnormalImg, normalImg
The displayExtremeImgs
function in Listing 16 will accept the lists of extreme images and display them in a grid format.
Listing 16
Grid Display of Normal and Abnormal Images
def displayExtremeImgs (detectorType, thisProd, abnormalImg, normalImg): thisCols = abnormalImg['Cluster'].size clusterIDlist = pd.DataFrame(np.concatenate((np.unique(abnormalImg['Cluster']), np.unique(normalImg['Cluster'])))) thisRows = clusterIDlist.size row_labels = ['CLUSTER_ID {}'.format(row) for row in iter(clusterIDlist.loc[:,0])] figure, axes = plt.subplots(nrows=thisRows, ncols=thisCols, figsize=(15,(thisRows*1.75))) for ax, row in zip(axes[:,2], row_labels): ax.set_title(row,fontweight="bold") ax.set_xticks([]) ax.set_yticks([]) figure.set_facecolor('w') col=0 for i, ind in enumerate (abnormalImg.index): imgURI = web_config.imgServerRoot + abnormalImg.loc[ind]['FILENAME'] img = get_URL_img(imgURI) axes[0, col].imshow(img) col = col+1 for i, ind in enumerate (normalImg.index): imgURI = web_config.imgServerRoot + normalImg.loc[ind]['FILENAME'] img = get_URL_img(imgURI) thisCluVal = np.int0(normalImg.loc[ind]['Cluster']) thisClu = np.int0(np.where(clusterIDlist == thisCluVal)) axes[1+(i // thisCols),(i % thisCols)].imshow(img) plt.setp(plt.gcf().get_axes(), xticks=[], yticks=[]); plt.show()
Combining these two functions allows the extreme images within each category to be viewed simultaneously. In all cases, the first row of images displays those images that were not clustered (cluster ID -1.0) and can be considered abnormal. Subsequent rows show the most extreme images for all additional clusters that were observed in the dataset.
In examining the image galleries in Figures 6-8, you can see that it is indeed possible to separate abnormal bevel images from those without any distinguishing features. Perhaps unusually, the abnormal scatter images show a thinner ridge at the wafer edge (see Figure 7). In the phase images shown in Figure 8, the unclustered or abnormal images appear to be a sub-category of the cluster ID 0.0 images, but ones where the defect feature intersects with the image boundary. In this case, cluster ID 0.0 and cluster ID -1.0 should both be considered abnormal.
Although you can clearly see differences between the image clusters, it is possible that every silicon wafer has the same distribution of cluster IDs, which provide no ability to differentiate between wafers. To investigate this, stacked bar charts were plotted for each product-detector combination, where counts were grouped by wafer, using the command in Listing 17 (see Figure 9).
Listing 17
Cluster ID Stacked Bar Chart
dfPLUS[(dfPLUS['Detector'] == thisDetector) & (dfPLUS['PROD'] == thisProd)][['Cluster','WAF','INSP_TIME']].\ groupby(['WAF','INSP_TIME','Cluster']).\ size().unstack().\ plot(sort_columns='INSP_TIME',kind='bar', stacked=True)
The specular images in Figure 9 show little variation between wafers. The phase images show more interesting behavior with a third cluster ID appearing on a selection of wafers. And perhaps unsurprisingly, the scatter images show an elevated level of scatter.
Differences between products (wafers) can be observed; what remains to be seen is if these differences are useful indicators of yield or reliability. Additionally, the significance of the differences could be determined by using Fisher's exact test [13]. The possibility of individual cluster IDs approaching zero observations makes the application of a Chi-square test [14] inappropriate in this case.
Conclusion
It is certainly possible to segregate abnormal bevel images from those that can be considered normal by means of unsupervised machine learning. This approach is not necessarily appropriate for all image classification applications, but this does not mean that you should overlook it. As always, you should aim to employ the simplest solution where possible. In certain cases, the extraction of texture and form from images can provide a simple solution in conjunction with PCA and DBSCAN.
Unsupervised machine learning bypasses the train, validate, and test cycle. However, the end result in this unsupervised machine learning approach is not a model that can be applied to new images to generate a cluster ID. Instead, the feature extraction, standardization, and PCA generate a set of coordinates that can be compared with the parameter space obtained from the historical data by means of the k-nearest neighbors algorithm [15]. This process provides a cluster ID for each new image.
Infos
- Smyth, Padhraic, et al. "Knowledge Discovery in Large Image Databases: Dealing with Uncertainties in Ground Truth." In: Advances in Knowledge Discovery and Data Mining (AAAI/MIT Press, Menlo Park, CA, 1995), pp. 109-120
- cx_Oracle: https://cx-oracle.readthedocs.io
- pandas: https://pandas.pydata.org/pandas-docs/stable/getting_started/index.html
- PIL: https://python-pillow.org/
- cv2: https://pypi.org/project/opencv-python/
- mahotas: https://pypi.org/project/mahotas/
- NumPy: https://numpy.org/doc/stable/user/quickstart.html
- Merging methods: https://pandas.pydata.org/docs/user_guide/merging.html
- sklearn StandardScaler: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
- sklearn PCA: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
- "Indoor Navigation with Machine Learning" by Roland Pleger, Linux Magazine, issue 255, February 2022, https://www.linux-magazine.com/index.php/Issues/2022/255/Machine-Learning
- sklearn DBSCAN: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html
- Fisher's exact test: https://mathworld.wolfram.com/FishersExactTest.html
- Chi-square test: https://www.jmp.com/en_be/statistics-knowledge-portal/chi-square-test.html
- k-nearest neighbor algorithm: https://www.ibm.com/topics/knn
« Previous 1 2
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
Fedora 41 Beta Available with Some Interesting Additions
If you're a Fedora fan, you'll be excited to hear the beta version of the latest release is now available for testing and includes plenty of updates.
-
AlmaLinux Unveils New Hardware Certification Process
The AlmaLinux Hardware Certification Program run by the Certification Special Interest Group (SIG) aims to ensure seamless compatibility between AlmaLinux and a wide range of hardware configurations.
-
Wind River Introduces eLxr Pro Linux Solution
eLxr Pro offers an end-to-end Linux solution backed by expert commercial support.
-
Juno Tab 3 Launches with Ubuntu 24.04
Anyone looking for a full-blown Linux tablet need look no further. Juno has released the Tab 3.
-
New KDE Slimbook Plasma Available for Preorder
Powered by an AMD Ryzen CPU, the latest KDE Slimbook laptop is powerful enough for local AI tasks.
-
Rhino Linux Announces Latest "Quick Update"
If you prefer your Linux distribution to be of the rolling type, Rhino Linux delivers a beautiful and reliable experience.
-
Plasma Desktop Will Soon Ask for Donations
The next iteration of Plasma has reached the soft feature freeze for the 6.2 version and includes a feature that could be divisive.
-
Linux Market Share Hits New High
For the first time, the Linux market share has reached a new high for desktops, and the trend looks like it will continue.
-
LibreOffice 24.8 Delivers New Features
LibreOffice is often considered the de facto standard office suite for the Linux operating system.
-
Deepin 23 Offers Wayland Support and New AI Tool
Deepin has been considered one of the most beautiful desktop operating systems for a long time and the arrival of version 23 has bolstered that reputation.