Finding problems using unsupervised image categorization

Needle in a Haystack

Author(s):

The most tedious part of supervised machine learning is providing sufficient supervision. However, if the samples come from a restricted sample space, unsupervised learning might be fine for the task.

In any classification project, it is certainly possible to get someone to review a certain number of images and build a classification list. However, when entering a new domain, it can be difficult to identify domain knowledge experts or to develop a ground truth for classification upon which all experts can agree. This is true regardless of whether you are looking at the backside of a silicon wafer for the first time, or if you are trying to identify the presence of volcanoes in radar images from the surface of Venus [1].

Alternatively, you can bypass all these problems and kick-start a classification project with unsupervised machine learning. Unsupervised machine learning is particularly applicable to environments where the typical images are largely identical, much like the pieces of hay in the haystack that you need to ignore when looking for needles.

In this article, I examine the potential for using unsupervised machine learning in Python (version 3.8.3 64-bit) to identify image categories for a restricted image space without resorting to training neural networks. This technique follows from the long tradition within engineering of finding the simplest solution to a problem. In this particular case, the solution relies upon the ability of the functions within the OpenCV and mahotas computer vision libraries to generate parameters for the texture and form within an image.

As an example, I'll look at the images obtained during the semiconductor manufacturing process from the bevel of silicon wafers. As part of the quality control procedure for semiconductor manufacturing, a series of photo images are taken for each wafer. Ideally, the wafers are normal, so the images are identical, but occasionally, a dissimilar photo can reveal a potential manufacturing problem that can generate defects on the affected wafer. Of course, you could train a human to wade through all these photos and look for problems, which would certainly be thorough, but it would take a lot of time and would introduce the possibility of human errors, especially as tedium develops. You could also train a neural network to look for dissimilar images, but neural networks need large amounts of compute resources, not to mention the expertise required for programming and training, as well as a sufficiently broad library of examples for each classification.

A simpler solution is to check the images using unsupervised data analysis techniques. The first step is to derive digital parameters for each of the photos for easier comparison. In this case, I used the Hu Moments and Haralick texture features, which are available through the cv2 and mahotas computer vision libraries. Hu Moments is an image descriptor that characterizes the shape of an object within an image. The Haralick texture features unsurprisingly describe the texture.

I then use principal component analysis (PCA) to reduce the data's dimensionality while still preserving as much of the variance as possible. The points are then grouped using a density-based clustering algorithm to identify the main categories of images as well as abnormal images. I relied upon the Seaborn and Matplotlib libraries to generate the visualizations.

PCA

PCA, a machine learning technique used for dimensionality reduction, is sometimes referred to as feature extraction. PCA is most appropriate for datasets with an unwieldy number of parameters but without classification, which allows PCA to be used in an unsupervised context. PCA's goal is to preserve the salient information within a dataset while generating a more manageable number of virtual parameters that are constructed from a subset of the original parameters. The number of these virtual parameters is configurable, depending on the amount of the variance desired, but typically two are generated because of the convenience of visualization.

The first principal component usually explains 75-80 percent of the total variance, and the second can be expected to represent 12-20 percent. PCA can also be described as finding a new projection line within the parameter space that maximizes the variance of data projected along it. These lines are the principal components.

Once the clusters have been defined, it is then possible to display sample images from the boundary region of each cluster as examples of the range of images that are typical within each cluster. Defining the boundaries allows you to make a convenient comparison of the typical images in each cluster and to clearly identify how effective the method is for unsupervised image categorization.

Background

The bevel inspection tools generate three distinct image types: specular, phase, and scatter. A single image is generated for the entire wafer bevel. However, to simplify the viewing, the images are segmented into 36 equal parts, each representing 10 degrees of the wafer bevel. Figure 1 shows five examples of each of these image types.

Figure 1: Examples of specular, phase, and scatter images.

Goal

These images may contain localized defects (see the third row in Figure 1). However, these defects are not the primary focus of this work. Instead, the main concern is the general structure and texture within the images and how these may change on and between individual wafers. Artifacts of interest can clearly be seen in the images shown in the third and fifth columns in Figure 1. The importance of these artifacts requires input from a domain knowledge expert; however, you can attempt to segregate these from the less noteworthy images.

Data Collection

The identification and collection of the relevant data from a primary source is often minimized, even though it could be considered the most essential part of data wrangling. The code snippet in Listing 1 demonstrates a succinct way to connect to an Oracle database [2] and use a cursor to extract your username. This simple query (i.e., select user from dual) needs to be replaced by an appropriate query. (I did not include the actual queries that were used here as they are not generally applicable.) The login credentials and database identification strings were all stored separately in a config module. (You could use a password manager such as PyKeePass for storing login information.)

Listing 1

Querying Username from the Database

import cx_Oracle
try:
  with cx_Oracle.connect(
      local_config.username,
      local_config.password,
      local_config.dsn,
      encoding=local_config.encoding) as conn:
    with conn.cursor() as cursor:
      # Now execute the sqlquery
      cursor.execute("select user from dual")
      print(cursor.fetchmany(20))
except cx_Oracle.DatabaseError as e:
  print("There was a problem with the YMS query ", e)

The data returned by a query can be conveniently imported into a pandas DataFrame with the following command:

my_df = pd.DataFrame(cursor.fetchall())

The essential pandas library [3] provides access to many data structures and methods that greatly simplify manipulation. The DataFrame and its methods are primary among these and should be familiar to users of the R programming language.

To avoid storing all images on the local machine, a pseudo data pipeline was created to pull down each image via a URL, extract the relevant features, and then delete the image. The get_URL_img function in Listing 2 uses the Python Image Library (PIL) [4] to extract each image via its URL onto the local filesystem.

Listing 2

Loading an Image into Memory from a URL

import requests
from PIL import Image
from io import BytesIO
def get_URL_img(URL):
  # Create a Session to contain you basic AUTH and persist your cookies
  authed_session = requests.Session()
  authed_session.auth = (local_config.WEBUSERNAME, local_config.WEBPASSWORD)
  USER_AGENT = "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.27 Safari/537.17"
  authed_session.headers.update({'User-Agent': USER_AGENT})
  # Fetch the actual data
  fetched_data = authed_session.get(URL)
  # Convert into an image
  return Image.open(BytesIO(fetched_data.content))

The functions shown in Listing 3 extract the Hu Moments (extract_hu_moments()) and Haralick texture features (extract_texture()) from a specified image. These functions can be found in the cv2 [5] (which is the import name for opencv-python) and mahotas [6] libraries, both of which require NumPy [7].

Listing 3

Feature Extraction Functions

import numpy as np
import cv2
import mahotas as mt
# Function to Extract  Hu Moments
def extract_hu_moments(image):
  feature = cv2.HuMoments(cv2.moments(np.array(image))).flatten()
  return feature
# Function to Extract Features
def extract_texture(image):
    # calculate haralick texture features for 4 types of adjacency
    textures = mt.features.haralick(np.array(image))
    ht_mean = textures.mean(axis=0)
    return ht_mean

Listing 4 then applies the extract functions on each image in turn using various pandas methods [8] to combine all results into a single DataFrame (i.e., imgFeatures).

Listing 4

Extracting Features from All Images

import pandas as pd
imgHaralickFeatures = pd.DataFrame
imgHuMoments = pd.DataFrame
for idx, wafstepIMG in wafStepIMGList_df.iterrows():
  imgURI = local_config.imgServerRoot + wafstepIMG['FILENAME']
  print("This URI:", imgURI, wafstepIMG['STEP_ID'])
  img = get_URL_img(imgURI)
  if img.size == (1820, 1002):
    print("Correct image size: ",img.size)
    # Extract and store Haralick features from bevel image
    har = extract_texture(img)
    if imgHaralickFeatures.empty:
      imgHaralickFeatures = pd.DataFrame(har.reshape(-1, len(har)), [wafstepIMG['IMAGE_ID']] ).add_prefix('haralick_')
    else:
      imgHaralickFeatures = imgHaralickFeatures.append(pd.DataFrame(har.reshape(-1, len(har)), [wafstepIMG['IMAGE_ID']] ).add_prefix('haralick_'))
    # Extract and store Hu Moments from bevel image
    hum = extract_hu_moments(img)
    if imgHuMoments.empty:
      imgHuMoments = pd.DataFrame(hum.reshape(-1, len(hum)), [wafstepIMG['IMAGE_ID']] ).add_prefix('hu_')
    else:
      imgHuMoments = imgHuMoments.append(pd.DataFrame(hum.reshape(-1, len(hum)), [wafstepIMG['IMAGE_ID']] ).add_prefix('hu_'))
  else:
    print("Wrong image size: ",img.size)
imgFeatures = imgHaralickFeatures.join(imgHuMoments, how='outer')

The resulting DataFrames can be pickled (serialized) for later analysis using the following command:

my_df.to_pickle(fileName.pkl)

Analysis

When an extended set of parameters is encountered, as in the example presented in this article, the most obvious means of reducing this feature space is to perform PCA (see the "PCA" sidebar). Listing 5 separates the data of each detector type to perform PCA on each group independently.

Listing 5

Independent DataFrames for Detector Types

# Map filename Extension to Detector Type
dectorType = {
  "Specular": "000",
  "Phase": "001",
  "Scatter": "002"
}
# Load BevelImage DataFrames
import pandas as pd
imgFeatures = pd.read_pickle('image_features.pkl')
wafStepIMGList_df = pd.read_pickle('waf_Step_IMG_List.pkl')
wafStepList_df = pd.read_pickle('waf_Step_List.pkl')
# Select the features for a single Detector Type
# Specular Detector
imgFeaturesSpecular = imgFeatures.filter( items =  wafStepIMGList_df[wafStepIMGList_df['EXT'] == dectorType["Specular"] ]['IMAGE_ID'] , axis=0)
# Phase Detector
imgFeaturesPhase = imgFeatures.filter( items =  wafStepIMGList_df[wafStepIMGList_df['EXT'] == dectorType["Phase"]]['IMAGE_ID'] , axis=0)
# Scatter Detector
imgFeaturesScatter = imgFeatures.filter( items =  wafStepIMGList_df[wafStepIMGList_df['EXT'] == dectorType["Scatter"]]['IMAGE_ID'] , axis=0)

Standardizing the values within each component prevents bias resulting from individual components simply because they contain larger values (see Listing 6). This step is provided by scikit-learn and imported via sklearn [9].

Listing 6

Standardizing Values

from sklearn.preprocessing import StandardScaler
stdimgFeaturesSpecular = pd.DataFrame(StandardScaler().fit_transform(imgFeaturesSpecular), index=imgFeaturesSpecular.index, columns=imgFeaturesSpecular.columns)
stdimgFeaturesPhase = pd.DataFrame(StandardScaler().fit_transform(imgFeaturesPhase), index=imgFeaturesPhase.index, columns=imgFeaturesPhase.columns)
stdimgFeaturesScatter = pd.DataFrame(StandardScaler().fit_transform(imgFeaturesScatter), index=imgFeaturesScatter.index, columns=imgFeaturesScatter.columns)

Once the data is in a fit state to perform PCA, the transformation can be performed via another sklearn sub-library [10] (see Listing 7). The decomposition into two components allows you to plot the results conveniently.

Listing 7

Extracting PCA Components into Two Axes

# import PCA from sklearn library
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
#tranform the given data set using PCA
principalDfSpecular = pd.DataFrame(data =pca.fit_transform(stdimgFeaturesSpecular), index=stdimgFeaturesSpecular.index, columns = ['PC1', 'PC2'])
principalDfPhase = pd.DataFrame(data =pca.fit_transform(stdimgFeaturesPhase), index=stdimgFeaturesPhase.index,  columns = ['PC1', 'PC2'])
principalDfScatter = pd.DataFrame(data =pca.fit_transform(stdimgFeaturesScatter), index=stdimgFeaturesScatter.index, columns = ['PC1', 'PC2'])
#principalDf['PC1']

In order to simplify subsequent data handling, a label column is added to each DataFrame to identify the detector type (specular, phase, or scatter), and all DataFrames are subsequently concatenated together (see Listing 8).

Listing 8

Adding Column and Combining DataFrames

principalDfSpecular['Detector'] = 'Specular'
principalDfPhase['Detector'] = 'Phase'
principalDfScatter['Detector'] = 'Scatter'
principalDf = pd.concat([principalDfSpecular, principalDfPhase, principalDfScatter])

The jointplot functions available from the Seaborn library provide a succinct way to visualize any structure that may have been revealed by PCA (see Listing 9).

Listing 9

PCA Data for Each Detector Type

import seaborn as sns
from matplotlib import pyplot as plt
plt.figure(figsize=(8,5))
sns.jointplot(data=principalDf[principalDf['Detector'] =='Scatter'], x='PC1', y='PC2', kind="hex", joint_kws=dict(bins='log'))

The PCA density plots in Figure 2 show these results: Not only has the structure been revealed, but the structure is also distinctly different for each detector type.

Figure 2: PCA density plots are shown here for the specular, phase, and scatter images.

To quantify this structure, some form of clustering needs to be applied to the PCA-transformed data. A natural choice is K-means clustering. However, in this case, the results are generally unsatisfactory at separating meaningful structures. For example in Figure 3, while the grey cluster and red random points are reasonably well separated, the other group of data points are crudely segmented, without regard for the internal structure. This is not necessarily always the case, and other examples exist where K-means clustering is an appropriate option [11].

Figure 3: K-means clustering has been applied to the PCA feature space.

Applying a DBSCAN clustering algorithm [12] produced much more promising results. Listing 10 applies this clustering algorithm with a minimum number of samples (22 in this example) per cluster to the two PCA components generated in Listing 7. The results of the DBSCAN clustering algorithm can be visualized with the help of the Matplotlib library, as shown in Listing 11. Figure 4 shows the results after the PCA data has been clustered using DBSCAN.

Listing 10

Applying the DBSCAN Algorithm

from sklearn.cluster import DBSCAN
from sklearn import metrics
# Extract PCA vales into an array to perform clustering
X = df.iloc[:, [0, 1]].values
type(X)
# Compute DBSCAN
db = DBSCAN(eps=0.35, min_samples=22).fit(X) # DEFAULT
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)
print("Number of clusters: %d" % n_clusters_)
print("Number of random points: %d" % n_noise_)
print("Silhouette Coefficient: %0.3f" % metrics.silhouette_score(X, labels))

Listing 11

Plotting DBSCAN Clustering

import matplotlib.pyplot as plt
unique_labels = set(labels)
colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
  if k == -1:
    col = [0, 0, 0, 1]
  class_member_mask = labels == k
  xy = X[class_member_mask & core_samples_mask]
  plt.plot(xy[:, 0],xy[:, 1],"o",markerfacecolor=tuple(col),markeredgecolor="k",markersize=6 )
  xy = X[class_member_mask & ~core_samples_mask]
  plt.plot(xy[:, 0],xy[:, 1],"o",markerfacecolor=tuple(col),markeredgecolor="k",markersize=3)
plt.title("Estimated number of clusters: %d" % n_clusters_)
plt.show()
Figure 4: The PCA parameter data is clustered using DBSCAN for specular, phase, and scatter images.

To visualize the differences between bevel images that were assigned different cluster IDs, the IDs must be merged with the DataFrame containing the images' and wafers' metadata from which the IDs were obtained (see Listing 12).

Listing 12

Merging Cluster IDs

Xlabels = pd.DataFrame(np.append(X, labels.reshape((4281, 1)), axis=1))
Xlabels.columns = ['PC1', 'PC2', 'Cluster']
# Copy IMAGE_ID from the index to a column
df['IND'] = df.index
# Merge Cluster IDs with the IMAGE_IDs, using the PCA values
df = pd.merge(df, Xlabels, on=["PC1", "PC2"])
# Reset Index
df.set_index('IND', drop=True, inplace=True)
df.index.name = None
# Merge the Clustered PCA dataframe with the Image and Wafer metadata
df['IMAGE_ID'] = df.index
dfPLUS = df.merge(wafStepIMGList_df, on='IMAGE_ID')
dfPLUS = dfPLUS.merge(wafStepList_df, on=('WAF_ID','STEP_ID'))
dfPLUS['LOT_GRP'] = dfPLUS['WAF'].str.replace('[^A-Z]+', '')
# Pickle individual DataFrame with Clustered data to the filesystem
#dfPLUS.to_pickle('clusteredSpecularImgPLUS.pkl')
#dfPLUS.to_pickle('clusteredPhaseImgPLUS.pkl')
dfPLUS.to_pickle('clusteredScatterImgPLUS.pkl')

I suspected much of the PCA chart structure was driven by wafer product differences. Listing 13 uses the Seaborn and Matplotlib libraries to generate a faceted chart to illustrate these differences (see Figure 5). Figure 5 shows a clear product type dependence, which can be seen in the overlap in the parameter space between the product types identified by A, B, and C, as well as those identified by X, Y, and Z.

Listing 13

Faceted Plot of Cluster IDs by Product Type

import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="ticks", color_codes=True)
g = sns.FacetGrid(dfPLUS, col="PROD", col_wrap=4,  hue="Cluster")
g = g.map(plt.scatter, "PC1", "PC2", edgecolor="w")
plt.show()
Figure 5: The faceted chart generated from the PCA feature space shows the product type dependence.

Image Display

Because I did not start with a classification, I have nothing to compare the cluster IDs with, so I cannot construct a confusion matrix or calculate the precision or recall for this method. I can, however, visually compare the images within different cluster IDs for specific product types. To achieve this, you must identify the images within each category (the distinct clusters and the unclustered images) with the greatest distance from the center of the clusters. This first requires that the centroids be identified for each product type and detector type. Then the distance is calculated from the appropriate centroid to each individual datapoint (see Listing 14).

Listing 14

Product-Detector Cluster Centers

import numpy as np
dfPLUS['PROD_Detect'] = dfPLUS['PROD'] + '_' + dfPLUS['Detector']
clusterCenters = dfPLUS[dfPLUS['Cluster'] > -1][['PROD_Detect', 'Cluster','PC1','PC2']].groupby(['PROD_Detect']).mean()
clusterCenters.index.name = None
dfPLUS['Delta1'] = \
dfPLUS.apply(lambda x: np.round_(x.PC1-clusterCenters.loc[x.PROD_Detect].PC1, 6), axis=1)
dfPLUS['Delta2'] = \
dfPLUS.apply(lambda x: np.round_(x.PC2-clusterCenters.loc[x.PROD_Detect].PC2, 6), axis=1)
dfPLUS['Delta'] = np.sqrt(dfPLUS['Delta1']**2 + dfPLUS['Delta2']**2)

Once the distances from the centers are known, the most distant images can be identified for each product and detector type using the findExtremeImgs function in Listing 15. This function returns two DataFrames: One for the abnormal or unclustered images and one for the normal or clustered images. The number of images within each category is specified by the mcols value, which is set to a value of 5 in my example.

Listing 15

DataFrames for Normal and Abnormal Images

def findExtremeImgs (detectorType, prodName):
  mcols = 5
  abnormalImg = dfPLUS[(dfPLUS['Detector'] == detectorType) & (dfPLUS['PROD'] == prodName) & (dfPLUS['Cluster'] < 0)]\
    [['PROD_Detect', 'Cluster','PC1','PC2','Delta','FILENAME']].\
    sort_values(['Delta'], ascending=False).\
    groupby(['PROD_Detect','Cluster']).\
      head(mcols)
  normalImg = dfPLUS[(dfPLUS['Detector'] == detectorType) & (dfPLUS['PROD'] == prodName) & (dfPLUS['Cluster'] >= 0)]\
    [['PROD_Detect', 'Cluster','PC1','PC2','Delta','FILENAME']].\
    sort_values(['Delta'], ascending=False).\
    groupby(['PROD_Detect','Cluster']).\
      head(mcols)
  return abnormalImg, normalImg

The displayExtremeImgs function in Listing 16 will accept the lists of extreme images and display them in a grid format.

Listing 16

Grid Display of Normal and Abnormal Images

def displayExtremeImgs (detectorType, thisProd, abnormalImg, normalImg):
  thisCols = abnormalImg['Cluster'].size
  clusterIDlist = pd.DataFrame(np.concatenate((np.unique(abnormalImg['Cluster']), np.unique(normalImg['Cluster']))))
  thisRows = clusterIDlist.size
  row_labels = ['CLUSTER_ID {}'.format(row) for row in iter(clusterIDlist.loc[:,0])]
  figure, axes = plt.subplots(nrows=thisRows, ncols=thisCols, figsize=(15,(thisRows*1.75)))
  for ax, row in zip(axes[:,2], row_labels):
    ax.set_title(row,fontweight="bold")
    ax.set_xticks([])
    ax.set_yticks([])
  figure.set_facecolor('w')
  col=0
  for i, ind in enumerate (abnormalImg.index):
    imgURI = web_config.imgServerRoot + abnormalImg.loc[ind]['FILENAME']
    img = get_URL_img(imgURI)
    axes[0, col].imshow(img)
    col = col+1
  for i, ind in enumerate (normalImg.index):
    imgURI = web_config.imgServerRoot + normalImg.loc[ind]['FILENAME']
    img = get_URL_img(imgURI)
    thisCluVal = np.int0(normalImg.loc[ind]['Cluster'])
    thisClu = np.int0(np.where(clusterIDlist == thisCluVal))
    axes[1+(i // thisCols),(i % thisCols)].imshow(img)
  plt.setp(plt.gcf().get_axes(), xticks=[], yticks=[]);
  plt.show()

Combining these two functions allows the extreme images within each category to be viewed simultaneously. In all cases, the first row of images displays those images that were not clustered (cluster ID -1.0) and can be considered abnormal. Subsequent rows show the most extreme images for all additional clusters that were observed in the dataset.

In examining the image galleries in Figures 6-8, you can see that it is indeed possible to separate abnormal bevel images from those without any distinguishing features. Perhaps unusually, the abnormal scatter images show a thinner ridge at the wafer edge (see Figure 7). In the phase images shown in Figure 8, the unclustered or abnormal images appear to be a sub-category of the cluster ID 0.0 images, but ones where the defect feature intersects with the image boundary. In this case, cluster ID 0.0 and cluster ID -1.0 should both be considered abnormal.

Figure 6: Specular images (abnormal and normal bevels) for product X.
Figure 7: Scatter images (abnormal and normal bevels) for product X.
Figure 8: Phase images (abnormal and normal bevels) for product X.

Although you can clearly see differences between the image clusters, it is possible that every silicon wafer has the same distribution of cluster IDs, which provide no ability to differentiate between wafers. To investigate this, stacked bar charts were plotted for each product-detector combination, where counts were grouped by wafer, using the command in Listing 17 (see Figure 9).

Listing 17

Cluster ID Stacked Bar Chart

dfPLUS[(dfPLUS['Detector'] == thisDetector) & (dfPLUS['PROD'] == thisProd)][['Cluster','WAF','INSP_TIME']].\
  groupby(['WAF','INSP_TIME','Cluster']).\
      size().unstack().\
  plot(sort_columns='INSP_TIME',kind='bar', stacked=True)
Figure 9: Stacked bar charts are plotted by product and image type for product X.

The specular images in Figure 9 show little variation between wafers. The phase images show more interesting behavior with a third cluster ID appearing on a selection of wafers. And perhaps unsurprisingly, the scatter images show an elevated level of scatter.

Differences between products (wafers) can be observed; what remains to be seen is if these differences are useful indicators of yield or reliability. Additionally, the significance of the differences could be determined by using Fisher's exact test [13]. The possibility of individual cluster IDs approaching zero observations makes the application of a Chi-square test [14] inappropriate in this case.

Conclusion

It is certainly possible to segregate abnormal bevel images from those that can be considered normal by means of unsupervised machine learning. This approach is not necessarily appropriate for all image classification applications, but this does not mean that you should overlook it. As always, you should aim to employ the simplest solution where possible. In certain cases, the extraction of texture and form from images can provide a simple solution in conjunction with PCA and DBSCAN.

Unsupervised machine learning bypasses the train, validate, and test cycle. However, the end result in this unsupervised machine learning approach is not a model that can be applied to new images to generate a cluster ID. Instead, the feature extraction, standardization, and PCA generate a set of coordinates that can be compared with the parameter space obtained from the historical data by means of the k-nearest neighbors algorithm [15]. This process provides a cluster ID for each new image.

Infos

  1. Smyth, Padhraic, et al. "Knowledge Discovery in Large Image Databases: Dealing with Uncertainties in Ground Truth." In: Advances in Knowledge Discovery and Data Mining (AAAI/MIT Press, Menlo Park, CA, 1995), pp. 109-120
  2. cx_Oracle: https://cx-oracle.readthedocs.io
  3. pandas: https://pandas.pydata.org/pandas-docs/stable/getting_started/index.html
  4. PIL: https://python-pillow.org/
  5. cv2: https://pypi.org/project/opencv-python/
  6. mahotas: https://pypi.org/project/mahotas/
  7. NumPy: https://numpy.org/doc/stable/user/quickstart.html
  8. Merging methods: https://pandas.pydata.org/docs/user_guide/merging.html
  9. sklearn StandardScaler: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
  10. sklearn PCA: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
  11. "Indoor Navigation with Machine Learning" by Roland Pleger, Linux Magazine, issue 255, February 2022, https://www.linux-magazine.com/index.php/Issues/2022/255/Machine-Learning
  12. sklearn DBSCAN: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html
  13. Fisher's exact test: https://mathworld.wolfram.com/FishersExactTest.html
  14. Chi-square test: https://www.jmp.com/en_be/statistics-knowledge-portal/chi-square-test.html
  15. k-nearest neighbor algorithm: https://www.ibm.com/topics/knn

The Author

Garry Tuohy has been surviving in the semiconductor industry for longer than a radiation-hardened processor. He also enjoys astronomy and dreaming of fanciful uses for microcontrollers.