Clustering helps to automatically recognize signal patterns​ 


Among other things, we use clustering to identify microcracks in automated manufacturing processes by analyzing high-frequency structure-borne noise emitted in the event of crack formation. 
the assignment of typical machine emissions and spontaneous crack formation over millions of data sets requires machine learning approaches.

various sporadically occurring structure-borne noise emissions as 3D spectrograms


Clustering concept:


Data preparation:
We cut out a "significant" snippet for each process. It includes the Windows slicing technique and the threshold check with a specific value, which is performed automatically. 
We determine the threshold values for the significance analysis using the amplitude/energy histogram and orient ourselves to changes in the gradient curves. 
After the snippet is created, using mesh and interpolation algorithms can eliminate some noise points in the snippet and keep the most important points, raw signals usually contain random parts that we do not want to include in the pattern matching, later we save the snippet as HDF5 file for each process. 
The HDF5 file is an easy-to-use file type. The structure is similar to a dictionary and can also contain many attributes of the data set.

 


The snippets can already be determined and saved on the measuring device installed on the machine; analysis operators and Python modules are available for this purpose.


Machine learning


Since acoustic data is a very special type of data, classic approaches from image processing are not readily transferable, so we have developed our own methods for similarity and distance evaluation. 
The similarity measures between the individual snippets then generate the similarity matrix.

Later, we combine the similarity matrix using the Python Scipy library. 
Hierarchical clustering is implemented here. 

Below we see the dendrogram of the result. 
To determine how many clusters there should be, a distance threshold is calculated, which is determined by the percentile of the majority distance. 
As we can see in the diagram, there are 7 clusters in this project with a distance threshold of 3911.


Below we find one result per cluster, which we refer to as representative snippets.
For each cluster, we determine a snippet that best represents the respective cluster. 

The theory is similar to K-means, whereby the central point of each cluster is determined. 
This means that the overall distance between the representative snippet and the other snippets in the same cluster should be the smallest.


You can read more concepts about  machine learning and compressions
 here.