Machine Learning and Correlation
The key concepts of Machine Learning and Data Science have to do with modeling and applying statistics to your data sets. In this article, we are taking a different approach. If you've never worked with the subject of Machine Learning, which is allied to Digital Signal Processing, this article provides a gentle introduction to a simple concept that will get you started for future studies. That subject is the correlation of two sequences of numbers to produce a third sequence, which is the output of the correlator. The correlation sum is not Machine Learning, but, mastering this concept is a good start.
The equation above shows the correlation sum calculated for two input sequences, x and y. The input is a function of time, while the output is a function of the "lag" between the two sequences. When the lag is zero, the output of the correlator is the dot product of the two inputs.
The first example shows the Barker code, which is a sequence of numbers chosen for their ability to correlate in an optimal way. Think of squaring a number. Squaring a number provides you with the highest output when considering operations that use only the number itself ( a+a, a÷a, a-a, a × a ).
The correlation sum outputs the largest value when the lag is zero.
Barker codes belong to a class of sequences called pseudo-random codes. One use case for pseudo-random codes is in noisy communication channels, where the signal-to-noise ratio is poor. Coding your signal using pseudo-random codes enables you to detect the presence of the transmitted signal by effectively integrating (summing) the energy in an optimal way. But this has nothing to do with the concept of machine learning, at least in the sense that I am aiming for. I'm introducing the concept only to introduce the concept of correlation.
The correlation sum is a step, a very simple step, in the overall process of decomposition of data into a sum of basic components. This can be illustrated partially in the next few examples in the graphs. The graphs show the autocorrelation of sinusoids. You'll notice that the autocorrelation of a sinusoid correlates strongly at a lag of zero and decreases to zero at some points. The cross-correlation of two sinusoids of different frequencies correlates weakly.
What I'd like for you to notice is that when the lag is zero, the cross correlation of sine and cosine of the same frequency is zero. This isn't intuitive immediately because the signals are so similar, except for being time shifted from each other. The lesson to take away from this is that decomposition requires more than one operation in this case. Correlation with sine and time-shifted sine (which ends up being the same as cosine) is needed to cover all cases. If you understand that all periodic waves can be represented as a weighted sum of component sinusoids of various frequencies, then you can understand the use of the correlation for finding the weights. This is the concept of the Fourier Transform. Another way of saying this is that you can use the Fourier Transform to transform or map data from the time-domain into the frequency domain.
If you're still following along, I'll cover how this concept helps with starting to understand machine learning. (It's relevant I promise!) I'm going to take the concept of correlation of time domain signals, and apply it to data that is not a function of time.
Multiplying each element in a sequence by each element in another sequence, and then summing the products, can also be represented as the dot product of two vectors. In the case of vectors of up to three elements, each element represents a distance with respect to a dimension. A three dimensional vector can be represented as a 3-tuple or an array of three elements. In the two-dimensional or three-dimensional case, there is a geometric significance to the result of the dot product operation. It's a measure of similarity.
The equation above shows the dot product of the two vectors "normalized." What normalized in this sense signifies, is that the vector represents "direction" purely, and does not include any information about the magnitude of the vector. The closer, or more similar that they are to each other, the smaller the angle and the closer that the value will be to one.
Dimensions that you can use this technique with are not just those that can be mapped to a plane or space. Imagine if you assign a numerical value to each word that appears in a search query. The first word in the query can represent the distance along the x-axis. The second word in the query can represent a distance along the y-axis. The third word can represent a distance along the z – axis. And so on. The vector describes the search query uniquely and it's similarity to other documents can be measured by using the dot product of vectors as shown above. Finding documents with the smallest angle between its vector representation and the search query's vector representation is a way of finding the most "relevant" or the most "similar" document.
Choosing the weights used for the values of the words in the documents that you are searching through is a complex task, and I'm not covering it in this blog. This is how search and classification can be done, as opposed to neural nets. I'm not suggesting that one method supersedes the other, but rather, understanding both directions in your projects will assist you in understanding the overall nature of your data.