Machine Learning - Overview

Machine Learning techniques

Machine learning encompasses a wide range of techniques used to analyze data, identify patterns, and make predictions.

Predictions could be a direct relationships (for eg; weight -> [magic prediction box] -> height) but also an temporal that learns relationsship between some point in time vs a future point in time (price now -> [magic prediction box] -> future price).

These techniques can be broadly categorized into different learning paradigms based on how they process and learn from data.

Some of these paradigms and naming conventions are well established, but ML research is constantly evolving, leading to many other learning techniques and combinations. It's simply not realistic to keep up with everything, but understanding the basics will cover most of the problems you set out to solve.

Here are some of the most important ones:

Supervised Learning

Supervised learning assumes that you have both the input data and the corresponding output (the value you want to predict). The model learns by continuously comparing its predicted outcome with the actual outcome, using the error to iteratively tweak its parameters and test repeatedly until it reaches a good fit.

For example, if we have a time series of stock prices, we know the actual price movements at each point in time. This allows us to label data points, such as marking a date as a "buy" or "sell" signal based on future trends. Alternatively, we can train a model to predict a specific outcome, such as the expected stock price X days into the future.

Supervised learning is commonly divided into two main types:

In supervised classification learning, we assign labels (e.g., buy/hold/sell signals) to a set of input data. By doing this, we "classify" the data and train the model to infer patterns that allow it to classify similar datasets in the future. When testing on new data, we want the model to correctly recognize (classify) similar patterns based on its previous training. For example, we could manually — or preferably programmatically — label all stock dips that have a future recovery in our dataset. Given enough examples, the model might be able to recognize these dip-buying situations and create a "dip buyer model." Now wouldnt that be cool?!!

Regression is used for predicting continuous values. Instead of assigning labels, the model outputs a numeric prediction. For instance, a regression model could be trained to predict future stock prices based on historical price movements, trading volume, and other relevant features. The goal is to find a mathematical relationship between input variables and the output so that the model can make accurate predictions on unseen data.

Unsupervised Learning

Unlike supervised learning, unsupervised learning works with unlabeled data, meaning there are no predefined outputs for the model to learn from. Instead, the algorithm tries to identify hidden patterns, structures, or groupings within the data.

Unsupervised learning is commonly divided into two main types:

Clustering algorithms group similar data points together based on their characteristics. This is useful in stock market analysis for identifying similar trading patterns or grouping stocks with similar behavior.

For example, a clustering algorithm could analyze a large number of stocks and group them into categories based on volatility, trading volume, or performance trends, helping traders find potential investment opportunities.

Dimensionality reduction techniques are used to simplify large datasets by reducing the number of features while preserving the most important information. This is useful for visualizing complex data or improving computational efficiency in machine learning models.

For example, in financial markets, we might have hundreds of features affecting stock prices, but dimensionality reduction can help distill the most relevant ones, allowing for more effective model training.

Density estimation is a type of unsupervised learning focused on learning the probability distribution of data. Instead of predicting labels or values, the goal is to model how data points are distributed in a given space. This is useful for anomaly detection, risk modeling, and understanding underlying data structures.

For example, in finance, density estimation can help identify unusual price movements or detect outliers in stock returns. By estimating the probability distribution of normal price changes, any deviation from this distribution can be flagged as a potential anomaly, such as a sudden crash or an unexpected spike in volatility.

Common techniques for density estimation include:



Semi-Supervised Learning

Learning from a mix of labeled and unlabeled data (e.g., self-training for image classification).

Reinforcement Learning

Reinforcement learning (RL) is a different approach where an agent learns by interacting with an environment and receiving rewards or penalties based on its actions. Instead of learning from a fixed dataset, the model continuously updates its strategy based on trial and error.

A classic example in finance is an RL model that learns optimal trading strategies by executing trades in a simulated market environment, adjusting its decisions based on profit or loss. Over time, the model refines its strategy to maximize returns.

Self-Supervised Learning

Generating pseudo-labels from raw data (e.g., contrastive learning in NLP and vision).

Generative Learning

Creating new data samples (e.g., GANs, VAEs, diffusion models).