Week 1: Machine Learning Basics

Introduction

Before we can dive deep into large language models, it is essential to get familiar with fundamental concepts in machine learning. Some key topics that should be revised include:

  • Machine Learning Problems

    • Supervised vs Unsupervised
    • Classification vs Regression
  • Neural Networks

    • Building Deep Neural Networks: MLPs & CNNs
    • Activation Functions: Sigmoid, ReLu
    • Loss Functions: Cross-Entropy Loss
    • Optimization and Gradient Descent
  • Natural Language Processing

    • Word Embeddings
    • Tokenization
    • RNNs

Additionally, working knowledge of libraries such as NumPy, Pandas, Scikit-Learn, PyTorch, Tensorflow etc is going to helpful.

Goals for the Week

  • The following guide from Maxime Labonne covers the essentials discussed above, and hence have been provided as-is for reference. It introduces essential knowledge about mathematics, Python, and neural networks. Your goal this week is to revise the fundamentals by going through this guide as needed.

Reference: https://github.com/mlabonne/llm-course?tab=readme-ov-file#-llm-fundamentals

Learning Guide

1. Mathematics for Machine Learning

Before mastering machine learning, it is important to understand the fundamental mathematical concepts that power these algorithms.

  • Linear Algebra: This is crucial for understanding many algorithms, especially those used in deep learning. Key concepts include vectors, matrices, determinants, eigenvalues and eigenvectors, vector spaces, and linear transformations.
  • Calculus: Many machine learning algorithms involve the optimization of continuous functions, which requires an understanding of derivatives, integrals, limits, and series. Multivariable calculus and the concept of gradients are also important.
  • Probability and Statistics: These are crucial for understanding how models learn from data and make predictions. Key concepts include probability theory, random variables, probability distributions, expectations, variance, covariance, correlation, hypothesis testing, confidence intervals, maximum likelihood estimation, and Bayesian inference.

๐Ÿ“š Resources:


2. Python for Machine Learning

Python is a powerful and flexible programming language that’s particularly good for machine learning, thanks to its readability, consistency, and robust ecosystem of data science libraries.

  • Python Basics: Python programming requires a good understanding of the basic syntax, data types, error handling, and object-oriented programming.
  • Data Science Libraries: It includes familiarity with NumPy for numerical operations, Pandas for data manipulation and analysis, Matplotlib and Seaborn for data visualization.
  • Data Preprocessing: This involves feature scaling and normalization, handling missing data, outlier detection, categorical data encoding, and splitting data into training, validation, and test sets.
  • Machine Learning Libraries: Proficiency with Scikit-learn, a library providing a wide selection of supervised and unsupervised learning algorithms, is vital. Understanding how to implement algorithms like linear regression, logistic regression, decision trees, random forests, k-nearest neighbors (K-NN), and K-means clustering is important. Dimensionality reduction techniques like PCA and t-SNE are also helpful for visualizing high-dimensional data.

๐Ÿ“š Resources:


3. Neural Networks

Neural networks are a fundamental part of many machine learning models, particularly in the realm of deep learning. To utilize them effectively, a comprehensive understanding of their design and mechanics is essential.

  • Fundamentals: This includes understanding the structure of a neural network, such as layers, weights, biases, and activation functions (sigmoid, tanh, ReLU, etc.)
  • Training and Optimization: Familiarize yourself with backpropagation and different types of loss functions, like Mean Squared Error (MSE) and Cross-Entropy. Understand various optimization algorithms like Gradient Descent, Stochastic Gradient Descent, RMSprop, and Adam.
  • Overfitting: Understand the concept of overfitting (where a model performs well on training data but poorly on unseen data) and learn various regularization techniques (dropout, L1/L2 regularization, early stopping, data augmentation) to prevent it.
  • Implement a Multilayer Perceptron (MLP): Build an MLP, also known as a fully connected network, using PyTorch.

๐Ÿ“š Resources:


4. Natural Language Processing (NLP)

NLP is a fascinating branch of artificial intelligence that bridges the gap between human language and machine understanding. From simple text processing to understanding linguistic nuances, NLP plays a crucial role in many applications like translation, sentiment analysis, chatbots, and much more.

  • Text Preprocessing: Learn various text preprocessing steps like tokenization (splitting text into words or sentences), stemming (reducing words to their root form), lemmatization (similar to stemming but considers the context), stop word removal, etc.
  • Feature Extraction Techniques: Become familiar with techniques to convert text data into a format that can be understood by machine learning algorithms. Key methods include Bag-of-words (BoW), Term Frequency-Inverse Document Frequency (TF-IDF), and n-grams.
  • Word Embeddings: Word embeddings are a type of word representation that allows words with similar meanings to have similar representations. Key methods include Word2Vec, GloVe, and FastText.
  • Recurrent Neural Networks (RNNs): Understand the working of RNNs, a type of neural network designed to work with sequence data. Explore LSTMs and GRUs, two RNN variants that are capable of learning long-term dependencies.

๐Ÿ“š Resources: