🧠 Neural Network Architecture

Understanding How Deep Learning Models Are Built

1. Basic Feed-Forward Architecture

INPUT LAYER (4 neurons) HIDDEN LAYER 1 (5 neurons) HIDDEN LAYER 2 (3 neurons) OUTPUT LAYER (2 neurons) Forward Pass →

📥 Input Layer

  • Purpose: Receives raw data
  • Size: Matches number of features
  • Example: 4 features (sepal length, width, petal length, width)
  • No activation function
  • Just passes values forward

🔧 Hidden Layers

  • Purpose: Extract patterns and features
  • Size: Flexible, typically 10-100s of neurons
  • Depth: More layers = "deeper" network
  • Use activation functions (ReLU, sigmoid)
  • Learn increasingly abstract representations

📤 Output Layer

  • Purpose: Produces final predictions
  • Size: Matches number of classes/outputs
  • Classification: Softmax activation
  • Regression: Linear activation
  • Each neuron = one class probability

2. Training Process: Forward and Backward Pass

INPUT HIDDEN OUTPUT LOSS Forward Pass Compute predictions Backward Pass Compute gradients Update: w = w - η × ∂L/∂w Update: w = w - η × ∂L/∂w 1. Forward → 2. Calculate Loss → 3. Backward → 4. Update Weights → Repeat

3. Common Architecture Patterns

📊 Shallow Network

  • 1-2 hidden layers
  • Good for simple problems
  • Fast training
  • Example: Iris classification
  • Architecture: 4→10→3

🏗️ Deep Network

  • 3+ hidden layers
  • Learns complex patterns
  • Requires more data
  • Example: Image recognition
  • Architecture: 784→128→64→32→10

🎯 Wide Network

  • Few layers, many neurons
  • Captures many features at once
  • Memory intensive
  • Example: Tabular data
  • Architecture: 50→500→10

💻 Building a Network in Python

# Define network architecture
input_size = 4 # Features in dataset
hidden_size = 10 # Hidden layer neurons
output_size = 3 # Number of classes

# Initialize weights randomly
W1 = np.random.randn(input_size, hidden_size) * 0.01
b1 = np.zeros((1, hidden_size))

W2 = np.random.randn(hidden_size, output_size) * 0.01
b2 = np.zeros((1, output_size))

# Forward propagation
z1 = np.dot(X, W1) + b1 # Linear combination
a1 = relu(z1) # Activation

z2 = np.dot(a1, W2) + b2 # Linear combination
a2 = softmax(z2) # Output probabilities

🔑 Key Architectural Decisions

  • Number of Layers: More layers = more abstraction, but harder to train
  • Neurons per Layer: Balance between capacity and overfitting
  • Activation Functions: ReLU for hidden layers, Softmax for output
  • Learning Rate: Controls step size during training (0.001 - 0.1 typical)
  • Batch Size: Number of samples processed before weight update
  • Epochs: Number of times to iterate through entire dataset

4. Architecture Design Guidelines

✅ Start Simple

  • Begin with 1-2 hidden layers
  • Use 10-50 neurons per layer
  • Train and evaluate
  • Add complexity if needed

⚖️ Balance Capacity

  • Too few neurons: Underfitting
  • Too many neurons: Overfitting
  • Monitor train vs test accuracy
  • Use regularization if needed

🎨 Common Patterns

  • Pyramid: layers get smaller (128→64→32)
  • Hourglass: narrow middle (100→50→100)
  • Uniform: same size (64→64→64)

5. Real-World Architecture Examples

Application Architecture Key Features
Iris Classification 4→10→3 Simple, fast, ~97% accuracy
MNIST Digits 784→128→64→10 Deep network, ~98% accuracy
ImageNet (AlexNet) Conv layers + 4096→4096→1000 First major CNN, revolutionized computer vision
GPT-3 96 transformer layers, 175B parameters Language model, massive scale
Simple Chatbot Vocab→256→128→64→Vocab Sequence-to-sequence, bidirectional