1. Basic Feed-Forward Architecture
📥 Input Layer
- Purpose: Receives raw data
- Size: Matches number of features
- Example: 4 features (sepal length, width, petal length, width)
- No activation function
- Just passes values forward
🔧 Hidden Layers
- Purpose: Extract patterns and features
- Size: Flexible, typically 10-100s of neurons
- Depth: More layers = "deeper" network
- Use activation functions (ReLU, sigmoid)
- Learn increasingly abstract representations
📤 Output Layer
- Purpose: Produces final predictions
- Size: Matches number of classes/outputs
- Classification: Softmax activation
- Regression: Linear activation
- Each neuron = one class probability
2. Training Process: Forward and Backward Pass
3. Common Architecture Patterns
📊 Shallow Network
- 1-2 hidden layers
- Good for simple problems
- Fast training
- Example: Iris classification
- Architecture: 4→10→3
🏗️ Deep Network
- 3+ hidden layers
- Learns complex patterns
- Requires more data
- Example: Image recognition
- Architecture: 784→128→64→32→10
🎯 Wide Network
- Few layers, many neurons
- Captures many features at once
- Memory intensive
- Example: Tabular data
- Architecture: 50→500→10
💻 Building a Network in Python
# Define network architecture
input_size = 4 # Features in dataset
hidden_size = 10 # Hidden layer neurons
output_size = 3 # Number of classes
# Initialize weights randomly
W1 = np.random.randn(input_size, hidden_size) * 0.01
b1 = np.zeros((1, hidden_size))
W2 = np.random.randn(hidden_size, output_size) * 0.01
b2 = np.zeros((1, output_size))
# Forward propagation
z1 = np.dot(X, W1) + b1 # Linear combination
a1 = relu(z1) # Activation
z2 = np.dot(a1, W2) + b2 # Linear combination
a2 = softmax(z2) # Output probabilities
input_size = 4 # Features in dataset
hidden_size = 10 # Hidden layer neurons
output_size = 3 # Number of classes
# Initialize weights randomly
W1 = np.random.randn(input_size, hidden_size) * 0.01
b1 = np.zeros((1, hidden_size))
W2 = np.random.randn(hidden_size, output_size) * 0.01
b2 = np.zeros((1, output_size))
# Forward propagation
z1 = np.dot(X, W1) + b1 # Linear combination
a1 = relu(z1) # Activation
z2 = np.dot(a1, W2) + b2 # Linear combination
a2 = softmax(z2) # Output probabilities
🔑 Key Architectural Decisions
- Number of Layers: More layers = more abstraction, but harder to train
- Neurons per Layer: Balance between capacity and overfitting
- Activation Functions: ReLU for hidden layers, Softmax for output
- Learning Rate: Controls step size during training (0.001 - 0.1 typical)
- Batch Size: Number of samples processed before weight update
- Epochs: Number of times to iterate through entire dataset
4. Architecture Design Guidelines
✅ Start Simple
- Begin with 1-2 hidden layers
- Use 10-50 neurons per layer
- Train and evaluate
- Add complexity if needed
⚖️ Balance Capacity
- Too few neurons: Underfitting
- Too many neurons: Overfitting
- Monitor train vs test accuracy
- Use regularization if needed
🎨 Common Patterns
- Pyramid: layers get smaller (128→64→32)
- Hourglass: narrow middle (100→50→100)
- Uniform: same size (64→64→64)
5. Real-World Architecture Examples
| Application | Architecture | Key Features |
|---|---|---|
| Iris Classification | 4→10→3 | Simple, fast, ~97% accuracy |
| MNIST Digits | 784→128→64→10 | Deep network, ~98% accuracy |
| ImageNet (AlexNet) | Conv layers + 4096→4096→1000 | First major CNN, revolutionized computer vision |
| GPT-3 | 96 transformer layers, 175B parameters | Language model, massive scale |
| Simple Chatbot | Vocab→256→128→64→Vocab | Sequence-to-sequence, bidirectional |