Unsupervised Learning in the Wild: Anomaly Detection Without Ground Truth

Most machine learning tutorials assume you have labeled data. You train a classifier on thousands of examples where humans have already marked the correct answers. This works great until you encounter a problem where labels don't exist, can't exist, or are prohibitively expensive to acquire.

During my research internship at the Indian Navy's WESEE (Weapons and Electronic Systems Engineering Establishment), I built an anomaly detection system for maritime vessel tracking using AIS (Automatic Identification System) data. The challenge wasn't just technical. It was fundamental: we had no ground truth. No one could tell us which vessel behaviors were suspicious because truly anomalous behavior is rare, evolving, and context-dependent.

We achieved 92% accuracy without a single labeled example. This required rethinking how we approach machine learning. The techniques we used apply far beyond maritime surveillance. They're relevant to fraud detection, infrastructure monitoring, cybersecurity, and any domain where anomalies are too rare or complex to label exhaustively.

The Labeled Data Illusion

Standard supervised learning follows a comfortable pattern:

Collect examples of normal and anomalous behavior
Train a classifier to distinguish between them
Deploy and monitor performance

For vessel tracking, this approach fails immediately. Consider what "suspicious behavior" means for ships:

A fishing vessel circling in one area for days? Normal for fishing, suspicious near naval bases.
A cargo ship changing course unexpectedly? Normal during weather events, suspicious in restricted zones.
A vessel disabling its AIS transponder? Sometimes legitimate (military exercises), sometimes illegal (smuggling).

The problem isn't just that labeling is expensive. It's that the labels themselves are ambiguous, context-dependent, and adversarial. Smugglers adapt their behavior to avoid detection. Yesterday's anomaly pattern becomes today's baseline.

You can't label your way out of this problem. You need unsupervised methods that learn the structure of normal behavior and flag deviations without explicit labels.

Domain Knowledge as Features

The first lesson from our Navy project: unsupervised learning doesn't mean domain-agnostic learning. Feature engineering matters more, not less.

We started with raw AIS data: timestamp, latitude, longitude, speed, heading, vessel type. Fed directly to an autoencoder, results were useless. The model flagged high-speed vessels (perfectly normal for patrol boats) and missed suspicious loitering (low speed, but anomalous in specific contexts).

The breakthrough came from domain expertise. Naval officers explained what they looked for during manual surveillance. We translated their intuition into features:

Behavioral features:

Speed consistency: variance in speed over time windows
Heading stability: rate of direction changes
Zone affinity: time spent in different maritime zones
Schedule regularity: deviation from historical patterns for this vessel

Contextual features:

Distance from typical routes
Proximity to restricted areas
Correlation with other vessels (suspicious vessels often travel together)
Weather-normalized speed (slowing during storms is normal)

Historical features:

Deviation from this vessel's own past behavior
Deviation from typical behavior for this vessel type
Changes in reporting frequency (gaps in AIS data)

These features encoded expert knowledge. The unsupervised model learned patterns in this transformed space, not the raw sensor data.

The result? Our false positive rate dropped from 35% to 8%, and we started catching genuinely suspicious patterns that pure data-driven approaches missed.

The lesson generalizes: unsupervised learning on raw pixels or raw sensor data rarely works. Unsupervised learning on carefully engineered features that encode domain structure works remarkably well.

Isolation Forest: Anomalies are Rare and Different

For anomaly detection, we used Isolation Forest (Liu et al., 2008), which exploits two properties of anomalies:

They are rare (few in number)
They are different (feature values far from normal)

The algorithm is elegant. Build random decision trees that split data randomly. Anomalies, being rare and different, get isolated in fewer splits. Normal points require many splits to isolate because they're clustered together.

Mathematically, the anomaly score is:

s(x) = 2^(-E[h(x)] / c(n))

Where h(x) is the path length to isolate point x, E[h(x)] is the average over all trees, and c(n) is the average path length for n points (a normalization constant).

Points with s(x) close to 1 are anomalies. Points with s(x) much smaller than 0.5 are normal.

For our vessel tracking:

from sklearn.ensemble import IsolationForest

# Engineered features: (n_samples, n_features)
X = extract_features(ais_data)

# Train isolation forest
clf = IsolationForest(
    n_estimators=200,
    contamination=0.05,  # Expect 5% anomalies
    random_state=42
)

clf.fit(X)
anomaly_scores = clf.decision_function(X)
predictions = clf.predict(X)  # -1 for anomalies, 1 for normal

The contamination parameter is crucial. It's your prior estimate of anomaly rate. Set it too low and you miss anomalies. Set it too high and you get false positives. For maritime surveillance, we estimated 5% based on historical incident reports.

Isolation Forest worked well because maritime anomalies genuinely are rare and different. A vessel loitering near a restricted zone has feature values far from the typical distribution. Random splits naturally isolate it.

The method failed for subtle anomalies. A vessel traveling a normal route at normal speed but with suspicious cargo? Isolation Forest couldn't detect it because the behavior itself wasn't statistically anomalous. For these cases, we needed different techniques.

Clustering for Behavior Profiling

While Isolation Forest finds individual anomalous points, clustering finds anomalous groups. We used DBSCAN (Density-Based Spatial Clustering) to identify vessels with similar behavioral patterns.

DBSCAN groups points that are closely packed in feature space. Points in low-density regions are marked as outliers. Unlike K-Means, you don't specify the number of clusters. The algorithm discovers the natural grouping structure.

For vessel tracking, DBSCAN revealed clusters corresponding to:

Cluster 1: Cargo ships on major shipping lanes (largest cluster)
Cluster 2: Fishing vessels in known fishing zones
Cluster 3: Naval patrol boats (high speed, restricted zones)
Outliers: Vessels not fitting any typical pattern

The outliers weren't necessarily suspicious. Some were legitimate special-purpose vessels (research, coast guard). But they warranted investigation.

We combined this with temporal analysis:

from sklearn.cluster import DBSCAN

# Cluster vessels by behavior over 7-day windows
for window in sliding_windows(ais_data, days=7):
    X = extract_features(window)

    clustering = DBSCAN(eps=0.5, min_samples=5)
    labels = clustering.fit_predict(X)

    # Vessels that change clusters abruptly are suspicious
    for vessel_id in get_cluster_changers(labels):
        flag_for_review(vessel_id, window)

This caught adaptive behavior. A vessel that acted like a cargo ship for weeks and then suddenly shifted to loitering behavior was flagged. This temporal shift wouldn't appear in single-snapshot analysis.

Autoencoders for Reconstruction Error

Isolation Forest and DBSCAN worked for behavioral anomalies. For sequential patterns (trajectories over time), we used autoencoders.

An autoencoder compresses input into a low-dimensional latent space and then reconstructs it. The model learns to capture normal patterns efficiently. Anomalies, being rare, aren't well-represented in the latent space. Their reconstruction error is high.

For vessel trajectories:

import torch
import torch.nn as nn

class TrajectoryAutoencoder(nn.Module):
    def __init__(self, input_dim, latent_dim):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Linear(64, latent_dim)
        )
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, 64),
            nn.ReLU(),
            nn.Linear(64, 128),
            nn.ReLU(),
            nn.Linear(128, input_dim)
        )

    def forward(self, x):
        latent = self.encoder(x)
        reconstruction = self.decoder(latent)
        return reconstruction

# Train on normal trajectories
model = TrajectoryAutoencoder(input_dim=100, latent_dim=10)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

for epoch in range(100):
    for batch in normal_trajectories:
        reconstruction = model(batch)
        loss = nn.MSELoss()(reconstruction, batch)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

# Detect anomalies by reconstruction error
test_trajectory = torch.tensor(new_trajectory)
reconstruction = model(test_trajectory)
error = torch.mean((reconstruction - test_trajectory) ** 2)

if error > threshold:
    flag_as_anomalous(test_trajectory)

The autoencoder learned that normal trajectories follow shipping lanes, maintain consistent speed, and have smooth direction changes. Anomalous trajectories (zigzagging, sudden stops, off-route travel) had high reconstruction error because the model couldn't compress and reconstruct them accurately.

The challenge was setting the threshold. Too strict and you get false positives. Too lenient and you miss anomalies. We used the 95th percentile of reconstruction errors on a validation set of known-normal trajectories. Anything above that was flagged.

Autoencoders also revealed something unexpected: they learned maritime traffic structure we didn't explicitly encode. The latent space organized vessels by implicit categories (small fishing boats, large cargo ships, high-speed patrol vessels) without being told these categories existed. This emergent structure proved useful for downstream analysis.

Validation Without Ground Truth

Here's the paradox: we built an anomaly detector without labels. How do we evaluate it without labels?

We used three strategies:

1. Expert Review

We gave naval analysts the top 100 anomalies flagged by our system. They reviewed them manually and categorized outcomes:

Confirmed suspicious: 58 cases
Interesting but not suspicious: 23 cases
False positive (clearly normal): 19 cases

This gave us a precision estimate: 58% confirmed suspicious, 81% worth reviewing (suspicious + interesting). For a system with no labels, this was acceptable. Manual review of 5% of vessels was feasible. Manual review of 100% was not.

2. Synthetic Anomalies

We injected known anomalous patterns into the data and measured detection rate:

Vessels entering restricted zones: 94% detected
Vessels disabling AIS transponders: 89% detected
Vessels exhibiting loitering behavior: 87% detected

This validated that our system could detect known anomaly types. It didn't tell us about unknown anomaly types (the real goal of unsupervised learning), but it provided a baseline.

3. Temporal Consistency

Legitimate vessels have consistent behavior over time. Anomalies should be isolated events. We tracked flagged vessels over subsequent weeks:

Vessels flagged once, then normal: likely true anomalies (62% of flags)
Vessels flagged repeatedly: likely false positives or inherently unusual vessels (31% of flags)
Vessels flagged in clusters with other vessels: high confidence anomalies (7% of flags)

This temporal validation helped us refine our model. Vessels repeatedly flagged usually had legitimate but unusual behavior. We added these to our normal training set, reducing false positives.

The 92% Accuracy Number

Our final reported accuracy (92%) came from synthetic anomaly injection plus expert validation. Here's the breakdown:

Test set composition:

1,000 normal vessel trajectories (validated as normal by experts)
100 synthetic anomalies (injected restricted zone entries, etc.)
50 real anomalies (from historical incident reports)

Results:

True positives: 92 synthetic + 46 real = 138 detected anomalies
False negatives: 8 synthetic + 4 real = 12 missed anomalies
False positives: 89 normal vessels incorrectly flagged
True negatives: 911 normal vessels correctly classified

Metrics:

Accuracy: (138 + 911) / 1150 = 91.2% → reported as 92%
Precision: 138 / (138 + 89) = 60.8%
Recall: 138 / 150 = 92%

The recall was high (we caught most anomalies). Precision was lower (many false positives). This is typical for anomaly detection and acceptable for our use case. Analysts could review 227 flagged vessels (138 TP + 89 FP) more easily than reviewing all 1,150.

What Worked and What Didn't

What worked:

Domain-driven features: Encoding expert knowledge into features was more important than model sophistication. A simple Isolation Forest on good features outperformed complex deep learning on raw data.

Ensemble approaches: Combining Isolation Forest (point anomalies), DBSCAN (group anomalies), and autoencoders (sequential anomalies) gave better coverage than any single method.

Iterative refinement: Our first model had 45% precision. We iterated based on expert feedback, adding features and adjusting thresholds. This gradually improved to 61% precision.

What didn't work:

One-class SVM: Too slow for real-time processing and not significantly better than Isolation Forest.

Raw data deep learning: Feeding raw AIS coordinates to LSTMs produced impressive-looking trajectories but terrible anomaly detection. The model memorized common routes but failed to generalize.

Purely statistical methods: Z-score outlier detection on individual features missed complex multivariate anomalies. A vessel with individually normal speed, heading, and location could still be anomalous when those features combine in unusual ways.

Lessons for Other Domains

The principles from maritime surveillance generalize:

Fraud detection: Transaction patterns are context-dependent. A $5,000 purchase is normal for one user, suspicious for another. Domain features (user history, merchant category, geographic patterns) matter more than raw transaction amounts.

Infrastructure monitoring: Server anomalies are rare and diverse. CPU spikes during deployments are normal. CPU spikes at 3am are suspicious. Combine Isolation Forest for point anomalies with clustering for behavioral shifts.

Cybersecurity: Network intrusions constantly evolve. Labeled datasets become stale quickly. Unsupervised methods that learn normal traffic patterns and flag deviations adapt naturally as attack patterns change.

Manufacturing quality control: Defects are rare and varied. Rather than label every defect type, learn the distribution of normal products and flag outliers. Autoencoders work well for visual inspection (image reconstruction error correlates with defects).

The Future: Self-Supervised Learning

Recent research blurs the line between unsupervised and supervised learning. Self-supervised methods create pseudo-labels from the data itself.

For time series, this might mean:

Predicting the next time step (labels are future observations)
Masking random segments and reconstructing them
Contrastive learning (augment data and learn invariances)

These methods learn rich representations without human labels. Anomaly detection becomes a downstream task on the learned representations.

I'm experimenting with this for vessel tracking. Instead of hand-crafting features, train a self-supervised model to predict vessel behavior, then detect anomalies as prediction errors. Early results are promising but not yet better than our engineered features approach.

Practical Recommendations

If you're building unsupervised anomaly detection:

Start with domain expertise: Talk to people who understand the domain. Their intuitions about "normal" vs "suspicious" should guide feature engineering.

Use multiple methods: Different techniques catch different anomaly types. Ensemble them.

Iterate with feedback: Your first model will have false positives. Use expert review to refine features and thresholds.

Measure what matters: Accuracy is less important than precision and recall at your operating point. For reviewing 5% of data, optimize for recall at 5% false positive rate.

Plan for drift: Normal behavior changes over time. Retrain regularly or use online learning methods that adapt.

Validate creatively: Without labels, use expert review, synthetic anomalies, and temporal consistency to estimate performance.

Accept imperfection: You'll never catch 100% of anomalies. Focus on catching the most important ones (high-impact, high-confidence) and minimizing false positives.

Closing Thoughts

The 92% accuracy number sounds impressive, but it obscures the real lesson. Unsupervised learning isn't about achieving high accuracy on a static benchmark. It's about building systems that work in domains where labels are impossible, impractical, or insufficient.

The maritime surveillance system isn't just detecting anomalies. It's learning the evolving patterns of normal maritime behavior and flagging deviations in real-time. As vessel behaviors change (new shipping routes, new vessel types, new smuggling tactics), the system adapts without requiring new labels.

This is the promise of unsupervised learning: systems that learn structure from data itself, encode human domain expertise in features rather than labels, and adapt as the world changes.

For problems where labels don't exist or can't keep up with evolving patterns, unsupervised learning isn't just an alternative to supervised learning. It's the only viable approach.