Leaderboard

We evaluate Time Series Foundation Models from two perspectives: 1) embedding quality and 2) Light curve classification. We compare commonly used TSFMs and astrophysics-specific methods such as Astromer-1, Astromer-2 and hand-crafted features. The main goal of this project is to explore the potential of TSFMs in astronomy data analysis. Click on Clustering (K-Means), Clustering (Ward), Classification (MLP), Classification (Logistic), Classification (RF) and Classification (k-NN) to expand detailed results.

Clustering Results

Reset K-Means Ward
Name Size Date NMI ARI F1 NMI ARI F1

Clustering results using unsupervised heads (K-Means, Ward) with metrics (NMI, ARI, F1). The best in each column is bold, second best is underlined.
Click on each models to see their details and visualization of embeddings!!!

Classification Results

Reset MLP k-NN Logistic Random Forest
Name Size Date Acc Rec Prec F1 Acc Rec Prec F1 Acc Rec Prec F1 Acc Rec Prec F1

Classification results using supervised heads (MLP, k-NN, Logistic Regression, Random Forest) with metrics (Accuracy, Recall, Precision, F1). The best in each column is bold, second best is underlined.
Click on each models to see their details and visualization of embeddings!!!


Out-of-Distribution Detection

Astronomers are especially interested in spotting stars that behave differently from the ones already known and labeled. To test if our model's embeddings can help find such unusual stars, we focus on rare types of variable stars that weren't part of the training set. We then use an algorithm (a type of outlier detector called an isolation forest) to see how different these stars look compared to the common types. If a star gets a high "out-of-distribution" score, it means it's more likely to be one of these unusual cases. We measure success by checking how many of the stars the method flags as unusual are truly rare types.

Results for out of distribution source detection. The best results are highlighted in bold, and the second-best results are underlined. The Chronos-Bolt-tiny performs very well on this task, ranking first across all metrics with hand-crafted features being a distant second.
Purity
Top 1 percentile Top 5 percentile Top 10 percentile
Astromer-1 0.014 (0.020) 0.087 (0.018) 0.121 (0.001)
Astromer-2 0.139 (0.019) 0.124 (0.001) 0.119 (0.001)
Moirai-small 0.172 (0.023) 0.143 (0.003) 0.152 (0.002)
Chronos-tiny 0.165 (0.005) 0.129 (0.017) 0.158 (0.021)
Chronos-Bolt-tiny 0.569 (0.037) 0.536 (0.053) 0.528 (0.050)
Random Embeddings 0.116 0.116 0.116
Hand-crafted features 0.213 (0.009) 0.280 (0.006) 0.260 (0.002)

Key Findings

The results reveal several important insights:

  • Chronos Models Excel: Both Chronos and Chronos-Bolt-tiny achieve the best performance across all metrics.
  • Hand-crafted Features Remain Competitive: Traditional astronomical features still perform well, ranking second.
  • Foundation Models Show Promise: Time series foundation models demonstrate superior ability to identify unusual stellar behavior.
  • Random Baselines Perform Poorly: Random embeddings serve as an important baseline, showing the significance of learned representations.