Radio transient searches produce candidates, not discoveries. A dedispersion pipeline can find a pulse-like event and a matched filter can measure its S/N, but terrestrial interference can trigger both. The next stage has to decide which candidates deserve attention.
I started with a small supervised RFI classifier built from interpretable features. It was deliberately simpler than the beta-VAE in Killing RFI with a beta-VAE, because I wanted to inspect the complete classification path before moving to learned representations.
I generated three kinds of candidate waterfalls:
- dispersed astrophysical-like pulses
- broadband impulsive RFI
- persistent narrowband RFI
I extracted features, trained a Random Forest, inspected its errors and tested whether it had learned useful structure or a shortcut from my simulator.
The distinction I tried to encode
A dispersed pulse follows approximately
\[\Delta t \propto \mathrm{DM}\,\nu^{-2}.\]Broadband impulsive RFI appears at nearly the same time in every channel. Persistent narrowband RFI occupies a small frequency range for much of the observation.
Those differences suggest features:
| Feature | What it measures |
|---|---|
| best-DM gain | Does nonzero dedispersion concentrate the signal? |
| temporal peak | Is there a pulse after dedispersion? |
| spectral occupancy | How much of the band participates? |
| temporal occupancy | How long does bright power persist? |
| modulation index | Is power smooth or concentrated across frequency? |
| zero-DM peak | Is the event strongest with no dispersion? |
These features were not perfect, but I could inspect why the model made a decision.
The synthetic candidates I generated
import numpy as np
rng = np.random.default_rng(1904)
N_FREQ = 96
N_TIME = 512
DT_MS = 1.0
FREQ_MHZ = np.linspace(1200.0, 1500.0, N_FREQ)
K_DM_MS = 4.148808e6
Cold-plasma delay:
def delays_in_samples(dm):
reference = FREQ_MHZ.max()
delay_ms = (
K_DM_MS
* dm
* (FREQ_MHZ ** -2 - reference ** -2)
)
return np.rint(delay_ms / DT_MS).astype(int)
Noise background:
def background():
channel_gain = rng.normal(1.0, 0.05, size=(N_FREQ, 1))
noise = rng.normal(0.0, 1.0, size=(N_FREQ, N_TIME))
return channel_gain * noise
Astrophysical-like dispersed pulse:
def dispersed_candidate():
data = background()
dm = rng.uniform(80.0, 500.0)
width = rng.uniform(1.5, 6.0)
amplitude = rng.uniform(0.8, 2.0)
arrival = rng.integers(80, 180)
time = np.arange(N_TIME)
for channel, delay in enumerate(delays_in_samples(dm)):
center = arrival + delay
data[channel] += amplitude * np.exp(
-0.5 * ((time - center) / width) ** 2
)
return data
Broadband impulsive RFI:
def broadband_rfi():
data = background()
center = rng.integers(80, N_TIME - 80)
width = rng.uniform(1.0, 8.0)
amplitude = rng.uniform(0.8, 2.0)
time = np.arange(N_TIME)
pulse = amplitude * np.exp(-0.5 * ((time - center) / width) ** 2)
spectral_shape = rng.uniform(0.7, 1.3, size=(N_FREQ, 1))
data += spectral_shape * pulse
return data
Persistent narrowband RFI:
def narrowband_rfi():
data = background()
first = rng.integers(0, N_FREQ - 8)
width = rng.integers(1, 8)
amplitude = rng.uniform(1.0, 3.0)
data[first:first + width] += amplitude
if rng.random() < 0.5:
modulation = 1.0 + 0.4 * np.sin(
np.linspace(0, rng.uniform(5, 20), N_TIME)
)
data[first:first + width] *= modulation
return data
Synthetic data gives us exact labels and unlimited samples. It also creates a dangerous illusion: the simulator defines the world, so a model can perform perfectly while learning artifacts of the simulator.
Dedispersion helper
def dedisperse(data, dm):
shifts = delays_in_samples(dm)
aligned = np.zeros_like(data)
for channel, shift in enumerate(shifts):
if shift == 0:
aligned[channel] = data[channel]
elif shift < N_TIME:
aligned[channel, :-shift] = data[channel, shift:]
return aligned
Robust standardization:
def robust_scale(data):
median = np.median(data)
mad = np.median(np.abs(data - median))
sigma = max(1.4826 * mad, 1e-6)
return (data - median) / sigma
Extract interpretable features
TRIAL_DMS = np.arange(0.0, 551.0, 25.0)
def series_peak_snr(series):
median = np.median(series)
mad = np.median(np.abs(series - median))
sigma = max(1.4826 * mad, 1e-6)
return float(np.max((series - median) / sigma))
def extract_features(data):
z = robust_scale(data)
dm_scores = []
for dm in TRIAL_DMS:
series = dedisperse(z, dm).mean(axis=0)
dm_scores.append(series_peak_snr(series))
dm_scores = np.asarray(dm_scores)
best_index = int(np.argmax(dm_scores))
best_dm = TRIAL_DMS[best_index]
best_dm_snr = dm_scores[best_index]
zero_dm_snr = dm_scores[0]
bright = z > 3.0
channel_occupancy = np.mean(np.any(bright, axis=1))
time_occupancy = np.mean(np.any(bright, axis=0))
spectrum = np.maximum(z, 0.0).mean(axis=1)
modulation_index = float(
np.std(spectrum) / (np.mean(spectrum) + 1e-6)
)
total_series = z.mean(axis=0)
temporal_peak = series_peak_snr(total_series)
return np.array([
best_dm,
best_dm_snr,
best_dm_snr - zero_dm_snr,
zero_dm_snr,
channel_occupancy,
time_occupancy,
modulation_index,
temporal_peak,
], dtype=np.float32)
The best-DM gain was the feature closest to the propagation model. My dispersed events improved when aligned at a nonzero DM, while a broadband impulse was already aligned at DM=0.
The dataset I built
GENERATORS = {
"pulse": dispersed_candidate,
"broadband_rfi": broadband_rfi,
"narrowband_rfi": narrowband_rfi,
}
FEATURE_NAMES = [
"best_dm",
"best_dm_snr",
"dm_gain",
"zero_dm_snr",
"channel_occupancy",
"time_occupancy",
"modulation_index",
"temporal_peak",
]
def make_dataset(samples_per_class):
features = []
labels = []
for label, generator in GENERATORS.items():
for _ in range(samples_per_class):
features.append(extract_features(generator()))
labels.append(label)
return np.asarray(features), np.asarray(labels)
X, y = make_dataset(samples_per_class=500)
print(X.shape, y.shape)
The classes are balanced on purpose. That lets us focus on feature behavior first. Real RFI datasets are severely imbalanced and require different metrics and sampling policies.
Split before training
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.25,
random_state=42,
stratify=y,
)
Because every synthetic candidate is independently generated, a random split is acceptable here. With telescope data, split by observing session, beam or time block. Adjacent candidates often share the same RFI event, and random row splitting leaks that event into both sets.
Train a Random Forest
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(
n_estimators=300,
max_depth=10,
min_samples_leaf=3,
class_weight="balanced",
random_state=42,
n_jobs=-1,
)
model.fit(X_train, y_train)
I used a Random Forest because it handled nonlinear boundaries without feature scaling and exposed feature importances. I treated it as a baseline, not as an automatic production choice.
Evaluate by class
from sklearn.metrics import classification_report, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
prediction = model.predict(X_test)
print(classification_report(y_test, prediction, digits=3))
ConfusionMatrixDisplay.from_predictions(
y_test,
prediction,
normalize="true",
cmap="Blues",
)
plt.tight_layout()
plt.savefig("rfi-confusion-matrix.png", dpi=150)
Overall accuracy is not enough. Ask:
- How many real pulses are rejected as RFI?
- Which RFI type becomes a pulse candidate?
- Does performance collapse for weak signals?
- Are errors concentrated at low DM?
For this discovery-style pipeline, I treated false negatives as more expensive than false positives. I tuned thresholds for that objective instead of a round accuracy number.
Probability and triage thresholds
probability = model.predict_proba(X_test)
pulse_column = list(model.classes_).index("pulse")
pulse_probability = probability[:, pulse_column]
for threshold in [0.5, 0.7, 0.9]:
selected = pulse_probability >= threshold
recovered = np.mean(selected[y_test == "pulse"])
contamination = np.mean(y_test[selected] != "pulse") if selected.any() else 0.0
print(
f"threshold={threshold:.1f} "
f"pulse_recall={recovered:.3f} "
f"selected_contamination={contamination:.3f}"
)
I used the classifier to triage candidates, not to declare discoveries. A high-recall threshold produced a smaller candidate set for later validation stages.
What the model appeared to learn
order = np.argsort(model.feature_importances_)[::-1]
for index in order:
print(
f"{FEATURE_NAMES[index]:20s} "
f"{model.feature_importances_[index]:.3f}"
)
When best_dm dominated, I checked whether the simulator had accidentally made every pulse high-DM and every RFI exactly zero-DM. Real RFI can mimic dispersion, and low-DM astrophysical sources exist. I used feature importance as a prompt for another experiment, not as an explanation by itself.
Shortcut test: change the simulator
I created a harder RFI class with a curved sweep:
def swept_rfi():
data = background()
start = rng.integers(80, 180)
slope = rng.uniform(-0.8, 0.8)
width = rng.uniform(1.0, 4.0)
amplitude = rng.uniform(0.8, 2.0)
time = np.arange(N_TIME)
for channel in range(N_FREQ):
center = start + slope * channel
data[channel] += amplitude * np.exp(
-0.5 * ((time - center) / width) ** 2
)
return data
Before retraining, I passed 200 swept examples through the existing model:
X_shift = np.asarray([
extract_features(swept_rfi())
for _ in range(200)
])
shift_prediction = model.predict(X_shift)
print(np.unique(shift_prediction, return_counts=True))
This is a distribution-shift test. A model that looked excellent on the original test set may call swept RFI astrophysical because that morphology did not exist in training.
Moving from simulation to telescope data
For a telescope dataset, I would keep:
- candidates from different observing days
- multiple pointings and frequency bands
- human-reviewed labels with an uncertainty field
- injected synthetic pulses inside real backgrounds
- metadata about beam, antenna, backend and observing conditions
- groups that keep related candidates in one split
I used injection recovery as the sensitivity measurement: pulses covered a range of DM, width, spectral index and S/N, and I checked which ones survived the complete pipeline. Bright, clean examples alone gave an unrealistically easy test.
What I recorded for each run
dataset version
observation IDs
label policy
feature definitions
grouped split key
random seed
model parameters
class counts
confusion matrix
threshold and pulse recall
performance by DM, width and S/N
Without these, a good result is difficult to reproduce and nearly impossible to diagnose later.
The model is the easy part. The hard part is defining examples that represent the radio environment and designing a split that prevents the same interference from appearing on both sides. If that part is wrong, a beautiful confusion matrix only shows that the leakage was easy to learn.