Astronomy catalogs rarely contain everything I need in one place. TESS tells me which stars were observed for transits. Gaia adds positions, proper motions, parallaxes, colors and luminosity information. To study a TESS target as a physical star, I had to join the two catalogs.
That join is called a cross-match. It looks like a database join, but the key is not a string or integer. The key is a position on the sky, measured at a particular epoch and with an uncertainty.
I tested the process with a small field around an arbitrary sky position. I queried TESS and Gaia, matched their coordinates, inspected ambiguous results and saved the resulting table.
What makes a sky match different
Two example catalog rows looked like this:
catalog A: RA=120.123456, Dec=-18.123456
catalog B: RA=120.123520, Dec=-18.123410
They are not equal, but they may describe the same star. Coordinates differ because of measurement uncertainty, catalog epoch, proper motion and the way each survey determines its source centroid.
The right question is not “are the coordinates equal?” It is:
Is the angular separation small enough, given the resolution, epoch and source density?
For two nearby positions, Astropy handles the spherical geometry for us with SkyCoord.
The environment I used
I used a small virtual environment with the astronomy libraries I needed:
python -m venv .venv
source .venv/bin/activate
pip install numpy pandas astropy astroquery matplotlib
The example uses:
astroquery.mast.Catalogsfor the TESS Input Catalog (TIC)astroquery.gaia.Gaiafor Gaia DR3astropy.coordinates.SkyCoordfor spherical matchingastropy.unitsso arcseconds do not become accidental degrees
The field I used
I kept the center coordinate as an input instead of hard-coding a famous exoplanet. That made it easier to repeat the same test on fields with different source densities.
import numpy as np
import astropy.units as u
from astropy.coordinates import SkyCoord
center = SkyCoord(
ra=120.0 * u.deg,
dec=-18.0 * u.deg,
frame="icrs",
)
query_radius = 2.0 * u.arcmin
match_radius = 1.0 * u.arcsec
The query radius controlled the size of the field downloaded from each archive. The match radius controlled which pairs I accepted as possible counterparts. I kept them as separate variables because they solve different problems.
My TESS Input Catalog query
MAST exposes the TIC through astroquery:
from astroquery.mast import Catalogs
tic = Catalogs.query_region(
center,
radius=query_radius,
catalog="TIC",
)
tic = tic[["ID", "ra", "dec", "Tmag"]]
coordinate_mask = (
np.ma.getmaskarray(tic["ra"])
| np.ma.getmaskarray(tic["dec"])
)
tic = tic[~coordinate_mask]
print(f"TIC sources: {len(tic)}")
The returned object is an Astropy Table, not a pandas DataFrame. That is useful because Astropy tables preserve units and masks.
I did not assume every row had coordinates. Some catalog columns were masked, and converting them directly into floats would have quietly created bad values.
My Gaia DR3 query
Gaia provides a cone-search helper:
from astroquery.gaia import Gaia
job = Gaia.cone_search_async(
center,
radius=query_radius,
table_name="gaiadr3.gaia_source",
)
gaia = job.get_results()
gaia = gaia[[
"source_id",
"ra",
"dec",
"parallax",
"pmra",
"pmdec",
"phot_g_mean_mag",
"bp_rp",
"ruwe",
]]
print(f"Gaia sources: {len(gaia)}")
ruwe is included because a close positional match is not automatically a good astrometric source. A large RUWE can indicate that Gaia’s single-star astrometric model fits poorly. It is a quality clue, not a universal delete button.
Convert both catalogs to coordinates
tic_coord = SkyCoord(
ra=tic["ra"],
dec=tic["dec"],
unit="deg",
frame="icrs",
)
gaia_coord = SkyCoord(
ra=gaia["ra"],
dec=gaia["dec"],
unit="deg",
frame="icrs",
)
At this point both catalogs lived in the same coordinate frame, so Astropy could calculate great-circle separations correctly, including near the poles and across RA=0.
Nearest-neighbor matching
For every TIC source, find the nearest Gaia source:
idx, separation, _ = tic_coord.match_to_catalog_sky(gaia_coord)
accepted = separation <= match_radius
print(f"Accepted matches: {accepted.sum()} / {len(tic)}")
print(f"Median accepted separation: "
f"{np.median(separation[accepted].to_value(u.arcsec)):.3f} arcsec")
idx[i] is the Gaia row nearest to TIC row i. This does not mean the match is valid. Nearest-neighbor algorithms always return something when the comparison catalog is non-empty. The radius cut is what turns a nearest neighbor into a candidate counterpart.
I assembled the accepted pairs into an output table:
from astropy.table import Table
matched = Table()
matched["tic_id"] = tic["ID"][accepted]
matched["gaia_source_id"] = gaia["source_id"][idx[accepted]]
matched["separation_arcsec"] = separation[accepted].to_value(u.arcsec)
matched["tmag"] = tic["Tmag"][accepted]
matched["gmag"] = gaia["phot_g_mean_mag"][idx[accepted]]
matched["parallax_mas"] = gaia["parallax"][idx[accepted]]
matched["bp_rp"] = gaia["bp_rp"][idx[accepted]]
matched["ruwe"] = gaia["ruwe"][idx[accepted]]
matched.sort("separation_arcsec")
matched.write("tic_gaia_matches.ecsv", overwrite=True)
ECSV is a good default for intermediate astronomy products because it preserves metadata and units better than plain CSV.
The first hidden problem: duplicate assignments
Nearest-neighbor matching is directional. Two TIC rows can choose the same Gaia source, especially in a crowded field.
gaia_ids = np.asarray(matched["gaia_source_id"])
unique_ids, counts = np.unique(gaia_ids, return_counts=True)
duplicates = unique_ids[counts > 1]
print(f"Gaia sources assigned more than once: {len(duplicates)}")
When duplicates appeared, I inspected them and considered several responses:
- keep only the smallest-separation pair
- require mutual nearest neighbors
- solve a global one-to-one assignment problem
- use magnitude, color or catalog identifiers as additional evidence
There is no universal choice. The scientific question determines the matching policy.
Mutual nearest neighbors
A simple stricter rule is to require the catalogs to choose each other:
tic_to_gaia, sep_tg, _ = tic_coord.match_to_catalog_sky(gaia_coord)
gaia_to_tic, _, _ = gaia_coord.match_to_catalog_sky(tic_coord)
rows = np.arange(len(tic))
mutual = gaia_to_tic[tic_to_gaia] == rows
close = sep_tg <= match_radius
keep = mutual & close
print(f"Mutual matches: {keep.sum()}")
This rejects many ambiguous pairs, but it can also reject legitimate sources in blended or incomplete catalogs. Stricter is not automatically more correct.
The second hidden problem: epoch and proper motion
Gaia DR3 positions are referenced to epoch J2016.0. A nearby fast-moving star can shift by more than an arcsecond over a few years. If the other catalog position belongs to a different epoch, propagate Gaia coordinates before matching.
from astropy.time import Time
gaia_with_motion = SkyCoord(
ra=gaia["ra"] * u.deg,
dec=gaia["dec"] * u.deg,
pm_ra_cosdec=gaia["pmra"] * u.mas / u.yr,
pm_dec=gaia["pmdec"] * u.mas / u.yr,
obstime=Time("J2016.0"),
frame="icrs",
)
gaia_at_2020 = gaia_with_motion.apply_space_motion(
new_obstime=Time("J2020.0")
)
This example omits radial velocity and distance, so it is an approximation. For ordinary distant stars and a short epoch baseline, it is often enough. For nearby high-proper-motion stars, use every available phase-space component and document the target epoch.
The separation distribution I got
Rather than accepting 1 arcsec only because it sounded reasonable, I plotted the separation distribution:
import matplotlib.pyplot as plt
sep_arcsec = separation.to_value(u.arcsec)
fig, ax = plt.subplots(figsize=(8, 4))
ax.hist(sep_arcsec, bins=np.linspace(0, 5, 51))
ax.axvline(match_radius.to_value(u.arcsec), color="tab:red",
label="adopted threshold")
ax.set_xlabel("nearest Gaia separation (arcsec)")
ax.set_ylabel("TIC sources")
ax.legend()
fig.tight_layout()
fig.savefig("tic_gaia_separations.png", dpi=150)
A real counterpart population often forms a narrow peak near zero, while chance alignments create a wider tail. Crowded Galactic-plane fields need more care than sparse high-latitude fields.
Estimate false matches with a shifted catalog
For an empirical control, I shifted one catalog by an amount larger than the match radius and repeated the match. I treated those pairs as accidental.
shifted_tic = SkyCoord(
ra=tic_coord.ra + 1.0 * u.arcmin,
dec=tic_coord.dec,
frame="icrs",
)
_, shifted_sep, _ = shifted_tic.match_to_catalog_sky(gaia_coord)
false_matches = np.count_nonzero(shifted_sep <= match_radius)
print(f"Shifted-catalog matches inside threshold: {false_matches}")
This was not a complete contamination model, but it showed whether my threshold was producing many random associations.
A compact reusable function
def nearest_crossmatch(left, right, max_sep):
idx, sep, _ = left.match_to_catalog_sky(right)
keep = sep <= max_sep
return np.flatnonzero(keep), idx[keep], sep[keep]
left_rows, right_rows, sep = nearest_crossmatch(
tic_coord,
gaia_coord,
1.0 * u.arcsec,
)
I kept the function small and left quality cuts, epoch propagation and one-to-one logic as explicit steps around it. That kept the assumptions visible in the analysis.
What I recorded with the result
I recorded the following details with the cross-match:
- catalog releases and table names
- query date or archive job identifier
- coordinate frame and epochs
- whether proper motion was applied
- matching algorithm
- maximum separation
- one-to-one or mutual-match policy
- quality filters
- number of sources before and after every cut
- an estimate of accidental-match contamination
The final table is not merely downloaded data. It is the result of a chain of scientific decisions.
I started this as a data-preparation step, but the test made it clear that cross-matching is part of the inference. A wrong counterpart gives me a perfectly valid parallax, color and proper motion for the wrong star, which is much harder to notice than a missing value.