Finding magnetic discontinuities

It can be divided into two parts:

  1. Finding the discontinuities, see this notebook
    • Corresponding to limited feature extraction / anomaly detection
  2. Calculating the properties of the discontinuities, see this notebook
    • One can use higher time resolution data

Processing the whole dataset

Notes that the candidates only require a small portion of the data so we can compress the data to speed up the processing.


source

compress_data_by_events

 compress_data_by_events (data:polars.dataframe.frame.DataFrame,
                          events:polars.dataframe.frame.DataFrame)

Compress the data for parallel processing


source

ids_finder

 ids_finder (detection_df:polars.lazyframe.frame.LazyFrame,
             tau:datetime.timedelta, ts:datetime.timedelta, bcols=None,
             extract_df:polars.lazyframe.frame.LazyFrame=None, **kwargs)
Type Default Details
detection_df LazyFrame data used for anomaly dectection (typically low cadence data)
tau timedelta
ts timedelta
bcols NoneType None
extract_df LazyFrame None data used for feature extraction (typically high cadence data),
kwargs

wrapper function for partitioned input used in Kedro


source

extract_features

 extract_features (partitioned_input:dict[str,typing.Callable[...,polars.l
                   azyframe.frame.LazyFrame]], tau:float, ts:float,
                   **kwargs)

wrapper function for partitioned input

Type Details
partitioned_input dict
tau float in seconds, yaml input
ts float in seconds, yaml input
kwargs
Returns DataFrame

Conventions

As we are dealing with multiple spacecraft, we need to be careful about naming conventions. Here are the conventions we use in this project.

  • sat_id: name of the spacecraft. We also use abbreviation, for example
    • sta for STEREO-A
    • thb for ARTEMIS-B
  • sat_state: state data of the spacecraft
  • b_vl: maximum variance vector of the magnetic field, (major eigenvector)

Data Level

  • l0: unprocessed

  • l1: cleaned data, fill null value, add useful columns

  • l2: time-averaged data

Columns naming conventions

  • radial_distance: radial distance of the spacecraft, in units of \(AU\)

  • plasma_speed: solar wind plasma speed, in units of \(km/s\)

  • sw_elevation: solar wind elevation angle, in units of \(\degree\)

  • sw_azimuth: solar wind azimuth angle, in units of \(\degree\)

  • v_{x,y,z} or sw_vel_{X,Y,Z}: solar wind plasma speed in the ANY coordinate system, in units of \(km/s\)

    • sw_vel_{r,t,n}: solar wind plasma speed in the RTN coordinate system, in units of \(km/s\)
    • sw_vel_gse_{x,y,z}: solar wind plasma speed in the GSE coordinate system, in units of \(km/s\)
    • sw_vel_lmn_{x,y,z}: solar wind plasma speed in the LMN coordinate system, in units of \(km/s\)
      • v_l or sw_vel_l: abbreviation for sw_vel_lmn_1
      • v_mn or sw_vel_mn (deprecated)
  • plasma_density: plasma density, in units of \(1/cm^{3}\)

  • plasma_temperature: plasma temperature, in units of \(K\)

  • B_{x,y,z}: magnetic field in ANY coordinate system

    • b_rtn_{x,y,z} or b_{r,t,n}: magnetic field in the RTN coordinate system
    • b_gse_{x,y,z}: magnetic field in the GSE coordinate system
  • B_mag: magnetic field magnitude

  • Vl_{x,y,z} or b_vecL_{X,Y,Z}: maxium variance vector of the magnetic field in ANY coordinate system

    • b_vecL_{r,t,n}: maxium variance vector of the magnetic field in the RTN coordinate system
  • model_b_{r,t,n}: modelled magnetic field in the RTN coordinate system

  • state : 1 for solar wind, 0 for non-solar wind

  • L_mn{_norm}: thickness of the current sheet in MN direction, in units of \(km\)

  • j0{_norm}: current density, in units of \(nA/m^2\)

Notes: we recommend use unique names for each variable, for example, plasma_speed instead of speed. Because it is easier to search and replace the variable names in the code whenever necessary.

For the unit, by default we use

  • length : \(km\)
  • time : \(s\)
  • magnetic field : \(nT\)
  • current : \(nA/m^2\)

Test

Test feature engineering

Code
# from tsflex.features import MultipleFeatureDescriptors, FeatureCollection

# from tsflex.features.integrations import catch22_wrapper
# from pycatch22 import catch22_all
Code
# tau_pd = pd.Timedelta(tau)

# catch22_feats = MultipleFeatureDescriptors(
#     functions=catch22_wrapper(catch22_all),
#     series_names=bcols,  # list of signal names
#     windows = tau_pd, strides=tau_pd/2,
# )

# fc = FeatureCollection(catch22_feats)
# features = fc.calculate(data, return_df=True)  # calculate the features on your data
Code
# features_pl = pl.DataFrame(features.reset_index()).sort('time')
# df = candidates_pl.join_asof(features_pl, on='time').to_pandas()
Code
# profile = ProfileReport(df, title="JUNO Candidates Report")
# profile.to_file("jno.html")

Benchmark

Notes

TODOs

  1. Feature engineering
  2. Feature selection

Obsolete codes