Temporal Graph Data in TGM
This tutorial shows the core graph API in TGM. By the end, you should understand how to:
- Construct and preprocess graph data (
DGData) - Split and discretize temporal datasets (
SplitStrategy) - Work with immutable graph views (
DGraph) - Train with batches (
DGBatch)
We also highlight some important errors, caching behaviour, and best practices.
1. The Core Objects
TGM's graph API revolves around four main objects:
| Object | Description | Mutable | Device Semantics | Typical Usage |
|---|---|---|---|---|
DGData |
Mutable bulk dataset storage (IO, splits, transforms) | Yes | No | Ingesting datasets from disk, TGB, preprocessing |
DGraph |
Immutable graph view backed by storage engine | No | Yes | Main user-facing graph object |
DGBatch |
Materialized batches of tensors from a temporal slice of data | Yes | Yes | What dataloaders yield, input to models |
DGStorage |
Internal backend for graph data (non-user-facing) | No | Yes | Powers graph querying, caching, slice ops |
Note: Users typically only interact with the first 3.
DGStorageis internal and abstracted away. It is in our stream of work to build out more efficient storage backends for various workloads in the future.
2. Starting with DGData
DGData is your main entry point for working with temporal graph datasets. It's a dataclass that holds bulk storage of events, timestamps, features, and metadata.
Because it's mutable, you can freely transform and prepare it before moving to the immutable graph representation (DGraph).
Features of DGData
- Holds raw edge data (
edge_index,edge_time) - Holds static node features, dynamic node features, and edge_features (on CPU)
- Provides IO constructors (CSV, Pandas, TGB, pyTorch)
- Supports temporal splitting and discretization
- Ensures data is sorted chronologically, valid node ids, valid tensor shapes, etc.
See below for a summary of the data class attributes of DGData:
@dataclass
class DGData:
"""Container for dynamic graph data to be ingested by `DGStorage`.
Stores edge and node events, their timestamps, features, and optional split strategy.
Provides methods to split, discretize, and clone the data.
Attributes:
time_delta (TimeDeltaDG | str): Time granularity of the graph.
time (Tensor): 1D tensor of all event timestamps [num_edge_events + num_node_events].
edge_mask (Tensor): Mask of edge events within `time`.
edge_index (Tensor): Edge connections [num_edge_events, 2].
edge_x (Tensor | None): Optional edge features [num_edge_events, D_edge].
node_x_mask (Tensor | None): Mask of dynamic node features within `time`.
node_x_nids (Tensor | None): Node IDs corresponding to dynamic node features [num_node_events].
node_x (Tensor | None): Dynamic Node features over time [num_node_events, D_node_dynamic].
node_y_mask (Tensor | None): Mask of node labels within `time`.
node_y_nids (Tensor | None): Node IDs corresponding to node labels [num_node_labels].
node_y (Tensor | None): Node labels over time [num_node_labels, D_node_dynamic].
static_node_x (Tensor | None): Node features invariant over time [num_nodes, D_node_static].
edge_type (Tensor | None) : Type of relation of each edge event in edge_index [num_edge_events].
node_type (Tensor | None) : Type of each node [num_nodes].
Raises:
InvalidNodeIDError: If an edge or node ID match `PADDED_NODE_ID`.
InvalidNodeIDError: If node labels exists with node IDs outside the graph's node ID range.
ValueError: If any data attributes have non-well defined tensor shapes.
EmptyGraphError: If attempting to initialize an empty graph.
Notes:
- Timestamps must be non-negative and sorted; DGData will sort automatically if necessary.
- Cloning creates a deep copy of tensors to prevent in-place modifications.
- Edge type is only applicable for Heterogeneous & Knowledge graph.
- Node type is only applicable for Knowledge graph.
"""
See tgm.data.dg_data.DGData for full reference.
3. Constructing DGData
You can build datasets in multiple ways. Let's look at each.
3.1 From TGB
This is most likely all you need. The Temporal Graph Benchmark (TGB) provides a suite of temporal graph datasets with diverse scales and properties. We natively support direct construction from all the tgbl- and tgbn- in TGM.
Note: Temporal heterogeneous graph (THG) is supported in TGM. Check out THG tutorial
Note: Temporal knowledge graph (TKG) is under construction and not yet supported in TGM.
Note: To load a TGB dataset, you must have the
py-tgbpackage in your python env.
from tgm.data import DGData
# Load the Wikipedia dataset from TGB
data = DGData.from_tgb('tgbl-wiki')
print(data.time_delta) # TimeDelta('s', value=1)
print(data.edge_index.shape) # torch.Size([157474, 2])
print(data.node_x) # None, no dynamic node features in tgbl-wiki
print(data.static_node_x) # None, no static node features in tgbl-wiki
TIP: You can
print(data)to see which features and events exist within the dataset.
3.2 Custom Datasets
If you have our own dataset in TGM, you can create a DGData object either from_csv, from_pandas, or directly from tensors. A brief overview of each is given below, consult the API reference for more details.
From CSV
Please consult our documentation for full description of our API. The table below summarizes the main pieces of data expected during construction. Note that analogous attributes are expected in the other IO constructors (e.g. from_pandas, from_raw)
| Attribute | Description | Type | Required | Note |
|---|---|---|---|---|
edge_file_path |
Path to CSV file containing edge data | str \| pathlib.Path |
Yes | edge_df if using from_pandas |
edge_src_col |
Column name in edge file for src nodes | str |
Yes | Cannot have ids matching tgm.constants.PADDED_NODE_ID |
edge_dst_col |
Column name in edge file for dst nodes | str |
Yes | Cannot have ids matching tgm.constants.PADDED_NODE_ID |
edge_time_col |
Column name in edge file for edge times | str |
Yes | Time must be non-negative |
node_x_file_path |
Path to CSV file containing dynamic node data | str \| pathlib.Path |
No | node_x_df is using from_pandas |
node_x_nids_col |
Column name in node file for node event node ids | str |
No, unless node_x_file_path is specified |
Cannot have ids matching tgm.constants.PADDED_NODE_ID |
node_x_time_col |
Column name in node file for node event node times | str |
No, unless node_x_file_path is specified |
Time must be non-negative |
node_x_col |
Column name in node file for dynamic node features | str |
No | |
node_y_file_path |
Path to CSV file containing dynamic node labels | str \| pathlib.Path |
No | node_y_df is using from_pandas |
node_y_nids_col |
Column name in node file for node label node ids | str |
No, unless node_y_file_path is specified |
Cannot have ids matching tgm.constants.PADDED_NODE_ID |
node_y_time_col |
Column name in node file for node label node times | str |
No, unless node_y_file_path is specified |
Time must be non-negative |
node_y_col |
Column name in node file for dynamic node labels | str |
No | |
static_node_x_file_path |
Path to CSV file containing static node features | str \| pathlib.Path |
No | static_node_x_df if using from_pandas |
static_node_x_col |
Column name in static node feats file for static node features | str |
No, unless static_node_x_file_path is specified |
|
time_delta |
Time granularity of the graph data | TimeDeltaDG \| str |
Yes | Default to event_ordered granularity 'r' |
A few key things to know:
time_delta: defines how timestamps are interpreted on your custom dataset.- The default is 'r' which entails event-ordered semantics. This means there is no real-world time unit assigned to your timestamps. This prevents from doing things like discretizing your data, and iterating by temporal snapshots.
- More often than not, your timestamps have some semantics meaning (e.g. seconds, days, etc). In this case, you should specify the appropriate
time_deltavalue. See our time management tutorial for more details. - edge data:
- We expect an
edge_file_pathwhich is a csv file withedge_src_col,edge_dst_col,edge_time_colas a minimum. - Your edge csv file may also contain
edge_x_colwhich are the edge features on your data - dynamic node data (optional)
- If included, we expect a
node_x_file_pathwhich is a csv file withnode_x_nids_col,node_x_time_colas a minimum. These are your dynamic node events. - If included, we expect a
node_y_file_pathwhich is a csv file withnode_y_nids_col,node_y_time_colas a minimum. These are your dynamic node labels. - Your dynamic node data csv file may also include
node_x_col, which are the dynamic node features in your data. - Your dynamic node labels csv file may also include
node_y_col, which are the dynamic node labels in your data. - static node data (optional)
- If included, we expect a
static_node_x_file_pathwhich is a csv file withstatic_node_x_col, the static node features for your dataset.
Internally, we perform various checks on the tensors shapes, node ranges, and timestamps values. If your data is well structured, everything should work. If you get an error message that is not intuitive, please let us know.
From Pandas
The API largely the same as above, except that we expected edge_df, node_x_df, and static_node_x_df dataframes for the edge, dynamic node, and static node data respectively, instead of csv files.
import pandas as pd
# Define Edge Data
edge_df = pd.DataFrame({
'src': [2, 2, 1],
'dst': [2, 4, 8],
't': [1, 5, 10],
'edge_feat': [torch.rand(5).tolist() for _ in range(3)], # Optional
})
# Define Dynamic Node Data (Optional)
dynamic_node_df = pd.DataFrame({
'node': [2, 4, 6],
't': [1, 2, 3],
'dynamic_node_feat': [torch.rand(5).tolist() for _ in range(3)],
})
# Define Static Node Features (Optional)
static_node_df = pd.DataFrame({
'static_node_feat': [torch.rand(11).tolist() for _ in range(9)]
})
dg = DGData.from_pandas(
edge_df=edge_df,
edge_src_col='src',
edge_dst_col='dst',
edge_time_col='t',
edge_x_col='edge_feat',
node_x_df=dynamic_node_df,
node_x_nids_col='node',
node_x_time_col='t',
node_x_col='dynamic_node_feat',
static_node_x_df=static_node_df,
static_node_x_col='static_node_feat',
time_delta='s', # second-wise granularity
)
From Tensors
If all your data is already in memory as torch.Tensor you can directly instantiate DGdata using the class method DGData.from_raw:
import torch
# Define Edge Data
edge_index = torch.LongTensor([[2, 2], [2, 4], [1, 8]])
edge_time = torch.LongTensor([1, 5, 20])
edge_feats = torch.rand(3, 5) # optional edge features
# Define Dynamic Node Data (Optional)
node_x_time = torch.LongTensor([1, 2, 3])
node_x_nids = torch.LongTensor([2, 4, 6])
node_x = torch.rand([3, 5])
# Define Static Node Features (Optional)
static_node_x = torch.rand(9, 11)
data = DGData.from_raw(
edge_time=edge_time,
edge_index=edge_index,
edge_x=edge_feats,
node_x_time=node_x_time,
node_x_nids=node_x_nids,
node_x=node_x,
static_node_x=static_node_x,
time_delta='s', # second-wise granularity
)
3.3 Errors to know
tgm.exceptions.EmptyGraphError: Raised when you try to construct aDGDataobject from empty data. This is probably not what you intended to do since downstreamDGraphis immutable.tgm.exceptions.InvalidNodeIDError: Raised when you dataset contains-1as a node ID (reserved for padding).
4. Splitting DGData
After loading your data, you'll probably want to split your dataset into train, validation, and test splits. TGM provides a strategy pattern interface for different split strategies:
TemporalSplit: Split by fixed timestamp boundariesTemporalRatioSplit: Split by ratio of both edge and node eventsTGBSplit: Pre-defined TGB data splits
Important: The TGB data splits uses pre-defined event masks, to match the splits as per the TGB leaderboard. If you try to change this, you'll get a
ValueError.
The split method is defined on DGData:
def split(self, strategy: SplitStrategy | None = None) -> Tuple[DGData, ...]:
"""Split the dataset according to a strategy.
Args:
strategy (SplitStrategy | None): Optional strategy to override the
default. If None, uses `_split_strategy` or defaults to `TemporalRatioSplit`.
Returns:
Tuple[DGData, ...]: Split datasets (train/val/test).
Raises:
ValueError: If attempting to override the split strategy for TGB datasets.
Notes:
- Splits preserve the underlying storage; only indices are filtered.
"""
Splitting TGB Datasets
from tgm.data import DGData, TemporalRatioSplit
# Load the Wikipedia dataset from TGB
data = DGData.from_tgb('tgbl-wiki')
# Split using native TGB masks
train_data, val_data, test_data = data.split()
# If you tried to override the split strategy, you'll get an error
split_strategy = TemporalRatioSplit(train=0.8, val=0.1, test=0.1)
_ data.split(strategy=split_strategy) # Raises ValueError
5. Discretizing DGData
In TGM, we do not enforce strict definition of continuous time (resp. discrete time) dynamic graph CTDG (resp. DTDG). Instead, as you have seen, we define graphs based on their time granularity. Therefore, the user is able to convert between event-based and snapshot based views of the underlying data. You can learn more about this in the UTG paper.
In TGM, we provide a method on DGData called discretize which allows you to coarsen your graph into different time granularities. The API looks like:
def discretize(
self, time_delta: TimeDeltaDG | str | None, reduce_op: str = 'first'
) -> DGData:
"""Return a copy of the dataset discretized to a coarser time granularity.
Args:
time_delta (TimeDeltaDG | str | None): Target time granularity.
reduce_op (str): Aggregation method for multiple events per bucket. Default 'first'.
Returns:
DGData: New dataset with discretized timestamps and features.
Raises:
EventOrderedConversionError: If discretization is incompatible with event-ordered granularity
InvalidDiscretizationError: If the target granularity is finer than the current granularity.
"""
Note: This is only well defined if the DGData time delta is time-ordered. If you try discretizing an event-ordered dataset, you will get a
tgm.exceptions.EventOrderedConversionError.Note: Discretization goes from finer time units (e.g. seconds) to coarse time units (e.g. hours). If your attempt to discretize in the other direction, you'll get a
tgm.exceptions.InvalidDiscretizationError.
See our time management tutorial for more details on discretization and how it relates to TimeDeltaDG.
6. From DGData to DGraph
Once your dataset is ready to go, you can cast it to DGraph:
from tgm import DGraph
from tgm.data import DGData
data = DGData.from_tgb(...)
dg = DGraph(data, device=...)
Some things to note:
DGraphis an immutable view over a temporal window of graph data.- It is backed by
DGStorage(internal engine). When you first create aDGraphas we did above, a new storage is created, and the view encapsulates the entire dataset. DGraphsupports device semantics, you can choose what device your graph is on.
DGraph Properties
Let's use our toy DGData we had above, cast to DGraph and inspect some of the properties of the entire dataset.
data = DGData.from_raw(...) # As we had above
dg = DGraph(data) # Default to CPU
print(f'Start time : {dg.start_time}') # 1
print(f'End time : {dg.end_time}') # 10
print(f'Number of nodes : {dg.num_nodes}') # 9
print(f'Number of edge events : {dg.num_edge_events}') # 3
print(f'Number of timestamps : {dg.num_timestamps}') # or len(dg); 5
print(f'Total events (edge+node) : {dg.num_events}') # 6
print(f'Edge feature dimension : {dg.edge_x_dim}') # 5
print(f'Static node feature dim : {dg.static_node_x_dim}') # 11
print(f'Dynamic node feature dim : {dg.node_x_dim}') # 5
print(f'TimeDelta : {dg.time_delta}') # TimeDelta('s', value=1)
print(f'Device : {dg.device}') # torch.device(cpu)
# We can move the graph to GPU
dg = dg.to('cuda')
print(f'Device : {dg.device}') # torch.device(cuda:0)
Note: The number of nodes is computed as
max(node_x_nids) + 1.Note: If the
DGraphis empty,start_timeandend_timeareNone.Note:
len()returns the number of timestamps (not the number of events) in the graph.
Slicing: Creating new views
You can create a new DGraph view by slicing the underlying data. Currently, we support slicing by time, or by event index. Both operations are lightweight, as the storage is shared between DGraph instances. This makes it very fast to select subsets of your data.
You can slice temporal data using slice_time(). This returns a new DGraph containing only events within the specified time range (end time exclusive). Slicing is a lightweight operation since the underlying data storage is shared across DGraph instances.
Note: These are both end-time exclusive operations.
Following from our previous code snippet:
sliced_dg = dg.slice_time(start_time=5, end_time=10)
print(sliced_dg.start_time) # 5
print(sliced_dg.end_time) # 9, end time exclusive
print(sliced_dg.num_edge_events) # 1
print(sliced_dg.device) # still on gpu
7. Materialization, Iteration and DGBatch
In practice, the typical workflow will require you to feed data into your model for training. For this purpose, we need to materialize the view.
The method on DGraph looks like:
def materialize(self, materialize_features: bool = True) -> DGBatch:
"""Materialize the current DGraph slice into a dense `DGBatch`.
Args:
materialize_features (bool, optional): If True, includes dynamic node
features, node IDs/times, and edge features. Defaults to True.
Returns:
DGBatch: A batch containing edge_src, edge_dst, edge_time, and optionally
features from the current slice.
"""
As described above, the output is a DGBatch object, which is nothing but a container of tensors corresponding to the materialized data of the DGraph, on device. By default, the DGBatch contains the following attributes:
@dataclass
class DGBatch:
"""Container for a batch of events/materialized data from a DGraph.
Each `DGBatch` holds edge and node information for a slice of a dynamic graph,
including optional dynamic node features and edge features. Hooks read and write
additional attributes to the container transparently during dataloading.
Args:
edge_src (Tensor): Source node indices for edges in the batch. Shape `(E,)`.
edge_dst (Tensor): Destination node indices for edges in the batch. Shape `(E,)`.
edge_time (Tensor): Timestamps of each edge event. Shape `(E,)`.
edge_x (Tensor | None, optional): Edge features for the batch. Tensor of shape `(T x V x V x d_edge)`.
edge_type (Tensor | None, optional): Type of each edge. Shape `(E,)`
node_x (Tensor | None, optional): Dynamic node features for nodes in the batch. Tensor of shape `(T x V x d_node_dynamic)`.
node_x_time (Tensor | None, optional): Timestamps corresponding to dynamic node features.
node_x_nids (Tensor | None, optional): Node IDs corresponding to dynamic node features.
node_y (Tensor | None, optional): Dynamic node labels for nodes in the batch. Tensor of shape `(T x V x d_node_labels)`.
node_y_time (Tensor | None, optional): Timestamps corresponding to dynamic node labels.
node_y_nids (Tensor | None, optional): Node IDs corresponding to dynamic node labels.
"""
For example:
# Our full graph view
dg_batch = dg.materialize(materialize_features=False) # Skip features
print(dg_batch.edge_src) # torch.tensor([2, 2, 1], dtype=torch.long, device='cuda:0')
print(dg_batch.edge_x) # None, because we skipped materializing features
# Our sliced graph view (from start_time=5, end_time=10)
sliced_dg_batch = sliced_dg.materialize()
print(dg_batch.edge_src) # torch.tensor([5], dtype=torch.long, device='cuda:0')
print(dg_batch.edge_x is None) # False, we matrialized our slice of edge features
Note: Materializing a full graph view with features could be expensive, especially on large graphs. Note: The device of
DGraphdetermines the device on which theDGBatchtensors are allocated.
DGDataLoader
Internally, the DGDataLoader is responsible for materializing slices of graph data, using exactly the mechanics describe above. In particular, when you do something like:
from tgm import DGraph
from tgm.data import DGDataLoader
dg = DGraph(...)
loader = DGDataLoader(dg, ...)
for batch in loader:
...
the data loader computes offsets into the storage, performs slicing operations, materializes the sliced views, and the applies hooks on the materialized data. See our hook management tutorial for more details.
Summary
We learned about how DGData is used for loading data and preprocessing. We discussed how to created data splits and discretize your dataset to coarser time granularities. Once your data is loaded, you cast to DGraph, which is an immutable view of a slice of data.We showed how to query various attributes from a DGraph, and how to slice the DGraph in temporal snapshots. Finally, we showed how to materialize the data in DGBatch for training, and how the DGDataLoader does this internally during iteration.
With this foundation, you're ready to explore hook management and get started with our examples. Please feel free to reach out to us if anything is unclear or unintuitive. We are happy to discuss and improve your experience with TGM.