15-418/618 Spring 2026 Final Project
Andrew Koulogeorge
Home | Midterm | Final Report | GitHub
I am going to implement the Barnes-Hut N-body simulation algorithm from scratch in CUDA and do detailed profiling and analysis to understand its behavior and optimize its performance through several iterative implementations. I will evaluate performance across different problem sizes, initial body configurations, and approximation thresholds.
The Barnes-Hut algorithm estimates pairwise gravitational forces between a collection of N bodies by approximating groups of bodies sufficiently far away as a single body with a corresponding center of mass. This reduces the complexity of the algorithm from O(N²) to O(N log N).
The input data to the algorithm is a collection of points in 3D space. The main computational
loop includes: (1) constructing an octree where each node represents a region of space,
(2) computing the center of mass for each node in the tree, and (3) traversing the tree to
compute the gravitational force on each particle.
Gain experience in performance engineering and analysis in a setting outside a problem set with standard performance benchmarks. This project is very similar in flavor to one of the 3 programming labs during the semester. With a longer timeline and less pressure to hit a fixed performance threshold, I am excited to gain experience with profiling tools and obtain deep insight into various aspects of this parallel program.
The structure of the tree is data-dependent and changes across time steps, creating several parallelism challenges (non exhaustive list):
I will use the NVIDIA GPUs available on the GHC cluster machines. I am starting the implementation from scratch. I plan to reference various online resources and papers related to Barnes-Hut implementations in CUDA, but not simply reimplement them. I want to explore my own ideas for optimization based on my own measurments. Barnes-Hut on Wikipedia Barnes-Hut on GPU
Plan to Achieve: Implement a sequential Barnes-Hut algorithm in C++. Implement a naive CUDA version and at least two optimized CUDA version driven by profiling data. Evaluate how each implementation scales across: number of bodies, initial spatial configuration, and the theta threshold for force approximation. Compare my best implementation against published state-of-the-art GPU Barnes-Hut implementations.
Hope to Achieve: Analyze performance of CUDA code on a different GPU generations hypothesize about causes of performance differences based on architectural details.
Poster Deliverable: Present a comparison of my implementations against a sequential baseline with analysis of the measurements and profiling data that motivated each optimization. Included detailed analysis of my best implementation. Include comparison against existing state-of-the-art implementations.
Coming soon.
Coming soon.