<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Self-Gravity on Keqi's blog</title><link>https://yekq.top/en/tags/self-gravity/</link><description>Recent content in Self-Gravity on Keqi's blog</description><generator>Hugo -- gohugo.io</generator><language>en</language><managingEditor>plloningye@gmail.com (Keqi Ye)</managingEditor><webMaster>plloningye@gmail.com (Keqi Ye)</webMaster><copyright>Keqi Ye</copyright><lastBuildDate>Tue, 07 Apr 2026 14:00:00 +0800</lastBuildDate><atom:link href="https://yekq.top/en/tags/self-gravity/index.xml" rel="self" type="application/rss+xml"/><item><title>Self-Gravity Tree Code Efficiency and Optimization Analysis in GASPHiA</title><link>https://yekq.top/en/posts/gasphia/treecode/</link><pubDate>Tue, 07 Apr 2026 14:00:00 +0800</pubDate><author>plloningye@gmail.com (Keqi Ye)</author><guid>https://yekq.top/en/posts/gasphia/treecode/</guid><description>&lt;iframe src="//player.bilibili.com/player.html?isOutside=true&amp;aid=116482834958240&amp;bvid=BV1u69yBGEwJ&amp;cid=37919327464&amp;p=1" scrolling="no" border="0" frameborder="no" framespacing="0" allowfullscreen="true"
style="width: 100%; aspect-ratio: 16/9; border-radius: 8px; margin-bottom: 20px;">
&lt;/iframe>
&lt;p align="center" style="font-size: 0.9em; color: gray; margin-top: -10px; margin-bottom: 20px;">
&lt;em> 500-Million Particle Pure Gravitational N-Body Simulation: Tidal Tail Structures from a Galaxy Collision
This simulation is implemented using the pure N-body module of GASPHiA. The initial model was constructed with the open-source tool DICE, incorporating both a complete dark matter halo and a stellar disk component. The focus is on showcasing the process by which particles in the galactic disk evolve into characteristic tidal tails under gravitational perturbations during the interaction.&lt;/em>
&lt;/p>
&lt;h2 id="why-consider-gravity">Why Consider Gravity?
&lt;/h2>&lt;p>In astrophysical SPH simulations, besides the impact process itself, the self-gravity of celestial bodies plays a crucial role in the final results. This also explains why we often do not introduce material strength models when simulating solid planet impacts, instead treating them as pure fluids. For some giant impact simulations, gravity not only requires real-time computation during the simulation but also demands solving the Poisson equation in the pre-processing stage to construct an initial celestial body in gravitational equilibrium before the impact calculation.&lt;/p>
&lt;p>The gravity discussed here, namely self-gravity, refers to the mutual attractive force between SPH particles due to their mass distribution. This is entirely different from the gravity model used in hydrodynamic simulations (such as dam breaks) — the latter only needs to apply a uniform constant acceleration pointing toward the ground for each particle.&lt;/p>
&lt;h2 id="how-to-compute-self-gravity-and-why-do-we-need-a-tree-code">How to Compute Self-Gravity, and Why Do We Need a Tree Code?
&lt;/h2>&lt;p>In SPH simulations, computing self-gravity essentially means solving the gravitational force on each particle. According to Newton&amp;rsquo;s law of universal gravitation, any two particles exert a gravitational force on each other. For a system with $N$ particles, computing all pairwise interactions directly requires $O(N^2)$ operations. When the number of particles in astrophysical simulations reaches millions or even tens of millions, this direct summation approach becomes completely unacceptable in terms of computational cost.&lt;/p>
&lt;p>This is the core motivation for introducing the &lt;strong>Tree Code&lt;/strong>. The fundamental idea stems from a simple physical intuition: when a distant cluster of particles is far enough away, we don&amp;rsquo;t need to compute the contribution of each particle individually. Instead, we can approximate the entire cluster as a single equivalent mass point located at its center of mass. Through this approximation, the computational complexity can be dramatically reduced from $O(N^2)$ to $O(N\log N)$.&lt;/p>
&lt;p>&lt;img src="https://yekq.top/posts/gasphia/treecode/concept.png"
width="297"
height="320"
srcset="https://yekq.top/posts/gasphia/treecode/concept_huc14661fad8e86edd86be0c107bea60bf_14013_480x0_resize_box_3.png 480w, https://yekq.top/posts/gasphia/treecode/concept_huc14661fad8e86edd86be0c107bea60bf_14013_1024x0_resize_box_3.png 1024w"
loading="lazy"
alt="Conceptual diagram of tree-based gravity computation, from reference [2]"
class="gallery-image"
data-flex-grow="92"
data-flex-basis="222px"
>&lt;/p>
&lt;p>The Tree Code was first proposed by Barnes and Hut[1], so the tree structure based on their work is typically called the &lt;strong>Barnes-Hut Tree&lt;/strong>. When implementing this data structure for CUDA architecture, GASPHiA draws from the parallel implementation approach in reference [2].&lt;/p>
&lt;p>However, the method in reference [2] was designed for pure N-body simulations, where the data structure only needs to satisfy the requirements of self-gravity computation. In contrast, the SPH method not only needs to compute self-gravity but also relies on the tree structure for efficient neighbor particle searching. This fundamental difference in requirements leads to significant differences between GASPHiA&amp;rsquo;s final tree code structure and that in reference [2]. The specific differences are mainly reflected in two aspects: first, the management strategy for the number of child nodes in tree nodes; second, the warp voting mechanism during parallel tree traversal — reference [2] only needs to determine the condition for gravitational interaction, while GASPHiA, as an SPH code, must additionally integrate the logic for neighbor searching.&lt;/p>
&lt;p>Despite these differences in implementation details, both approaches maintain a high degree of consistency in their core concept of spatial recursive partitioning. Therefore, readers can still use reference [2] as an important reference for understanding the underlying spatial partitioning logic.&lt;/p>
&lt;h2 id="efficiency-comparison-and-optimization">Efficiency Comparison and Optimization
&lt;/h2>&lt;h3 id="background">Background
&lt;/h3>&lt;p>Currently, GASPHiA has implemented self-gravity computation based on the Barnes-Hut Tree. However, compared to reference [2], during tree traversal, the lack of spatial sorting for particles means the efficiency is certainly not as high as with spatial sorting. To intuitively demonstrate the power of the Barnes-Hut Tree, we implemented brute-force pairwise self-gravity computation as a reference, and used the efficiency without spatial sorting as a baseline to explore the efficiency improvement after including the sorting process.&lt;/p>
&lt;h3 id="computation-kernels">Computation Kernels
&lt;/h3>&lt;p>The tree-based self-gravity computation consists of the following kernels:&lt;/p>
&lt;ol>
&lt;li>
&lt;p>&lt;strong>Reset Tree Structure&lt;/strong>&lt;/p>
&lt;div class="highlight">&lt;div class="chroma">
&lt;table class="lntable">&lt;tr>&lt;td class="lntd">
&lt;pre tabindex="0" class="chroma">&lt;code>&lt;span class="lnt"> 1
&lt;/span>&lt;span class="lnt"> 2
&lt;/span>&lt;span class="lnt"> 3
&lt;/span>&lt;span class="lnt"> 4
&lt;/span>&lt;span class="lnt"> 5
&lt;/span>&lt;span class="lnt"> 6
&lt;/span>&lt;span class="lnt"> 7
&lt;/span>&lt;span class="lnt"> 8
&lt;/span>&lt;span class="lnt"> 9
&lt;/span>&lt;span class="lnt">10
&lt;/span>&lt;span class="lnt">11
&lt;/span>&lt;span class="lnt">12
&lt;/span>&lt;span class="lnt">13
&lt;/span>&lt;span class="lnt">14
&lt;/span>&lt;span class="lnt">15
&lt;/span>&lt;span class="lnt">16
&lt;/span>&lt;/code>&lt;/pre>&lt;/td>
&lt;td class="lntd">
&lt;pre tabindex="0" class="chroma">&lt;code class="language-cpp" data-lang="cpp">&lt;span class="line">&lt;span class="cl">&lt;span class="kt">void&lt;/span> &lt;span class="n">SPHOctree&lt;/span>&lt;span class="o">::&lt;/span>&lt;span class="n">resetOctree&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="p">{&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">resetOctreeKernel&lt;/span>&lt;span class="o">&amp;lt;&amp;lt;&amp;lt;&lt;/span>&lt;span class="n">numBlocks&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">ThreadsPerBlock&lt;/span>&lt;span class="o">&amp;gt;&amp;gt;&amp;gt;&lt;/span>&lt;span class="p">(&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">this&lt;/span>&lt;span class="o">-&amp;gt;&lt;/span>&lt;span class="n">d_child&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">this&lt;/span>&lt;span class="o">-&amp;gt;&lt;/span>&lt;span class="n">d_count&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">this&lt;/span>&lt;span class="o">-&amp;gt;&lt;/span>&lt;span class="n">d_start&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">this&lt;/span>&lt;span class="o">-&amp;gt;&lt;/span>&lt;span class="n">d_sorted&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">this&lt;/span>&lt;span class="o">-&amp;gt;&lt;/span>&lt;span class="n">d_node_com_mass&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">this&lt;/span>&lt;span class="o">-&amp;gt;&lt;/span>&lt;span class="n">d_node_hmax&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">this&lt;/span>&lt;span class="o">-&amp;gt;&lt;/span>&lt;span class="n">d_mutex&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">this&lt;/span>&lt;span class="o">-&amp;gt;&lt;/span>&lt;span class="n">d_node_index&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">this&lt;/span>&lt;span class="o">-&amp;gt;&lt;/span>&lt;span class="n">num_particles&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">this&lt;/span>&lt;span class="o">-&amp;gt;&lt;/span>&lt;span class="n">max_nodes&lt;/span>&lt;span class="p">);&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">CUDA_CHECK&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">cudaGetLastError&lt;/span>&lt;span class="p">());&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="p">}&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/td>&lt;/tr>&lt;/table>
&lt;/div>
&lt;/div>&lt;/li>
&lt;li>
&lt;p>&lt;strong>Compute Particle Bounding Box&lt;/strong>&lt;/p>
&lt;div class="highlight">&lt;div class="chroma">
&lt;table class="lntable">&lt;tr>&lt;td class="lntd">
&lt;pre tabindex="0" class="chroma">&lt;code>&lt;span class="lnt">1
&lt;/span>&lt;span class="lnt">2
&lt;/span>&lt;span class="lnt">3
&lt;/span>&lt;span class="lnt">4
&lt;/span>&lt;span class="lnt">5
&lt;/span>&lt;span class="lnt">6
&lt;/span>&lt;span class="lnt">7
&lt;/span>&lt;span class="lnt">8
&lt;/span>&lt;/code>&lt;/pre>&lt;/td>
&lt;td class="lntd">
&lt;pre tabindex="0" class="chroma">&lt;code class="language-cpp" data-lang="cpp">&lt;span class="line">&lt;span class="cl">&lt;span class="kt">void&lt;/span> &lt;span class="n">SPHOctree&lt;/span>&lt;span class="o">::&lt;/span>&lt;span class="n">computeBoundingBox&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">RealType4&lt;/span> &lt;span class="o">*&lt;/span>&lt;span class="n">d_particles&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="p">{&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">computeMin&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">d_particles&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">d_reduceTmp&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">num_particles&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">d_bounding_box_min&lt;/span>&lt;span class="p">);&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">computeMax&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">d_particles&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">d_reduceTmp&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">num_particles&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">d_bounding_box_max&lt;/span>&lt;span class="p">);&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">CUDA_CHECK&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">cudaDeviceSynchronize&lt;/span>&lt;span class="p">());&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">CUDA_CHECK&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">cudaGetLastError&lt;/span>&lt;span class="p">());&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="p">}&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/td>&lt;/tr>&lt;/table>
&lt;/div>
&lt;/div>&lt;/li>
&lt;li>
&lt;p>&lt;strong>Build Tree Top-Down&lt;/strong>&lt;/p>
&lt;div class="highlight">&lt;div class="chroma">
&lt;table class="lntable">&lt;tr>&lt;td class="lntd">
&lt;pre tabindex="0" class="chroma">&lt;code>&lt;span class="lnt"> 1
&lt;/span>&lt;span class="lnt"> 2
&lt;/span>&lt;span class="lnt"> 3
&lt;/span>&lt;span class="lnt"> 4
&lt;/span>&lt;span class="lnt"> 5
&lt;/span>&lt;span class="lnt"> 6
&lt;/span>&lt;span class="lnt"> 7
&lt;/span>&lt;span class="lnt"> 8
&lt;/span>&lt;span class="lnt"> 9
&lt;/span>&lt;span class="lnt">10
&lt;/span>&lt;span class="lnt">11
&lt;/span>&lt;span class="lnt">12
&lt;/span>&lt;span class="lnt">13
&lt;/span>&lt;span class="lnt">14
&lt;/span>&lt;span class="lnt">15
&lt;/span>&lt;span class="lnt">16
&lt;/span>&lt;span class="lnt">17
&lt;/span>&lt;span class="lnt">18
&lt;/span>&lt;/code>&lt;/pre>&lt;/td>
&lt;td class="lntd">
&lt;pre tabindex="0" class="chroma">&lt;code class="language-cpp" data-lang="cpp">&lt;span class="line">&lt;span class="cl">&lt;span class="kt">void&lt;/span> &lt;span class="n">SPHOctree&lt;/span>&lt;span class="o">::&lt;/span>&lt;span class="n">buildTree&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">RealType4&lt;/span> &lt;span class="o">*&lt;/span>&lt;span class="n">d_positions&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="p">{&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">buildTreeKernel&lt;/span>&lt;span class="o">&amp;lt;&amp;lt;&amp;lt;&lt;/span>&lt;span class="n">numBlocks&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">ThreadsPerBlock&lt;/span>&lt;span class="o">&amp;gt;&amp;gt;&amp;gt;&lt;/span>&lt;span class="p">(&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">d_positions&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">this&lt;/span>&lt;span class="o">-&amp;gt;&lt;/span>&lt;span class="n">d_node_com_mass&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">this&lt;/span>&lt;span class="o">-&amp;gt;&lt;/span>&lt;span class="n">d_count&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">this&lt;/span>&lt;span class="o">-&amp;gt;&lt;/span>&lt;span class="n">d_start&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">this&lt;/span>&lt;span class="o">-&amp;gt;&lt;/span>&lt;span class="n">d_child&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">this&lt;/span>&lt;span class="o">-&amp;gt;&lt;/span>&lt;span class="n">d_node_index&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">this&lt;/span>&lt;span class="o">-&amp;gt;&lt;/span>&lt;span class="n">d_bounding_box_min&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">this&lt;/span>&lt;span class="o">-&amp;gt;&lt;/span>&lt;span class="n">d_bounding_box_max&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">this&lt;/span>&lt;span class="o">-&amp;gt;&lt;/span>&lt;span class="n">num_particles&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">this&lt;/span>&lt;span class="o">-&amp;gt;&lt;/span>&lt;span class="n">max_nodes&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">);&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">CUDA_CHECK&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">cudaGetLastError&lt;/span>&lt;span class="p">());&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">CUDA_CHECK&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">cudaDeviceSynchronize&lt;/span>&lt;span class="p">());&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="p">}&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/td>&lt;/tr>&lt;/table>
&lt;/div>
&lt;/div>&lt;/li>
&lt;li>
&lt;p>&lt;strong>Compute Center of Mass&lt;/strong>&lt;/p>
&lt;div class="highlight">&lt;div class="chroma">
&lt;table class="lntable">&lt;tr>&lt;td class="lntd">
&lt;pre tabindex="0" class="chroma">&lt;code>&lt;span class="lnt"> 1
&lt;/span>&lt;span class="lnt"> 2
&lt;/span>&lt;span class="lnt"> 3
&lt;/span>&lt;span class="lnt"> 4
&lt;/span>&lt;span class="lnt"> 5
&lt;/span>&lt;span class="lnt"> 6
&lt;/span>&lt;span class="lnt"> 7
&lt;/span>&lt;span class="lnt"> 8
&lt;/span>&lt;span class="lnt"> 9
&lt;/span>&lt;span class="lnt">10
&lt;/span>&lt;/code>&lt;/pre>&lt;/td>
&lt;td class="lntd">
&lt;pre tabindex="0" class="chroma">&lt;code class="language-cpp" data-lang="cpp">&lt;span class="line">&lt;span class="cl">&lt;span class="kt">void&lt;/span> &lt;span class="n">SPHOctree&lt;/span>&lt;span class="o">::&lt;/span>&lt;span class="n">computeCenterOfMass&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="p">{&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">computeCenterOfMassKernel&lt;/span>&lt;span class="o">&amp;lt;&amp;lt;&amp;lt;&lt;/span>&lt;span class="n">numBlocks&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">ThreadsPerBlock&lt;/span>&lt;span class="o">&amp;gt;&amp;gt;&amp;gt;&lt;/span>&lt;span class="p">(&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">this&lt;/span>&lt;span class="o">-&amp;gt;&lt;/span>&lt;span class="n">d_node_com_mass&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">this&lt;/span>&lt;span class="o">-&amp;gt;&lt;/span>&lt;span class="n">d_node_index&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">this&lt;/span>&lt;span class="o">-&amp;gt;&lt;/span>&lt;span class="n">num_particles&lt;/span>&lt;span class="p">);&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">CUDA_CHECK&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">cudaGetLastError&lt;/span>&lt;span class="p">());&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">CUDA_CHECK&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">cudaDeviceSynchronize&lt;/span>&lt;span class="p">());&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="p">}&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/td>&lt;/tr>&lt;/table>
&lt;/div>
&lt;/div>&lt;/li>
&lt;li>
&lt;p>&lt;strong>Compute Self-Gravity&lt;/strong>&lt;/p>
&lt;div class="highlight">&lt;div class="chroma">
&lt;table class="lntable">&lt;tr>&lt;td class="lntd">
&lt;pre tabindex="0" class="chroma">&lt;code>&lt;span class="lnt"> 1
&lt;/span>&lt;span class="lnt"> 2
&lt;/span>&lt;span class="lnt"> 3
&lt;/span>&lt;span class="lnt"> 4
&lt;/span>&lt;span class="lnt"> 5
&lt;/span>&lt;span class="lnt"> 6
&lt;/span>&lt;span class="lnt"> 7
&lt;/span>&lt;span class="lnt"> 8
&lt;/span>&lt;span class="lnt"> 9
&lt;/span>&lt;span class="lnt">10
&lt;/span>&lt;/code>&lt;/pre>&lt;/td>
&lt;td class="lntd">
&lt;pre tabindex="0" class="chroma">&lt;code class="language-cpp" data-lang="cpp">&lt;span class="line">&lt;span class="cl">&lt;span class="n">computeGravityKernel&lt;/span>&lt;span class="o">&amp;lt;&amp;lt;&amp;lt;&lt;/span>&lt;span class="n">numBlocks&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">ThreadsPerBlock&lt;/span>&lt;span class="o">&amp;gt;&amp;gt;&amp;gt;&lt;/span>&lt;span class="p">(&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">d_positions&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">this&lt;/span>&lt;span class="o">-&amp;gt;&lt;/span>&lt;span class="n">d_node_com_mass&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">this&lt;/span>&lt;span class="o">-&amp;gt;&lt;/span>&lt;span class="n">d_child&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">d_accelerations&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">this&lt;/span>&lt;span class="o">-&amp;gt;&lt;/span>&lt;span class="n">d_bounding_box_min&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">this&lt;/span>&lt;span class="o">-&amp;gt;&lt;/span>&lt;span class="n">d_bounding_box_max&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">this&lt;/span>&lt;span class="o">-&amp;gt;&lt;/span>&lt;span class="n">num_particles&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">this&lt;/span>&lt;span class="o">-&amp;gt;&lt;/span>&lt;span class="n">theta&lt;/span> &lt;span class="o">*&lt;/span> &lt;span class="k">this&lt;/span>&lt;span class="o">-&amp;gt;&lt;/span>&lt;span class="n">theta&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">this&lt;/span>&lt;span class="o">-&amp;gt;&lt;/span>&lt;span class="n">constG&lt;/span>&lt;span class="p">);&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/td>&lt;/tr>&lt;/table>
&lt;/div>
&lt;/div>&lt;/li>
&lt;/ol>
&lt;h3 id="performance-bottleneck-analysis">Performance Bottleneck Analysis
&lt;/h3>&lt;p>According to reference [2], the most time-consuming part of the entire tree code workflow is the final step — traversing the tree structure to compute self-gravity. The root cause lies in the mismatch between thread access patterns and data spatial distribution: if particle data is not explicitly sorted, particles within the same Warp may be far apart in physical space. This spatial distribution scatter directly leads to severe divergence among threads within a Warp when performing pruning decisions:&lt;/p>
&lt;div class="highlight">&lt;div class="chroma">
&lt;table class="lntable">&lt;tr>&lt;td class="lntd">
&lt;pre tabindex="0" class="chroma">&lt;code>&lt;span class="lnt">1
&lt;/span>&lt;span class="lnt">2
&lt;/span>&lt;span class="lnt">3
&lt;/span>&lt;span class="lnt">4
&lt;/span>&lt;span class="lnt">5
&lt;/span>&lt;/code>&lt;/pre>&lt;/td>
&lt;td class="lntd">
&lt;pre tabindex="0" class="chroma">&lt;code class="language-cpp" data-lang="cpp">&lt;span class="line">&lt;span class="cl">&lt;span class="kt">bool&lt;/span> &lt;span class="n">mac_satisfied&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="n">child&lt;/span> &lt;span class="o">&amp;lt;&lt;/span> &lt;span class="n">n&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="o">||&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="o">!&lt;/span>&lt;span class="n">is_active&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="o">||&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="n">w_sq&lt;/span> &lt;span class="o">/&lt;/span> &lt;span class="n">theta_sq&lt;/span> &lt;span class="o">&amp;lt;&lt;/span> &lt;span class="n">r_sq&lt;/span>&lt;span class="p">);&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">if&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="n">__all_sync&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mh">0xffffffff&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">mac_satisfied&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="p">{&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1">// Prune
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="p">}&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/td>&lt;/tr>&lt;/table>
&lt;/div>
&lt;/div>&lt;p>When different particles in the same Warp face their respective target nodes, some threads may satisfy the pruning condition (&lt;code>mac_satisfied&lt;/code> is true), while others still need to continue traversing deeper (false). Since CUDA&amp;rsquo;s Warp execution follows the single-instruction multiple-thread (SIMT) model, all threads must execute the same instruction path. Therefore, as long as any thread in the Warp does not satisfy the pruning condition — even if the majority could have terminated early — the entire Warp must enter the subsequent node drilling process. This execution path divergence caused by differences in data dependencies between threads greatly weakens the effectiveness of the pruning mechanism, causing substantial computational resources to be wasted on unnecessary deep node visits.&lt;/p>
&lt;h3 id="performance-test-and-comparison">Performance Test and Comparison
&lt;/h3>&lt;p>To verify the above analysis, I first conducted a test: discretizing a cube to obtain regularly arranged particles, using single-precision computation with 1 million particles. The test results show that traversing the tree to compute gravity is indeed the most time-consuming step:&lt;/p>
&lt;div class="highlight">&lt;div class="chroma">
&lt;table class="lntable">&lt;tr>&lt;td class="lntd">
&lt;pre tabindex="0" class="chroma">&lt;code>&lt;span class="lnt"> 1
&lt;/span>&lt;span class="lnt"> 2
&lt;/span>&lt;span class="lnt"> 3
&lt;/span>&lt;span class="lnt"> 4
&lt;/span>&lt;span class="lnt"> 5
&lt;/span>&lt;span class="lnt"> 6
&lt;/span>&lt;span class="lnt"> 7
&lt;/span>&lt;span class="lnt"> 8
&lt;/span>&lt;span class="lnt"> 9
&lt;/span>&lt;span class="lnt">10
&lt;/span>&lt;span class="lnt">11
&lt;/span>&lt;span class="lnt">12
&lt;/span>&lt;span class="lnt">13
&lt;/span>&lt;span class="lnt">14
&lt;/span>&lt;span class="lnt">15
&lt;/span>&lt;span class="lnt">16
&lt;/span>&lt;span class="lnt">17
&lt;/span>&lt;span class="lnt">18
&lt;/span>&lt;/code>&lt;/pre>&lt;/td>
&lt;td class="lntd">
&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Regular arrangement (equivalent to roughly sorted)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">Calculating Gravity on GPU &lt;span class="o">(&lt;/span>Barnes-Hut, &lt;span class="nv">theta&lt;/span>&lt;span class="o">=&lt;/span>0.5&lt;span class="o">)&lt;/span>...
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">--- Gravity Computation Profiling ---
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> resetOctree: 11.6206 ms
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> computeBoundingBox: 0.4268 ms
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> buildTree: 45.4690 ms
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> computeCoM: 0.1025 ms
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> computeGravity: 2955.5117 ms
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> -------------------------------------
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Shuffled (no sorting)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">--- Gravity Computation Profiling ---
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> resetOctree: 10.7203 ms
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> computeBoundingBox: 0.4361 ms
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> buildTree: 44.1606 ms
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> computeCoM: 0.1303 ms
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> computeGravity: 202.4100 ms
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> -------------------------------------
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/td>&lt;/tr>&lt;/table>
&lt;/div>
&lt;/div>&lt;p>&lt;strong>Key finding&lt;/strong>: The time difference reaches &lt;strong>15x&lt;/strong>, and the only difference is whether particles are shuffled. Without shuffling, it is equivalent to having performed sorting (since the initial particle arrangement is regular); with shuffling, time skyrockets.&lt;/p>
&lt;p>Detailed Profile data is as follows:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th style="text-align:left">Metric Name&lt;/th>
&lt;th style="text-align:left">Meaning&lt;/th>
&lt;th style="text-align:center">Regular&lt;/th>
&lt;th style="text-align:center">Shuffled&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td style="text-align:left">&lt;strong>Elapsed Cycles&lt;/strong>&lt;/td>
&lt;td style="text-align:left">Total GPU clock cycles consumed by Kernel&lt;/td>
&lt;td style="text-align:center">~0.29 Billion&lt;/td>
&lt;td style="text-align:center">~2.58 Billion&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td style="text-align:left">&lt;strong>Duration&lt;/strong>&lt;/td>
&lt;td style="text-align:left">Actual kernel runtime&lt;/td>
&lt;td style="text-align:center">~0.19 s&lt;/td>
&lt;td style="text-align:center">~3.10 s&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td style="text-align:left">&lt;strong>SM Frequency&lt;/strong>&lt;/td>
&lt;td style="text-align:left">Average streaming multiprocessor frequency&lt;/td>
&lt;td style="text-align:center">~1.55 GHz&lt;/td>
&lt;td style="text-align:center">~830 MHz&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td style="text-align:left">&lt;strong>Compute (SM) Throughput&lt;/strong>&lt;/td>
&lt;td style="text-align:left">Compute unit busyness&lt;/td>
&lt;td style="text-align:center">~80.4%&lt;/td>
&lt;td style="text-align:center">~79.2%&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td style="text-align:left">&lt;strong>Memory Throughput&lt;/strong>&lt;/td>
&lt;td style="text-align:left">Overall memory subsystem busyness&lt;/td>
&lt;td style="text-align:center">~61.5%&lt;/td>
&lt;td style="text-align:center">~59.0%&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td style="text-align:left">&lt;span style="color: red;">&lt;strong>L2 Cache Throughput&lt;/strong>&lt;/span>&lt;/td>
&lt;td style="text-align:left">&lt;span style="color: red;">L2 cache access throughput&lt;/span>&lt;/td>
&lt;td style="text-align:center">&lt;span style="color: red;">~23.9%&lt;/span>&lt;/td>
&lt;td style="text-align:center">&lt;span style="color: red;">~14.1%&lt;/span>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td style="text-align:left">&lt;span style="color: red;">&lt;strong>DRAM Throughput&lt;/strong>&lt;/span>&lt;/td>
&lt;td style="text-align:left">&lt;span style="color: red;">VRAM direct access throughput&lt;/span>&lt;/td>
&lt;td style="text-align:center">&lt;span style="color: red;">~0.35%&lt;/span>&lt;/td>
&lt;td style="text-align:center">&lt;span style="color: red;">~15.7%&lt;/span>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td style="text-align:left">&lt;strong>Achieved Occupancy&lt;/strong>&lt;/td>
&lt;td style="text-align:left">Actual active Warp ratio&lt;/td>
&lt;td style="text-align:center">~98.6%&lt;/td>
&lt;td style="text-align:center">~95.2%&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>From the table, it is clear that after shuffling, &lt;strong>L2 Cache Throughput&lt;/strong> drops from 23.9% to 14.1%, while &lt;strong>DRAM Throughput&lt;/strong> increases from 0.35% to 15.7%. This demonstrates that particle sorting is crucial for improving data locality, fully utilizing L2 cache, and reducing direct VRAM access — directly causing the 15x performance difference.&lt;/p>
&lt;h3 id="sorting-algorithm">Sorting Algorithm
&lt;/h3>&lt;p>The previous section&amp;rsquo;s shuffle experiment revealed that &lt;strong>spatial locality&lt;/strong> plays a decisive role in implementing tree algorithms on CUDA. Better locality means that within a single Warp, the 32 threads are more likely to access the same tree nodes when traversing the octree, thereby greatly reducing Warp divergence and pushing L1/L2 cache hit rates to the limit, avoiding massive direct reads/writes to the slow DRAM.&lt;/p>
&lt;blockquote>
&lt;p>&lt;strong>Note&lt;/strong>: Although I used the expression &amp;ldquo;reduce Warp divergence (Warp Divergence)&amp;rdquo;, the threads are not actually diverging — they are simply opening too many nodes. In a sense, the inconsistency of &lt;code>mac_satisfied&lt;/code> in warp voting is relatively strong. So I still borrowed the term &amp;ldquo;Warp divergence.&amp;rdquo;&lt;/p>
&lt;/blockquote>
&lt;p>However, in actual SPH or N-body simulations, as time progresses, particles violently collide and mix in space, and their indices in the memory array become completely decoupled from their real geometric positions. Therefore, we need to perform efficient spatial sorting after each frame&amp;rsquo;s tree construction.&lt;/p>
&lt;h4 id="1-core-idea-utilizing-natural-tree-topology-sorting">1. Core Idea: Utilizing Natural Tree Topology Sorting
&lt;/h4>&lt;p>Since the octree itself is a perfect mesh partitioning of three-dimensional space, reading leaf nodes in the order of tree traversal (e.g., depth-first DFS or breadth-first BFS) yields a particle sequence clustered in three-dimensional space.&lt;/p>
&lt;h4 id="2-gather-addressing-strategy-sort-indices-only-dont-move-data">2. Gather Addressing Strategy: Sort Indices Only, Don&amp;rsquo;t Move Data
&lt;/h4>&lt;p>In a system with millions of particles, copying tens of megabytes of particle coordinates (float4 data) back and forth in global memory each time would not only be extremely time-consuming but also consume significant additional memory. We adopt the &lt;strong>Gather addressing strategy&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>Do not move the actual particle positions in memory&lt;/li>
&lt;li>Generate a one-dimensional mapping array &lt;code>sort&lt;/code>&lt;/li>
&lt;li>&lt;code>sort[i]&lt;/code> stores &amp;ldquo;the actual particle number that thread i should process&amp;rdquo;&lt;/li>
&lt;/ul>
&lt;p>During subsequent gravity computation, adjacent threads within the same Warp read adjacent values from the &lt;code>sort&lt;/code> array. This way, particles processed by the same Warp are spatially close, and their tree traversal paths are also likely to be consistent.&lt;/p>
&lt;h4 id="3-top-down-sort">3. Top-Down Sort
&lt;/h4>&lt;p>To complete sorting at extreme speed on the GPU, we utilize the &lt;code>count&lt;/code> (subtree particle count) already computed during the tree-building phase, adopting a top-down parallel allocation mechanism. The parent node divides exclusive &amp;ldquo;memory intervals&amp;rdquo; for each child node in the &lt;code>sort&lt;/code> array based on their &lt;code>count&lt;/code>.&lt;/p>
&lt;h4 id="4-profile-validation">4. Profile Validation
&lt;/h4>&lt;p>Theoretically, after tree sorting and traversal, path divergence is further reduced because the parent nodes of particles in the same Warp are almost always together. To quantify this improvement, I reduced the particle count to 1 million and performed Profile analysis on the following three computation modes:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th style="text-align:left">Computation Mode&lt;/th>
&lt;th style="text-align:left">Description&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td style="text-align:left">1. Fully Random&lt;/td>
&lt;td style="text-align:left">No tree sorting, particles randomly distributed in space&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td style="text-align:left">2. Initial Regular&lt;/td>
&lt;td style="text-align:left">No tree sorting, particles regularly arranged in space&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td style="text-align:left">3. Tree Sorting&lt;/td>
&lt;td style="text-align:left">Fully random but using tree sorting&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>Profile command:&lt;/p>
&lt;div class="highlight">&lt;div class="chroma">
&lt;table class="lntable">&lt;tr>&lt;td class="lntd">
&lt;pre tabindex="0" class="chroma">&lt;code>&lt;span class="lnt">1
&lt;/span>&lt;/code>&lt;/pre>&lt;/td>
&lt;td class="lntd">
&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="cl">ncu --set full -f -o profile/no_shuffle_no_sort_profile -k computeGravityKernel -s &lt;span class="m">2&lt;/span> -c &lt;span class="m">2&lt;/span> ./sph_simulator
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/td>&lt;/tr>&lt;/table>
&lt;/div>
&lt;/div>&lt;p>Command parameter descriptions:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th style="text-align:left">Parameter&lt;/th>
&lt;th style="text-align:left">Meaning&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td style="text-align:left">&lt;code>--set full&lt;/code>&lt;/td>
&lt;td style="text-align:left">Collect all available performance metrics&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td style="text-align:left">&lt;code>-f&lt;/code>&lt;/td>
&lt;td style="text-align:left">Force overwrite existing output files&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td style="text-align:left">&lt;code>-o profile/xxx&lt;/code>&lt;/td>
&lt;td style="text-align:left">Specify output file path and name&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td style="text-align:left">&lt;code>-k computeGravityKernel&lt;/code>&lt;/td>
&lt;td style="text-align:left">Specify the Kernel name to Profile&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td style="text-align:left">&lt;code>-s 2&lt;/code>&lt;/td>
&lt;td style="text-align:left">Skip the first 2 iterations after program startup before starting Profile (avoid cold start effects)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td style="text-align:left">&lt;code>-c 2&lt;/code>&lt;/td>
&lt;td style="text-align:left">Execute 2 repetitions and average (reduce measurement error)&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>Comparison results:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th style="text-align:left">Computation Mode&lt;/th>
&lt;th style="text-align:center">Kernel Time&lt;/th>
&lt;th style="text-align:center">Executed Instructions&lt;/th>
&lt;th style="text-align:center">L1/TEX Hit Rate&lt;/th>
&lt;th style="text-align:center">Memory Throughput&lt;/th>
&lt;th style="text-align:center">Executed IPC&lt;/th>
&lt;th style="text-align:center">Warp Avg Instruction Cycles (CPI)&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td style="text-align:left">1. Fully Random&lt;/td>
&lt;td style="text-align:center">33.13 ms&lt;/td>
&lt;td style="text-align:center">246.2 Billion&lt;/td>
&lt;td style="text-align:center">48.23%&lt;/td>
&lt;td style="text-align:center">2.37 GB/s&lt;/td>
&lt;td style="text-align:center">2.97&lt;/td>
&lt;td style="text-align:center">13.75 cycles&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td style="text-align:left">2. Initial Regular&lt;/td>
&lt;td style="text-align:center">5.13 ms&lt;/td>
&lt;td style="text-align:center">41.9 Billion&lt;/td>
&lt;td style="text-align:center">57.52%&lt;/td>
&lt;td style="text-align:center">8.14 GB/s&lt;/td>
&lt;td style="text-align:center">2.99&lt;/td>
&lt;td style="text-align:center">13.44 cycles&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td style="text-align:left">3. Tree Sorting&lt;/td>
&lt;td style="text-align:center">3.11 ms&lt;/td>
&lt;td style="text-align:center">22.1 Billion&lt;/td>
&lt;td style="text-align:center">58.66%&lt;/td>
&lt;td style="text-align:center">12.25 GB/s&lt;/td>
&lt;td style="text-align:center">2.74&lt;/td>
&lt;td style="text-align:center">13.28 cycles&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>From the results, &lt;strong>tree sorting mode (3.11 ms) is approximately 10x faster than fully random mode (33.13 ms)&lt;/strong>, and better than initial regular mode (5.13 ms). This proves that the spatial sorting strategy based on tree topology can significantly improve self-gravity computation efficiency.&lt;/p>
&lt;p>Now, GASPHiA uses the code optimized with the tree sorting mode.&lt;/p>
&lt;h3 id="performance-curves">Performance Curves
&lt;/h3>&lt;p>The following two figures show the performance comparison of different computation modes across varying particle counts (linear and logarithmic scales):&lt;/p>
&lt;p>&lt;img src="https://yekq.top/posts/gasphia/treecode/performance_plot_linear.png"
width="1000"
height="700"
srcset="https://yekq.top/posts/gasphia/treecode/performance_plot_linear_huafae19ad656c60aef7f71be576aca324_80463_480x0_resize_box_3.png 480w, https://yekq.top/posts/gasphia/treecode/performance_plot_linear_huafae19ad656c60aef7f71be576aca324_80463_1024x0_resize_box_3.png 1024w"
loading="lazy"
alt="Performance validation linear scale"
class="gallery-image"
data-flex-grow="142"
data-flex-basis="342px"
>&lt;/p>
&lt;p>&lt;img src="https://yekq.top/posts/gasphia/treecode/performance_plot.png"
width="1000"
height="700"
srcset="https://yekq.top/posts/gasphia/treecode/performance_plot_hu64e62595c1a297bc5695bbbe5a2bbf85_98245_480x0_resize_box_3.png 480w, https://yekq.top/posts/gasphia/treecode/performance_plot_hu64e62595c1a297bc5695bbbe5a2bbf85_98245_1024x0_resize_box_3.png 1024w"
loading="lazy"
alt="Performance validation logarithmic scale"
class="gallery-image"
data-flex-grow="142"
data-flex-basis="342px"
>&lt;/p>
&lt;blockquote>
&lt;p>&lt;strong>Note&lt;/strong>: In the figures above, &amp;ldquo;Build Tree&amp;rdquo; includes all processes such as bounding box computation and sorting. Compared to traversing trees for gravity or neighbor computation, the time for building the tree can be ignored.&lt;/p>
&lt;/blockquote>
&lt;h3 id="accuracy-and-speedup-validation">Accuracy and Speedup Validation
&lt;/h3>&lt;p>I also tested the speedup and error of the Barnes-Hut algorithm relative to brute-force computation with 1 million particles. Note that the errors here are relative errors.&lt;/p>
&lt;p>&lt;strong>Performance and accuracy comparison across different $\theta$ parameters&lt;/strong>:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th style="text-align:center">$\theta$&lt;/th>
&lt;th style="text-align:center">Computation Time (ms)&lt;/th>
&lt;th style="text-align:center">Max Relative Error&lt;/th>
&lt;th style="text-align:center">Avg Relative Error&lt;/th>
&lt;th style="text-align:center">Speedup&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td style="text-align:center">0.1&lt;/td>
&lt;td style="text-align:center">181.877&lt;/td>
&lt;td style="text-align:center">0.00219153&lt;/td>
&lt;td style="text-align:center">3.27768e-05&lt;/td>
&lt;td style="text-align:center">7.42231x&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td style="text-align:center">0.2&lt;/td>
&lt;td style="text-align:center">39.5222&lt;/td>
&lt;td style="text-align:center">0.00502609&lt;/td>
&lt;td style="text-align:center">6.86399e-05&lt;/td>
&lt;td style="text-align:center">34.1566x&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td style="text-align:center">0.3&lt;/td>
&lt;td style="text-align:center">14.9356&lt;/td>
&lt;td style="text-align:center">0.0214612&lt;/td>
&lt;td style="text-align:center">0.00015996&lt;/td>
&lt;td style="text-align:center">90.3847x&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td style="text-align:center">0.4&lt;/td>
&lt;td style="text-align:center">8.75622&lt;/td>
&lt;td style="text-align:center">0.0335255&lt;/td>
&lt;td style="text-align:center">0.00028779&lt;/td>
&lt;td style="text-align:center">154.17x&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td style="text-align:center">0.5&lt;/td>
&lt;td style="text-align:center">5.88032&lt;/td>
&lt;td style="text-align:center">0.0711218&lt;/td>
&lt;td style="text-align:center">0.00053247&lt;/td>
&lt;td style="text-align:center">229.57x&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td style="text-align:center">0.6&lt;/td>
&lt;td style="text-align:center">4.20147&lt;/td>
&lt;td style="text-align:center">0.0565245&lt;/td>
&lt;td style="text-align:center">0.000783245&lt;/td>
&lt;td style="text-align:center">321.303x&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td style="text-align:center">0.7&lt;/td>
&lt;td style="text-align:center">3.52717&lt;/td>
&lt;td style="text-align:center">0.106591&lt;/td>
&lt;td style="text-align:center">0.00153542&lt;/td>
&lt;td style="text-align:center">382.728x&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td style="text-align:center">0.8&lt;/td>
&lt;td style="text-align:center">3.21018&lt;/td>
&lt;td style="text-align:center">0.170889&lt;/td>
&lt;td style="text-align:center">0.002965&lt;/td>
&lt;td style="text-align:center">420.521x&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>Accuracy validation figure:&lt;/p>
&lt;p>&lt;img src="https://yekq.top/posts/gasphia/treecode/Accuracy.png"
width="1000"
height="600"
srcset="https://yekq.top/posts/gasphia/treecode/Accuracy_hu2514c82add25ddfeced948d86ddae09b_66968_480x0_resize_box_3.png 480w, https://yekq.top/posts/gasphia/treecode/Accuracy_hu2514c82add25ddfeced948d86ddae09b_66968_1024x0_resize_box_3.png 1024w"
loading="lazy"
alt="Accuracy validation"
class="gallery-image"
data-flex-grow="166"
data-flex-basis="400px"
>&lt;/p>
&lt;p>&lt;strong>Conclusion&lt;/strong>: As $\theta$ increases, computational accuracy decreases (error increases), but computational speed significantly improves. At $\theta = 0.5$, we can achieve a 200x speedup while maintaining approximately 7% maximum relative error, making it a relatively ideal balance point.&lt;/p>
&lt;h2 id="references">References
&lt;/h2>&lt;p>[1] Barnes J, Hut P. A hierarchical O (N log N) force-calculation algorithm. nature. 1986 Dec 4;324(6096):446-9.&lt;/p>
&lt;p>[2] Burtscher M, Pingali K. An efficient CUDA implementation of the tree-based barnes hut n-body algorithm. In GPU computing Gems Emerald edition 2011 Jan 1 (pp. 75-92). Morgan Kaufmann.&lt;/p></description></item></channel></rss>