articles

low poly chains via raymarching

11 Jun 2025

I’ve long had the dream of creating high resolution chains on characters with raymarching. The problem is that Unity’s object transform is based on the character’s hip bone, so making raymarched geometry “stick” to characters is impossible. I believe that I’ve finally solved this.

One draw call, many raymarched objects.

TLDR:

Main ideas and HLSL

The core idea is to make it possible for each fragment of a material to learn an origin point’s location and orientation. If you can recover an origin point and a rotation, then you can raymarch inside that coordinate system, then translate back to object coordinates at the end.

For each submesh* in a mesh, I bake an origin point and an orientation.

* A submesh is just a set of vertices connected by edges. A mesh might contain many unconnected submeshes. For example, in blender, you can combine two objects with ctrl+J. I call those two combined but unconnected things submeshes.

The orientation of the submesh is derived from the face normals. I sort the faces in the submesh by their area. The largest area face is used as the first basis vector of our rotated coordinate system. Then I get the next face which is sufficiently orthogonal to the first basis vector (absolute value of dot product is > some epsilon). I orthogonalize those two basis vectors with graham-schmidt, then generate the third with a cross product. I ensure right-handedness by checking that the determinant is positive, then convert to a quaternion. I then store that quaternion in 2 UV channels.

The rotation quaternion is recovered on the GPU as follows:

float4 GetRotation(v2f i, float2 uv_channels) {
  float4 quat;
  quat.xy = get_uv_by_channel(i, uv_channels.x);
  quat.zw = get_uv_by_channel(i, uv_channels.y);
  return quat;
}
...
RayMarcherOutput MyRayMarcher(v2f i) {
...
  float2 uv_channels = float2(1, 2);
  float4 quat = GetRotation(i, uv_channels);
  float4 iquat = float4(-quat.xyz, quat.w);
}

It’s worth lingering here for a second. Each submesh is conceptualized as a rotated bounding box. We just deduced an orthonormal basis for that rotated coordinate system. That means that the artist can rotate their bounding boxes however they want in Blender, and the plugin will automatically work out how to orient things. You can arbitrarily move and rotate your bounding boxes and it Just Works.

The origin point is simply the average of all the vertex locations. I encode it as a vector from each vertex to that location, and stuff it into vertex colors. Since vertex colors can only encode numbers in the range [0, 1], I use the alpha channel to scale the length of each vertex.

I made two non obvious decisions in the way I bake the vertex offsets:

  1. The offsets are encoded in terms of the rotated coordinate system. This saves one quaternion rotation in the shader.

  2. The offsets are scaled according to the L-infinity norm (Manhattan distance) rather than the standard L2 norm (Euclidian distance). This lets the artist think in terms of the bounding box dimensions rather than the square root of the sum of squares of the box’s dimensions. Like if your box is 1x0.6x0.2, then you can just raymarch a primitive with those dimensions and your

The origin point is recovered on the GPU as follows:

float3 GetFragToOrigin(v2f i) {
  return (i.color * 2.0f - 1.0f) / i.color.a;
}
RayMarcherOutput MyRayMarcher(v2f i) {
...
  float3 frag_to_origin = GetFragToOrigin(i);
}

With those pieces in place, the raymarcher is pretty standard, but some care has to be taken when getting into and out of the coordinate system. Here’s a complete example in HLSL:

RayMarcherOutput MyRayMarcher(v2f i) {
  float3 obj_space_camera_pos = mul(unity_WorldToObject,
      float4(_WorldSpaceCameraPos, 1.0));
  float3 frag_to_origin = GetFragToOrigin(i);

  float2 uv_channels = float2(1, 2);
  float4 quat = GetRotation(i, uv_channels);
  float4 iquat = float4(-quat.xyz, quat.w);

  // ro is already expressed in terms of rotated basis vectors, so we don't
  // have to rotate it again.
  float3 ro = -frag_to_origin;
  float3 rd = normalize(i.objPos - obj_space_camera_pos);
  rd = rotate_vector(rd, iquat);

  float d;
  float d_acc = 0;
  const float epsilon = 1e-3f;
  const float max_d = 1;

  [loop]
  for (uint ii; ii < CUSTOM30_MAX_STEPS; ++ii) {
    float3 p = ro + rd * d_acc;
    d = map(p);
    d_acc += d;
    if (d < epsilon) break;
    if (d_acc > max_d) break;
  }
  clip(epsilon - d);

  float3 localHit = ro + rd * d_acc;
  float3 objHit = rotate_vector(localHit, quat);
  float3 objCenterOffset = rotate_vector(frag_to_origin, quat);

  RayMarcherOutput o;
  o.objPos = objHit + (i.objPos + objCenterOffset);
  float4 clipPos = UnityObjectToClipPos(o.objPos);
  o.depth = clipPos.z / clipPos.w;

  // Calculate normal in rotated space using standard raymarcher gradient
  // technique
  float3 sdfNormal = calc_normal(localHit);
  float3 objNormal = rotate_vector(sdfNormal, quat);
  o.normal = UnityObjectToWorldNormal(objNormal);

  return o;
}

Scalability and limitations

  1. This technique is extremely scalable. I have a world with 16,000 bounding boxes that runs at ~800 microseconds/frame without volumetrics.

  2. You can have overlapping raymarched geometry without paying the usual 8x slowdown of domain repetition.

You still pay the price of overdraw, and unlike domain repetition, there’s no built-in compute budgeting. I.e. with domain repetition you’d hit your iteration cap and stop. With this you won’t.

  1. The workflow is artist friendly. You can move, scale, and rotate your geometry freely. Re-bake once you’re done and everything just works.

  2. Shearing works, but doesn’t permit re-baking.

Test setup, no shearing
Shear in Blender but don’t re-bake.
Shear in Blender and re-bake
Shear in Unity.

Blender and Unity tooling

I’ve written a Blender plugin to permit myself to bake the vectors and quaternions as described above.

Blender overview.

The plugin supports baking vectors and quaternions on extremely large meshes primarily through caching. If your mesh contains many submeshes that are simply translated in space, then baking should take less than a second. If those submeshes are scaled, skewed, or rotated, then they won’t cache and baking will take longer.

The baker lets you rotate the baked quaternion around the basis vectors. I had to fuck with this a fair bit, and eventually found that 180 degrees worked. Try going through every combo of 90 degrees (64 total) if you run into trouble. Use quick exporter to speed up the process. You can visualize the vectors with my Unity script, which is described below.

Baker options.

It also supports a bunch of other workflows, mostly designed for the voxel world creation workflow:

  1. Select all linked submeshes. This just does ctrl+L for each submesh with at least one vert, edge, or face selected. Blender’s built in ctrl+L seems to be inconsistent in its behavior.

  2. Select linked across boundaries. This basically does ctrl+L, but lets the meshes be disconnected at as long as they have a vert that’s within some epsilon of a selected vert. That epsilon is configurable. It’s scalable up to thousands of submeshes.

  3. Deduplicate submeshes. This just looks for submeshes where all their verts are close to others. The closeness parameter (epsilon) is configurable. It works via spatial hashing so it’s extremely scalable.

  4. Merge by distance per submesh. This just iterates over all submeshes and does a merge by distance on each. When working with large collections of submeshes, it’s easy to accidentally duplicate a face/edge/vert along the way, and these duplications can stack up. This lets you recover.

  5. Pack UV island by submesh Z. This lets you pack UV islands for large collections of submeshes and sort them by their Blender z axis height. Buggy as shit rn, sorry!

This is less relevant, but I wanted some way to instance axis-aligned geometry along a curve and sort each instance’s UVs by Z height. These nodes do that. Put them on a curve and select your instance. Then use the “Pack UV island by submesh Z” plugin tool to actually pack them.

Instance axis-aligned geometry and sort UVs.

Finally, I have a Unity script which lets you visualize the raw baked vectors, and the “corrected” baked vectors, i.e. those rotated with the baked quaternion. Simply attach “Decode vertex vectors” to your gameobject. The light blue vectors are raw vectors, and the orange ones are the corrected ones. The orange ones should converge at the center of each submesh. (It’s okay if they overshoot/undershoot, you can correct for that in your SDF.)

Visualize baked data in Unity.

how much CO2 do American cars produce?

23 May 2025

TLDR: About 1.52010121.520 \cdot 10^{12} kg/year. This increases the CO2_2 in the atmosphere by about 0.0480.048% per year.

Let’s gather some facts:

Assume that the weighted average car is getting 20 mpg. This includes passenger and freight. Passenger cars are higher and freight vehicles are lower.

Then:

(265,653,749 Americans)(13,476 miles/(yearAmerican))(18.73 pounds of CO2/gallon of gas)÷(20.0 miles/gallon)=3.352*1012 pounds/year=1.520*1012 kg/year \begin{align*} & (265,653,749 \text{ Americans}) \\ &\cdot (13,476 \text{ miles} / (\text{year} \cdot \text{American})) \\ &\cdot (18.73 \text{ pounds of CO$_2$} / \text{gallon of gas}) \\ &\div (20.0 \text{ miles} / \text{gallon}) \\ &= 3.352 * 10^{12} \text{ pounds/year} \\ &= 1.520 * 10^{12} \text{ kg/year} \end{align*}

Quick unit analysis to sanity check that equation:

(people)(miles/(peopleyear))miles/year(miles/year)/(miles/gallon)gallon/year(gallon/year)(pounds/gallon)pounds / year \begin{align*} &(\text{people})\cdot(\text{miles/(people$\cdot$year)}) \\ \rightarrow &\text{miles/year} \\ &(\text{miles/year})/(\text{miles/gallon}) \\ \rightarrow &\text{gallon/year} \\ &(\text{gallon/year})\cdot(\text{pounds/gallon}) \\ \rightarrow &\text{pounds / year} \end{align*}

Checks out.

The atmosphere weighs about 5.1510185.15 \cdot 10^{18} kg (Lide, David R. Handbook of Chemistry and Physics. Boca Raton, FL: CRC, 1996: 14–17).

By mole fraction, the atmosphere is about 78.08% N2N_2, 20.95% O2O_2, 0.93% ArAr, and 0.04% CO2_2 (wikipedia).

Using the periodic table, one mole of each molecule weighs: N2=14.007*2=28.014gO2=15.999*2=31.998gAr=39.95gCO2=12.011+15.999*2=44.009g \begin{align*} N_2 = 14.007*2 &= 28.014 g \\ O_2 = 15.999*2 &= 31.998 g \\ Ar &= 39.95 g \\ CO_2 = 12.011 + 15.999*2 &= 44.009 g \\ \end{align*}

The weight of one mole of atmosphere is then:

0.780828.014g+0.209531.998g+0.009339.95g+0.000444.009g=28.966g \begin{align*} &0.7808 \cdot 28.014 g\\ + &0.2095 \cdot 31.998 g\\ + &0.0093 \cdot 39.95 g\\ + &0.0004 \cdot 44.009 g\\ = &28.966 g \end{align*}

Since the atmosphere is 0.04% CO2_2, we can compute the fractional weight of CO2_2 in atmosphere as 44.009g0.0004/28.966g=0.000607744.009 g \cdot 0.0004 / 28.966 g = 0.0006077. This number tells us what fraction of the mass of the atmosphere is CO2_2. We established above that this number is 5.1510185.15 \cdot 10^{18} kg, so the weight of all the CO2_2 in the atmosphere is therefore 3.12910153.129 \cdot 10^{15} kg.

We know that Americans emit 1.52010121.520 \cdot 10^{12} kg/year of CO2_2. We know that the CO2_2 in the atmosphere weighs 3.1291015kg3.129 \cdot 10^{15} kg. Therefore, every year, Americans increase the CO2_2 in the atmosphere by a factor of:

(1.5201012)/(3.1291015)=0.00048 (1.520 \cdot 10^{12}) / (3.129 \cdot 10^{15}) = 0.00048

or 0.048%.

\blacksquare

This guy used CO2_2 ppm readings + the known mass of the atmosphere to arrive at a figure of 3,208 Gt, matching my 3,129 figure very closely. Wikipedia cites a figure of 3,341 Gt using the same ppm + total mass technique. So we’re all within a pretty tight range of each other.

That Wikipedia article also claims that we’ve only increased the CO2_2 in the atmosphere by ~50% since the beginning of the Industrial Revolution. If so, that kinda tracks with our figures. If we assume that Americans have been emitting at the current rate (fewer but shittier cars in the past) for about 50 years, that works out to a total contribution of 2.5% just from our cars.

We know that cars are not the dominant form of CO2_2 emissions. British Petroleum publishes an amazing, annual statistical review of global energy trends. Let’s pore over the 2022 document (link). In 2022, Americans emitted 4.701 Gt of CO2_2 (page 12). Thus cars contributed 32.33% of our total CO2_2 budget. In the same year, China emitted about 10.523 GT of CO2_2 (page 12). Much of that can be seen as Americans offloading their emissions to China in the form of manufacturing. Finally, we see that the entire world’s emissions amount to about 33.884 Gt of CO2_2 per year. American drivers are therefore responsible for about 4.485% of that budget.

If we synthesize our “2.5% of the CO2 in the air is from American drivers” number with the above figure that we’re emitting about 5% of the global budget, we get a global cumulative emission of about 50%. That also matches what Wikipedia claims: that CO2 in the atmosphere has increased by about 50% since the start of the Industrial Revolution.

So through basic analysis of public data and a couple reasonable inferences, we have arrived at the same conclusion as the “entrenched academics”: that the change in CO2_2 in the atmosphere over the last 200 years is due to human activity.

“big llms are memory bound”

22 May 2025

There is wisdom oft repeated that “big neural nets are limited by memory bandwidth.” This is utter horseshit and I will show why.

LLMs are typically implemented as autoregressive feed-forward neural nets. This means that to generate a sentence, you provide a prompt which the neural net then uses to generate the next token. That prompt + token is fed back into the neural net repeatedly until it produces an EOF token, marking the end of generation.

We want to derive an equation predicting token rate TT. Let’s define some variables:

TT: token rate (tokens / second)

MM: memory bandwidth (bytes / second)

PP: model size (parameters)

CC: compute throughput (parameters / second)

QQ: model quantization (bytes / parameter)

Since each token requires accessing the entire model’s parameters, then on an infinitely powerful computer:

T=MPQT = \frac{M}{P \cdot Q}

As the model size PP grows, token rate TT drops; as memory bandwidth MM grows, token rate TT increases. Likewise, quantizing the model eases memory pressure, so reducing bytes/param QQ increases token rate TT. This is all expected.

However, most of our computers do not have infinite compute throughput. We must then adjust our equation:

T=min(MQ,C)PT = \frac{\min(\frac{M}{Q}, C)}{P}

Token rate TT increases until we saturate compute CC or memory bandwidth MQ\frac{M}{Q}, then it stops. Totally reasonable.

Notably, token rate uniformly drops as parameter count increases. The common wisdom that “big models are memory bound lol” is complete horseshit.

This equation helps you balance your compute against your memory bandwidth. You can calculate your system’s memory bandwidth as follows, assuming you have DDR5:

McM_c: memory channels

MsM_s: memory speed (GT/s)

M=Ms8McM = M_s \cdot 8 \cdot M_c

(Source: wikipedia)

So if you have 12 channels of DDR5 @ 6000 MT/s, that works out to 1286=57612 \cdot 8 \cdot 6 = 576 GB/s.

Consider a model like DeepSeek-V3-0324 in 2.42 bit quant. This bad boy is a mixture of experts (MoE) with 37B activated parameters per token. So at 2.42 bits / parameter, that works out to ~11.19 GB / token. Assuming infinite compute, the upper bound on token generation rate is 576 / 12.53 = 51.46 tokens / second.

I hate to be the bearer of bad news. You will not see this token rate. On my shitass server with an EPYC 9115 CPU and 12 channels of ECC DDR5 @ 6000 MT/s, I only see 4.6 tok/s. That implies that my CPU is more than 10x less than what I need to saturate my memory subsystem. I’m using a recent build of llama-cli for this test, and a relatively small context window (8k max).

In conclusion:

  1. The theory behind token rate is very simple once you grok that LLMs are just autoregressors, and they need to page every active parameter into memory once per token to operate.
  2. You can extrapolate expected performance from smaller models, since memory bandwidth and compute dictate throughput in inverse proportion to model size.
  3. People on the internet (especially redditors) are fucking stupid.

meow meow meow meow

14 Apr 2025

meow meow meow meow meow meow meow meow. meow meow meow meow, meow meow meow meow meow meow meow.

meow meow meow meow meow. meow meow meow meow meow meow meow, meow meow meow. meow meow meow. meow meow meow meow meow meow meow meow meow. meow meow meow; meow, meow meow meow meow meow meow meow.

meow meow meow meow meow. meow meow meow. meow meow.

riding crop

7 Apr 2025

Image of a 3D model of a riding crop.

Click here to download my riding crop from gumroad. See the gumroad page for setup instructions.

Gumroad suspended my account over this product. Yes, over a fucking riding crop. That’s why it’s hosted here. Enjoy the 100% discount <3

a panoply of frameworks

3 Apr 2025

I want to use electron. I know that raw CSS sucks dick so let’s use a framework. Bootstrap sucks so let’s use tailwind. Oh wait tailwind has a build step? Okay let’s use the CLI. Wait, I’m going to need to be able to plumb runtime data eventually. I think that’s what react is for right? Uhhh if I’m using react is the tailwind CLI going to be good enough? It seems like vite is what people are using for tailwind+react. Okay let’s just commit to that. Hmm this is a lot of setup, should I use a template? Oh wait the main template people are using advertises “full access to node.js apis from the renderer process.” That seems like a terrible fucking idea. Good thing I actually read the electron docs.

I want to die.

electron first impressions

1 Apr 2025

Occasionally I want to build some throwaway app for use by other people. CLIs are nice and all, but they’re hard to launch from VR, and most people have never interacted with a terminal. So I need some way to write a GUI. Enter electron.

Electron is a cross-platform UI framework. It bundles an entire chromium install (gross) but in return you can basically just use standard web dev practices.

It exposes a two-process model: one main process, and one renderer process. The main process has basically unfettered access to the OS, and the renderer process has unfettered access to the DOM (document object model - the runtime structure of an HTML webpage). The two processes talk to each other through channels.

Generating a distributable is easy with forge-cli. My main nitpick here is that I think the default maker should be the zip maker, not the installer. Installers give me the headache that I have to remember to uninstall the thing once it most likely fails to work. Isolated environments with no hidden side effects are simply better. Switching to zip is simple matter of editing the default forge.config.js and moving ‘win32’ to the maker-zip block. The generated .zip works basically as expected: it contains a bunch of dependencies, and an .exe. Put the .zip in a directory, extract it, double click the .exe, and you app opens. (One more nit: the zip should contain a subdirectory so you can extract without manually creating a directory for it.) The hello world package is heavy but not as bad as I expected: 10.6MB disk (compressed), 282MB disk (uncompressed), 0.0% CPU, 65MB memory. Memory is basically in line with what I was getting with wxWidgets - I think that was around 30 MB with my entire STT app built in. Worse but IMO within the realm of reasonability. Time to first draw is pretty good - under a second according to the eyeball test.

hello world :3

20 Mar 2025