yummers

I’ve long had the dream of creating high resolution chains on characters with raymarching. The problem is that Unity’s object transform is based on the character’s hip bone, so making raymarched geometry “stick” to characters is impossible. I believe that I’ve finally solved this.

Main ideas and HLSL

The core idea is to make it possible for each fragment of a material to learn an origin point’s location and orientation. If you can recover an origin point and a rotation, then you can raymarch inside that coordinate system, then translate back to object coordinates at the end.

* A submesh is just a set of vertices connected by edges. A mesh might contain many unconnected submeshes. For example, in blender, you can combine two objects with ctrl+J. I call those two combined but unconnected things submeshes.

The orientation of the submesh is derived from the face normals. I sort the faces in the submesh by their area. The largest area face is used as the first basis vector of our rotated coordinate system. Then I get the next face which is sufficiently orthogonal to the first basis vector (absolute value of dot product is > some epsilon). I orthogonalize those two basis vectors with graham-schmidt, then generate the third with a cross product. I ensure right-handedness by checking that the determinant is positive, then convert to a quaternion. I then store that quaternion in 2 UV channels.

float4 GetRotation(v2f i, float2 uv_channels) {
  float4 quat;
  quat.xy = get_uv_by_channel(i, uv_channels.x);
  quat.zw = get_uv_by_channel(i, uv_channels.y);
  return quat;
}
...
RayMarcherOutput MyRayMarcher(v2f i) {
...
  float2 uv_channels = float2(1, 2);
  float4 quat = GetRotation(i, uv_channels);
  float4 iquat = float4(-quat.xyz, quat.w);
}

It’s worth lingering here for a second. Each submesh is conceptualized as a rotated bounding box. We just deduced an orthonormal basis for that rotated coordinate system. That means that the artist can rotate their bounding boxes however they want in Blender, and the plugin will automatically work out how to orient things. You can arbitrarily move and rotate your bounding boxes and it Just Works.

The origin point is simply the average of all the vertex locations. I encode it as a vector from each vertex to that location, and stuff it into vertex colors. Since vertex colors can only encode numbers in the range [0, 1], I use the alpha channel to scale the length of each vertex.

float3 GetFragToOrigin(v2f i) {
  return (i.color * 2.0f - 1.0f) / i.color.a;
}
RayMarcherOutput MyRayMarcher(v2f i) {
...
  float3 frag_to_origin = GetFragToOrigin(i);
}

With those pieces in place, the raymarcher is pretty standard, but some care has to be taken when getting into and out of the coordinate system. Here’s a complete example in HLSL:

RayMarcherOutput MyRayMarcher(v2f i) {
  float3 obj_space_camera_pos = mul(unity_WorldToObject,
      float4(_WorldSpaceCameraPos, 1.0));
  float3 frag_to_origin = GetFragToOrigin(i);

  float2 uv_channels = float2(1, 2);
  float4 quat = GetRotation(i, uv_channels);
  float4 iquat = float4(-quat.xyz, quat.w);

  // ro is already expressed in terms of rotated basis vectors, so we don't
  // have to rotate it again.
  float3 ro = -frag_to_origin;
  float3 rd = normalize(i.objPos - obj_space_camera_pos);
  rd = rotate_vector(rd, iquat);

  float d;
  float d_acc = 0;
  const float epsilon = 1e-3f;
  const float max_d = 1;

  [loop]
  for (uint ii; ii < CUSTOM30_MAX_STEPS; ++ii) {
    float3 p = ro + rd * d_acc;
    d = map(p);
    d_acc += d;
    if (d < epsilon) break;
    if (d_acc > max_d) break;
  }
  clip(epsilon - d);

  float3 localHit = ro + rd * d_acc;
  float3 objHit = rotate_vector(localHit, quat);
  float3 objCenterOffset = rotate_vector(frag_to_origin, quat);

  RayMarcherOutput o;
  o.objPos = objHit + (i.objPos + objCenterOffset);
  float4 clipPos = UnityObjectToClipPos(o.objPos);
  o.depth = clipPos.z / clipPos.w;

  // Calculate normal in rotated space using standard raymarcher gradient
  // technique
  float3 sdfNormal = calc_normal(localHit);
  float3 objNormal = rotate_vector(sdfNormal, quat);
  o.normal = UnityObjectToWorldNormal(objNormal);

  return o;
}

Scalability and limitations

You still pay the price of overdraw, and unlike domain repetition, there’s no built-in compute budgeting. I.e. with domain repetition you’d hit your iteration cap and stop. With this you won’t.

Blender and Unity tooling

I’ve written a Blender plugin to permit myself to bake the vectors and quaternions as described above.

The plugin supports baking vectors and quaternions on extremely large meshes primarily through caching. If your mesh contains many submeshes that are simply translated in space, then baking should take less than a second. If those submeshes are scaled, skewed, or rotated, then they won’t cache and baking will take longer.

The baker lets you rotate the baked quaternion around the basis vectors. I had to fuck with this a fair bit, and eventually found that 180 degrees worked. Try going through every combo of 90 degrees (64 total) if you run into trouble. Use quick exporter to speed up the process. You can visualize the vectors with my Unity script, which is described below.

It also supports a bunch of other workflows, mostly designed for the voxel world creation workflow:

This is less relevant, but I wanted some way to instance axis-aligned geometry along a curve and sort each instance’s UVs by Z height. These nodes do that. Put them on a curve and select your instance. Then use the “Pack UV island by submesh Z” plugin tool to actually pack them.

Finally, I have a Unity script which lets you visualize the raw baked vectors, and the “corrected” baked vectors, i.e. those rotated with the baked quaternion. Simply attach “Decode vertex vectors” to your gameobject. The light blue vectors are raw vectors, and the orange ones are the corrected ones. The orange ones should converge at the center of each submesh. (It’s okay if they overshoot/undershoot, you can correct for that in your SDF.)

how much CO2 do American cars produce?

TLDR: About

1.520 \cdot 10^{12}

kg/year. This increases the CO

_2

in the atmosphere by about

0.048

% per year.

Assume that the weighted average car is getting 20 mpg. This includes passenger and freight. Passenger cars are higher and freight vehicles are lower.

\begin{align*} & (265,653,749 \text{ Americans}) \\ &\cdot (13,476 \text{ miles} / (\text{year} \cdot \text{American})) \\ &\cdot (18.73 \text{ pounds of CO$_2$} / \text{gallon of gas}) \\ &\div (20.0 \text{ miles} / \text{gallon}) \\ &= 3.352 * 10^{12} \text{ pounds/year} \\ &= 1.520 * 10^{12} \text{ kg/year} \end{align*}

\begin{align*} &(\text{people})\cdot(\text{miles/(people$\cdot$year)}) \\ \rightarrow &\text{miles/year} \\ &(\text{miles/year})/(\text{miles/gallon}) \\ \rightarrow &\text{gallon/year} \\ &(\text{gallon/year})\cdot(\text{pounds/gallon}) \\ \rightarrow &\text{pounds / year} \end{align*}

The atmosphere weighs about

5.15 \cdot 10^{18}

kg (Lide, David R. Handbook of Chemistry and Physics. Boca Raton, FL: CRC, 1996: 14–17).

By mole fraction, the atmosphere is about 78.08%

N_2

, 20.95%

O_2

, 0.93%

Ar

, and 0.04% CO

_2

(wikipedia).

Using the periodic table, one mole of each molecule weighs:

\begin{align*} N_2 = 14.007*2 &= 28.014 g \\ O_2 = 15.999*2 &= 31.998 g \\ Ar &= 39.95 g \\ CO_2 = 12.011 + 15.999*2 &= 44.009 g \\ \end{align*}

\begin{align*} &0.7808 \cdot 28.014 g\\ + &0.2095 \cdot 31.998 g\\ + &0.0093 \cdot 39.95 g\\ + &0.0004 \cdot 44.009 g\\ = &28.966 g \end{align*}

Since the atmosphere is 0.04% CO

_2

, we can compute the fractional weight of CO

_2

in atmosphere as

44.009 g \cdot 0.0004 / 28.966 g = 0.0006077

. This number tells us what fraction of the mass of the atmosphere is CO

_2

. We established above that this number is

5.15 \cdot 10^{18}

kg, so the weight of all the CO

_2

in the atmosphere is therefore

3.129 \cdot 10^{15}

kg.

We know that Americans emit

1.520 \cdot 10^{12}

kg/year of CO

_2

. We know that the CO

_2

in the atmosphere weighs

3.129 \cdot 10^{15} kg

. Therefore, every year, Americans increase the CO

_2

in the atmosphere by a factor of:

This guy used CO

_2

ppm readings + the known mass of the atmosphere to arrive at a figure of 3,208 Gt, matching my 3,129 figure very closely. Wikipedia cites a figure of 3,341 Gt using the same ppm + total mass technique. So we’re all within a pretty tight range of each other.

That Wikipedia article also claims that we’ve only increased the CO

_2

in the atmosphere by ~50% since the beginning of the Industrial Revolution. If so, that kinda tracks with our figures. If we assume that Americans have been emitting at the current rate (fewer but shittier cars in the past) for about 50 years, that works out to a total contribution of 2.5% just from our cars.

We know that cars are not the dominant form of CO

_2

emissions. British Petroleum publishes an amazing, annual statistical review of global energy trends. Let’s pore over the 2022 document (link). In 2022, Americans emitted 4.701 Gt of CO

_2

(page 12). Thus cars contributed 32.33% of our total CO

_2

budget. In the same year, China emitted about 10.523 GT of CO

_2

(page 12). Much of that can be seen as Americans offloading their emissions to China in the form of manufacturing. Finally, we see that the entire world’s emissions amount to about 33.884 Gt of CO

_2

per year. American drivers are therefore responsible for about 4.485% of that budget.

If we synthesize our “2.5% of the CO2 in the air is from American drivers” number with the above figure that we’re emitting about 5% of the global budget, we get a global cumulative emission of about 50%. That also matches what Wikipedia claims: that CO2 in the atmosphere has increased by about 50% since the start of the Industrial Revolution.

So through basic analysis of public data and a couple reasonable inferences, we have arrived at the same conclusion as the “entrenched academics”: that the change in CO

_2

in the atmosphere over the last 200 years is due to human activity.

“big llms are memory bound”

There is wisdom oft repeated that “big neural nets are limited by memory bandwidth.” This is utter horseshit and I will show why.

LLMs are typically implemented as autoregressive feed-forward neural nets. This means that to generate a sentence, you provide a prompt which the neural net then uses to generate the next token. That prompt + token is fed back into the neural net repeatedly until it produces an EOF token, marking the end of generation.

We want to derive an equation predicting token rate

T

. Let’s define some variables:

Since each token requires accessing the entire model’s parameters, then on an infinitely powerful computer:

As the model size

P

grows, token rate

T

drops; as memory bandwidth

M

grows, token rate

T

increases. Likewise, quantizing the model eases memory pressure, so reducing bytes/param

Q

increases token rate

T

. This is all expected.

However, most of our computers do not have infinite compute throughput. We must then adjust our equation:

Token rate

T

increases until we saturate compute

C

or memory bandwidth

\frac{M}{Q}

, then it stops. Totally reasonable.

Notably, token rate uniformly drops as parameter count increases. The common wisdom that “big models are memory bound lol” is complete horseshit.

This equation helps you balance your compute against your memory bandwidth. You can calculate your system’s memory bandwidth as follows, assuming you have DDR5:

So if you have 12 channels of DDR5 @ 6000 MT/s, that works out to

12 \cdot 8 \cdot 6 = 576

GB/s.

Consider a model like DeepSeek-V3-0324 in 2.42 bit quant. This bad boy is a mixture of experts (MoE) with 37B activated parameters per token. So at 2.42 bits / parameter, that works out to ~11.19 GB / token. Assuming infinite compute, the upper bound on token generation rate is 576 / 12.53 = 51.46 tokens / second.

I hate to be the bearer of bad news. You will not see this token rate. On my shitass server with an EPYC 9115 CPU and 12 channels of ECC DDR5 @ 6000 MT/s, I only see 4.6 tok/s. That implies that my CPU is more than 10x less than what I need to saturate my memory subsystem. I’m using a recent build of llama-cli for this test, and a relatively small context window (8k max).

meow meow meow meow

meow meow meow meow meow meow meow meow. meow meow meow meow, meow meow meow meow meow meow meow.

meow meow meow meow meow. meow meow meow meow meow meow meow, meow meow meow. meow meow meow. meow meow meow meow meow meow meow meow meow. meow meow meow; meow, meow meow meow meow meow meow meow.

riding crop

Click here to download my riding crop from gumroad. See the gumroad page for setup instructions.

Gumroad suspended my account over this product. Yes, over a fucking riding crop. That’s why it’s hosted here. Enjoy the 100% discount <3

a panoply of frameworks

I want to use electron. I know that raw CSS sucks dick so let’s use a framework. Bootstrap sucks so let’s use tailwind. Oh wait tailwind has a build step? Okay let’s use the CLI. Wait, I’m going to need to be able to plumb runtime data eventually. I think that’s what react is for right? Uhhh if I’m using react is the tailwind CLI going to be good enough? It seems like vite is what people are using for tailwind+react. Okay let’s just commit to that. Hmm this is a lot of setup, should I use a template? Oh wait the main template people are using advertises “full access to node.js apis from the renderer process.” That seems like a terrible fucking idea. Good thing I actually read the electron docs.

electron first impressions

Occasionally I want to build some throwaway app for use by other people. CLIs are nice and all, but they’re hard to launch from VR, and most people have never interacted with a terminal. So I need some way to write a GUI. Enter electron.

Electron is a cross-platform UI framework. It bundles an entire chromium install (gross) but in return you can basically just use standard web dev practices.

It exposes a two-process model: one main process, and one renderer process. The main process has basically unfettered access to the OS, and the renderer process has unfettered access to the DOM (document object model - the runtime structure of an HTML webpage). The two processes talk to each other through channels.

Generating a distributable is easy with forge-cli. My main nitpick here is that I think the default maker should be the zip maker, not the installer. Installers give me the headache that I have to remember to uninstall the thing once it most likely fails to work. Isolated environments with no hidden side effects are simply better. Switching to zip is simple matter of editing the default forge.config.js and moving ‘win32’ to the maker-zip block. The generated .zip works basically as expected: it contains a bunch of dependencies, and an .exe. Put the .zip in a directory, extract it, double click the .exe, and you app opens. (One more nit: the zip should contain a subdirectory so you can extract without manually creating a directory for it.) The hello world package is heavy but not as bad as I expected: 10.6MB disk (compressed), 282MB disk (uncompressed), 0.0% CPU, 65MB memory. Memory is basically in line with what I was getting with wxWidgets - I think that was around 30 MB with my entire STT app built in. Worse but IMO within the realm of reasonability. Time to first draw is pretty good - under a second according to the eyeball test.

articles

low poly chains via raymarching