Let's experiment with computation using the browser, and nothing else. It's 2020, so we already can run almost low-level assembly instructions code in WebAssembly. We have WebGL 1.0 consistently deployed in every device fully supporting GPU rendering API called OpenGL ES 2.0. And we have classes and destructuring in ES6, which is the best version of JavaScript web came up with. Let's ignore cool kids' toolchains and frameworks: these cause trouble impeding simple explanations of concepts. We're going to build everything literally from scratch.

Control: from a relay switch to a reality simulator

Let's explore the concept of control.

Control is the smallest computation step, it's a basic building block of computation. Control is a trivial operation. A basic building block of control is a switch, or a valve. Another word for "control" is "regulation". The mysterious combination of the controlled outcomes ("results", "outputs") is often called "computation".

WebGL theory: controlling browser in space and time

Terminology

GPU shader does this:

reads texture bits
intertwines texture bits
writes texture bits

There's literally no black magic beyond that.

To avoid confusion between pixels, texels, and fragments, let's define these:

Pixel: Visible on the screen picture element.
Texel: Texture element when shader reads it.
Fragment: Pixel, texel, or an argument of resulting one when shader writes result of computation. Called "fragment" rather than "pixel" because one resulting pixel can be composed of more than one fragment in many runs, plus things like stencil and depth buffers influence what the result will be. It also can be discard'ed: no write will happen.
Texture image unit: An array of texels available for one shader run. All GPUs support 8 simultaneous texture image units. Some systems can have 16, or 32.
Color buffer: An array of pixels into which shader writes. Most mobile platforms have only one color buffer, and it's typically 32 bits per pixel wide.

The multiplicity of all texture image units, all texture sizes, and the number of bits per texel define full addressable range of shader inputs.

Maximum texture size is 4096 by 4096 texels.

The number of bits per texel is 32 on mobile platforms, and 128 on laptops. 32 bits per texel comes from four R, G, B, and A channels 8 bits (0..255) each. 128 bits per texel comes from four 32-bit floating point numbers. 128 bits per texel is only available with OES_texture_float extension enabled. It's available on the most but very low-end platforms.

Texture image units comprise primary high-bandwidth input to shader.

Color buffers comprise the main output. On laptops there are 8 color buffers.

Color buffer size varies from 4096x4096 on phones, to 16384x16384 on laptops. Due to texture flipping technique, using those color buffers for GPGPU doesn't give a performance advantage. The only use of the color buffer resolutions above 4096 is for actual graphical rendering of visible images, which are not intended to be used as textures.

Our main goal is to do GPGPU rather than rendering graphics, so let's assume that the maximum color buffer size is 4096 by 4096, and only bump it up for special purposes (real-time graphics and high-resolution export).

Total addressable shader input

Low-end mobile:

8 texture units
*
4096 texels wide
*
4096 texels tall
*
4 RGBA channels
*
8 bits channel
=
4,294,967,296 bits
=
536,870,912 bytes = 512 MB.

Typical mobile:

8 texture units
*
4096 texels wide
*
4096 texels tall
*
4 RGBA channels
*
32 bits channel (float)
=
17,179,869,184 bits
=
2,147,483,648 bytes = 2 GB.

A simple laptop:

32 texture units
*
4096 texels wide
*
4096 texels tall
*
4 RGBA channels
*
32 bits channel (float)
=
68,719,476,736 bits
=
8,589,934,592 bytes = 8 GB.

High-end AMD and Nvidia GPUs allow addressing even larger texture memory.

Total addressable shader output

If texture unit memory is a potential access, color buffer memory is often full write. As optimization, we can render triangles and rectangles to only run shader over specified block of output memory, when we know for sure for which areas recomputation is needed.

Mobile color buffer
```
1 color buffer
*
4096 texels wide
*
4096 texels tall
*
4 RGBA channels
*
8 bits channel
=
536,870,912 bits
=
67,108,864 bytes = 64 MB.
```
Not much per a shader run. Note that visible mobile framebuffer often has less than 8 bits per channel, but it's irrelevant for GPGPU computation.

Laptop color buffer in GPGPU mode

8 color buffers
*
4096 texels wide
*
4096 texels tall
*
4 RGBA channels
*
32 bits channel (float)
=
17,179,869,184 bits
=
2,147,483,648 bytes = 2 GB.

Laptop color buffer in graphics mode makes sense to output just one color buffer, despite the support of 8, rendering an RGB image 768 MB large (16384x16384 size).

On most systems, memory bandwidth is the main limiting factor of how many bits can be computed per second. A laptop with DDR4-2400, for example, has 18.75 GB/s. iPhone XS Max' LPDDR4X has 8.33 GB/s.

Single shader run output per fragment

On mobile it's 32 bits. On desktop it's 32 times more, 1024 bits.

Discretizing to the minimum workload in fragments, it's a 1x1 fragment on a desktop system equalling to 8x4 fragments on a smartphone.

8 vec4 can be read matching precisely these:

8 fetches - from the same 8 output color buffers on desktop;
or 8 fetches from the same texture on mobile (doesn't have to be that way, but can).

Is there a performance difference between 8 different textures vs. the same texture accessed with 8 different coordinates in a GLSL shader?

On low-end mobile without floating point textures (should we even support these in 2020?) only 8 bits per channel RGBA textures can be read.

Making API GLSL decisions

In an attempt to establish a cohesive GLSL API (which can utilize resources in an efficient enough manner, and still preserve a great deal of compatibility, to maintain scalability of demonstrational and educational purposes, and just for the simplicity of experimentation), we need to come up with simple I/O interface, abstract enough to scale to other architectures (FPGA and ASIC included), representing interace to digital computation in the most general way: through a set of binary bits I/O surfaces; with internal state bits (memory).

All practical computing systems have these classes of bit surfaces:

INPUTS from external world

Happen asynchronously to internal computational processes. Can require buffering or external REQ signal synchronizer when implemented in FPGA. We split external inputs into two categories:

Synchronized REQ input signals. Assumed as already passed through two flip-flops clocked by any given current "computation wave" clock (there might be many local clocks).
Asynchronous input signals. Volatile data, coordinated by one of the REQ signals. In our computing code, to be consistent with FPGA and ASIC implementations, we're not allowed to read from these bits unless a REQ signal allows us. A simple way to think about this is to keep an assumption, that bits in this surface can change while a GPU shader is reading these.

None of these are automatically buffered. Special actions of copying or managing write/read pointers must taken. (More detail on the subject can be found in Daniel Chapiro's PhD thesis work on GALS - globally asynchronous-locally synchronous systems.)

OUTPUTS visible to external world

The resulting bits of computation. Each bit in this surface is naturally backed by a memory flip-flop, so external world consumer can read it at any moment. We don't have to keep it non-volatile while they are reading these bits, but we must electrically back their 0 or 1 values. This statement is equivalent to saying that we don't support the Mealey-style state machines.

HIDDEN STATE, fully controlled internally and invisible to external world

Has many subareas:

Constant surfaces which never change (think π, for example).
Fixed source code. ROM-style never update procedures. On Earth, bugs are always around, so these things are unlikely to be read-only forever.
Dynamic source code. Think about these as deployed multiple-data processing depending on load. Or a dynamic code buit and edited in an IDE. Or optimized-JIT-compiled stuff to run math more efficiently than using higher level interpreted code. All source code can live in this area to be upgradable.
Data caches. The most important stuff. State machine state. Intermediate state. OOP field state. Momentum, velocity, pressure etc.

A subcategory belonging both to inputs and to outputs can be an Interrupt Vector Table triggering signal. Computational processes can run independently from inputs, polling values when computational process decides to do that. In a comletely reactive system, this polling is not necessary to start computation. A hardware interrupt mechanism wakes up processing when specified bits of data change. In a reactive system, there is a concept of modified bits. The hardware circuits of IVT simply set these modified bits so the rest of spiking reactive processing can take place.

The IVT concept can be implemented restricting which bits should cause shader re-run, when code run in simulation mode. Instead of launching GPGPU shader on any writable by external world surface change, it can define a small restricted area. Switching between computing and sleeping modes is non-trivial idea, when attempting the compatibility between the classic CPU IVT model, FPGA reactive implmentation, and a GPGPU emulation.

IVTs are kind of internal-code driven subscriptions. There should be an area visible to external world, an interrupt handling vector can be installed in a way saying, "if these bits change externally, write those other bits here and there - marking those as modified", in the emulation mode re-running computation, or waking the CPU/GPU after chaning the bits.

While running dynamically, even in the FPGA world, the three classes of bit surfaces become five:

IN: externally provided state.
IN: current hidden state.
IN: current visible state.
OUT: next hidden state.
OUT: next visible state.

Considering the subcategories listed above, our textures and color buffers must concern the following ten IO types:

IN_EXT_REQ: two flip-flops synchronized external state.
IN_EXT_DATA: coordinated by IN_REQ signals external state.
IN_CONST: never changing hidden data.
IN_CODE_ROM: never changing code.
IN_CODE_RAM: current code.
IN_DATA: current hidden state.
IN_VISIBLE: current visible state (provided so easier to change).
OUT_CODE_RAM: potentially self-modified code.
OUT_DATA: next hidden data.
OUT_VISIBLE: next visible state.

The analysis we made is important, because it clearly shows that a significant part of textures is not required to be written in every shader run as a color buffer target. We never write into textures which do represent external input, which back constant values for our computation, or have fixed algorithms which we might prefer to keep intact.

Based on the analysis, let's draft how a GLSL code can look like. The difference between functional IO and shader runs is crucial here to understand. Because each shader run describes just a few bits of the OUT_ surfaces.

WebGL code

We can run many switching operations simultaneously using parallelism provided by GPU. Let's enable WebGL API so we can issue descriptions of what to compute to the GPU:

const gl_canvas =
	document.createElement('canvas');
gl_canvas.setAttribute('width', '1px');
gl_canvas.setAttribute('height', '1px');

window.gl = gl_canvas.getContext('webgl', {
	depth: false,
	stencil: false,
	alpha: false,
	antialias: false,
	preserveDrawingBuffer: false,
	failIfMajorPerformanceCaveat: false,
	powerPreference: 'high-performance',
	desynchronized: true,
});

// this.insertAdjacentElement('afterend',
// gl_canvas);

return gl;

The sequence of calls to get our GLSL shader compiled is rather verbose. It follows C API, which doesn't make too much sense in JavaScript world. Let's ignore the fact, and pack it inside a function, so we can call it each time we need to compile a new shader:

window.compile_shader = (gl, vertSrc, fragSrc) => {

	const vertex_shader =
		gl.createShader(gl.VERTEX_SHADER);

	gl.shaderSource(vertex_shader, vertSrc);
	gl.compileShader(vertex_shader);

	const frag_shader =
		gl.createShader(gl.FRAGMENT_SHADER);

	gl.shaderSource(frag_shader, fragSrc);
	gl.compileShader(frag_shader);

	const program = gl.createProgram();
	gl.attachShader(program, vertex_shader);
	gl.attachShader(program, frag_shader);
	gl.linkProgram(program);

	if (!gl.getProgramParameter(program,
			gl.LINK_STATUS)) {
		let error_text =
			`Could not initialise shaders ${
				gl.getProgramInfoLog(program) }\n`;

		if (!gl.getShaderParameter(vertex_shader,
				gl.COMPILE_STATUS)) {
			error_text += `failed to compile vertex ${
				gl.getShaderInfoLog(vertex_shader) }\n`;
		}
		
		if (!gl.getShaderParameter(frag_shader,
				gl.COMPILE_STATUS)) {
			error_text += `failed to compile vertex ${
				gl.getShaderInfoLog(frag_shader) }\n`;
		}
		throw Error(error_text);
	}

	const compile_log = '' +
		gl.getShaderInfoLog(vertex_shader) +
		gl.getShaderInfoLog(frag_shader) +
		gl.getProgramInfoLog(program);

	if (compile_log) {
		throw Error(compile_log);
	}

	return program;
};

Now we can call the compile_shader() function. A vertex and a fragment shader source code strings are passed as arguments. And we also pass a WebGL API reference associated with the canvas. What happens next is OpenGL ES 2 driver compiles these two GLSL shader source codes into a corresponding GPU instruction set.

const vertex_shader = `
#version 100

precision highp float;
attribute vec2 attr_vertex_pos;
void main() {
  gl_Position = vec4(attr_vertex_pos.xy,
  	0.0, 1.0);
}
`;

const fragment_shader = `
#version 100

// Only when WEBGL_draw_buffers
// is there (mostly on laptops only):
#extension GL_EXT_draw_buffers : require

# ifdef GL_FRAGMENT_PRECISION_HIGH
precision highp float;
precision highp int;
precision highp sampler2D;
# else
precision mediump float;
precision lowp int;
precision lowp sampler2D;
# endif

uniform sampler2D texture_0;

void main() {
	// Currently rendered position:
	gl_FragCoord.xy;
	vec4 input_data = texture2D(
		texture_0, vec2(0.0, 0.0));

	// Just return the pixel
	gl_FragData[0] = input_data;
}
`;

// Enable extensions before compiling the shader:
// To write multiple output buffers
window.multiple_color_buffers =
	gl.getExtension('WEBGL_draw_buffers');
multiple_color_buffers.drawBuffersWEBGL([
  multiple_color_buffers.COLOR_ATTACHMENT0_WEBGL
]);

// To read float textures
gl.getExtension('OES_texture_float');

// To write float fragments
gl.getExtension('WEBGL_color_buffer_float');

window.shader = compile_shader(gl,
	vertex_shader, fragment_shader);

window.attr_vertex_pos = gl.getAttribLocation(
	shader, 'attr_vertex_pos');

gl.enableVertexAttribArray(attr_vertex_pos);


window.uniform_texture_0 =
	gl.getUniformLocation(
		shader, 'texture_0');

return [shader,
	attr_vertex_pos,
	uniform_texture_0];

Let's check out if we can write shaders which produce floating point values rather than 4 bytes of RGBA in the range from 0 to 255.

return gl.getExtension('WEBGL_color_buffer_float');

If it shows OK, then the shader can only output 4 byte values. No float rendering on this platform.

Perhaps, we can render simultaneously to multiple textures?

const ext_draw_buffers =
	gl.getExtension('WEBGL_draw_buffers');
return ext_draw_buffers &&
	ext_draw_buffers.MAX_COLOR_ATTACHMENTS_WEBGL
	// ext_draw_buffers.MAX_DRAW_BUFFERS_WEBGL

Typically, mobile platforms don't support rendering to floating point textures, and don't support rendering to multiple textures simultaneously.

When running a shader, can we at least fetch 4-vectors of floating point values?

return gl.getExtension('OES_texture_float');

Let's abstract out the difference between high-performance GPU in a laptop, and a battery-efficient GPU in a smartphone, and write a GLSL wrapper to call our shaders.

In the shader source code, we'll make global parameters available, so our code can adjust correspondingly. The code might look like this:

`

void main () {
  // Location of current output pixel.
  gl_FragCoord.xy;

  vec2 screenPos = gl_FragCoord.xy - 0.5;
  float x = screenPos.x;
  float y = screenPos.y;

  float oddRow = mod(y, 2.0) > 0.0 ? 0.0 : 1.0;

  // Writing output into a plane
  // a) Smartphone mode - 32 bits per pixel:
  gl_FragData[0] = vec4(0, 255, 127, 0) / 255.0;
  // b) Laptop mode - 


`

Let's leave high-performance rendering for special use cases, and use the most compatible configuration (fetch from 8 floating point vec4, but render to just 4 bytes - 32 bits of result in a shader thread). Luckily, fetching vec4 from 4 bytes doesn't have memory tranfer penalty, so we can move the same buffer back-and-forth while recomputing the next state.

Let's run our first GLSL shader. At first, we need to understand how to invoke a shader code. Since GPU is a graphics processing unit, the traditional way is to pass a set of triangles to toggle rendering.

How do we render triangles? We need a render target. We omitted attaching our canvas to the page, so we don't have to render to the framebuffer, but we're going to render into a texture instead.

Note that the range of output data to be recomputed is directly determined by the triangles we're about to render. We can enable reactivity and lazy recomputation of only areas which do require to get updated. In general case though, when each data output depends on each data input, it's impossible to determine what outputs will become changed: that's the very purpose of any computation.

Let's try a 1x1 pixel texture first:

window.gpgpu_texture_width = 1;
window.gpgpu_texture_height = 1;


const gpgpu_texture_0 = gl.createTexture();
gl.bindTexture(gl.TEXTURE_2D, gpgpu_texture_0);
gl.texParameteri(gl.TEXTURE_2D, gl.TEXTURE_MAG_FILTER, gl.NEAREST);
gl.texParameteri(gl.TEXTURE_2D, gl.TEXTURE_MIN_FILTER, gl.NEAREST);
const pixel = new Float32Array([ 0.0, 0.0, 0.0, 0.0 ]);
gl.texImage2D(gl.TEXTURE_2D, 0, gl.RGBA, 1, 1, 0, gl.RGBA, gl.FLOAT, pixel);

const gpgpu_texture_1 = gl.createTexture();
gl.bindTexture(gl.TEXTURE_2D, gpgpu_texture_1);
gl.texParameteri(gl.TEXTURE_2D, gl.TEXTURE_MAG_FILTER, gl.NEAREST);
gl.texParameteri(gl.TEXTURE_2D, gl.TEXTURE_MIN_FILTER, gl.NEAREST);
const pixel2 = new Float32Array([ 1.0, 2.0, 3.0, 4.0 ]);
gl.texImage2D(gl.TEXTURE_2D, 0, gl.RGBA, 1, 1, 0, gl.RGBA, gl.FLOAT, pixel2);

// Associate a rendering framebuffer with the texture:
const color_buffer_0 = gl.createFramebuffer();
gl.bindFramebuffer(gl.FRAMEBUFFER, color_buffer_0);
gl.framebufferTexture2D(gl.FRAMEBUFFER,
	multiple_color_buffers.COLOR_ATTACHMENT0_WEBGL,
	// gl.COLOR_ATTACHMENT0,
	gl.TEXTURE_2D,
	gpgpu_texture_0, 0);

gl.useProgram(shader);

gl.viewport(0, 0,
	gpgpu_texture_width, gpgpu_texture_height);

// Enables multiple shader texture uniforms
gl.activeTexture(gl.TEXTURE0);
gl.bindTexture(gl.TEXTURE_2D, gpgpu_texture_1);
// gl.bindTexture(gl.TEXTURE_2D, gpgpu_texture_0);
gl.uniform1i(uniform_texture_0, 0);



// One-time setup: attach rendered triangles

// BYTE: signed 8-bit integer, with values in [-128..127] (when normalized mapped to [-1..1] with step 0.0078125)
// SHORT: signed 16-bit integer, with values in [-32768..32767] (when normalized mapped to [-1..1] with step 0.000030517)
// UNSIGNED_BYTE: unsigned 8-bit integer, with src values in [0..255] (when normalized mapped to [0..1] with step 0.00390625)
// UNSIGNED_SHORT: unsigned 16-bit integer, with values in [0..65535] (when normalized mapped to [0..1] with step 0.000015258)
// FLOAT: 32-bit IEEE floating point number, never normalized
const two_triangles = gl.createBuffer();
gl.bindBuffer(gl.ARRAY_BUFFER, two_triangles);
const vertices_of_two_triangles = new Float32Array([
  1.0,  1.0,
 -1.0,  1.0,
  1.0, -1.0,
 -1.0, -1.0,
]);

gl.bufferData(gl.ARRAY_BUFFER,
  vertices_of_two_triangles,
  gl.DYNAMIC_DRAW);

gl.vertexAttribPointer(attr_vertex_pos, 2, gl.FLOAT, false, 0, 0);


// gl.bindFramebuffer(gl.FRAMEBUFFER, color_buffer_0);

gl.drawArrays(gl.TRIANGLE_STRIP, 0, 4);

const shader_results = new Float32Array(
	gpgpu_texture_width
	* gpgpu_texture_height
	* 4);
gl.readPixels(0, 0,
	gpgpu_texture_width,
	gpgpu_texture_height,
	gl.RGBA,
	gl.FLOAT, shader_results);

return shader_results;

gl.useProgram(shader);