← Home ← Back to /sci/

Thread 16784057

11 posts 2 images /sci/
Anonymous No.16784057 [Report] >>16784059 >>16784129 >>16784149 >>16784323
>It's just fancy autocomplete

What do people mean when they say this? Why couldn’t autoregressive next-token prediction have more complex behavior as an emergent property?

You can be reductionist and claim human reasoning is "nothing more than the probabilistic unfolding of neural signals with a training objective of eating and fucking", yet that reduction doesn’t invalidate the claim that humans are capable of reasoning. In the same way that complex behaviors emerged from that model, why couldn't they emerge from autoregressive next-token prediction?
Anonymous No.16784059 [Report] >>16784323
>>16784057 (OP)
it IS reductionist, but it has no reason to have any particular shape outside of the statistical symbolic relationships, whereas human beings have a brain developed by evolutionary pressures. even if you want to argue that it's "something more", it isn't gonna be anything close to a human even though a human is the closest thing to it there is. you feelin me? LLMs are like a shadow of an extremely reduced average of a section of human brains smeared over time
Anonymous No.16784129 [Report] >>16784325
>>16784057 (OP)
>More complex emergent behavior
Because there's no recursion. Recursion is the the method by which we perform and understand even the simplest of computations, such as 5+5. Without a breakthrough in intelligent recursion, inductive reasoning is impossible.

I already know people are going to try to lie and bs me and say I'm wrong. If you disagree with my, post a link to a notebook where you successfully train a neural network to globally fit z=x*y without using logarithms. (Protip: you can't)
Anonymous No.16784149 [Report] >>16784293
>>16784057 (OP)
>Why couldn’t autoregressive next-token prediction
you said it yourself - next token
imagine a guy who only thinks of one word at a time
now imagine he has no eyes or hands, no conception of a world outside words
what do you call such a guy?
Anonymous No.16784154 [Report]
It has no initiative. How can it build itself if its waiting to take orders. Theres progress there though.
Anonymous No.16784293 [Report] >>16784331
>>16784149
Makes sense. *If* by next token prediciton we meant they don't plan further ahead than the next word and *if* that's all what LLMs are doing, that would indeed be a significant limitation. Except it doesn't describe LLMs, so I guess "it just predicts the next token" description was just false to begin with.
https://www.youtube.com/watch?v=Bj9BD2D3DzA
Though it really should be obvious already from the things LLMs are capable of, even without any fancy interpretability research.
Anonymous No.16784323 [Report]
>>16784057 (OP)
>What do people mean when they say this? Why couldn’t autoregressive next-token prediction have more complex behavior as an emergent property?
Because predicting the next token IS the emergent behavior of the entire mathematical machinery of the transformer.

>>16784059
>it IS reductionist
Completely false. It's the high-level functional definition of the system. It's what's DOWNSTREAM from all those mathematical operations.
Anonymous No.16784325 [Report] >>16784327
>>16784129
>Recursion is the the method by which we perform and understand even the simplest of computations, such as 5+5. Without a breakthrough in intelligent recursion, inductive reasoning is impossible.
Isn't that a bit overstated? Transformers aren't purely linear; the self-attention mechanism allows for dynamic, context-dependent processing that can simulate recursive patterns across layers and sequences. We've seen LLMs exhibit emergent behaviors like in-context learning, where they generalize from examples in a way that feels inductive—almost like bootstrapping recursion from patterns. Why dismiss that as insufficient for complex computation?

>If you disagree with my, post a link to a notebook where you successfully train a neural network to globally fit z=x*y without using logarithms. (Protip: you can't)
That's an interesting challenge, but it's not as insurmountable as you make it sound. Neural networks, especially MLPs or even transformers with appropriate architectures, can indeed learn to approximate (and with sufficient capacity, closely fit) the function z = x * y globally over a reasonable domain without any logarithmic tricks—it's a classic bilinear form, and the universal approximation theorem guarantees that continuous functions like this can be represented arbitrarily well by feedforward nets.

To prove the point, I don't need a full notebook right now (though I could whip one up in Colab if you're game), but here's a dead-simple PyTorch example of a two-layer MLP that learns multiplication on inputs x, y in [-10, 10]. It converges to near-perfect global fit after a few epochs of training on sampled data:
Anonymous No.16784327 [Report] >>16784329
>>16784325
[code]import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import matplotlib.pyplot as plt

# Simple MLP for z = x * y
class MultiplierNet(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(2, 10) # Input: [x, y]
self.fc2 = nn.Linear(10, 1)
self.relu = nn.ReLU()

def forward(self, x):
x = self.relu(self.fc1(x))
return self.fc2(x)

# Generate training data
np.random.seed(42)
n_samples = 10000
x = np.random.uniform(-10, 10, n_samples)
y = np.random.uniform(-10, 10, n_samples)
z_true = x * y
train_data = torch.tensor(np.column_stack([x, y]), dtype=torch.float32)
train_labels = torch.tensor(z_true, dtype=torch.float32).unsqueeze(1)

# Train
model = MultiplierNet()
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)
for epoch in range(200):
optimizer.zero_grad()
outputs = model(train_data)
loss = criterion(outputs, train_labels)
loss.backward()
optimizer.step()
if epoch % 50 == 0:
print(f'Epoch {epoch}, Loss: {loss.item():.6f}')

# Test on grid
x_test = np.linspace(-10, 10, 100)
y_test = np.linspace(-10, 10, 100)
X, Y = np.meshgrid(x_test, y_test)
test_data = torch.tensor(np.column_stack([X.ravel(), Y.ravel()]), dtype=torch.float32)
preds = model(test_data).detach().numpy().reshape(100, 100)
true_z = X * Y

# Plot (in a real notebook, this would show near-perfect match)
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.contourf(X, Y, true_z)
plt.title('True z = x*y')
plt.colorbar()
plt.subplot(1, 2, 2)
plt.contourf(X, Y, preds)
plt.title('Predicted z')
plt.colorbar()
plt.show()

print(f'Mean absolute error: {np.mean(np.abs(preds - true_z)):.4f}') [/code]
Anonymous No.16784329 [Report]
>>16784327
I forgot I'm not on /g/.
Either way, this trains to an MAE of ~0.001 or better—essentially a global fit for practical purposes, no logs involved. Scale up the hidden units or layers, and it gets even tighter. If you want exact symbolic representation, that's more a question for basis functions (e.g., a single neuron with a quadratic activation), but approximation is what NNs excel at, and it's emergent from the layered structure, not recursion per se.

This ties back to our chat: emergent behaviors like fitting complex functions (multiplication is inductive in a sense—generalizing patterns) arise from attention and depth in transformers, without needing explicit recursion. Why insist on recursion when depth + attention bootstraps similar capabilities?
Anonymous No.16784331 [Report]
>>16784293
>*If* by next token prediciton we meant they don't plan further ahead
They do not and cannot "plan ahead" because they're not trying to get anywhere. There is no end-goal. You could at most compare it to strategizing to reduce the odds of working yourself into a dead end as you traverse a landscape, familiar to you only in terms of its general properties, but you can't see more than one step in front of you, let alone see your destination and plot an actual course.