Thread 106384114

24 posts 2 images /g/

Anonymous 8/26/2025, 1:27:30 AM No.106384114 >>106384395 >>106384742 >>106384824 >>106384853 >>106384858 >>106385307 >>106385341 >>106385669 >>106386565 >>106386832 >>106387592

1753884116637021.png md5: 6c7b0c76... 🔍

Which function is faster?

#define SIZE 10000000

float Foo(const float* a, const float* b) {
float sum = 0;
for (int i = 0; i < SIZE; i++) {
sum += a[i] * b[i];
}
return sum;
}

float Bar(const float* a, const float* b) {
float* c = new float[SIZE];
float sum = 0;

for (int i = 0; i < SIZE; i++) {
c[i] = a[i] * b[i];
}
for (int i = 0; i < SIZE; i++) {
sum += c[i];
}
delete[] c;
return sum;
}

Anonymous 8/26/2025, 1:29:13 AM No.106384124

idk compile and run them

Anonymous 8/26/2025, 1:59:18 AM No.106384395

>>106384114 (OP)
fewer operations and iterations on Foo, so Foo is the clear winner here

Anonymous 8/26/2025, 2:37:24 AM No.106384742

>>106384114 (OP)
Computer Science is the study of neither Computers, or an actual Science.
However it is something more akin to a wizard crafting magical spells.

Anonymous 8/26/2025, 2:42:45 AM No.106384776

compiler will optimize your shitty code anyway

run both with a lot of operations and -O3

Anonymous 8/26/2025, 2:46:26 AM No.106384811

second one makes mustard gas

Anonymous 8/26/2025, 2:48:19 AM No.106384824

>>106384114 (OP)
Bar, because that has more lines of code

Anonymous 8/26/2025, 2:51:34 AM No.106384853 >>106385259 >>106388633

>>106384114 (OP)
Function Foo is generally faster than function Bar.

Here's why:

Memory Allocation and Deallocation:
Bar dynamically allocates memory for c using new float[SIZE] and deallocates it using delete[] c. This memory management overhead is absent in Foo, which operates directly on the input arrays. For a SIZE of 10,000,000, this overhead can be significant.

Cache Locality and Memory Access Patterns:
Foo performs a single pass over the data, reading a[i] and b[i] and immediately accumulating the sum. This typically results in better cache utilization. Bar, on the other hand, first performs a pass to compute a[i] * b[i] and store it in c[i], and then a second pass to sum the elements of c. This two-pass approach can lead to less optimal cache usage, especially for large SIZE, as the data in c might be evicted from the cache between the two loops.

Redundant Operations:
Bar introduces an intermediate array c and an extra loop to store and then sum the products, which is an unnecessary step compared to Foo's direct summation.

In essence, Foo performs the same computation as Bar but with fewer operations and less memory overhead, leading to better performance.

Anonymous 8/26/2025, 2:52:06 AM No.106384858

>>106384114 (OP)
Bar obviously.

Anonymous 8/26/2025, 3:34:43 AM No.106385222

bit diddlers should go to jail

Anonymous 8/26/2025, 3:39:06 AM No.106385259

>>106384853
this shit should be a bannable offence

Anonymous 8/26/2025, 3:45:02 AM No.106385307 >>106385320

>>106384114 (OP)
Why would Bar be faster? It does what Foo does but with few extra steps

Anonymous 8/26/2025, 3:46:54 AM No.106385320

>>106385307
Easier to vectorize.

Anonymous 8/26/2025, 3:49:05 AM No.106385341

>>106384114 (OP)
bar uses borrowed references so that is somehow faster, I dont' know why.

Anonymous 8/26/2025, 4:34:05 AM No.106385669

>>106384114 (OP)
idk, i only use raw C and asm, not this cpp bs but Foo looks more sane to me

Anonymous 8/26/2025, 5:27:18 AM No.106386069

considering you are allocating something really large on the heap in bar it has to be the first one. im pretty sure that modern compilers can remove one of those loops, but it cant remove the `new` call.

Anonymous 8/26/2025, 5:35:51 AM No.106386123 >>106386386

#include
#include
#include

#define SIZE 10000000

float Foo(const float* a, const float* b) {
float sum = 0;
for (int i = 0; i < SIZE; i++) {
sum += a[i] * b[i];
}
return sum;
}

float Bar(const float* a, const float* b) {
float* c = new float[SIZE];
float sum = 0;

for (int i = 0; i < SIZE; i++) {
c[i] = a[i] * b[i];
}
for (int i = 0; i < SIZE; i++) {
sum += c[i];
}
delete[] c;
return sum;
}

int main() {
std::mt19937 random(12345);

float* a = new float[SIZE];
float* b = new float[SIZE];
for (int i = 0; i < SIZE; i++) {
a[i] = random();
}
for (int i = 0; i < SIZE; i++) {
b[i] = random();
}

float result;
std::chrono::high_resolution_clock clock;
std::chrono::high_resolution_clock::time_point start;
std::chrono::duration duration;

start = clock.now();
result = Foo(a, b);
duration = clock.now() - start;
std::cout << "foo: " << result << " time: " << duration.count() << std::endl;

start = clock.now();
result = Bar(a, b);
duration = clock.now() - start;
std::cout << "bar: " << result << " time: " << duration.count() << std::endl;

delete[] a;
delete[] b;
}

I'm using -O3 on Clang on mint with i5 13600k
foo: 4.55982e+25 time: 4067.53
bar: 4.55982e+25 time: 13620.3

Anonymous 8/26/2025, 5:56:05 AM No.106386272

Saved some time with SSE intrinsics but suffered the floating-point drift.
Foo: -1383.275269 time: 103454 microseconds
Bar: -1383.275269 time: 301907 microseconds
Baz: -1383.314087 time: 55903 microseconds

Anonymous 8/26/2025, 6:12:06 AM No.106386386

>>106386123
foo: 4.60901e+25 time: 3952.43
bar: 4.60901e+25 time: 9442.64

With GCC -Ofast and i5-8600K

Anonymous 8/26/2025, 6:50:50 AM No.106386565

>>106384114 (OP)
>Gerald Sussyman

Anonymous 8/26/2025, 7:39:54 AM No.106386832

>>106384114 (OP)
>Which function is faster?
dunno ask compiler

Anonymous 8/26/2025, 10:00:27 AM No.106387592

>>106384114 (OP)
Should be negligible on modern hardware

Anonymous 8/26/2025, 12:21:22 PM No.106388317

Can't beat foo on the GPU.

...

float Gpu(const float* a, const float* b, sycl::queue& queue)
{
float sum = 0;
sycl::buffer buffer_sum{&sum, 1};

try {
queue.submit([&](sycl::handler& cgh) {
auto rd = sycl::reduction(buffer_sum, cgh, sycl::plus());
cgh.parallel_for(sycl::nd_range<1>{SIZE, 1000}, rd,
[=](sycl::nd_item<1> item, auto& sum ) {
int idx = item.get_global_id();
sum += a[idx] * b[idx];
});
});
} catch (sycl::exception e) {
std::cout << e.what() << std::endl;
}

return buffer_sum.get_host_access()[0];
}

int main() {
std::mt19937 random(12345);

sycl::device device;
sycl::gpu_selector_v(device);
std::cout << "Running on device: "
<< device.get_info() << "\n";
auto queue = sycl::queue(device, sycl::property_list{});

float* a = sycl::malloc_shared(SIZE, queue);
float* b = sycl::malloc_shared(SIZE, queue);
for (int i = 0; i < SIZE; i++) {
a[i] = random();
}
for (int i = 0; i < SIZE; i++) {
b[i] = random();
}

...

float result;
std::chrono::high_resolution_clock clock;
std::chrono::high_resolution_clock::time_point start;
std::chrono::duration duration;

start = clock.now();
result = Gpu(a, b, queue);
duration = clock.now() - start;
std::cout << "gpu: " << result << " time: " << duration.count() << std::endl;

sycl::free((void*)a, queue);
sycl::free((void*)b, queue);
}

Running on device: Intel(R) Arc(TM) A770 Graphics
foo: 4.60901e+25 time: 2427.54
bar: 4.61224e+25 time: 11519.4
gpu: 4.61354e+25 time: 8799.44

Anonymous 8/26/2025, 1:14:54 PM No.106388633

>>106384853
ok, thanks chatgpt