← Home ← Back to /g/

Thread 106384114

24 posts 2 images /g/
Anonymous No.106384114 >>106384395 >>106384742 >>106384824 >>106384853 >>106384858 >>106385307 >>106385341 >>106385669 >>106386565 >>106386832 >>106387592
Which function is faster?

#define SIZE 10000000

float Foo(const float* a, const float* b) {
float sum = 0;
for (int i = 0; i < SIZE; i++) {
sum += a[i] * b[i];
}
return sum;
}

float Bar(const float* a, const float* b) {
float* c = new float[SIZE];
float sum = 0;

for (int i = 0; i < SIZE; i++) {
c[i] = a[i] * b[i];
}
for (int i = 0; i < SIZE; i++) {
sum += c[i];
}
delete[] c;
return sum;
}
Anonymous No.106384124
idk compile and run them
Anonymous No.106384395
>>106384114 (OP)
fewer operations and iterations on Foo, so Foo is the clear winner here
Anonymous No.106384742
>>106384114 (OP)
Computer Science is the study of neither Computers, or an actual Science.
However it is something more akin to a wizard crafting magical spells.
Anonymous No.106384776
compiler will optimize your shitty code anyway

run both with a lot of operations and -O3
Anonymous No.106384811
second one makes mustard gas
Anonymous No.106384824
>>106384114 (OP)
Bar, because that has more lines of code
Anonymous No.106384853 >>106385259 >>106388633
>>106384114 (OP)
Function Foo is generally faster than function Bar.

Here's why:

Memory Allocation and Deallocation:
Bar dynamically allocates memory for c using new float[SIZE] and deallocates it using delete[] c. This memory management overhead is absent in Foo, which operates directly on the input arrays. For a SIZE of 10,000,000, this overhead can be significant.

Cache Locality and Memory Access Patterns:
Foo performs a single pass over the data, reading a[i] and b[i] and immediately accumulating the sum. This typically results in better cache utilization. Bar, on the other hand, first performs a pass to compute a[i] * b[i] and store it in c[i], and then a second pass to sum the elements of c. This two-pass approach can lead to less optimal cache usage, especially for large SIZE, as the data in c might be evicted from the cache between the two loops.

Redundant Operations:
Bar introduces an intermediate array c and an extra loop to store and then sum the products, which is an unnecessary step compared to Foo's direct summation.

In essence, Foo performs the same computation as Bar but with fewer operations and less memory overhead, leading to better performance.
Anonymous No.106384858
>>106384114 (OP)
Bar obviously.
Anonymous No.106385222
bit diddlers should go to jail
Anonymous No.106385259
>>106384853
this shit should be a bannable offence
Anonymous No.106385307 >>106385320
>>106384114 (OP)
Why would Bar be faster? It does what Foo does but with few extra steps
Anonymous No.106385320
>>106385307
Easier to vectorize.
Anonymous No.106385341
>>106384114 (OP)
bar uses borrowed references so that is somehow faster, I dont' know why.
Anonymous No.106385669
>>106384114 (OP)
idk, i only use raw C and asm, not this cpp bs but Foo looks more sane to me
Anonymous No.106386069
considering you are allocating something really large on the heap in bar it has to be the first one. im pretty sure that modern compilers can remove one of those loops, but it cant remove the `new` call.
Anonymous No.106386123 >>106386386
#include
#include
#include

#define SIZE 10000000

float Foo(const float* a, const float* b) {
float sum = 0;
for (int i = 0; i < SIZE; i++) {
sum += a[i] * b[i];
}
return sum;
}

float Bar(const float* a, const float* b) {
float* c = new float[SIZE];
float sum = 0;

for (int i = 0; i < SIZE; i++) {
c[i] = a[i] * b[i];
}
for (int i = 0; i < SIZE; i++) {
sum += c[i];
}
delete[] c;
return sum;
}

int main() {
std::mt19937 random(12345);

float* a = new float[SIZE];
float* b = new float[SIZE];
for (int i = 0; i < SIZE; i++) {
a[i] = random();
}
for (int i = 0; i < SIZE; i++) {
b[i] = random();
}

float result;
std::chrono::high_resolution_clock clock;
std::chrono::high_resolution_clock::time_point start;
std::chrono::duration duration;

start = clock.now();
result = Foo(a, b);
duration = clock.now() - start;
std::cout << "foo: " << result << " time: " << duration.count() << std::endl;

start = clock.now();
result = Bar(a, b);
duration = clock.now() - start;
std::cout << "bar: " << result << " time: " << duration.count() << std::endl;

delete[] a;
delete[] b;
}


I'm using -O3 on Clang on mint with i5 13600k
foo: 4.55982e+25 time: 4067.53
bar: 4.55982e+25 time: 13620.3
Anonymous No.106386272
Saved some time with SSE intrinsics but suffered the floating-point drift.
Foo: -1383.275269 time: 103454 microseconds
Bar: -1383.275269 time: 301907 microseconds
Baz: -1383.314087 time: 55903 microseconds
Anonymous No.106386386
>>106386123
foo: 4.60901e+25 time: 3952.43
bar: 4.60901e+25 time: 9442.64

With GCC -Ofast and i5-8600K
Anonymous No.106386565
>>106384114 (OP)
>Gerald Sussyman
Anonymous No.106386832
>>106384114 (OP)
>Which function is faster?
dunno ask compiler
Anonymous No.106387592
>>106384114 (OP)
Should be negligible on modern hardware
Anonymous No.106388317
Can't beat foo on the GPU.

...

float Gpu(const float* a, const float* b, sycl::queue& queue)
{
float sum = 0;
sycl::buffer buffer_sum{&sum, 1};

try {
queue.submit([&](sycl::handler& cgh) {
auto rd = sycl::reduction(buffer_sum, cgh, sycl::plus());
cgh.parallel_for(sycl::nd_range<1>{SIZE, 1000}, rd,
[=](sycl::nd_item<1> item, auto& sum ) {
int idx = item.get_global_id();
sum += a[idx] * b[idx];
});
});
} catch (sycl::exception e) {
std::cout << e.what() << std::endl;
}

return buffer_sum.get_host_access()[0];
}

int main() {
std::mt19937 random(12345);

sycl::device device;
sycl::gpu_selector_v(device);
std::cout << "Running on device: "
<< device.get_info() << "\n";
auto queue = sycl::queue(device, sycl::property_list{});

float* a = sycl::malloc_shared(SIZE, queue);
float* b = sycl::malloc_shared(SIZE, queue);
for (int i = 0; i < SIZE; i++) {
a[i] = random();
}
for (int i = 0; i < SIZE; i++) {
b[i] = random();
}

...

float result;
std::chrono::high_resolution_clock clock;
std::chrono::high_resolution_clock::time_point start;
std::chrono::duration duration;



start = clock.now();
result = Gpu(a, b, queue);
duration = clock.now() - start;
std::cout << "gpu: " << result << " time: " << duration.count() << std::endl;

sycl::free((void*)a, queue);
sycl::free((void*)b, queue);
}


Running on device: Intel(R) Arc(TM) A770 Graphics
foo: 4.60901e+25 time: 2427.54
bar: 4.61224e+25 time: 11519.4
gpu: 4.61354e+25 time: 8799.44
Anonymous No.106388633
>>106384853
ok, thanks chatgpt