View Full Version : Numerical programming in C++...
Stimpson J. Cat
14th September 2003, 06:42 AM
Those of you who have been around for a while may remember that about a year ago (maybe a little more), I posted a link to some vector and matrix classes which I have written for doing numerical programming in C++.
If anybody is interested, this software has had some pretty significant updates made to it recently. You can find it, along with online documentation and a list of what's new, here.
VecMat Software (http://neurodyn.umsl.edu/~kevin/vecmat/VM_Software.htm)
Dr. Stupid
Peskanov
14th September 2003, 11:29 AM
Nice work, Stimpson;
I read you are interested in getting best performance possible, so let me make some suggestions:
- In dividing a series of numbers by the same scalar, it's common to use the reciprocal aproach, as multiply is much faster than division (usually 3 cycles against ~18)
Calculate reciprocal once: rec=1.0/scalar
And multiply by all the numbers: rec*v0, rec*v1,...
- Tiling is an advanced technique that *sometimes* improve speed drastically. I have seen speedups up to 7x! It's used with big matrix, let's say 200x200 or bigger.
When you read the data in the matrix vertically, internal cpu data caches get flowed with unused data from closer horizontal cells...That's the way all modern caches work!
When you reach the bottom of the matrix, the cache has already lost the data at the top, and that spells disaster in performace terms.
Tiling is the technique of avoiding sequential vertical readings by operating in small tiles (or matrix), for example 8x8 or 16x16. It's use is common in graphics programming due to size and performance problems.
Sadly, all really interesting techniques makes good code design an impossible task...
arcticpenguin
14th September 2003, 11:58 AM
Originally posted by Stimpson J. Cat
Those of you who have been around for a while may remember...
I've been around such a long while that I still program in FORTRAN.
Stimpson J. Cat
14th September 2003, 04:05 PM
Peskanov,
Thanks for your comments. :)
I read you are interested in getting best performance possible, so let me make some suggestions:
- In dividing a series of numbers by the same scalar, it's common to use the reciprocal aproach, as multiply is much faster than division (usually 3 cycles against ~18)
Calculate reciprocal once: rec=1.0/scalar
And multiply by all the numbers: rec*v0, rec*v1,...
I thought about that, unfortunately it would break the division operation for vectors of integers. I would need to write separate specializations for float and double, as well as their complex counterparts. I figure this way is simpler. After all, the existence of the division operator does not imply that the user has to use it. You can always calculate the reciprocal yourself, and then do the multiplication. In fact, that is exactly what I usually do. The only time I ever actually use the division operator is when it is completely insignificant to performance (for example, normalizing the result of a Fourier transform).
- Tiling is an advanced technique that *sometimes* improve speed drastically. I have seen speedups up to 7x! It's used with big matrix, let's say 200x200 or bigger.
When you read the data in the matrix vertically, internal cpu data caches get flowed with unused data from closer horizontal cells...That's the way all modern caches work!
When you reach the bottom of the matrix, the cache has already lost the data at the top, and that spells disaster in performance terms.
Tiling is the technique of avoiding sequential vertical readings by operating in small tiles (or matrix), for example 8x8 or 16x16. It's use is common in graphics programming due to size and performance problems.
The matrix multiplication routines are one area where I am still not very satisfied. I have been reading about the tiling methods you refer to, but I do not yet understand them well enough to try to implement them myself. It is also my understanding that the tile size that will give optimum performance depends strongly on the type of CPU.
I have been toying around with the idea of integrating BLAS into my stuff as well, but I am not really sure how to do that, either.
Arcticpenguin,
I've been around such a long while that I still program in FORTRAN.
:eek: Oh the humanity!! :eek:
I have a colleague who does all his programming in fortran. I can't stand the language myself. I took a course on it about 8 years ago, and at the time I could use it, but it has been so long, I can barely follow the source code anymore, much less write it.
Dr. Stupid
a_unique_person
14th September 2003, 10:36 PM
This is all, like, so last century. The main game is symbolic manipulation now.
Although, according to this http://developers.slashdot.org/developers/03/09/14/1821207.shtml?tid=126&tid=156 C++ has actually been used to win this years ICFP programming contest.
Peskanov
15th September 2003, 03:06 PM
Stimp,
I thought about that, unfortunately it would break the division operation for vectors of integers. I would need to write separate specializations for float and double, as well as their complex counterparts. I figure this way is simpler. After all, the existence of the division operator does not imply that the user has to use it. You can always calculate the reciprocal yourself, and then do the multiplication. In fact, that is exactly what I usually do. The only time I ever actually use the division operator is when it is completely insignificant to performance (for example, normalizing the result of a Fourier transform).
That's Ok, because anyway many programmers will divide using the reciprocal. It's a common use.
The matrix multiplication routines are one area where I am still not very satisfied. I have been reading about the tiling methods you refer to, but I do not yet understand them well enough to try to implement them myself. It is also my understanding that the tile size that will give optimum performance depends strongly on the type of CPU.
I have been toying around with the idea of integrating BLAS into my stuff as well, but I am not really sure how to do that, either.
Well, tiling is used only when you operate the memory in a non-sequential manner, like in matrix multiply, transpose or diagonalization. The size of the tile is not so important, it's common to use a lenght equal to the cache line. For example, 32 bytes, that gives you a tile of 8 rows of 8 float number each.
Optimal performance is basically unpredictable; that is, you can execute 2 times the same program and see variations on the speed of 5%, for example. Some times this is due to the physical memory start, the memory "alignment". If your matrix starts on address 0x0...nn0 you get speed A, and if your matrix start en address 0x0...nn4 you get speed A*1.05
Don't care about these complications. Tiling is used to acomplish HUGE speed improvements in bad cache situations. Getting 3% or 7% more speed in a particular CPU adjusting the tile size has no use. But just for using tiling (of any small size) you can get huge improvements.
An easy way to speed up your vertical matrix operations is just to implement one tiled operation: transpose.
Prepare a fast function to transpose, using 8x8 tiles, it's not hard, as transposition operates in local regions correctly.
When you have to operate with some big matrix in vertical, for example in matrix multiply, just create a transposed version of the vertical-operated one, so you can operate in horizontal.
About BLAS, I have no idea as I am not too much into maths.
Anyway, I also recommed you to look at vectorial unit libraries. You can get big speedups from Altivec or SSE. And you can also do vectorial code directly. I can testify that at least Altivec is quite easy to handle...
Stimpson J. Cat
16th September 2003, 04:03 AM
Peskanov,
An easy way to speed up your vertical matrix operations is just to implement one tiled operation: transpose.
Prepare a fast function to transpose, using 8x8 tiles, it's not hard, as transposition operates in local regions correctly.
When you have to operate with some big matrix in vertical, for example in matrix multiply, just create a transposed version of the vertical-operated one, so you can operate in horizontal.
Interesting. I can certainly see how that would work when the matrices are in contiguous memory. My software allows for a more general case, where both the row-stride and column-stride can be just about anything. I think it is clear that if I am going to implement any real improvement in the matrix operations, it will have to be done by having a library of functions optimized for different situations, and then having the main functions intelligently decide which optimized functions to call.
For example, given the fact that matrix multiplication is not of linear complexity, for very large matrices which are not contiguous, it would make sense to operate on contiguous copies. But then you are dealing with trade-offs between speed and memory usage.
Anyway, I also recommed you to look at vectorial unit libraries. You can get big speedups from Altivec or SSE. And you can also do vectorial code directly. I can testify that at least Altivec is quite easy to handle...
That's also something I am interested in trying. The optimal code for a pipelined floating point unit is not the same as that for a vector processing unit. I am not sure how to properly optimize for things like SSE, 3DNow, or Altivec.
Dr. Stupid
Peskanov
18th September 2003, 02:27 AM
Stimp,
Interesting. I can certainly see how that would work when the matrices are in contiguous memory. My software allows for a more general case, where both the row-stride and column-stride can be just about anything. I think it is clear that if I am going to implement any real improvement in the matrix operations, it will have to be done by having a library of functions optimized for different situations, and then having the main functions intelligently decide which optimized functions to call.
That's a common dilema, clean vs fast.
In some extreme cases I have seen (and I have coded) another philosophy which is both clean and fast: dinamically build the final code using general descriptions.
I used this aproach for a sound mixer, and I have seen an friend use it for graphic rasterizers.
In a language he specified, the user of the library explains which kind of operations wants for every pixel: texturing, bump, shaders, etc... Then the library build the machine code and uses it.
This method looks quite nice for math applications, you could specify an equation to be applied for every member of a matrix or a vector and still get top speed.
In the downside, it sacrifies portability, and requires time and skill.
For example, given the fact that matrix multiplication is not of linear complexity, for very large matrices which are not contiguous, it would make sense to operate on contiguous copies. But then you are dealing with trade-offs between speed and memory usage.
Do you think memory is so important today? I know some math. intensive programs use huge matrix but...
There still is the posibility of transposing the matrix 2 times, one to operate and other to restore the matrix. But I am not sure it's a good idea, after all the user can pass you the same matrix in both arg. of a multiplication.
That's also something I am interested in trying. The optimal code for a pipelined floating point unit is not the same as that for a vector processing unit. I am not sure how to properly optimize for things like SSE, 3DNow, or Altivec.
What do you mean, exactly?
The main difference between common alg. and vectorial ones is:
1- You have to operate with 4 individual elements at a time.
2- Branching has to be avoided at all cost.
3- Forget about double precission, at least for now
4- Portability sucks.
For a library like yours, it can be used for optimizing special case;
Difference 1 is a problem in matrix transpose and matrix multiply; but public code for fast matrix transpose using vectorial units is avalaible; and, using matrix transpose, matrix multiply is more sequential (vectorizable).
Difference 2 is not a problem for you, as you have very few conditionals in you loops.
If you need some example of how to vectorize a loop, just post it and I can translate it to Altive code.
Stimpson J. Cat
18th September 2003, 08:27 AM
Peskanov,
Do you think memory is so important today? I know some math. intensive programs use huge matrix but...
Unfortunately, it often is an issue for me. It is not uncommon for me to have to operate on data sets which are a few hundred MB. Of course, such data sets are always in contiguous order, so the idea of having two sets of functions, one optimized for flexibility, and one optimized for storage and speed, makes sense.
I am really beginning to think this is the way to go. I can simply have my matrix multiplication routines check to see if the data is formatted properly for the more highly optimized routines (which it almost always will be), and only use the slower, but completely generic code that I have already written, when it has to.
This approach would also have the advantage of leaving the trade-off between memory usage and speed, up to the user. If he wants speed, and doesn't mind using more memory, he can simply make deep-copies of the non-conforming matrices, and then call the Matrix multiplication functions with them.
That's also something I am interested in trying. The optimal code for a pipelined floating point unit is not the same as that for a vector processing unit. I am not sure how to properly optimize for things like SSE, 3DNow, or Altivec.
--------------------------------------------------------------------------------
What do you mean, exactly?
It has been my understanding that optimum code for a pipelined FPU should attempt to group together independent operations, so that the pipeline does not have to wait for one to finish before it can start the next, whereas the code for a vector FPU should group together multiple operations operating on the same data. Certainly these things are not mutually exclusive, but they are different.
The main difference between common alg. and vectorial ones is:
1- You have to operate with 4 individual elements at a time.
2- Branching has to be avoided at all cost.
3- Forget about double precission, at least for now
4- Portability sucks.
Those are actually some big problems for me. Double precision is pretty much necessary for scientific programming, and as far as I know, the P4 with its SSE2 is the only one that supports vector processing on doubles. Portability is also an important issue for me. Every program I write needs to be able to run on at least four different platforms, namely Sun workstations (with gcc), Linux PCs (also with gcc), Windows PCs (with gcc under Cygwin), and Windows PCs (with MSVC++). In addition, the PCs range from P3s to P4s to Athlons.
I guess vector processing just isn't in the cards for me. :p
Dr. Stupid
AmateurScientist
18th September 2003, 08:34 AM
Originally posted by arcticpenguin
I've been around such a long while that I still program in FORTRAN.
I don't still program in FORTRAN. I don't still program at all. Like you, I did learn first BASIC, then FORTRAN. I also learned assembly language programming, which was hard as hell. I hated having to keep track of which register held what value at the moment.
Do you remember printing out your programs on the green and white tractor paper and then trying to debug them by hand with a pen or pencil?
I remember going to bed after being in the computer lab most of the night and dreaming in code.
If there is hell on earth, it is dreaming in code and being so immersed in it that you try to figure out how to debug your alarm clock when it goes off in the morning.
AS
Underemployed
18th September 2003, 11:43 AM
I find it particularly soothing to read all this, despite not understanding a single word. A bit like watching a snooker game on TV. Ignorance is indeed bliss.
Stimpson J. Cat
18th September 2003, 12:26 PM
AS,
If there is hell on earth, it is dreaming in code and being so immersed in it that you try to figure out how to debug your alarm clock when it goes off in the morning.
Been there. Done that. I feel your pain, man. :rub:
Dr. Stupid
Yahweh
19th September 2003, 10:47 PM
I dont really have much to add, as I dont understand any of it anyway...
A few weeks ago, I picked up a C++ book. I've been trying to teach myself how to program, and I've come to the realization that C++ was intentionally designed to be hard to learn.
One day at a time, one crappy DOS window at a time...
jj
20th September 2003, 02:56 AM
Originally posted by arcticpenguin
I've been around such a long while that I still program in FORTRAN.
Lessee.
Add a to b, giving c.
Oh, wait, no, that's not Fortran :)
if (i) 10, 10, 20
THAT is fortran
Personally, I preferred the language that started out
Begin block;
...
Stimpson J. Cat
20th September 2003, 06:32 AM
Yahweh,
I dont really have much to add, as I dont understand any of it anyway...
A few weeks ago, I picked up a C++ book. I've been trying to teach myself how to program, and I've come to the realization that C++ was intentionally designed to be hard to learn.
I guess it depends on how you approach it. I have found that the best way to learn a new programming language, is by porting programs from languages you already understand. If you just read a book on C++, and then sit down and try to write a program, you are probably no going to get anywhere.
In fact, this software was started essentially as a way to teach myself C++. Up until about two years ago, I basically just used C. I used a C++ compiler, but the only C++ features I used were the memory allocation and iostream functions.
Dr. Stupid
rockoon
24th September 2003, 12:50 AM
Memory optimisations are indeed critical. The more modern the processor the more critical it is. A 386 didnt need any sort of cache optimisation strategy (it didnt have a memory cache) and a 486 had such small cache lines that trying to optimise for it was almost silly.
But now with AMD XP's and Pentium 4's.. optimising for cache line size is very important. The reason for this is two-fold:
A) the BUS speed between system ram and the CPU cache's is many multiples slower than the CPU itself. It takes longer to move the memory from system ram to the caches than it does to do a multiplication or even a series of multiplications.
B) the cache line sizes are now 64 bytes
On AMD XP's there is a simple way to prime the caches before you need to use the data that is fully compatible with all earlier processors: Simply read from memory before you need it. You only need to read one byte from memory per cache line. Just keep 4 or so cache lines being filled before you use the data.
---
As far as SSE - The idea of SIMD is a good one for sure but there are some practical problems when trying to make full use of these extensions.
One is that you really need to utilize assembly language.. the SSE extensions to the C++ compilers will not know if your data is 16-byte aligned (SSE loves 16-byte aligned data) when producing your executable so cant take advantage of the special aligned move operations which are much faster than the general purpose move operations.
Another pressing problem is that SSE does not have horizontal operations. When an SSE register is loaded with 4 values, you cant perform operations that use these values exclusively. You cant multiply the 1st value by the 2nd value. The SIMD nature of SSE is only vertical. You can multiply one SSE register by another which multiples the 1st value of the 1st register by the 1st value of the second register, the 2nd value of the first register by the 2nd value of the second register.. and so forth.
Intels SSE2 attempts to add horizontal operations but it falls a little short and is not compatible with any AMD processor that I am aware of.
And the main drawback of using SSE (which is foced upon you by the previous limitations i've listed) is that you are forced to use a memory organisation that isnt exactly friendly:
structure of arrays (SoA)
rather than the familiar
array of structures. (AoS)
a_unique_person
24th September 2003, 01:13 AM
Originally posted by Yahweh
I dont really have much to add, as I dont understand any of it anyway...
A few weeks ago, I picked up a C++ book. I've been trying to teach myself how to program, and I've come to the realization that C++ was intentionally designed to be hard to learn.
One day at a time, one crappy DOS window at a time...
Don't try to learn programming with C++. It is the worlds most difficult programming language.
Personally, I have a soft spot for Delphi. However, C#, which is based on the best of Delphi and C++ may be worth looking at, or Java. Both are one iteration further on along the path to a good programming language.
rockoon
24th September 2003, 05:09 AM
Originally posted by a_unique_person
Don't try to learn programming with C++. It is the worlds most difficult programming language.
Personally, I have a soft spot for Delphi. However, C#, which is based on the best of Delphi and C++ may be worth looking at, or Java. Both are one iteration further on along the path to a good programming language.
What makes you think java is a good programming language? ;-)
a_unique_person
24th September 2003, 05:31 AM
C++ is more of an experimental language that is still being developed. Compiler writers have a lot of trouble conforming to the specifications. Just reading the queries people have on inheritance/initialisation/templates etc is enough to make your brain hurt.
Java was designed to be easier to understand while still being powerful.
Peskanov
24th September 2003, 02:25 PM
C++ is more of an experimental language that is still being developed. Compiler writers have a lot of trouble conforming to the specifications. Just reading the queries people have on inheritance/initialisation/templates etc is enough to make your brain hurt.
Java was designed to be easier to understand while still being powerful.
If your aim is to learn a OO language to get job, Java is a good option, an C# also probably is (although I hate to admit it).
If you are interested to learn an OO language because of the alleged benefits (OO as the silver bullet, etc...) try Smalltalk. Also Ruby or Python are good starts. Those two also have the benefit of being widely used in professional circles, specially Python.
BTW, I am trying to learn Caml right now, a extremely powerful language. It's most used variant, OCaml, also includes OO features.
Peskanov
25th September 2003, 01:23 AM
Hello rockoon,
Memory optimisations are indeed critical. The more modern the processor the more critical it is. A 386 didnt need any sort of cache optimisation strategy (it didnt have a memory cache) and a 486 had such small cache lines that trying to optimise for it was almost silly.
But now with AMD XP's and Pentium 4's.. optimising for cache line size is very important. The reason for this is two-fold:
What do you mean with "optimising for cache line size"?
A) the BUS speed between system ram and the CPU cache's is many multiples slower than the CPU itself. It takes longer to move the memory from system ram to the caches than it does to do a multiplication or even a series of multiplications.
B) the cache line sizes are now 64 bytes
On AMD XP's there is a simple way to prime the caches before you need to use the data that is fully compatible with all earlier processors: Simply read from memory before you need it. You only need to read one byte from memory per cache line. Just keep 4 or so cache lines being filled before you use the data.
I disagree. Prefetching has few benefits in purely sequential loops like Stimpson's ones.
In vertical access it can help, but vertical access IS the real problem, this is the reason I talk about tiling & transpose.
In random accessing through big tables, prefetch can also help using large software pipelining, but this is not the case. And usually you can avoid this case using tiling or ordering.
Anyway, please note that the tendency in modern processor design is to add mechanisms to calculate automatically the prefetch needs and bring the data that will be probably used. You have a good example in the G5 architecture.
One is that you really need to utilize assembly language.. the SSE extensions to the C++ compilers will not know if your data is 16-byte aligned (SSE loves 16-byte aligned data) when producing your executable so cant take advantage of the special aligned move operations which are much faster than the general purpose move operations.
In PowerPC, all the Altivec datatypes are declared by default aligned using some GCC pragma.
You can use the macros and types and introduce Altivec code in C/C++ directly.
Are you sure it's not the same in the Intel world?
Another pressing problem is that SSE does not have horizontal operations. When an SSE register is loaded with 4 values, you cant perform operations that use these values exclusively. You cant multiply the 1st value by the 2nd value. The SIMD nature of SSE is only vertical. You can multiply one SSE register by another which multiples the 1st value of the 1st register by the 1st value of the second register, the 2nd value of the first register by the 2nd value of the second register.. and so forth.
Intels SSE2 attempts to add horizontal operations but it falls a little short and is not compatible with any AMD processor that I am aware of.
But this just normal...Some of the fisrt vectorial processors had vectors with dozens of values! Operating in parallel (vertical) is the nature of vectors processors.
Frankly, if you can't parallelize your algorithm, why bother using vectors? Fortunately near all algorithms are vectorizable.
rockoon
25th September 2003, 02:59 AM
What do you mean with "optimising for cache line size"?
As an example the AMD XP's L1 cache line size is 64 bytes. If you access a single byte of memory (not using an instruction which specifically avoids cache pollution) 64 contiguous bytes will be loaded. Its not just any old 64-bytes, its a specific 64-bytes that can be predicted. The first line begins at absolute memory address 0, the second at absolute memory address 64, and so forth.
If you allign your data on these 64-byte boundaries (called lines) you can gain performance by avoiding cache pollution. Using a single cache line rather than 2, to handle your data structure. If your data structure happens to be 64 bytes in size and aligned on these 64 byte cache lines then accessing a single element of the structure will load the entire structure into the caches.
The first access of the structure would then have a wait state penalty while the rest of the accesses wont have any wait state penalty as long as the data remains in the cache.
--
Going further, the AMD XP will not wait if it doesnt -need- the data yet. Out of order execution. When it sees an instruction such as "mov eax, [esi]" it wont wait as long as you dont use the eax register as a source operand before the cache line is finally loaded. The CPU is happy to keep on executing other non-related instructions. If the cache line is loaded when you do finally need the value in the eax register then no wait state occurs.
This leads to the idea of prefetching which has real gains. Doubling and even trippling the performance of tight loops if one of the instructions within the loop is only there to prefetch a cache line several iterations ahead of time. THis works because you dont actualy have to use the data you are prefetching. As long as you dont use it then it will never cause a wait state but serves the purpose of preloading a cache line before you need to access it.
For more details, refer to the AMD Optimisation guide which is available at their website.
Guide (http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22007.pdf)
Anyway, please note that the tendency in modern processor design is to add mechanisms to calculate automatically the prefetch needs and bring the data that will be probably used.
And therein is the problem. On the one hand you can specifically prefetch the data you are sure to need.. on the other you can hope the processor guesses right. If branch prediction is any indicator of the processors guessing abilities, then i'd suggest being explicit and forcing the prefetch of the data you want. I've seen first-hand a 100% performance increase on sequential data.
But this just normal...Some of the fisrt vectorial processors had vectors with dozens of values!
Indeed. And one of the first SIMD extensions to PC compatibles was 3DNow which has horizontal operations. The reason its a pain without horizontal operations is memory organization.
Consider calculating the cross product of two vectors using SSE SIMD without horizontal operations. One vector would get loaded into a SSE register just as it appears in memory using a single instruction but the other cannot. The other must be spread along (in the case with 3D vectors) 3 other registers.. and expanded to fill those registers. This is 6 times more instructions than was needed for loading the first vector. This also cant be pipelined well because the result of one instruction is immediately needed by another before you can finally perform your goal of multiplying the two vectors which needs the result of all of the 7 "load and reorder" instructions. Its so hard on the pipline that the performance gains are rather disappointing.
This of course can be alleviated by using Structure of Arrays memory organization rather than Array of Structures organization. But the latter memory organization has other drawbacks (cache pollution being one of them).
Peskanov
26th September 2003, 03:24 AM
rockoon,
If you allign your data on these 64-byte boundaries (called lines) you can gain performance by avoiding cache pollution. Using a single cache line rather than 2, to handle your data structure. If your data structure happens to be 64 bytes in size and aligned on these 64 byte cache lines then accessing a single element of the structure will load the entire structure into the caches.
You miss the point that most of alg. read/write the data in a sequential manner, specially the vector/matrix ones in Stimpson library.
Reading the rest of the cache line is in fact a form of prefetch.
For example, a simple vertex struct:
typedef struct {
float fX;
float fY;
float fZ;
}
If you read fX, you will get 64/(3*4) = 5,3 vertex on cache.
If you pad the struct
typedef struct {
float fX;
float fY;
float fZ;
long PAD;
}
you will get 64/(4*4) = 4 vertex on cache
In sequential reading/writting, the first case is alway preferable. In random accesing, it will always depend on the size of the struct and the size and philosophy of the caches.
Well, unless you are manually purging the caches, a boring task that rarely pay.
An exception to all this are really simple alg. like memory copy, where you can use cache line inst. There, cache line size alignment can produce big speedups.
The first access of the structure would then have a wait state penalty while the rest of the accesses wont have any wait state penalty as long as the data remains in the cache.
It does not matter, in the next iteration it could mean there are no waits at all. It auto-balances itself.
I have found cases where un-aligning the data increased speed (to my despair).
Going further, the AMD XP will not wait if it doesnt -need- the data yet. Out of order execution. When it sees an instruction such as "mov eax, [esi]" it wont wait as long as you dont use the eax register as a source operand before the cache line is finally loaded. The CPU is happy to keep on executing other non-related instructions. If the cache line is loaded when you do finally need the value in the eax register then no wait state occurs.
This leads to the idea of prefetching which has real gains. Doubling and even trippling the performance of tight loops if one of the instructions within the loop is only there to prefetch a cache line several iterations ahead of time. THis works because you dont actualy have to use the data you are prefetching. As long as you dont use it then it will never cause a wait state but serves the purpose of preloading a cache line before you need to access it.
For more details, refer to the AMD Optimisation guide which is available at their website.
rockoon, this manual, when talking about alignment, refers to natural data size alignment, not cache line size alignment.
Unaligned accesses are always expensive. In some CPUs (like hitachi sh series) it even produces an exception.
BTW, this book does not even mention tiling at all. Keeping locality is one of the most powerful optimizing techniques used today.
It's too AMD-centric, you should search alternative sources of info about optimzation.
And therein is the problem. On the one hand you can specifically prefetch the data you are sure to need.. on the other you can hope the processor guesses right. If branch prediction is any indicator of the processors guessing abilities, then i'd suggest being explicit and forcing the prefetch of the data you want. I've seen first-hand a 100% performance increase on sequential data.
I have never seen such big improvements done by prefetching. ITOH powerpc cpu rarely get speeds too far from the bus speed.
I would love to see one example of this to see which improvement can I gain in my G4; prefetching never produced big improvements in my code.
And therein is the problem. On the one hand you can specifically prefetch the data you are sure to need.. on the other you can hope the processor guesses right. If branch prediction is any indicator of the processors guessing abilities, then i'd suggest being explicit and forcing the prefetch of the data you want. I've seen first-hand a 100% performance increase on sequential data.
The problem is that manually adjusting the prefetch can sometimes be bad in other CPUs. as you are working for the size of the L1 which changes in every CPU.
In my work this is not a problem, as I work in the console maket, and the specs of every machine are closed.
But working for PC, I see few sense in using this kind of finetuning opts.
This of course can be alleviated by using Structure of Arrays memory organization rather than Array of Structures organization. But the latter memory organization has other drawbacks (cache pollution being one of them).
You did not get my point.
What I meant: when using vector units, you always have to structure your data to fit the vector format.
I mean, expecting sequential instructions in a vector unit is a bit naive. You have two options: you can parallelize your data and use the vector unit, or, if dificult, you can forget about using the vector unit.
rockoon
26th September 2003, 05:39 AM
Originally posted by Peskanov
rockoon,
You miss the point that most of alg. read/write the data in a sequential manner, specially the vector/matrix ones in Stimpson library.
Reading the rest of the cache line is in fact a form of prefetch.
You missed the point. Reading the rest of the line is certainly a form of prefetch but the problem is you are asking for the prefetch precisely when you have gotten to handling the data. So there the CPU is, waiting for the data to be copied into the caches. Hardly a prefetch at all.
In your vector structure, you are using 12 bytes (assuming 32-bit floats). On AMD XP's and P4's every 5 vectors you will hit a wait state unless you prefetch the data ahead of time. Padding to 16 bytes will make that aspect even worse. Optimisation is not a simple process anymore. You can't point to a single "rule" and know what the fastest form will be. Infact on both Intel and AMD cpu's they have a hard time telling you how long an instruction will take because of all these complex factors. The best tool is to write the code and measure it.
Well, unless you are manually purging the caches, a boring task that rarely pay.
An exception to all this are really simple alg. like memory copy, where you can use cache line inst. There, cache line size alignment can produce big speedups.
Well I never mentioned cache purging. There is of course no benefit to cache purging. I don't believe its even possible on AMD or Intel CPU's because its a silly operation.
It does not matter, in the next iteration it could mean there are no waits at all. It auto-balances itself.
I have found cases where un-aligning the data increased speed (to my despair).
the point is that you can avoid *all* the waits. Auto-balancing is just an average of those iterations with waits and those without. The point is you dont need *any* iterations with waits (cept at the very beginning).
rockoon, this manual, when talking about alignment, refers to natural data size alignment, not cache line size alignment.
Unaligned accesses are always expensive. In some CPUs (like hitachi sh series) it even produces an exception.
Read it again. Alligning on 64-byte boundaries on the latest AMD and intel's has everything to do with organising your prefetching scheme. The section you were reading was just general tips.
It's too AMD-centric, you should search alternative sources of info about optimzation.
Actualy its not. Prefetching works for intels too.
I have never seen such big improvements done by prefetching. ITOH powerpc cpu rarely get speeds too far from the bus speed.
And that is why you havent seen significant gains. On the latest AMD's and Intels the ratio between memory bus speed and the chip speed is often 8:1 or higher. Operating on data not in the cache will then be 8+ times slower than operating on data in the on-chip caches with most instructions. That could turn a 17 cycle iteration ( a vector * matrix operation ) into a 24+ cycle iteration. Doesnt that seem like a big difference to you?
In the future this will become even more important as the CPU speeds will continue to grow faster than the BUS speeds.
I would love to see one example of this to see which improvement can I gain in my G4; prefetching never produced big improvements in my code.
If only the G4 was a good CPU ;-) In all seriousness the G4 isnt even close to the same architecture as the common desktops on the market. I can't even pin down weather or not the caches are even on-chip. One site seems to indicate that at least the L2 cache is not on-chip. Which leads me to believe we are talking 486-level technology as far as cache's are concerned.
The problem is that manually adjusting the prefetch can sometimes be bad in other CPUs. as you are working for the size of the L1 which changes in every CPU.
The point being that NOT prefetching is almost always worse than an over-saturation or under-saturation of prefetches. Asking twice for the same cache line often has *zero* overhead. Asking for every other cache line still has benefits every other time.
Can the G4 execute multiple instructions per clock? These prefetch instructions are often completely hidden by another instruction on AMD/Intel. They dont add any additional overhead when placed correctly. This has been true since the first pentium and K6.
As I mentioned before. a vector * matrix operation can be done in as few as 17 clocks average on AMD XP cpu's and 18 clocks average on Pentium4's. But only when the data is in the caches! Thats over 100 million per second on a 2ghz machine which is almost considered low-end these days.
A vector is typically 16 bytes, ie the required memory bandwidth to perform this is 16 * 100mil = 1.6 billion bytes per second but the typical bus speed of a 2ghz machine is 266 mhz - which through the miracles of DDR has a throughput of just about an optimum 2.1 billion bytes per second but a latency of 1 bus clock or 6 cpu clocks per request. IE, you will need to prefetch in order to achieve these high rates while still operating on the data.
In terms of this level of efficiency, a wait state is a disaster.
Peskanov
26th September 2003, 03:27 PM
rockoon,
Well I never mentioned cache purging. There is of course no benefit to cache purging. I don't believe its even possible on AMD or Intel CPU's because its a silly operation.
With purging, I meant zero-pad the data line you are going to write fully, so it doesn't have to be read, and, once written, throwing them out the cache. In PPC the instructions are dcbz (data cach block zero) and dcbts (datacache block touch for store).
Actualy its not. Prefetching works for intels too.
Prefetching works since the times of 68040/Pentium/PPC601, that is, nearly all commercial processors since the 1989-92 arch. transition.
If only the G4 was a good CPU ;-) In all seriousness the G4 isnt even close to the same architecture as the common desktops on the market. I can't even pin down weather or not the caches are even on-chip. One site seems to indicate that at least the L2 cache is not on-chip. Which leads me to believe we are talking 486-level technology as far as cache's are concerned.
You are incredibly ignorant, however that doesn't seems to stop you from trying to look an expert.
Come on, take a look in Ars technica; try there because you will probably not be able to properly read the manuals motorola have online.
Can the G4 execute multiple instructions per clock?
Okay, I finally get it. You are trolling.
Look, don't bother replying. I am seeing your profile; choose another target for your silly games. You are on ignore.
Stimpson, sorry for derailing your thread. I leave the thread now.
rockoon
26th September 2003, 08:42 PM
oh i'm trolling because I don't know what the specifics of the G4 are. Is that sand your head is in?
To make sure people leave with the right idea:
SSE Prefetching (http://www.gamasutra.com/features/19990730/sse_prefetch_01.htm)
Evaluating the Impact of Memory System Performance (http://www.ece.umd.edu/~aneesh/paper-final.pdf)
Optimizing Computer Runtime Using Code Optimization and Data Prefetching (http://www.cis.rit.edu/class/simg707/Web_Pages/alain.htm)
And here (http://www.cts.com.au/cfort.html) you will find:
Data prefetching is an effective technique to hide memory access latency. Data prefetching inserts prefetch instructions for selected data references at specific points in the program, so that referenced data items are moved as close to the processor as possible (put in cache memory) before the data items are actually used. For applications that are more compute-intensive, this can provide significant performance improvements.
its even important in multi-processor configurations: HERE (http://www-2.cs.cmu.edu/afs/cs.cmu.edu/academic/class/15740-f98/public/lectures/prefetch_slides.pdf)
A search for "software prefetching" on google provides over 44,000 documents.
FFed
27th September 2003, 12:13 AM
Originally posted by Stimpson J. Cat
If you just read a book on C++, and then sit down and try to write a program, you are probably no going to get anywhere.
Dr. Stupid
This is exactly what I have started to do. I know nothing about programming and I decided to learn something about it. I don't know anyone who programs so I basically just went off of stuff that I read online. I picked C++ because it looked to be pretty common. I am not sure exactly what kind of program I would like to make, but I would like to learn something.
Stimpson J. Cat
27th September 2003, 03:01 AM
FFed,
This is exactly what I have started to do. I know nothing about programming and I decided to learn something about it. I don't know anyone who programs so I basically just went off of stuff that I read online. I picked C++ because it looked to be pretty common. I am not sure exactly what kind of program I would like to make, but I would like to learn something.
Don't get me wrong. I am not saying that C++ is a bad language to learn programming with. Just that if you already know how to program in one language, the best way to learn a new one (C++ or otherwise), is by porting programs from the language you know.
In your case, my advice would be to find some simple C++ programs, and try to understand what they are doing. You can then test your understanding by trying to modify those programs in small ways. Once you are more comfortable with the language, you can use bits and pieces of those programs as templates for your own. Eventually you will be able to write programs from scratch.
Basically that is how I tought myself to program back when I was a kid, only that was with Basic.
What types of programming are you interested in doing?
Dr. Stupid
FFed
27th September 2003, 10:36 AM
Originally posted by Stimpson J. Cat
FFed,
What types of programming are you interested in doing?
Dr. Stupid
I am not really sure. I don't have any specific need for any type of program but I am interested in learning some language. I was thinking of just learning C++ and then I would figure out what to make from there. I know it sounds a little backwards, but I figured I would give it a try and see what happens.
Is there a better first language to learn?
Skeptoid
27th September 2003, 12:47 PM
FFed,
Give Python (http://www.python.org/) some thought. I've seen it recommended for beginning programmers many times. It's a free open source programming language used by novices and professionals alike. There's a lot of documentation, support, and tutorials available through their website.
FFed
30th September 2003, 07:50 PM
Originally posted by Skeptoid
FFed,
Give Python (http://www.python.org/) some thought. I've seen it recommended for beginning programmers many times. It's a free open source programming language used by novices and professionals alike. There's a lot of documentation, support, and tutorials available through their website.
Thanks, I will take a look at it.
Diamond
3rd October 2003, 01:53 PM
In the future, the only sound will be geek speaking code unto geek.
Underemployed
3rd October 2003, 02:58 PM
And that would of course all be in geek code. (http://www.geekcode.com/geek.html)
© 2001-2008, James Randi Educational Foundation. All Rights Reserved.
vBulletin® v3.7.3, Copyright ©2000-2008, Jelsoft Enterprises Ltd.