Dont quote me on this, but I think the point is to perform multiple instructions in one go instead of doing one instruction, read the next, do one instruction, read the next and so on.
Bang on, I'll elaborate though
In the CPU very large instructions can sometimes take multiple clock cycles to be completed. This was alot more prevalent back when the registers were only 8 or 16bits wide, and the operations were large.
These instructions would need to be broken down into multiple smaller instructions and done separately. What they started to realize was that when this happened, the CPU, even though it usually had memory registers and L* cache much larger than what the CPU was capable of executing at once, this was going unused, and rather the CPU would pull only one instruction at a time, work on it, store it, load up the next one and on and on. What they had it do instead was load as many of the instructions/memory pointers/output pointers into the cache at once and have the CPU do all of the instructions in one big execution run.
A 32 bit CPU usually has much more than just 32 bits of cache so there is usually ample space for storing these pipelined instructions.
Basically, you could think of it similarly to buffering a video. Your preloading things you need just in time for when you need them, instead of getting them as you need them. This dosen't work in all situations of course, and it also presents some problems like latency and in some cases taking longer than doing it in the origional way, but in operations like blit's on large bitmaps or heavy mathmatics it can seriously speed up an application.