Hi all,
last week I wrote a preprocessor for .fx effect files that help me to replicate shader code easily. Why replication? Well, there are some good reasons for this. Come on, follow me... and excuse me for my english!!! :)
Take a look at that (pixel shader 3.0) code:
#define MAX_NUM_OF_LIGHTS 32
uniform int numLights;
float4 PS(VS_OUTPUT In) : COLOR0
{
float4 result = 0;
for (int i = 0; i < numLights; ++i)
result += ComputeLight(In, i);
return result;
}
It basically compute the light contribution from a variable number of lights (this is a typical scenario. Details about the ComputeLight function isn't important here).
Now, we will consider a simple scenario with a fixed number of lights: 16 (i.e. numLights = 16).
The HLSL compiler generate a rep/endrep code block for the loop. On my machine (Geforce 6800 Ultra 512 Mb), I obtain about 39 fps! Why a so poor performance? Because in the ComputeLight function, I read the light attributes (position, color, etc...) from a constant array dynamically indexed (by the parameter i)! Somethings like this:
float NL = dot(In.Normal, normalize(lightPos[i] - In.WorldPos));
where the lightPos array (passed by the C++ code) is declared as follow:
float3 lightPos[MAX_NUM_OF_LIGHTS];
Now, because constant indexing isn't supported by PS 3.0, probably the compiler translate it in a code like the following:
pos = lightPos[i];
become:
if (i == 0) pos = lightPos[0];
else if (i == 1) pos = lightPos[1];
....
else if (i == 31) pos = lightPos[31];
that is quite inefficient!!! So never use dynamic indexed array! (at least, use it only for very small arrays).
A solution to this problem is to use a texture to pass the lights parameters, so I have coded another version of the ComputeLight that use texture lookups (ComputeLightUseTex).
Now the code:
#define MAX_NUM_OF_LIGHTS 32
uniform int numLights;
float4 PS(VS_OUTPUT In) : COLOR0
{
float4 result = 0;
for (int i = 0; i < numLights; ++i)
result += ComputeLightUseTex(In, i);
return result;
}
give to me 177 fps! A great improvement!
Ok, let try to improve it a bit:
#define MAX_NUM_OF_LIGHTS 32
uniform int numLights;
float4 PS(VS_OUTPUT In) : COLOR0
{
float4 result = 0;
for (int i = 0; i < MAX_NUM_OF_LIGHTS; ++i)
if (i < numLights)
result += ComputeLightUseTex(In, i);
return result;
}
This give to me half the performance of the previous version (85 fps for ComputeLightUseTex and 19 fps for ComputeLight), because all the MAX_NUM_OF_LIGHTS iteration are executed (so 32 instead of 16). This happen because the compiler make the if condition flatten and so the relative code was executed in any case. So, my next try was to force branching:
#define MAX_NUM_OF_LIGHTS 32
uniform int numLights;
float4 PS(VS_OUTPUT In) : COLOR0
{
float4 result = 0;
for (int i = 0; i < MAX_NUM_OF_LIGHTS; ++i)
[branch]if (i < numLights)
result += ComputeLightUseTex(In, i);
return result;
}
But I obtain the error: "error X3528: can't force branch with gradients". This because the ComputeLightUseTex make use of texture lookups.
So, I have tried with the version that make use of constant array (ComputeLight) and I obtain about 32 fps due to the constant indexing problem. A solution to that problem could be to force loop unrolling:
#define MAX_NUM_OF_LIGHTS 32
uniform int numLights;
float4 PS(VS_OUTPUT In) : COLOR0
{
float4 result = 0;
[unroll]for (int i = 0; i < MAX_NUM_OF_LIGHTS; ++i)
[branch]if (i < numLights)
result += ComputeLight(In, i);
return result;
}
this should be faster, because the i parameter now is a constant. But, the compiler fails with error:
"error X4550: maximum boolean register index exceeded - Try reducing number of constant branches, take bools out of structs/arrays or move them to the start of the struct"
This happen because I use static branch inside the function (an if to activate/deactivate the specular component).
So, finally, the winner code (177 fps) was the second one:
#define MAX_NUM_OF_LIGHTS 32
uniform int numLights;
float4 PS(VS_OUTPUT In) : COLOR0
{
float4 result = 0;
for (int i = 0; i < numLights; ++i)
result += ComputeLightUseTex(In, i);
return result;
}
but this requires that I use a texture to store my light attributes.
At this point, I will try to unroll the loop manually (remember that numLights was fixed to 16 in all my previous test):
float4 PS16Lights(VS_OUTPUT In) : COLOR0
{
float4 result = 0;
result += ComputeLightUseTex(In, 0);
result += ComputeLightUseTex(In, 1);
result += ComputeLightUseTex(In, 2);
...
result += ComputeLightUseTex(In, 15);
return result;
}
This give to me about 336 fps! The reason of this speed boost isn't clear. Double performance was quite unexpected. Probably preshader can do a better work with the unrolled version or perhaps the rep/endrep have a severe performance penalty. I will tried this method also with the ComputeLight (i.e. with constant array) and I still obtain 336 fps. So now I can use both version without problem.
So, the conclusion is that actually we can't get rid of manual unrolling and/or shader replication, but this can be a quite time consuming task when done manually. To solve that problem I wrote my own preprocessor. Take a look at it:
<replicate @1: 1 to 16>
float4 PS<=@1>Lights(VS_OUTPUT In) : COLOR0
{
float4 result = 0;
<replicate @2: 1 to <=@1>>
result += ComputeLightUseTex(In, <=@2>);
</replicate>
return result;
}
</replicate>
This produce:
float4 PS1Lights(VS_OUTPUT In) : COLOR0
{
float4 result = 0;
result += ComputeLightUseTex(In, 0);
return result;
}
float4 PS2Lights(VS_OUTPUT In) : COLOR0
{
float4 result = 0;
result += ComputeLightUseTex(In, 0);
result += ComputeLightUseTex(In, 1);
return result;
}
float4 PS3Lights(VS_OUTPUT In) : COLOR0
{
float4 result = 0;
result += ComputeLightUseTex(In, 0);
result += ComputeLightUseTex(In, 1);
result += ComputeLightUseTex(In, 2);
return result;
}
float4 PS4Lights(VS_OUTPUT In) : COLOR0
{
float4 result = 0;
result += ComputeLightUseTex(In, 0);
result += ComputeLightUseTex(In, 1);
result += ComputeLightUseTex(In, 2);
result += ComputeLightUseTex(In, 3);
return result;
}
float4 PS5Lights(VS_OUTPUT In) : COLOR0
{
float4 result = 0;
result += ComputeLightUseTex(In, 0);
result += ComputeLightUseTex(In, 1);
result += ComputeLightUseTex(In, 2);
result += ComputeLightUseTex(In, 3);
result += ComputeLightUseTex(In, 4);
return result;
}
float4 PS6Lights(VS_OUTPUT In) : COLOR0
{
float4 result = 0;
result += ComputeLightUseTex(In, 0);
result += ComputeLightUseTex(In, 1);
result += ComputeLightUseTex(In, 2);
result += ComputeLightUseTex(In, 3);
result += ComputeLightUseTex(In, 4);
result += ComputeLightUseTex(In, 5);
return result;
}
float4 PS7Lights(VS_OUTPUT In) : COLOR0
{
float4 result = 0;
result += ComputeLightUseTex(In, 0);
result += ComputeLightUseTex(In, 1);
result += ComputeLightUseTex(In, 2);
result += ComputeLightUseTex(In, 3);
result += ComputeLightUseTex(In, 4);
result += ComputeLightUseTex(In, 5);
result += ComputeLightUseTex(In, 6);
return result;
}
float4 PS8Lights(VS_OUTPUT In) : COLOR0
{
float4 result = 0;
result += ComputeLightUseTex(In, 0);
result += ComputeLightUseTex(In, 1);
result += ComputeLightUseTex(In, 2);
result += ComputeLightUseTex(In, 3);
result += ComputeLightUseTex(In, 4);
result += ComputeLightUseTex(In, 5);
result += ComputeLightUseTex(In, 6);
result += ComputeLightUseTex(In, 7);
return result;
}
float4 PS9Lights(VS_OUTPUT In) : COLOR0
{
float4 result = 0;
result += ComputeLightUseTex(In, 0);
result += ComputeLightUseTex(In, 1);
result += ComputeLightUseTex(In, 2);
result += ComputeLightUseTex(In, 3);
result += ComputeLightUseTex(In, 4);
result += ComputeLightUseTex(In, 5);
result += ComputeLightUseTex(In, 6);
result += ComputeLightUseTex(In, 7);
result += ComputeLightUseTex(In, 8);
return result;
}
float4 PS10Lights(VS_OUTPUT In) : COLOR0
{
float4 result = 0;
result += ComputeLightUseTex(In, 0);
result += ComputeLightUseTex(In, 1);
result += ComputeLightUseTex(In, 2);
result += ComputeLightUseTex(In, 3);
result += ComputeLightUseTex(In, 4);
result += ComputeLightUseTex(In, 5);
result += ComputeLightUseTex(In, 6);
result += ComputeLightUseTex(In, 7);
result += ComputeLightUseTex(In, 8);
result += ComputeLightUseTex(In, 9);
return result;
}
float4 PS11Lights(VS_OUTPUT In) : COLOR0
{
float4 result = 0;
result += ComputeLightUseTex(In, 0);
result += ComputeLightUseTex(In, 1);
result += ComputeLightUseTex(In, 2);
result += ComputeLightUseTex(In, 3);
result += ComputeLightUseTex(In, 4);
result += ComputeLightUseTex(In, 5);
result += ComputeLightUseTex(In, 6);
result += ComputeLightUseTex(In, 7);
result += ComputeLightUseTex(In, 8);
result += ComputeLightUseTex(In, 9);
result += ComputeLightUseTex(In, 10);
return result;
}
float4 PS12Lights(VS_OUTPUT In) : COLOR0
{
float4 result = 0;
result += ComputeLightUseTex(In, 0);
result += ComputeLightUseTex(In, 1);
result += ComputeLightUseTex(In, 2);
result += ComputeLightUseTex(In, 3);
result += ComputeLightUseTex(In, 4);
result += ComputeLightUseTex(In, 5);
result += ComputeLightUseTex(In, 6);
result += ComputeLightUseTex(In, 7);
result += ComputeLightUseTex(In, 8);
result += ComputeLightUseTex(In, 9);
result += ComputeLightUseTex(In, 10);
result += ComputeLightUseTex(In, 11);
return result;
}
float4 PS13Lights(VS_OUTPUT In) : COLOR0
{
float4 result = 0;
result += ComputeLightUseTex(In, 0);
result += ComputeLightUseTex(In, 1);
result += ComputeLightUseTex(In, 2);
result += ComputeLightUseTex(In, 3);
result += ComputeLightUseTex(In, 4);
result += ComputeLightUseTex(In, 5);
result += ComputeLightUseTex(In, 6);
result += ComputeLightUseTex(In, 7);
result += ComputeLightUseTex(In, 8);
result += ComputeLightUseTex(In, 9);
result += ComputeLightUseTex(In, 10);
result += ComputeLightUseTex(In, 11);
result += ComputeLightUseTex(In, 12);
return result;
}
float4 PS14Lights(VS_OUTPUT In) : COLOR0
{
float4 result = 0;
result += ComputeLightUseTex(In, 0);
result += ComputeLightUseTex(In, 1);
result += ComputeLightUseTex(In, 2);
result += ComputeLightUseTex(In, 3);
result += ComputeLightUseTex(In, 4);
result += ComputeLightUseTex(In, 5);
result += ComputeLightUseTex(In, 6);
result += ComputeLightUseTex(In, 7);
result += ComputeLightUseTex(In, 8);
result += ComputeLightUseTex(In, 9);
result += ComputeLightUseTex(In, 10);
result += ComputeLightUseTex(In, 11);
result += ComputeLightUseTex(In, 12);
result += ComputeLightUseTex(In, 13);
return result;
}
float4 PS15Lights(VS_OUTPUT In) : COLOR0
{
float4 result = 0;
result += ComputeLightUseTex(In, 0);
result += ComputeLightUseTex(In, 1);
result += ComputeLightUseTex(In, 2);
result += ComputeLightUseTex(In, 3);
result += ComputeLightUseTex(In, 4);
result += ComputeLightUseTex(In, 5);
result += ComputeLightUseTex(In, 6);
result += ComputeLightUseTex(In, 7);
result += ComputeLightUseTex(In, 8);
result += ComputeLightUseTex(In, 9);
result += ComputeLightUseTex(In, 10);
result += ComputeLightUseTex(In, 11);
result += ComputeLightUseTex(In, 12);
result += ComputeLightUseTex(In, 13);
result += ComputeLightUseTex(In, 14);
return result;
}
float4 PS16Lights(VS_OUTPUT In) : COLOR0
{
float4 result = 0;
result += ComputeLightUseTex(In, 0);
result += ComputeLightUseTex(In, 1);
result += ComputeLightUseTex(In, 2);
result += ComputeLightUseTex(In, 3);
result += ComputeLightUseTex(In, 4);
result += ComputeLightUseTex(In, 5);
result += ComputeLightUseTex(In, 6);
result += ComputeLightUseTex(In, 7);
result += ComputeLightUseTex(In, 8);
result += ComputeLightUseTex(In, 9);
result += ComputeLightUseTex(In, 10);
result += ComputeLightUseTex(In, 11);
result += ComputeLightUseTex(In, 12);
result += ComputeLightUseTex(In, 13);
result += ComputeLightUseTex(In, 14);
result += ComputeLightUseTex(In, 15);
return result;
}
Nice, isn't it?
Let me explain it a bit:
<replicate @1: 1 to 16>
float4 PS<=@1>Lights(VS_OUTPUT In) : COLOR0
{
float4 result = 0;
<replicate @2: 1 to <=@1>>
result += ComputeLightUseTex(In, <=@2>);
</replicate>
return result;
}
</replicate>
I have introduced the tags <replicate></replicate>. Here the syntax:
<replicate @n: start_range_expr to end_range_expr>
where:
- n is an integer;
- start_range_expr and end_range_expr are integer expressions that represent the iteration limits;
The tags <replicate></replicate> make a copy of its contained text a specified number of times. An associated variable (@1 and @2 in the example) act as an index. In the body of the replicate tag, you can use the value of that index with the syntax: <=@1>.
The replication tags could be nested without limits.
Note that if the end range is smaller than the start range, the replicate produce an empty string. This way we can simulate a static branch with the following code:
<replicate @1: 0 to 1>
<replicate @3: 0 to 1>
float4 ComputeLight<=@1><=@3>(VS_OUTPUT In, int lightIndex)
{
float4 result = 0;
result += computeDiffuseContribution(In, lightIndex);
<replicate @2: 1 to <=@1>>
result += computeSpecularContribution(In, lightIndex);
</replicate>
<replicate @4: 1 to <=@3>>
result *= computeShadowFactor(In, lightIndex);
</replicate>
return result;
}
</replicate>
</replicate>
will produce:
float4 ComputeLight00(VS_OUTPUT In, int lightIndex)
{
float4 result = 0;
result += computeDiffuseContribution(In, lightIndex);
return result;
}
float4 ComputeLight01(VS_OUTPUT In, int lightIndex)
{
float4 result = 0;
result += computeDiffuseContribution(In, lightIndex);
result *= computeShadowFactor(In, lightIndex);
return result;
}
float4 ComputeLight10(VS_OUTPUT In, int lightIndex)
{
float4 result = 0;
result += computeDiffuseContribution(In, lightIndex);
result += computeSpecularContribution(In, lightIndex);
return result;
}
float4 ComputeLight11(VS_OUTPUT In, int lightIndex)
{
float4 result = 0;
result += computeDiffuseContribution(In, lightIndex);
result += computeSpecularContribution(In, lightIndex);
result *= computeShadowFactor(In, lightIndex);
return result;
}
The name of the function (ComputeLight) was extended with two binary digit: the first represent the specular contribution and the second one the shadow factor.
However this is not very mnemonic. So we can use an alternative syntax:
<replicate @1: "", "WithSpecular">
<replicate @3: "", "WithShadow">
float4 ComputeLight<=@1><=@3>(VS_OUTPUT In, int lightIndex)
{
float4 result = 0;
result += computeDiffuseContribution(In, lightIndex);
<replicate @2: 1 to <=#@1>>
result += computeSpecularContribution(In, lightIndex);
</replicate>
<replicate @4: 1 to <=#@3>>
result *= computeShadowFactor(In, lightIndex);
</replicate>
return result;
}
</replicate>
</replicate>
this produce:
float4 ComputeLight(VS_OUTPUT In, int lightIndex)
{
float4 result = 0;
result += computeDiffuseContribution(In, lightIndex);
return result;
}
float4 ComputeLightWithShadow(VS_OUTPUT In, int lightIndex)
{
float4 result = 0;
result += computeDiffuseContribution(In, lightIndex);
result *= computeShadowFactor(In, lightIndex);
return result;
}
float4 ComputeLightWithSpecular(VS_OUTPUT In, int lightIndex)
{
float4 result = 0;
result += computeDiffuseContribution(In, lightIndex);
result += computeSpecularContribution(In, lightIndex);
return result;
}
float4 ComputeLightWithSpecularWithShadow(VS_OUTPUT In, int lightIndex)
{
float4 result = 0;
result += computeDiffuseContribution(In, lightIndex);
result += computeSpecularContribution(In, lightIndex);
result *= computeShadowFactor(In, lightIndex);
return result;
}
that is by far more readable.
Analyzing the code:
<replicate @1: "", "WithSpecular">
<replicate @3: "", "WithShadow">
float4 ComputeLight<=@1><=@3>(VS_OUTPUT In, int lightIndex)
{
float4 result = 0;
result += computeDiffuseContribution(In, lightIndex);
<replicate @2: 1 to <=#@1>>
result += computeSpecularContribution(In, lightIndex);
</replicate>
<replicate @4: 1 to <=#@3>>
result *= computeShadowFactor(In, lightIndex);
</replicate>
return result;
}
</replicate>
</replicate>
We have a new alternative syntax:
<replicate @n: "string1", "string2", ...>
in this way the variable @n iterates on strings instead of integer numbers. The preprocessor still associate a progressive integer number to every string (starting from 0). So "string1" have a value of 0, "string2" have a value of 1, etc...
To access to that number, you can use the syntax: <=#@n> (with the '#' as prefix).
The inner replicates acts as two "if" statements. Something like:
if (haveSpecular)
{
result += computeSpecularContribution(In, lightIndex);
}
if (haveShadow)
{
result *= computeShadowFactor(In, lightIndex);
}
Because the starting and ending range are expression, you can also have an else statement in this way:
<replicate @2: 1 to <=#@1>>
// Code for the TRUE part
</replicate>
<replicate @2: 1 to 1-<=#@1>>
// Code for the FALSE part
</replicate>
I believe that when the number of replication is high, this kind of preprocess could be very useful (maybe Microsoft will include a similar feature in its compiler, before of after...). I plain to release my preprocessor for free very soon, hoping that it can be helpful. Stay tuned.
Ok, any comment/suggestion is really appreciated.
Thanks for the attention and for the patient!
Best Regards,
- AGPX