If you do it in '2D', you just move the player up and down the y axis to simulate moving in and out on the z axis. Then you render based on their y axis position, bottom to top, to simulate that objects are in front or behind others. Jumping is a special case here, because instead of using the y axis to simulate z, you actually want to move them along the y axis temporarily. You would keep track of their state, and if jumping, rendering them on their 'real' z plane.
For collision detection with this method I would assume your z plane would be chunked up a bit, imagine highway lanes, and anything in a lane is collidable with other things in that lane.