Text this: Robust 3D Human Avatar Reconstruction From Monocular Videos Using Depth Optimization and Camera Pose Estimation