There have been significant advances in designing deep networks for complex computer vision tasks. One that is of considerable importance is image understanding through pixel-wise classification, i.e. semantic segmentation. Despite the advances, self-supervised algorithms have many limitations and challenges, with perhaps the most significant being generalization. This thesis introduces a method based on generative models as a practical approach for addressing these shortcomings. First, we analyze several semantic segmentation methods to gain insight into their limitations. We investigate the effectiveness of one of the state-of-the-art methods on two different problem settings. The latter part of the thesis introduces an alternative approach using generative adversarial networks and autoencoders for image-to-image translation. The main idea is encoding an image into two latent codes to represent structure and style. We propose a new approach to enforce structure-consistency without requiring semantic labels to disentangle the two latent codes. We further show how this would result in a more detailed style transfer and image manipulation. Finally, we present results on multiple datasets and discuss how our approach can be practical in real-world applications. Our experiments demonstrate that our approach performs better than the baselines -or, in the worst-case, gives comparable results- while solving some of the shortcomings in tasks requiring a semantic mask.