Learning Structured Appearance Models from Captioned Images of