Text this: Feature pyramid attention network for audio‐visual scene classification