Introduction: In noisy situations, speech may be masked with conflicting acoustics, including background noise from the environment or other competing talkers. The process of listening to one stream of sounds while tuning out background noise is referred to as the “cocktail party problem,”  but its physiological basis remains poorly understood. In this study, we used electroencephalography (EEG) to measure neural responses to a continuous, controlled speech stimulus without noise versus speech presented in naturalistic noise. The aims of this project were the following: 1) characterize neural responses to phonological features in naturalistic noisy environments versus in noise-free environments, 2) characterize neural responses to the acoustic envelope of both speech-in-noise and speech alone stimuli, 3) investigate which other speech features may provide additional insight pertaining to neural entrainment and auditory separation for speech in noise.
Methods: We recorded scalp 64-channel scalp EEG (BrainVision actiCHamp) from 16 native English speakers (8M/8F, age 20-35) while they watched and listened to movie trailers and listened to sentences from the Texas Instruments Massachusetts Institute of Technology (TIMIT) acoustic-phonetic corpus. Word and phoneme boundaries were annotated in both sets of stimuli so that neural responses to specific speech content could be analyzed. The TIMIT sentences always occurred in clear contexts with no overlapping speakers, whereas the movie trailers had multiple incidences of overlapping talkers, background music, or other simultaneous sounds alongside speech. Briefly, EEG signals were referenced to the average of the mastoid channels, then eye movement and blink artifacts were removed using independent component analysis (ICA). Data were then bandpass-filtered between 1 and 15 Hz. We analyzed these data using linear receptive field models, which predicted neural activity to sounds based on acoustic or linguistic properties of both stimuli over time. We asked whether the neural responses to TIMIT could be predicted from a model based on a separate subset of those sentences. In addition, we asked whether neural responses to some of the movie trailers presented could be predicted from a model based on responses to different movie trailers. Finally, we tested whether responses to TIMIT could predict responses to the movie trailer stimuli, and vice versa. The purpose of this final analysis was to determine how similar the responses to acoustic and phonetic content were for a highly controlled stimulus versus a much noisier stimulus where speech occurred in the presence of varying background noise. In all analyses, performance of the model was assessed by the correlation between recorded EEG data not used to fit the models and predicted EEG from either feature set.
Results: We were able to predict broadband EEG responses to both controlled and more naturalistic stimuli from acoustic and phonological features with high accuracy (up to r=0.5). We could also predict neural responses to phonological features in TIMIT from models trained on both TIMIT and movie trailers. However, predictions were more accurate for the within stimulus class comparison as compared to the cross-stimulus comparison. It was more difficult to predict neural responses to the movie trailers, regardless of training stimulus. These results suggest that neural responses entrained to the phonological features of speech better than to the acoustic envelope in noisy conditions, but there were smaller differences in entrainment to either feature (phonological or acoustic envelope) for the clean speech condition. Our ability to predict neural activity in response to speech sounds was higher when those sounds occurred without background noise.
Conclusion: These results have implications for identifying which features of speech could be used to build a brain-machine interface for communication, or a cognitive hearing aid to identify and separate speech from noise.
 Cherry, E. Colin. 1953. “Some Experiments on the Recognition of Speech, with One and with Two Ears.” The Journal of the Acoustical Society of America 25: 975–79.
 Theunissen, F. E., S. V. David, N. C. Singh, A. Hsu, W. E. Vinje, and J. L. Gallant. 2001. “Estimating Spatio-Temporal Receptive Fields of Auditory and Visual Neurons from Their Responses to Natural Stimuli.” Network 12 (3): 289–316.