Complex Plane based Realistic Sound Generation for Free Movement in Virtual Reality

A binaural rendering is a technology that generates a realistic sound for a user with a stereo headphone, so it is essential for the stereo headphone based virtual reality (VR) service. However, the binaural rendering has a problem that it cannot reflect the user's free movement in the VR. Because the VR sound does not match with the visual scene when the user moves freely in the VR space, the performance of the VR may be degraded. To reduce the mismatch problem in the VR, the complex plane based stereo realistic sound generation method was proposed to allow the user’s free movement in the VR causing the change of the distance and azimuth between the user and the speaker. For the calculation of the distance and the azimuth between the user and the speaker by the user’s position change, the 5.1 multichannel speaker playback system and the user are placed in the complex plane. Then, the distance and the azimuth between the user and the speaker can be simply calculated as the distance and the angle between two points in the complex plane. The 5.1 multichannel audio signals are scaled by the estimated five distances according to the inverse square law, and the scaled multichannel audio signals are mapped to the newly generated virtual 5.1 multichannel speaker layout using the measured five azimuths and the azimuth by the head movement. Finally, we can successfully obtain the stereo realistic sound to reflect the user’s position change and the head movement through the binaural rendering using the scaled and mapped 5.1 multichannel audio signals and the HRTF coefficients. Experimental results show that the proposed method can generate the realistic audio sound reflecting the user’s position and azimuth change in the VR only with less than about 5 % error rate. Keywords—Virtual reality; realistic sound; binaural rendering; constant power panning; head related transfer function


I. INTRODUCTION
In general, users should have their own multi-channel audio playback environment to enjoy the realistic sound by multichannel audio signals. However, most of the users have a stereo headphone environment, so they are unable to enjoy realistic audio by the multi-channel audio signals. Therefore, a head related transfer function (HRTF) [1] based binaural rendering has been proposed to solve this limitation [2][3][4][5][6]. In particular, the binaural rendering is essential to deliver the more realistic audio signal to the users in a system such as a virtual reality (VR) service based on the stereo headphone environment [8][9][10]. In the binaural rendering, the stereo realistic sound is generated using the multi-channel audio signals and the HRTF coefficients. The stereo realistic sound generation based on the binaural rendering can efficiently supply the realistic sound with the user in the VR service, but there is a critical limitation that the existing stereo realistic sound generation based on the binaural rendering does not reflect the user's position change. Since the stereo realistic sound generation through binaural rendering with a fixed HRTF cannot reflect the user's position change, there is a gap between the visual scene and the sound causing the performance degradation of the VR service. To solve the fixed sound scene problem in the VR, the sound scene control of the stereo realistic sound in the VR was introduced to reflect the user's head azimuth change [11]. In [11], the HRTF coefficients are replaced by the new HRTF coefficients corresponding to the user's azimuth change, and the realistic sound with the controlled sound scene is calculated with the multi-channel audio signals and the replaced HRTF coefficients. Although the realistic sound generation with the substitution of the HRTF coefficients can successfully generate the stereo realistic sound with the controlled sound scene according to the user's head movement, it needs very high data amount of the stored HRTF coefficients for all azimuth directions. The data rate of the HRTF coefficients are 23. 6 Mbytes to be 32 times compared with that of the HRTF coefficients of the 5.1 multi-channel speaker layout. Therefore, the sound scene control of the realistic sound with the substitution of the HRTF coefficients is not suitable for the embedded system with low memory storage. Accordingly, the constant power panning (CPP) based sound scene control of the realistic sound was introduced [12][13][14][15]. The CPP based sound scene control scheme used only the HRTF coefficients of the 5.1 multi-channel speaker layout, so the data rate of the HRTF coefficients is exactly same as the original binaural rendering. Instead, the CPP based sound scene control method mapped the original multi-channel audio signals onto the new 5.1 multi-channel speaker layout rearranged by the user's head movement. The CPP based method can be applied to the embedded system with the low memory storage because it can generate the realistic sound reflecting the user's head movement without the increase of the HRTF coefficients.
Meanwhile, the VR service allows the user's free movement in the VR space, so the VR service should consider the user's not only head movement but also position change. Namely, the VR service should generate the realistic sound reflecting the user's free movement. However, since the sound scene control based on the HRTF coefficients and the CPP method only focuses on the modification of the stereo realistic sound scene according to the user's azimuth change, its' stereo realistic sound cannot imply the user's distance change.
This work was funded by the research fund of Korea Nazarene University in 2021. Also, this research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2017R1D1A3B03034951). www.ijacsa.thesai.org Therefore, there is still a mismatch between the VR scene and the VR sound when the user freely moves in the VR space and the overall performance of the VR service may be very poor. The realistic sound generation method based on a complex plane for tracking the user's movement is proposed to improve the performance of the VR service by reflecting the user's free movement in the VR sound. The user's free movement (position change) in the VR space causes both changes of the distance and the azimuth between the user and the speaker, while the user's head movement only effects on the azimuth change. Therefore, the proposed method separately handles the user's position change and the head movement and it calculates the distance and the azimuth between the user and the speaker by the user's free movement. Then, the proposed method can generate the realistic sound by scaling the audio signal using the measured distance and by adjusting the sound scene using the final azimuth formed by adding the measured azimuth for the position change and the azimuth change for the head movement. In conclusion, the proposed method can improve the overall performance of the VR service by generating the realistic sound that reflects the user's free movement including the head movement. This paper consists of as follows. In Section 2, the stereo realistic sound generation through the binaural rendering and the sound scene control of the realistic sound is described. In Section 3, the realistic sound generation for the user's free movement in the VR is proposed. In Sections 4 and 5, the experimental result and the conclusion will be given, respectively.

II. STEREO REALISTIC SOUND GENERATION BASED ON BINAURAL RENDERING FOR VR
A. Binaural Rendering for VR The VR system needed the stereo realistic sound generation method for the immersive effect by the multi-channel audio signals since the VR system used the stereo headphone for the delivery of the VR sound. The VR system adopted the conventional binaural rendering for generating the stereo realistic sound [1][2][3][4][5][6][7]. The binaural rendering is a technology that generates the stereo realistic audio sound with the multichannel audio effect for stereo headphone environment using HRTF coefficients to characterize all signal paths from speakers to human ears [1]. As shown in Fig. 1, the binaural rendering is computed with the input multi-channel signal and the HRTF coefficients. To generate the output stereo realistic sound, the input multi-channel audio signals are convolved with the HRTF coefficients as in (1) [2][3][4].  N is the channel number of the multi-channel audio signals and  is the linear convolution. Since the linear convolution in time domain between the input signals and the HRTF coefficients has very high computational complexity, the binaural rendering is calculated as the multiplication of the input signals and the HRTF coefficients in the frequency domain as in (2)[5] and Fig. 1 is updated as Fig. 2. Meanwhile, (2) can be rewritten in matrix form for 5.1 multi-channel audio signals as in (3) [11].

B. Sound Scene Control of Stereo Realistic Sound Reflecting
Azimuth Change in VR Although the conventional binaural rendering was useful for the VR system, it could not reflect the user's azimuth change. Therefore, the sound scene control of the stereo realistic sound was proposed [11,12]. When the azimuth angle of the user changed in the 5.1 channel reproduction environment, the direction in which the 5.1 channel signal is transmitted to the user or the azimuth angle of the 5.1 channel reproduction environment also changed. So, the existing HRTF coefficients should be replaced by new HRTF coefficients corresponding to the azimuth angle of the new 5.1 channel reproduction environment. The binaural rendering with the sound scene control could generate the realistic sound with the substituted HRTF coefficients and the 5.1 channel audio signal according to the user's azimuth change as in (4).  Here, if the angle of any channel X minus hm  is negative, the final azimuth of any channel is the calculated angle plus 360 degrees. Fig. 3 shows an example of the user's azimuth change in the 5.1 multi-channel speaker layout. Since the angle of the user's azimuth change is 90 degrees, the angle of existing 5.1 multi-channel speaker layout is rearranged as shown in Fig. 3 and the HRTF coefficients are substituted to reflect the rearranged multi-channel speaker layout. The binaural rendering generates the stereo realistic sound with the controlled sound scene using the 5.1 multi-channel audio signals and the substituted HRTF coefficients as in (5). Fig. 4 shows the overall procedure of the sound scene control of the realistic sound based on the substitution of the HRTF coefficients.
Although the above explained sound scene control scheme of the realistic sound can successfully generate the stereo realistic sound with the controlled sound scene, it needed very high data amount of the stored HRTF coefficients as 23. 6 Mbytes. Therefore, the embedded system with the low memory storage could not implement the sound scene control of the realistic sound with the substitution of the HRTF coefficients. Accordingly, the CPP based sound scene control of the realistic sound was introduced [11][12][13][14][15]. The CPP based sound scene control scheme fixed the HRTF coefficients of the 5.1 multichannel speaker layout and it mapped the existing multichannel audio signals onto the new 5.1 multi-channel speaker layout rearranged by the user's head movement. Fig. 5 shows an example of the mapping of the multi-channel audio signals to the newly formed 5.1 multi-channel speaker layout according to the user's head movement. The 5.1multi-channel speaker layout is newly created around the user's new front, and the existing 5.1 multi-channel audio signals are mapped onto the new speaker layout using the CPP technique. The binaural rendering is performed as in (6) using the mapped 5.1 multi-channel signals and the HRTF coefficients of the 5.1 multi-channel speaker layout to generate stereo realistic sound with the controlled sound scene according to the user's head movement.
Sk is a newly generated signal of any channel X through the mapping of the 5.1 multi-channel audio signals to the newly formed 5.1 multi-channel speaker layout. For the explanation of the signal mapping using the CPP method [14,15], let's assume that there are two channel speakers (C1 and C2) and any channel (C3) lays in between two channel speakers after the user's head movement as shown in Fig. 6. Then, a signal of channel C3 is mapped onto the channel C1 and C2 using (7) and (8).
Here, norm  is the normalized angle of azimuth of C3 laid in between C1 and C2, and aperture is the reference angle between C1 and C2. 1  Fig. 7 shows the overall procedure of the sound scene control of the realistic sound based on the CPP method.

III. PROPOSED STEREO REALISTIC SOUND GENERATION FOR FREE MOVEMENT IN VR
In the VR service, the user moves freely in the VR space, so the VR sound must be adjusted according to the VR scene. Namely, the VR service allows the user's head movement and the position change in the VR space and the VR sound in the VR service should reflect the user's free movement. However, since the previously explained sound scene control based on the HRTF coefficients and the CPP method only focused on the modification of the stereo realistic sound scene according to the user's azimuth change, the previous realistic sound could not imply the user's distance change. Therefore, there is still a mismatch between the VR scene and the VR sound when the user freely moves in the VR space and the overall performance of the VR service can be severely degraded. To allow the user's free movement in the VR space and reduce the performance degradation of the VR, the realistic sound generation method based on a complex plane for the user's movement tracking is proposed. The user's position change effected on both changes of the distance and the azimuth between the user and the speaker while the user's head movement only effected on the azimuth change between the user and the speaker. Therefore, the proposed method separately handled the user's position change and the head movement. Namely, the distance and the azimuth between the user and the speaker layout by the user's position change were firstly measured, and then the final azimuth between the user and the speaker by considering two azimuths caused by the user's position change and the head movement was determined. The signal level was modified using the calculated distance between the user and the speaker while the sound scene of the realistic sound was controlled using the measured azimuth. The detail of the realistic sound generation using the calculated distance and azimuth between the user and the speaker is given in the below.
For the calculation of the distance and the azimuth between the user and the speaker by the user's position change, it is assumed that the 5.1 multi-channel speaker playback system located in the complex plane as shown in Fig. 8 and the user moved freely on the complex plane. Based on the assumption, the distance and the azimuth between the user and the speaker by the user's position change could be estimated because the user and the speaker were considered as two points in the complex plane. Meanwhile, as the azimuth measurement method could vary according to the user's location on the complex plane, the distance and azimuth between the user and the speaker were measured based on the user's location divided Newly generated multichannel audio signals through the signal mapping www.ijacsa.thesai.org into four areas around the speaker as shown in Fig. 9 and 10. Meanwhile, the four areas around the speaker in the complex plane are summarized in Table I. After setting four areas for all speakers in the 5.1 multi-channel speaker layout as shown in Fig. 9, the distance and the azimuth between the user and the speaker were calculated in each area as shown in Fig. 10. In Fig. 10, a jb  and c jd  are the position of any speaker X in the 5.1 multi-channel speaker layout and the user in the complex plane, respectively. Table II summarizes the calculation of the distance and azimuth between the user and the speaker for four areas around the speaker according to the user's position change.   For the realistic sound generation by reflecting the user's position change and the head movement, the 5.1 multi-channel audio signals were firstly scaled using the measured five distance values between the moved user and the 5.1 multichannel speaker. Then, the scaled 5.1 multi-channel audio signals were mapped onto the new multi-channel speaker layout using not only the estimated five azimuths between the moved user and the 5.1 multi-channel speaker but also the azimuth according to the user's head movement. Based on the inverse square law that the sound intensity is inversely proportional to the distance from the source [16,17], the scaled 5.1 multi-channel audio signals were calculated using the estimated five distance values. Moreover, because all the distances between the user and the 5.1 multi-channel speaker layout are equal to one, the scaled 5.1 multi-channel audio signals were calculated as in (9). (11) where , () sm X Sk is the generated signal of any channel X of the virtual 5.1 multi-channel speaker layout formed by the user's position and head movement through the signal scaling as in (9) and the signal mapping using the final azimuth as in (10). Fig. 11 shows the overall procedure of the proposed the realistic sound generation for the user's free movement in the VR.

IV. RESULTS AND DISCUSSION
To validate the performance of the proposed realistic sound generation method for the user's free movement in the VR, the subjective listening test was performed. Three audio contents were used for the test and listed in Table III. For simplification and clarification of the test, the realistic audio sound only using the left and the right channel signals was separately generated according to the user's position change as shown in Fig. 12.
Here, it was assumed that there was no user's head movement. Five listeners participated in the test, and they evaluated the azimuth and distance of the generated realistic audio sound at the changed user's position compared to those of the realistic audio sound at the original position. Meanwhile, the azimuths and distances between the user and the speakers were theoretically calculated in each position as in Table IV.  test results also show that the performance of the proposed method may be rather poor. It is because the proposed method used only the HRTF coefficients of the 5.1 multi-channel speaker layout, namely, the proposed method did not have the sufficient resolution of the HRTF coefficients to generate the realistic audio sound according to the user's free movement in the VR. Therefore, it is necessary to improve the proposed method to generate realistic sound by utilizing HRTF coefficients of the 10.1 or more playback environment.

V. CONCLUSION
The realistic audio is essential to enjoy the realistic services such as VR, but there is a limitation that multi-channel audio playback environment is involved for the realistic audio. Although the binaural rendering could solve this limitation to provide the realistic sound in the stereo headphone playback environment, there was a problem that the binaural rendering alone could not reflect the user's free movement in the VR. Therefore, there was the mismatch between the visual scene and the audio sound in the VR, so the performance of the VR was degraded. In this paper, the complex plane based stereo realistic sound generation method was proposed to allow the user's free movement such as the position change and the head azimuth change in the VR. In the proposed method, the variations of the azimuth and distance between the user and the speaker were reflected according to the user's movement in the stereo realistic sound generated by the binaural rendering. The subjective listening test results showed that the proposed method could generate the realistic audio sound that successfully reflected the user's free movement only with less than 5 % error rate of the azimuth and the distance evaluation. In spite of the good performance of the proposed method, the performance improvement of the proposed method through the increase of the resolution of the HRTF coefficients remains as a future work because the proposed method only had the HRTF coefficients of the 5.1 multi-channel speaker layout and it caused the error of the azimuth and the distance evaluation.