An Approach for External Preference Mapping Improvement by Denoising Consumer Rating Data

In this study, denoising data was advocated in sensory analysis field to remove the existing noise in consumer rating data before processing to External Preference Mapping (EPM). This technique is a data visualization used to understand consumers’ sensory profiles by relating their preferences towards products to external information about sensory characteristics of the perceived products. The output is a perceptual map which visualizes the optimal products and aspects that maximize consumers’ likings. Hence, EPM is considered as a decision tool to support the development or improvement of products and respond to market requirements. In fact, the stability of the map is affected by the high variability of judgments that make consumer rating data very noisy. This may lead to a mismatch between products features and consumers’ preferences then distorted results and wrong decisions. To remove the existing noise, the use of some filtering methods is proposed. Regularized Principal Component Analysis (RPCA) and Stein’s Unbiased Risk Estimate (SURE), based respectively on hard and soft thresholding rules, were applied to consumer rating data to separate the signal from noise and maintain only useful information about the given liking scores. As a way to compare the EPM obtained from each strategy, a sampling process was conducted to randomly select samples from noisy and cleaned data, then perform their corresponding EPM. The stability of the obtained maps was evaluated using an indicator that computes and compares distances between them, both before and after denoising. The effectiveness of this methodology was evaluated by a simulation study and a potential application was shown on real dataset. Results show that recorded distances after denoising are lower than those before in almost all cases for both RPCA and SURE. However, RPCA outperforms SURE. The corresponding map is made more stable where level lines are seen smoothed and products are better located on liking zones. Hence, noise removal reduces variability in data and brings closer preferences which improves the quality of the visualized map. Keywords—Data denoising; Regularized Principal Component Analysis; Stein’s Unbiased Risk Estimate; sensory analysis; external preference mapping stability


I. INTRODUCTION
In marketing research, listening to the voice of consumers has become a fundamental strategy to make good decisions about the development or improvement of products.Sensory analysis techniques are often used as a set of multivariate statistical methods to quantify and explain consumers' sensory perceptions towards products ()i.e taste, sight, hearing, smell and touch).
The method is to conduct a survey on a sample of consumers asking them to evaluate products by rating their liking.This data is known to be called hedonic data or consumer rating data.The consumers are asked to give a liking score, on a defined scale, as overall assessment of the product.The 9point-hedonic-scale defined by David Peryam and colleagues [1] is often used: the consumers rate products according to a score ranging from 1 to 9 such that 1 indicates that the consumer extremely dislikes the product and 9 indicates that he extremely likes it.This hedonic scale was used for rating various products such as household products, personal care products, cosmetics, etc.However, it was mainly adopted by food industries to rate food products according to consumers' tastes, which is the case study of this investigation.Many industrial companies have made the choice to seek the opinion of consumers through a score out of 10 or over 11.In their study [2], researchers show that longer scales are also good discriminators and would be even more effective than shorter ones.On the other hand, a second data known to be called sensory data is collected.Generally, a panel of trained assessors is asked to rate exactly some sensory attributes of the same set of products during different sessions of experimentation.The data is qualified as instrumental since it gives objective descriptions considered as properties measurements of the products.Sensory data are represented as a matrix crossing panelists, products, sessions and the sensory measured attributes.Generally, the average table by product is used.In case of food products, descriptive data can also be collected from a set of measures of physico-chemical components through successive analyses in chemiometrics laboratories.
A statistical analysis is then performed to connect consumer data to sensory data in order to understand consumers' tendencies and retrieve sensory attributes that are drivers of their liking.External Preference Mapping (EPM) [3] is one of such methods that visually assess this relationship.The output is a perceptual map that shows the optimal products maximizing consumers' likings and their acceptability to related aspects.Hence, EPM is considered as a decision tool to support the development of a new product or to improve existing products in order to respond to market requirements and avoid product failure.The applications vary across a wide range of fields such as automobile sector to evaluate preferences towards cars' headlights [4], the mobile sector to characterize mobile phones and watches [5], the cosmetic sector to rate some anti-aging creams [6], etc.It is mainly used in food science to evaluate the consumers' likings towards some food products such as beer [7], olive oil [8] and cookies [9] which is the case of this study.
The obtained map is assumed to be instable.It suffers from a huge luck of stability generated by the existing noise in consumer rating data.The latter is supposed to be noisy due to high variability of human verdicts influenced either by psychological or physiological factors [10].The psychological factors affect the psychology of the consumers and induce them to score products incorrectly which induces errors in data.For example, the labeling of samples (1,2,3 or A,B,C...) can force consumers to rate products accordingly.Then, codes should instead be random combinations of letters or numbers.Also, the contrast between a low quality product just before a higher quality one causes a risk of over-rating the second sample.This is known as the contrast effect.Hence, randomized and balanced order of samples presentation may minimize this type of error.Additionally, error of central tendency is very common since consumers tend to score the samples using the central part of the scale and avoid using the extreme ends for fear of making mistakes.Many other psychological errors can occur such as the stimulus error when subjects rate samples according to other perceptions and the expectation error caused by previous knowledge or indications that identify products.On the other hand, physiological factors have a big influence on consumers' ratings.The errors are mainly caused by fatigue, habituation, simultaneous interaction of stimuli and dulling of the senses as result of continued exposures [10].To reduce these errors, many measures must be considered when collecting data.The randomization and calibration order of samples, the separation of intense attributes and the use of subjects that are familiar with tested samples are advised.But even if necessary measures are taken to avoid the occurrence of error, consumers are not instruments and they are still prone to bias.This Noise is unavoidable for this type of data and presents a huge problem when processing to data visualization.It affects the stability of the obtained map which may lead to mismatch between products features and consumers' preferences then distorted results and wrong decisions.This may induce product failure in the market.The idea is to search for a better visualization of the mapping between consumers' preferences and sensory attributes of the perceived products in order to correctly select a set of product prototypes that maximize consumer liking, then, ensure the increase of consumer appeal towards the designed products.
The idea is to proceed with denoising consumer rating data.Some filtering methods are proposed and tested to extract only useful information and remove distorting noise.The use of Regularized Principal Component Analysis (RPCA) [11] and Stein's Unbiased Risk Estimate (SURE) [12] denoising techniques was proposed.They were chosen among others due to their efficiency in denoising data matrix for which the associated structure corresponds to a low rank matrix considered as signal corrupted by noise.Both RPCA and SURE techniques help to recover the low rank signal using shrinkage terms.RPCA is based on the association of a non-linear transformation of the singular values and a hard thresholding rule.However, SURE method suggested a soft thresholding rule to the singular values of the noisy observations by shrinking them with the same amount.RPCA was used for noise removal from transcriptomic data and the improvement of corresponding graphical representations [11].SURE method was used for denoising clinical cardiac magnetic resonance image series data [12].In this study, their use is advocated in sensory analysis field for noise removal from consumer rating data and the improvement of preference maps.
An indicator of maps stability was then defined to allow for their comparison.A sampling approach was performed to randomly choose samples from consumer rating data.Then, an average distance of predicted scores is computed between the corresponding maps.Here the stability is invoked as sensibility to consumer data sampling.The map constructed from consumer data already denoised is compared to the original one from a visual point of view and using the stability indicator.The different techniques as well as the proposed comparison approach of EPM are detailed in Section II.Results given by simulated examples and real data are shown in Section III.All results were obtained through the use of https://www.r-project.org/Rstatistical software.EPM was performed using SensMap R package ( [13]) developed by our research team.

A. RPCA and SURE for Denoising Consumer Rating Data
Let's denote by Y the P × C hedonic matrix where P is the number of products and C is the number of consumers and by X the P × A matrix where A is the number of sensory attributes.Under the fixed effect model of Principal Component Analysis (PCA) [14], Y data is generated as a fixed structure of low rank that corresponds to signal, corrupted by noise.The matrix can then be written as in (1): where = ( pc ) is a P × C matrix such that ∀p ∈ {1, . . ., P } and ∀c ∈ {1, . . ., C}.The coefficients of Y and Ỹ can be written as in (2): In this study, only the signal Ỹ is considered for a further analysis.It is obtained by minimizing the ||Y − Ỹ || 2 .Suppose the SVD of Y as : The principal components are given by F = UD 1/2 ∈ R P ×C and the k first principal components matrix can be written as follows: where U 1:k is the matrix with only the first k columns, hence Hence if k ≤ C is fixed, the hedonic data matrix Y can be approximated by: Therefore the approximation of Y from PCA, that corresponds to the underlying signal from the principal components can be written as: PCA is based on hard thresholding rule that selects only a certain number of dimensions and linearly shrinks the singular values.The interest is given to the major built factors which represent the signal to be interpreted, while eliminating particular trends that are not the object of interest and may disrupt the analysis.

1) Regularized Principal Component Analysis (RPCA):
In [11], authors suppose that under the fixed effect model, PCA method does not provide the best recovery of the underlying signal and that the visualization produced from PCA may display patterns that are very noisy.A regularised version of PCA was proposed, denoted here by RPCA, in order to get a more precise estimation as close as possible to Ỹ .The method is to select a certain number of dimensions and shrink the corresponding singular values with a different amount of shrinkage for each singular value.A non-linear transformation of the singular values is applied in association with a hard thresholding rule.
The estimation by RPCA corresponds to finding a matrix of low rank k by regularizing the maximum likelihood estimator such that: Where Φ r is the shrinkage term obtained by minimising the Mean Squared Error.It is explained as the ratio of the signal variance over the total variance of the associated dimension [11].It is given by: such that RSS is the Residual Sum of Squares and ddl is the number of observations minus the number of independent parameters.
In fact, each singular value is multiplied by Φ r , the shrinkage term, or, in other words, thresholded with this constant.Then, the first dimensions can be considered as more stable and trustworthy than the last ones.To perform RPCA, a tuning parameter corresponding to the number of underlying dimensions is needed.It is selected in this study using crossvalidation [16].
2) Stein's Unbiased Risk Estimate (SURE): SURE method was commonly used in image denoising [15].It relies on a soft thresholding rule to the singular values of the noisy observations by shrinking them with the same amount.In [12], authors suggested that one can determine the threshold level λ o by minimising Stein's unbiased risk estimate.The method recovers an approximately low-rank data matrix such that: where for real ( √ λ r − λ o ) term.SURE requires a tuning parameter that corresponds to an estimation of the noise variance σ 2 to determine λ o unlike RPCA that requires the number of underlying dimensions k of the signal.

B. External Preference Mapping
The EPM is performed to explain consumers' preferences in Y matrix in function of sensory characteristics of products in X data in order to know how products attributes drive consumers' likings.The method is to perform first a PCA [17] on sensory data X in order to reduce the dimensions of products on the bases of their sensory aspects.The first two PCA components denoted in this paper by F 1 and F 2 are extracted.They contain the maximum amount of information from sensory descriptive data.
The second step consists in regressing and predicting consumers' scores based on products coordinates in the sensory space spanned by F 1 and F 2 .The liking score of each consumer is expressed here using complete quadratic regression model [3] where linear, quadratic, and two-way interaction between dimensions are considered.This implies that the consumer liking increases with intensity increase until reaching a maximum of preference then liking decreases with intensity increase.The model of each consumer is given by ∀c = 1, . . ., C, ∀p = 1, . . ., P : where y c = (y 1c , . . ., y P c ) T ∈ R P ×1 is the response vector corresponding to the preference of the consumer, (a, b, c, d, e, f ) are the parameters to be estimated and c = ( 1c , . . ., P c ) T ∈ R P ×1 ∼ N P (0, σ 2 c I P ) is the vector of random Gaussian errors.Each consumer model builds a surface of predictions spanned by F 1 and F 2 .The plan was discretized and a set of points of the space is considered, where N is the number of points in the grid and (F 1 (g), F 2(g)) are the coordinates of each point.
Hence for each F (g) ∈ G of the grid, a prediction of the consumer score is then computed using the estimation of the model defined in (6) and it's denoted by y c (g).
The principle of EPM is to compare the obtained predicted score at each point with the mean of the scores given by the consumer corresponding to the average of each column of Y denoted here by y c .
If y c (l) ≥ y c , this point is considered for further step, else the point will not be taken in account.All predictions surfaces for the whole sample of consumers are superposed one over the other to construct the multidimensional prediction map [3].At each point of this space, the number of consumers for which predicted scores are higher than average scores in data is counted.The obtained percentage at each point correspond to the preference level lines.Hence, the way to obtain predicted scores is very important since they represent the basis to compare and delimit the preference level lines on the map.In this investigation, using noisy data is supposed to lead to inaccurate predicted scores and then unstable visualized map.

C. Comparison of Maps Stability
The dilemma is how to compare the obtained maps more precisely how to evaluate their stability and quality.A sampling process was proposed to randomly select samples of equal size from Y data, then perform the EPM for each sample and compute distances between the sorting sub-maps.In fact, since the spaces are sets of predictions, at each point of the two grids, predicted scores are recorded and an average squared distance between them is computed after performing a defined number of sampling.It is denoted in this paper by ASDP .The process was carried out using either RPCA or SURE and ASDP were recorded following the same path.Results on simulated examples and real data are shown in next part.Using ASDP as maps stability indicator makes easier the comparison before and after denoising then the efficiency evaluation of the proposed approach.Its use can be generalized to compare maps from different strategies.

A. Simulation Study
The way to generate consumer rating data was inspired from data structure given in model 1.Simulations are performed following these steps: 1) Build the fixed structure that corresponds to signal, for which parameters were obtained from a consumer model proposed by real data.2) Add a complex white Gaussian noise ∼ N (0, σ 2 ) to build variability between consumers.The Y matrix is then generated and the structure in model 1 is restored.The built matrix can now be seen as a structure of true signal corrupted by error.3) Visualize EPM from the obtained consumer dataset.4) Remove noise from simulated consumer data using RPCA or SURE, visualize again the EPM and compare.
• The number of consumers C in (40, 200, 500) since values found in the literature vary with studies from 40 subjects to 480 subjects.
• The number of underlying dimensions k of RPCA vary in (2,4) using GCV parameter.
• The noise variance σ 2 estimated when SURE is used.
• Several configurations and real parameters are tested according to model 6.The ASDP between maps were calculated from Y , Y RP CA and Y SU RE data and results are gathered in Table I and represented in Fig. 1. Results show that ASDP (Y ) are lower than ASDP (Y RP CA) and ASDP (Y SU RE) in almost all situations by varying the number of consumers C, the noise variance σ 2 and the number of RPCA dimensions k.Firstly, the distances between sub-maps obtained for low number of consumers C = 40 are higher than those computed for an important C(200,500).This is observed for the different variance noise and even without denoising, which is the case of the bars in first, fourth and seventh columns from Fig. 1 corresponding to rows number from 1 to 6 in Table I.Consequently, denoising is advised when the sample of consumers is very small since the important variability of verdicts.Whereas, it is shown that for a high number of consumers ASDP (Y ), ASDP (Y RP CA) and ASDP (Y SU RE) record lower values as shown in the remaining bars of Fig. 1 and illustrated in rows number from 13 to 18 of Table I.The distances between sub-maps are made closer for all values of σ and k and tend to zero specially for low noise variance.The higher is the number of C, the lower is the error in data and then the lower are differences between maps.
On the other side, the effect of denoising is specially seen for high noise variance, shown in the last three columns of Fig. 1 where bars are got longer.The recorded distances corresponding to low σ 2 are smaller than those computed from data simulated with larger σ 2 values, which was expected.The distances are increasing with the increase of noise variance regardless of the size of consumers sample C or the number of k dimensions.Hence, denoising is more efficient in the case of higher noise variance in data.The number of the underlying dimensions k of RPCA was empirically chosen.It was seen that using more or fewer k has not a great influence on the ASDP behavior.Its choice is relative to the case study.In all situations, SURE is clearly outperformed by RPCA.The latter brings closer the distances and reduces higher variations.It gives particularly good results when data is very noisy.
Moreover, the Mean Square Error was computed between the fitted matrix obtained from each method and corresponding raw data in order to assess the fitting performance of denoising methods.Results are shown in  The consumers were asked to rate products by giving an overall liking score on a scale ranging from 0 (I do not like) to 10 (I like very much).A second data corresponds to descriptive sensory data Y = 8p × 23d where the same set of products were evaluated by a trained panelists according to 23 descriptors measuring their sensory perceptions namely smell, vision, touch and taste.The consumer data set represent a significant difference in consumers' preferences and a significant heterogeneity of sample since they are from different nationalities.
2) Denoising consumer data: In this part, results are compared from a visual point of view and using ASDP stability indicator.Fig. 3 shows consumer data representations obtained from Y , Y RP CA and Y SU RE data.The first two dimensions represent the between-class variability, whereas the other dimensions represent the within-class variability, which is less of interest in this case study.The first dimensions of all representations order products from the less rated to the highly rated.To better highlight the effect of denoising, individuals are represented by consideration of the categorical variable origin as individual clusters.The origin of cookies matches with two categories: P corresponds to Pakistan and F to France.The confidence ellipses are then drawn in order to examine the stability of the products positioning.Their use was introduced in sensory analysis to compare sensory profiles of trained panelists [18].
The representation obtained from Y raw data shows that the ellipses associated with French and Pakistani products are slightly overlapped.It is deduced that there is a significant link between the Pakistani and French groups of products.Hence, there is a little similitude between the consumers' sensory profiles.However, in figures corresponding to Y RP CA and Y SU RE, there are no overlaps between ellipses.This means that consumers made clear differences among the origin of products.In addition, Sprits, Sooper and Petit brun products are brought closer to the origin in both RPCA and SURE representations.The filtering methods brought closer products ratings by driving them to the center.That is to say that the given outputs are approximately the outputs obtained from Ỹ .Both two methods are clearly efficient, however RPCA representation outperforms SURE by bringing products from the same origin closer to each other.The corresponding confidence ellipses are considerably smaller than those obtained from SURE representation.In deed, the difference in surfaces explains the variability of consumers' ratings towards products hence, RPCA is very promising to reduce variability of consumers' verdicts.
3) Impact of denoising on External Preference Mapping: As a further step, maps from noisy and clean data were visualized to explore the impact of denoising Fig. 5. Differences of likings are also shown in Table II.Results show that perceptual maps obtained after denoising (second and third plots) represent a more net space where optimal products are clearly visualized and preference zones are well distinguished.At first, let's see how to read a given map.The axis correspond respectively to the two loading scores of the first two principal components F 1 and F 2 .Two main regions are distinguished.The green zone indicates the least ideal location for a given product where only within products are appreciated however the remaining ones are totally disliked.Conversely orange zones are considered as preference zones embracing ideal products that maximize consumers' likings.The preference percentages are lying in the contour level lines on the sensory space.
The relevant result is given by RPCA.The effect of denoising is very clear on the related map Fig. 5 (second plot).The latter represents a very stable sensory space where preference level lines are more regular and straight.They are moved away from each other to construct clear zones of preferences and make easy the lecture and interpretation of the products location.The user can extract quickly and easily the patterns from consumer data.Moreover, the map is also enhanced when cleaning data using SURE method.The space shown in Fig. 5 (third plot) seems to be also stable with less erratic and more regular level lines.Products also are made distant compared to the ordinal map and the lecture is made easier.The three representations show that Palet Breton (made by Petit Déli in France) and Gala (made by Group Danone in Pakistan) are located in the ideal zones (i.e clear orange) and they are considered as optimal products, which indicates that both Group Danone in Pakistan and Petit Déli in France are leaders according to this sensory evaluation.These products were appreciated by 75% of consumers shown on denoised map with RPCA, against 95% shown on initial map and 95% on filtered map with SURE II.The map obtained from RPCA provides the industrials by true and exact liking scale of ideal products compared to others, then allow them to make adequate decisions about these products.The combination of liking information with those obtained in Fig. 4 (attributes are in French wording) shows that these products were appreciated for their crumbliness (Tgranusab), thickness (Vépaisseur), melting texture (Tfondant), friability (Tfriabilité) and farina texture (Tfarineux).Conversely, Sooper was located in least liking

Sprits
Palet breton q q q q q q q q Fig. 5. External preference mapping performed from Y, Y RP CA and Y SU RE data.
location according to the noisy map and rated only by 30-35% of consumers however located in preference zones and rated by 60-65% according to cleaned maps using both RPCA and SURE.
Results were compared with those obtained in [9] following ANOVA and MFA analysis.It was shown that Sooper was liked by a great number of consumers specially for Pakistani sample.The description of this product gives the criteria that determine likings Fig. 4 which consist in saltiness (Gsalé), egg smell (Ooeuf),egg taste (Goeuf), butter taste (Gbeurre) and butter smell (Obeurre).In the other side, Petit brun was highly rated according to noisy map by approximately 65% of consumers against only 40% shown in desnoised maps.Confirming these results with those obtained in [9], this product was appreciated approximately by half population.It was totally rejected by the Pakistani panel.By analogy with sensory descriptions Fig. 4, Petit Brun is characterized by lemon smell and taste.Concerning the other products, they are relatively located in least liking location in the three representations.Improvements were mainly shown for optimal products located in the ideal zones.
Hence, the results obtained from noisy data may lead to wrong decisions about certain products.It is advised to remove the existing noise in order to obtain a smoothed map showing reliable and trustworthy results to manufacturers that help in making right decisions about improvement of products to meet as high as possible consumers' preferences.To ensure the efficiency of using clean data and its impact on maps stability, the visualized maps were compared using ASDP indicator.Consumer dataset has been divided each into 2 separate samples of 147 randomly selected consumers.EPM was performed on each sample before and after denoising consumers data.100 random selections of samples were considered and average deviations of all selections was recorded.The results are shown in Fig. 6.As expected, the average deviations of predictions after denoising by RPCA and SURE are still lower than before for all selections of samples.Conversely, RPCA neatly outperforms SURE by recovering the differences between consumers' judgments.MSE were also computed from Y , Y RP CA and Y SU RE and equal, respectively to 934.569, 496.4266 and 563.8959.Both RPCA and SURE improve the fitting performance however RPCA gives promising results and is very suitable in the case of sensory evaluations.It is definitely the best compromise by reconstructing only the true signal, reducing the errors of overrating or miss-rating products and providing maps with more smoothness.

IV. CONCLUSION
In this investigation, denoising consumer data is advocated in sensory analysis field, before processing to External Preference Mapping.In fact, the existing noise affects the stability of the obtained map which leads to imperfect results and users may arrive at incorrect decisions about consumers' tendencies and products characterization.The use of Regularized Principal Component Analysis and Stein's Unbiased Risk Estimate was advocated.The denoising methods search to give a more precise estimation of the underlying structure which allows a better reconstruction and visualization of maps representations.Both thresholding methods give promising results but RPCA is suited well and have largely improved maps stability.The results obtained from simulated examples and real data show that distances computed between maps after denoising are made closer specially in case of very noisy data and smaller sample of consumers.This means that noise removal reduces variability between consumers' judgments and then helps stabilizing maps.The obtained sensory space is made more stable where preference level lines are made smoothed.Denoising helps to extract only useful information about consumers' likings and to remove the irrelevant error.The goal was to provide researchers and practitioners by a tool with better performance.In future work, the idea is to highlight the over-smoothing issue and propose other denoising techniques that must be compared with RPCA and SURE performance, by going one step more on parametrization details.

Fig. 1 .
Fig. 1.Representation of ASDP calculations from Table I: from different numbers of consumers (C), noise variance (σ 2 ) and numbers of underlying dimensions of RPCA (k).

Fig. 2 .
Second and third boxplots corresponding respectively to M SE(Y RP CA, Y RP CA) and M SE(Y SU RE, Y SU RE) are much smaller than M SE(Y, Y ).As was expected, RPCA leads to the lower MSE in all cases.In the case of reconstructing and visualizing sensory data, RPCA is recommended.This behavior can be interpreted by the nonlinear transformation of the singular values associated with hard thresholding conversely to the soft thresholding rule used by SURE.

Fig. 2 .
Fig.2.Mean Square Error computed between the fitted matrix from each method and corresponding raw data over the 500 simulation from TableI.

Fig. 3 .
Fig. 3. Representation of individuals from Y, Y RP CA and Y SU RE.

Fig. 4 .
Fig. 4. Projection of products and their sensory attributes on the first PCA factor map.

Fig. 6 .
Fig. 6. Results of ASDP calculations from 100 sampling from real consumer data.

TABLE I .
RESULTS OF PREDICTION DISTANCES BETWEEN MAPS COMPUTED BEFORE AND AFTER DENOISING USING RPCA AND SURE OBTAINED FROM OVER 500 SIMULATIONS FROM DIFFERENT NUMBERS OF CONSUMERS (C), NOISE VARIANCE (σ 2 ) AND NUMBERS OF UNDERLYING DIMENSIONS OF RPCA (K)

TABLE II .
COMPARISON OF PERCENTAGES OF ACCEPTANCE FROM EPM PERFORMED ON Y, Y RP CA AND Y SU RE