Data Balancing and Aggregation Strategy to Predict Yield in Hard Disk Drive Manufacturing

 Abstract —Hard disk drive manufacturing is complicated and involves several steps of assembling and testing. Poor yield in one step can result in fail product of the whole lot. Accurate yield prediction is thus important to product monitoring and management. This paper presents a novel idea of data preparation and modeling to predict yield in the process of hard disk drive production. Data balancing technique based on clustering and re-sampling is introduced to make the proportion of the pass and fail products comparable. Then, we propose a strategy to aggregate manufacturing data to be in a reasonable group size and efficient for the subsequent step of yield predictive model creation. Experimental results reveal that grouping data into a constant size of 10,000 records can lead to the more accurate yield prediction as compared to the intuitive idea of weekly grouping.


I. INTRODUCTION
Data reliability and cost-efficient are two important factors that make hard disk widely used as the storage device to store big data in most organizations [1].In the production process of hard disk drive (HDD), many small parts are assembled and being tested several times along the assembly line.The quality control process may take as long as three months per production lot [2,3].The HDDs that can pass all testing steps are called the pass units.Those that fail in any of the testing stages are called the fail units.The proportion of pass units to fail units is called yield [4,5].
It is certain that HDD manufacturing industries require yield in the production process as high as possible.Accurate yield estimation is important for process engineers and product managers for proper planning in logistics and marketing.Yield estimation is traditionally performed by process engineers to rely on their own experience in calculating yields at each step of HDD manufacturing.Yield estimation is done manually and it is time consuming.We thus propose in this research work to apply a data-driven approach based on machine learning technology to automatically predict yield assisting engineers in the HDD manufacturing industry.
The difficult part of machine learning-based yield prediction is the excessive amount of data records and data attributes.The number of records can be higher than a million and the number of attributes can be more than hundreds.It is almost impossible to apply such high dimensionality data in the modeling step.Therefore, data pre-processing is an essential step to be applied prior to the deployment of machine learning technique [6][7][8][9][10][11].
We thus introduce a heuristic method to pre-process HDD manufacturing data.We firstly propose a novel idea based on cluster analysis to re-balance data.HDD data records contain two class of products: pass units and fail units.Normally, the number of pass units is much higher than the number of fail units.A high imbalance between the two classes can decrease significantly performance of the prediction model.Data improvement by making equal proportion among the two classes is essential.Reducing the number of data attributes is the next essential step of data-preprocessing.Before applying machine learning technique to create a model to predict yield, we also introduce a novel idea of data aggregation to group data records in order to reduce amount of data instances.Details of these techniques are explained in the next section.

A. Data
Data used in the modeling and experimentation are real data collected from the HDD production in the three months period.The number of data records is 10,000,000 and the number of attributes (or features) is 125.Some important attributes are summarized in Table I.

B. Research Framework and Yield Prediction Steps
The four main steps of data-driven modeling to predict yield in the HDD manufacturing process is shown in Fig. 1.Data Balancing.The original dataset has high imbalance proportion between the pass and fail units (imbalance ratio is 28:1).Therefore, data re-balancing method is introduced.The re-balancing strategy starts by grouping data into five main groups using k-means algorithm.After that, different data handling methods have been applied to each data group as illustrated in Table II.
Feature Selection.This step is for reducing number of data attributes.We experiment with several feature selection algorithms including decision tree (C5), classification and regression tree (CART), support vector machine (SVM), stepwise regression (SR), genetic algorithm (GA), chi-square (Chi 2 ), and information gain (IG).After experimentation, the best method is applied to extract important attributes to be used in the next step.-"Pass" status indicates that this HDD passed the test process and be able to be input of the next operation step or ready to ship to customer.-"Fail" status means this HDD is rejected from the test process and must go to either "rework", "retest", "recycle" or "scrap" process according to the debug diagnostic failure symptom HSA_PR Head stack assembly status (prime/rework).
-"Prime" means this HSA is the fresh new built component and never been installed in any other HDD before.-"Rework" means this HSA is a component that had been installed in another HDD, but that HDD had been rejected in the test process with the HSA labeled as rework.Thus, this HSA is recycled by being rebuilt again in this HDD.Data Aggregation.This step is another contribution of this work.To decrease the number of data records and to improve performance of yield prediction, we propose data aggregation techniques using two main strategy: constant aggregation and weekly aggregation.Constant aggregation is the act of grouping data records with constant number such as a group of 500 records, whereas weekly aggregation is grouping by week.Example of grouping data as a constant interval of 10 records per group is shown in Fig. 2. Suppose data contain records of three weeks with selected five attributes (Fig. 3), the step of weekly aggregation is illustrated in Fig. 4.  It can be noticed from Figs. 3 and 4 that the new attribute named yield has been created.Value of yield can be computed from the number of pass units in each data group divided by all units in a group and multiply by 100 to be yield percentage at each new aggregated data record.The four new data attributes (HSA_PR = Prime, HSA_PR = RCY, Media_PR = Prime, Media_PR = RCY) are also created to be used later in the modeling step.

Media_PR
Model Creation & Evaluation.The last step of this research is the use of re-balanced are aggregated data to create model for predicting yield in the HDD manufacturing.Two learning algorithms are applied: multiple linear regression (MLR) and artificial neural network (ANN).

III. EXPERIMENTATION AND RESULTS
At the first step of data balancing that data have been clustered into five groups, imbalance ratio between the pass and fail units is illustrated in Table III.This imbalance ratio has been managed by the proposed method resulting in the equal proportion as shown in Table IV.After data balancing, 7 methods to feature selection have been applied and then tested with the two learning algorithms (MLR and ANN).Results of feature selection are presented in Table V. Performance of feature selection practiced by engineers is also presented as a baseline for comparison.It can be seen from the results that feature selected with genetic algorithm to create model using the algorithm multiple linear regression is the best technique for yield prediction.We then applied GA and MLR to test the two data aggregation methods: constant aggregation and weekly aggregation.For constant aggregation, seven sizes of data aggregation have been tested.The results are shown in Table VI.It can be clearly seen that data aggregation of constant size perform better than weekly aggregation method and the errors are the same for data of sizes 10K up to 40K.The non-decreasing errors also occur with other two feature selection methods as shown in Fig. 5.

IV. CONCLUSION
This research presents a methodology to prepare data for modeling with machine learning technique in order to predict yield in the hard disk drive (HDD) manufacturing process.The data preparation steps used in this work are data balancing, feature selection, and data aggregation.The prepared data are then modeled with two algorithms: multiple linear regression and artificial neural network.
The focus of this research is the data preparation techniques.We propose a technique to re-balance data to contain the equal amount of the two data classes: pass and fail HDD units.The proposed data balancing is based on data clustering.We also introduce the idea of data aggregation based on weekly time-frame and aggregation at constant size.Experimental results reveal that data aggregation at the constant size of 10,000 records incorporated by the genetic algorithm for feature selection and then modeling with multiple linear regression yield the best predictive model for the specific task of HDD yield prediction.

Fig. 5 .
Fig. 5.The trend of prediction errors modeling with MLR that are trained with varied sizes of constant data aggregation and performed feature selection with C5, GA, and Chi 2 .

TABLE I :
SOME ATTRIBUTES FROM THE HDD MANUFACTURING

TABLE IV :
PASS AND FAIL UNITS AFTER APPLYING THE DATA BALANCING