|
|
 |
 |
本課程有兩個主要的作業:
作業 1 問題集1
問題1: Charles 讀書俱樂部案例
請研讀此案例並回答案例後的所有相關問題
閱讀資料:
Bhandari, Vinni,和Dr. Nitin Patel.〈Charles讀書俱樂部案例〉
Levin, Nissan,和Jacob Zahav.〈資料庫行銷的一個案例研究〉,Tel Aviv大學。直效行銷教育基金會,Inc..1995年3月
美國出版商協會。產業統計,2002年
佛羅倫斯的藝術史
一本名為《佛羅倫斯的藝術史》的新書正準備出版。CBC從其客戶資料庫中隨機抽選4,000位客戶,寄發測試性的廣告郵件。接著客戶的回覆資料會與他們過去的購買資料整合,並將此資料隨機切割為3個部份。分別為訓練組資料(共1,800位客戶):始,初始資料用來比對回應資料模型;驗證組資料(共1,400位客戶),提供資料用於比較不同的回應模型之表現;最後為測試組資料(共800位用戶),資料為當最終模型選定後,被用於評估所運用之模型的準確性。樣本資料在一個區隔的電子表格CBC_4000.xls(XLS)中。在資料表格內(非表格頂列)的每一列(或案例),對應著各個市場測試客戶。每一行代表一個變量,在頂列中提供了這些變量的名稱。變量名稱和描述如下表所示︰
表(一): CBC_4000.xls檔案的變量列表
|
|
|
|
變量名稱 |
|
|
|
描述 |
|
|
|
|
|
|
|
Seq# |
|
|
|
資料劃分中的序號 |
|
|
|
|
|
|
|
ID# |
|
|
|
整個(未劃分的)市場測試資料集中的標識號 |
|
|
|
|
|
|
|
Gender |
|
|
|
O=男性1=女性 |
|
|
|
|
|
|
|
M |
|
|
|
消費金額- 購買書籍的總消費金額 |
|
|
|
|
|
|
|
R |
|
|
|
嶄新性- 距離最後一次購買的月數 |
|
|
|
|
|
|
|
F |
|
|
|
次數 – 總購買次數 |
|
|
|
|
|
|
|
FirstPurch |
|
|
|
距離第一次購買的月數 |
|
|
|
|
|
|
|
ChildBks |
|
|
|
兒童類別圖書的購買數量 |
|
|
|
|
|
|
|
YouthBks |
|
|
|
青少年類別圖書的購買數量 |
|
|
|
|
|
|
|
CookBks |
|
|
|
廚藝類別圖書的購買數量 |
|
|
|
|
|
|
|
DoItYBks |
|
|
|
DIY類別圖書的購買數量 |
|
|
|
|
|
|
|
RefBks |
|
|
|
參考類別圖書(地圖集、大英百科全書、字典..等)的購買數量 |
|
|
|
|
|
|
|
ArtBks |
|
|
|
藝術類別圖書的購買數量 |
|
|
|
|
|
|
|
GeoBks |
|
|
|
地理類別圖書的購買數量 |
|
|
|
|
|
|
|
ItalCook |
|
|
|
《義大利烹調祕訣》一書的購買數量 |
|
|
|
|
|
|
|
ItalAtlas |
|
|
|
《義大利歷史版圖》一書的購買數量 |
|
|
|
|
|
|
|
ItalArt |
|
|
|
"Italian Art."《義大利藝術》一書的購買數量 |
|
|
|
|
|
|
|
Florence |
|
|
|
=1 代表已購買《佛羅倫斯的藝術史》一書
=0 則表示未購買此書
|
|
|
|
|
|
|
|
Related purchase |
|
|
|
相關書籍的購買數量上午 12:11 2008/7/13 |
|
|
|
|
問題2: 「德國客戶信用評等」案例
(英文PDF)、
(英文DOC)
:
請研讀此案例並回答案例後的所有相關問題 (XLS)
「德國客戶信用評等」案例資料集(XLS)
作業2 問題集2
問題1:
一個區別分析的通常應用是將不同的債卷作級別分類。這些分級主要是為了能反應債券的風險程度並影響發行債卷的公司之借款成本。不同的財務評等是從每年的報告中呈現出來的,通常可以協助決定公司的債卷分級。
Excel電子資料表BondRatingProb1.xls (XLS) 中,包含名為訓練資料(Training data)與驗證資料(Validation data)的兩個表格。這些資料來自於從COMPUSTAT財務資料檔案中抽樣出的95家公司的數據。當中公司的債卷已經經由「Moody的債卷分級」(1980)分類成從AAA(最安全的等級)到C(風險最高的等級)的7個風險等級。, 這些資料包含每家公司的10個財務變數。內容如下:
LOPMAR:營運利潤率的對數
LFIXMAR:稅前固定支出回收率的對數
LTDCAP: 長期債務資本化
LGERRAT:長期債務總額對權益總額比率的對數
LLEVER: 槓桿度的對數
LCASHLTD: 現金流量對長期債務的對數
LACIDRAT: 速動比率的對數
LCURRAT: 流動資產對流動負債的對數
LRECTURN: 應收週轉率的對數
LASSLTD: 淨有形資產對長期負債的對數
以上的資料,有81筆觀察值被歸類為訓練組資料;另外14筆觀察值則定為驗證組資料。債卷評級被編碼為欄位標題為CODERTG中的數值。例如: AAA被編碼為1,AA編碼為2..等等。使用 XLMiner構造區別分析(與神經網絡模型,以分類驗證資料的債卷評級。你將需要使用評分新資料的功能。你所能發現的最佳分類器具有多少的效益?另外,要注意資料中的分類變量是有序的(例如︰ AAA的評等優於AA,而AA又優於A)。是否有某個分類器的誤分類率劣於另一個分類器。若是如此,你將如何考慮其結果。
問題2:
判斷下列問題的正誤,並用一句話來說明
-
在線性複迴歸模型中,對於一系列獨立變量而言,調整的 R2永遠低於R2值。
-
在線性複迴歸模型中,最佳的變量子集就是那些具有較少變量,而具有較高Mallow’s Cp值的子集。
-
一個不含隱藏層的神經網絡,通常利用p個輸入變量︰ x1, x2 … xp,預測一個連續變量y。此網絡是採用訓練資料建立模型,並且發現在驗證資料中的誤差平方和為SSN。另外帶有自變量x1, x2 … xp和因變量y的線性複迴歸模型,也適合於相同的驗證數據。此迴歸模型的殘差平方和是SSR,而且SSR不會比SSN大。
-
當反向傳播算法被用於構造神經網絡時,網絡通常會在誤差函數之誤差的整體或局部極小值點處停止。
-
被使用於構造人工神經網絡模型的變量數等於神經網絡的所有節點數。
問題3:
Excel的電子資料表RegressionProb3.xls (XLS) ,包含了名為訓練組資料與驗證組資料的兩個表格。我們將使用XLMiner、根據訓練組資料,建立兩個模型。並使用驗證組資料比較它們作為預測模型的效能。
-
模型1︰根據訓練組資料,建立從變數x1到x9(以及常數項)之所有變量的複迴歸模型。我們稱此模型的係數向量為ß1。 .
-
使用XLMiner中的子集選項,來選擇一個只使用訓練組資料所建立的模型。我們稱此模型的係數向量為ß2。.
-
使用驗證組資料將ß1複製B5到k5的單位格中,計算模型1的平均值與標準差。 同樣複製ß2,計算模型2。.
-
分別從(i)預測值的乖離率,(ii)與預測值的均方誤差,來比較兩個模型。
問題4:
Excel電子資料表NormalsProb4.xls (XLS) 包含了兩個分群(群0和群1)以及兩個變量(x和y),共1000筆觀察值。
-
將所有資料點標繪成2維的散佈圖。並且將群1和群0不同地標記(例如︰一個標記為x,另一個標記為o)。 如此,你將能夠清楚視覺化各群的分布情況。
-
將資料切分為600筆訓練組與400筆分類組(驗證組)。
-
比較以下不同演算法的模型效能:
請記得邏輯迴歸與區別分析都屬於線性分類器。亦即,它會在一個平面上將點,區分成不同的類別。相對地,神經網路和K-最近鄰分類,則允許非線性分類(你是否對於後兩種資料點應如何分類,具有幾何上的直覺?)
-
針對每一種方法,將最佳的分類器標繪成散佈圖。
每一個散佈圖,需顯示以下系列的點︰
-
群0被正確分類的點
-
群0被錯誤分類的點
-
群1被正確分類的點
-
群1被錯誤分類的點
-
資料是被模擬的。其每一個類別的(x,y)值皆服從二元常態分布。最小的錯分的貝氏法則,有誤差率18.5%。你根據各個形態所做出的最佳分類器,有多接近這個誤差率?請針對為何某些形態的分類器,對於這個資料數據能有更佳的表現,給予一個直覺性的說明。
There are two major assignments for this course:
Homework 1
Problem Set 1
Problem 1: The Charles Book Club Case
Read the case and answer all the questions at the end of the case.
Readings:
Bhandari, Vinni, and Dr. Nitin Patel. “The Charles Book Club Case.”
Levin, Nissan, and Jacob Zahav. ” A Case Study in Database Marketing.” Tel Aviv University. Direct Marketing Educational Foundation, Inc.. March 1995.
Association of American Publishers. Industry Statistics, 2002.
Art History of Florence
A new title, "The Art History of Florence", is ready for release. CBC has sent a test mailing to a random sample of 4,000 customers from its customer base. The customer responses have been collated with past purchase data. The data has been randomly partitioned into 3 parts- Training Data (1800 customers): initial data to be used to fit response models, Validation Data (1400 customers): hold-out data used to compare the performance of different response models, and Test Data (800 Customers): data only to be used after a final model has been selected to estimate the likely accuracy of the model when it is deployed. The Sample Data are in a separate spreadsheets CBC_4000.xls (XLS). Each row (or case) in the spreadsheet (other than the header) corresponds to one market test customer. Each column is a variable with the header row giving the name of the variable. The variable names and descriptions are given in Table 1, below:
Table 1: List of Variables in CBC_4000.xls
|
|
|
|
VARIABLE NAMES |
|
|
|
DESCRIPTION |
|
|
|
|
|
|
|
Seq# |
|
|
|
Sequence number in the partition |
|
|
|
|
|
|
|
ID# |
|
|
|
Identification number in the full (unpartitioned) market test data set |
|
|
|
|
|
|
|
Gender |
|
|
|
O=Male 1=Female |
|
|
|
|
|
|
|
M |
|
|
|
Monetary- Total money spent on books |
|
|
|
|
|
|
|
R |
|
|
|
Recency- Months since last purchase |
|
|
|
|
|
|
|
F |
|
|
|
Frequency - Total number of purchases |
|
|
|
|
|
|
|
FirstPurch |
|
|
|
Months since first purchase |
|
|
|
|
|
|
|
ChildBks |
|
|
|
Number of purchases from the category: Child books |
|
|
|
|
|
|
|
YouthBks |
|
|
|
Number of purchases from the category: Youth books |
|
|
|
|
|
|
|
CookBks |
|
|
|
Number of purchases from the category: Cookbooks |
|
|
|
|
|
|
|
DoItYBks |
|
|
|
Number of purchases from the category Do It Yourself books |
|
|
|
|
|
|
|
RefBks |
|
|
|
Number of purchases from the category: Reference books (Atlases, Encyclopedias, Dictionaries) |
|
|
|
|
|
|
|
ArtBks |
|
|
|
Number of purchases from the category: Art books |
|
|
|
|
|
|
|
GeoBks |
|
|
|
Number of purchases from the category: Geography books |
|
|
|
|
|
|
|
ItalCook |
|
|
|
Number of purchases of book title: "Secrets of Italian Cooking." |
|
|
|
|
|
|
|
ItalAtlas |
|
|
|
Number of purchases of book title: "Historical Atlas of Italy." |
|
|
|
|
|
|
|
ItalArt |
|
|
|
Number of purchases of book title: "Italian Art." |
|
|
|
|
|
|
|
Florence |
|
|
|
=1 "The Art History of Florence." was bought,
=0 if not |
|
|
|
|
|
|
|
Related purchase |
|
|
|
Number of related books purchased |
|
|
|
|
Problem 2: The German Credit Case
(英文PDF)、
(英文DOC)
:
Read the case and answer all the questions at the end of the case
German Credit Case Data (XLS)
Homework 2
Problem Set 2
Problem 1:
A common application of Discriminant Analysis is the classification of bonds into various bond rating classes. These ratings are intended to reflect the risk of the bond and influence the cost of borrowing for companies that issue bonds. Various financial ratios culled from annual reports are often used to help determine a company’s bond rating.
The Excel spreadsheet BondRatingProb1.xls (XLS) contains two sheets named Training data and Validation data. These are data from a sample of 95 companies selected from COMPUSTAT financial data tapes. The company bonds have been classified by Moody’s Bond Ratings (1980) into seven classes of risk ranging from AAA, the safest, to C, the most risky. The data include ten financial variables for each company. These are:
LOPMAR: Logarithm of the operating margin,
LFIXMAR: Logarithm of the pretax fixed charge coverage,
LTDCAP: Long-term debt to capitalization,
LGERRAT: Logarithm of total long-term debt to total equity,
LLEVER: Logarithm of the leverage,
LCASHLTD: Logarithm of the cash flow to long-term debt,
LACIDRAT: Logarithm of the acid test ratio,
LCURRAT: Logarithm of the current assets to current liabilities,
LRECTURN: Logarithm of the receivable turnover,
LASSLTD: Logarithm of the net tangible assets to long-term debt.
The data are divided into 81 observations in the Training data sheet and 14 observations in the Validation data sheet. The bond ratings have been coded into numbers in the column with the title CODERTG, with AAA coded as 1, AA as 2, etc. Use XLMiner to develop Discriminant Analysis and Neural Networks models to classify the bonds in the Validation data sheet. You will need to use the score new data option. What is the performance of the best classifier you have been able to find? Notice that the there is order in the class variables (i.e., AAA is better than AA, which is better than A,…). Would certain misclassification errors be worse than others? If so, how would you suggested measuring this?
Problem 2:
Give true false answers to the following questions with one sentence to justify your answer.
-
The adjusted R2 value for a set of independent variables in multiple linear regression is always less than the value of R2.
-
The most promising subsets of variables to include in a multiple linear regression model are those that have few variables and have a high value for Mallow’s Cp.
-
An Artificial Neural Network with no hidden layers is used to predict a continuous variable y using p input variables, x1, x2 … xp. The network is trained on a training dataset and it is found that the sum of squared errors on a validation dataset is SSN. A multiple linear regression model with independent variables x1, x2 … xp and dependent variable y is fitted to the same validation data. The sum of squared residuals for the regression model is SSR. SSR cannot be greater than SSN.
-
The backprop algorithm when used in training an Artificial Neural Network will always terminate at a global or local minimum of the error function.
-
The number of variables used in training an Artificial Neural Network is equal to the total number of nodes in the network.
Problem 3:
The Excel spreadsheet RegressionProb3.xls (XLS) contains two sheets named Training Data and Validation Data. We will use XLMiner to build two models with the training data and then use the validation data to compare their performance as prediction models.
-
Fit a multiple regression model, Model1, to the training data using all the variables X1 through X9 (and the constant term). Call the coefficient vector for this model ß1.
-
Use the subset selection options in XLMiner to choose a model using only the training data. Call the coefficient vector for this model ß2.
-
Use the Validation Data to compute the mean and the standard deviation of errors for Model1 by copying ß1 into cells B5 through K5. Do the same for Model2 by copying ß2.
-
Compare the models in terms of (i) bias in the predictions, (ii) mean square error of predictions.
Problem 4:
The Excel spreadsheet NormalsProb4.xls (XLS) contains 1000 observations with two groups (Group 0 and Group 1) and two variables (x and y).
-
Plot all the data points in a 2-dimensional scatter plot. Mark Group 1 points and
Group 0 points differently (e.g., one with a 'x' and the other with 'o') so you can
visualize the distribution of the points of each Group.
-
Partition the data into training and classification sets with 600 and 400
observations respectively.
-
Compare the performance of:
Remember that logistic regression and discriminant analysis are linear classifiers
- i.e., it separates points of different classes with a plane. In contrast, neural
networks and k-nearest neighbors allow non-linear classifiers (do you have an
intuitive idea on the geometry of how the latter two classifies points?).
-
For each method, plot a scatter plot for the best classifier. On each plot, display
the following series
-
Group 0 points that are classified correctly,
-
Group 0 points that are misclassified,
-
Group 1 points that are classified correctly,
-
Group 1 points that are misclassified.
-
The data was simulated. The (x,y) values for each class follow a bivariate normal
distribution. The Bayes Rule for minimum misclassification has an error rate of
18.5%. How close is the best classifier you have developed of each type to this
error rate? Give an intuitive explanation of why certain types of classifiers seem
to be better for this data.
|
|
|
 |