留學生期中考試題目及要求
1. Discuss the difference between a lunch-driven and a data-driven decision. (Article 1) (5 points)
2. What is the challenge of implementing the long tail strategy? And what is the right way to do that? (article 4) (5 points)
3. Why does Target want to know if a woman is pregnant? And how did Target do it? (Article 2) (6 points)
4. Why do many companies’ NBO strategies fail? (Article 6) (5 points)
5. What are the advantages and disadvantages of using social media for product/service promotion? (article 7) (7 points)
6. Why are “large, diverse crowds of independent thinking people better at predicting the future or solving a problem than the brightest experts among them?” (article 8) (7 points)
7. Given following table, what is the probability of a large company that has been charged for illegal accounting activity before to be fraud? (5 points) (lecture 6)
8. What is the difference between classification and prediction in the context of data mining? Why do we need to partition data for supervised learning? (Lecture 3) (6points)
9. Discuss the difference between supervised and unsupervised learning. Give an example of a business application of supervised learning and one of unsupervised learning. (6points)
10. Consider the following series of business transactions: (Lecture 8)
Transaction 1: involves items A and D
Transaction 2: involves item A
Transaction 3: involves items A, C and D
Transaction 4: involves items B and D
討論的區(qū)別。午餐驅(qū)動和數(shù)據(jù)驅(qū)動的決策。(1條)(5分)
2。實施長尾戰(zhàn)略的挑戰(zhàn)是什么?什么是做正確的方式?(4條)(5分)
3。為什么目標想知道一個女人懷孕?什么目標呢?(2條)(6分)
4。為什么許多公司的NBO策略失敗?(6條)(5分)
5。使用社會媒體的產(chǎn)品/服務(wù)推廣的優(yōu)點和缺點是什么?(7條)(7分)
6。為什么“大,不同的人群,獨立思考的人更好地預(yù)測未來或比他們當中最聰明的專家解決問題嗎?“(8條)(7分)#p#分頁標題#e#
給出了7。下表,一個大型公司,被指控非法會計活動之前被欺詐的概率是多少?(5分)(6講)
8。在數(shù)據(jù)挖掘中的分類和預(yù)測之間的區(qū)別是什么?為什么我們需要有監(jiān)督學習算法的數(shù)據(jù)分區(qū)?(3講)(6分)
9。討論之間的監(jiān)督和無監(jiān)督學習的差異。給一個監(jiān)督學習和無監(jiān)督學習商業(yè)應(yīng)用一例。(6分)
10??紤]以下交易系列:(8講)
交易1:包括項目A和D
2:交易涉及的項目
3:交易涉及項目A,C和D
4:交易涉及的項目B和D
Q1. List all item combinations and their support (in percent). (6 points)
Q2. List all possible rules (in the form {X} -> {Y} meaning if set {X} is purchased then set {Y} is also purchased) and their confidence. Note that {X} -> {Y} and {Y} -> {X} are two different rules. (6 points)
Q3. What is the lift ratio for the rule {B} -> {D}? Briefly interpret it. (6 points)
Q4. What is the lift ratio for the rule {A} -> {D}? Briefly interpret it. (6 points)
11. The German Credit data set (available at blackboard) contains observations on 30 variables for 1000 past applicants for credit. Each applicant was rated as “good credit” (700 cases) or “bad credit” (300 cases). New applicants for credit can also be evaluated on these 30 "predictor" variables. We want to develop a credit scoring model that can be used to determine if a new applicant is a good credit risk or a bad credit risk, based on values for one or more of the predictor variables. All the variables are explained in Table 1.1. data has been organized in the spreadsheet GermanCredit.xls)
Table 1.1: Variables for the German Credit data
Var. # Variable Name Description Variable Type Code Description
1 OBS# Observation No. Categorical Sequence Number in data set
2 CHK_ACCT Checking account status Categorical 0 : < 0 DM
1: 0 <= ...< 200 DM
#p#分頁標題#e#
2 : => 200 DM
3: no checking account
3 DURATION Duration of credit in months Numerical
4 HISTORY Credit history Categorical 0: no credits taken
1: all credits at this bank paid back duly
2: existing credits paid back duly till now
3: delay in paying off in the past
4: critical account
5 NEW_CAR Purpose of credit Binary car (new) 0: No, 1: Yes
6 USED_CAR Purpose of credit Binary car (used) 0: No, 1: Yes
7 FURNITURE Purpose of credit Binary furniture/equipment 0: No, 1: Yes
8 RADIO/TV Purpose of credit Binary radio/television 0: No, 1: Yes
9 EDUCATION Purpose of credit Binary education 0: No, 1: Yes
10 RETRAINING Purpose of credit Binary retraining 0: No, 1: Yes
11 AMOUNT Credit amount Numerical
12 #p#分頁標題#e#SAV_ACCT Average balance in savings account Categorical 0 : < 100 DM
1 : 100<= ... < 500 DM
2 : 500<= ... < 1000 DM
3 : =>1000 DM
4 : unknown/ no savings account
13 EMPLOYMENT Present employment since Categorical 0 : unemployed
1: < 1 year
14 INSTALL_RATE Installment rate as % of disposable income Numerical
15 MALE_DIV Applicant is male and divorced Binary 0: No, 1:Yes
16 MALE_SINGLE Applicant is male and single Binary 0: No, 1:Yes
17 MALE_MAR_WID Applicant is male and married or a widower Binary 0: No, 1:Yes
18 CO-APPLICANT Application has a co-applicant Binary 0: No, 1:Yes
19 GUARANTOR Applicant has a guarantor Binary 0: No, 1:Yes
20 PRESENT_RESIDENT Present resident since - years Categorical 0: <= 1 year
1<…<=2 years#p#分頁標題#e#
2<…<=3 years
3:>4years
21 REAL_ESTATE Applicant owns real estate Binary 0: No, 1:Yes
22 PROP_UNKN_NONE Applicant owns no property (or unknown) Binary 0: No, 1:Yes
23 AGE Age in years Numerical
24 OTHER_INSTALL Applicant has other installment plan credit Binary 0: No, 1:Yes
25 RENT Applicant rents Binary 0: No, 1:Yes
26 OWN_RES Applicant owns residence Binary 0: No, 1:Yes
27 NUM_CREDITS Number of existing credits at this bank Numerical
28 JOB Nature of job Categorical 0 : unemployed/ unskilled - non-resident
1 : unskilled - resident
2 : skilled employee / official
3 : management/ self-employed/highly qualified employee/ officer
29 NUM_DEPENDENTS Number of people for whom liable to provide maintenance Numerical
30 #p#分頁標題#e#TELEPHONE Applicant has phone in his or her name Binary 0: No, 1:Yes
31 FOREIGN Foreign worker Binary 0: No, 1:Yes
32 RESPONSE Credit rating is good Binary 0: No, 1:Yes
Table 1.2, below, shows the values of these variables for the first several records in the case.
Table 1.2 The data (first several rows)
The consequences of misclassification have been assessed as follows: the cost of a false positive (incorrectly saying an applicant is a good credit risk) is 500 DM, while the cost of false negative (incorrectly saying an applicant is a bad credit risk) is 100 DM. This can be summarized in the following table.
Table 1.3 Opportunity Cost Table (in Dutch Marks)
Predicted (Decision)
Actual Good (Accept) Bad (Reject)
Good 0 100 DM
Bad 500 DM 0
The opportunity cost table was derived from the average net profit per loan as shown below:
Table 1.4 Average Net Profit
Predicted (Decision)
Actual Good (Accept) Bad (Reject)
Good 100 DM 0
Bad - 500 DM 0
Tasks
1. Use ‘GermanCredit.xls’ file and use all variables to develop a Logistic Regression classification model. Create a classification matrix for this model (6 points)
2. Use ‘GermanCredit.xls’ file and select ten variables to develop a Logistic Regression classification model. Create a classification matrix for this model (6 points)#p#分頁標題#e#
3. On the classification matrix, there are three types of accuracy that measure the performance of the model. Based the opportunity cost given in the table3 1.3 and 1.4, please indicate which accuracy measure is the most important one in this context. Offer your comments on these models, indicating the outputs and measurements you would use to judge the performance of your models. (6 points)
4. If you want to select 275 customers from the validation data set, which model would you adopt for credit rating? Why? (6 points)