Skip to main content

Data Scientist in a big retail store, say Woolworths, and your task is to optimise various retail processes such as inventory management, product placement, and customised offers.

SESSION 2 FORMAL EXAMINATIONS – NOVEMBER 2020
EXAMINATION DETAILS:
Unit Code: COMP2200/COMP6200
Unit Name: Data Science
Duration of exam: 3 hours in a 6 hour window
Total number of questions: 8
Total number of pages: 5 (incl. this cover sheet)
Total number of marks: 100

INSTRUCTIONS:
Answer ALL questions in a single word processor file and upload your answers to the provided Turnitin submission page by the due time. You can upload a Word or PDF file.
Collaboration with others in completing this exam is not allowed. The work you submit should be your own. Any evidence of copying or collusion will be referred to the Faculty Discipline Committee. Note that your submissions will be passed through Turnitin to identify copying from the Internet or from other students.
1. (10 marks) You are working as a Data Scientist in a big retail store, say Woolworths, and your task is to optimise various retail processes such as inventory management, product placement, and customised offers. Using the CRISP-DM model, can you explain what you will do in each stage of the data science project life cycle, what your input will be, and what you will deliver at each stage? (Write no more than 500 words in total)
2. The following graph shows the relationship between the US spending on science and the number of suicides (by hanging, strangulation, and suffocation). Based on this graph, answer the following questions.

(a) (5 marks) What does the correlation mean in this context? What does the R2 value mean? (Write no more than 200 words in total)
(b) (5 marks) One of your friends Mr. Citizen thinks that this correlation is because of the increasingpressure on researchers to continuously produce output. How would you evaluate this explanation? Looking at the numbers in the data displayed, can you determine whether this explanation could account for the effect shown? (Write no more than 200 words in total)
3. (a) (5 marks) For the following data scenarios, which chart should you use to visualise? Justify your answers. (Write no more than 200 words in total)
(1) Bureau of Meteorology data having average monthly rainfall in Sydney from 2016 to 2020.
(2) Hospital data having systolic pressure and weight of 2000 patients.
(3) Australian Bureau of Statistics data having yearly household expenses (grocery, transport, education, rent/mortgage, and entertainment) for Australian population
(4) Australian Bureau of Statistics providing Census data showing population density for each suburb across New South Wales.
(5) Bureau of Meteorology weather data having multiple weather conditions in Sydney with features including date, precipitation, max temperature, min temperature, wind speed, and weather (drizzle, rain, sunny, snow, and fog).
(b) (5 marks) You are working on a project that analyses the census data provided by Australian Bureau of Statistics. Table 1 shows a sample dataset. What data cleaning and normalisation techniques should you apply on this data so that you can apply unsupervised learning methods? (Write no more than 200 words in total)
Table 1: Sample Census dataset from Australian Bureau of Statistics
Census Code Suburb State Area sqkm
CED101 Berowra NSW 78644.32
CED101 wentworthville New South Wales 89232.53645
CED101 north sydney nsw 10324.45
CED101 mt. druitt 10583.12
CED105 st. Kilda Vic. 8524.96762
CED105 South melb. vic 45321.87
CED105 gelong Victoria 24534.2534
4. (a) (5 marks) I have data on different laptops from different brands with features for weight (grams), size (cm), RAM (GB), Hard Drive (GB), Processor (Intel core i5, Intel core i7, Intel core i3, AMD Ryzen, AMD Athlon, etc), and price (Australian Dollars). I want to cluster similar laptops based on their specifications. Discuss your approach to applying a clustering algorithm on this data. What transformations would be needed before you could work with this data and why? (Write no more than 200 words in total)
(b) (5 marks) You built a regression model to predict baby length based on mother’s height and mother’s age. Based on the training regression model using training data, the model coefficient’s for mother’s height and mother’s age are [0.2539,-0.0075] and intercept is 4.7623. What is your interpretation from these coefficients and intercept values? Can you figure out how change in variables effect the baby’s length? (Write no more than 200 words in total)
5. You plan to build a machine learning model to predict whether a patient in a hospital is “healthy” or“not healthy” based on the patient’s medical measurements. The dataset is highly imbalanced where “not healthy” outnumbered “healthy” individuals.
(a) (5 marks) To evaluate the performance of a trained model, you can create a confusion matrix for the comparison between the predicted results and the testing data class labels. From the confusion matrix, you calculated accuracy score. Explain why reporting accuracy score on such dataset is not indicative of model’s true performance. What measures you should take to mitigate any inflated results. What other metrics can you formulate from confusion matrix which are true indicative of model’s robust performance. (Write no more than 200 words in total)
(b) (5 marks) If the training data size is very big (e.g., 1 billion data instances) and the testing datasethas 1000 instances, which model do you prefer to use, KNN (k-Nearest Neighbors) classifier or Na¨ive Bayes classifier? Justify your answer. (Write no more than 200 words in total)
6. There is a robot in an animal shelter which needs to learn to discriminate Dogs and Cats based onthe fur and colour features. You are required to train the robot with classification models on the following dataset (Table 2) and make a prediction on a testing data instance. The feature Fur takes one of the two possible values (Coarse and Fine), and Colour also takes one of the two possible values (Brown and Black). For denotation convenience, you can use X1 and X2 to represent the two features respectively, and Y to represent the prediction target during the inference.
Table 2: Animal Data
Index Fur Colour class
#1 Coarse Brown Dog
#2 Fine Black Cat
#3 Coarse Black Cat
#4 Coarse Black Dog
#5 Fine Brown Cat
(a) (5 marks) You are required to build a KNN (k-Nearest Neighbors) classification model and predict the class label for the following data instance (#6 in Table 3). You can randomly choose k from its possible value range to consider the k-nearest neighbors. The distance between two data instances is calculated as the number of features having different values. For example, the distance between the 1st and the 2nd data instances is 2 because they differ from each other on both features ‘Fur’ and ‘Colour’. Specify the value of k you will use, and show the details of learning and prediction. Table 3: Testing Dataset
Index Fur Colour class
#6 Fine Brown
(b) (10 marks) You are required to build a Na¨ive Bayes classifier from the dataset and predictthe class label for the data instance #6, using the Laplacian correction technique if the zeroprobability issue occurs. Show the details of learning and prediction.
7. (a) (5 marks) The linear regression model can be regarded as a simple type of artificial neural network. From the perspective of artificial neural networks, what activation function corresponds to the linear regression model? Specify the mathematical form of the activation function. Is it a good idea to build multi-layer neural network models with this activation function? Justify your answer. (Write no more than 200 words in total)
(b) (10 marks) As the gradient descent method can be used to learn model parameters in neuralnetwork models, you can use it to estimate the parameters in a linear regression model. You are required to perform the initial steps of gradient descent on the following dataset (Table 4) to estimate the parameters w0 and w1 for the linear regression model y = w0 + w1x. The sum of squared errors is used for the loss function. Concretely, you need to formulate the loss function
L(w0,w1) and derive its gradient ). Then, pick a pair of values randomly to initialize w0 and w1, and evaluate the gradient with the w0 and w1 values. Show the key steps of inference and calculation.
Table 4: 2-Dimensional Data
Index X Y
#1 1 1
#2 2 3
(c) (5 marks) Based on the gradient obtained in the above step, update the estimate for w0 and w1. Assume that the learning rate ? is 0.5. Show the key steps of inference and calculation.
8. The following dataset (Table 5) describes COVID-19 testing records for 5 people. You want to builda decision tree classification model from the dataset to predict if a person suffers from COVID-19 or not according to the two symptoms Cough and Fever. Both the feature Cough and Fever take one of the two possible values yes (having a symptom) and no (not having a symptom). The target attribute COVID-19 also takes one of the two possible values yes (infected) and no (normal). For denotation convenience, you can use X1 and X2 to represent the two features respectively, and Y to represent the prediction target.
Table 5: COVID-19 Data
Index Cough Fever COVID-19
#1 no no no
#2 yes yes yes
#3 no yes yes
#4 no yes no
#5 yes no no
(a) (10 marks) You are required to build a decision tree with the Gini impurity heuristic. Show thekey steps of inference and calculation.
(b) (5 marks) Which issue might the decision tree model built above suffer from, overfitting or underfitting? Propose two different strategies to mitigate the possible issue with justification. (Write no more than 200 words in total)


  • Assignment statusSolved by our Writing Team at CapitalEssayWriting.com
  •  
  • CLICK HERE TO ORDER THIS PAPER AT CapitalEssayWriting.com
  • Comments

    Popular posts from this blog

    Starting with this provided code, add the following functionality

    1.Starting with this provided code, add the following functionality: Replace hardcoded strings “Zero”, “One”, “Two”, “Three” in the ArrayList based on user typed input (use Scanner or JOptionPane classes). The user will be prompted for the String to be stored in the ArrayList and then hit enter. The user will be able to continue to add items to the ArrayList until they just hit enter without typing anything. Once the user does this (hits enter without typing anything), the program will display all of the elements of the ArrayList, both the index and String values, in a table. It will do this via a single loop making use of an iterator method. 2. Starting with this provided code, add the following functionality: Use a Try/Catch block so that the exception is caught and the program exits a bit more gracefully. Save this file as TryCatch.java. (Be sure to rename the Public Class accordingly.) Starting with the provided code again (without the Try/Catch block), fix the code so that

    Josie Eskander

      Question 1: Task 1: Report Assume you are Josie Eskander. You are writing in response to techno trading P/L’s advertisement of a new laptop at 20% below normal price. You want information on brand name, availability of service and repairs, delivery times and methods of payment. Write the letter using the seven basic parts of the letter. In the opening paragraph present a clear and courteous request. Secondly write a response from techno trading giving details and proposing the sale. Provide draft of both emails in the space below. Question 2: Task 2: Report In pairs, nominate a good and a bad letter writer. Discuss the key differences. Write a good/bad letter from techno training to Alex Antonov accepting/declining his proposal to invest in the business Question 3: Task 3: Report Write a letter from techno trading p/l to a new client ‘new realities p/l’ urging them to buy techno new virtual reality software. Make a strong argument for the product. Question 4: Task 4: Report Write a l

    Sandra Coke is vice president for research and development at Great Lakes Foods (GLF), a large snack food company that has approximately 1,000 employees

    Chapter 2 I Trait Approach 33 CASE 2.1 Choosing a New Director of Research Sandra Coke is vice president for research and development at Great Lakes Foods (GLF), a large snack food company that has approximately 1,000 employees. As a result of a recent reorganization, Sandra must choose the new director of research. The director will report directly to Sandra and will be responsible for developing and testing new products. The research division of GLF employs about 200 people. The choice of directors is important because Sandra is receiving pressure from the president and board of GLF to improve the company's overall growth and productivity. Sandra has identified three candidates for the position. Each candidate is at the same managerial level. She is having difficulty choosing one of them because each has very strong credentials. Alexa Smith is a longtime employee of GLF who started part-time in the mailroom while in high school. After finishing school, Alexa worked in as many as