Group Project Overview

During the semester, students are expected to work on a capstone project within a group. The capstone project is meant to showcase the ability to program with data in a statistical computing environment. That is to say, we would like each group project to show evidence of the following stages:

R4DS Chapter 1: Data Science Project Overview

Timeline

There are five assignments for the project. Their due dates are:

  • Group Choice - Tuesday, March 4th, 6:00 PM
  • Project Proposal - Friday, March 15th, 11:59 PM
  • Project Demo Video - Thursday, May 9, 11:00 AM
  • Final Report - Thursday, May 9, 11:00 AM (in place of the final exam)
  • Peer Evaluation - Thursday, May 9, 11:00 AM

Topics

Students are free to pursue multiple outlets in demonstrating their newfound statistical programming. Sample topics would include:

  • creating functions to import data from a web API, database, or data file and model it;
  • designing and implementing new visualization techniques;
  • converting pre-existing R scripts into an R package;
  • re-implementating existing R code to improve speed or clarity of use; and
  • translating features in a different language to R.

All groups must create a Shiny application that lowers the barrier of entry for using their implementation.

Data Selection

To motivate your topic, you may wish to use or construct a dataset.

You may choose any data set as long as it does not violate these two conditions:

  1. data has a minimum of 500 observations and 10 variables; and
  2. data is not:
    • from either UC Irvine Machine Learning Repository, Kaggle, SNAP, and FiveThirtyEight;
    • used as a data set in the course;
    • found in text books; and
    • “blogged” about.

The dataset may be relevant to research outside of this course, another field, or some other interest of the groups. If you have any questions about whether your data is appropriate, do not hesitate to ask. If you plan to use data from either a research project or your current job be sure to gain permission from the data controller.

Unlike other courses, the second condition exists to promote working on “fresh” data. We want you to explore new topics that do not already have a pre-defined solution.

Final Products

The final products of the group project are:

  1. a written report that details the construction;
  2. code that can be run on a member of the course staff’s computer; and
  3. a video that explains and demonstrates their project.

Group Project Checklist

The must successful projects have a tendency to exhibit the following characteristics:

  • describe the problem or need succinctly;
  • explain the benefits of the project;
  • complete the project before the course ends;
  • ability to approach the problem or need in multiple ways;
  • and the problem is fun or interesting to work on.

Try to make sure these characteristics can be found in your project.

Task Specifics

Group Choice

For this project, you must work in groups of at least three students and at most four students. A portion of your grade will come from your ability to work in a group setting.

By Tuesday, March 4th, 6:00 PM, please either:

You may not be a group of one. Groups of one only exist if a student has been fired from their original group and have received a letter grade reduction for the project. Moreover, if a group member is fired, they must complete the group project themselves. To trigger this measure, the student’s group must:

  • Build a case against the team member that documents their inability to contribute to the project.
  • Schedule a meeting with the instructor to try to resolve this.
  • After one week, if the issues have not been addressed, the team member will be discharged from the group and will have to complete the group project by themselves.

Project Proposal

The project proposal is due on Friday, March 15th, 11:59 PM

It should be submitted via the groups GitHub repository in: stat385-sp2019/project-repo-team<id>-<team-name>.

After review of the proposal, it will be evaluated in one of two ways:

  • Approved: Your group may proceed with your plans for the data and project.
  • Pending: We will provide suggestions, concerns, or needed information that must be addressed before the proposal will be approved.

Students who receive high approval are able to reuse about 70-80% of their proposal in the Final Report. Therefore, there is a benefit to spending time on the project proposal stage. Consider attending office hours or talking in person with the instructor.

For the group project proposal, students must have the following sections:

  • Introduction
  • Related Work
  • Method
  • Feasibility
  • Conclusion
  • References
  • Appendix

Within these sections, please write in paragraph-form. That is, please write at least 3 - 4 sentences per block of text and avoid answering using lists. Moreover, any inline figures, tables, or supporting material should be placed within an Appendix and referenced from within the proposal. Examples for the latter portion can be found within the templated document.

All reports should be written using RMarkdownnot Word – via the default template provided. Color is permitted for section headers. However, the use of color should be used sparingly within the body of the document. Label each section of the proposal clearly, e.g. “Introduction”, “Related Work”, and so on.

Expectations regarding the contents of each section is outlined next. Make sure to answer each question fully within the given section.

Note: The proposal style is loosely based on the IMRAD Style. Organzing a paper under the IMRAD is largely used to write scholarly articles. Check out the Carnegie Mellon University’s IMRD Cheat Sheet for more details.

Introduction

The introduction section provides a preview of the project’s focus. Within this section, provide an overview on the selected topic for the consumption of a manager. In essence, the manager must be able to understand what the project is and why they should support the endeavor. You are allowed to make the assumption that the manager is knowledgeable in base R concepts. Make sure to answer the following questions:

  • What problem or topic are you addressing?
  • Why is it interesting or important? In particular, what evidence supports this conclusion?
    • Cite papers or reputable sources that back up this claim. (You may want to find material using Google Scholar.)
  • Where did the problem or topic come from?
  • What is your idea for addressing the problem or topic?
  • How does your idea match with the course’s focus on statistical programming?

The Related Work section must provide an overview of pre-existing solutions. In essence, please credit those who enabled you to consider embarking on this project, or as Issac Newton in a letter to Robert Hooke on 15 February 1676 more aptly put it:

If I have seen further it is by standing on the shoulders of Giants.

Address the following questions:

  • What other ideas have been attempted?
  • Why is your team’s idea original compared to prior work?

Method

The Methods section should contain the overall details of the project including any preliminary work. In particular, the implementation details behind the approach should be explained at length here. The more details you can provide, the better feedback your group can receive. As a result, the section serves as a roadmap of what features are going to be developed and any external dependencies that are required. To satisfy this section, provide detailed responses for the following:

  • What packages will you use in your implementation?
  • What code will the group need to write for the project?
  • Provide low-fidelity prototypes (e.g. sketches on paper) in the Appendix of:
    • Visualisations
      • What kinds of graphs will you use?
      • Label axes, provide a title, and mention any interactivity.
    • Interface
      • All projects need a Shiny Application.
      • Sketch how a user will work with the shiny application.
  • What have you done or learned so far for the project?

We are primarily wanting to ensure that your project has met the criterion of the data science pipeline. In essence, we want to see evidence that your project has:

  • Reading data into R or accessing data via an API.
  • Data transformations (e.g. Tidying (tidyr), Summarizing (dplyr), et cetera.)
  • Data visualization (e.g. ggplot2, plotly, gganimate)
  • R functions either in external packages or included in a new R package
  • Interactive Interface (e.g. shiny)
  • Reproducibility

Feasibility

The Feasibility section is meant to act as a way to reflect upon the proposal. Generally speaking, there will be three weeks of heavy development time afforded to the group. Building a detailed ecosystem or heavily scripting in a different language will likely not lead your team to success. Hence, please provide a project management overview of who on your team will be doing what and when by answering:

  • Is this project able to be completed before the end of the semester?
  • What steps must occur to complete the project before the end of the semester?
  • What is the work plan to accomplish the necessary tasks before the end of the semester?
    • Specify who is doing what and when.
    • Consider making a Gantt chart to highlight each stage of the project.

Conclusion

The Conclusion section provides a summary of the entire proposal. This acts as the final paragraph that can be used to justify the work being proposed. In general, this means you should make one last push to identify the problem, potential solution, and its novelty.

References

The References section acts as a bibliography for all papers referenced in the Introduction, Related Works, and Method sections. The references should be formated in Chicago author-date format, which is the default for RMarkdown.

  • Provide a list (5+) of papers or items you have read to write this proposal.
  • Please list all R packages or software referenced.

To acquire software citation information, R has a built-in command that creates a BibTex and in-line text citation. To generate the citation of an installed R package, type:

# In R
citation(package="pkg_name")

Appendix

The Appendix section contains figures, sample data, and other miscellaneous entries. Generally, this sketch seeks to contain all of your planning information.

  • Provide the sketches of visualisations and the shiny application.
  • Provide an overview on the desired functions.
    • What is a function’s input? Output? How are functions related to each other.
    • For example, read_data("hospital_data.csv") must be called before tidy_hospital(), et cetera.
  • Provide a sample of the data set you intend to use (~10 observations).

Project Demo Video

The project demo video is due by Thursday, May 9, 11:00 AM.

The goal behind the demo video is to provide an overview of the project and show how the solution presently works. The video should be between 3 and 7 minutes long.

For the overview, we would like to see 3 to 5 slides that briefly describe the problem/data, the method, and results. After this, please demo the solution in its entirety. This should be self-contained in one video. Be wary of the time limit though!

The videos should be uploaded to either Google Drive or Box Sync and a download link sent in a single email to the STAT 385 e-mail box.

Do not upload the video to the Git repository as files greater than 100Mb will automatically be rejected.

Suggestions

Below are suggestions on FREE software that can be used to record your screen and create a short demo video. If you need additional assistance, please visit the Media Commons @ UGL.

You do not have to use these software suggestions.

Screen Recording Software

To record your screen, you can use the following free screen recorders:

Editing Tools

To piece together different video clips, add sounds, or title cards, you can use one of the following movie editors:

Final Report

The final report for your project is due by Thursday, May 9, 11:00 AM.

This report is largely an update of the initial project proposal. The goal here is to clean up the methods section so that it resembles the actual project methodology, include a results portion and discussion section that clearly convey the take aways from working with the data.

Please see the rubric for more details.

Peer Evaluation

Peer evaluation of all group members is due by Thursday, May 9, 11:00 AM.

The peer evaluation will involve rating individually each member of your group, suggesting a grade, and indicating if any issue arose.

Please fill out the peer evaluation form: To Appear

For difficulty logging into the form, please see the FAQ entry: Why is Google Forms telling me I’m outside of the organization when I try to fill out a survey?

Scoring Rubric

Proposal

  • Percent of Final Grade: 2.5%

For the group proposal, the structure is meant to provide a moment for an intervention or clarity to the project. The basis of the proposal is largely used as the basis for the final report. Spending time working on the proposal will have a significantly higher payoff when the time comes to submit the final report.

Having said this, you will be graded on whether each portion of proposal is answered, the clarity of the content, appropriateness of data, formatting, et cetera.

  • Introduction
    • [2] Topic/Problem is explained.
    • [2] Motivation for solving the problem is described
    • [2] If needed, the data set selected or generated is described and the source is given.
    • [2] Project’s focus is on statistical programming.
  • Related Work
    • [2] Familiarity with prior work.
    • [2] Proposed work has not been attempted before.
  • Methods
    • [6] At least one entry exists for each of the stages of the data science workflow.
    • [2] Preliminary work done to undertake the project
    • [2] Sketches provide ample evidence of forethought.
    • [2] Interface provides multiple input controls and output areas.
  • Feasibility
    • [2] Project can be completed within the time frame. (e.g. Not a reimplementation of existing methods.)
    • [2] Gantt chart or breakdown of work displayed
    • [2] Project lists specific contributors for steps in the method.
  • Conclusion
    • [2] Group has provided an executive summary of the proposal.
  • References
    • [2] At least 5+ content references are included
    • [2] Citations are appropriately listed
    • [1] Packages being used are cited (citation(package = "pkg_name"))
  • General
    • [3] Grammar and Spelling
      • Free from spelling mistakes
      • Content follows a logical ordering
      • Audience considerations are accounted for (e.g. explain for the layperson / manager.)
    • [3] The appropriate formatting is followed
      • Report includes project title.
      • Team Name, Team Members, and NetIDs are included in the report.
      • Report is appropriately named and submitted.
      • The report does not show code outside of the appendix.
Points Status
> 38 Approved+
(35, 38] Approved-
< 34 Pending

Video

  • Percent of Final Grade: 5%

For the demo video, the structure is meant to give you a platform to showcase your work to a larger audience. Audiences typically prefer viewing video content of new software being developed to see how applicable it can be to their workflow.

As an example, consider the following “code demos” GIFs:

More inspiration can be found in:

With this being said, we’ll be focusing more on how well you present to the general public what you managed to develop this semester. Bare in mind, this is an ideal way to showcase your skills to future employers.

  • Slides
    • [2] Provide an introduction to the problem/data
    • [2] Describe what was implemented and the results that can be obtained.
    • [2] Emphasize limitations/problems ran into
    • [2] Potential future improvements
  • Code Demo
    • [15] Show the features of your application
      • Walk through the steps to using the app.
      • Show the different outputs and inputs possible.
      • The app must be functional with no visible warnings to the user unless intended.
  • General
    • [2] Grammar and Spelling
      • Free from spelling mistakes
      • Content follows a logical ordering
      • Audience considerations are accounted for (e.g. explain for the layperson / manager.)
    • [5] The appropriate formatting is followed
      • Include the Project Title, Team Name, Team Members, and NetIDs in the video.
      • The slides are discussed either at the start of the video or throughout.
      • Video should be between 3 - 7 minutes in length.

Final Report

  • Percent of Final Grade: 11.5%

The CAs will use the following point breakdown on the final report

  • Introduction
    • [5] Problem statement.
      • What is the issue that has arisen?
    • [5] Relevance to audience.
      • Why should we be interested in the project?
    • [5] Description of data
      • What is the data and how it is related to the goal?
      • Please place the code book (e.g. description of each variable) in the appendix.
    • [5] Course connection
      • How does your idea match with the course’s focus on statistical programming?
  • Related Work
    • [5] Previous approaches
      • Who has done what so far on this problem?
    • [5] Novelty of Approach
      • How is your view original in comparison to this body of work?
  • Methods
    • [5] Appropriate methods from class are used.
    • [10] Methods are used correctly.
  • Results
    • [2] Results are clearly organized through visualizations or as a table.
    • [3]
  • Discussion
    • [5] Correct conclusions are drawn from the results.
    • [5] How the results relate to the goal is discussed.
    • [10] Results are connected to the motivation of the project.
  • Conclusion
    • [5] Appropriately summarize the project and end outcomes.
  • References
    • [5] Citations are appropriately listed
  • Code
    • [25] R is used appropriately.
      • Does your code perform the desired tasks?
      • Is your code readable?
      • Is your style consistent?
      • Does your code work on a different computer?
  • General
    • [5] Grammar and Spelling
      • Free from spelling mistakes
      • Content follows a logical ordering
      • Audience considerations are accounted for (e.g. explain for the layperson / manager.)
    • [5] The appropriate formatting is followed
      • Report includes project title.
      • Team Name, Team Members, and NetIDs are included in the report.
      • Report is appropriately named and submitted.
      • The report does not show code outside of the appendix.

Peer Evaluation

  • Percent of Final Grade: 1%

When writing the peer evaluations, you will be asked to grade how well you did and inturn how well each other member of the group functioned. As a result, you should put thought into the reviews of each team member. Evaluations that are simplistic in nature, e.g. scoring all members as 100%, will likely result in reduced peer evaluation grade dedicated.

The instructor reserves the right to further reduce a students overall project grade if their team members report that they did not attempt to make a significant contribution to the project.

FAQ

This section will likely be updated as we progress through the remainder of the semester.

What do you mean we cannot embed code in the report sections?

The goal here is to write the final report as form of documentation for future students or employers to look over. As a result, there is a need to emphasize what the end project’s outcomes are to a general audience that is not as keyed into your project.

How long should the written reports be?

Reports should emphasize brevity and conciseness. As a result, there is no “minimum” page requirement; however, there is a “maximum” threshold. That is, avoid cluttering the report with extended text when simpler sentences would suffice.

Keep in mind that the group project is intentionally open-ended to see what your group will do without being given explicit steps, so have fun!


Home | Policies