Data Modeling Project

Your work in this class will include a project in which you use the ML programming language to model real-world data of phenomena. You will pick some domain of information, write a set of ML datatypes that capture that information, and write a set of ML functions that operate on that information.

As a quick example, in Sections 1.10 and 2.3 of the textbook, a series of datatypes are used to model meals at a restaurant:

datatype bread = White | MultiGrain | Rye | Kaiser;
datatype spread = Mayo | Mustard;
datatype vegetable = Cucumber | Lettuce | Tomato;
datatype deliMeat = Ham | Turkey | RoastBeef | ExtraVeg of vegetable;
datatype noodle = Spaghetti | Penne | Fusilli | Gemelli | Farfalle;
datatype sauce = Pesto | Marinara | Creamy;
datatype protein = MeatBalls | Sausage | Chicken | Tofu;
datatype entree = Sandwich of bread * spread * vegetable * deliMeat | Pasta of noodle * sauce * protein;
datatype salad = Caesar | Garden;
datatype side = Fries | Chips | CarrotSticks | GarlicBread | Salad of salad;
datatype beverage = Water | Coffee | Pop | Lemonade | IceTea;
datatype meal = Meal of entree * side * beverage;

Operations on datatypes like this could include calculating something from them (say, the calories or price of a meal), comparing two values of this data type (if a sandwich could have an indefinite number of layers, then we could check which of two sandwiches had more stuff), or modifying a value of this data type (for example, substituting meat with something plant-based to make a meal vegetarian).

The textbook contains many and diverse examples of datatypes modeling real-world data and phenomena, including

There are others in Chapters 8 and 10, not covered in class.

Though you are permitted to expand on one of these examples, you are encouraged to come up with your own domain of information, something that interests you and that you already have knowledge about. To get you started thinking, here are some more ideas:

Ideally you will work on this with one partner (team of two), but I will also allow solo projects and teams of three. Work on this project will be spread throughout the semester so that as you can improve your model and make it more sophisticated as you learn more ML. This shouldn't be a reason to put it off, however. For most projects, most of what you need to know will be covered in the first few weeks of the class, so most teams will be able to do most of the work early in the semester.

Your final submission will comprise four parts:

  1. The ML datatypes modeling the data;
  2. Sample data, that is, values of the datatypes you designed;
  3. The ML functions operating on the data;
  4. A write-up (approximately 1-2 pages), consisting of

Schedule

Proposal - 5 points - deadline September 30
Team sends me a proposal by email no later than September 30. I will aim to make sure your proposal is at the right level of difficulty and suggest what ML features and discrete math concepts throughout the semester will be useful to you in this project. It is likely that some proposals will need some modification or further development, so sending a proposal earlier—even weeks earlier—is not a bad idea. Sept 30 should be the absolute latest. I can provide some assistance in finding a project topic and "matchmaking" team partners, but I don't want people coming to me on Sept 29 and saying "I don't have a partner or a topic."
Prototoype - 15 points - deadline November 18
Team sends me a prototype of the ML code by email no later than November 18. This should be everything except the write up, complete as far as you can tell. I will respond with improvements or corrections I would like you to make. If you want more time to work on those improvements, then send me the prototype earlier.

Point distribution:

Final version and writeup - 80 points - deadline December 9
Team sends the final version of the ML code and the write up (PDF is preferred) by email no later than December 9, which is the last day of class. If you don't want this project to interfere with the other things you'll need to finish up that week, then send me the final version earlier.

Point distribution:

  • 15 points - datatypes
  • 15 points - sample data
  • 15 points - functions
  • 15 points - style, including concision and efficiency
  •  20 points - writeup
  •  

    Assessment

    In assessing your datatypes, I will look at how you strike a balance between fidelity and richness on the one hand, and parsimony on the other hand. Every datatype should be used by a function or by another datatype.

    In assessing your sample data, I will look to see that you exercise all datatypes, provide multiple examples that differ in their particulars, and provide enough data to exercise your functions. Ideally, you will identify a real-world dataset that is not too large: perhaps on the order of 20-50 items.

    In assessing your functions, I will look for correctness, thoroughness, and variety. I will also look for functions that answer realistic questions about the data or perform transformations that someone might really want to do. Get functions working before you worry about efficiency.

    In assessing style, I will look for


    Janet Davis (davisj@whitman.edu). Adapted from Thomas VanDrunen's project for CS/MATH 243, Wheaton College.

    Created August 28, 2016
    Last revised November 25, 2016, 05:15:02 PM PST
    This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License.