140x Filetype PDF File size 0.49 MB Source: educationaldatamining.org
From{Solution} Synthesis to {Student Attempt} Synthesis ∗ for Block-Based Visual Programming Tasks Adish Singla Nikitas Theodoropoulos MPI-SWS MPI-SWS adishs@mpi-sws.org ntheodor@mpi-sws.org ABSTRACT education. Considering the Hour of Code initiative alone, Block-based visual programming environments are increas- over one billion hours of programming activity has been ingly used to introduce computing concepts to beginners. spent in learning to solve tasks in such environments [8]. Given that programming tasks are open-ended and concep- tual, novice students often struggle when learning in these Programming tasks on these platforms are conceptual and environments. AI-driven programming tutors hold great open-ended, and require multi-step deductive reasoning to promise in automatically assisting struggling students, and solve. Given these aspects, novices often struggle when need several components to realize this potential. We inves- learning to solve these tasks. The difficulties faced by novice tigate the crucial component of student modeling, in par- students become evident by looking at the trajectory of stu- ticular, the ability to automatically infer students’ miscon- dents’ attempts who are struggling to solve a given task. For ceptions for predicting (synthesizing) their behavior. We in- instance, in a dataset released by Code.org [10, 8, 35], even troduce a novel benchmark, StudentSyn, centered around for simple tasks where solutions require only 5 code blocks the following challenge: For a given student, synthesize the (see Figure 1a), students submitted over 50,000 unique at- student’s attempt on a new target task after observing the tempts with some exceeding a size of 50 code blocks. student’s attempt on a fixed reference task. This challenge is akin to that of program synthesis; however, instead of syn- AI-driven programming tutors have the potential to sup- thesizing a {solution} (i.e., program an expert would write), port these struggling students by providing personalized as- the goal here is to synthesize a {student attempt} (i.e., pro- sistance, e.g., feedback as hints or curriculum design [37]. gram that a given student would write). We first show that To effectively assist struggling students, AI-driven systems human experts (TutorSS) can achieve high performance need several components, a crucial one being student mod- on the benchmark, whereas simple baselines perform poorly. eling. In particular, we need models that can automatically Then, we develop two neuro/symbolic techniques (NeurSS infer a student’s knowledge from limited interactions and and SymSS) in a quest to close this gap with TutorSS. then predict the student’s behavior on new tasks. However, student modeling in block-based visual programming envi- Keywords ronments can be quite challenging because of the following: block-based visual programming, programming education, (i) programming tasks are conceptual with no well-defined program synthesis, neuro-symbolic AI, student modeling skill-set or problem-solving strategy for mastery [23]; (ii) there could be a huge variability in students’ attempts for a 1. INTRODUCTION task [52]; (iii) the objective of predicting a given student’s Theemergenceofblock-basedvisualprogrammingplatforms behavior on new tasks is not limited to coarse-grained suc- hasmadecodingmoreaccessibleandappealingtobeginners. cess/failure indicators (e.g., [50])—ideally, we should be able Block-basedprogramminguses“codeblocks”thatreducethe to do fine-grained synthesis of attempts for the student. burden of syntax and introduces concepts in an interactive Beyond the above-mentioned challenges, there are two criti- way. Led by initiatives like Hour of Code by Code.org [10, 8] cal issues arising from limited resources and data scarcity for and the popularity of languages like Scratch [41], block- a given domain. First, while the space of tasks that could be based programming has become integral to introductory CS designed for personalized curriculum is intractably large [1], ∗Correspondence to: Adish Singla; the publicly available datasets of real-world students’ at- Authors listed alphabetically. tempts are limited; e.g., the Hour of Code: Maze Challenge domain has datasets for only two tasks [35]. Second, when A. Singla and N. Theodoropoulos. From solution synthesis to stu- a deployed system is interacting with a new student, there dent attempt synthesis for block-based visual programming tasks. is limited prior information [15], and the system would have In A. Mitrovic and N. Bosch, editors, Proceedings of the 15th In- to infer the student’s knowledge by observing behavior on a ternational Conference on Educational Data Mining, pages 454– few reference tasks, e.g., through a quiz [21]. These two is- 461, Durham, United Kingdom, July 2022. International Educational sues limit the applicability of state-of-the-art techniques that Data Mining Society. rely on large-scale datasets across tasks or personalized data ©2022Copyright is held by the author(s). This work is distributed per student (e.g., [50, 28, 29, 36])—we need next-generation under the Creative Commons Attribution NonCommercial NoDeriva- student modeling techniques that can operate under data tives 4.0 International (CC BY-NC-ND 4.0) license. https://doi.org/10.5281/zenodo.6853165 def Run(){ def Run(){ RepeatUntil(goal){ RepeatUntil(goal){ If(pathAhead){ move move turnLeft } Datasets for move Else{ reference task turnLeft turnLeft } move ? } } } } 18 18 18x 18x (a) Reference task T with solution code and datasets (b) stu’s attempt for T (c) Target task T (d) stu’s attempt for T Figure 1: Illustration of our problem setup and objective for the task Maze#18 in the Hour of Code: Maze [9] by Code.org [8]. Asexplained in Section 2.2, we consider three distinct phases in our problem setup to provide a conceptual separation in terms of information and computation available to a system. (a) In the first phase, we are given a reference task T18 along with its ⋆ solution code C 18 and data resources (e.g., a real-world dataset of different students’ attempts); reference tasks are fixed and T the system can use any computation a priori. (b) In the second phase, the system interacts with a student, namely stu, who attempts the reference task T18 and submits a code, denoted as Cstu. (c, d) In the third phase, the system seeks to synthesize 18 18x T 18x the student stu’s behavior on a target task T , i.e., a program that stu would write if the system would assign T to the student. Importantly, the target task T18x is not available a priori and this synthesis process would be done in real-time. scarcity and limited observability. To this end, this paper (see [23]); however, there has been some progress in this focuses on the following question: For a given student, can direction. In recent works [28, 29], models have been pro- we synthesize the student’s attempt on a new target task af- posed to predict human behavior in chess for specific skill ter observing the student’s attempt on a fixed reference task? levels and to recognize the behavior of individual players. 1.1 OurApproachandContributions Along these lines, [7] introduced methods to perform early prediction of struggling students in open-ended interactive Figure 1 illustrates this synthesis question for a scenario in simulations. There has also been work on student modeling the Hour of Code: Maze Challenge [9] by Code.org [8]. This for block-based programming, e.g., clustering-based meth- question is akin to that of program synthesis [20]; however, ods for misconception discovery [18, 44], and deep learning instead of synthesizing a {solution} (i.e., program an ex- methodstorepresentknowledgeandpredictperformance[50]. pert would write), the goal here is to synthesize a {student attempt} (i.e., program that a given student would write). AI-driven systems for programming education. There has This goal of synthesizing student attempts, and not just so- been a surge of interest in developing AI-driven systems for lutions, requires going beyond state-of-the-art program syn- programming education, and in particular, for block-based thesis techniques [3, 4, 25]; crucially, we also need to define programming domains [37, 38, 51]. Existing works have appropriate metrics to quantitatively measure the perfor- studied various aspects of intelligent feedback, for instance, mance of different techniques. Our main contributions are: providing next-step hints when a student is stuck [35, 53, 31, 15], giving data-driven feedback about a student’s miscon- (1) We formalize the problem of synthesizing a student’s at- ceptions [45, 34, 39, 52], or generating/recommending new tempt on target tasks after observing the student’s be- tasks [2, 1, 19]. Depending on the availability of datasets and havior on a fixed reference task. We introduce a novel resources, different techniques are employed: using historical benchmark, StudentSyn, centered around the above datasets to learn code embeddings [34, 31], using reinforce- synthesis question, along with generative/discriminative ment learning in zero-shot setting [15, 46], bootstrapping performance measures for evaluation. from a small set of expert annotations [34], or using expert (2) Weshowcasethathumanexperts(TutorSS)canachieve grammars to generate synthetic training data [52]. high performance on StudentSyn, whereas simple base- Neuro-symbolic program synthesis. Our approach is related lines perform poorly. to program synthesis, i.e., automatically constructing pro- (3) We develop two techniques inspired by neural (NeurSS) grams that satisfy a given specification [20]. The usage of and symbolic (SymSS) methods, in a quest to close the deep learning models for program synthesis has resulted in gap with human experts (TutorSS). significant progress in a variety of domains including string transformations [16, 14, 32], block-based visual program- Weprovide additional details and results in the longer ver- ming [3, 4, 13, 48], and competitive programming [25]. Pro- sion of the paper [47]. We will also publicly release the gram synthesis has also been used to learn compositional benchmarkandimplementationstofacilitatefutureresearch. symbolic rules and mimic abstract human learning [30, 17]. 1.2 Related Work 2. PROBLEMSETUP Student modeling. For close-ended domains like vocabulary Next, we introduce definitions and formalize our objective. learning ([42, 36, 22]) and Algebra problems ([12, 40, 43]), 2.1 Preliminaries the skills or knowledge components for mastery are typically well-defined and we can use Knowledge Tracing techniques The space of tasks. We define the space of tasks as T; in to model a student’s knowledge state over time [11, 33]. this paper, T is inspired by the popular Hour of Code: Maze These modeling techniques, in turn, allow us to provide Challenge [9] from Code.org [8]; see Figure 1a. We define feedback, predict solution strategies, or infer/quiz a stu- a task T ∈ T as a tuple (T , T , T ), where T de- vis store size vis dent’s knowledge state [40, 21, 43]. Open-ended domains notes a visual puzzle, T the available block types, and store pose unique challenges to directly apply these techniques Tsize the maximum number of blocks allowed in the solution code. The task T in Figure 1a corresponds to Maze#18 in Ttar to the student. In this work, we focus on this fine-level, the Hour of Code: Maze Challenge [9], and has been studied arguably also the most challenging, synthesis objective. in a number of prior works [35, 15, 1]. Performance evaluation. So far, we have concretized the syn- The space of codes. We define the space of all possible codes thesis objective; however, there is still a question of how as C and represent them using a Domain Specific Language to quantitatively measure the performance of a technique (DSL) [20]. In particular, for codes relevant to tasks consid- set out to achieve this objective. The key challenge stems ered in this paper, we use a DSL from [1]. A code C ∈ C has from the open-ended and conceptual nature of programming the following attributes: Cblocks is the set of types of code tasks. Even for seemingly simple tasks such as in Figure 1a, blocks used in C, Csize is the number of code blocks used, and the students’ attempts can be highly diverse, thereby mak- Cdepth is the depth of the Abstract Syntax Tree of C. ing it difficult to detect a student’s misconceptions from ob- served behaviors; moreover, the space of misconceptions it- Solution code and student attempt. For a given task T, a self is not clearly understood. To this end, we begin by solution code C⋆ ∈ C should solve the visual puzzle; addi- designing a benchmark to quantitatively measure the per- T formance of different techniques w.r.t. our objective. tionally, it can only use the allowed types of code blocks (i.e., Cblocks ⊆ Tstore) and should be within the specified size threshold (i.e., C ≤T ). WenotethatataskT∈Tmay 3. BENCHMARK size size havemultiple solution codes; in this paper, we typically refer In this section, we introduce our benchmark, StudentSyn. to a single solution code that is provided as input. A student 3.1 STUDENTSYN:DataCuration attempt for a task T refers to a code that is being written by a student (including incorrect or partial codes). A student Webeginbycuratingasyntheticdatasetforthebenchmark, attempt could be any code C ∈ C as long as it uses the set designed to capture different scenarios of the three distinct of available types of code blocks (i.e., C ⊆Tstore). phases mentioned in Section 2.2. In particular, each scenario blocks ref stu tar stu stu corresponds to a 4-tuple (T , CTref, T , CTtar), where CTref (observed by the system) and Cstu (to be synthesized by 2.2 Objective Ttar Distinct phases. To formalize our objective, we introduce the system) correspond to a student stu’s attempts. three distinct phases in our problem setup that provide a Reference and target tasks. We select two reference tasks conceptual separation in terms of information and compu- for this benchmark, namely T4 and T18—they correspond to tation available to a system. More concretely, we have: Maze#4 and Maze#18 in the Hour of Code: Maze Chal- (1) Reference task Tref: We are given a reference task Tref lenge [9]. These tasks have been studied in a number of for which we have real-world datasets of different stu- prior works [35, 15, 1] because of the availability of large- dents’ attempts as well as access to other data resources. scale datasets of students’ attempts. For each reference task, Reference tasks are fixed and the system can use any we manually create three target tasks—Figure 2b illustrates 18 4 computation a priori (e.g., compute code embeddings). target tasks for T ; the target tasks for T can be found in the longer version of the paper [47]. These target tasks are (2) Student stu attempts Tref: The system interacts with a similar to the corresponding reference task in a sense that student, namely stu, who attempts the reference task Tref the set of available block types is same and the nesting struc- and submits a code, denoted as Cstu . At the end of this ture of programming constructs in solution codes is same. ref T phase, the system has observed stu’s behavior on Tref and ref stu 1 Types of students’ behaviors and students’ attempts. For we denote this observation by the tuple (T , C ref). ref tar T a given reference-target task pair (T , T ), next we seek tar to simulate a student stu to create stu’s attempts Cstu (3) Target task T : The system seeks to synthesize the stu- Tref tar and Cstu . We begin by identifying a set of salient stu- dent stu’s behavior on a target task T . Importantly, Ttar the target task Ttar is not available a priori and this syn- dents’ behaviors and misconceptions for reference tasks T4 and T18 based on students’ attempts observed in the real- thesis process would be done in real-time, possibly with constrained computational resources. Furthermore, the world dataset of [35]. In this benchmark, we select 6 types of system may have to synthesize the stu’s behavior on a students’ behaviors for each reference task—Figure 2c high- large number of different target tasks from the space T lights the 6 selected types for T18; the 6 selected types for T4 2 can be found in the longer version of the paper [47].3 For a (e.g., to personalize the next task in a curriculum). given pair (Tref,Ttar), we first simulate a student stu by asso- Granularity level of our objective. There are several differ- ciating this student to one of the 6 types, and then manually stu stu ent granularity levels at which we can predict the student create stu’s attempts C ref and C tar. For a given scenario T T tar ref stu tar stu stu stu’s behavior for T , including: (a) a coarse-level binary (T , C ref, T , C tar), the attempt C tar is not observed and tar T T T prediction of whether stu will successfully solve T , (b) serves as a ground truth for evaluation purposes; henceforth, a medium-level prediction about stu’s behavior w.r.t. to we interchangeably write a scenario as (Tref,Cstu ,Ttar,?). Tref a predefined feature set (e.g., labelled misconceptions); (c) stu ref stu tar stu a fine-level prediction in terms of synthesizing C tar, i.e., a Total scenarios. We create 72 scenarios (T , CTref, T , CTtar) T in the benchmark corresponding to (i) 2 reference tasks, (ii) program that stu would write if the system would assign 1In practice, the system might have more information, e.g., 3 target tasks per reference task, (iii) 6 types of students’ stu behaviors per reference task, and (iv) 2 students per type. the whole trajectory of edits leading to C ref. 2 T 3 Even though the Hour of Code: Maze Challenge [9] has Wenote that, in real-world settings, the types of students’ only 20 tasks, the space T is intractably large and new tasks behaviors and their attempts have a much larger variability can be generated, e.g., for providing feedback [1]. and complexities with a long-tail distribution. def Run(){ RepeatUntil(goal){ If(pathAhead){ } move Datasets for Else{ reference task } turnLeft } } 18 18 18x 18y 18z (a) Reference task T with solution code and datasets (b) Three target tasks for T : T , T , and T def Run(){ def Run(){ def Run(){ def Run(){ def Run(){ def Run(){ RepeatUntil(goal){ RepeatUntil(goal){ RepeatUntil(goal){ RepeatUntil(goal){ move move If(pathAhead){ If(pathLeft){ If(pathAhead){ move If(pathAhead){ turnLeft move turnLeft turnLeft turnLeft move move move } move } move } move Else{ } Else{ turnLeft Else{ move turnRight Else{ turnLeft move turnLeft turnRight } move } } } move } } move } } move move } } } move } } move } 18 (c) Example codes (i)–(vi) corresponding to six types of students’ behaviors when attempting T , each capturing different misconceptions 18 Figure 2: Illustration of the key elements of the StudentSyn benchmark for the reference task T shown in (a)—same as in Figure 1a. (b) Shows three target tasks associated with T18; these target tasks are similar to T18 in a sense that the set of available block types is same as T18 and the nesting structure of programming constructs in solution codes is same as store ⋆ 18 in C 18. (c) Shows example codes corresponding to six types of students’ behaviors when attempting T , each capturing a T different misconception as follows: (i) confusing left/right directions when turning or checking conditionals, (ii) following one of the wrong path segments, (iii) misunderstanding of IfElse structure functionality and writing the same blocks in both the execution branches, (iv) ignoring the IfElse structure when solving the task, (v) ignoring the While structure when solving the task, (vi) attempting to solve the task by using only the basic action blocks in {turnLeft, turnRight, move}. def Run(){ def Run(){ def Run(){ def Run(){ def Run(){ move move move RepeatUntil(goal){ RepeatUntil(goal){ move move move If(pathLeft){ move turnLeft turnLeft turnLeft turnLeft turnLeft RepeatUntil(goal){ move RepeatUntil(goal){ move move If(pathRight){ move If(pathLeft){ } turnRight turnRight move turnLeft Else{ move move move move move } } turnRight } } } Else{ move Else{ } move move move } } move } } move } } } } ? option (a) option (b) option (c) option (d) option (e) def Run(){ def Run(){ def Run(){ def Run(){ def Run(){ move move move turnLeft move move move turnLeft move move stu’s attempt for T18x turnLeft turnLeft move move turnLeft If(pathRight){ RepeatUntil(goal){ move If(pathRight){ RepeatUntil(goal){ in Figure 1 turnRight If(pathRight){ move turnRight turnRight move move move move turnLeft } } move } turnLeft Else{ Else{ turnRight Else{ move move move turnRight move } } } turnLeft } } } turnRight move } } } } option (f) option (g) option (h) option (i) option (j) Figure 3: Illustration of the generative and discriminative objectives in the StudentSyn benchmark for the scenario shown 18x in Figure 1. For the generative objective, the goal is to synthesize the student stu’s behavior on the target task T , i.e., a program that stu would write if the system would assign T18x to the student. For the discriminative objective, the goal is to choose one of the ten codes, shown as options (a)–(j), that corresponds to the student stu’s attempt. For each scenario, ten options are created systematically as discussed in Section 3.2; in this illustration, option (a) corresponds to the solution code C∗ for the target task and option (e) corresponds to the student stu’s attempt as designed in the benchmark. T18x 3.2 STUDENTSYN:PerformanceMeasures Generative performance. As a generative performance mea- Weintroduce two performance measures to capture our syn- sure, we introduce a 4-point Likert scale to evaluate the thesis objective. Our first measure, namely generative per- quality of synthesizing stu’s attempt Cstu for a scenario tar ref stu tar T formance, is to directly capture the quality of fine-level syn- (T , CTref, T , ?). The scale is designed to assign scores thesis of the student stu’s attempt—this measure requires based on two factors: (a) whether the elements of the stu- a human-in-the-loop evaluation. To further automate the dent’s behavior observed in Cstu are present, (b) whether Tref evaluation process, we then introduce a second performance the elements of the target task Ttar (e.g., parts of its solu- measure, namely discriminative performance. tion) are present. More concretely, the scores are assigned as
no reviews yet
Please Login to review.