Our expectations for how visual interactions sound are shaped in part by our own learned understandings of and experiences with objects and actions, and in part by the extent to which we perceive coherence between gestures which can be identified as "sound-generating" and their resultant sonic events. Even as advances in technology have made the creation of dynamic computer-generated audio-visual spaces not only possible but increasingly common, composers and sound designers have sought tighter integration between action and gesture in the visual domain and their accompanying sound and musical events in the auditory domain. Procedural audio and music, or the use of real-time data generated by in-game actors and their interactions in virtual space to dynamically generate sound and music, allows sound artists to create tight couplings across the visual and auditory modalities. Such procedural approaches however become problematic when players or observers are presented with audio-visual events within novel environments wherein their own prior knowledge and learned expectations about sound, image and interactivity are no longer valid. With the use of procedurally-generated music and audio in interactive systems becoming more prevalent, composers, sound-designers and programmers are faced with an increasing need to establish low-level understandings of the crossmodal correlations between visual gesture and sonified musical result both to convey artistic intent as well as to present communicative sonifications of visual action and event. For composers and designers attempting to build evocative and expressive procedural sound and music systems, when the local realities of any given virtual space are completely flexible and malleable, there exist few to no dependable locale-specific models upon which to base their choices of mapping schemata. This research focuses jointly on the creative and technical concerns necessary to build procedurally-generated crossmodal musical interactions, as well as on the perceptual issues surrounding specific mapping schemata linking interactions with sound and music. A software solution and methodology are presented to facillitate the mapping of parameters of action, motion and gesture from virtual space to sound-generating process, allowing composers and designers to repurpose real-time data as drivers for compositional and sound-related process. Creative and technical examples drawn from a series of multimodal musical experiences are presented and discussed, exploring a variety of potential mapping schemata as well as the inner workings of the presented codebases. To assess the perceived coherence between motion and gesture in the visual modality and generated sound and musical events in the auditory modality, this research also details a user-study measuring the impact of audio-visual crossmodal correspondences between low-level attributes of motion and sound. Subjects taking part in a controlled study were presented with multimodal examples of musically sonified motion in a pairwise comparison task and asked to rate the perceived fit between visual and auditory events. Each example was defined as a composite set of simple motion and sound attributes. Study results were analyzed using the Bradley-Terry statistical model, effectively calculating the relative contribution of each crossmodal attribute within each attribute pairing to the perceived coherence or 'fit' between audio and visual data. The statistical analysis of correlated motion/sound mappings and their relative contributions to the perceptual coherence of audio-visual interactions lay the groundwork towards the establishment of predictive models linking attributes of sound and motion to perceived fit.