The efficiency of human-computer interaction is greatly hindered by the small size of the touchscreens on mobile devices, such as smart phones and watches. This has prompted widespread interest in handwriting recognition systems, which can be divided into active and passive systems. Active systems require additional hardware devices to perceive movements of handwriting or the tracking accuracy is not adequate for handwriting recognition. Passive methods use the acoustic signal of pen rubbing and are susceptible to environmental noise (above 60dB). This paper presents a novel handwriting recognition system based on vibration signals detected by the built-in accelerometer of smart phones. VibWriter is highly resistant to interference since the normal environmental noise will not cause the vibration of the accelerometer. Extensive experiments demonstrated the efficacy of the system in terms of accuracy in letter recognition (76.15%) and word recognition (88.14%) when dealing with words of various lengths written by various users in a variety of writing positions under a variety of environmental conditions.
VibWriter uses the built-in accelerometer of a Samsung S7 to detect vibration signals generated by the desk when in contact with a pen. The accelerometer of smart phone can achieve the sampling rate of approximately 500Hz [1], and even a small strokes of 0.1s can generate 50 samples. Therefore, we try to recognize different handwriting letters with the vibration signal. As shown in Fig. 1(a), when writing the letters “C”, “X”, and “Z”, the exceedingly weak amplitude of the vibration signals make it difficult to differentiate between the three letters directly. Besides, different letters comprise different numbers of strokes, as indicated by the spectrum in which the letter “Z” comprises three strokes, the letter “X” comprises two , and the letter “C” comprises only one stroke (see Fig.1(b)).
VibWriter comprises three modules: letter segmentation, letter recognition, and word suggestion. Vibration signal detected by the built-in accelerometer is first sent to the letter segmentation module to be divided into discrete segments. The letter recognition module identifies the different segments. Finally, the word suggestion module combines the letters into words.
Obtaining the highest sampling rate from the built-in accelerometer precludes the stable sampling rate of raw data [1]. In most situations, more than half of the vibration signals are missing, such that the actual number of samples collected per second is roughly 490. The accuracy of timestamps is 1ms. Therefore, we upsampling the raw data to 1000Hz by spline interpolation.
Generally, the tap of a pen on the desk surface produces a distinctive vibration pattern indicating the beginning of writing. However, in some situations where the user seeks to write quietly, such as a meeting room, the writing process begins with a swipe. This situation makes it difficult to identify the start of writing. The signal produced by a tap presents an abrupt change in amplitude, whereas the amplitude of the signal produced by a swiping motion grows gradually. The common approach to segmentation often fails to identify vibration signals that begin with a swipe [2], [3]. We calculate the mean value of the vibration signal S(t) with the sliding window tw = 100ms.
Letter detection is based largely on three time thresholds T1, T2 and T3, and three amplitude thresholds A1, A2 and A3. T1 and T2 indicate the minimum and maximum lengths of the letters, whereas T3 indicates the time interval between words. A1 and A2 indicate the maximum and minimum absolute values of M(t), whereas A3 indicates the minimum absolute value of interference. We use the time threshold to constrain the signal length of letters and words, and the amplitude threshold to judge the begin and end of the signal.
Peak selection is based on the amplitude threshold, where the start threshold is M_{start} = \(0.2 × A1 + 0.8 × A2\) and the end threshold is \(M_{end} = 0.1 × A1 + 0.9 × A2\).
We adopt Short-time Fourier Transform (STFT) to generate features in the frequency domain. We develop a dynamic denoising algorithm, which identifies noise based on a reference signal collected during idle periods. We begin by establishing a noise sample \(S_noise = [s_1, s_2, ..., s_l]\), and then update the sample as:
\[\hat{S}_{noise} = \frac{1}{N}\sum_{i=1}^{N}S_{noise_{i}}\]where l indicates the length of the noise sample according to different handwriting segments. S_noise preserves the noise signal between letters and words, and N represents the number of samples in S_noise. Then, we can denoise the signal with the spectrum subtraction:
\[||Y(k)||^2 = ||S_{signal}(k)||^2 + ||\hat{S}_{noise}(k)||^2\]Convolutional neural network (CNN) have proven highly effective in spectrum classification [2], [3]. The spectral width of vibration signals is far narrower than acoustic signals. Therefore, the module have to extract handwriting features at various scales, (e.g., single taps, single strokes, and entire letters). We achieved handwriting recognition using the Xception model and Focal Loss.
We notice the fact that users often write a word rather than a single letter. Therefore, we develop a word suggestion algorithm to enhance handwriting recognition performance at the word level. We utilized the N-gram algorithm [4] and edit distance to implement word suggestion of different lengths.
VibWriter is implemented on a Samsung S7 and a MacBook Pro (Intel Core i9 CPU@2.3GHz and 16GB RAM) is implemented as the server. Based on the built-in accelerometer, we can achieve a sampling rate of about 490Hz. We use a third-party application AccDataRec for diaplay.
We use the top-1 output of the network as the recognition result. As shown in Fig. 4, the average accuracy in letter recognition is 75.69%. Analysis of misclassification reveals that around 20% of the letters ”K” and ”N” are misidentified as ”R” and ”V”, respectively. Clearly, a word suggestion algorithm is required to achieve reasonable recognition performance.
The performance of the VibWriter system using the N-gram algorithm for short words and the edit distance algorithms for longer words is verified by count- ing the number of correct words suggestions. The proposed algorithms achieve overall accuracy of 88.14% for words of various lengths.
[1] Z. Ba, T. Zheng, X. Zhang, Z. Qin, B. Li, X. Liu, and K. Ren, “Learning-based practical smartphone eavesdropping with built-in accelerometer,” 01 2020.
[2] H. Yin, A. Zhou, G. Su, B. Chen, L. Liu, and H. Ma, “Learning to recognize handwriting input with acoustic features,” Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., vol. 4, no. 2, Jun. 2020.
[3] H. Du, P. Li, H. Zhou, W. Gong, G. Luo, and P. Yang, “Wordrecorder: Accurate acoustic-based handwriting recognition using deep learning,” in IEEE INFOCOM 2018 - IEEE Conference on Computer Communi- cations, 2018, pp. 1448–1456.
[4] P. Nather, “N-gram based text categorization,” 2005.