Agentowa struktura wielomodalnego interfejsu do Narodowej Platformy Cyberbezpieczeństwa, część 2.

pol Artykuł w języku polskim DOI: 10.14313/PAR_234/5

Włodzimierz Kasprzak , Wojciech Szynkiewicz , Maciej Stefańczyk , Wojciech Dudek , Maksym Figat , Maciej Węgierek , Dawid Seredyński , wyślij Cezary Zieliński Politechnika Warszawska, Wydział Elektroniki i Technik Informacyjnych, Instytut Automatyki i Informatyki Stosowanej

Pobierz Artykuł

Streszczenie

Ten dwuczęściowy artykuł przedstawia interfejs do Narodowej Platformy Cyberbezpieczeństwa (NPC). Wykorzystuje on gesty i komendy wydawane głosem do sterowania pracą platformy. Ta część artykułu przedstawia strukturę interfejsu oraz sposób jego działania, ponadto prezentuje zagadnienia związane z jego implementacją. Do specyfikacji interfejsu wykorzystano podejście oparte na agentach upostaciowionych, wykazując że podejście to może być stosowane do tworzenia nie tylko systemów robotycznych, do czego było wykorzystywane wielokrotnie uprzednio. Aby dostosować to podejście do agentów, które działają na pograniczu środowiska fizycznego i cyberprzestrzeni, należało ekran monitora potraktować jako część środowiska, natomiast okienka i kursory potraktować jako elementy agentów. W konsekwencji uzyskano bardzo przejrzystą strukturę projektowanego systemu. Część druga tego artykułu przedstawia algorytmy wykorzystane do rozpoznawania mowy i mówców oraz gestów, a także rezultaty testów tych algorytmów.  

Słowa kluczowe

Narodowa Platforma Cyberbezpieczeństwa, rozpoznawanie gestów, rozpoznawanie mówcy, rozpoznawanie mowy, rozpoznawanie obrazu

Agent Structure of Multimodal User Interface to the National Cybersecurity Platform – Part 2

Abstract

This two part paper presents an interface to the National Cybersecurity Platform utilising gestures and voice commands as the means of interaction between the operator and the platform. Cyberspace and its underlying infrastructure are vulnerable to a broad range of risk stemming from diverse cyber-threats. The main role of this interface is to support security analysts and operators controlling visualisation of cyberspace events like incidents or cyber-attacks especially when manipulating graphical information. Main visualization control modalities are gesture- and voice-based commands. Thus the design of gesture recognition and speech-recognition modules is provided. The speech module is also responsible for speaker identification in order to limit the access to trusted users only, registered with the visualisation control system. This part of the paper focuses on the structure and the activities of the interface, while the second part concentrates on the algorithms employed for the recognition of: gestures, voice commands and speakers. 

Keywords

gesture recognition, image recognition, National Cybersecurity Platform, speaker recognition, speech recognition

Bibliografia

  1. Alizé, opensource speaker recognition. [http://alize. univ-avignon.fr].
  2. Benzeghiba M., De Mori R., Deroo O., Dupont S., Erbes T., Jouvet D., Fissore L., Laface P., Mertins A., Ris C., Rose R., Tyagi V., Wellekens C., Automatic speech recognition and speech variability: A review. “Speech Communication”, Vol. 49, No. 10–11, 763–786, 2007, DOI: 10.1016/j. specom.2007.02.006.
  3. Bolt R.A., “Put-that-there”: Voice and gesture at the graphics interface. ACM SIGGRAPH Computer Graphics, Vol. 14, No. 3, 1980, 262–270, DOI: 10.1145/800250.807503.
  4. Borges P.V.K., Conci N., Cavallaro A., Video-based human behavior understanding: A survey. “IEEE Transactions on Circuits and Systems for Video Technology”, Vol. 23, No. 11, 1993–2008, 2013, DOI: 10.1109/TCSVT.2013.2270402.
  5. Cao Z., Simon T., Wei S., Sheikh Y., Realtime Multi-Person 2D Pose Estimation using Part Affnity Fieldss , IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1302–1310, 2017.
  6. Chen L., Wei H., Ferryman J., A survey of human motion analysis using depth imagery. “Pattern Recognition Letters”, Vol. 34, No. 15, 1995–2006, 2013, DOI: 10.1016/j. patrec.2013.02.006.
  7. Chiu C., Sainath T.N., Wu Y., Prabhavalkar R., Nguyen P., Chen Z., Kannan A., Weiss R.J., Rao K., Gonina E., Jaitly N., Li B., Chorowski J., Bacchiani M., State-ofthe-art speech recognition with sequence-to-sequence models. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4774–4778, 2018,  DOI: 10.1109/ICASSP.2018.8462105.
  8. Chung J., Nagrani A., Zisserman A., Voxceleb2: Deep speaker recognition. INTERSPEECH 2018.
  9. Divekar R.R., et al. CIRA: An architecture for building configurable immersive smart-rooms. K. Arai, S. Kapoor, R. Bhatia (eds.), Intelligent Systems and Applications. IntelliSys 2018, Vol. 869 serii Advances in Intelligent Systems and Computing, 76–95, Springer, 2019.
  10. Dumas B., Lalanne D., Oviatt S.. Human Machine Interaction, Vol. 5440 serii Lecture Notes in Computer Science, rozdz. Multimodal Interfaces: A Survey of Principles, Models and Frameworks, 3–26. Springer, 2009.
  11. Gillian N., Paradiso J.A., The gesture recognition toolkit. “Journal of Machine Learning Research”, Vol. 15, No. 1, 3483–3487, 2014.
  12. Gillian N.E., Knapp R.B., O’Modhrain M.S., Recognition of multivariate temporal musical gestures using n-dimensional dynamic time warping. NIME, 2011.
  13. Graves A., Jaitly N.. Towards end-to-end speech recognition with recurrent neural networks. ICML’14 Proceedings of the 31st International Conference on International Conference on Machine Learning, Vol. 32, II-1764–II-1772.
  14. Gravier G., SPro: Speech Signal Processing Toolkit, 2010.
  15. Herath S., Harandi M., Porikli F., Going deeper into action recognition: A survey. “Image and Vision Computing”, Vol. 60, 4–21, 2017, DOI: 10.1016/j.imavis.2017.01.010.
  16. Hirschmuller H., Stereo processing by semiglobal matching and mutual information. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 30, No. 2, 328–341, 2008, DOI: 10.1109/TPAMI.2007.1166.
  17. Jaimes A., Sebe N.. Multimodal human–computer interaction: A survey. “Computer Vision and Image Understanding”, Vol. 108, No. 1, 116–134, 2007, DOI: 10.1016/j. cviu.2006.10.019.
  18. Kaldi. The kaldi project. [http://kaldi.sourceforge.net/ index.html].
  19. Kapuscinski T., Oszust M., Wysocki M., Warchol D., Recognition of hand gestures observed by depth cameras. “International Journal of Advanced Robotic Systems”, Vol. 12, No. 4, 2015, DOI: 10.5772/60091.
  20. Kasprzak W., Rozpoznawanie obrazów i sygnałów mowy. Oficyna Wydawnicza Politechniki Warszawskiej, 2009.
  21. Kasprzak W., Przybysz P., Stochastic modelling of sentence semantics in speech recognition. Computer Recognition Systems 4, Vol. 95 serii Advances in Intelligent and Soft Computing, 737–746, Berlin, Heidelberg, 2011, Springer-Verlag, DOI: 10.1007/978-3-642-20320-6_75.
  22. Kasprzak W., Wilkowski A., Czapnik K., Hand gesture recognition based on free-form contours and probabilistic inference. “International Journal of Applied Mathematics and Computer Science”, Vol. 22, No. 2, 437–448, 2012, DOI: 10.2478/v10006-012-0033-6.
  23. Kazemi V., Sullivan J., One millisecond face alignment with an ensemble of regression trees. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1867–1874, 2014, DOI: 10.1109/CVPR.2014.241.
  24. Koller O., Ney H., Bowden R., Deep Hand: How to Train a CNN on 1 Million Hand Images When Your Data is Continuous and Weakly Labelled. IEEE Conference on Computer Vision and Pattern Recognition, 3793–3802, 2016, DOI: 10.1109/CVPR.2016.412.
  25. Krizhevsky A., Sutskever I., Hinton G.E., ImageNet classification with deep convolutional neural networks. NIPS’12 Proceedings of the 25th International Conference on Neural Information Processing Systems, Vol. 1, 1097–1105, 2012.
  26. Loudia. The loudia library. [https://github.com/rikrd/ loudia].
  27. Lowe D.G., Distinctive image features from scale-invariant keypoints. “International Journal of Computer Vision, Vol. 60, No. 2, 91–110, 2004, DOI: 10.1023/B:V ISI.0000029664.99615.94.
  28. Lukic Y., Vogt C., Dürr O., Stadelmann T., Speaker identification and clustering using convolutional neural networks. IEEE 26th InternationalWorkshop on Machine Learning for Signal Processing (MLSP), 2016, DOI: 10.1109/ MLSP.2016.7738816.
  29. Mak M.-W., Chien J.-T., Machine learning for speaker recognition. INTERSPEECH 2016 Tutorial, [www.eie.polyu. edu.hk/mwmak/papers/IS2016-tutorial.pdf], 2016.
  30. MARF, [marf.sourceforge.net].
  31. Matarneh R., Maksymova S., Lyashenko V., Belova N.V., Speech recognition systems: A comparative review. “Journal of Computer Engineering”, Vol. 19, No. 5, 71–79, 2017.
  32. Mistral. The mistral biometric platform. [http://mistral. univ-avignon.fr].
  33. Nayana P., Mathew D., Thomas A., Comparison of Text Independent Speaker Identification Systems using GMM and i-Vector Methods. “Procedia Computer Science”, Vol. 115, 47–54, 2017, DOI: 10.1016/j.procs.2017.09.075.
  34. Nunnally T., Uluagac A.S., Beyah R., InterSec: An interaction system for network security applications. IEEE International Conference on Communications (ICC), 7132–7138, 2015, DOI: 10.1109/ICC.2015.7249464.
  35. Oikonomopoulos A., Patras I., Pantic M., Spatiotemporal localization and categorization of human actions in unsegmented image sequences, “IEEE Transactions on Image Processing”, Vol. 20, No. 4, 1126–1140, 2010, DOI: 10.1109/ TIP.2010.2076821.
  36. Oviatt S., Ten myths of multimodal interaction. Communications of the ACM, Vol. 42, No. 11, 74–81, 1999, DOI: 10.1145/319382.319398.
  37. Oviatt S., Schuller B., Cohen P., Sonntag D., Potamianos G., Kruger A. (eds.), The Handbook of Multimodal-Multisensor Interfaces, Vol. 1: Foundations, User Modeling, and Common Modality Combinations. ACM Books Series. Association for Computing Machinery (ACM), 2017.
  38. Pedersoli F., Benini S., Adami N., Leonardi R., XKin: an open source framework for hand pose and gesture recognition using kinect. “The Visual Computer”, Vol. 30, No. 10, 1107–1122, 2014, DOI: 10.1007/s00371-014-0921-x.
  39. Poppe R., A survey on vision-based human action recognition. “Image and Vision Computing”, Vol. 28, No. 6, 976– 990, 2010, DOI: 10.1016/j.imavis.2009.11.014.
  40. Presti L.L., La Cascia M., 3D skeleton-based human action classification: A survey. “Pattern Recognition”, Vol. 53, 130–147, 2016, DOI: 10.1016/j.patcog.2015.11.019.
  41. Purwins H., Li B., Virtanen T., Schlüter J., Chang S., Sainath T., Deep learning for audio signal processing. “IEEE Journal of Selected Topics in Signal Processing”, Vol. 13, No. 2, 206–219, 2019, DOI: 10.1109/JSTSP.2019.2908700.
  42. Rabiner L.R., Juang B.-H., Historical Perspective of the Field of ASR/NLU, 521–538. Springer, Berlin, Heidelberg, 2008.
  43. Rabiner L.R., Schafer R.W., Introduction to digital speech processing. Foundations and Trends in Signal Processing, 1(1):1–194, 2007.
  44. Reynolds D., Quatieri T.F., Dunn R.B., Speaker verification using adapted Gaussian mixture models. “Digital Signal Processing”, Vol. 10, No. 1–3, 19–41, 2000, DOI: 10.1006/ dspr.1999.0361.
  45. Reynolds D., Rose R.C., Dunn R.B., Robust text-independent speaker identification using Gaussian mixture speaker models. “IEEE Transactions on Speech and Audio Processing”, Vol. 3, No. 1, 72–83, 1995, DOI: 10.1109/89.365379.
  46. Salvador S., Chan P., FastDTW: Toward accurate dynamic time warping in linear time and space.
  47. Schafer R.W., Homomorphic Systems and Cepstrum Analysis of Speech, 161–180. Springer, Berlin, Heidelberg, 2008.
  48. Sinha A., Choi C., Ramani K., DeepHand: Robust hand pose estimation by completing a matrix imputed with deep features. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4150–4158, 2016, DOI: 10.1109/CVPR.2016.450.
  49. Turk M., Multimodal interaction: A review. “Pattern Recognition Letters”, Vol. 36, 189–195, 2014, DOI: 10.1016/j. patrec.2013.07.003.
  50. Ullah A., Ahmad J., Muhammad K., Sajjad M., Baik S.W., Action recognition in video sequences using deep bi-directional LSTM with CNN features. “IEEE Access”, Vol. 6, 1155–1166, 2017, DOI: 10.1109/ACCESS.2017.2778011.
  51. Viola P., Jones M., Rapid object detection using boosted cascade of simple features. IEEE Conference on Computer Vision and Pattern Recognition, 2001, DOI: 10.1109/ CVPR.2001.990517.
  52. Walker W., Lamere P., Kwok P., Raj B., Singh R., Gouvea E., Wolf P., Woelfel J., Sphinx-4: A flexible open source framework for speech recognition. SMLI TR2004-0811, SUN Microsystems Inc., 2004, [http://cmusphinx.sourceforge. net/sphinx4].
  53. Weinland D., Ronfard R., Boyer E., A survey of vision-based methods for action representation, segmentation and recognition. “Computer Vision and Image Understanding”, Vol. 115, No. 2, 224–241, 2011, DOI: 10.1016/j.cviu.2010.10.002.
  54. Woodland P., Evermann G., Gales M., HTK book. Cambridge University Engineering Department (CUED). 2000– 2006.
  55. Young S., HMMs and Related Speech Recognition Technologies, 539–558. Springer, Berlin, Heidelberg, 2008.
  56. Yu Z., Zhang C., Image based static facial expression recognition with multiple deep network learning. Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, 435–442, ACM, 2015, DOI: 10.1145/2818346.2830595.
  57. Zhao R., Wang K., Divekar R., Rouhani R., Su H., Ji Q., An immersive system with multi-modal human-computer interaction. 13th IEEE International Conference on Automatic Face Gesture Recognition (FG 2018), 517–524, 2018, DOI: 10.1109/FG.2018.00083.