Articles | Open Access |

Cross-Lingual Pretrained Framework Integrating Diverse Feature Aggregation for Classification of English Documents

Dr. Alejandro Martínez López , Department of Computer Science, University of Barcelona, Barcelona, Spain

Abstract

The rapid expansion of textual data across multilingual digital environments has intensified the need for robust, scalable, and linguistically adaptive text classification frameworks. Traditional monolingual models often struggle to generalize across heterogeneous linguistic contexts, particularly when handling semantic ambiguity, domain variation, and cross-lingual knowledge transfer. This study proposes a cross-lingual pretrained framework that integrates diverse feature aggregation mechanisms to enhance the classification performance of English documents. The framework leverages advances in transformer-based architectures, graph neural networks, and multi-view representation learning to capture syntactic, semantic, and contextual dependencies within textual data.

The research builds upon pretrained language models such as BERT and its variants, combining them with feature fusion strategies derived from convolutional neural networks, recurrent neural networks, and graph-based representations. By integrating multiple feature spaces—including lexical embeddings, contextual representations, and structural graph features—the proposed framework addresses limitations associated with single-representation learning. The model incorporates cross-lingual knowledge transfer through pretrained multilingual embeddings and heterogeneous graph attention mechanisms, enabling improved generalization across diverse datasets.

Methodologically, the study adopts a hybrid architecture that combines transformer encoders with feature aggregation layers and graph-based relational modeling. Experimental evaluation is conceptually structured using benchmark classification scenarios, focusing on accuracy, robustness, and scalability. The findings indicate that multi-feature aggregation significantly enhances classification performance, particularly in complex and noisy datasets where contextual dependencies are critical. Furthermore, the integration of cross-lingual pretrained models improves semantic consistency and reduces classification errors associated with linguistic variability.

The study contributes to the growing body of research on intelligent text classification by proposing a unified framework that bridges gaps between pretrained language models and multi-view feature integration. The implications extend to applications in information retrieval, sentiment analysis, content moderation, and enterprise document management. However, challenges related to computational complexity, data dependency, and interpretability remain critical considerations for future research. The paper concludes by outlining directions for optimizing cross-lingual architectures and enhancing feature fusion strategies for next-generation text classification systems.

Keywords

Cross-lingual learning, text classification, pretrained models

References

Chen, J., Yang, Z., Yang, D.: Mixtext: Linguistically-informed interpolation of hidden space for semi-supervised text classification. arXiv preprint arXiv:2004.12239 (2020)

Chen, Y.: Convolutional neural network for sentence classification (2015)

Defferrard, M., Bresson, X., Vandergheynst, P.: Convolutional neural networks on graphs with fast localized spectral filtering. Advances in neural information processing systems 29 (2016)

Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

Gao, J., Li, P., Laghari, A.A., Srivastava, G., Gadekallu, T.R., Abbas, S., Zhang, J.: Incomplete multiview clustering via semidiscrete optimal transport for multimedia data mining in iot. ACM Transactions on Multimedia Computing, Communications and Applications 20(6), 1–20 (2024)

Gao, J., Liu, M., Li, P., Laghari, A.A., Javed, A.R., Victor, N., Gadekallu, T.R.: Deep incomplete multi-view clustering via information bottleneck for pattern mining of data in extreme-environment iot. IEEE Internet of Things Journal (2023)

Gao, J., Liu, M., Li, P., Zhang, J., Chen, Z.: Deep multiview adaptive clustering with semantic invariance. Transactions on Neural Networks and Learning Systems (2023)

Gururangan, S., Dang, T., Card, D., Smith, N.A.: Variational pretraining for semi-supervised text classification. arXiv preprint arXiv:1906.02242 (2019)

Johnson, R., Zhang, T.: Deep pyramid convolutional neural networks for text categorization. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 562–570 (2017)

Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759 (2016)

Li, C., Peng, X., Peng, H., Li, J., Wang, L.: Textgtl: Graph-based transductive learning for semi-supervised text classification via structure-sensitive interpolation. In: IJCAI. pp. 2680–2686 (2021)

Li, P., Chen, Z., Yang, L.T., Gao, J., Zhang, Q., Deen, M.J.: An incremental deep convolutional computation model for feature learning on industrial big data. Transactions on Industrial Informatics 15(3), 1341–1349 (2018)

Li, P., Gao, J., Zhang, J., Jin, S., Chen, Z.: Deep reinforcement clustering. Transactions on Multimedia (2022)

Li, P., Laghari, A.A., Rashid, M., Gao, J., Gadekallu, T.R., Javed, A.R., Yin, S.: A deep multimodal adversarial cycle-consistent network for smart enterprise system. IEEE Transactions on Industrial Informatics 19(1), 693–702 (2022)

Lin, Y., Meng, Y., Sun, X., Han, Q., Kuang, K., Li, J., Wu, F.B.: Transductive text classification by combining gcn and bert. arxiv 2021. arXiv preprint arXiv:2105.05727

Liu, C., Wang, X.: Quality-related english text classification based on recurrent neural network. Journal of Visual Communication and Image Representation 71, 102724 (2020)

Liu, P., Qiu, X., Chen, X., Wu, S., Huang, X.J.: Multi-timescale long short-term memory neural network for modelling sentences and documents. In: conference on empirical methods in natural language processing. pp. 2326–2335 (2015)

Mundra, S., Mittal, N.: Fa-net: fused attention-based network for hindi english code-mixed offensive text classification. Social Network Analysis and Mining 12(1), 100 (2022)

Peng, H., Li, J., He, Y., Liu, Y., Bao, M., Wang, L., Song, Y., Yang, Q.: Large-scale hierarchical text classification with recursively regularized deep graph-cnn. In: world wide web conference. pp. 1063–1072 (2018)

Sachan, D.S., Zaheer, M., Salakhutdinov, R.: Revisiting lstm networks for semi-supervised text classification via mixed objective function. In: Proceedings of the aaai conference on artificial intelligence. vol. 33, pp. 6940–6948 (2019)

Shabestanı, S., Gec¸ikli, M.: Machine learning use for english texts’ classification (a mini-review). Osmaniye Korkut Ata ¨Universitesi Fen Bilimleri Enstit¨us¨u Dergisi 7(1), 414–423 (2024)

Taha, K., Yoo, P.D., Yeun, C., Taha, A.: Text classification: A review, empirical, and experimental evaluation. arXiv preprint arXiv:2401.12982 (2024)

Wang, G., Li, C., Wang, W., Zhang, Y., Shen, D., Zhang, X., Henao, R., Carin, L.: Joint embedding of words and labels for text classification. arXiv preprint arXiv:1805.04174 (2018)

Wang, X., Ji, H., Shi, C., Wang, B., Ye, Y., Cui, P., Yu, P.S.: Heterogeneous graph attention network. In: The world wide web conference. pp. 2022–2032 (2019)

Wang, Z., Liu, X., Yang, P., Liu, S., Wang, Z.: Cross-lingual text classification with heterogeneous graph neural network. arXiv preprint arXiv:2105.11246 (2021)

Xie, Q., Huang, J., Du, P., Peng, M., Nie, J.Y.: Inductive topic variational graph auto-encoder for text classification pp. 4218–4227 (2021)

Xu, J., Cai, Y., Wu, X., Lei, X., Huang, Q., Leung, H.f., Li, Q.: Incorporating context-relevant concepts into convolutional neural networks for short text classification. Neurocomputing 386, 42–53 (2020)

Yao, L., Mao, C., Luo, Y.: Graph convolutional networks for text classification. In: Proceedings of the AAAI conference on artificial intelligence. vol. 33, pp. 7370–7377 (2019)

Zhang, H., Zhang, J.: Text graph transformer for document classification. In: Conference on empirical methods in natural language processing (EMNLP) (2020)

Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classification. Advances in neural information processing systems 28 (2015)

Article Statistics

Downloads

Download data is not yet available.

Copyright License

Download Citations

How to Cite

Dr. Alejandro Martínez López. (2026). Cross-Lingual Pretrained Framework Integrating Diverse Feature Aggregation for Classification of English Documents. International Journal of Computer Science & Information System, 11(05), 1–12. Retrieved from https://scientiamreearch.org/index.php/ijcsis/article/view/382