Especialización en Estadística Aplicada
URI permanente para esta colecciónhttp://hdl.handle.net/11371/6011
Examinar
Envíos recientes
2024-12Ítem Análisis Estadístico de la Calidad del Agua en Colombia: Comparativa Regional desde 2007 hasta 2023Ramirez Castaño, Francy Julieth; Romero Cárdenas, Oscar Alfonso, Salamanca Bernal, Julián AndrésContexto. El presente estudio se enfoca en el análisis del índice de calidad del agua en las cinco regiones de Colombia, conforme a lo establecido en la Resolución 2115 de 2007. Propósito. Este índice se ha reportado mensualmente desde el año 2007 hasta el 2023, proporcionando un panorama amplio sobre la evolución de la calidad del agua en distintas regiones del país. Metodología. Para llevar a cabo este análisis, se emplearon métodos estadísticos avanzados para el análisis de series de tiempo, análisis de tendencia y modelos predictivos.Se utilizaron herramientas computacionales especializadas como R para el procesamiento y modelado de los datos. Resultados. Se realizó el análisis de las series de tiempo aplicando el modelo VAR y realizando las pruebas correspondientes para evaluar la viabilidad del los modelo. Conclusiones. Se aplicaron a las 5 series de cada una de las regiones de Colombia a cada se le estimo dos modelo dando mejores resultados con el modelo 2 que tiene mas rezagos. 2023-12-02Ítem Evaluación de modelos de DBO5 para el afluente y efluente de un STARD piloto mediante tecnicas de machine learning.Pascal Suárez, Angel Camilo; González Martínez, Edwin Fernando; Salamanca Bernal, Julián AndrésContexto. Establecer el mejor modelo predictivo para el parámetro DBO5, mediante el uso de técnicas de Mechine Learning entre la Demanda Quimica de Oxigeno, los Solidos Suspendidos Totales, el Nitrogeno Total y el Fosforo Total de una planta de tratamiento piloto ubicada en la Facultad del Medio Ambiente de la Universidad Distrital Francisco Jose de Caldas. Propósito. Determinar mediante la aplicación de técnicas de machine Learning un modelo de carácter predictivo, el cual ayude a la toma de decisiones en funciones de las diferentes métricas evaluadas de los modelos aplicados. Metodología. Se realizo una limpieza e imputacion de datos faltantes, por la heterogeneidad de los datos se realizo una transformación de la base de datos para una mejor homogeneidad de los mismo; se determinaron dos grupos mediante el Análisis de Componentes Principales, posteriormente se aplicaron los modelos de Regresión lineal multiple y Random Forest, las métricas evaluadas para la determinar el mejor modelo fueron el RMSE, MAPE, R2, COR . Resultados. El mejor modelo RANDOM FOREST aplicado con las variables determinadas con el criterio AKAIKE presento las mejores métricas para el Afluente y Efluente del STARD, dentro de dichas métricas determinas esta el RMSE con un valor de 0.285 y 0.34 respectivamente. Conclusiones. De acuerdo con los resultados, se puede determinar que a pesar de obtener buenas metricas el modelos de regresión lineal, este no cumple los supuestos de normalidad, por lo cual el mejor modelo predictivo fue el Random Forest con mejores métricas y variables del criterio AKAIKE. 2023-12-12Ítem Factores asociados al rendimiento académico. Un análisis desde el desarrollo territorial.Parra Mateus, Oscar Albeiro; Muñoz Zambrano, Angela Ximena; Eljadue Cock, Brenda; Gonzáles Velazco, José John FredyContexto. La comprensión de los determinantes del rendimiento académico ha sido objeto de numerosos estudios, abordando diversos factores que podrían influir en el desempeño estudiantil. No obstante, en el contexto colombiano, aún persiste una brecha de investigación, ya que no se han encontrado estudios que expliquen de manera específica los puntajes obtenidos por los estudiantes en relación con el desarrollo del territorio. Propósito. Determinar la influencia que el desarrollo del territorio puede tener en el rendimiento académico de los estudiantes que presentaron las pruebas Saber 11 realizadas en el año 2022. Metodología. Se aplicó la metodología SEMMA para el proceso de muestreo, exploración, modificación, modelado y evaluación de los datos. Se desarrolló un modelo predictivo para esta prueba, utilizando como datos de entrada la información contenida en el Informe de Madurez de Ciudades y Territorios Inteligentes del año 2022, y la base de datos que recopila los resultados de las pruebas Saber 11 correspondientes al mismo año. Durante el análisis, se examinaron diversas variables, incluyendo el puntaje global, el género y el municipio de residencia de los estudiantes, entre otras. Resultados. Destacaron como las variables más significativas la jornada de estudio y tener computador. La evaluación del modelo indicó un coeficiente de determinación (R2) de 0.23, con una variación en las predicciones (RMSE) de ±44.52 puntos. Conclusiones. Es posible que los indicadores de calidad del municipio no tengan un impacto significativo en el nivel de educación percibido en las pruebas Saber, las cuales miden el rendimiento académico del estudiante. Factores más cercanos o personales, como la educación de los padres y la jornada escolar, podrían influir de manera más directa en estos resultados.Ítem Zona gris, un experimento para entrenar un modelo de clasificación a partir de valores extremos.Enriquez Sanchez, Dany Alexander; González Veloza, José John FredyIn Machine Learning, we often convert supervised regression problems into dichotomous classification problems based on the definition of the target variable, which simplifies decision making. Our hypothesis in this work is that Training a dichotomous classification model using only the extreme values of the target variable, discarding the rest (gray zone), produces better results than using all the data from the development population in the training phase. This could benefit researchers and practitioners in terms of time, savings in computational resources, and possibly better performance in the training phase. Furthermore, this research can serve as a first step to better understand the influence of extreme values on training classification models and open a new field of study. To evaluate this hypothesis, we use a database of the results of the saber pro tests from the year 2019 of the Ministry of Information and Communications Technologies "Open Data". We perform two model training tests: a symmetric scheme that balances the classification values 0 and 1 and an asymmetric scheme that imbalances these values. The best results were obtained when training the model in the range from 0% to 30% of the gray zone using an asymmetric scheme. However, no significant results were observed that supported the hypothesis.Ítem Modelo de pronosticó para la estimación de costos semanales de importación marítima de bases para la producción de lubricantes en Colombia desde las Américas mediante un modelo SARIMAOsorio Castañeda, Cristhian Camilo; Niño Gutiérrez, Sindy CarolinaThe main objective of this study is to analyze and predict CIF import prices (Cost, Insurance and Freight) weekly from the bases for the production of lubricants in Colombia from the Americas. It seeks to evaluate historical import price trends and use time series models for forecasting. The SEMMA (Sample, Explore, Modify, Model, Assess) methodology was used for data analysis. The data used were obtained from the Treid platform, which provides information on imports. were explored and transformed the data, and certain characteristics were identified, such as the repetition of dates and the lack of records in some days. Modifications were made to the data, filtering the information and calculating the weekly average of CIF values. A time series with a positive trend was obtained. Based on these analyses, a a SARIMA predictive model eliminating seasonal and non-stationary behavior, for a seasonal series with s periods in time to predict weekly CIF import prices. The results of this study provide a vision of the behavior of the import prices of bases for lubricants in Colombia, which which is of great importance for decision making in the industry. It is concluded that this methodological approach innovative can contribute to a better understanding and management of CIF import prices for the bases for lubricants, making it possible to adjust strategies and increase participation in the national lubricants market.Ítem Diagnóstico de la Población Recicladora Independiente del Municipio de Pasto, a partir de Técnicas de Aprendizaje SupervisadoCarlosama Ruales, Yana Stefhania; González Veloza1, José John FredyHistorically, in Colombia the recycling population has carried out waste recovery activities usable under precarious working conditions and systematic restrictions and prohibitions by the State, which has generated a constant struggle by the Recycling Guild, who have seen in the associated work the only answer to defend your rights. Therefore, it is necessary to understand why almost half of the recycling population of the municipality of Pasto is not associated with a recycling organization, taking into account its advantages, such as being providers of the public cleaning service and thus receiving the usage fee. Therefore, The objective of this research is to identify the socioeconomic conditions of non-associated recyclers that can explain their lack of interest in organizing. For this, the data obtained in the diagnosis of gender of the recycling population of Pasto 2021, various models were trained under the learning techniques supervised selecting the LGBM method (Light Gradient Boosting Machine), for presenting the best metrics of performance in the task of predicting the conditions of the non-associated recycler population; for processing data, data cleaning, transformation of some variables, and simple data imputation were performed. null, finally the training and test data were separated. According to the results produced by the This model would have to start working with recyclers who have been in the trade for the longest years, because for them the recycling is only a subsistence activity and not the basis of its economy.Ítem Análisis con Machine Learning de Peticiones Externas (PQRS) del Servicio Nacional de Aprendizaje - SENA para mitigación de incumplimientos normativosAyala Alfonso, Yésica Patricia; Durán Ramírez,Julio Mario; González Martínez., Edwin FernandoThe National Learning Service - SENA receives Petitions, Complaints, Claims, Suggestions, Denunciations, Acknowledgments, Congratulations and Guardianship Actions (PQRS) that must be managed to guarantee a timely response to citizens who request the solution to their requirement; additionally, it must ensure the follow-up and compliance with the regulations that regulate the management of PQRS in Colombia, as well as automating processes that are currently carried out manually. For this reason, the purpose of this project is to analyze with Machine Learning models the PQRS received by SENA, which allow mitigating the risk of materializing regulatory breaches and manage to resolve the PQRS in a timely manner for the public. For this, the SEMMA methodology is used, being the more appropriate for the analysis of large databases. It should be noted that language tools were used of Python and R programming to execute and apply the analysis of the PQRS, obtaining conclusive results and satisfactory about the Machine Learning models chosen to predict the possible violation of rights of citizenship; therefore, with these results, it is considered necessary to suggest the implementation of the models executed before SENA.Ítem Reducción de tiempos de ejecución en el proceso de calibración en un laboratorio de metrología a partir de un modelo predictivoCastro Rodríguez, Ever Daniel; González Veloza, José John FredyThe calibration of measuring instruments in a metrology laboratory is essential, but it can be affected due to internal processes that delay activities. This research proposes a predictive model to reduce the execution times, improve resource allocation and provide a more efficient service. learning was used automatic and cross validation to train the model with historical data. The implementation of the model achieved a average reduction of 30% in execution times, improving the performance of human resources, development of activities and customer satisfaction. This research supports the importance of predictive technology in metrology to improve efficiency and quality of service.Ítem Clasificación de variables que representan mayor impacto al incumplirse al momento de otorgar cartera de microcréditosHerrera Carranza, Gamaliel; González Veloza, José John FredyThe granting of credit products in the Colombian financial market, specifically in the microcredit lines, implies a higher level of risk, taking into account the nature of the profile of the clients to whom these products are granted, who are mostly people with high levels of vulnerability, little or minimal credit experience and marked informality in the development of their economic activities, therefore It is important to identify which credit policies are more relevant when the client deteriorates or does not pay timely the value of the installments corresponding to the disbursed credits. That is why it is considered pertinent to carry out an analysis of the variables (credit policies) that are evaluated during the analysis process and approval of credit applications, since it has been identified that those customers who have incurred in arrears, did not initially comply with any of the credit policies that should be considered for approval of credit. So, according to the results of the analysis, it is considered important to integrate into the methodology evaluation, analysis and granting, some adjustments to the level of demand in compliance with the policies that generate a greater impact or possibly explain non-payment by customers, such as example: control the amount of the disbursement, exclude or request greater guarantees from clients who carry out activities companies that showed a bad payment habit, with which they can improve their portfolio quality indicators.Ítem Análisis de elementos de tierras raras en cenizas de Carbón a través de técnicas de aprendizaje automático no supervisado (CLUSTERING)Díaz Moreno, Héctor Felipe; Gonzalez Martinez, Edwin FernandoCoal ash is a byproduct of coal combustion in power plants, which contains a variety of chemical elements among which are the rare earth elements and given the special characteristics that the starting material COAL possesses, there is a high probability of being found in appreciable quantities. Colombia, as the main producer of Coal in Latin America and 5 in the world, presents as a good alternative in their search and since over the years they have been accumulating ashes in the surroundings of the thermal power plants product of the combustion for the generation of electricity and are not They have given other uses different from those traditionally established. A pile of ashes is characterized within a ash yard in a thermoelectric plant in Colombia, obtaining the amount of ETRs by ICP-MS and the elements older by XRF. To the results obtained from the study, the methodology of exploration and treatment of SEMM data. Clustering algorithms such as K-means, Hierarchical Clustering, and DbScan are used to explore segregation patterns in the data. Metrics such as the Silhouette coefficient, the sum of squares, the Calinski-Harabasz and Davies Bouldin partition criteria to assess the quality of the clusters obtained. Besides, Statistical tests are performed to determine the significance of the aggregates obtained. They are found for 3 to 4 segregations depending on the unsupervised algorithm used, showing that the groupings obtained have good quality in terms of between-group variance and these values indicate that the groups are well separated and they have a high internal cohesion and are homogeneous, separated and cohesive within the groups, validating the segregation quality explaining more than 80% of the variability of the data. Since three to four groups comply with being the best segmentation obtained taking into account this result coal ashes can be treated differentially during the processes of extraction and/or benefit of the earth elements in this type or another of by-products of coal combustion showing a field of application still to be explored and with good growth prospect.Ítem Modelo para determinar el desempeño de los aspirantes que participan en procesos de selección para empleos de carrera administrativa en el estado colombianoGalindo León, Edwin Ariel; Salamanca Berna,Julian Andres; Gonzalez Martinez., Edwin FernandoContext. Admission and promotion to the administrative career in the Colombian state is done through merit, to Therefore, Decree 1083 of 2015 establishes the selection processes that must be carried out for the substitution of the vacancies offered. Purpose. This work aims to develop a model that seeks to identify the performance that applicants who participate in selection processes for jobs in state entities would have from different variables associated with the subject and the job offered. Methodology. This work was advanced with data obtained of an advanced selection process for the replacement of vacancies in administrative career jobs and work was done with the CRISP-DM methodology, which implies knowing the business, as well as understanding and preparing the data, develop the model, test it and, if effective, deploy it. Results. Two were made models (Cubic Regression and Random Forest) in which a coefficient of determination (R 2 ) of 0.060 and 0.067 respectively, in addition to a standard error of the models of 8.436 in the Cubic Regression and 8.455 in the Random Forest. conclusions. According to the results, from the variables worked on, no robust models, so it is necessary to include other sociodemographic variables in order to predict results with a lower percentage of error since this has practical implications in the context of the study.Ítem Aprendizaje supervisado para clasificación de experiencias de viajes turísticos en ColombiaOrtegón Palomino,Frank Sebastian; Gonzáles Martínez, Edwin FernandoIn Colombia, according to the Colombian Association of Travel and Tourism Agencies (ANATO), close to 2.2 million people moved in the country as internal visitors during the first half of 2022, a figure that represents 10.4% of the total population; Consequently, the companies that offer tourist trips at the national level They face a highly competitive environment. To address this issue, an analysis of the results is carried out, found that the Logistic Regression model is more effective in correctly classifying the experiences of the users. In conclusion, this research provides valuable information to improve decision making in the tourist travel companies and promoting profitability by offering more satisfactory experiences to users.Ítem Análisis de series temporales de precipitación (1990- 2021) en el municipio de Quibdó, departamento Chocó. Usando modelado SARIMAGonzález Sanclemente, Janier Emir; Sandoval Rodríguez, WilsonAim. Analyze time series of precipitation between 1990 and 2021 in the municipality of Quibdó, located in the department of Chocó using modeling SARIMA that allows to make a forecast on the precipitations in the municipality. Materials and methods. For the purposes of this study, the precipitation data, from the meteorological station of the Institute of Hydrology, Meteorology and Environmental Studies (IDEAM) located at the El Caraño airport of Quibdó, from January 1990 to December 2021, it was reviewed if the precipitation presents incomplete data. It was verified if the series has any trend. For the which a forecast was made by fitting a Seasonal Autoregressive Moving model Average, SARIMA, this was done using the R program (R Core Team 2014). Results. The results suggest that the precipitation does not present a trend, while its variance is unequivocally increasing, a SARIMA model (0 ,0 ,9 ) x (0 , 1 , 1)12. Conclusion. The time series of Monthly precipitation exhibits a stationary trend relative to time. The annual variance for precipitation has not changed for the studied series and model SARIMA obtained is appropriate for the series studied and a prognosis was made for a period of 12 monthsÍtem Modelo para la predicción de interrupciones de energía en circuitos de distribución reportados a la superintendencia de servicios públicos ColombiaSalinas Moreno, Giovanny; González Veloza, José John FredyThe quality of the energy service is a fundamental aspect for the well-being of society, since it influences the business productivity, home comfort and public safety. That is why it is important to analyze carefully the interruptions at the circuit level that the energy companies report to the superintendency of public services, in order to know significant characteristics that allow us to make decisions to improve the quality of electrical service. With the present work we seek to know the characteristic features of a circuit electricity that presented some interruption in a given year and thus forecast the time of interruptions not programmed implementing machine learning techniques. This was done with open data from outages scheduled and unscheduled reported to the superintendency between the years 2010 - 2022 (n=85387), they were trained Arima, LSTM and Prophet models to forecast circuit outage time. with the results obtained, we identified that the best model to work with our data was Prophet by obtaining an MSE and a RMSE low with respect to the others using the test data, we enter to evaluate the time of the interruptions not circuit scheduled between 01/02/2022 and 07/09/2023, forecasting a decrease in interruptions of energy, the months, days and holidays that present more power interruptions are identified, with this you can make decisions to improve the quality of service.Ítem Modelo de aprendizaje automático aplicado a la desaparición forzada en ColombiaVilla Bustamante, Juan José; González Velosa, José John FredyContext. Forced disappearance is framed as a crime against human rights, which has been recurring within Colombian history as a way to hide a crime or impose a notice of violence in a determined territory. There the fact breaks into daily life both in a family and community field. Of the In the same way, it is important to understand and address it because of how difficult it is to carry out the investigation of a case of these. Purpose. The present study sought to identify the sociodemographic factors in Colombia that benefit the appearance of a person within a context of enforced disappearance through a machine model classification learning. Methodology. Where, based on the analysis of open data on disappearances in Colombia from 1970 to December 2019 (n = 55,145), various classification models were trained to forecast the appearance of people (alive or dead) reported as enforced disappearances. Results. The Classification models with the best performance in the test data were the Light Gradient Boosting models Machine and Extreme Gradient Boosting, which obtained the highest AUC (0.7493 and 0.7485 respectively). By On the other hand, the variables that contributed the most to the prediction of the event were: Municipality where the disappearance occurred, age and studies of the disappeared person. conclusions. The present results showed that the municipality of residence is what most impacts the probability of appearance of a person, where the probabilities are they increase if they reside in main cities such as Bogotá, Medellín and Barranquilla. Similarly, it is suggested to the owners of the database to improve the dimensionality of the variable "classification of disappearance"; but if it is intended to make a model of the same research problem, it is suggested to do it by means of a different methodology.Ítem Metodología de evaluación de la incidencia del factor humano en la medición intra-laboratorio de volúmenes de líquidos mediante el método gravimétrico y de transferencia en recipientes volumétricos metálicos.Aguirre Romero, Elvis; Bermudez Rubio, DagobertoTheSuperintendenciaofIndustriaandComercio(SIC)CalibrationLaboratoriesimplementintra-laboratoryverification studies to demonstrate the reliability of the calibration methods for metallic volumetric containers (RVM) used in the inspection and verification activities of metrological control. The diversity of liquid volume measurement methods and volumetric instruments used generates the need to support with statistical techniques that the results obtained by the gravimetric and transfer measurement method are comparable. The objective of this research is to evaluate the incidence of the human factor in the intra-laboratory measurement of liquid volume by gravimetric and transfer methods in RVM with a capacity of 20 liters. An experimental design of two nominal factors with 3 and 2 levels respectively was considered. From the design,was constructed the complete 3x2 factorial arrangement with five replicates , which will allow obtaining all the relevant information regarding the effect of these factors on the total volume measurements. The two-factor ANOVA test shows that there are no significant differences between the means of the volumetric measurements by the two treatments: analysts and measurement methods . The proposed statistical model allows traceability of the influence of the human factor in the measurement of volumes of liquids in RVM by different methods.Ítem Modelos de machin elearning para la predicción del estado de salud prenatal y la prevención mediante cardiotocogramas.Arevalo Rodriguez, William Fabian; Jiménez Prieto, Ingrid Natalia; Gonzales Veloza, Jose Jhon FreddyReducing child mortality is reflected in several of the UN Goals and is a key indicator of human progress. progress. CTGs are a simple and affordable option for assessing fetal health, allowing health professionals to take action to prevent infant and maternal mortality. health professionals to take action to prevent infant and maternal mortality. The main objective of this The main objective of this work is the implementation and evaluation of various machine learning models in order to determine the best in terms of construction, computational cost and accuracy in the classification (diagnosis) of the state of the fetus. Because of this, we propose a tournament of machine learning models that allow to find a balance between an easy to replicate and apply model and an easy to use model. easy to replicate and apply and a high sensitivity and accuracy in terms of fetal status prediction. Therefore, a list of different techniques a list of different supervised classification techniques are trained on a dataset provided by a plan that drives the automation of the automatic analysis of automation of automatic CTG analysis. Those models that performed best in accuracy require gradient boosting techniques. require gradient augmentation techniques where a high accuracy value is achieved, such models reveal that the accelerations and variability of the that accelerations and abnormal variability on short and long timescales play an important role in determining health status. health status. Among the models tested, the GBS presents the best results reaching an accuracy of 96.0% for the categorization of health status. 96.0% for fetal status categorization.Ítem Metodología para incrementar la precisión en la estimación del volumen de llamadas en los centros de contacto del sector de telecomunicaciones a través de modelos de series de tiempos en el proceso de Workforce ManagementPacheco Gómez, Alexandra; Sandoval Rodriguez, WilsonAn erroneous inbound call forecast can lead to excess or shortage of agents, which translates into business losses or customer dissatisfaction. Therefore, we seek to design and validate an optimal statistical forecasting model that adjusts to the real traffic requirements and that handles the theoretical-practical assumptions of time series that allows the optimization of resources in terms of human talent. The Box-Jenkins statistical methodology based on time series was used to create a forecast model which was compared with the traditional model currently used in the Workforce process. It was possible to build a statistical methodology that adds value to the process and its ability to characterize the behavior of incoming calls is statistically supported, which leads to a better control in the estimation of resources, which will increase efficiency and reliability in the staffing process.Ítem Análisis exploratorio y pronóstico de la cantidad personas implicadas en siniestros viales relacionados con bicicletas en la ciudad de Bogotá D.C, a través del modelamiento de series de tiempoCastillo Caicedo, José Ricardo; Sandoval Rodriguez, WilsonPrevious research shows how exploratory analyzes and modeling through time series of road accidents in a specific place can help understand this phenomenon and reduce them. This study seeks to develop an exploratory analysis and a forecast of the number of people involved in road accidents related to bicycles in the city of Bogotá D.C, through time series modeling. The open data set of road accidents shared by the Bogotá District Mobility Secretariat was used, which relates information on road incidents recorded in police reports of traffic accidents that occurred in this city between 2015 and 2020. Three of the twenty localities of the city of Bogotá D.C were defined as study area, namely: Suba, Engativá and Fontibón. In the first instance, a descriptive analysis of the data was carried out, and immediately afterwards, the SARIMA models were postulated, in order to identify which is the one that best adapts to the aforementioned phenomenon.Ítem Determinación de valores de terreno de suelo urbano en Bogotá de acuerdo con las ofertas inmobiliarias de predios NPHMelo Cerón, William Alexander; Gonzalez Veloza, John FredyThe land value of urban land in the cadastral updating process is determined by evaluating the characteristics of the behavior of the real estate market of the properties of the city of Bogotá, through this study the collection of variables that best describe the behavior of the land value of urban land, as well as the most relevant characteristics within the economic information of the cadastral update process. It is specified that the final product is a multiple linear regression model that describes the behavior of each descriptive variable and the final result of the objective variable that is expressed in the field value of each of the data, additionally the spatial distribution is obtained of the predicted values and their variations in the geographic frame of the city.