Parameter inference is perhaps the most fundamental problem in the field of Statistics. Both the Bayesians' posterior distribution and the frequentists' maximum likelihood estimate method critically reply on the availability of the probability mass or density function, namely, the likelihood function (\theta; X) = p_\theta(X)$. However, in many applications, the likelihood function cannot be explicitly obtained, or is intractable to compute. This unavailability precludes the possibility of direct Bayesian computation or maximum likelihood learning. In these cases, approximate inference can still be performed as long as it is possible to simulate data samples $X$ from the likelihood-free model given a certain parameter $\theta$, using the methods of ABC. Both the accuracy and computational efficiency of ABC depend on the choice of summary statistic, specially when dealing with high-dimensional data. But it is unclear which guiding principles can be used to construct effective summary statistics. In Chapter 2, we explore the possibility of automating the process of constructing summary statistics by training deep neural networks (DNN) to predict the parameters from artificially generated data: the resulting summary statistics are approximately posterior means of the parameters. With minimal model-specific tuning, our method constructs summary statistics for the Ising model and the moving-average model, which match or exceed theoretically-motivated summary statistics in terms of the accuracy of the resulting posteriors. In many important models, the likelihood function is not entirely available but conditionally computable or known up to a normalizing constant. An example is the model of the form $ _\theta(x) = e^{-E(x, \theta)-\Lambda(\theta)}, $$ where the \textit{energy function} $E(x, \theta)$ is known while the \textit{log-partition function} $\Lambda(\theta)$ is unknown. As the Markov Chain Monte Carlo (MCMC) method can sample data $X$ given $\theta$, ABC can be used in principle. But a more powerful alternative is Geoffrey E. Hinton's Contrastive Divergence (CD) learning algorithm that approximates $\nabla \Lambda(\theta)$, the missing term in the gradient of the log-likelihood function, by a short MCMC run and maximizes the log-likelihood function with the approximate gradient. Despite CD's empirical success, both computer simulation and theoretical analysis show that CD may fail to converge to the maximum likelihood estimate or the true parameter. In Chapter 3, we study the asymptotic properties of CD algorithm with a fixed learning rate in exponential families and establish the conditions that guarantee the convergence of CD algorithm. We prove that, given a data sample $X_1, \dots, X_n \sim p_{\theta^*}$ i.i.d. and let $\{\theta\}_{t \ge 0}$ be the sequence generated by the CD algorithm, then any limit point of their time average is an asymptotically consistent estimate in the sense that $$\lim_{n \to \infty} \mathbb{P}\left(\limsup_{t \to \infty} \left\Vert \frac{1}{t} \sum_{s=0}^{t-1} \theta_s - \theta^*\right\Vert_2 \ge A_m n^{-(1-2\gamma)/3}\right) = 0$$ for any $\gamma \in (0,1/2)$ and some coefficient constant $A_m$ depending on $, the number of transition steps in Markov Chain Monte Carlo in each iteration of CD algorithm. In Chapter 4, we extend the results in Chapter 3 to CD algorithm with an annealed learning rate and get analogous asymptotic properties.