曾經(jīng)因?yàn)镹LTK的緣故開(kāi)始學(xué)習(xí)Python,之后漸漸成為我工作中的第一輔助腳本語(yǔ)言,雖然開(kāi)發(fā)語(yǔ)言是C/C++,但平時(shí)的很 多文本數(shù)據(jù)處理任務(wù)都交給了Python。離開(kāi)騰訊創(chuàng)業(yè)后,第一個(gè)作品課程圖譜也是選擇了Python系的Flask框架,漸漸的將自己的絕大部分工作交 給了Python。這些年來(lái),接觸和使用了很多Python工具包,特別是在文本處理,科學(xué)計(jì)算,機(jī)器學(xué)習(xí)和數(shù)據(jù)挖掘領(lǐng)域,有很多很多優(yōu)秀的Python 工具包可供使用,所以作為Pythoner,也是相當(dāng)幸福的。其實(shí)如果仔細(xì)留意微博,你會(huì)發(fā)現(xiàn)很多這方面的分享,自己也Google了一下,發(fā)現(xiàn)也有同學(xué) 總結(jié)了“Python機(jī)器學(xué)習(xí)庫(kù)”,不過(guò)總感覺(jué)缺少點(diǎn)什么。最近流行一個(gè)詞,全棧工程師(full stack engineer),作為一個(gè)苦逼的創(chuàng)業(yè)者,天然的要把自己打造成一個(gè)full stack engineer,而這個(gè)過(guò)程中,這些Python工具包給自己提供了足夠的火力,所以想起了這個(gè)系列。當(dāng)然,這也僅僅是拋磚引玉,希望大家能提供更多的 線索,來(lái)匯總整理一套Python網(wǎng)頁(yè)爬蟲,文本處理,科學(xué)計(jì)算,機(jī)器學(xué)習(xí)和數(shù)據(jù)挖掘的兵器譜。
一、Python網(wǎng)頁(yè)爬蟲工具集
一 個(gè)真實(shí)的項(xiàng)目,一定是從獲取數(shù)據(jù)開(kāi)始的。無(wú)論文本處理,機(jī)器學(xué)習(xí)和數(shù)據(jù)挖掘,都需要數(shù)據(jù),除了通過(guò)一些渠道購(gòu)買或者下載的專業(yè)數(shù)據(jù)外,常常需要大家自己動(dòng) 手爬數(shù)據(jù),這個(gè)時(shí)候,爬蟲就顯得格外重要了,幸好,Python提供了一批很不錯(cuò)的網(wǎng)頁(yè)爬蟲工具框架,既能爬取數(shù)據(jù),也能獲取和清洗數(shù)據(jù),我們也就從這里 開(kāi)始了:
1. Scrapy
Scrapy, a fast high-level screen scraping and web crawling framework for Python.
鼎鼎大名的Scrapy,相信不少同學(xué)都有耳聞,課程圖譜中的很多課程都是依靠Scrapy抓去的,這方面的介紹文章有很多,推薦大牛pluskid早年的一篇文章:《Scrapy 輕松定制網(wǎng)絡(luò)爬蟲》,歷久彌新。
官方主頁(yè):http://scrapy.org/
Github代碼頁(yè): https://github.com/scrapy/scrapy
2. Beautiful Soup
You didn’t write that awful page. You’re just trying to get some data out of it. Beautiful Soup is here to help. Since 2004, it’s been saving programmers hours or days of work on quick-turnaround screen scraping projects.
讀書的時(shí)候通過(guò)《集體智慧編程》這本書知道Beautiful Soup的,后來(lái)也偶爾會(huì)用用,非常棒的一套工具。客觀的說(shuō),Beautifu Soup不完全是一套爬蟲工具,需要配合urllib使用,而是一套HTML/XML數(shù)據(jù)分析,清洗和獲取工具。
官方主頁(yè):http://www.crummy.com/software/BeautifulSoup/
3. Python-Goose
Html Content / Article Extractor, web scrapping lib in Python
Goose最早是用Java寫得,后來(lái)用Scala重寫,是一個(gè)Scala項(xiàng)目。Python-Goose用Python重寫,依賴了Beautiful Soup。前段時(shí)間用過(guò),感覺(jué)很不錯(cuò),給定一個(gè)文章的URL, 獲取文章的標(biāo)題和內(nèi)容很方便。
Github主頁(yè):https://github.com/grangier/python-goose
二、Python文本處理工具集
從 網(wǎng)頁(yè)上獲取文本數(shù)據(jù)之后,依據(jù)任務(wù)的不同,就需要進(jìn)行基本的文本處理了,譬如對(duì)于英文來(lái)說(shuō),需要基本的tokenize,對(duì)于中文,則需要常見(jiàn)的中文分 詞,進(jìn)一步的話,無(wú)論英文中文,還可以詞性標(biāo)注,句法分析,關(guān)鍵詞提取,文本分類,情感分析等等。這個(gè)方面,特別是面向英文領(lǐng)域,有很多優(yōu)秀的工具包,我 們一一道來(lái)。
1. NLTK — Natural Language Toolkit
NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, and an active discussion forum.
搞自然語(yǔ)言處理的同學(xué)應(yīng)該沒(méi)有人不知道NLTK吧,這里也就不多說(shuō)了。不過(guò)推薦 兩本書籍給剛剛接觸NLTK或者需要詳細(xì)了解NLTK的同學(xué): 一個(gè)是官方的《Natural Language Processing with Python》,以介紹NLTK里的功能用法為主,同時(shí)附帶一些Python知識(shí),同時(shí)國(guó)內(nèi)陳濤同學(xué)友情翻譯了一個(gè)中文版,這里可以看到:推薦《用 Python進(jìn)行自然語(yǔ)言處理》中文翻譯-NLTK配套書;另外一本是《Python Text Processing with NLTK 2.0 Cookbook》,這本書要深入一些,會(huì)涉及到NLTK的代碼結(jié)構(gòu),同時(shí)會(huì)介紹如何定制自己的語(yǔ)料和模型等,相當(dāng)不錯(cuò)。
官方主頁(yè):http://www.nltk.org/
Github代碼頁(yè):https://github.com/nltk/nltk
2. Pattern
Pattern is a web mining module for the Python programming language.
It has tools for data mining (Google, Twitter and Wikipedia API, a web crawler, a HTML DOM parser), natural language processing (part-of-speech taggers, n-gram search, sentiment analysis, WordNet), machine learning (vector space model, clustering, SVM), network analysis and canvas visualization.
Pattern由比利時(shí)安特衛(wèi)普大學(xué)CLiPS實(shí)驗(yàn)室出品,客 觀的說(shuō),Pattern不僅僅是一套文本處理工具,它更是一套web數(shù)據(jù)挖掘工具,囊括了數(shù)據(jù)抓取模塊(包括Google, Twitter, 維基百科的API,以及爬蟲和HTML分析器),文本處理模塊(詞性標(biāo)注,情感分析等),機(jī)器學(xué)習(xí)模塊(VSM, 聚類,SVM)以及可視化模塊等,可以說(shuō),Pattern的這一整套邏輯也是這篇文章的組織邏輯,不過(guò)這里我們暫且把Pattern放到文本處理部分。我 個(gè)人主要使用的是它的英文處理模塊Pattern.en, 有很多很不錯(cuò)的文本處理功能,包括基礎(chǔ)的tokenize, 詞性標(biāo)注,句子切分,語(yǔ)法檢查,拼寫糾錯(cuò),情感分析,句法分析等,相當(dāng)不錯(cuò)。
官方主頁(yè):http://www.clips.ua.ac.be/pattern
3. TextBlob: Simplified Text Processing
TextBlob is a Python (2 and 3) library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.
TextBlob 是一個(gè)很有意思的Python文本處理工具包,它其實(shí)是基于上面兩個(gè)Python工具包NLKT和Pattern做了封裝(TextBlob stands on the giant shoulders of NLTK and pattern, and plays nicely with both),同時(shí)提供了很多文本處理功能的接口,包括詞性標(biāo)注,名詞短語(yǔ)提取,情感分析,文本分類,拼寫檢查等,甚至包括翻譯和語(yǔ)言檢測(cè),不過(guò)這個(gè)是基于 Google的API的,有調(diào)用次數(shù)限制。TextBlob相對(duì)比較年輕,有興趣的同學(xué)可以關(guān)注。
官方主頁(yè):http://textblob.readthedocs.org/en/dev/
Github代碼頁(yè):https://github.com/sloria/textblob
4. MBSP for Python
MBSP is a text analysis system based on the TiMBL and MBT memory based learning applications developed at CLiPS and ILK. It provides tools for Tokenization and Sentence Splitting, Part of Speech Tagging, Chunking, Lemmatization, Relation Finding and Prepositional Phrase Attachment.
MBSP與Pattern同源,同出自比利時(shí)安特衛(wèi)普大學(xué)CLiPS實(shí)驗(yàn)室,提供了Word Tokenization, 句子切分,詞性標(biāo)注,Chunking, Lemmatization,句法分析等基本的文本處理功能,感興趣的同學(xué)可以關(guān)注。
官方主頁(yè):http://www.clips.ua.ac.be/pages/MBSP
5. Gensim: Topic modeling for humans
Gensim是一個(gè)相當(dāng)專業(yè)的主題模型Python工具包,無(wú)論是代碼還是文檔,我們?cè)?jīng)用《如何計(jì)算兩個(gè)文檔的相似度》介紹過(guò)Gensim的安裝和使用過(guò)程,這里就不多說(shuō)了。
官方主頁(yè):http://radimrehurek.com/gensim/index.html
github代碼頁(yè):https://github.com/piskvorky/gensim
6. langid.py: Stand-alone language identification system
語(yǔ) 言檢測(cè)是一個(gè)很有意思的話題,不過(guò)相對(duì)比較成熟,這方面的解決方案很多,也有很多不錯(cuò)的開(kāi)源工具包,不過(guò)對(duì)于Python來(lái)說(shuō),我使用過(guò)langid這個(gè) 工具包,也非常愿意推薦它。langid目前支持97種語(yǔ)言的檢測(cè),提供了很多易用的功能,包括可以啟動(dòng)一個(gè)建議的server,通過(guò)json調(diào)用其 API,可定制訓(xùn)練自己的語(yǔ)言檢測(cè)模型等,可以說(shuō)是“麻雀雖小,五臟俱全”。
Github主頁(yè):https://github.com/saffsd/langid.py
7. Jieba: 結(jié)巴中文分詞
“結(jié) 巴”中文分詞:做最好的Python中文分詞組件 “Jieba” (Chinese for “to stutter”) Chinese text segmentation: built to be the best Python Chinese word segmentation module.
好了,終于可以說(shuō)一個(gè)國(guó)內(nèi)的Python文本處理工具包了:結(jié)巴分詞,其功能包括支持三種分詞模式(精確模式、全模式、搜索引擎模式),支持繁體分詞,支持自定義詞典等,是目前一個(gè)非常不錯(cuò)的Python中文分詞解決方案。
Github主頁(yè):https://github.com/fxsjy/jieba
8. xTAS
xtas, the eXtensible Text Analysis Suite, a distributed text analysis package based on Celery and Elasticsearch.
感謝微博朋友 @大山坡的春 提供的線索:我們組同事之前發(fā)布了xTAS,也是基于python的text mining工具包,歡迎使用,鏈接:http://t.cn/RPbEZOW。看起來(lái)很不錯(cuò)的樣子,回頭試用一下。
Github代碼頁(yè):https://github.com/NLeSC/xtas
三、Python科學(xué)計(jì)算工具包
說(shuō) 起科學(xué)計(jì)算,大家首先想起的是Matlab,集數(shù)值計(jì)算,可視化工具及交互于一身,不過(guò)可惜是一個(gè)商業(yè)產(chǎn)品。開(kāi)源方面除了GNU Octave在嘗試做一個(gè)類似Matlab的工具包外,Python的這幾個(gè)工具包集合到一起也可以替代Matlab的相應(yīng)功 能:NumPy+SciPy+Matplotlib+iPython。同時(shí),這幾個(gè)工具包,特別是NumPy和SciPy,也是很多Python文本處理 & 機(jī)器學(xué)習(xí) & 數(shù)據(jù)挖掘工具包的基礎(chǔ),非常重要。最后再推薦一個(gè)系列《用Python做科學(xué)計(jì)算》,將會(huì)涉及到NumPy, SciPy, Matplotlib,可以做參考。
1. NumPy
NumPy is the fundamental package for scientific computing with Python. It contains among other things:
1)a powerful N-dimensional array object
2)sophisticated (broadcasting) functions
3)tools for integrating C/C++ and Fortran code
4) useful linear algebra, Fourier transform, and random number capabilities
Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.
NumPy幾乎是一個(gè)無(wú)法回避的科學(xué) 計(jì)算工具包,最常用的也許是它的N維數(shù)組對(duì)象,其他還包括一些成熟的函數(shù)庫(kù),用于整合C/C++和Fortran代碼的工具包,線性代數(shù)、傅里葉變換和隨 機(jī)數(shù)生成函數(shù)等。NumPy提供了兩種基本的對(duì)象:ndarray(N-dimensional array object)和 ufunc(universal function object)。ndarray是存儲(chǔ)單一數(shù)據(jù)類型的多維數(shù)組,而ufunc則是能夠?qū)?shù)組進(jìn)行處理的函數(shù)。
官方主頁(yè):http://www.numpy.org/
2. SciPy:Scientific Computing Tools for Python
SciPy refers to several related but distinct entities:
1)The SciPy Stack, a collection of open source software for scientific computing in Python, and particularly a specified set of core packages.
2)The community of people who use and develop this stack.
3)Several conferences dedicated to scientific computing in Python – SciPy, EuroSciPy and SciPy.in.
4)The SciPy library, one component of the SciPy stack, providing many numerical routines.
“SciPy 是一個(gè)開(kāi)源的Python算法庫(kù)和數(shù)學(xué)工具包,SciPy包含的模塊有最優(yōu)化、線性代數(shù)、積分、插值、特殊函數(shù)、快速傅里葉變換、信號(hào)處理和圖像處理、常 微分方程求解和其他科學(xué)與工程中常用的計(jì)算。其功能與軟件MATLAB、Scilab和GNU Octave類似。 Numpy和Scipy常常結(jié)合著使用,Python大多數(shù)機(jī)器學(xué)習(xí)庫(kù)都依賴于這兩個(gè)模塊。”—-引用自“Python機(jī)器學(xué)習(xí)庫(kù)”
官方主頁(yè):http://www.scipy.org/
3. Matplotlib
matplotlib is a python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. matplotlib can be used in python scripts, the python and ipython shell (ala MATLAB * or Mathematica ), web application servers, and six graphical user interface toolkits.
matplotlib 是python最著名的繪圖庫(kù),它提供了一整套和matlab相似的命令A(yù)PI,十分適合交互式地進(jìn)行制圖。而且也可以方便地將它作為繪圖控件,嵌入 GUI應(yīng)用程序中。Matplotlib可以配合ipython shell使用,提供不亞于Matlab的繪圖體驗(yàn),總之用過(guò)了都說(shuō)好。
官方主頁(yè):http://matplotlib.org/
4. iPython
IPython provides a rich architecture for interactive computing with:
1)Powerful interactive shells (terminal and Qt-based).
2)A browser-based notebook with support for code, text, mathematical expressions, inline plots and other rich media.
3)Support for interactive data visualization and use of GUI toolkits.
4)Flexible, embeddable interpreters to load into your own projects.
5)Easy to use, high performance tools for parallel computing.
“iPython 是一個(gè)Python 的交互式Shell,比默認(rèn)的Python Shell 好用得多,功能也更強(qiáng)大。 她支持語(yǔ)法高亮、自動(dòng)完成、代碼調(diào)試、對(duì)象自省,支持 Bash Shell 命令,內(nèi)置了許多很有用的功能和函式等,非常容易使用。 ” 啟動(dòng)iPython的時(shí)候用這個(gè)命令“ipython –pylab”,默認(rèn)開(kāi)啟了matploblib的繪圖交互,用起來(lái)很方便。
官方主頁(yè):http://ipython.org/
四、Python 機(jī)器學(xué)習(xí) & 數(shù)據(jù)挖掘 工具包
機(jī)器學(xué)習(xí)和數(shù)據(jù)挖掘這兩個(gè)概念不太好區(qū)分,這里就放到一起了。這方面的開(kāi)源Python工具包有很多,這里先從熟悉的講起,再補(bǔ)充其他來(lái)源的資料,也歡迎大家補(bǔ)充。
1. scikit-learn: Machine Learning in Python
scikit-learn (formerly scikits.learn) is an open source machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support vector machines, logistic regression, naive Bayes, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.
首 先推薦大名鼎鼎的scikit-learn,scikit-learn是一個(gè)基于NumPy, SciPy, Matplotlib的開(kāi)源機(jī)器學(xué)習(xí)工具包,主要涵蓋分類,回歸和聚類算法,例如SVM, 邏輯回歸,樸素貝葉斯,隨機(jī)森林,k-means等算法,代碼和文檔都非常不錯(cuò),在許多Python項(xiàng)目中都有應(yīng)用。例如在我們熟悉的NLTK中,分類器 方面就有專門針對(duì)scikit-learn的接口,可以調(diào)用scikit-learn的分類算法以及訓(xùn)練數(shù)據(jù)來(lái)訓(xùn)練分類器模型。這里推薦一個(gè)視頻,也是我 早期遇到scikit-learn的時(shí)候推薦過(guò)的:推薦一個(gè)Python機(jī)器學(xué)習(xí)工具包Scikit-learn以及相關(guān)視頻–Tutorial: scikit-learn – Machine Learning in Python
官方主頁(yè):http://scikit-learn.org/
2. Pandas: Python Data Analysis Library
Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series.
第 一次接觸Pandas是由于Udacity上的一門數(shù)據(jù)分析課程“Introduction to Data Science” 的Project需要用Pandas庫(kù),所以學(xué)習(xí)了一下Pandas。Pandas也是基于NumPy和Matplotlib開(kāi)發(fā)的,主要用于數(shù)據(jù)分析和 數(shù)據(jù)可視化,它的數(shù)據(jù)結(jié)構(gòu)DataFrame和R語(yǔ)言里的data.frame很像,特別是對(duì)于時(shí)間序列數(shù)據(jù)有自己的一套分析機(jī)制,非常不錯(cuò)。這里推薦一 本書《Python for Data Analysis》,作者是Pandas的主力開(kāi)發(fā),依次介紹了iPython, NumPy, Pandas里的相關(guān)功能,數(shù)據(jù)可視化,數(shù)據(jù)清洗和加工,時(shí)間數(shù)據(jù)處理等,案例包括金融股票數(shù)據(jù)挖掘等,相當(dāng)不錯(cuò)。
官方主頁(yè):http://pandas.pydata.org/
=======================================================
分割線,以上工具包基本上都是自己用過(guò)的,以下來(lái)源于其他同學(xué)的線索,特別是《Python機(jī)器學(xué)習(xí)庫(kù)》,《23個(gè)python的機(jī)器學(xué)習(xí)包》,做了一點(diǎn)增刪修改,歡迎大家補(bǔ)充
========================================================
mlpy is a Python module for Machine Learning built on top of NumPy/SciPy and the GNU Scientific Libraries.
mlpy provides a wide range of state-of-the-art machine learning methods for supervised and unsupervised problems and it is aimed at finding a reasonable compromise among modularity, maintainability, reproducibility, usability and efficiency. mlpy is multiplatform, it works with Python 2 and 3 and it is Open Source, distributed under the GNU General Public License version 3.
官方主頁(yè):http://mlpy.sourceforge.net/
4. MDP:The Modular toolkit for Data Processing
Modular toolkit for Data Processing (MDP) is a Python data processing framework.
From the user’s perspective, MDP is a collection of supervised and unsupervised learning algorithms and other data processing units that can be combined into data processing sequences and more complex feed-forward network architectures.
From the scientific developer’s perspective, MDP is a modular framework, which can easily be expanded. The implementation of new algorithms is easy and intuitive. The new implemented units are then automatically integrated with the rest of the library.
The base of available algorithms is steadily increasing and includes signal processing methods (Principal Component Analysis, Independent Component Analysis, Slow Feature Analysis), manifold learning methods ([Hessian] Locally Linear Embedding), several classifiers, probabilistic methods (Factor Analysis, RBM), data pre-processing methods, and many others.
“MDP 用于數(shù)據(jù)處理的模塊化工具包,一個(gè)Python數(shù)據(jù)處理框架。 從用戶的觀點(diǎn),MDP是能夠被整合到數(shù)據(jù)處理序列和更復(fù)雜的前饋網(wǎng)絡(luò)結(jié)構(gòu)的一批監(jiān)督學(xué)習(xí)和非監(jiān)督學(xué)習(xí)算法和其他數(shù)據(jù)處理單元。計(jì)算依照速度和內(nèi)存需求而高 效的執(zhí)行。從科學(xué)開(kāi)發(fā)者的觀點(diǎn),MDP是一個(gè)模塊框架,它能夠被容易地?cái)U(kuò)展。新算法的實(shí)現(xiàn)是容易且直觀的。新實(shí)現(xiàn)的單元然后被自動(dòng)地與程序庫(kù)的其余部件進(jìn) 行整合。MDP在神經(jīng)科學(xué)的理論研究背景下被編寫,但是它已經(jīng)被設(shè)計(jì)為在使用可訓(xùn)練數(shù)據(jù)處理算法的任何情況中都是有用的。其站在用戶一邊的簡(jiǎn)單性,各種不 同的隨時(shí)可用的算法,及應(yīng)用單元的可重用性,使得它也是一個(gè)有用的教學(xué)工具。”
官方主頁(yè):http://mdp-toolkit.sourceforge.net/
5. PyBrain
PyBrain is a modular Machine Learning Library for Python. Its goal is to offer flexible, easy-to-use yet still powerful algorithms for Machine Learning Tasks and a variety of predefined environments to test and compare your algorithms.
PyBrain is short for Python-Based Reinforcement Learning, Artificial Intelligence and Neural Network Library. In fact, we came up with the name first and later reverse-engineered this quite descriptive “Backronym”.
“PyBrain(Python-Based Reinforcement Learning, Artificial Intelligence and Neural Network)是Python的一個(gè)機(jī)器學(xué)習(xí)模塊,它的目標(biāo)是為機(jī)器學(xué)習(xí)任務(wù)提供靈活、易應(yīng)、強(qiáng)大的機(jī)器學(xué)習(xí)算法。(這名字很霸氣)
PyBrain正如其名,包括神經(jīng)網(wǎng)絡(luò)、強(qiáng)化學(xué)習(xí)(及二者結(jié)合)、無(wú)監(jiān)督學(xué)習(xí)、進(jìn)化算法。因?yàn)槟壳暗脑S多問(wèn)題需要處理連續(xù)態(tài)和行為空間,必須使用函數(shù)逼近(如神經(jīng)網(wǎng)絡(luò))以應(yīng)對(duì)高維數(shù)據(jù)。PyBrain以神經(jīng)網(wǎng)絡(luò)為核心,所有的訓(xùn)練方法都以神經(jīng)網(wǎng)絡(luò)為一個(gè)實(shí)例。”
官方主頁(yè):http://www.pybrain.org/
6. PyML – machine learning in Python
PyML is an interactive object oriented framework for machine learning written in Python. PyML focuses on SVMs and other kernel methods. It is supported on Linux and Mac OS X.
“PyML是一個(gè)Python機(jī)器學(xué)習(xí)工具包,為各分類和回歸方法提供靈活的架構(gòu)。它主要提供特征選擇、模型選擇、組合分類器、分類評(píng)估等功能。”
項(xiàng)目主頁(yè):http://pyml.sourceforge.net/
7. Milk:Machine learning toolkit in Python.
Its focus is on supervised classification with several classifiers available:
SVMs (based on libsvm), k-NN, random forests, decision trees. It also performs
feature selection. These classifiers can be combined in many ways to form
different classification systems.
“Milk 是Python的一個(gè)機(jī)器學(xué)習(xí)工具箱,其重點(diǎn)是提供監(jiān)督分類法與幾種有效的分類分析:SVMs(基于libsvm),K-NN,隨機(jī)森林經(jīng)濟(jì)和決策樹。它 還可以進(jìn)行特征選擇。這些分類可以在許多方面相結(jié)合,形成不同的分類系統(tǒng)。對(duì)于無(wú)監(jiān)督學(xué)習(xí),它提供K-means和affinity propagation聚類算法。”
官方主頁(yè):http://luispedro.org/software/milk
http://luispedro.org/software/milk
8. PyMVPA: MultiVariate Pattern Analysis (MVPA) in Python
PyMVPA is a Python package intended to ease statistical learning analyses of large datasets. It offers an extensible framework with a high-level interface to a broad range of algorithms for classification, regression, feature selection, data import and export. It is designed to integrate well with related software packages, such as scikit-learn, and MDP. While it is not limited to the neuroimaging domain, it is eminently suited for such datasets. PyMVPA is free software and requires nothing but free-software to run.
“PyMVPA(Multivariate Pattern Analysis in Python)是為大數(shù)據(jù)集提供統(tǒng)計(jì)學(xué)習(xí)分析的Python工具包,它提供了一個(gè)靈活可擴(kuò)展的框架。它提供的功能有分類、回歸、特征選擇、數(shù)據(jù)導(dǎo)入導(dǎo)出、可視化等”
官方主頁(yè):http://www.pymvpa.org/
9. Pyrallel – Parallel Data Analytics in Python
Experimental project to investigate distributed computation patterns for machine learning and other semi-interactive data analytics tasks.
“Pyrallel(Parallel Data Analytics in Python)基于分布式計(jì)算模式的機(jī)器學(xué)習(xí)和半交互式的試驗(yàn)項(xiàng)目,可在小型集群上運(yùn)行”
Github代碼頁(yè):http://github.com/pydata/pyrallel
10. Monte – gradient based learning in Python
Monte (python) is a Python framework for building gradient based learning machines, like neural networks, conditional random fields, logistic regression, etc. Monte contains modules (that hold parameters, a cost-function and a gradient-function) and trainers (that can adapt a module’s parameters by minimizing its cost-function on training data).
Modules are usually composed of other modules, which can in turn contain other modules, etc. Gradients of decomposable systems like these can be computed with back-propagation.
“Monte (machine learning in pure Python)是一個(gè)純Python機(jī)器學(xué)習(xí)庫(kù)。它可以迅速構(gòu)建神經(jīng)網(wǎng)絡(luò)、條件隨機(jī)場(chǎng)、邏輯回歸等模型,使用inline-C優(yōu)化,極易使用和擴(kuò)展。”
官方主頁(yè):http://montepython.sourceforge.net
11. Theano
Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. Theano features:
1)tight integration with NumPy – Use numpy.ndarray in Theano-compiled functions.
2)transparent use of a GPU – Perform data-intensive calculations up to 140x faster than with CPU.(float32 only)
3)efficient symbolic differentiation – Theano does your derivatives for function with one or many inputs.
4)speed and stability optimizations – Get the right answer for log(1+x) even when x is really tiny.
5)dynamic C code generation – Evaluate expressions faster.
6) extensive unit-testing and self-verification – Detect and diagnose many types of mistake.
Theano has been powering large-scale computationally intensive scientific investigations since 2007. But it is also approachable enough to be used in the classroom (IFT6266 at the University of Montreal).
“Theano 是一個(gè) Python 庫(kù),用來(lái)定義、優(yōu)化和模擬數(shù)學(xué)表達(dá)式計(jì)算,用于高效的解決多維數(shù)組的計(jì)算問(wèn)題。Theano的特點(diǎn):緊密集成Numpy;高效的數(shù)據(jù)密集型GPU計(jì)算;高 效的符號(hào)微分運(yùn)算;高速和穩(wěn)定的優(yōu)化;動(dòng)態(tài)生成c代碼;廣泛的單元測(cè)試和自我驗(yàn)證。自2007年以來(lái),Theano已被廣泛應(yīng)用于科學(xué)運(yùn)算。theano 使得構(gòu)建深度學(xué)習(xí)模型更加容易,可以快速實(shí)現(xiàn)多種模型。PS:Theano,一位希臘美女,Croton最有權(quán)勢(shì)的Milo的女兒,后來(lái)成為了畢達(dá)哥拉斯 的老婆。”
12. Pylearn2
Pylearn2 is a machine learning library. Most of its functionality is built on top of Theano. This means you can write Pylearn2 plugins (new models, algorithms, etc) using mathematical expressions, and theano will optimize and stabilize those expressions for you, and compile them to a backend of your choice (CPU or GPU).
“Pylearn2建立在theano上,部分依賴scikit-learn上,目前Pylearn2正處于開(kāi)發(fā)中,將可以處理向量、圖像、視頻等數(shù)據(jù),提供MLP、RBM、SDA等深度學(xué)習(xí)模型。”
官方主頁(yè):http://deeplearning.net/software/pylearn2/
其他的,歡迎大家補(bǔ)充,這里也會(huì)持續(xù)更新這篇文章。