import Python, Finance, Scientific Computing - Latest Comments

Re: Strata NYC 2013 and PyData 2013 Talks

gour_atmarama — Tue, 24 Dec 2013 16:19:49 -0000

Hi Wes,

when exploring what could be used with Python for SEO work I stumbled upon pandas and your book as well.

Now, seeing development of badger, I have few questions_

a) what will be the future of pandas development?

b) is badger meant to replace panda?

c) will badger be initially provided only as some kind of SaaS?

d) considering that you're expert in diciing & slicing data, which tool you consider more appropriate for SEO work? (although I prefer old-school desktop tools)

Sincerely,
Gour

Re: A Roadmap for Rich Scientific Data Structures in Python

Merlyn — Mon, 23 Dec 2013 13:34:07 -0000

Have you looked into RootPy? Not PyRoot but Rootpy. It does what you want and is integrated with scipy and numpy. No pandas however. There are never enough pandas. http://www.rootpy.org/

Re: Strata NYC 2013 and PyData 2013 Talks

Gagi — Tue, 19 Nov 2013 03:20:25 -0000

You give DBAs too much credit. :) In the real world I see lots of single tables with VARCHAR columns.Good talk, nice to see DataPad coming along.

Re: Adventures in Aggregating Data (Group By)

Wes McKinney — Sun, 03 Nov 2013 01:17:39 -0000

Could you ask this question on the pydata mailing list?

Re: Adventures in Aggregating Data (Group By)

Rory — Mon, 28 Oct 2013 01:05:32 -0000

Wes, I love pandas, and I've been working through your book and changing the way my organisation appraoches data management and analysis. However, I can't for the life of my work out how to pass the values from more than one DataFrame Column to a groupby.transform function. That is I do not want to passmultiple columns one by one to the same function to transform, but I want to transform one column but utilising values found in other columns. I am able to get the result by simply iterating through the bygroups and making lists of the data, but I want to do it with transform. Any tips?

Re: Whirlwind tour of pandas in 10 minutes

Jeannie — Sun, 29 Sep 2013 09:25:58 -0000

I second that, I'm trying to follow the video tutorial

Re: Whirlwind tour of pandas in 10 minutes

Max Richter — Fri, 13 Sep 2013 10:12:35 -0000

Where do you have your data from that you are using in your video?

Re: Whirlwind tour of pandas in 10 minutes

Max Richter — Fri, 13 Sep 2013 10:11:27 -0000

Where do you have your stock data from that you are using in your video?

Re: PyCon Singapore 2013

Wes McKinney — Fri, 06 Sep 2013 13:57:10 -0000

Maybe ask on StackOverflow?

Re: PyCon Singapore 2013

Cristian — Fri, 06 Sep 2013 06:07:24 -0000

Hello,

Regarding your book, I have to congratulate you for bringing me into Python, as my first choice until now was R. I have a few difficulties trying to fetch JSON data from my online broker (can connect to the server, but when I stream data it comes as JSON). Is there any way to transform each incoming piece of information to pandas DataFrame?
Thanks.

Re: Why I’m not on the Julia bandwagon (yet)

Connelly Barnes — Mon, 02 Sep 2013 15:09:26 -0000

As an update a year later Julia is still over-selling their performance on their homepage. A simple array addition in a loop is 70x slower than C++:

julia> function f()
ans = [1.0 2.0]
for i=1:1000^2
ans += [1.0 2.0]
end
return ans
end

Takes 0.23 secs on my system versus 0.0033 secs for C code. If you write scalar code that adds elements 1 and 2 it is only about 1.5x slower than C, but I don't want to write scalar code.

Also Dijkstra settled the 0 and 1 indexing debate in the 1960s:
http://www.cs.utexas.edu/us...

The language does look technically nice except the 1 indexing flaw. Maybe someone can fork and make a 0 indexed Julia.

Re: Filtering out duplicate pandas.DataFrame rows

Nikita — Sat, 31 Aug 2013 07:55:33 -0000

Thanks, by using "groupby" and "duplicated" I managed to get what i needed :)

Re: Filtering out duplicate pandas.DataFrame rows

Wes McKinney — Tue, 27 Aug 2013 18:22:12 -0000

Yes use `duplicated`

Re: Filtering out duplicate pandas.DataFrame rows

Nikita — Mon, 26 Aug 2013 17:50:56 -0000

Great way to drop duplicates based on multiple columns. Thanks!
Is there a way to get the indices or values in a specific column of the data which is dropped? e.g. I want to know what value of C is in the row where A and B are found to have duplictes.
I really mis the which() function from R.

Re: PyCon Singapore 2013

Shishir Pandey — Fri, 19 Jul 2013 02:18:31 -0000

Thanks a lot for the link.
Quora has a great collection of question and answers I was thinking of downloading some data for playing around with it. It does not have any api. What method do you suggest to get data from Quora?

Re: PyCon Singapore 2013

Wes McKinney — Thu, 18 Jul 2013 17:27:45 -0000

check out data.stackexchange.com

Re: PyCon Singapore 2013

Shishir Pandey — Thu, 18 Jul 2013 02:04:48 -0000

I am interested in the script that you used to get data from stackexchange. Can you share that?
Thanks.

Re: PyCon Singapore 2013

Wes McKinney — Thu, 04 Jul 2013 18:51:54 -0000

please transfer these comments to github to start a discussion with the other developers about your needs. we're keen to improve the read_csv function in this regard

Re: PyCon Singapore 2013

Mike — Thu, 04 Jul 2013 06:45:33 -0000

Actually, I dont really need a read_csv option that will kick out any values that dont parse to numeric. I need one of these alternatives:

A) I do not want non numeric values to be represented in different ways. Because when I try to clean up the data, I first have to handle "NaN", then I have to handle "None", then I need to handle "?", etc. I only want one representation of non numerics, such as "NaN" only. Then I can handle all non numerics once, with "NaN" statements.

B) If there are many ways to represent non numerics, "NaN", "None", "Inf", "?", etc - then I want an explicit list in the manual of all the non numerics I must handle. This list should be easy to access, i.e. it should not be hidden somewhere. For instance in the read_csv manual: "These are the following non numeric values that can be found in a data frame after read_csv: NaN, None, Inf, ?, ...". Then I can explicitly handle each case.

What I dont want is the situation where I worry if I have non numerics left in my data. Have I handled all non numerics? As of now, I only did a "pandas.drop", does that cover all "None", "NaN", etc? I dont know. I hope I only have numerics left. But I dont know. I hope when we go live with my statarb algo, we dont loose lot of money just because it is not clear how to wash the pandas data?

BTW. I prefer Pandas to R, and are trying to convert people from R to Pandas. Great job you are doing! :)

Re: PyCon Singapore 2013

Mike — Mon, 01 Jul 2013 14:57:49 -0000

Great! This was a huge pain in R. Until then, do you have a link that shows how to catch all possible non numeric values there is in Pandas? How can I catch all of them values? I must look for "NaN", "none", ... ? Do you have a complete list of non numerics that I can look for? Or some code snippet?

Re: PyCon Singapore 2013

Wes McKinney — Thu, 27 Jun 2013 15:37:06 -0000

We've been planning to add a parsing option to read_csv that will "kick out" any values that don't parse to numeric. Haven't done so yet, your voice on the GitHub issue list with example data would be helpful.

Re: PyCon Singapore 2013

Mike — Thu, 27 Jun 2013 10:15:55 -0000

I have recently switched to Pandas from R, one thing that annoyed me in R, was that there was many types of NaN values. For instance, I removed all "NaN" with R.dropna or some similar command. Much later I discovered I had "N/A" values left in my dataset. So I had to remove those values as well. Maybe there was other non values left, such as "?". How could I know? I would like a simple way to remove _all_ values that are not numbers, so I could catch all NaN, N/A, ?, etc. Could you please make sure that Pandas does not have this problem with pandas.dropna. It was a real pain to try to catch all non values. So when I use pandas.dropna, I would like it to catch all values that are not floats. AFAIK, pandas.dropna only catches "NaN"?

Re: PyCon Singapore 2013

Boon Kwee — Mon, 17 Jun 2013 00:36:01 -0000

Hi Wes, I really appreciate the talk. The demo with IPython shell is really useful and interesting! I'm already trying out pandas for data visualization using data from various internet sources. Thanks a great deal!

Re: PyCon Singapore 2013

Wes McKinney — Sun, 16 Jun 2013 23:10:15 -0000

pandas could run on pypy but all of the Cython extensions would need to be ported to pure Python or C with cffi. I don't see any conflict between pandas and numba; I see good opportunities to use numba kernels to accelerate operations in pandas. I'm waiting for the pull request.

I don't yet see any strides being made to improve general data processing / data preparation (as it relates to business analytics, for example) in pypy or numba which is my main area of interest.

Re: PyCon Singapore 2013

Carst Vaartjes — Sun, 16 Jun 2013 18:12:22 -0000

Great presentation; one thing I'm really curious about is how you see the future for Pandas (with its Cython setup) in a numba or pypy world? I have positive experiences with pypy, but for pandas (set aside possible cython in pypy issues) i would guess it means more that "the rest" of Python would get accelerated too? (and not pandas itself, as that has already been pushed/optimized as much as possible to cython/c level?). Would love to understand this better!